Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP input file names not respected in execution VM #4184

Open
chapmanb opened this issue Oct 1, 2018 · 13 comments
Open

HTTP input file names not respected in execution VM #4184

chapmanb opened this issue Oct 1, 2018 · 13 comments
Labels

Comments

@chapmanb
Copy link

chapmanb commented Oct 1, 2018

Hi all;
In testing release 35 with CWL inputs I've also been looking at supporting remote URL references. This is working correctly for GS URLs but not for http URLs. I've put together a test case that demonstrates the problem:

https://github.com/bcbio/test_bcbio_cwl/tree/master/gcp

The somatic-workflow-http CWL workflow uses http URLs and doesn't work, while the comparable somatic-workflow CWL workflow uses GS URLs referencing the same data and does work.

The workflow fails with:

java.io.FileNotFoundException: Cannot hash file https://storage.googleapis.com/bcbiodata/test_bcbio_cwl/testdata/genom
es/hg19/seq/hg19.fa

when running tasks. The files get downloaded to the input directories but get numerical values instead of the original file names so never seem to sync over and get translated correctly to the workflow

ls -lh cromwell_work/cromwell-executions/main-somatic.cwl/eaa632df-52a8-4aae-826f-647a42fa7145/call-prep_samples_to_rec/inputs/1515144/
total 136K
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 225050424226294657
-rw------- 2 chapmanb chapmanb  43 Sep 26 14:07 2612405277530248055
-rw------- 2 chapmanb chapmanb  43 Sep 26 14:07 503001634356675169
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 5802330287039666628
-rw------- 2 chapmanb chapmanb  43 Sep 26 14:07 5809676514510180826
-rw------- 2 chapmanb chapmanb  43 Sep 26 14:07 6090832304768530540
-rw------- 2 chapmanb chapmanb  43 Sep 26 14:07 6105514522473810611
-rw------- 3 chapmanb chapmanb 37K Sep 26 14:07 6807576659333162957
-rw------- 3 chapmanb chapmanb 150 Sep 26 14:07 6853384576121493061
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 7483350933664987331
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 7538690575330349970
-rw------- 3 chapmanb chapmanb 37K Sep 26 14:07 7691692211431528147
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 7783203266940950463
-rw------- 3 chapmanb chapmanb 150 Sep 26 14:07 8389565043859020157
-rw------- 2 chapmanb chapmanb  43 Sep 26 14:07 8932347409858620277
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 993751307168383758

My configuration is:

engine {
  filesystems {
    gcs {
      auth = "application-default"
    }
    http {}
  }
}

backend {
  providers {
    Local {
      config {
      filesystems {
        http { }
      }
      }
    }
  }
}

Am I doing anything wrong with my configuration or setup that I could tweak? Thanks so much for any pointers/suggestions.

@mcovarr
Copy link
Contributor

mcovarr commented Oct 1, 2018

The hash failures are expected with http inputs and should not be the cause of your workflow failure. Also we don't currently support http in engine filesystems. Do you see any other error messages than might provide some insight into what's happening?

@chapmanb
Copy link
Author

chapmanb commented Oct 1, 2018

Thanks much for the helping with debugging on this.

Beyond the hash failure from Cromwell the other errors I get are all from the workflow itself due to not preserving the original file names. The numerical hashes for files get passed directly into the downstream tools, stripping off any extensions or other identifying information. This results in tool confusion, like tabix can't tell a file wasn't already gzipped:

ValueError: Unexpected tabix input: /home/chapmanb/drive/work/cwl/test_bcbio_cwl/gcp/cromwell_work/cromwell-executions/main-somatic.cwl/93ef2d1c-88ee-4dc2-af0a-e0ea86bc785e/call-prep_samples/shard-0/execution/bedprep/cleaned-8539016497173364825.gz

or bwa can't find all the other associated indices:

bwa mem /home/chapmanb/drive/work/cwl/test_bcbio_cwl/gcp/cromwell_work/cromwell-executions/main-somatic.cwl/93ef2d1c-88ee-4dc2-af0a-e0ea86bc785e/call-alignment/shard-1/wf-alignment.cwl/96d7b606-e0fe-4305-a586-e0fc4acf76f8/call-process_alignment/shard-0/inputs/1628767813 [...]

[E::bwa_idx_load_from_disk] fail to locate the index files

Is it expected to lose the original input file names when passing through the pipeline. A lot of tools are sensitive to these and this might be the underlying issue.

Regarding the configuration, without http {} in under engine -> filesystems I get a complaint about it not being supported, even with http {} under backend -> providers -> Local -> config -> filesystems:

java.lang.IllegalArgumentException: Either https://storage.googleapis.com/bcbiodata/test_bcbio_cwl/testdata/genomes/hg19/seq/hg19.fa exists on a filesystem not supported by this instance of Cromwell, or a failure occurred while building an actionable path from it. Supported filesystems are: LinuxFileSystem. Failures: LinuxFileSystem: Cannot build a local path from https://storage.googleapis.com/bcbiodata/test_bcbio_cwl/testdata/genomes/hg19/seq/hg19.fa (RuntimeException) Please refer to the documentation for more information on how to configure filesystems: http://cromwell.readthedocs.io/en/develop/backends/HPC/#filesystems
        at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.cromwell$engine$workflow$lifecycle$materialization$MaterializeWorkflowDescriptorActor$$workflowInitializationFailed(MaterializeWorkflowDescriptorActor.scala:211)
        at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anonfun$2.applyOrElse(MaterializeWorkflowDescriptorActor.scala:181)
        at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anonfun$2.applyOrElse(MaterializeWorkflowDescriptorActor.scala:176)

I was trying to lift off how things were done with the Google/gcp resolution so added it in there to fix this issue. Is there a different configuration approach I should be using?

Thanks again for helping with this.

@mcovarr
Copy link
Contributor

mcovarr commented Oct 1, 2018

Hi Brad

In addition to the http entry in the backend filesystems config the http filesystem also needs to be defined system-wide. Cromwell's reference.conf defines this already so as long as that's being pulled into your configuration and you're not overriding the filesystems you should be set.

@chapmanb
Copy link
Author

chapmanb commented Oct 2, 2018

Thanks for this help with the configuration. I'm not intentionally overwriting the global filesystem, but I don't have an explicit import of reference.conf. Do I need to have an import like we do for application? Do you spot anything else I might be doing wrong?

https://gist.github.com/chapmanb/72c6bf2d8282412b252f6192968b17cf

I appreciate all the help debugging this.

@gemmalam gemmalam added the Needs Triage Ticket needs further investigation and refinement prior to moving to milestones label Oct 4, 2018
@cjllanwarne
Copy link
Contributor

Within your http filesystem, I suspect you need enabled: true, ala https://github.com/broadinstitute/cromwell/blob/35_hotfix/core/src/main/resources/reference.conf#L349

@geoffjentry
Copy link
Collaborator

@cjllanwarne Did you try that or are you making an educated guess? I haven't yet seen anyone need to do that, although AFAIK (educated guess myself) the http stuff isn't properly wired into the engine level stuff at all. My suspicion, having run into something similar myself, is that this is a "doesn't really work with CWL" issue, but could be wrong there

@chapmanb
Copy link
Author

chapmanb commented Oct 5, 2018

Chris, thanks for the idea. I tried this and unfortunately had the same issue. From the behavior it looks like http is working in that the files get downloaded, but they don't get proper naming with numerical names instead of the expected file names. This disconnect seems to be what causes issues when passing these on to the CWL tools.

@rebrown1395 rebrown1395 removed the Needs Triage Ticket needs further investigation and refinement prior to moving to milestones label Nov 15, 2018
@cjllanwarne
Copy link
Contributor

Hi @chapmanb - sorry for the delay in responding here.

I was able to get http inputs to work in CWL against a default (ie no custom config specified) instance of Cromwell in server mode. The test case I used is in the linked PR (#4392)

I wonder whether you could confirm:

  • Whether this test case works for you, and if so:
    • Is your use of HTTP inputs different somehow?
    • How can I enhance my test case to cover whatever is different?
  • Or, whether this test case does not work for you, and if so:
    • We might try to work out what is different between your configuration and the default which might be breaking things

@chapmanb
Copy link
Author

Chris;
Thanks for working on this and for the test case to iterate with. This example does work for me in the sense that it generates an md5sum, but also demonstrates the underlying issue I'm having with https inputs. I also get them downloaded and staged into my pipeline, but the file names get mangled into random download number. md5sum is cool with this, but many of my real tasks fail because the expected file extensions and associated secondary file extensions get lost with the random file names.

Here's the example output I get from running this that demonstrates the file naming issue:

/usr/bin/md5sum '/home/chapmanb/tmp/cromwell/cromwell_work/cromwell-executions/main-http_inputs.cwl/093e2835-e4cc-4731-9248-88d74dec0977/call-sum/inputs/1515144/1710814112361209342' | cut -c1-32

This input should be called jamie_the_cromwell_pig.png but instead gets a long number attached to it. Is it possible to preserve initial file names with https like happens with other filesystem types?

In terms of the test cases, it would be great if it also checked that the file extension and name get preserved.

Thanks again for looking at this.

@chapmanb
Copy link
Author

Is there a fix in the latest development for this issue? I'm still stuck on this so not sure if I missed something.

@geoffjentry
Copy link
Collaborator

I was wondering the same. @rebrrown1395 it’s not obvious why this issue was closed and as it’s from an external user it should be explained prior to closing

@rebrown1395
Copy link

@geoffjentry and @chapmanb, my apologies this wasn't supposed to be closed!

@rebrown1395 rebrown1395 reopened this Nov 27, 2018
@ruchim ruchim added the 🐛Bug label Feb 25, 2019
@gemmalam gemmalam changed the title CWL: http input file references not properly pulled in with release 35 HTTP input file names not respected in execution VM Feb 25, 2019
@gemmalam gemmalam removed the Red team label Apr 22, 2019
@cnizo
Copy link

cnizo commented May 14, 2021

Hello! Is there an update or workaround on this?
I am experiencing the same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants