HTTP input file names not respected in execution VM #4184

chapmanb · 2018-10-01T09:24:40Z

Hi all;
In testing release 35 with CWL inputs I've also been looking at supporting remote URL references. This is working correctly for GS URLs but not for http URLs. I've put together a test case that demonstrates the problem:

https://github.com/bcbio/test_bcbio_cwl/tree/master/gcp

The somatic-workflow-http CWL workflow uses http URLs and doesn't work, while the comparable somatic-workflow CWL workflow uses GS URLs referencing the same data and does work.

The workflow fails with:

java.io.FileNotFoundException: Cannot hash file https://storage.googleapis.com/bcbiodata/test_bcbio_cwl/testdata/genom
es/hg19/seq/hg19.fa

when running tasks. The files get downloaded to the input directories but get numerical values instead of the original file names so never seem to sync over and get translated correctly to the workflow

ls -lh cromwell_work/cromwell-executions/main-somatic.cwl/eaa632df-52a8-4aae-826f-647a42fa7145/call-prep_samples_to_rec/inputs/1515144/
total 136K
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 225050424226294657
-rw------- 2 chapmanb chapmanb  43 Sep 26 14:07 2612405277530248055
-rw------- 2 chapmanb chapmanb  43 Sep 26 14:07 503001634356675169
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 5802330287039666628
-rw------- 2 chapmanb chapmanb  43 Sep 26 14:07 5809676514510180826
-rw------- 2 chapmanb chapmanb  43 Sep 26 14:07 6090832304768530540
-rw------- 2 chapmanb chapmanb  43 Sep 26 14:07 6105514522473810611
-rw------- 3 chapmanb chapmanb 37K Sep 26 14:07 6807576659333162957
-rw------- 3 chapmanb chapmanb 150 Sep 26 14:07 6853384576121493061
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 7483350933664987331
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 7538690575330349970
-rw------- 3 chapmanb chapmanb 37K Sep 26 14:07 7691692211431528147
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 7783203266940950463
-rw------- 3 chapmanb chapmanb 150 Sep 26 14:07 8389565043859020157
-rw------- 2 chapmanb chapmanb  43 Sep 26 14:07 8932347409858620277
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 993751307168383758

My configuration is:

engine {
  filesystems {
    gcs {
      auth = "application-default"
    }
    http {}
  }
}

backend {
  providers {
    Local {
      config {
      filesystems {
        http { }
      }
      }
    }
  }
}

Am I doing anything wrong with my configuration or setup that I could tweak? Thanks so much for any pointers/suggestions.

The text was updated successfully, but these errors were encountered:

mcovarr · 2018-10-01T16:53:47Z

The hash failures are expected with http inputs and should not be the cause of your workflow failure. Also we don't currently support http in engine filesystems. Do you see any other error messages than might provide some insight into what's happening?

chapmanb · 2018-10-01T17:41:28Z

Thanks much for the helping with debugging on this.

Beyond the hash failure from Cromwell the other errors I get are all from the workflow itself due to not preserving the original file names. The numerical hashes for files get passed directly into the downstream tools, stripping off any extensions or other identifying information. This results in tool confusion, like tabix can't tell a file wasn't already gzipped:

ValueError: Unexpected tabix input: /home/chapmanb/drive/work/cwl/test_bcbio_cwl/gcp/cromwell_work/cromwell-executions/main-somatic.cwl/93ef2d1c-88ee-4dc2-af0a-e0ea86bc785e/call-prep_samples/shard-0/execution/bedprep/cleaned-8539016497173364825.gz

or bwa can't find all the other associated indices:

bwa mem /home/chapmanb/drive/work/cwl/test_bcbio_cwl/gcp/cromwell_work/cromwell-executions/main-somatic.cwl/93ef2d1c-88ee-4dc2-af0a-e0ea86bc785e/call-alignment/shard-1/wf-alignment.cwl/96d7b606-e0fe-4305-a586-e0fc4acf76f8/call-process_alignment/shard-0/inputs/1628767813 [...]

[E::bwa_idx_load_from_disk] fail to locate the index files

Is it expected to lose the original input file names when passing through the pipeline. A lot of tools are sensitive to these and this might be the underlying issue.

Regarding the configuration, without http {} in under engine -> filesystems I get a complaint about it not being supported, even with http {} under backend -> providers -> Local -> config -> filesystems:

java.lang.IllegalArgumentException: Either https://storage.googleapis.com/bcbiodata/test_bcbio_cwl/testdata/genomes/hg19/seq/hg19.fa exists on a filesystem not supported by this instance of Cromwell, or a failure occurred while building an actionable path from it. Supported filesystems are: LinuxFileSystem. Failures: LinuxFileSystem: Cannot build a local path from https://storage.googleapis.com/bcbiodata/test_bcbio_cwl/testdata/genomes/hg19/seq/hg19.fa (RuntimeException) Please refer to the documentation for more information on how to configure filesystems: http://cromwell.readthedocs.io/en/develop/backends/HPC/#filesystems
        at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.cromwell$engine$workflow$lifecycle$materialization$MaterializeWorkflowDescriptorActor$$workflowInitializationFailed(MaterializeWorkflowDescriptorActor.scala:211)
        at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anonfun$2.applyOrElse(MaterializeWorkflowDescriptorActor.scala:181)
        at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anonfun$2.applyOrElse(MaterializeWorkflowDescriptorActor.scala:176)

I was trying to lift off how things were done with the Google/gcp resolution so added it in there to fix this issue. Is there a different configuration approach I should be using?

Thanks again for helping with this.

mcovarr · 2018-10-01T20:33:05Z

Hi Brad

In addition to the http entry in the backend filesystems config the http filesystem also needs to be defined system-wide. Cromwell's reference.conf defines this already so as long as that's being pulled into your configuration and you're not overriding the filesystems you should be set.

chapmanb · 2018-10-02T13:46:13Z

Thanks for this help with the configuration. I'm not intentionally overwriting the global filesystem, but I don't have an explicit import of reference.conf. Do I need to have an import like we do for application? Do you spot anything else I might be doing wrong?

https://gist.github.com/chapmanb/72c6bf2d8282412b252f6192968b17cf

I appreciate all the help debugging this.

cjllanwarne · 2018-10-04T18:55:10Z

Within your http filesystem, I suspect you need enabled: true, ala https://github.com/broadinstitute/cromwell/blob/35_hotfix/core/src/main/resources/reference.conf#L349

geoffjentry · 2018-10-05T09:43:22Z

@cjllanwarne Did you try that or are you making an educated guess? I haven't yet seen anyone need to do that, although AFAIK (educated guess myself) the http stuff isn't properly wired into the engine level stuff at all. My suspicion, having run into something similar myself, is that this is a "doesn't really work with CWL" issue, but could be wrong there

chapmanb · 2018-10-05T17:37:48Z

Chris, thanks for the idea. I tried this and unfortunately had the same issue. From the behavior it looks like http is working in that the files get downloaded, but they don't get proper naming with numerical names instead of the expected file names. This disconnect seems to be what causes issues when passing these on to the CWL tools.

cjllanwarne · 2018-11-15T21:16:50Z

Hi @chapmanb - sorry for the delay in responding here.

I was able to get http inputs to work in CWL against a default (ie no custom config specified) instance of Cromwell in server mode. The test case I used is in the linked PR (#4392)

I wonder whether you could confirm:

Whether this test case works for you, and if so:
- Is your use of HTTP inputs different somehow?
- How can I enhance my test case to cover whatever is different?
Or, whether this test case does not work for you, and if so:
- We might try to work out what is different between your configuration and the default which might be breaking things

chapmanb · 2018-11-16T15:29:09Z

Chris;
Thanks for working on this and for the test case to iterate with. This example does work for me in the sense that it generates an md5sum, but also demonstrates the underlying issue I'm having with https inputs. I also get them downloaded and staged into my pipeline, but the file names get mangled into random download number. md5sum is cool with this, but many of my real tasks fail because the expected file extensions and associated secondary file extensions get lost with the random file names.

Here's the example output I get from running this that demonstrates the file naming issue:

/usr/bin/md5sum '/home/chapmanb/tmp/cromwell/cromwell_work/cromwell-executions/main-http_inputs.cwl/093e2835-e4cc-4731-9248-88d74dec0977/call-sum/inputs/1515144/1710814112361209342' | cut -c1-32

This input should be called jamie_the_cromwell_pig.png but instead gets a long number attached to it. Is it possible to preserve initial file names with https like happens with other filesystem types?

In terms of the test cases, it would be great if it also checked that the file extension and name get preserved.

Thanks again for looking at this.

chapmanb · 2018-11-27T00:24:57Z

Is there a fix in the latest development for this issue? I'm still stuck on this so not sure if I missed something.

geoffjentry · 2018-11-27T00:28:58Z

I was wondering the same. @rebrrown1395 it’s not obvious why this issue was closed and as it’s from an external user it should be explained prior to closing

rebrown1395 · 2018-11-27T14:50:16Z

@geoffjentry and @chapmanb, my apologies this wasn't supposed to be closed!

cnizo · 2021-05-14T17:16:27Z

Hello! Is there an update or workaround on this?
I am experiencing the same issue

gemmalam added the Needs Triage Ticket needs further investigation and refinement prior to moving to milestones label Oct 4, 2018

rebrown1395 assigned cjllanwarne Nov 15, 2018

rebrown1395 removed the Needs Triage Ticket needs further investigation and refinement prior to moving to milestones label Nov 15, 2018

rebrown1395 closed this as completed Nov 26, 2018

rebrown1395 reopened this Nov 27, 2018

gemmalam added the 🌲Redwood label Feb 11, 2019

ruchim removed the 🌲Redwood label Feb 25, 2019

ruchim unassigned cjllanwarne Feb 25, 2019

ruchim added the 🐛Bug label Feb 25, 2019

gemmalam changed the title ~~CWL: http input file references not properly pulled in with release 35~~ HTTP input file names not respected in execution VM Feb 25, 2019

gemmalam added the Red team label Feb 25, 2019

gemmalam removed the Red team label Apr 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTP input file names not respected in execution VM #4184

HTTP input file names not respected in execution VM #4184

chapmanb commented Oct 1, 2018

mcovarr commented Oct 1, 2018 •

edited

Loading

chapmanb commented Oct 1, 2018

mcovarr commented Oct 1, 2018

chapmanb commented Oct 2, 2018

cjllanwarne commented Oct 4, 2018

geoffjentry commented Oct 5, 2018

chapmanb commented Oct 5, 2018

cjllanwarne commented Nov 15, 2018

chapmanb commented Nov 16, 2018

chapmanb commented Nov 27, 2018

geoffjentry commented Nov 27, 2018

rebrown1395 commented Nov 27, 2018

cnizo commented May 14, 2021

HTTP input file names not respected in execution VM #4184

HTTP input file names not respected in execution VM #4184

Comments

chapmanb commented Oct 1, 2018

mcovarr commented Oct 1, 2018 • edited Loading

chapmanb commented Oct 1, 2018

mcovarr commented Oct 1, 2018

chapmanb commented Oct 2, 2018

cjllanwarne commented Oct 4, 2018

geoffjentry commented Oct 5, 2018

chapmanb commented Oct 5, 2018

cjllanwarne commented Nov 15, 2018

chapmanb commented Nov 16, 2018

chapmanb commented Nov 27, 2018

geoffjentry commented Nov 27, 2018

rebrown1395 commented Nov 27, 2018

cnizo commented May 14, 2021

mcovarr commented Oct 1, 2018 •

edited

Loading