-
Notifications
You must be signed in to change notification settings - Fork 755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace bedtools sort with unix sort in BEDTOOLS_GENOMECOV #6063
Conversation
`bedtools sort` uses a large amount of CPUs and memory, but when using it here it doesn't require the additional genome based features of `bedtools`. Replacing it should speed up the process and make it many times more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually had issues with bedtools sort in nf-core/atacseq, see this and this PR. Actually, in our case, we switch to use gnu sort and be able to use the --parallel
and --buffer-size
arguments, see here. The reason for these changes was that the previous implementation of the module was very memory gready when dealing with big files resulting for the merge of the files at the library level, making the pipeline to fail for many users.
Thanks @JoseEspinosa, do you think we should use |
So I am totally for this change. Also, I would suggest to use gnu sort and add an |
Allows customisation of GNU
Rather than complicate things with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not wrong sort in biocontainers/bedtools:2.31.1--hf5e1c6e_0
is not gnu sort and thus, the arguments --parallel
and the --buffer-size
won't work. This should do the trick. Also, not sure if we should also include LC_ALL=C sort
to be sure that the sort order remains the same across different systems. Finally for reference, our implementation in nf-core/atacseq was inspired by this implementation on hicar of single sort process.
the default biocontainer comes with a version of |
For me is OK to add them as default, this is what I actually did in the atacseq module. |
OK let's go for the Seqera container. I'll wait for someone else to throw an objection before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonderful! 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last thought, should we control for task.memory
being not null
(not set) ?
Good point. It should always be set but I've just moved the logic to handle the case where it has been explicitly set to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just add a comment as to why it has been changed please, so nobody decides to change it back in future ;)
Some HPC clusters refuse job submissions where memory has been set, so there is a valid use case for this being set to |
We still have an open discussion around Seqera containers and private registries. Clearly, people are using them (great!) but support for private or offline registries isn't solved and it makes me wary of going too far down this route. |
Are we actually having that discussion anywhere? |
Bedtools recommends this method itself https://bedtools.readthedocs.io/en/latest/content/tools/sort.html So method wise no controversy. Seqera containers, @maxulysse has been copying them over to quay.io for the simplicity of having all the containers in one place (Why that makes it easier I still don't understand). So I think that's your quick fix there.
|
if people set the |
@SPPearce several POCs here seqeralabs/nf-aggregate#43 seqeralabs/nf-aggregate#44 seqeralabs/nf-aggregate#45 seqeralabs/nf-aggregate#46 and here: #5832, nf-core/tools#2408 |
bedtools sort
uses a large amount of CPUs and memory, but when using it here it doesn't require the additional genome based features ofbedtools
. Replacing it should speed up the process and make it many times more efficient.First discovered by @pabloaledo
PR checklist
versions.yml
file.label
nf-core modules test <MODULE> --profile docker
nf-core modules test <MODULE> --profile singularity
nf-core modules test <MODULE> --profile conda
nf-core subworkflows test <SUBWORKFLOW> --profile docker
nf-core subworkflows test <SUBWORKFLOW> --profile singularity
nf-core subworkflows test <SUBWORKFLOW> --profile conda