Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local MSA results different from MSA server #665

Open
sangyeon-hits opened this issue Nov 25, 2024 · 7 comments
Open

Local MSA results different from MSA server #665

sangyeon-hits opened this issue Nov 25, 2024 · 7 comments

Comments

@sangyeon-hits
Copy link

sangyeon-hits commented Nov 25, 2024

Expected Behavior

I expect using colabfold_search with locally prepared DBs to give the same MSA results as those using colabfold_batch with the MSA server.

Current Behavior

The two give different MSA results given the same input .fasta.

Steps to Reproduce (for bugs)

  1. Install colabfold==1.5.5 via pip to a fresh new mamba environment (python==3.11.10).

  2. Build mmseqs2 of commit 71dd32ec43e3ac4dabf111bbc4b124f1c66a85f1 following ColabFold README.

  3. Execute the following to set up the DBs:

    MMSEQS_NO_INDEX=1 bash setup_databases.sh $colabfold_db_dir

    where I use the mmseqs2 built from step 2 for the tsv2exprofiledb commands.

  4. Prepare a sample .fasta file (say sample.fasta) of a single protein sequence.

  5. Get a locally generated MSA by:

    colabfold_search --mmseqs $mmseqs sample.fasta $colabfold_db_dir out_local
    # $mmseqs == mmseqs2 executable from step 2
    # Adding args `--db2 pdb100_230517` gave no change in the MSA outputs.
  6. Independently, get a MSA generated by querying the server like:

    colabfold_batch sample.fasta out_server --msa-only
  7. Compare the .a3m files generated from steps 5 and 6.

ColabFold Output (for bugs)

Omitted; I can attach outputs if necessary.

Context

I want to reproduce results from ColabFold notebooks on my local machine.

Your Environment

  • Git commit used: e2ca9e8
    where I used the ColabFold code only for executing setup_databases.sh. For colabfold_{search,batch} commands, I used v1.5.5 installed via pip.
  • Operating system and version: Red Hat Enterprise Linux 9.3 (Plow)
@sangyeon-hits
Copy link
Author

Related: #263

@sangyeon-hits
Copy link
Author

sangyeon-hits commented Nov 26, 2024

When I tried a short sequence input like

>A
MKTAYIAKQRQISFVKSHFSRQDILDLWIYHTQGYFP

the server MSA and local MSA are the same. But with a longer input:

>A
MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ

The server and local outputs differ.
In this case, ~60% of the sequences in the resulting MSAs coincide (ignoring the order in the MSA files and the numbers in header lines), whereas the remaining ~40% portions are different from each other.

@milot-mirdita
Copy link
Collaborator

Can you post the terminal output of the colabfold_search command please?

@sangyeon-hits
Copy link
Author

sangyeon-hits commented Nov 26, 2024

@milot-mirdita Thank you for the response. The following is the stdout I got.
msa_out_202305.log
(The timestamp just denotes the mmseqs commit date I used.)

I just tried --db2 pdb100_230517 with --use-templates but just got error termination because the file pdb100_230517_seq is missing from the DBs.

@wehs7661
Copy link

Hi @sangyeon-hits, have you resolved the issue? I am also having different MSA results between my local machine and the mmseqs server.

I was using the exactly same procedure as you except for the following minor differences for the environment/software:

  • I installed ColabFold from source (the latest version as of today, commit 406d4c6) in a mamba environment with Python 3.10.13.
  • For MMseqs2, as the flag --prefilter-mode is not compatible with the version corresponding commit 71dd32ec43e3ac4dabf111bbc4b124f1c66a85f1, I used release 15 as instructed here.

In case that anyone else is interested, here is the content of my input FASTA file:

>A|protein
MDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCP

The local MSA computation took a long time (> 1 hour) but to my understanding this might be because colabfold_search is meant for handling an input FASTA containing a large number of sequences. (Please correct me if I am wrong.) This is a separate issue that I will investigate later.

And here are the outputs from the two different approaches. I've changed the extension to txt to be able to upload them here.
protein_server.txt
protein_local.txt

I don't have the terminal output from the colab_search command but can rerun the command if desired.

@milot-mirdita I was also wondering if you have any insights in this issue. Thanks so much for your assistance!

@sangyeon-hits
Copy link
Author

Hello, @wehs7661.
I couldn't resolve this issue. My team decided to use Jackhmmer and larger genetic data that common works like AF3 and Chai-1 used, so we haven't looked into this very problem for a while.

@wehs7661
Copy link

Hi @sangyeon-hits thanks so much for the update! I guess I'll explore other options as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants