Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on usage #2

Open
rlorigro opened this issue Dec 2, 2024 · 3 comments
Open

Clarification on usage #2

rlorigro opened this issue Dec 2, 2024 · 3 comments
Assignees

Comments

@rlorigro
Copy link

rlorigro commented Dec 2, 2024

I have attempted to use SRFAligner on some of my own graphs, which are generated by some arbitrary third party tool. When I run SRFAligner I get the following output:

./SRFAligner -g graph.gfa -f sequences.fasta -a test_srf.gaf -c 
Reading the graph...Error: sum of block heights does not correspond to node number!
 done.
Indexing the graph... done.
Locate
Cannot find any semi-repeat-free match of 92_1
Cannot find any semi-repeat-free match of 158_1
Cannot find any semi-repeat-free match of 116_1
GraphAligner Branch master commit daec67f67a2f50d648a6aa30cbbe5a2949583061 2024-01-19 10:52:13 +0200
GraphAligner Branch master commit daec67f67a2f50d648a6aa30cbbe5a2949583061 2024-01-19 10:52:13 +0200
GraphAligner Branch master commit daec67f67a2f50d648a6aa30cbbe5a2949583061 2024-01-19 10:52:13 +0200
GraphAligner Branch master commit daec67f67a2f50d648a6aa30cbbe5a2949583061 2024-01-19 10:52:13 +0200
Load graph from graph.gfa
Load graph from graph.gfa
Build alignment graph
Build alignment graph
Build minimizer seeder from the graph
Seeds from file
Seed cluster size 1
Extend up to 5 seed clusters
Alignment bandwidth 10
Clip alignment ends with identity < 66%
X-drop DP score cutoff 14705
Backtrace from 10 highest scoring local maxima per cluster
write alignments to test_srf.gaf
Align
Minimizer seeds, length 15, window size 20, density 10
Seed cluster size 1
Extend up to 5 seed clusters
Alignment bandwidth 10
Clip alignment ends with identity < 66%
X-drop DP score cutoff 14705
Backtrace from 10 highest scoring local maxima per cluster
write alignments to ./unaligned_reads_19749.gaf
Align
Alignment finished
Input reads: 293 (367168bp)
Seeds found: 641
Seeds extended: 144
Reads with a seed: 144 (180480bp)
Reads with an alignment: 144 (180341bp)
Alignments: 144 (180341bp)
End-to-end alignments: 54 (67696bp)
awk: not an option: -i

And then when I look at the output GAF, I see that there are only 144 lines, which does not correspond to the 293 input sequences.

I see that stderr/stdout has the following concerning messages:

Reading the graph...Error: sum of block heights does not correspond to node number! 
Cannot find any semi-repeat-free match of 92_1
Cannot find any semi-repeat-free match of 158_1
Cannot find any semi-repeat-free match of 116_1
awk: not an option: -i

and unaligned_reads_19749.gaf is empty.

Maybe there is some mistake in my usage? If so, could you clarify the intended usage?

Thanks

@nrizzo nrizzo self-assigned this Dec 4, 2024
@nrizzo
Copy link
Collaborator

nrizzo commented Dec 4, 2024

Hi again,

currently SRFAligner and SRFChainer support only graphs generated by founderblockgraph, following a simple GFA extension describing the blocks. These MSA-based graphs have a restricted topology (they are elastic founder graphs) and the node segments are ``unique'' strings that are as short as possible. The README should contain this info and I will update it asap, so thank you very much for the question!

However, I think SRFAligner (but not SRFChainer) can be used on arbitrary GFA graphs using forward L links only (e.g. L s1 + s2 + 0M) since the seed step performed by efg-locate does not consider the topology, so the results might be interesting. Very short non-unique nodes might give you bad seeds, and very long nodes might result in no seeds at all (try option -m 1 or -m 2, and see option -p for a more understandable output).

For your specific execution, it seems that it's failing because SRFAligner -c is using GNU awk's option -i. I will also add it as a dependency.

~Nicola

@rlorigro
Copy link
Author

rlorigro commented Dec 4, 2024

Ah thank you. I did eventually realize my mistake, and I went to install vcf2multialign, but then ran into another issue when following their build instructions. At that point I grew a bit weary and gave up.

Do you think your speed advantage is mainly in the construction of the graph? In other words, should I expect to beat GraphAligner with SRFAligner, without using an EFG?

@nrizzo
Copy link
Collaborator

nrizzo commented Dec 5, 2024

Instead of marking explicitly the dependency on GNU awk 4.1.0 or later versions, I now work around the problem in commit 93b1087 by using a temp file, so your original command should now fully work.

SRFAligner -c on iEFGs exploits the fact that a majority of the reads are aligned correctly by seeding a small subset of exact unique substrings in the graph/reference (see the preprint for more details). On general graphs, my intuition tells me that the speed of the SRFAligner part should not increase but the quality of the seeds/alignments should decrease. However, option -c realigns some of the reads and speed will be affected as in the worst case all initial alignments are thrown out and GraphAligner needs to do all of the work. For the seeding, it would be interesting to filter out all short nodes of length <= 10-14, so that efg-locate is not easily fooled.

Let me know if you have any other question!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants