-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathabouelhoda_08_coconut_797586.pdf.txt
1952 lines (1468 loc) · 65.8 KB
/
abouelhoda_08_coconut_797586.pdf.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>1471-2105-9-476.fm</title>
<meta name="Author" content="wood02"/>
<meta name="Creator" content="FrameMaker 8.0"/>
<meta name="Producer" content="Acrobat Distiller 9.0.0 (Windows)"/>
<meta name="CreationDate" content=""/>
</head>
<body>
<pre>
BMC Bioinformatics
BioMed Central
Open Access
Software
CoCoNUT: an efficient system for the comparison and analysis of
genomes
Mohamed I Abouelhoda*1,2, Stefan Kurtz3 and Enno Ohlebusch4
Address: 1Faculty of Engineering, Cairo University, Giza, Egypt, 2Nile University, Giza, Egypt, 3Center for Bioinformatics, University of Hamburg,
Bundesstraße 43, 20146 Hamburg, Germany and 4Faculty of Engineering and Computer Sciences, University of Ulm,89069 Ulm, Germany
Email: Mohamed I Abouelhoda* - [email protected]; Stefan Kurtz - [email protected];
Enno Ohlebusch - [email protected]
* Corresponding author
Published: 12 November 2008
BMC Bioinformatics 2008, 9:476
doi:10.1186/1471-2105-9-476
Received: 18 May 2008
Accepted: 12 November 2008
This article is available from: http://www.biomedcentral.com/1471-2105/9/476
© 2008 Abouelhoda et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
Background: Comparative genomics is the analysis and comparison of genomes from different
species. This area of research is driven by the large number of sequenced genomes and heavily
relies on efficient algorithms and software to perform pairwise and multiple genome comparisons.
Results: Most of the software tools available are tailored for one specific task. In contrast, we have
developed a novel system CoCoNUT (Computational Comparative geNomics Utility Toolkit)
that allows solving several different tasks in a unified framework: (1) finding regions of high similarity
among multiple genomic sequences and aligning them, (2) comparing two draft or multichromosomal genomes, (3) locating large segmental duplications in large genomic sequences, and
(4) mapping cDNA/EST to genomic sequences.
Conclusion: CoCoNUT is competitive with other software tools w.r.t. the quality of the results.
The use of state of the art algorithms and data structures allows CoCoNUT to solve comparative
genomics tasks more efficiently than previous tools. With the improved user interface (including
an interactive visualization component), CoCoNUT provides a unified, versatile, and easy-to-use
software tool for large scale studies in comparative genomics.
Background
The size of genome sequence data has been rising at an
exponential rate for the past decade or two, and will dramatically increase with new sequencing technologies
becoming widely available. To analyze, annotate and
compare these genome sequences, new algorithms and
software for post-sequencing functional analysis are
demanded by the scientific community.
Whole genome comparisons can be used as a first step
toward solving genomic puzzles, such as determining
coding regions, discovering regulatory signals, and deduc-
ing the mechanisms and history of genome evolution. Of
importance to the genome annotation process, the
genome comparison approach obviates the need for a priori knowledge of a protein sequence motif and provides a
straightforward means for mapping information from the
stored annotated genomes to the novel ones.
Sequence comparison in the context of comparative
genomics is complicated by the fact that both local and
global mutations of the DNA molecules occur during evolution. Local mutations (point mutations) consist of substitutions, insertions or deletions of single nucleotides,
Page 1 of 17
(page number not for citation purposes)
BMC Bioinformatics 2008, 9:476
while global mutations (genome rearrangements) change
the DNA molecules on a large scale. Global mutations
include inversions, transpositions, and translocations as
well as large-scale duplications, insertions, and deletions.
Thus, if the organisms under consideration are closely
related (that is, if no or only a few genome rearrangements
have occurred) or one compares regions which are suspected to be orthologous (regions in two or more
genomes in which orthologous genes occur in the same
order), then global alignments can, for example, be used
for the prediction of genes and regulatory elements. This
is because coding regions are relatively well preserved,
while non-coding regions tend to show varying degrees of
conservation. Genome comparisons of more closely
related species may also help to determine the genetic
basis for phenotype variation and may reveal species-specific regions (signatures) that can be targeted for identification.
For diverged genomic sequences, however, a global alignment strategy is likely predestined to failure for having to
align non-colinear and unrelated regions.
In fact, the realm of comparative genomics is not limited
to the comparison of two or multiple uni- or multi-chromosomal genomes. It also includes the comparison of
two or multiple draft genomic sequences, the comparison
of different assemblies, cDNA/EST mapping, and the
comparison of two cDNA/EST libraries from different species. In all these tasks, the key problem is to identify
regions of similarity among the sequences, and to align
them.
To cope with the shear volume of data, most of the comparative genomics software-tools use an anchor-based
method that is composed of three phases:
1. computation of fragments (segments in the sequences
that are similar),
2. computation of highest-scoring chains of colinear nonoverlapping fragments (these are the anchors that form
the basis of the alignment), and
http://www.biomedcentral.com/1471-2105/9/476
some use a graph based solution, and others use a geometric based solution.
Comparative genome analysis on a large scale
A tool for the systematic comparative study of sequences
as large as vertebrate or plant genomes must satisfy the
following criteria.
Versatility
To be useful for molecular biologists, such a tool should
be able to deal with versatile tasks. The CoCoNUT system
supports the following genome comparison tasks:
• Computation of a multiple alignment of closely related
(i.e., similar) sequences.
• Computation of regions of high similarity among multiple genomic sequences.
• Comparison of two draft or multi-chromosomal
genomes. (This task is similar to the comparison of two
cDNA/EST libraries).
• Identification of segmental duplications in whole
genomic sequences.
• cDNA/EST mapping.
To the best of our knowledge, there is no other softwaretool which covers so many tasks. CoCoNUT is freely available for non-commercial purposes.
Compositionality and usability
A complex system supporting the manifold tasks of
genome analysis usually consists of several advanced programs. Thus it must provide simple interfaces to enable
the composition of these programs. CoCoNUT uses variations of the above-mentioned anchor-based strategy to
support genome comparison tasks. The three phases (1)
computation of fragments, (2) chaining of fragments, and
(3) post-processing of chains are clearly separated. Thus,
it is possible to exchange a program performing one of the
phases without affecting the whole system. Moreover, it is
possible to stop the computation at any phase, and store
the intermediate results for later use.
3. alignment of the regions between the anchors.
See [1,2] for reviews about the tools using this strategy for
comparing whole genomic sequences, and see [3,4] and
the references therein for the tools addressing the task of
cDNA/EST mapping. All the tools employing this strategy
implement the three phases, but the details depend on the
task and are different among the tools. For example, some
tools use exact algorithms, some use greedy algorithms,
Efficiency
To analyze complete genomes of up to several billion base
pairs, the space and time used by the algorithms must
scale "almost" linearly with the sequence length and the
output size. CoCoNUT is based on the anchor based strategy mentioned before and its algorithms meet this
requirement. Our implementation of the crucial first
phase (computations of fragments) is linear, see [5] for
more details. The second phase uses techniques from
Page 2 of 17
(page number not for citation purposes)
BMC Bioinformatics 2008, 9:476
computational geometry to chain the fragments. This
approach is "almost" linear in the number of fragments,
which is a considerable advantage over the straightforward graph based approach (which requires quadratic
running time). For more details about our chaining algorithms, see [4,6-8].
Experimental results show that CoCoNUT is able to efficiently handle large sequence sets. For example, four bacterial genomes are processed (i.e., finding regions of high
similarity and aligning them) in a few minutes. CoCoNUT
can process three large mammalian X-chromosomes in
about one hour on a standard workstation. However, for
the largest vertebrate chromosomes, a server class
machine (of say 16 or 32 GB of RAM) is probably
required. Comparing a set of complete mammalian
genomes in one run would require even more RAM, and
therefore we recommend to perform the comparison on
the chromosome level. However, a general statement
about the upper limit of the number and size of the
sequences which can be processed is difficult, because the
resource requirement very much depends on the similarity of the genomes. The more similar, the more matches
are to be computed and the more resources are required.
Interactive Visualization
The large amount of data delivered by comparative
genomics requires a visualization. CoCoNUT comes with
an interactive visualization tool called VisCHAINER. This
displays dot plots of the comparison results (with zoom
and select capabilities). In contrast to established dot-plot
tools like DOTTER [9] or Gepard [10], VisCHAINER can
automatically display dot plots of multiple genome comparisons. That is, all two-dimensional projections of the
common regions among multiple genomes are plotted.
Moreover, VisCHAINER has functionalities specific to the
anchor-based strategy.
Related work
Whole genome comparison
In [2], Treangen and Messeguer presented a classification
of genome comparison tools. In this classification, CoCoNUT falls into the category of large-scale multiple genome
comparison tools. Some of these (ABA [11], Mulan [12],
TBA [13], Mauve [14], and M-GCAT [2]) can deal with
genome rearrangements. In what follows, we briefly compare our system with these software-tools except for
Mulan, because Mulan is a network server based on TBA.
ABA and TBA employ a progressive alignment strategy,
i.e., they construct local alignments from pairwise comparisons, possibly following a "guide" tree. Both tools use
BLASTZ [15] to identify hits (small regions of similarity)
between pairs of genomes, and then they combine these
hits into larger alignment blocks. Therefore, these tools
can detect similarities that must not necessarily be present
http://www.biomedcentral.com/1471-2105/9/476
in all of the genomes under consideration. This is an
advantage over Mauve, M-GCAT, and CoCoNUT. On the
other hand, both tools suffer from large running times
even for short sequences; see [11,13] and [[2], page 2].
Mauve and M-GCAT use maximal unique matches as fragments. By definition, these matches occur only once in
each genome (or chromosome). As a consequence, for
genomes containing large-scale duplications (e.g., plants
genomes), the number of fragments may be very small
and thus no reasonable alignment can be produced. In
fact, this shortcoming was already mentioned in [16] and
[2].
Identification of large genomic duplications
There are many software tools for locating repeated segments in large genomic sequences; see [17] for a review.
CoCoNUT is different from other tools because it can efficiently locate large genomic duplications (such as di- and
tetraploidization). These are difficult to detect as they (1)
are very long, (2) may be interrupted by large gaps (due to
deletions or insertions of other repeat types), and (3)
might have undergone rearrangement events. As an example, we show how CoCoNUT can locate the genome
duplications in chromosome I of A. thaliana.
cDNA/EST mapping
Standard dynamic programming algorithms cannot be
used for high throughput mapping of cDNA sequences
because they have a quadratic running time. Hence, heuristic algorithms have been developed for this task. All of
them either use a seed-and-extend strategy or a chaining
strategy.
Tools applying the seed-and-extend strategy include,
among others, BLAT [18] and MGAlign [19]. These tools
differ in the type of seeds they use and in the way the seeds
are computed. Tools using the chaining based strategy
include, among others, GenomeThreader [20], GMAP
[21] and the program by Shibuya and Kurochkin [3].
The work of [3] is worth mentioning because it uses suffix
trees for the computation of exact matches and introduces
a geometric-based chaining algorithm. In [4], the algorithm of [3] was further refined by using enhanced suffix
arrays instead of suffix trees, by using maximal exact
matches instead of maximal unique matches, and by
using a chaining algorithm that is less complicated and
more suitable for cDNA mapping. Its sensitivity/specificity was compared to the program BLAT (the most popular program for cDNA mapping applying the k-mer based
seed-and-extend strategy). It was shown in [4] that the
chaining strategy is more specific than the seed-andextend strategy, while achieving the same level of sensitivity. Moreover, the algorithm obviates the need for mask-
Page 3 of 17
(page number not for citation purposes)
BMC Bioinformatics 2008, 9:476
http://www.biomedcentral.com/1471-2105/9/476
ing the genomes, while unmasked sequences can often
not be processed by k-mer based seed-and-extend strategies, as the number of k-mers is too large. Seed-and-extend
strategies based on maximal exact matches (e.g., as implemented in [22]) may be able to process unmasked,
sequences, but for cDNA mapping they are less specific
than the chaining approach, see above. In CoCoNUT, the
algorithms and software prototypes presented in [4] were
rewritten and extended by additional splice site detection
methods.
Implementation
Computing the fragments
For i, 1 i k, let Si = Si[1..ni] denote a string of length ni
= |Si|. In our applications, Si is a long DNA sequence (e.g.,
a complete chromosome). Furthermore, Si[li... hi] is the
substring of Si starting at position li and ending at position
hi. A fragment consists of substrings S1[l1... h1], S2[l2...
h2],..., Sk[lk... hk] that are "similar". If S1[l1... h1] = S2[l2... h2]
= ... = Sk[lk... hk] (i.e., the substrings are identical), then
such a fragment is called exact fragment or multiple exact
match. A multiple exact match is called left maximal, if Si[li
- 1] Sj[lj - 1] for some i j, and it is called right maximal
if Si[hi + 1] Sj[hj + 1] for some i j. A multiple maximal
exact match (multiMEM for short) is left and right maximal. In other words, the constituent substrings cannot be
simultaneously extended to the left and to the right.
1
S1
4
2
5
7
A multiMEM is called rare if the constituent substrings
Si[li... hi] appear at most r times in Si, where 1 i k and r
is a natural number specified by the user. We call the value
r the rareness value. A multiMEM is called unique if r = 1. In
this case, we speak of a multiple maximal unique match or
multiMUM for short. Note that the maximal unique
matches used in the program MUMmer can be viewed as
multiMEMs with rareness value r = 1 for k = 2 sequences.
If character mismatches, deletions, or insertions are
allowed in the constituent substrings of a fragment, then
we speak of a non-exact fragment. The programs DIALIGN
[23] and LAGAN [24] compute fragments with substitutions, and the program BLASTZ [15] (used in PipMaker
[25]) computes fragments with substitutions, insertions,
and deletions.
Our system can use any kind of fragments, provided that
they are output in the CoCoNUT format. For our experiments, we use (rare) multiMEMs because these are easier
and faster to compute than non-exact matches. Using
spaced seeds [26] for pairwise comparisons would also be
reasonable. Note that multiMEMs of minimum length k
achieve the same level of sensitivity as approximate
matches computed by extending seeds of length k. Moreover, the number of multiMEMs is much smaller than the
number of k-mers, which results in faster processing and
better specificity.
t
S2
6
3
7
4
1
S2
5
3
2
3
7
6
2
(a)
o
(b)
S1
Figure 1
Fragment representation and global chaining
Fragment representation and global chaining. The fragments in (a) can be represented, as shown in (b), by hyper-rectangles in a k-dimensional space, where k is the number of genomes, and each axis corresponds to one genome. In this example, the optimal global chain of colinear non-overlapping fragments consists of the fragments 2, 3, and 7.
Page 4 of 17
(page number not for citation purposes)
BMC Bioinformatics 2008, 9:476
http://www.biomedcentral.com/1471-2105/9/476
Geometrically, a fragment f of k genomes can be repre-
score(C) =
corner points beg(f) and end(f). beg(f) is the k-tuple (l1,
l2,..., lk), where l1,..., lk are the start positions of the fragments in S1,..., Sk. end(f) is the k-tuple (h1, h2,..., hk), where
h1,..., hk are the end positions of the fragments in S1,..., Sk,
respectively; see Figure 1. With every fragment f, we associate a positive weight f. weight .ޒThis weight can, for
example, be the length of the fragment (in case of exact
fragments) or its statistical significance. In our system, in
the default case, we use the fragment length as weight.
Chaining the fragments
We define a binary relation << on the set of fragments by
f << f' if and only if end(f).xi <beg(f').xi for all i, 1 i k. If
f << f', then f precedes f'. Two fragments in a chain are colinear if the order of their respective segments is the same in
all genomes. In the pictorial representation of Figure 1(a),
two fragments are colinear if the lines connecting their
segments are non-crossing (e.g., the fragments 1 and 4 are
colinear, while 1 and 2 are not).
A chain of colinear non-overlapping fragments (or chain
for short) is a sequence of fragments f1, f2,..., fᐍ such that fi
<< fi+1 for all 1 i < ᐍ. The score of a chain ΌC = f1, f2,..., fᐍ is
S
1
4
3
1
5
7
8
−1
k
sented by a hyper-rectangle in ≥0 with the two extreme
∑
f i ⋅ weight −
i =1
2
i +1 , f i )
i =1
where g(fi+1, fi) is the cost of connecting fragment fi to fi+1
in the chain. We will call this cost gap cost. The gap cost
implemented in the current version of CoCoNUT is
defined as follows. For two fragments f << f',
k
g( f ′, f ) =
∑ beg( f ′).x
i
− end( f ).x i
i =1
Given n weighted fragments from two or more genomes,
the following problems can be defined:
• The global chaining problem is to determine a chain of
maximum score starting at the origin 0 = (0,..., 0) and ending at the terminus point t = (n1 + 1,..., nk + 1). Such a
chain will be called optimal global chain. Figure 1 shows a
set of fragments and an optimal global chain.
• The local chaining problem is to determine a chain of maximum score 0. Such a chain will be called optimal local
chain. It is not necessary that this chain starts at the origin
or ends at the terminus. Figure 2 shows a set of fragments
and an optimal local chain.
• Given a threshold T, the all significant local chains problem
is to determine all chains of score T. Obviously, the all
S
2
6
∑ g( f
6
9
4
3
1
S2
1
4
6
7
8
8
9
5
7
2
(a)
o
(b)
S
1
Figure 2
Local chaining
Local chaining. Computation of an optimal local chain of colinear non-overlapping fragments. The optimal local chain is composed of the fragments 1, 4, and 6. When the start point of fragment 9 is scanned, a range maximum query searches for a fragment of highest score occurring in the shaded region.
Page 5 of 17
(page number not for citation purposes)
BMC Bioinformatics 2008, 9:476
significant local chains problem is a generalization of the
local chaining problem.
In a solution to the all significant local chains problem,
some chains can share one or more fragments, composing
a cluster of fragments. In the example of Figure 2, the local
chains Ό1, 3, 6 and Ό1, 4, 6 share the fragments 1 and 6,
yielding the cluster Ό1, {3, 4}, 6. The cluster Ό7, {8, 9} represents two local chains Ό7, 8 and Ό7, 9. To reduce the output size, we report the clusters and from each cluster we
report a local chain of highest score as a representative
chain of this cluster. This representative chain is a significant local chain. In the example, the representative chains
are Ό1, 4, 6 and Ό7, 8.
Our chaining algorithm is not heuristic, i.e., it computes
an optimal chain w.r.t. the given constraints. It is based on
the line-sweep paradigm and uses range maximum queries
(RMQ) with activation. During the line sweep procedure,
the fragments are scanned w.r.t. their order in one of the
genomes. If an end point of a fragment is scanned, then it
is activated. If a start point of a fragment is scanned, then
we connect it to an activated fragment of highest score
occurring in the rectangular region bounded by the start
point of the fragment and the origin. This highest-scoring
fragment is found by an RMQ, see Figure 2(b). For more
details about our chaining algorithms; see [6,7,27]. In
practice, variations of the basic algorithms or certain preprocessing steps are required. Because these variations are
specific to each application, we handle them in detail in
the respective sections.
The data flow in CoCoNUT
Figure 3 summarizes the data flow and typical usage of
CoCoNUT.
The input to the system is a set of genomic sequences. For
genome analysis, all chromosomes are input. Usually,
each chromosome is given as a single FASTA file and one
compares a combination of chromosomes at a time.
(However, CoCoNUT can also compare two multi-chromosomal or draft genomes in a single run; this will be
explained in the following section.) Inversions can be
taken into account by considering the backward strands of
some chromosomes and the forward strands of the other
chromosomes. In CoCoNUT, all combinations of orientations are considered by default, but the user has the
option to restrict the comparison to the forward strands
only. For repeat analysis, the input is a single genomic
sequence in single or multiple FASTA files. For cDNA
mapping, the user submits one genomic sequence and
cDNA sequences in a multiple FASTA file.
Each comparison consists of a fragment generation phase
and a chaining phase. The fragments are usually generated
http://www.biomedcentral.com/1471-2105/9/476
by the program ramaco [5], which computes rare multiMEMs using an enhanced suffix array of one of the chromosomes. Alternatively, if one does not expect too many
repeats in the considered sequences (and therefore no
explosion in the number of multiMEMs), it may not be
necessary to specify a rareness parameter. In such a case,
one can use the program multimat [28] to compute the
multiMEMs. While ramaco can also compute multiMEMs
(without rareness parameters), multimat does this more
efficiently at the expense of a larger index size (as it
requires to build an enhanced suffix array of all chromosomes.) The program CHAINER carries out the chaining
phase and delivers all significant local chains, where each
chain corresponds to a region of similarity.
Upon completion of these two phases, CoCoNUT provides the functionality to
• visualize the resulting chains in 2D plots, or
• compute an alignment on the nucleotide level for each
chain (by the strategy described in the next section) and
filter out chains with low sequence identity, or
• compute and visualize regions of high similarity, or
• perform a second chaining step with the chains of the
first chaining step as new fragments.
To repeat parts of the comparison with different parameters, the user can re-start the comparison at four points:
(1) after the index generation, (2) after the fragment generation, (3) after the first chaining step, and (4) after the
alignment. For example, if the user has already computed
the fragments and chains, then he/she could run the alignment program later, based on the stored fragments and
chains. He/she could also repeat the chaining step with
the stored fragments, but with different parameters.
Results and discussion
Finding regions of high similarity
The first two phases of CoCoNUT are (1) the computation
of fragments (multiMEMs) and (2) the computation of all
significant local chains. These chains correspond to
regions of high similarity, but the reader should keep in
mind that the regions depend on the parameters with
which the program was called. This behavior bears resemblance to the widely used program BLAST [29] for comparing DNA or protein sequences. A BLAST search enables
a researcher to compare a query sequence with a database
of sequences, and identify sequences in the database that
are similar to the query sequence. The sequences delivered
by BLAST depend on the parameters with which the program was called, and the parameter choice is very important. The following scenario shows a typical usage of
Page 6 of 17
(page number not for citation purposes)
BMC Bioinformatics 2008, 9:476
http://www.biomedcentral.com/1471-2105/9/476
> gi genome 1
GCCGCGCGT
CGGT
CG..
Input genomes
Fragment Generation Phase
Construct Index
use index & proceed
(.p[p|m]+) Compute Fragments
use fragments & proceed
Chaining Phase
(.chn, .ccn, .stc)
Chaining
use chains & proceed
Post−processing Phase
(.ps)
Alignment
2D Plots
(.align)
Filter alignment with
identity < T
use alignment & proceed
(.align.filtered, .chn.filtered,
.ccn.filtered, .stc)
2D Plots
(.filtered*.ps)
Recursive Chaining
Chaining
2D Plots
Figure 3
CoCoNUT block diagram
CoCoNUT block diagram. The data-flow in CoCoNUT for the task of comparing multiple genomes. The user can repeat
the comparison starting in any of the four phases (marked as use index, use fragments, use chains, and use alignment) and proceed
further in the comparison. The extensions of the files produced in each step are shown in brackets; see also Table 1.
Page 7 of 17
(page number not for citation purposes)
BMC Bioinformatics 2008, 9:476
BLAST. Following the sequencing of a DNA segment of
functional importance in a certain species, a scientist will
typically perform a BLAST search against genomes of
related species. It is then a research hypothesis that the
sequences identified by the search are in fact homologous
(in the phylogenetic sense) to the query sequence. However, because sequence similarity may arise from different
ancestors (e.g., short sequences may be similar by chance
or sequences may be similar because both were selected to
bind to a particular protein, such as a transcription factor)
this working hypothesis must be corroborated. The same
is true for CoCoNUT. The regions of high similarity identified by CoCoNUT may or may not be homologous, and
an alignment of these may or may not be meaningful.
Other authors use the terms synteny or syntenic regions
instead of regions of high similarity. In genetics, synteny
describes the physical co-localization of genetic loci on
the same chromosome within an individual or species,
while shared synteny describes preserved co-localization of
genes on chromosomes of related species. The term shared
synteny is sometimes also used to describe preservation of
the precise order of genes on a chromosome passed down
from a common ancestor, but many geneticists reject this
use of the term. Passarge et al. [30] wrote: "We believe
molecular biologists ought to respect the original definition of synteny and its etymological derivation, especially
as this term is still needed to refer to genes located on the
same chromosome. We recognize the need to refer to gene
loci of common ancestry. Correct terms exist: 'paralogous'
for genes that arose from a common ancestor gene within
one species and 'orthologous' for the same gene in different species." However, in our context, the term orthologous
regions cannot be used either, simply because we cannot
generally infer orthology from sequence similarity alone
(nevertheless, shared synteny in the gene order sense is
one of the most reliable criteria for establishing the
orthology of genomic regions in different species).
Because there is no "right word" yet, we will use the term
regions of high similarity, although we feel that this term
does not have the right connotation (especially if one uses
a second chaining step, see below).
In contrast to global alignment tools (e.g., MGA [31]),
which assume global similarity, CoCoNUT can cope with
genome rearrangements. It uses the three step approach
depicted in Figure 3. The user can specify several parameters in the CoCoNUT system and a reasonable parameter
choice is very important.
In the fragment generation phase, the parameter "minimum fragment length" can be set by the user, but it is usually a good idea to first use the default parameter, which
is estimated based on the count statistics; see [32]. Furthermore, the user can specify the rareness value of a frag-
http://www.biomedcentral.com/1471-2105/9/476
ment. The rareness parameter depends on the number of
"important to see" repeated segments in the genomes, an
information that cannot be determined automatically.
Therefore, the user has to test different rareness values. In
our experiments, we found that 5 is a reasonable rareness
value to start with.
Only fragments (multiMEMs) whose lengths exceed the
minimum fragment length are generated. On the one
hand, if the minimum fragment length is too small or the
rareness value is too large, a large number of fragments is
generated. On the other hand, if the minimum fragment
length is too large or the rareness value is too small, too
few fragments for a meaningful comparison may be generated.
In the chaining phase, CHAINER solves the all significant
local chains problem. In addition, the user can specify an
upper bound on the gap length between fragments in a
chain. That is, two fragments can only be connected in a
chain if the number of characters separating them does
not exceed this user-defined maximum gap-length, which
is identical for all sequences. This option prevents unrelated fragments from extending a chain. The user can also
filter out chains based on their length or their score. (This
filtration can also be carried out using the visualization
tool VisCHAINER.)
The fragments of a local chain represent anchors that form
the basis of the local alignment. Only the regions between
them must be aligned on the nucleotide level. If one compares just two genomes, the regions between the anchors
are aligned by a global alignment algorithm based on
standard dynamic programming. For more than two
genomes, the program CLUSTALW [33] is used to align
these regions. This strategy is also used in the multiple global alignment tool MGA [31], albeit for a single global
chain.
For closely related genomes, it is recommended to
increase the minimum fragment length. This usually does
not affect the sensitivity of the procedure. Moreover, a single chaining step is usually enough to identify regions of
high similarity.
For distantly related genomes, the minimum fragment
length needs to be reduced to increase the sensitivity of
the comparison. This, however, has the effect that many
fragments appear by chance. To identify regions of high
similarity in the "noisy" fragment set, it is important to
use a double-chaining strategy. In the first chaining phase,
one computes chains of multiMEMs with a stringent gap
length. In the second chaining phase, the chains resulting
from the first chaining step are considered as new fragments. Moreover, the gap length is increased. In this way,
Page 8 of 17
(page number not for citation purposes)
BMC Bioinformatics 2008, 9:476
it is possible to remove the noise without missing relevant
fragments.
We exemplify these two strategies by comparing three
related bacterial genomes and three distantly related
mammalian chromosomes. The experiments were carried
out on a Sun Sparc V processor with 1015 MHz and 6GB
RAM.
Comparing closely related bacterial genomes
We compared the three bacterial genomes E. coli, S. sonnei,
and S. boydii (accession numbers are NC_000913,
NC_007384, and NC_007613, respectively). As a reference, we first compared the proteomes of the three
genomes and obtained the best hit of all proteins encoded
in three genomes. This comparison was performed using
the Comprehensive Microbial Resource web-based comparison tools [34]. We used the option that reports the best
hits for each protein. Figure 4 (lower left) shows the projection E. coli vs. E. sonnei in which the hits that appear on
a vertical or horizontal line correspond to repeated segments in E. coli or E. sonnei encoding the same protein. (By
searching the non-redundant protein database using
BLAST, we found that the repeats are insertion elements.)
We used CoCoNUT to compare the three genomes on the
DNA level. The minimum length of the fragments was
between 15 to 18 and the rareness value was between 5
and 20. (The default values for minimum length and rareness are 18 and 5, respectively.) In the chaining step, the
maximum gap length was set to 1000 bp. For minimum
length 15 and rareness value 20, we obtained the best
results w.r.t. the reference comparison on the protein
level. All chains of length less than 500 bp were filtered
out. As can be seen in Figure 4, the regions containing
orthologs are covered by the local chains. The remaining
repeated segments visible in the DNA plot, but not in the
protein plot, are insertion elements that do not encode a
protein.
In this comparison, there was no need for a second chaining step because regions of high similarity could easily be
identified. All alignments derived from the chains show
an identity of more than 70%. The whole experiment,
including the computation of the multiple alignment,
took a few minutes.
We applied the program Mauve to the same three bacterial
genomes. Mauve uses fragments of the type multiMUMs,
and as shown in Figure 4, is not able to identify repeated
segments.
Comparing distantly related mammalian chromosomes
The X-chromosomes of human, mouse, and rat were compared by CoCoNUT. We used masked and unmasked
http://www.biomedcentral.com/1471-2105/9/476
sequences of the latest assemblies of the three genomes.
We used the human genome version 18 NCBI build 36,
the mouse genome version 9 NCBI build 37, and the rat
X-chromosome version 4 RGSC v3.4 from the UCSC
genome browser. The accession numbers of the X-chromosomes of human, mouse, and rat are NC_000023 and
NC_000086, and NC_005120, respectively.
As a reference, we used the BioMart system [35,36] to
retrieve all orthologous proteins among the three X-chromosomes. The human X-chromosome was taken as a reference. Figure 5 (upper left) shows the plot for the human
vs. mouse X-chromosome. (Projections w.r.t. other pairwise genomes are not shown.) Each point in this plot corresponds to a protein shared in all X-chromosomes with
identity larger than 25%.
We also compared our results with Bourque et al. [37],
who identified synteny blocks and used them to compute
genome rearrangement scenarios. (A synteny block in the
Bourque et al. paper is composed of non-repeated colinear regions of high similarity. As discussed above, some
readers may reject the use of the term synteny block, but we
will stick to the original terminology used by Bourque et
al. [37].) The synteny blocks were identified by first combining the results of all pairwise genome comparisons,
and then by verifying these using all pairwise proteome
comparisons.
We ran CoCoNUT using different parameters, starting
with the default values. The parameters that produced
good results were as follows: The minimum fragment
length was 20 and the rareness value was 10. Furthermore,
the gap length between two fragments in a chain was set
to 600 bp and chains with length less than 80 bp were filtered out. Figure 5 (upper right) shows the projections of
the resulting chains w.r.t. the human and mouse chromosomes. (Other projections are not shown.)
Although the results show that the regions containing the
orthologous proteins are covered by CoCoNUT local
alignments, it is difficult to automatically identify large
regions of high similarity by visual inspection. Therefore,
we performed a second chaining step. In this step, the gap
length between two fragments was 500 Kbp, and chains
with length less than 300 Kbp were filtered out. The resulting chains were considered to be the regions of high similarity. The parameters were chosen to mimic the strategy
of [37]. Figure 5 (lower left) depicts the results of the second chaining step w.r.t. the human and mouse chromosomes. (Other projections are not shown.) Table 1 in
Additional file 1 lists the exact coordinates of the regions
of high similarity.
Page 9 of 17
(page number not for citation purposes)
BMC Bioinformatics 2008, 9:476
http://www.biomedcentral.com/1471-2105/9/476
5e+06
5e+06
4.5e+06
4.5e+06
4e+06
4e+06
3.5e+06
3.5e+06
3e+06
3e+06
2.5e+06
2.5e+06
2e+06
2e+06
1.5e+06
1.5e+06
1e+06
1e+06
500000
500000
0
0
500000
1e+06
1.5e+06
2e+06
2.5e+06
3e+06
E. coli (x-axis) vs. S. sonnei
5e+06
3.5e+06
4e+06
4.5e+06
5e+06
©
4e+06
1e+06
1.5e+06
2e+06
2.5e+06
3e+06
3.5e+06
4e+06
4.5e+06
5e+06
(y-axis)
4.5e+06
©
4e+06
©
3.5e+06
3e+06
3e+06
©
2.5e+06
2.5e+06
2e+06
2e+06
1.5e+06
1.5e+06
1e+06
1e+06
500000
500000
0
500000
5e+06
©
3.5e+06
0
E. coli (x-axis) vs. S. boydii
©
4.5e+06
0
(y-axis)