-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathabby_10_detecting_800418.pdf.txt
1683 lines (1229 loc) · 57.1 KB
/
abby_10_detecting_800418.pdf.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Detecting lateral gene transfers by statistical reconciliation of phylogenetic forests</title>
<meta name="Subject" content="BMC Bioinformatics 2010, 11:324. doi: 10.1186/1471-2105-11-324"/>
<meta name="Author" content="Sophie S Abby, Eric Tannier, Manolo Gouy, Vincent Daubin"/>
<meta name="Creator" content="FrameMaker 8.0"/>
<meta name="Producer" content="Acrobat Distiller 7.0 (Windows)"/>
<meta name="CreationDate" content=""/>
</head>
<body>
<pre>
Abby et al. BMC Bioinformatics 2010, 11:324
http://www.biomedcentral.com/1471-2105/11/324
Open Access
METHODOLOGY ARTICLE
Detecting lateral gene transfers by statistical
reconciliation of phylogenetic forests
Methodology article
Sophie S Abby, Eric Tannier, Manolo Gouy and Vincent Daubin*
Abstract
Background: To understand the evolutionary role of Lateral Gene Transfer (LGT), accurate methods are needed to
identify transferred genes and infer their timing of acquisition. Phylogenetic methods are particularly promising for this
purpose, but the reconciliation of a gene tree with a reference (species) tree is computationally hard. In addition, the
application of these methods to real data raises the problem of sorting out real and artifactual phylogenetic conflict.
Results: We present Prunier, a new method for phylogenetic detection of LGT based on the search for a maximum
statistical agreement forest (MSAF) between a gene tree and a reference tree. The program is flexible as it can use any
definition of "agreement" among trees. We evaluate the performance of Prunier and two other programs (EEEP and
RIATA-HGT) for their ability to detect transferred genes in realistic simulations where gene trees are reconstructed from
sequences. Prunier proposes a single scenario that compares to the other methods in terms of sensitivity, but shows
higher specificity. We show that LGT scenarios carry a strong signal about the position of the root of the species tree
and could be used to identify the direction of evolutionary time on the species tree. We use Prunier on a biological
dataset of 23 universal proteins and discuss their suitability for inferring the tree of life.
Conclusions: The ability of Prunier to take into account branch support in the process of reconciliation allows a gain in
complexity, in comparison to EEEP, and in accuracy in comparison to RIATA-HGT. Prunier's greedy algorithm proposes a
single scenario of LGT for a gene family, but its quality always compares to the best solutions provided by the other
algorithms. When the root position is uncertain in the species tree, Prunier is able to infer a scenario per root at a
limited additional computational cost and can easily run on large datasets.
Prunier is implemented in C++, using the Bio++ library and the phylogeny program Treefinder. It is available at: http://
pbil.univ-lyon1.fr/software/prunier
Background
The systematic reconstruction of molecular phylogenies
based on the diversity of genes found in complete
genomes reveals an unforeseen degree of incongruence
among gene trees. Different reasons, either biological or
methodological, explain this diversity of patterns. First,
evolutionary mechanisms such as gene duplication and
loss, lateral gene transfer (LGT) and incomplete lineage
sorting generate gene histories that deviate from that of
species [1]. In unicellular organisms, and particularly
Bacteria and Archaea, most of the real phylogenetic conflict is likely the result of LGT [2,3]. On the other hand,
* Correspondence: [email protected]
1
Université de Lyon; Université Lyon 1; CNRS; INRIA; UMR 5558, Laboratoire de
Biométrie et Biologie Evolutive, 43 boulevard du 11 novembre 1918, F-69622
Villeurbanne, France
Full list of author information is available at the end of the article
the reconstruction of gene histories based on sequence
alignments is not trivial, and many artifacts are known
which produce aberrant phylogenies due to stochastic
effects or inadequate models of sequence evolution [4-7].
A challenge in the understanding of the patterns and processes of genome evolution is therefore to sort out these
different sources of conflict.
The question of identifying LGTs based on phylogenies
typically applies to the following data: first, a gene phylogeny, characterized by an unrooted tree topology, with
branch lengths and statistical support for internal
branches; and second, some knowledge of the evolutionary relationships among the organisms represented in
this tree, ideally a rooted species phylogeny. Events of
LGT can then be invoked to explain the topological discrepancies between the two trees. Among the various
approaches that have been proposed to resolve the prob-
© 2010 Abby et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Abby et al. BMC Bioinformatics 2010, 11:324
http://www.biomedcentral.com/1471-2105/11/324
lem of tree reconciliation, the MAF (maximum agreement forest), and the closely related SPR (subtree pruning
and regrafting) arguably represent the most appropriate
models for the replacement of genes by LGT. The MAF
problem consists in finding the smallest number of edges
to cut in both trees in order to obtain two identical "forests" of rooted subtrees [8,9]. The SPR problem also corresponds to minimizing the number of subtrees to cut in
a tree, but adds extra complexity by searching for the
optimal place to regraft them. In the case of a rooted species tree, both approaches are equivalent to minimizing
the number of LGTs that occurred in the gene family.
These problems are known to be computationally difficult, but several algorithms have been proposed, notably
to efficiently address the SPR problem. For instance,
Than and Nakhleh have proposed a decomposition
approach, implemented in the RIATA-HGT program
[10,11], which identifies regions of the tree where the
conflict can be resolved independently, and thus significantly reduces the complexity of the SPR reconciliation in
many cases. EEEP also implements such a decomposition
[12], and adds the possibility to restrict the type of SPR
moves to those that immediately reduce the discordance
among trees.
Obviously, an incorrectly reconstructed gene tree will
lead to the inference of erroneous LGTs. It is therefore
essential to take into account all information at hand on
the reliability of the observed topological conflict, such as
the length and support of all branches. Methods of LGT
detection based on topological comparisons sometimes
propose ways to incorporate the statistical information of
the gene tree [11,13,12]. RIATA-HGT [11] first performs
a purely topological reconciliation that proposes a collection of LGT scenarios. Each transfer is associated with a
value which depends on the statistical support of the conflict it resolves in the gene tree, and the user can choose
to ignore LGTs under a given threshold. EEEP [12] uses a
different approach in which internal branches of the gene
tree having a statistical support below a given threshold
are collapsed a priori, before the trees are reconciled.
The two approaches described above have been implemented and tested on simulated datasets [11,12]. The
main concern in both simulation setups was to evaluate
time performance and the ability of the methods to
recover the number of simulated LGTs in a gene tree.
However, biologists are usually interested not only in the
number of LGTs but more importantly in identifying the
actual events of gene transfer, i.e. the exact set of species
that are "misplaced" in the gene tree. Both approaches
generally propose a number of evolutionary scenarios,
but their accuracy has not been evaluated so far.
Here, we use simulations to explore the ability of different methods to detect events of transfers based on gene
Page 2 of 13
trees reconstructed from sequences. We introduce a
greedy algorithm called "Prunier" [14], which uses information on topology, statistical support and branch
lengths to quickly identify a maximum statistical agreement forest (MSAF) that corresponds to a most parsimonious scenario of transfer. Prunier uses a customizable
statistical agreement function. Two implementations of
this function were tested: a fast one based on branch support, and a more advanced one which uses the expected
likelihood weights (ELW) test [15]. We show that working
on reconstructed trees strongly affects the ability of different methods to identify transfer events. Although all
methods can roughly estimate the number of LGTs that
occurred in a gene history, the accuracy of proposed scenarios varies largely. In comparison with other methods,
Prunier has lower false positive rate, which makes it a
more accurate approach to detect LGTs in almost all simulated situations, especially for complicated gene histories. When tested on biological data of 23 universal
proteins with hypothetical reference trees, Prunier
revealed high rates of LGT, in particular in genes that are
known to be prone to transfer. However, the degree of
conflict in these genes raises concern on the approach
used to reconstruct universal trees.
Results
Prunier Algorithm
Objectives
The phylogenetic detection of lateral gene transfers relies
on differences between a gene tree TG, with branch
lengths and support values, and a reference species tree
TS on the same set of species S. Our method takes both
trees as input. For clarity, we will suppose that the species
tree is rooted, though the method offers the possibility of
using an unrooted reference tree as input, with a reasonable increase in complexity (see Material and Methods).
The input gene tree is always unrooted, and a byproduct
of the procedure is to output a restrained set of root locations on the gene tree. Here we consider that topological
differences between TS and TG can result either from LGT
or from stochastic effects in the process of gene tree
reconstruction. Therefore, an agreement function is
needed to decide whether observed topological differences among trees are significant or not. Examples of
such functions are maximum-likelihood tests comparing
trees given a sequence alignment: KH (Kishino-Hasegawa
[16], SH (Shimodaira-Hasegawa [17], AU (approximately
unbiased [18] or ELW (expected likelihood weights [15]
tests. Because these tests all take unrooted trees as input,
we used additional criteria to be able to handle the disagreement for the position of the root when necessary
(see Material and Methods). We also propose a faster
alternative which simply considers the statistical support
Abby et al. BMC Bioinformatics 2010, 11:324
http://www.biomedcentral.com/1471-2105/11/324
of internal branches. Two (sub-)trees are said to disagree
if they fail the agreement function.
If a species tree and a gene tree disagree (based on the
agreement function), the objective is to decompose them
into a statistically agreeing forest. For a subset Si of the set
of species S, we note T(Si) the subtree of T containing
exactly Si where internal nodes of degree 2 are contracted.
We aim at partitioning the set of species Sinto a minimum number of subsets S1 ,...,Sk such that for each subset
Si, the two subtrees TS(Si) and TG(Si) agree, and all TS(Si)
as well as all TG(Si) are disjoint. We call this partition a
maximal statistical agreement forest (MSAF), in reference to the maximal agreement forest [8,9], which is the
particular case where the agreement function gives a positive answer only if the two compared (sub-)trees are
homomorphic (have the same topology).
Among S1 ,...,Sk , only one subset contains the root of
the species tree. This is the non-transferred "backbone" of
the tree as the root is its most ancient node and cannot
acquire a gene from one of its descendants. The other
TS(Si) of the forest are interpreted as LGTs that occurred
in the last common ancestor of Si,. Their number (k-1) is
the minimum number of transfer events that have to be
invoked to explain the significant differences between the
two trees.
Greedy Algorithm
We use a greedy procedure to approach the maximum
statistical agreement forest of two trees (see Fig. 1). Conflicting edges of the gene tree are those inducing bipartitions that are not in the species tree. A transfer event is
characterized by a subtree TG(Si) that is in agreement
with the corresponding TS(Si) but whose position is conflicting. In other words, an LGT results in a non conflicting edge whose removal decreases the conflict among
trees. A non conflicting edge e cuts TS in two subtrees,
one containing the root, the other defining a common
clade Si. If TG(Si) and Ts(Si) agree, the "conflict score" of e
is a function (detailed in Material and Methods) of the
number and support of the conflicting edges that are
eliminated if e is removed from the tree. We consider the
non conflicting edge with the highest conflict score as
defining the most likely transfer. The procedure (Fig. 1)
can be briefly described as follows:
• While the gene tree and species tree disagree:
▪ Remove the edge defining a statistically agreeing
common clade with maximal conflict score
Often, several edges have the same conflict score. In
this case, a different criterion is used to choose what edge
to cut among top rating edges. We first use the alignment
of the gene family to estimate branch lengths on both the
gene and reference trees using Treefinder [19]. Then, for
each candidate edge, we compute the difference between
Page 3 of 13
its lengths in both trees and cut the one with the highest
difference. We hence remove the branch that is most
affected when constrained to its reference position. In
practice, branch lengths are estimated only once. This
step appears to be necessary in most gene families and it
accounts for most of the computing time in the "fast"
implementation of Prunier.
Because only non conflicting edges are removed, the
procedure always produces two subtrees, one of which is
in agreement by construction, and the other (which contains the root) can be used recursively as an input of the
function. Eventually, a statistical agreement forest is
reached and each component of this forest which does
not contain the root of TG is interpreted as originating
from a transfer event. A scenario of transfer can readily
be constructed from the comparison of TG with the forest.
The algorithm and its implementation are fully detailed
in the Material and Methods section. In particular, the
definition of the agreement function, conflict score and
the position of the root are discussed.
Accuracy on simulated LGTs
We used the procedure described in [20] (see Material
and Methods for details) to generate 330 gene trees and
corresponding sequence alignments with increasing
number (from 0 to 10) of LGTs. Gene trees were simulated with subtree pruning and regrafting (SPR) operations from a 40-taxa rooted reference tree. Gene
sequences were simulated along gene trees with variable
rates of evolution among branches and sites (See Material and Methods for details). Maximum-likelihood (ML)
trees were reconstructed using the resulting alignments.
We used a different model of substitution than that used
for sequence simulations to increase the chance of topological conflict resulting from reconstruction artifacts.
All results of the simulation procedure are available as
supplemental
information
(additional
file
1,
"simulated_dataset_prunier.zip").
We compared the performances of RIATA-HGT [11]
(available in the PhyloNet package [21], EEEP [12] and
Prunier based on this simulated dataset.
Accuracy of detected transfer events
Rather than focusing on the accuracy of the number of
detected LGTs as in previous simulation studies
[10,12,11], we concentrated on the comparison of
inferred scenarios with the true (simulated) one. Among
the proposed LGT events, we counted the number of true
and false positives (respectively TP and FP) for each
method. In subsequent comparisons, all three methods
used the same threshold for branch support: for EEEP,
branches under this threshold are collapsed a priori; for
RIATA-HGT, among all inferred transfers only those
Abby et al. BMC Bioinformatics 2010, 11:324
http://www.biomedcentral.com/1471-2105/11/324
Page 4 of 13
! ""
#
$%
&
'
(
("
)!
$%
#
'
&
("
)!
*+* , - +
.
!
Figure 1 Prunier algorithm: example of a Prunier run. Example of reconciliation of an unrooted gene tree TG with a rooted species tree TS by
searching for the maximum statistical agreement forest (MSAF). In the gene tree, blue branches are those common to the two trees whereas red
branches indicate conflicting edges, i.e., those found in TG but not in TS. Support values are shown for conflicting edges. In this example, two rounds
are needed to reconcile the two trees. The agreement function "Agree" corresponds here to the "fast" version of Prunier: a gene (sub-)tree is considered in statistical agreement with the species (sub-)tree if no conflicting edge above 80 exists. At each round, clades corresponding to common edges
(blue branches) are ranked by decreasing scores. This score reflects the conflict (combination of the support values) removed when the edge is cut
(symbolized by "Rm{clade}"). The highest scoring subtree candidate to transfer is removed if it is in agreement with the species tree. The output of
Prunier is a statistical agreement forest (SAF), composed of a reconciled backbone subtree (non-transferred sequences: {ABFGHIK}) and as many reconciled subtrees (transferred sequences) as lateral gene transfers (LGT). In this example, two LGT events are inferred: {J} and {CDE} have been transferred.
crossing a branch above this threshold were retained; and
for Prunier, two trees were in agreement if they had no
conflicting edges above this threshold. Fig. 2 shows the
number of TP (fig. 2A) and FP (fig. 2B) as a function of
the true number of LGT events, with a threshold of 0.90
(see additional files 2 and 3 for results with a 0.60 threshold:
"TP_HGT_multi-scenar_BP60.pdf"
and
"FP_HGT_multi-scenar_BP60.pdf" ). For clarity, the
results of another agreement function based on the ELW
test [15] (Prunier-slow) which gave results very close to
Abby et al. BMC Bioinformatics 2010, 11:324
http://www.biomedcentral.com/1471-2105/11/324
Page 5 of 13
scenario proposed (see Material and Methods for why
not all transfers are detectable). For this reference, the
estimated number of transfers is quite accurate but the
number of FP increases with the number of transfers,
remaining under 2 (fig. 2B). The three methods have different behaviors with increasing number of transfers.
RIATA-HGT performs relatively well at identifying true
transfer events (fig. 2A), but consistently infers false positives, even when no transfers have been simulated (fig.
2B). Surprisingly, this number of FP does not increase significantly with the number of transfers. EEEP is very good
at detecting zero or one transfer, but then increasingly
fails at producing any transfer scenario, with more than
50% of failure for the 10-transfer category. Prunier is
comparable to RIATA-HGT for the detection of TP (fig.
2A), but has lower rates of FP (fig. 2B). The rate of FP
increases with the number of simulated transfers to reach
the same level as RIATA-HGT with 10 transfers.
Prunier-fast are not presented. While Prunier gives a single scenario, both RIATA-HGT and EEEP generally propose several solutions that are equivalent in terms of
number of events. We chose to show the performance of
these programs by recording both the worst (with minimum TP and maximum FP) and best (i.e. with maximum
TP and minimum FP) scenarios.
To obtain an idea of the best possible estimate of the
number of detectable LGTs in our simulations, we ran
RIATA-HGT on the true gene trees (those used for
sequence simulations), and kept as a reference the best
!"
###
$
Figure 2 Accuracy of transfer events detection. A: numbers of true
positive LGTs (TP). B: numbers of false positive LGTs (FP). LGT events detected in maximum-likelihood gene trees are displayed as a function of
the real number of transfers per tree. Results were computed with a
support threshold for topological conflict significance of 0.90. As EEEP
and RIATA-HGT may propose multiple scenarios, grey areas represent
the variability between the worst and the best scenarios. Areas are delimited by blue borders for RIATA-HGT and green borders for EEEP. Prunier (fast version) results are drawn in magenta. The black curves
symbolize the reference boundary: those curves were obtained using
RIATA-HGT (best scenario selected) on true gene trees (before sequence simulation). The green numbers below the true positives area
of EEEP represent the number of unresolved cases (either quoted as
"Unsolved" by the program, or resulting from a program crash). The
blue numbers above RIATA-HGT TP area indicate the two cases for
which the program crashed. A program crash is counted as 0 TP and 0
FP. The grey line shows a relationship 1:1 between the two axes. The
estimated number of transfers by each method is the sum of the
curves shown in A and B.
Many studies trying to resolve controversial phylogenetic
relationships among species have used LGT detection as
a first step to retrieve sets of orthologous genes (e.g., [2226]. In such case, it is essential that all LGT events are
correctly detected in a gene family. We thus measured the
proportion of gene trees in which all events of LGT were
Accuracy of transfer scenarios
Figure 3 Inference of correct complete LGT scenarios. Proportion
of correctly inferred LGT scenarios, i.e. scenarios with 100% TP and 0%
FP as a function of the true number of LGT. Results were obtained with
a support threshold for topological conflict significance of 0.90. As
EEEP and RIATA-HGT may propose multiple scenarios, grey areas represent the variability between the worst and the best scenarios. Areas
are delimited by blue borders for RIATA-HGT and green borders for EEEP. Prunier (fast version) results are plotted in magenta. The black curve
symbolizes the reference boundary was obtained using RIATA-HGT
(best scenario selected) on true gene trees.
Abby et al. BMC Bioinformatics 2010, 11:324
http://www.biomedcentral.com/1471-2105/11/324
correctly identified (100% TP and 0% FP) (Fig. 3, see additional file 4 "Pc_true_LGT_events_BP60.pdf" for results
with a 0.60 threshold). For Prunier and EEEP, the percentage of exact scenarios is high for low numbers of transfers, but drops relatively quickly to zero. For gene families
with three transfers, these methods predict the correct
scenario in about 50% of the cases. For RIATA-HGT, the
proportion of correctly predicted scenarios is low even
with few transfers, owing to its high rate of false positives.
Transferred sequences
The positive predictive value (PPV) is the proportion of
transferred sequences among those predicted to be transferred. The negative predictive value (NPV) is the proportion of non-transferred sequences among those
predicted to be non-transferred. In this evaluation, we do
not present the results for EEEP as its rate of failure with
high rates of transfers precludes fair comparisons among
approaches. We measured the predictive power of
RIATA-HGT, selecting the best among all inferred scenarios, and Prunier for a set of simple agreement functions, defined by thresholds ranging from 0.50 to 0.95
(Fig. 4). We also tested for Prunier another agreement
function which uses comparisons of the likelihood of
entire tree topologies by the ELW test [15] (Prunierslow). All values were computed on the 330 simulated
gene trees.
The NPV is very similar among both methods (between
75% and 80%), and variations of the agreement function
have only little effect on this value, although for all
parameters, RIATA-HGT is slightly better than Prunier.
In contrast, the PPV varies greatly with the threshold,
especially for Prunier, but with consistently higher PPV
than RIATA-HGT. These results can be summarized by
computing the accuracy of the two methods, which is the
proportion of correctly classified sequences (transferred
or not). The accuracy ranges between 73% and 77%
(mean of 74%) for RIATA-HGT, and lies between 77%
and 79% (mean of 78%) for Prunier.
Impact of the root of the species tree on reconstructed
scenarios
Prunier proposes by default a scenario for every possible
root of the species tree (see Materials and Methods). Different rootings of the species tree are expected to give different LGT scenarios, because the choice of a root
constrains the clades that can be transferred. Especially,
the number of transfer events is expected to be different.
We examined the possibility that LGTs could inform on
the true position of the root. For each of the 77 possible
roots of our 40 leaves species tree, we counted the number of inferred LGTs in the 330 simulated gene families.
For instance, the total number of LGTs ranged between
1370 and 1549 with a threshold of 0.90. The true root was
Page 6 of 13
among the two locations with the minimal number of
LGT, the other one being on a neighboring edge. However, a Wilcoxon paired test showed that only 50 among
the 77 possible roots were significantly different from the
best root. This establishes LGT as a potential tool for
rooting species trees but suggests that many gene families
and relatively high rates of transfer are necessary to discern the true root.
Application to a biological dataset
We used the dataset of Brown and colleagues [22] which
contains 23 universal proteins distributed in 45 species
from the three domains of life. This study pioneered a
number of more recent analyses in which LGTs are
searched in order to obtain a set of orthologous genes
that can be concatenated to resolve a specific question
[24-26]. In their article, Brown et al. focused on the elimination of gene transfers among domains, by manually
removing those gene families that supported a nonmonophyletic bacterial domain. Nine gene families were
removed on this criterion, which reduced the dataset to
14 genes. Two different species trees were reconstructed:
a first one based on the whole dataset (23 genes), which
was deemed artifactual due to LGT and a second one
based on the cleansed dataset (14 genes). The two trees
mainly disagreed on the position of the early diverging
bacterial phyla, respectively spirochaetes and hyperthermophiles.
Although detecting transferred genes with a certain reference tree and using the remaining sequences to infer a
tree would be a circular reasoning, it is possible to use our
algorithm to test the hypothesis that a dataset is devoid of
LGT. We ran Prunier (slow and fast version with a threshold of 0.80 and 0.90) using both trees as a reference, and
looked at the number of transfers inferred in the 23
genes. The simulation results showed very similar results
for the slow and fast version of Prunier (for example, the
number of detected transfers by the slow and fast methods and the threshold of 0.80 were correlated with R =
0.97). With real data, we also observed a high correlation
among different agreement methods (the best being
between slow and fast_0.80: R = 0.63) but with marked
differences for some gene families. For instance the initiation factor 2 (IF-2) gene tree is found to be completely
congruent with the 23-gene reference tree using the slow
test, while the fast version infers 9 transfers. Reciprocally,
some other genes showed higher transfer rates with the
slow test, for instance ribosomal protein S11 where 13
(23-gene tree) or 11 (14-gene tree) LGTs were detected
with the slow version versus one with the fast version, for
both reference trees. The mean number of transfers
invoked in the two sets of genes identified by Brown et al.
is significantly different, for the fast version, regardless of
the reference tree (Wilcoxon test between numbers of
Abby et al. BMC Bioinformatics 2010, 11:324
http://www.biomedcentral.com/1471-2105/11/324
Page 7 of 13
Figure 4 Accuracy of transferred sequences detection. RIATA-HGT (best scenarios selected) and Prunier (fast version) results are respectively indicated by triangles and circles. The color code corresponds to the different support values tested as a threshold for topological conflict significance.
The figure represents the positive predictive value as a function of the negative predictive value (see Material and Methods) of the two methods. Black
lines represent the limits given by RIATA-HGT on true trees. These statistics show how often predictions are correct when sequences are classified as
transferred (PPV) or not transferred (NPV). A perfect method would have PPV = 1 and NPV = 1.
transfer events for each tree shown in Table 1: p-value =
0.04 for both fast_0.80 and fast_0.90 with the 23-gene
tree, p-value = 0.004 and 0.03 with the 14-gene tree for
fast_0.80 and fast_0.90, respectively). This suggests that
the manual criterion originally used by Brown et al. [22]
for gene exclusion correctly identified genes with high
transfer rates. Nevertheless, many among the 14 genes
retained in the second dataset show a significant amount
of transfers with our method. This raises the question of
whether the combination of such a limited amount of
genes with such a strong degree of conflict can really
yield a reliable species tree.
Discussion
Detecting LGTs using phylogenetic approaches is a challenge for several reasons. First, the reconstruction of
optimal scenarios explaining the discrepancies between
two trees is a complex algorithmic problem. Second, in
practice, not all of these discrepancies require a biological
explanation, because reconstructed gene trees are imper-
Abby et al. BMC Bioinformatics 2010, 11:324
http://www.biomedcentral.com/1471-2105/11/324
Page 8 of 13
Table 1: Analysis of 23 universal gene families with Prunier with two reference trees
Protein
Alignment
length
LGT inferred by Prunier (slow and fast version)
(* indicates LGT diagnosis by Brown et al. )
23 genes tree reference tree
14 genes tree reference tree
Slow
Fast-80
Fast-90
Slow
Fast-80
Fast-90
aspartyl-tRNA synthetase
249
4
8
8
6
7
7
glutamyl-tRNA synthetase
188
12
12
12
13
11
11
leucyl-tRNA synthetase
358
9
11
11
11
13
13
initiation factor 2
337
8
9
0
3
3
1
elongation factor G
536
10
11
5
11
9
4
elongation factor Tu
340
0
9
9
7
9
9
ribosomal protein L2
192
2
13
1
6
8
0
ribosomal protein S5
131
7
2
2
4
2
2
ribosomal protein S8
118
7
5
0
7
5
0
ribosomal protein S11
110
13
1
1
11
1
1
DNA-directed RNA polymerase β chain
537
0
8
2
8
7
2
DNA topoisomerase I
236
10
13
4
8
12
3
DNA polymerase III subunit
194
3
2
1
2
2
0
signal recognition particle protein
298
3
8
1
4
5
1
alanyl-tRNA synthetase (*)
502
5
7
3
4
11
3
histidyl-tRNA synthetase (*)
166
7
18
9
14
15
11
isoleucyl-tRNA synthetase (*)
552
8