-
Notifications
You must be signed in to change notification settings - Fork 51
/
Copy pathIntroduction.Rmd
201 lines (166 loc) · 10.2 KB
/
Introduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
```{r,message=FALSE,echo=FALSE}
library("knitcitations")
bib <- read.bibtex("bibtexlib.bib")
library("knitr")
cite_options(citation_format = "pandoc", max.names = 3, style = "html", hyperlink = "to.doc")
opts_chunk$set(tidy = FALSE, message = FALSE, warning = FALSE)
```
---
title: "Introduction"
subtitle: "*NJ Grünwald, ZN Kamvar and SE Everhart*"
---
This primer provides a concise introduction to conducting applied analyses of
population genetic data in R, with a special emphasis on non-model populations
including clonal or partially clonal organisms. It is not meant to be a textbook
on population genetics. Quite to the contrary, we refer the reader to several
textbooks on population genetics `r citep(bib[c("nielsen2013", "templeton2006",
"hartl2007")])`. Likewise, this book will not replace books on theory and
statistics of population genetics `r citep(bib[c("weir1996")])`. The reader is
thus expected to have a basic understanding of population genetic theory and
applications. Finally, this primer is focused on traditional population genetics
based on allele frequencies, rather than more sophisticated coalescent
approaches `r citep(bib[c("wakeley2009", "nielsen2013", "hein2004")])`, although
some of the material covered here will apply. In a nutshell, this primer
provides a valuable resource for tackling the nitty-gritty analysis of
populations that do not necessarily conform to textbook genetics and might or
might not be in Hardy-Weinberg equilibrium and panmixia.
This primer is geared towards biologists interested in analyzing their
populations. This primer does not require extensive knowledge of programming in
R, but the user is expected to install R and all packages required for this
primer.
Why use R?
-----
Until recently, one of the more tedious aspects of conducting a population
genetic analysis was the need for repeated reformatting data to conduct
different, complimentary analyses in different programs. Often, these programs
only ran on one platform. Now, R provides a toolbox with its packages that
allows analysis of most data conveniently without tedious reformatting on all
major computing platforms including Microsoft Windows, Linux, and Apple's OS X.
R is an open source statistical programming and graphing language that includes
tools for statistical, population genetic, genomic, phylogenetic, and
comparative genomic analyses. This primer is for those of us that want to
conduct applied analyses of populations and make full use of the palette of
tools available in R including for example the R markdown code for this primer.
Note that the R user community is very active and that both R and its packages
are regularly updated, critically modified, and noted as deprecated (no longer
updated) as appropriate.
> Any R user needs to make sure all components are up-to-date and that versions
> are compatible.
Population genetics
----
Traditional population genetics is based on analysis of observed allele
frequencies compared to frequencies expected, assuming a population genetic
model. For example, under a **Wright-Fisher model** you might expect to see
populations of diploid individuals that reproduce sexually, with non-overlapping
generations. This model ignores effects such as mutation, recombination,
selection or changes in population size or structure. More complex models can
incorporate different aspects of effects observed in real populations. However,
most of these models assume that populations reproduce sexually.
Here, we briefly review some terminology but assume that the reader has a basic
understanding of theory and applications of population genetics.
A **locus** is a position in the genome where we can observe one or several
alleles in different individuals. Loci used in population genetics are assumed
to be selectively neutral and can be an anonymous or non-coding region such as a
microsatellite locus (SSR), a single nucleotide polymorphism (SNP) or the
presence/absence of a band on a gel. A **genotype** is the combination of
alleles carried by a given individual at a particular set of loci. Individuals
carrying the same set of alleles are considered to have the same **multilocus
genotype** $MLG$.
Basic metrics of interest in populations are polymorphisms, allele frequencies
and genotype frequencies. Polymorphism can be estimated in several ways, such as
the total number of loci observed that have more than one allele. The frequency
of an allele is calculated as the number of allele copies observed in a
population divided by the total number of individuals, times the ploidy ($N$ in
haploids or $2N$ in diploids) in the population. For diploids we would observe
frequency $f_{A}$ and $f_{a}$ for alleles $A$ and $a$, respectively:
$f_{A} = \frac{N_{A}}{2N}$ and $f_{a} = \frac{N_{a}}{2N}$
where $N_{A}$ are the numbers of allele $A$ segregating in the population
genotyped.
By definition, at a biallelic locus frequencies sum to 1:
$f_{A} + f_{a} = 1$.
Genotype frequencies are the relative frequencies of each $MLG$ observed in a
population. Thus, for diploids at a **biallelic locus** we can observe three
genotypes: $AA$, $Aa$, and $aa$ and their respective frequencies:
$f_{AA} = \frac{N_{AA}}{2N}$ and $f_{Aa} = \frac{N_{Aa}}{2N}$ and $f_{aa} =
\frac{N_{aa}}{2N}$
These frequencies again add up to 1. In diploid organisms we can also calculate
the frequency of **homozygotes** ($AA$ or $aa$) or **heterozygotes** ($Aa$) as
the proportion of individuals falling into each class. The proportion of
individuals that are heterozygous is given by $f_{Aa}$ while the homozygous
proportion is given by $1-f_{Aa} = f_{AA}+f_{aa}$.
An important aspect of population structure is the heterozygosity observed
within subdivided populations $H_{S}$ and across total populations $H_{T}$. If
we assume presence of two populations in Hardy-Weinberg equilibrium then
frequencies of Allele $A$ in populations 1 and 2 pooled over both populations
would be:
$f_{A} = \frac{2N_{1}f_{A1} + 2N_{2}f_{A2}}{2N_{1} + 2N_{2}}$
where each allele frequency is multiplied by population size $N_{1}$ and $N_{2}$
and divided by total number of alleles $2N_{1} + 2N_{2}$. Remember, that in a
diploid population there are 2 alleles per locus per individual and hence every
frequency is multiplied by 2.
Genetic differentiation of two populations (e.g. subdivision of populations)
occurs if we observe differences in polymorphism, allele frequencies, genotype
frequencies, and heterozygosity. Wright's $F_{st}$ is a commonly used measure
of differences in heterozygosity observed in the overall population relative to
the subdivided populations and is defined as the differences between $H_{T}$ and
$H_{S}$ relative to $H_{T}$:
$F_{ST} = \frac{H_{T}-H_{S}}{H_{T}}$
If allele frequencies are identical in both subpopulations then $H_{T} = H_{S}$.
If, on the other hand, allele frequencies vary between subdivided populations,
then $H_{T} > H_{S}$ and populations are considered to be genetically
differentiated.
Besides calculation of $F_{st}$ other approaches can be used to assess
population structure including clustering. These approaches will be explored in
subsequent chapters.
The special case of clonal organisms
----
Plants and microorganisms such as fungi, oomycetes, and bacteria often reproduce
clonally or follow a mixed reproductive system, including various degrees of
clonality and sexuality `r citep(bib[c("halkett2005tackling")])`. Traditional
population genetic theory does not apply and these populations violate basic
assumptions in the corresponding analyses. Thus, a different set of tools are
required for analyses of clonal populations `r citep(bib[c("anderson1995clonality",
"balloux2003population", "de2006molecular",
"grunwald2011evolution", "halkett2005tackling", "milgroom1996recombination")])`.
Typically these focus on metrics that are model free and do not assume random
mating or random sampling of alleles. Model free metrics suitable for analyzing
these populations include measures of genotypic diversity or evenness, as well
as cluster analysis based on genetic distance. Additional useful measures
include indices of linkage disequiblibrium among markers that are used to infer
if populations are reproducing sexually or clonally. We will explore these
analyses bit by bit throughout this primer. The *poppr* R package was
specifically developed to facilitate analyses of clonal populations
`r citep(bib[c("kamvar2014poppr", "kamvar2015novel")])`.
Marker systems, ploidy and other kinds of data
----
Population genetic data come in many shapes and forms and a good analysis needs
to be tailored to the marker system, ploidy used and corresponding genetic
models assumed `r citep(bib[c("grunwald2011evolution")])`. Organisms can be
haploid, diploid, with or without known phase, or polyploid. In the case of
diploid species, marker systems can be dominant (AFLP, RFLP or RAPD data) or co-
dominant (SSR, SNPs, allozymes) `r citep(bib[c("grunwald2003")])`. A locus can
have two alleles (0/1 for AFLP; C/T for SNPs) or many alleles per locus
(A/B/C...N for SSRs or allozymes). In each case the data has to be coded and
analyzed differently. Mutation rates can also differ for different markers
systems (SSR >> mitochondrial SNPs > nuclear SNPs)
`r citep(bib[c("grunwald2011evolution")])`. All these aspects of a particular data
set need to be considered in the data analyses.
Applications of population genetic analyses
----
Population genetic analyses are tremendously valuable for answering questions
ranging from applied to basic evolutionary questions `r citep(bib[c("grunwald2011evolution")])`.
For example, a typical concern when
finding a new, invasive organism is whether it is introduced to the area or
emerged from a resident population. Other questions of interest might include:
- Where is the center of origin?
- Does this organism reproduce sexually?
- Has the population gone through a genetic bottleneck?
- Are populations structured by region, geography, micro-environment?
- What are source and sink populations and what are the rates of migration?
This primer will provide some of the basic tools needed to answer many of these
questions for populations that are clonal or partially clonal. Note, however,
that more powerful, complementary coalescent methods will not be covered here.
We hope you enjoy following along.
References
----------
<!--- ref --->