Week 6 Inferring Evolutionary Processes from Sequence Data

In this session, we will be learning how to work with DNA sequence data in R. So far, much of the work we have done has used allelic based models where we have considered individual markers, rather than sequences as a whole. However, DNA sequence data allows us to take into account information at more than a single site - i.e. we also incorporate our understanding of invariant sites too. For this reason, a number of different statistics are often used when handling sequencing data.

Today we will focus on inferring the number of segregating sites, nucleotide diversity and Tajima’s D. We focus on simpler statistics because the aim of today is also to familiarise you with working with large-scale, genome datasets. So we will begin with a relatively straightforward example with just a few sequences, before moving onto a larger dataset and then eventually looking at patterns of nucleotide diversity at the scale of an entire chromosome. This is an important precursor for what is to come in the following two sessions, where we will be focusing more and more on genomic data.

What to expect

Today we won’t introduce any new R-concepts, but jump straight into the action with some tools for handling sequence data. In this section we will:

reinforce our understanding of basic population genetic statistics estimated from nucleotide data
learn how to calculate these statistics on real data
perform a genome scan analysis using these statistics

Getting started

The first thing we need to do is set up the R environment. Today we’ll be using tidyverse but also we will need three additional packages for this session - ape, pegas and PopGenome. To install these packages, use the following commands:

install.packages("ape")
install.packages("pegas")
install.packages("devtools")
devtools::install_github("pievos101/PopGenome")

Note that PopGenome is installed in a slightly different way than what you are used to. As long as you install the devtools package first, then PopGenome, you should have no issues.

Once these packages are installed installed, we will clear the R environment with rm(list = ls()) and then load everything we need for this session.

# clear the R environment
rm(list = ls())
library(tidyverse)
library(ape)
library(pegas)
library(PopGenome)

Remember that clearing the R environment when you start a script is good practice to make sure you don’t have any conflicts with previously loaded data. All three packages ape, pegas and PopGenome are really useful for handling genetic data in R - follow the links for more information about each of them. Remember that you always can access the help page of any function with ?.