6.1 Working with DNA sequence data

For the first part of the tutorial, we learn about how to handle DNA sequence data in R. To keep things simple at this stage, we will use a relatively small dataset. Once we have some familiarity with sequence data and the statistics we can calculate from it, we will learn how to handle data derived from whole genome sequencing. For much of this first part of the tutorial, we will use a series of functions in ape and also some from pegas.

6.1.1 Reading sequence data into R.

The first thing we will learn to do today is to read seqeunce data into R. Unlike in the previous tutorials, it doesn’t make much sense to store sequence data as a data.frame, so we need to use functions specifically designed for the purpose such as read.dna. We will be reading in a FASTA which is a standard file format for sequencing data. FASTA files can vary slightly, but the basic format is the same. The file we will use today can be downloaded here.

Before we actually read this file into R, let’s take a look at it:

file.show("example.fas")

The file.show function is quite self-explanatory. You can also click on the FASTA in the Files pane in R Studio and have a look at it in the console. In this FASTA file, there are two lines for each sequence - the sequence name and then the sequence itself. The sequence name line starts with > and then after it are the base calls. Again, it is worth checking out the wikipedia page for FASTA as it contains more information on how these files are formatted.

Next we will actually read our FASTA file in. To do this, we will use read.dna() like so:

mydna <- read.dna("example.fas", format = "fasta")

This will create an object, mydna which is an ape specific structure called a DNAbin - i.e. DNA data stored in binary. You don’t need to worry too much about the specifics of this data structure although you might want to check it for yourself. Try class(mydna) and ?DNAbin to get more information.

In the meantime, have a look at the mydna object and see what information is printed to the R console.

6.1.2 Exploring DNA sequence data

As you will have just seen, when you call the mydna object, you only get a summary in the R console. What if you actually want to look at the sequences? One way to do this is to use the as.alignment function - like so:

myalign <- as.alignment(mydna)

This function converts our DNAbin object into an alignment - which is essentially a list with different elements - i.e nb - the number of sequences, seq - the actual sequences themselves, nam - the names of the sequences.

alignment object is convenient in that you can access information the way you are used to, e.g. with the $ operator. You can look at the sequences as a character vector using the following R code:

myalign$seq

However, if you want to compare sequences, using the alview() function may be a better method:

alview(mydna)

Note that in this case, we used alview directly on our mydna object - not on the myalign alignment object. This prints our alignment to the screen and it makes it immediately obvious where there are nucleotide polymorphisms in our data. You should note however that the N bases in the first sequence are bases which could not be called by the sequencing machine - they are not valid base calls.

NOTE In this case, these sequences were read into R as an alignment already - meaning that each position in the sequences corresponds to one another. We did not actually align the sequences in R itself.

6.1.3 Calculating basic sequence statistics

6.1.3.1 Base composition

Using a couple of standard pegas functions, we can get some more information about our sequences. For example, we can calculate the frequency of the four nucleotides in our dataset:

base.freq(mydna)

We can also calculate the GC content - i.e. the proportion of the sequence that is either a G or C nucleotide.

GC.content(mydna)

To break down GC content even further, this is equivalent to:

sum(base.freq(mydna)[c(2, 3)])

6.1.3.2 Segregating sites

Looking at our aligned sequences, we can see that there are several positions whether there is a polymorphism. Using the seg.sites function, we can find the segregating sites:

seg.sites(mydna)

seg.sites returns the position in the sequence where polymorphisms occur in our 40 base pair sequence alignment. The positions are stored in a vector, and we can count the number of elements in the vector with the length() function to get the number of segregating sites.

length(seg.sites(mydna))

As this function makes pretty clear, segregating sites is just the number of polymorphic positions in a sequence. However, it is also something that scales with sample size - i.e. the more sequences we add, the higher the probability of finding segregating sites in our data. This is a point we will return to later.

One last thing about the segrating sites we calculated here - it is not standardised to the length of sequences (i.e., the capitalised $S$ rather than $s$). To achieve that we need to do the following.

# get segregating sites
S <- length(seg.sites(mydna))
# set sequence length
L <- 40
# standardise segregating sites by sequence length
s <- S/L

6.1.3.3 Nucleotide diversity

Nucleotide diversity or $\Pi$ is the average number of pairwise differences between sequences in a population or sample. We can calculate it using the following formula:

$\Pi = \displaystyle \frac{1}{[n(n-1)]/2} \sum_{i<j}\Pi_{ij}$

Where $\Pi_{ij}$ is the number of nucleotide differences between sequence $i$ and sequence $j$ and ${[n(n-1)]/2}$ is the number of possible pairwise sequence comparisons from all $n$ sequences. In other words, it is the sum of pairwise differences between each pair of sequences, divided by the number of pairs.

From seg.sites(mydna), we know there are two polymorphic positions at sites 35 and 36 in our 40 bp alignment. Using alview(mydna) we can visualise these and also actually count the differences. There are 2 nucleotide differences between sequences No305 and No304, 1 between 304 and 306 and then finally, 1 again between 305 and 306.

We can use R to calculate our nucleotide diversity by hand:

# set n sequences
n <- 3
# set differences
# 2 differences between 304 and 305, 1 difference in the other comparisons
Pi_ij <- c(2, 1, 1) 
# calculate the number of pairwise comparisons
np <- (n*(n-1))/2
# calculate the nucleotide diversity
Pi <- sum(Pi_ij)/np

You can compare this to the calculations in Chapter 7 of the textbook - as they are essentially identical. What we have here is the average number of nucleotide differences between our three sequences. This is not standardised to the sequence length (i.e., $\Pi$, not $\pi$), so we need to do that next.

# calculate standarised nucleotide diversity
L <- 40
pi <- Pi/L

Of course, calculating nucleotide diversity by hand is not that useful when we have more than even a few sequences (as we will shortly). Luckily, we can also calculate nucleotide diversity using the function nuc.div from pegas.

nuc.div(mydna)

Unlike the number of segregating sites, this estimate is already standardised to the length of our sequence. Take a moment to compare the output previous command with our hand calculated value of pi. They are different! Why is this? Well using alview again we can see that there are actually two sites where in the first sequence, No305, the base call is N. As we learned earlier, this means we do not have a reliable call for this position in the sequence (perhaps a sequencing error or an ambigous base).

The nuc.div function therefore corrects for these missing bases and shortens the total length of our sequence by two. This therefore changes the final estimate of $\pi$ it produces. Of course, manually counting differences between sequences and finding sequence lengths and adjusting for Ns is time-consuming and error-prone, so you will always be using the nuc.div() function directly when working with large data sets (doing it manually, however, may help you understand the theory behind).