7.3 Setting up sliding windows
So far, this will start to seem quite familiar! We learned in the last session that per-SNP estimates of statistics such as \(\pi\) can often be extremely noisy when you are calculating them on very large numbers of markers. As well as this, there are issues with the fact that SNP positions in close proximity are not always independent due to recombination - this is a theme we will return too shortly. So for this reason, it is often better to use a sliding-window approach - i.e. split the genome into windows of a particular size and then calculate the mean for a statistic within that window.
We know already that chromosome 8 is 49,693,984 bp long, so we can get an idea of how many sliding windows we would generate by using some R code. We’ll set our sliding window to be 100,000 bp wide - or 100 Kb. We will also set a step or jump for our window of 25,000 bp - or 25Kb.
# set chromosome size
chr8 <- 49693984
# set window size and window jump
window_size <- 100000
window_jump <- 25000
We use these values to set up our sliding windows for our sparrows dataset using the PopGenome
function, sliding.window.transform
# make a sliding window dataset
sparrows_sw <- sliding.window.transform(sparrows, width = window_size, jump = window_jump, type = 2)
Last week we calculated the window positions along the chromosome for you (inside a yellow box). This week, however, we will show you how you can use basic R commands to find these. We begin by making a sequence from 1 to the length of chromosome 8, with steps equal to our window size using seq()
.
# use seq to find the start points of each window
window_start <- seq(from = 1, to = chr8, by = window_jump)
Then we can find the end point of each window by adding the window size to each element of the vector.
Now we have generated two vectors: window_start
and window_stop
. The windows in the chromosome run from the positions in window_start
to the corresponding position in window_stop
For example, the first window runs from 1 to 100 Kb (i.e., window_start[1]
to window_stop[1]
), the second window from 25 Kb to 125 Kb (window_start[2]
to window_stop[2]
) and so on.
However, there is an issue here. Some of the windows stop after the end of the chromosome (compare e.g. chr8
with the final stop position window_stop[length(window_stop)]
25), so we need to remove these. You can use the following code and logical operations to see that all windows start before the end of the chromosome but that because of how we generated the stop windows, this is not the case for the stop positions26.
# no windows start after the end of chromosome 8
sum(window_start > chr8)
# but some window stop positions do occur past the final point
sum(window_stop > chr8)
In fact, there are 4 windows that are beyond the end of the chromosome. To remove them, we can use the same logical operations as above, just this time within square brackets to drop those positions.
# remove windows from the start and stop vectors
window_start <- window_start[window_stop < chr8]
window_stop <- window_stop[window_stop < chr8]
Here we wrapped our logical operation window_stop < chr8
in square brackets, which tells R to return the elements of the vector where this condition is TRUE
. Also note that we have to remove the start windows that meet this condition too. Why? Well because we are using a sliding window and our window size is 100 kb, the window starting at 49,675,001 will come close to the end of the chromosome.
Actually, this highlights an important point, our final window actually falls short of the end of the chromosome. You can check this like so:
This is something to be aware of, since our final window falls short of the end of the chromosome, we may not be including all our variants. This is not necessarily wrong, but it is important to note.
Anyway, although a little long-winded, this sliding window section is important as it will be useful for plotting later. For now, we will save our sliding window start/stop positions as a data.frame
. We’ll also calculate the midpoint for each window.
# save as a data.frame
windows <- data.frame(start = window_start, stop = window_stop,
mid = window_start + (window_stop-window_start)/2)
7.3.1 Calculating sliding window estimates of nucleotide diversity and differentiation
Now that we have set up the data, the population information and the sliding windows, it is quite straightforward for us to calculate some statistics we are interested in. In this case, we are going to calculate nucleotide diversity (i.e. \(\pi\)) and FST. We will also generate a third statistic, dXY, which is the absolute nucleotide divergence between two populations.
First we will calculate \(\pi\). Handily, the following command also sets up what we need for dXY.
Next we will calculate FST, which again is very straight forward with a single command.
Note that here we use mode = "nucleotide"
to specify we want it to be calculated sliding averages of nucleotides, rather than using haplotype data, which is the alternative. And that’s it for calculating the statistics! As you will see in the next section, extracting them from the sparrows_sw
object is actually more difficult than generating them…
7.3.2 Extracting statistics for visualisation
Since we ran our analysis on a sliding-window basis, we should have estimates of \(\pi\), FST and dXY for each window. What we want to do now is extract all our statistics and place them in a single data.frame
for easier downstream visualisation - this will let us identify how these statistics are interrelated.
The extraction process involves extracting data and manipulating strings to label things correctly. This is a bit too much to go into in this tutorial, but string tools can be very useful for biological data. If you’re interested, see if you can understand how sub()
and paste0()
is used below, you’ve come a long way if you do!
# extract nucleotide diversity and correct for window size
nd <- sparrows_sw@nuc.diversity.within/100000
# make population name vector
pops <- c("bactrianus", "house", "italian", "spanish", "tree")
# set population names
colnames(nd) <- paste0(pops, "_pi")
# extract fst values
fst <- t(sparrows_sw@nuc.F_ST.pairwise)
# extract dxy - pairwise absolute nucleotide diversity
dxy <- get.diversity(sparrows_sw, between = T)[[2]]/100000
## Name the fst and dxy columns properly
# get column names
x <- colnames(fst)
# replace "pop" with the proper names of the populations
for(i in 1:length(pops)){
x <- sub(paste0("pop", i), pops[i], x)
}
# replace forward slash with underline
x <- sub("/", "_", x)
# set column names of fst and dxy
colnames(fst) <- paste0(x, "_fst")
colnames(dxy) <- paste0(x, "_dxy")
## Combine all data with the windows object we made earlier
sparrow_data <- as_tibble(data.frame(windows, nd, fst, dxy))
We now have the data frame sparrow_data
, take a look at it to ensure that it looks correct (be aware of any errors you might get while running the code in yellow above). The data contain window positions as well as nucleotide diversity \(\pi\) and the two pairwise measures FST and dxy for all pairs of populations. Now we can finally investigate differences between the populations!
Note that here,
length(window_stop)
in the square brackets means we are evaluating the final value in thewindow_stop
vector.↩︎remember that
window_stop > chr8
gives a vector of the same length aswindow_stop
, containingTRUE
if the corresponding element ofwindow_stop
is larger thanchr8
, andFALSE
if it is smaller.sum()
treats eachTRUE
as 1 and eachFALSE
as 0, so it can effectively be used to count the number ofTRUE
s in a vector, as we do here.↩︎