7.3 Setting up sliding windows

So far, this will start to seem quite familiar! We learned in the last session that per-SNP estimates of statistics such as \(\pi\) can often be extremely noisy when you are calculating them on very large numbers of markers. As well as this, there are issues with the fact that SNP positions in close proximity are not always independent due to recombination - this is a theme we will return too shortly. So for this reason, it is often better to use a sliding-window approach - i.e. split the genome into windows of a particular size and then calculate the mean for a statistic within that window.

We know already that chromosome 8 is 49,693,984 bp long, so we can get an idea of how many sliding windows we would generate by using some R code. We’ll set our sliding window to be 100,000 bp wide - or 100 Kb. We will also set a step or jump for our window of 25,000 bp - or 25Kb. N.B For the the functions in GenoPop to properly overlap, we need to set this value to a negative one.

# set chromosome size
chr8 <- 49693984

# set window size and window jump
window_size <- 100000
window_jump <- -25000

We will shortly be using these objects to run our analysis.

7.3.1 Calculating sliding window estimates of nucleotide diversity and differentiation

Now that we have set up the R environment, the population information and the sliding windows, it is quite straightforward for us to calculate some statistics we are interested in. In this case, we are going to calculate nucleotide diversity (i.e. \(\pi\)) and FST. We will also generate a third statistic, dXY, which is the absolute nucleotide divergence between two populations.

First we will calculate \(\pi\). Here we will do this for the house sparrow, like you did in the previous exercise. You can also calculate this for the different species too if you want to.

# calculate nucleotide diversity - house
pi_windows_house <- Pi("./sparrow_chr8_downsample.vcf.gz", exclude_ind = c(bactrianus, spanish, italian, tree),
                 seq_length = chr8, window_size = window_size, skip_size = window_jump)
# calculate nucleotide diversity - bactrianus
pi_windows_bac <- Pi("./sparrow_chr8_downsample.vcf.gz", exclude_ind = c(house, spanish, italian, tree),
                 seq_length = chr8, window_size = window_size, skip_size = window_jump)

While we wait we can break this down: - vcf_file - just defines the path to our vcf file - exclude_ind - sets individuals to leave out of the analysis (i..e everything but house here) - seq_length - this is the length of the chromosome, taken from the genome information - window_size is the size of the windows we examine, here 100Kb - skip_size is the step size for our sliding windows - 25Kb in this instance.

Next we will calculate FST, which again is very straight forward with a single command. Here we will do this for house versus bactrianus sparrows. Feel free to experiment with other comparisons if you wish.

# calculate Fst
fst_windows_house_bac <- Fst(vcf_path = vcf_path, pop1_individuals = house, pop2_individuals = bactrianus,
                             weighted = TRUE, window_size = window_size, skip_size = window_jump)

Note that here we use weighted = TRUE to specify we want the weighted FST per window. This just means that our FST estimates are weighted by the number of SNPs in each window, such that windows with many versus those with very few SNPs are accounted for .

Finally, we will also repeat this for dXY, which is equally straightforward.

# calculate Dxy
dxy_windows_house_bac <- Dxy(vcf_path = vcf_path, pop1_individuals = house, pop2_individuals = bactrianus,
                             seq_length = chr8, window_size = window_size, skip_size = window_jump)

The only difference here is that we again need to specify the length of the chromosome with seq_length = chr8 - i.e. the variable we set up earlier. And that’s it for calculating the statistics!

7.3.2 Extracting statistics for visualisation

Since we ran our analysis on a sliding-window basis, we should have estimates of \(\pi\), FST and dXY for each window. What we want to do now is extract all our statistics and place them in a single data.frame for easier downstream visualisation - this will let us identify how these statistics are interrelated.

First up, we will combine all our statistics into a single data.frame, sort out the classes of some of the data and then make it a tibble for easier access. One issue we have to deal with is that the analysis produces NA for windows with no SNPs - this means the values of \(\pi\) differ for the house and bactrianus analyses. Below we first merge the \(\pi\) datasets to deal with this and then create our master dataset for visualisation. The last step just adds a midpoint value for all our windows for plotting later on.

# Join the pi datasets
pi <- left_join(pi_windows_house, pi_windows_bac, by = "Start", suffix = c("_house", "_bac"))

# create the master dataset 
sparrow_data <- data.frame(
  chromosome = as.character(pi_windows_house$Chromosome),
  start =  as.numeric(pi_windows_house$Start),
  end = as.numeric(pi_windows_house$End),
  pi_house = as.numeric(pi$Pi_house),
  pi_bac = as.numeric(pi$Pi_bac),
  fst = as.numeric(fst_windows_house_bac$Fst),
  dxy = as.numeric(dxy_windows_house_bac$Dxy)
) %>% as_tibble()

# get midpoint of genome windows
sparrow_data <- sparrow_data %>%
  mutate(mid = round((start + end)/2))

We now have the data frame sparrow_data, take a look at it to ensure that it looks correct (be aware of any errors you might get while running the code in yellow above). The data contain window positions as well as nucleotide diversity \(\pi\) and the two pairwise measures FST and dxy for our house and bactrianus comparison. Now we can finally investigate differences between the populations!