7.2 Returning to the sparrow dataset

In the last session, we used the PopGenome package to calculate sliding window estimates of nucleotide diversity across chromosome 8 of the house, Italian and Spanish sparrows with data from Ravinet et al. (2018). We will now return to this example and use it to demonstrate why we must interpret the genomic landscape of differentiation with caution.

7.2.1 Reading in the sparrow vcf

The first step we need to take is to read our VCF of the sparrow chromosome 8 into the R environment. This is exactly the same procedure as the last session but just in case you missed those steps, here they are again. Remember that becasue the VCF is large, the file is compressed and there are some preprocessing steps you will need to do before you can open in it in R.

First, download the VCF
Next, make a directory in your working directory (use getwd if you don’t know where that is) and call it sparrow_snps
Move the downloaded VCF into this new directory and then uncompress it. If you do not have an program for this, you can either use the Unarchiver (Mac OS X) or 7zip (Windows).
Make sure only the uncompressed file is present in the directory.

With these steps carried out, you can read this data in like so:

MAC

sparrows <- readData("./sparrow_snps/", format = "VCF", include.unknown = TRUE, FAST = TRUE)

WINDOWS

sparrows <- readData("./sparrow_snps", format = "VCF", include.unknown = TRUE, FAST = TRUE)

Like last time, we then need to read the file with population information, and attach that to our sparrows object. You should have the file available from last week’s tutorial, otherwise it can be downloaded here.

sparrow_info <- read.table("./sparrow_pops.txt", sep = "\t", header = TRUE)
populations <- split(sparrow_info$ind, sparrow_info$pop)
sparrows <- set.populations(sparrows, populations, diploid = T)

7.2.2 Examining the variant data

Remember, you can look at the data we have read in using the following command:

get.sum.data(sparrows)

In this case, you can see that from the n.sites that the final site is at position 49,693,117. The actual chromosome is 49,693,984 long - so this confirms variants span the entire chromosome. Note that n.sites is a bit counter-intuitive here, it would only make sense as the number of sites if we had called nucleotides at every single position in the genome - but since this is a variant call format, only containing polymorphic positions then obviously this is not the case. Furthermore, the data has actually been subset in order to make it more manageable for our purposes today.

Nonetheless, it is still substantial, from the n.biallelic.sites we can see there are 91,312 bilallelic SNPs and from n.polyallelic.sites, there are 1092 positions with more than two alleles. So in total we have:

sparrows@n.biallelic.sites + sparrows@n.polyallelic.sites

A total of 92,404 SNPs - a big dataset which requires some specific approaches to handling the data.