7.2 Returning to the sparrow dataset
In the last session, we used the PopGenome
package to calculate sliding window estimates of nucleotide diversity across chromosome 8 of the house, Italian and Spanish sparrows with data from Ravinet et al. (2018). We will now return to this example and use it to demonstrate why we must interpret the genomic landscape of differentiation with caution.
7.2.1 Reading in the sparrow vcf
The first step we need to take is to read our VCF of the sparrow chromosome 8 into the R environment. This is exactly the same procedure as the last session but just in case you missed those steps, here they are again. Remember that becasue the VCF is large, the file is compressed and there are some preprocessing steps you will need to do before you can open in it in R.
- First, download the VCF
- Next, make a directory in your working directory (use
getwd
if you don’t know where that is) and call itsparrow_snps
- Move the downloaded VCF into this new directory and then uncompress it. If you do not have an program for this, you can either use the Unarchiver (Mac OS X) or 7zip (Windows).
- Make sure only the uncompressed file is present in the directory.
With these steps carried out, you can read this data in like so:
MAC
WINDOWS
Like last time, we then need to read the file with population information, and attach that to our sparrows
object. You should have the file available from last week’s tutorial, otherwise it can be downloaded here.
7.2.2 Examining the variant data
Remember, you can look at the data we have read in using the following command:
In this case, you can see that from the n.sites
that the final site is at position 49,693,117. The actual chromosome is 49,693,984 long - so this confirms variants span the entire chromosome. Note that n.sites
is a bit counter-intuitive here, it would only make sense as the number of sites if we had called nucleotides at every single position in the genome - but since this is a variant call format, only containing polymorphic positions then obviously this is not the case. Furthermore, the data has actually been subset in order to make it more manageable for our purposes today.
Nonetheless, it is still substantial, from the n.biallelic.sites
we can see there are 91,312 bilallelic SNPs and from n.polyallelic.sites
, there are 1092 positions with more than two alleles. So in total we have:
A total of 92,404 SNPs - a big dataset which requires some specific approaches to handling the data.