2.1 Intro to data manipulation with tidyverse
Data manipulation might seem quite a boring topic but it is actually a crucial part of data science and increasingly, bioinformatics and evolutionary biology. For the average researcher working with biological data, we would estimate that the vast majority of analysis time is spent handling the data. By handling and manipulation, we mean exploring the data, shaping it into a form we want to work with and extracting information we find important or interesting. Getting to know your data is absolutely fundamental to properly understanding it and that is why we have decided to dedicate time to it in this chapter.
At this point in our tutorial, we will use a series of packages collectively known as the tidyverse; in particularly, we will focus on functions from a tidyverse package called dplyr
. These packages grew from the approach of Hadley Wickham - a statistician responsible for popularising fresh approaches to R and data science. As with nearly all things in R, there are many, many ways to achieve the same goal and the guidlines we give here are by no means definitive. However, we choose to introduce these principles now because in our experience of data analysis, they have greatly improved our efficiency, the clarity of our R code and the way we work with data.
2.1.1 What is the tidyverse?
It’s important to emphasize that the tidyverse set of packages can do mostly the same as base R already can do. So what’s the difference? While base R is a collection of different methods and functions built up over years, tidyverse is designed with a specific philosophy in mind. This leads to having a consistent approach to solving problems that many find appealing. That being said, if you find you prefer the “regular” R-functions over their tidyverse equivalents, go ahead and use those instead, there’s nothing wrong with that.
2.1.2 The dplyr package
dplyr
is one of the packages in the tidyverse, and is focused on manipulating data in data frames. dplyr
at it’s core consists of combining 5 different verbs for data handling:
select()
select columns from your datafilter()
filters rows based on certain criteriamutate()
creates new columns (not gone through in this tutorial, but included in this list for completeness)group_by()
creates groups for summarizing datasummarise()
summarises data based on the groups you have created
We will go through the use of these functions shortly. You may notice that you’ve already learned how to select
, filter
and mutate
data last week using []
and $
, which is correct, and exactly what we mean when we say that base R and tidyverse
can do the same things.