2.2 Using dplyr to work with your data
First, we have to install and load the tidyverse.7
Together with the tidyverse, you get a built in data set called starwars
, containing information about the characters from the Star Wars films. We will use this data set for most of this tutorial. Since it’s already loaded, all you have to do to access it is run:
starwars
#> # A tibble: 87 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
#> 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
#> 4 Darth V… 202 136 none white yellow 41.9 male mascu…
#> 5 Leia Or… 150 49 brown light brown 19 fema… femin…
#> 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
#> 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
#> 8 R5-D4 97 32 <NA> white, red red NA none mascu…
#> 9 Biggs D… 183 84 black light brown 24 male mascu…
#> 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
#> # ℹ 77 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
Additional info: starwars
is a data frame like you learned about last week, but you might notice that it’s printing a bit differently (e.g. with red color for NA
, condensing the output). This is because starwars
is a special kind of data frame introduced in the tidyverse, called a tibble
. For all practical purposes, a tibble and a data frame is the same, so throughout this course we won’t care much whether our data is contained in a tibble or a regular data frame.
2.2.1 The pipe
The tidyverse introduces a new operator called the pipe, which looks like this %>%
. Conceptually, the pipe allows you to do something with your data, and then send the result to a new function which does more work, sends it to the next function and so on until you’re satisfied. You can for instance use the pipe like this:
This is equivalent to writing:
Basically, you are putting the left hand side of the pipe into the parentheses in the function on the right hand side. This may not seem useful right now, but as we will see later, this can make code much easier to read.
You can also assign the results of your pipe to an object like any ordinary calculation:
There will be a lot of examples of using the pipe throughout this tutorial, showing how it can make quite complex code readable.
Important concept:
The pipe operator %>%
allows you to send an object from the left side of the pipe to a function on the right side.
2.2.2 Selecting columns with select()
Lets say we want to choose the name
and homeworld
columns from our starwars
data, how can we do that? With standard R, we might do something like this.
With dplyr
we can do the following:
select(starwars, name, homeworld)
#> # A tibble: 87 × 2
#> name homeworld
#> <chr> <chr>
#> 1 Luke Skywalker Tatooine
#> 2 C-3PO Tatooine
#> 3 R2-D2 Naboo
#> 4 Darth Vader Tatooine
#> 5 Leia Organa Alderaan
#> 6 Owen Lars Tatooine
#> 7 Beru Whitesun Lars Tatooine
#> 8 R5-D4 Tatooine
#> 9 Biggs Darklighter Tatooine
#> 10 Obi-Wan Kenobi Stewjon
#> # ℹ 77 more rows
The first argument here is your data, while the others are the columns you want to select. Note that you don’t need to use the quotes "
here, you generally don’t need those for the dplyr
functions.
select()
becomes even more intuitive when using the pipe:
This style of code is closer to how we would write in English: “Take the starwars
data, and select the name
and homeworld
columns”. From now on we will write all our dplyr
code using the pipe.
If you want to omit a column, you can use -
in front of its name:
select
also has additional ways of selecting columns, some examples of this is shown below:
# choose only columns containing an underscore
starwars %>% select(contains("_"))
# choose only columns beginning with "s"
starwars %>% select(starts_with("s"))
# choose only columns ending with "color"
starwars %>% select(ends_with("color"))
Exercise: Use select()
to select numeric columns, i.e. all the columns that contain numbers, and save it to an object with a meaningful name. You can do this manually by looking at the columns. If you want a challenge, see if you can figure out how to automatically select numeric columns from your data.
2.2.3 Filtering colums using filter()
Last week, you learned to filter a data set based on some criterion using the square brackets []
. To filter out only the humans from the starwars
data set, you could write:
dplyr
’s filter()
function does the same thing, but with a slightly different syntax:
This looks quite similar to using the square brackets, one notable difference being that you don’t need to use starwars$
within filter()
, the function already understands that we are working with starwars
.
Like with base R, you can use this to filter with the other logical operators as well, like >
and !=
:
# get people lower than 1m tall
starwars %>% filter(height < 100)
# get all non-humans
starwars %>% filter(species != "Human")
You can also filter using several criteria at once, simply separate the logical statements with a comma:
2.2.3.1 Combining filter()
and select()
The real power of the pipe shows when you chain several operations together. To both filter and select from your data, simply first do the filtering and then pipe the result to select
:8
starwars %>% filter(height < 100) %>%
select(name, height, birth_year)
#> # A tibble: 7 × 3
#> name height birth_year
#> <chr> <int> <dbl>
#> 1 R2-D2 96 33
#> 2 R5-D4 97 NA
#> 3 Yoda 66 896
#> 4 Wicket Systri Warrick 88 8
#> 5 Ratts Tyerel 79 NA
#> 6 Dud Bolt 94 NA
#> 7 R4-P17 96 NA
Again, the code looks like how you would explain what you’re doing: “take the starwars data, filter based on height, and select the name
, height
and birth_year
columns”. Note that you can have a line break after a pipe like you can inside parentheses, your code will still continue running.
Exercise: use filter()
to choose all the people that has “Naboo” as homeworld
. Select name, skin_color and eye_color and save the result to an object.
Important concept:
- filter()
is used to select specific rows. Example: filter(height < 100)
- select()
is used to select specific columns. Example: select(name, height)
2.2.4 Grouped summaries with group_by()
and summarise()
Imagine that you want to calculate the mean height of the people (and droids) in the starwars
data set. You could use mean()
on the column to achieve this (note the use of na.rm
since the height column contains NA
s):
But what if you want to calculate the mean height separately for e.g. the different species? One way to do this is to do a grouped summary. Your group is the species
column, and your summary statistic is mean
. We create groups using the group_by()
function:
starwars %>% group_by(species)
#> # A tibble: 87 × 14
#> # Groups: species [38]
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
#> 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
#> 4 Darth V… 202 136 none white yellow 41.9 male mascu…
#> 5 Leia Or… 150 49 brown light brown 19 fema… femin…
#> 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
#> 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
#> 8 R5-D4 97 32 <NA> white, red red NA none mascu…
#> 9 Biggs D… 183 84 black light brown 24 male mascu…
#> 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
#> # ℹ 77 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
Notice that nothing has changed in the data, but at the top you can see the text # Groups: species [38]
, showing that you indeed have created a group, and that you have 38 different species in your data. The main use of group_by()
is together with summarise()
, which does a summary based on the groups you’ve created:
starwars %>%
group_by(species) %>% #create group
summarise(mean_height = mean(height, na.rm = TRUE)) # calculate summary statistic
#> # A tibble: 38 × 2
#> species mean_height
#> <chr> <dbl>
#> 1 Aleena 79
#> 2 Besalisk 198
#> 3 Cerean 198
#> 4 Chagrian 196
#> 5 Clawdite 168
#> 6 Droid 131.
#> 7 Dug 112
#> 8 Ewok 88
#> 9 Geonosian 183
#> 10 Gungan 209.
#> # ℹ 28 more rows
Note how again, like in data.frame
, the argument name to summarise
becomes the column name in your new data frame. You can use several summary functions inside summarise()
, like median()
, sd()
, sum()
and max()
to name some. You can also do several summaries within a single summarise()
function:
starwars %>%
group_by(species) %>%
summarise(mean_height = mean(height, na.rm = TRUE),
median_height = median(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 38 × 4
#> species mean_height median_height sd_height
#> <chr> <dbl> <dbl> <dbl>
#> 1 Aleena 79 79 NA
#> 2 Besalisk 198 198 NA
#> 3 Cerean 198 198 NA
#> 4 Chagrian 196 196 NA
#> 5 Clawdite 168 168 NA
#> 6 Droid 131. 97 49.1
#> 7 Dug 112 112 NA
#> 8 Ewok 88 88 NA
#> 9 Geonosian 183 183 NA
#> 10 Gungan 209. 206 14.2
#> # ℹ 28 more rows
We can even group by several variables, creating more detailed summaries:
starwars %>%
group_by(homeworld, sex) %>%
summarise(mean_height = mean(height, na.rm = TRUE))
#> # A tibble: 60 × 3
#> # Groups: homeworld [49]
#> homeworld sex mean_height
#> <chr> <chr> <dbl>
#> 1 Alderaan female 150
#> 2 Alderaan male 190.
#> 3 Aleen Minor male 79
#> 4 Bespin male 175
#> 5 Bestine IV <NA> 180
#> 6 Cato Neimoidia male 191
#> 7 Cerea male 198
#> 8 Champala male 196
#> 9 Chandrila female 150
#> 10 Concord Dawn male 183
#> # ℹ 50 more rows
Now you get two groups for homeworld Alderaan, one with males and one with females. For the following homeworld groups there are only males (except Chandrila, with only females), so you just get one group for each (giving a pretty accurate picture of the gender balance in Star Wars).
2.2.4.1 Counting how many observations we have in our groups with tally()
When we have created our groups, we can also use the tally()
function to count the number of observations we have in the groups:
starwars %>%
group_by(species) %>%
tally()
#> # A tibble: 38 × 2
#> species n
#> <chr> <int>
#> 1 Aleena 1
#> 2 Besalisk 1
#> 3 Cerean 1
#> 4 Chagrian 1
#> 5 Clawdite 1
#> 6 Droid 6
#> 7 Dug 1
#> 8 Ewok 1
#> 9 Geonosian 1
#> 10 Gungan 3
#> # ℹ 28 more rows
This can be useful to get an overview of your data9
Important concept:
group_by()
can be combined with different functions to give an overview of your data.
group_by() %>% summarise()
does some calculation in each group. Example:group_by(homeworld, sex) %>% summarise(mean_height = mean(height))
group_by() %>% tally()
counts the number of observations in the groups. Example:group_by(species) %>% tally()
2.2.5 Using everything we’ve learned in a single pipe, and a dplyr
exercise
One advantage of pipes is that you can do everything you want in a single operation10. Below is an example using everything we’ve learned so far in a single pipe.
starwars %>%
select(-films, -vehicles, -starships) %>%
filter(species == "Human") %>%
group_by(sex) %>%
summarise(mean_height = mean(height, na.rm = TRUE))
#> # A tibble: 2 × 2
#> sex mean_height
#> <chr> <dbl>
#> 1 female 164.
#> 2 male 182.
Exercise: Explain to another student what happens in the code above (or write it down if you’re doing this on your own).
Exercise: Take the starwars data set, filter so you keep all that are below the mean height. Then, calculate the mean height of these short individuals, grouped by homeworld.
Show hint
You can supply the mean height to the logical statement inside filter()
. Your filtering step should then look like this: filter(height < mean(height, na.rm = TRUE))
starwars %>%
filter(height < mean(height, na.rm = TRUE)) %>%
group_by(homeworld) %>%
summarise(mean_height = mean(height, na.rm = TRUE))
#> # A tibble: 19 × 2
#> homeworld mean_height
#> <chr> <dbl>
#> 1 Alderaan 150
#> 2 Aleen Minor 79
#> 3 Chandrila 150
#> 4 Corellia 170
#> 5 Coruscant 168.
#> 6 Endor 88
#> 7 Iridonia 171
#> 8 Malastare 112
#> 9 Mirial 168
#> 10 Naboo 147
#> 11 Rodia 173
#> 12 Sullust 160
#> 13 Tatooine 153.
#> 14 Toydaria 137
#> 15 Troiken 122
#> 16 Tund 163
#> 17 Vulpter 94
#> 18 Zolan 168
#> 19 <NA> 81
Remember that you only need to install a package once, but that it needs to be loaded with
library()
every time you want to use it.↩︎Remember that what the pipe basically does is to put the left hand side of the pipe into the function on the right hand side. Without the pipe, filtering and selecting looks like this:
select(filter(starwars, height < 100), name, height, birth_year)
↩︎for example, you could realise that it doesn’t make sense to calculate mean and standard deviation when you only have a single value, like we’ve done quite a bit↩︎
In practice, it’s probably smart to make an object of your intermediary results every now and then.↩︎