2.2 Using dplyr to work with your data

First, we have to install and load the tidyverse.⁷

install.packages("tidyverse")
library(tidyverse)

Together with the tidyverse, you get a built in data set called starwars, containing information about the characters from the Star Wars films. We will use this data set for most of this tutorial. Since it’s already loaded, all you have to do to access it is run:

starwars
#> # A tibble: 87 × 14
#>    name     height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
#>  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
#>  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
#>  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
#>  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
#>  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
#>  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
#>  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
#>  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
#> 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
#> # ℹ 77 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

Additional info: starwars is a data frame like you learned about last week, but you might notice that it’s printing a bit differently (e.g. with red color for NA, condensing the output). This is because starwars is a special kind of data frame introduced in the tidyverse, called a tibble. For all practical purposes, a tibble and a data frame is the same, so throughout this course we won’t care much whether our data is contained in a tibble or a regular data frame.

2.2.1 The pipe

The tidyverse introduces a new operator called the pipe, which looks like this %>%. Conceptually, the pipe allows you to do something with your data, and then send the result to a new function which does more work, sends it to the next function and so on until you’re satisfied. You can for instance use the pipe like this:

x <- 1:100
x^2 %>% mean()
#> [1] 3383.5

This is equivalent to writing:

mean(x^2)
#> [1] 3383.5

Basically, you are putting the left hand side of the pipe into the parentheses in the function on the right hand side. This may not seem useful right now, but as we will see later, this can make code much easier to read.

You can also assign the results of your pipe to an object like any ordinary calculation:

x2_mean <- x^2 %>% mean()

There will be a lot of examples of using the pipe throughout this tutorial, showing how it can make quite complex code readable.

Important concept:
The pipe operator %>% allows you to send an object from the left side of the pipe to a function on the right side.

2.2.2 Selecting columns with `select()`

Lets say we want to choose the name and homeworld columns from our starwars data, how can we do that? With standard R, we might do something like this.

# with names
starwars[, c('name', 'homeworld')]
# with indices
starwars[, c(1, 9)]

With dplyr we can do the following:

select(starwars, name, homeworld)
#> # A tibble: 87 × 2
#>    name               homeworld
#>    <chr>              <chr>    
#>  1 Luke Skywalker     Tatooine 
#>  2 C-3PO              Tatooine 
#>  3 R2-D2              Naboo    
#>  4 Darth Vader        Tatooine 
#>  5 Leia Organa        Alderaan 
#>  6 Owen Lars          Tatooine 
#>  7 Beru Whitesun Lars Tatooine 
#>  8 R5-D4              Tatooine 
#>  9 Biggs Darklighter  Tatooine 
#> 10 Obi-Wan Kenobi     Stewjon  
#> # ℹ 77 more rows

The first argument here is your data, while the others are the columns you want to select. Note that you don’t need to use the quotes " here, you generally don’t need those for the dplyr functions.

select() becomes even more intuitive when using the pipe:

starwars %>% select(name, homeworld)

This style of code is closer to how we would write in English: “Take the starwars data, and select the name and homeworld columns”. From now on we will write all our dplyr code using the pipe.

If you want to omit a column, you can use - in front of its name:

# choose all columns BUT name
starwars %>% select(-name)

select also has additional ways of selecting columns, some examples of this is shown below:

# choose only columns containing an underscore
starwars %>% select(contains("_"))
# choose only columns beginning with "s"
starwars %>% select(starts_with("s"))
# choose only columns ending with "color"
starwars %>% select(ends_with("color"))

Exercise: Use select() to select numeric columns, i.e. all the columns that contain numbers, and save it to an object with a meaningful name. You can do this manually by looking at the columns. If you want a challenge, see if you can figure out how to automatically select numeric columns from your data.

# manual method:
sw_hmb <- starwars %>% select(height, mass, birth_year)

# challenge:
sw_hmb <- starwars %>% select(where(is.numeric))

2.2.3 Filtering colums using `filter()`

Last week, you learned to filter a data set based on some criterion using the square brackets []. To filter out only the humans from the starwars data set, you could write:

starwars[starwars$species == "Human", ]

dplyr’s filter() function does the same thing, but with a slightly different syntax:

starwars %>% filter(species == "Human")

This looks quite similar to using the square brackets, one notable difference being that you don’t need to use starwars$ within filter(), the function already understands that we are working with starwars.

Like with base R, you can use this to filter with the other logical operators as well, like > and !=:

# get people lower than 1m tall
starwars %>% filter(height < 100)

# get all non-humans
starwars %>% filter(species != "Human")

You can also filter using several criteria at once, simply separate the logical statements with a comma:

# get all non-humans shorter than 1m
starwars %>% filter(height < 100, species != "Human")

2.2.3.1 Combining `filter()` and `select()`

The real power of the pipe shows when you chain several operations together. To both filter and select from your data, simply first do the filtering and then pipe the result to select:⁸

starwars %>% filter(height < 100) %>% 
  select(name, height, birth_year)
#> # A tibble: 7 × 3
#>   name                  height birth_year
#>   <chr>                  <int>      <dbl>
#> 1 R2-D2                     96         33
#> 2 R5-D4                     97         NA
#> 3 Yoda                      66        896
#> 4 Wicket Systri Warrick     88          8
#> 5 Ratts Tyerel              79         NA
#> 6 Dud Bolt                  94         NA
#> 7 R4-P17                    96         NA

Again, the code looks like how you would explain what you’re doing: “take the starwars data, filter based on height, and select the name, height and birth_year columns”. Note that you can have a line break after a pipe like you can inside parentheses, your code will still continue running.

Exercise: use filter() to choose all the people that has “Naboo” as homeworld. Select name, skin_color and eye_color and save the result to an object.

sw_naboo_color <- starwars %>% filter(homeworld == "Naboo") %>%
  select(name, skin_color, eye_color)

Important concept:
- filter() is used to select specific rows. Example: filter(height < 100) - select() is used to select specific columns. Example: select(name, height)

2.2.4 Grouped summaries with `group_by()` and `summarise()`

Imagine that you want to calculate the mean height of the people (and droids) in the starwars data set. You could use mean() on the column to achieve this (note the use of na.rm since the height column contains NAs):

mean(starwars$height, na.rm = TRUE)
#> [1] 174.6049

But what if you want to calculate the mean height separately for e.g. the different species? One way to do this is to do a grouped summary. Your group is the species column, and your summary statistic is mean. We create groups using the group_by() function:

starwars %>% group_by(species)
#> # A tibble: 87 × 14
#> # Groups:   species [38]
#>    name     height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
#>  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
#>  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
#>  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
#>  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
#>  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
#>  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
#>  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
#>  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
#> 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
#> # ℹ 77 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

Notice that nothing has changed in the data, but at the top you can see the text # Groups: species [38], showing that you indeed have created a group, and that you have 38 different species in your data. The main use of group_by() is together with summarise(), which does a summary based on the groups you’ve created:

starwars %>% 
  group_by(species) %>% #create group
  summarise(mean_height = mean(height, na.rm = TRUE)) # calculate summary statistic
#> # A tibble: 38 × 2
#>    species   mean_height
#>    <chr>           <dbl>
#>  1 Aleena            79 
#>  2 Besalisk         198 
#>  3 Cerean           198 
#>  4 Chagrian         196 
#>  5 Clawdite         168 
#>  6 Droid            131.
#>  7 Dug              112 
#>  8 Ewok              88 
#>  9 Geonosian        183 
#> 10 Gungan           209.
#> # ℹ 28 more rows

Note how again, like in data.frame, the argument name to summarise becomes the column name in your new data frame. You can use several summary functions inside summarise(), like median(), sd(), sum() and max() to name some. You can also do several summaries within a single summarise() function:

starwars %>% 
  group_by(species) %>%
  summarise(mean_height = mean(height, na.rm = TRUE),
            median_height = median(height, na.rm = TRUE),
            sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 38 × 4
#>    species   mean_height median_height sd_height
#>    <chr>           <dbl>         <dbl>     <dbl>
#>  1 Aleena            79             79      NA  
#>  2 Besalisk         198            198      NA  
#>  3 Cerean           198            198      NA  
#>  4 Chagrian         196            196      NA  
#>  5 Clawdite         168            168      NA  
#>  6 Droid            131.            97      49.1
#>  7 Dug              112            112      NA  
#>  8 Ewok              88             88      NA  
#>  9 Geonosian        183            183      NA  
#> 10 Gungan           209.           206      14.2
#> # ℹ 28 more rows

We can even group by several variables, creating more detailed summaries:

starwars %>%
  group_by(homeworld, sex) %>%
  summarise(mean_height = mean(height, na.rm = TRUE))
#> # A tibble: 60 × 3
#> # Groups:   homeworld [49]
#>    homeworld      sex    mean_height
#>    <chr>          <chr>        <dbl>
#>  1 Alderaan       female        150 
#>  2 Alderaan       male          190.
#>  3 Aleen Minor    male           79 
#>  4 Bespin         male          175 
#>  5 Bestine IV     <NA>          180 
#>  6 Cato Neimoidia male          191 
#>  7 Cerea          male          198 
#>  8 Champala       male          196 
#>  9 Chandrila      female        150 
#> 10 Concord Dawn   male          183 
#> # ℹ 50 more rows

Now you get two groups for homeworld Alderaan, one with males and one with females. For the following homeworld groups there are only males (except Chandrila, with only females), so you just get one group for each (giving a pretty accurate picture of the gender balance in Star Wars).

2.2.4.1 Counting how many observations we have in our groups with `tally()`

When we have created our groups, we can also use the tally() function to count the number of observations we have in the groups:

starwars %>% 
  group_by(species) %>%
  tally()
#> # A tibble: 38 × 2
#>    species       n
#>    <chr>     <int>
#>  1 Aleena        1
#>  2 Besalisk      1
#>  3 Cerean        1
#>  4 Chagrian      1
#>  5 Clawdite      1
#>  6 Droid         6
#>  7 Dug           1
#>  8 Ewok          1
#>  9 Geonosian     1
#> 10 Gungan        3
#> # ℹ 28 more rows

This can be useful to get an overview of your data⁹

Important concept:
group_by() can be combined with different functions to give an overview of your data.

group_by() %>% summarise() does some calculation in each group. Example: group_by(homeworld, sex) %>% summarise(mean_height = mean(height))
group_by() %>% tally() counts the number of observations in the groups. Example: group_by(species) %>% tally()

2.2.5 Using everything we’ve learned in a single pipe, and a `dplyr` exercise

One advantage of pipes is that you can do everything you want in a single operation¹⁰. Below is an example using everything we’ve learned so far in a single pipe.

starwars %>%
  select(-films, -vehicles, -starships) %>%
  filter(species == "Human") %>%
  group_by(sex) %>%
  summarise(mean_height = mean(height, na.rm = TRUE))
#> # A tibble: 2 × 2
#>   sex    mean_height
#>   <chr>        <dbl>
#> 1 female        164.
#> 2 male          182.

Exercise: Explain to another student what happens in the code above (or write it down if you’re doing this on your own).

Exercise: Take the starwars data set, filter so you keep all that are below the mean height. Then, calculate the mean height of these short individuals, grouped by homeworld.

Show hint

You can supply the mean height to the logical statement inside filter(). Your filtering step should then look like this: filter(height < mean(height, na.rm = TRUE))

starwars %>%
  filter(height < mean(height, na.rm = TRUE)) %>%
  group_by(homeworld) %>%
  summarise(mean_height = mean(height, na.rm = TRUE))
#> # A tibble: 19 × 2
#>    homeworld   mean_height
#>    <chr>             <dbl>
#>  1 Alderaan           150 
#>  2 Aleen Minor         79 
#>  3 Chandrila          150 
#>  4 Corellia           170 
#>  5 Coruscant          168.
#>  6 Endor               88 
#>  7 Iridonia           171 
#>  8 Malastare          112 
#>  9 Mirial             168 
#> 10 Naboo              147 
#> 11 Rodia              173 
#> 12 Sullust            160 
#> 13 Tatooine           153.
#> 14 Toydaria           137 
#> 15 Troiken            122 
#> 16 Tund               163 
#> 17 Vulpter             94 
#> 18 Zolan              168 
#> 19 <NA>                81

Remember that you only need to install a package once, but that it needs to be loaded with library() every time you want to use it.↩︎
Remember that what the pipe basically does is to put the left hand side of the pipe into the function on the right hand side. Without the pipe, filtering and selecting looks like this: select(filter(starwars, height < 100), name, height, birth_year)↩︎
for example, you could realise that it doesn’t make sense to calculate mean and standard deviation when you only have a single value, like we’ve done quite a bit↩︎
In practice, it’s probably smart to make an object of your intermediary results every now and then.↩︎