10.2 More on data handling: categorising and joining

Handling large amounts of data with few lines of code is one of R’s strong points. In this section we will show how you can use ifelse() and the left_join() function from dplyr to make categories and add information to your data

10.2.1 ifelse() for making categories

Let’s first review what the ifelse() function does. As you learned back in chapter 5, ifelse() takes a logical statement, and returns something different depending on whether the condition is TRUE or FALSE:

# make a small vector
y <- c(20, 30, 50)
# use ifelse to evaluate it
ifelse(y > 25, "Greater than 25", "Not greater than 25")

This function can be extremely useful for creating new variables in datasets. Let’s return to the familiar starwars data from dplyr in order to use the function in this way.

starwars
#> # A tibble: 87 × 14
#>    name     height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
#>  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
#>  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
#>  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
#>  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
#>  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
#>  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
#>  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
#>  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
#> 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
#> # ℹ 77 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

Now we can take a look at the starwars$species vector. There are a lot of different species, so what if we wanted to create a vector that simply states whether an individual is a droid or not?

# if else to identify droids and non-droids
ifelse(starwars$species == "Droid", "Droid", "Non-Droid")

This can be useful e.g. for comparing droids and non-droids in a plot or table. Say we want to label our species based on whether they are a droid, human or neither of the two. A useful thing with ifelse() is that the third argument can be another ifelse() function! So we can actually chain ifelse() commands like this:

ifelse(starwars$species == "Droid", "Droid", ifelse(starwars$species == "Human", "Human", "Neither human nor droid"))

This is useful, but quickly becomes convoluted. Imagine how the code would look if we threw in a third and fourth category there! In cases like this, remember to use linebreaks to make the code more readable. You can have a linebreak anywhere after starting a function, and R will still understand that it’s part of the same function. A suggestion for better formatting than above:

ifelse(
  starwars$species == "Droid", "Droid", 
  ifelse(
    starwars$species == "Human", "Human", 
    "Neither human nor droid"
    )
  )

Still, if you have more than, say, four-five categories, this becomes difficult to read and time-consuming. For e.g. adding more information to a data frame, joining may be a better alternative, which we will go through next.

10.2.2 Joining

For this section we will revisit the copepods.txt data that we encountered way back in week 2. Start by reading in this data. You should know enough by now to do this by yourself, so we won’t show you how.

copepods
#>   depth acartia calanus harpacticoida oithona oncaea temora
#> 1     0       0       3             0       2      0      0
#> 2     2       1       0             0       6      1      0
#> 3     4       1       0             0       7      0      1
#> 4     6      27       0             1       0      0      2
#> 5     8      11       0             2       6      0      3
#> 6    10      17       0             3       0      0      2
#> 7    12      13       0             1       0      0      1
#> 8    14       7       0            13       0      0      0
#> 9    16       6       0             6       0      0      1

Next, we use pivot_longer() to get all taxa in a single column, i.e., convert to long format. See if you manage to do this yourself before looking at my code below.

copepods_long <- pivot_longer(copepods, 
                              -depth, 
                              names_to = "taxon", 
                              values_to = "count")
copepods_long
#> # A tibble: 54 × 3
#>    depth taxon         count
#>    <int> <chr>         <int>
#>  1     0 acartia           0
#>  2     0 calanus           3
#>  3     0 harpacticoida     0
#>  4     0 oithona           2
#>  5     0 oncaea            0
#>  6     0 temora            0
#>  7     2 acartia           1
#>  8     2 calanus           0
#>  9     2 harpacticoida     0
#> 10     2 oithona           6
#> # ℹ 44 more rows

Now, say that you have recorded the temperature at each depth, and want to add that information to your copepod data. How would you go about doing that? First, here is the data in a data frame:

temps <- data.frame(
  depth = c(0,2,4,6,8,10,12,14,16),
  temp = c(15.5, 15.4, 15.2, 14.7, 11.4, 8.3, 7.6, 7.0, 6.8)
)
temps
#>   depth temp
#> 1     0 15.5
#> 2     2 15.4
#> 3     4 15.2
#> 4     6 14.7
#> 5     8 11.4
#> 6    10  8.3
#> 7    12  7.6
#> 8    14  7.0
#> 9    16  6.8

One way would be using nested ifelse() functions, like we learned in the previous section. This is a lot of work and doesn’t look good, but it’s written it out below just to show you it’s possible:

copepods_long$depthtemp <- ifelse(
  copepods_long$depth == 0, 15.5,
  ifelse(
    copepods_long$depth == 2, 15.4,
    ifelse(
      copepods_long$depth == 4, 15.2,
      ifelse(
        copepods_long$depth == 6, 14.7,
        ifelse(
          copepods_long$depth == 8, 11.4,
          ifelse(
            copepods_long$depth == 10, 8.3,
            ifelse(
              copepods_long$depth == 12, 7.6,
              ifelse(
                copepods_long$depth == 14, 7.0,
                ifelse(
                  copepods_long$depth == 16, 6.8,
                  NA
                )
              )
            )
          )
        )
      )
    )
  )
)

Instead, you can use the left_join() function from dplyr. You have to supply it the original data, the data you want to join with, and a vector of column names to join by (here “depth”).

copepods_temp <- left_join(copepods_long, temps, by = "depth")
copepods_temp
#> # A tibble: 54 × 5
#>    depth taxon         count depthtemp  temp
#>    <dbl> <chr>         <int>     <dbl> <dbl>
#>  1     0 acartia           0      15.5  15.5
#>  2     0 calanus           3      15.5  15.5
#>  3     0 harpacticoida     0      15.5  15.5
#>  4     0 oithona           2      15.5  15.5
#>  5     0 oncaea            0      15.5  15.5
#>  6     0 temora            0      15.5  15.5
#>  7     2 acartia           1      15.4  15.4
#>  8     2 calanus           0      15.4  15.4
#>  9     2 harpacticoida     0      15.4  15.4
#> 10     2 oithona           6      15.4  15.4
#> # ℹ 44 more rows

left_join() matches one or more columns in your two data sets, and add rows from the second data set into the first data set in the correct place. You see that the temp column is equal to the depthtemp we created earlier, but it’s so much easier to work with! Keep in mind that it is this simple in our case because depth has the exact same name in both data frames. Remember this when recording data in the future!

Important concept:
If you have your data spread out over multiple files, remember to name columns appropriately. All columns that contain the same kind of data should have the exact same name across all data sets. Similarly, the data should be entered in the same way in both data sets (e.g., don’t record depth as “2” in one data set and “2m” in the other). If you do this, you can easily join data sets with the left_join() function.