9.2 More on data handling: categorising and joining
Handling large amounts of data with few lines of code is one of R’s strong points. In this section we will show how you can use ifelse()
and the left_join()
function from dplyr to make categories and add information to your data
9.2.1 ifelse()
for making categories
Let’s first review what the ifelse()
function does. As you learned back in chapter 5, ifelse()
takes a logical statement, and returns something different depending on whether the condition is TRUE
or FALSE
:
# make a small vector
y <- c(20, 30, 50)
# use ifelse to evaluate it
ifelse(y > 25, "Greater than 25", "Not greater than 25")
This function can be extremely useful for creating new variables in datasets. Let’s return to the familiar starwars
data from dplyr
in order to use the function in this way.
starwars
#> # A tibble: 87 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
#> 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
#> 4 Darth V… 202 136 none white yellow 41.9 male mascu…
#> 5 Leia Or… 150 49 brown light brown 19 fema… femin…
#> 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
#> 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
#> 8 R5-D4 97 32 <NA> white, red red NA none mascu…
#> 9 Biggs D… 183 84 black light brown 24 male mascu…
#> 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
#> # ℹ 77 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
Now we can take a look at the starwars$species
vector. There are a lot of different species, so what if we wanted to create a vector that simply states whether an individual is a droid or not?
# if else to identify droids and non-droids
ifelse(starwars$species == "Droid", "Droid", "Non-Droid")
This can be useful e.g. for comparing droids and non-droids in a plot or table. Say we want to label our species based on whether they are a droid, human or neither of the two. A useful thing with ifelse()
is that the third argument can be another ifelse()
function! So we can actually chain ifelse()
commands like this:
ifelse(starwars$species == "Droid", "Droid", ifelse(starwars$species == "Human", "Human", "Neither human nor droid"))
This is useful, but quickly becomes convoluted. Imagine how the code would look if we threw in a third and fourth category there! In cases like this, remember to use linebreaks to make the code more readable. You can have a linebreak anywhere after starting a function, and R will still understand that it’s part of the same function. A suggestion for better formatting than above:
ifelse(
starwars$species == "Droid", "Droid",
ifelse(
starwars$species == "Human", "Human",
"Neither human nor droid"
)
)
Still, if you have more than, say, four-five categories, this becomes difficult to read and time-consuming. For e.g. adding more information to a data frame, joining may be a better alternative, which we will go through next.
9.2.2 Joining
For this section we will revisit the copepods.txt data that we encountered way back in week 2. Start by reading in this data. You should know enough by now to do this by yourself, so we won’t show you how.
copepods
#> depth acartia calanus harpacticoida oithona oncaea temora
#> 1 0 0 3 0 2 0 0
#> 2 2 1 0 0 6 1 0
#> 3 4 1 0 0 7 0 1
#> 4 6 27 0 1 0 0 2
#> 5 8 11 0 2 6 0 3
#> 6 10 17 0 3 0 0 2
#> 7 12 13 0 1 0 0 1
#> 8 14 7 0 13 0 0 0
#> 9 16 6 0 6 0 0 1
Next, we use pivot_longer()
to get all taxa in a single column, i.e., convert to long format. See if you manage to do this yourself before looking at my code below.
copepods_long
#> # A tibble: 54 × 3
#> depth taxon count
#> <int> <chr> <int>
#> 1 0 acartia 0
#> 2 0 calanus 3
#> 3 0 harpacticoida 0
#> 4 0 oithona 2
#> 5 0 oncaea 0
#> 6 0 temora 0
#> 7 2 acartia 1
#> 8 2 calanus 0
#> 9 2 harpacticoida 0
#> 10 2 oithona 6
#> # ℹ 44 more rows
Now, say that you have recorded the temperature at each depth, and want to add that information to your copepod data. How would you go about doing that? First, here is the data in a data frame:
temps <- data.frame(
depth = c(0,2,4,6,8,10,12,14,16),
temp = c(15.5, 15.4, 15.2, 14.7, 11.4, 8.3, 7.6, 7.0, 6.8)
)
temps
#> depth temp
#> 1 0 15.5
#> 2 2 15.4
#> 3 4 15.2
#> 4 6 14.7
#> 5 8 11.4
#> 6 10 8.3
#> 7 12 7.6
#> 8 14 7.0
#> 9 16 6.8
One way would be using nested ifelse()
functions, like we learned in the previous section. This is a lot of work and doesn’t look good, but it’s written it out below just to show you it’s possible:
copepods_long$depthtemp <- ifelse(
copepods_long$depth == 0, 15.5,
ifelse(
copepods_long$depth == 2, 15.4,
ifelse(
copepods_long$depth == 4, 15.2,
ifelse(
copepods_long$depth == 6, 14.7,
ifelse(
copepods_long$depth == 8, 11.4,
ifelse(
copepods_long$depth == 10, 8.3,
ifelse(
copepods_long$depth == 12, 7.6,
ifelse(
copepods_long$depth == 14, 7.0,
ifelse(
copepods_long$depth == 16, 6.8,
NA
)
)
)
)
)
)
)
)
)
Instead, you can use the left_join()
function from dplyr
. You have to supply it the original data, the data you want to join with, and a vector of column names to join by (here “depth”).
copepods_temp <- left_join(copepods_long, temps, by = "depth")
copepods_temp
#> # A tibble: 54 × 5
#> depth taxon count depthtemp temp
#> <dbl> <chr> <int> <dbl> <dbl>
#> 1 0 acartia 0 15.5 15.5
#> 2 0 calanus 3 15.5 15.5
#> 3 0 harpacticoida 0 15.5 15.5
#> 4 0 oithona 2 15.5 15.5
#> 5 0 oncaea 0 15.5 15.5
#> 6 0 temora 0 15.5 15.5
#> 7 2 acartia 1 15.4 15.4
#> 8 2 calanus 0 15.4 15.4
#> 9 2 harpacticoida 0 15.4 15.4
#> 10 2 oithona 6 15.4 15.4
#> # ℹ 44 more rows
left_join()
matches one or more columns in your two data sets, and add rows from the second data set into the first data set in the correct place. You see that the temp
column is equal to the depthtemp
we created earlier, but it’s so much easier to work with! Keep in mind that it is this simple in our case because depth
has the exact same name in both data frames. Remember this when recording data in the future!
Important concept:
If you have your data spread out over multiple files, remember to name columns appropriately. All columns that contain the same kind of data should have the exact same name across all data sets. Similarly, the data should be entered in the same way in both data sets (e.g., don’t record depth as “2” in one data set and “2m” in the other). If you do this, you can easily join data sets with the left_join()
function.