5.1 More on functions: Vectorisation

Last week you learned a bit about creating your own functions, and hopefully saw how useful they can be for simplifying your work flow. This week you will learn more about how they work, and some tricks for using functions efficiently.

As an example, let’s use a simple function that takes radius and height as an input, and calculates the volume of a cylinder (\(\pi r^2h\)).

cyl_vol <- function(r, h){
  return(pi * r^2 * h)
}

First, let’s use it for a single value of r and h.

# radius 3, height 8
r <- 3
h <- 8
cyl_vol(r, h)
#> [1] 226.1947

Let’s say we want to calculate the volume of 10 cylinders with radius 3, and heights ranging from 1 through 10. In R, we can supply a vector of heights as the second argument, and do these 10 calculations in a single operation!

# radius 3, height 1:10
r <- 3
h <- 1:10
cyl_vol(r, h)
#>  [1]  28.27433  56.54867  84.82300 113.09734 141.37167 169.64600 197.92034
#>  [8] 226.19467 254.46900 282.74334

This is part of what we call vectorisation, taking a single function and applying it to a vector of values.

We can even supply two vectors, one for r and 1 for h if we want to vary both radius and height.

r <- 10:1
h <- 1:10
cyl_vol(r, h)
#>  [1] 314.15927 508.93801 603.18579 615.75216 565.48668 471.23890 351.85838
#>  [8] 226.19467 113.09734  31.41593

Note that this calculates volume for \(r=10, h=1\), then \(r=9, h=2\) and so on, not for all combinations of the vectors.

Important concept:
Many functions can be applied to a vector of values instead of a single value. This is a quick and simple way of running a lot of functions at the same time. Notice how the function call cyl_vol(r, h) is exactly the same for single values and vectors!

Tip:
Notice that you could also have used a for loop to use the function on a range of values. If you have experience with other programming languages, e.g. Python, you’re probably used to do it that way! In R, however, vectorising functions is often faster than using for-loops if your vectors get really large. In some cases it’s also easier both to read and write.

5.1.1 The `apply()` function

The apply() function is a tool to further vectorise your functions that works with matrices and data frames. apply() will apply a function to either each row or each column of your data frame/matrix. The general structure of using apply() is:

apply(data, row_or_colwise, function_name)

Where the second argument is either 1 for operating on each row or 2 for operating on each column¹⁹.

The best way to show how it works is probably through an example. Consider if you have the following data frame of species counts, where each column contains counts for a species, and each row is a location.

species <- data.frame(
  sp_1 = c(0, 3, 2, 6, 7),
  sp_2 = c(4, 2, 3, 0, 1),
  sp_3 = c(2, 2, 0, 0, 1))
species
#>   sp_1 sp_2 sp_3
#> 1    0    4    2
#> 2    3    2    2
#> 3    2    3    0
#> 4    6    0    0
#> 5    7    1    1

What if you want to take the mean count for each location? I.e. calculate the mean of each row in the data set. You can’t simply use mean() on your data frame:

mean(species)
#> [1] NA

R doesn’t understand what to do in this case. However, this is a perfect opportunity for using apply()! Remember that the second argument should be 1 since we want to work on rows here.

apply(species, 1, mean)
#> [1] 2.000000 2.333333 1.666667 2.000000 3.000000

Great! Notice how mean should be written without parentheses when using it inside apply(). This is something you unfortunately just have to remember.

Exercise: use apply() calculate the mean count of each species, i.e. the means of the columns in your data.

Show hint

Use 2 as the second argument to work with columns instead of rows.

apply(species, 2, mean)
#> sp_1 sp_2 sp_3 
#>  3.6  2.0  1.0

Important concept:
Use apply() to use a function on either all rows (1) or all columns (2) of a data frame/matrix.

This section has been a small taste on what R can do when you vectorise your functions. Vectorisation can be a bit tricky to wrap your head around in the beginning, but if you keep using it it eventually becomes a very useful tool for performing a lot of calculations simultaneously.

5.1.2 The `ifelse()` function

Say that you have a vector with values, and you want to group them somehow. For example, you have a vector of numbers between 0 and 10,²⁰ and want to label the values that are larger than 5.

# generate 10 numbers between 0 and 10

set.seed(14) # ensure same result of random drawing each time
x <- sample(0:10, 10) # draw 10 random numbers between 0 and 10
x
#>  [1]  8 10  3  2  5  6  9  1  7  0

Now, you may remember from the first week that you can check which numbers are more than 5 by using a logical statement:

x > 5
#>  [1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE

However, what if you want to label the values “high” and “low”, respectively? For this, you can use the ifelse() function. Conceptually, ifelse() works like this:

ifelse(logical_statement, value_if_TRUE, value_if_FALSE)

This means that to get “high” and “low” for our values, we can write:

ifelse(x > 5, "high", "low")
#>  [1] "high" "high" "low"  "low"  "low"  "high" "high" "low"  "high" "low"

5.1.2.1 `ifelse()` examples

ifelse() can be used for a lot of things, here are a couple of more examples:

Check if a value is higher than the mean:

ifelse(x > mean(x), "above mean", "below_mean")
#>  [1] "above mean" "above mean" "below_mean" "below_mean" "below_mean"
#>  [6] "above mean" "above mean" "below_mean" "above mean" "below_mean"

Convert a value from count to binary presence/absence:

ifelse(x > 0, 1, 0)
#>  [1] 1 1 1 1 1 1 1 1 1 0

We can also use it for character vectors, looking for a specific word to make groups:

animals <- c("horse", "donkey", "zebra", "horse", "zebra", "mule")
ifelse(animals == "horse", "actual horse", "almost horse")
#> [1] "actual horse" "almost horse" "almost horse" "actual horse" "almost horse"
#> [6] "almost horse"

This usage can be convenient when plotting values, as you will see in the evolution-part of the tutorial.

It can be tricky to remember which is which of these, so don’t worry if you find yourself looking at the help page with ?apply all the time, I sure do! Also, a general rule of thumb is that in R, rows always comes before columns, like when you extract values from a data frame with square brackets [].↩︎
Here, we create a vector of 10 random numbers between 0 and 10. The set.seed() function ensures that everyone will get the same random numbers (not so random after all then), just compare your result to the person beside you and to the output in the tutorial. If you change the seed, or run sample() multiple times without setting the seed every time, you will see the vector changing.↩︎