10.4 Vectorisation

During the tutorials in this course, you have encountered the term “vectorisation” a few times(e.g. chapter 5). In short, this is the concept of using functions and doing operations on a vector of values, in contrast to looping over each value and calculating each time. In many cases, it can make calculations faster and more readable.

Just to remind you, here is the example we used in chapter 5:

# define function to calculate volume of a cylinder
cyl_vol <- function(r, h){
  return(pi * r^2 * h)
}

# use the function on single values
r <- 3
h <- 8
cyl_vol(r, h)
#> [1] 226.1947

# use the function on vectors of values
r <- 10:1
h <- 1:10
cyl_vol(r, h)
#>  [1] 314.15927 508.93801 603.18579 615.75216 565.48668 471.23890 351.85838
#>  [8] 226.19467 113.09734  31.41593

An equivalent for-loop to the last example would be:

# create empty vector
vols <- rep(NA, length(r))
# do calculations in loop
for (i in 1:length(r)){
  vols[i] <- cyl_vol(r[i], h[i])

}
vols
#>  [1] 314.15927 508.93801 603.18579 615.75216 565.48668 471.23890 351.85838
#>  [8] 226.19467 113.09734  31.41593

Notice how much simpler the vectorised solution looks. For simple operations like these, vectorising instead of looping is a no-brainer. Now, we will look at some more ways to vectorise, with the sapply() and lapply() functions.

10.4.1 Using lapply() to vectorise

In chapter 5, you learned about the apply() function, for applying a function to a data frame or matrix. The lapply() function work similarly, but on vectors and lists instead.

Say you have a simple function that takes a vector as input, and rescales each element to the proportion of the total by dividing on the sum of the vector. In other words, calculate and return \(\frac{x}{sum(x)}\).

Exercise: Make a function that does the above, and test it with a couple of different vectors.

# define function
calc_proportion <- function(x){
  return(x/sum(x))
}

# test it
calc_proportion(c(1, 3, 6))
#> [1] 0.1 0.3 0.6
calc_proportion(c(17, 36, 24, 55))
#> [1] 0.1287879 0.2727273 0.1818182 0.4166667

Now, what if we want to do this with several vectors? We may for example have a list of vectors, and want to apply this function to all of them:

numbers_list <- list(
  c(1, 3, 6),
  c(17, 36, 24, 55),
  c(100, 500, 400, 38, 75)
)
numbers_list
#> [[1]]
#> [1] 1 3 6
#> 
#> [[2]]
#> [1] 17 36 24 55
#> 
#> [[3]]
#> [1] 100 500 400  38  75

If we simply try to use the function on the list directly, we get an error. R has no way of using sum() on a list!

calc_proportion(numbers_list)
#> Error in sum(x): invalid 'type' (list) of argument

This is where lapply() comes in. lapply() (or “list apply”) takes a list or vector as input, and applies a function to each element of the list/vector. The syntax is lapply(list, function). Like with apply(), the function should not have parentheses after its name. Let’s use this to apply calc_proportion() to numbers_list:

lapply(numbers_list, calc_proportion)
#> [[1]]
#> [1] 0.1 0.3 0.6
#> 
#> [[2]]
#> [1] 0.1287879 0.2727273 0.1818182 0.4166667
#> 
#> [[3]]
#> [1] 0.08984726 0.44923630 0.35938904 0.03414196 0.06738544

Now we get a list of our rescaled vectors! Note that lapply() always returns a list (hence “list apply”). Like earlier examples we’ve seen, it doesn’t matter if numbers_list has a single value or a million, the code would still look the same39.

Important concept:
lapply() allows for fast and simple vectorisation when your function cannot be applied to your data directly. It takes a list/vector and a function as arguments, and returns a list of values.

10.4.2 sapply()

sapply() (or “simplified apply”) is very similar to lapply() in that it takes a list or vector and a function as arguments, and applies the function to the list/vector. The key difference is that where lapply() always returns a list, sapply() returns the simplest possible object. For example, say we want to use sum() on each vector in numbers_list. With lapply() we get:

lapply(numbers_list, sum)
#> [[1]]
#> [1] 10
#> 
#> [[2]]
#> [1] 132
#> 
#> [[3]]
#> [1] 1113

However, each list element only contains one value. Wouldn’t it be easier to have it stored in a vector? This is what we get with sapply():

sapply(numbers_list, sum)
#> [1]   10  132 1113

Important concept:
sapply() can be convenient to simplify the output of a vectorised operation. Use sapply() if you want the simplest data structure possible, and lapply() when you want to be sure that your operation returns a list.

10.4.3 Anonymous functions

A concept that is often used together with lapply() and sapply() is anonymous functions. Like the term implies, this is a function without a name. For example, instead of creating the calc_proportion() function above, we could have defined the function inside our lapply() function like this40:

lapply(numbers_list, function(x) x/sum(x))
#> [[1]]
#> [1] 0.1 0.3 0.6
#> 
#> [[2]]
#> [1] 0.1287879 0.2727273 0.1818182 0.4166667
#> 
#> [[3]]
#> [1] 0.08984726 0.44923630 0.35938904 0.03414196 0.06738544

It works in the exact same way, but you don’t have to define a function beforehand, and is easy to understand for the reader right away. Only use this for very simple functions that you are only going to use once. Otherwise, defining a named function is better. It is good to know about anonymous functions anyway, as they are frequently used.


  1. Of course, this can also be achieved with a for-loop, which looks like this:

    # create empty list, confusingly with the vector() function ...
    proportion_list <- vector(mode = "list", length = length(numbers_list))
    
    # loop and calculate
    for (i in 1:length(numbers_list)){
      proportion_list[[i]] <- calc_proportion(numbers_list[[i]])
    }
    proportion_list
    #> [[1]]
    #> [1] 0.1 0.3 0.6
    #> 
    #> [[2]]
    #> [1] 0.1287879 0.2727273 0.1818182 0.4166667
    #> 
    #> [[3]]
    #> [1] 0.08984726 0.44923630 0.35938904 0.03414196 0.06738544

     ↩︎

  2. Note that the function here is defined without the use of curly brackets and return(). This is not unique to anonymous functions (but most often used there), but an equivalent way to define a function in R. This means that all these four functions are exactly equal:

    f1 <- function(x) x/sum(x)
    f2 <- function(x) return(x/sum(x))
    f3 <- function(x){
      x/sum(x)
    }
    f4 <- function(x){
      return(x/sum(x))
    }

    When the function is more than a single line, you should always use the curly brackets. I personally prefer to use an explicit return() in more complicated functions, but you will encounter both in code you read.↩︎