1.2 R essentials

In this part of the tutorial, we will learn the fundamentals of R programming, while investigating the demographics of the Nordic countries. All numbers are the 2020 populations taken from the International Data Base (IDB) from the US government.

1.2.1 Assigning to objects

1.2.1.1 Problem: numbers are complicated

Let’s start by looking at Norway and Sweden. Say you want to find out what proportion the Norwegian population makes up of the total population of Norway plus Sweden. Conceptually, it looks something like this:

\[\frac{Norwegian\ population}{Norwegian\ population+Swedish\ population}\]

Exercise: The Norwegian population is 5 465 387 and the Swedish is 10 185 555. Use R to calculate the proportion the Norwegian population makes up of the total population. Do the same for the Swedish population.


If you were able to do this, great work! Your solution probably looked something like this1:

# Norwegian population
5465387 / (5456387 + 10185555)
# Swedish population
10185555 / (5465387 + 10185555)

However, there are several problems with doing it this way. First of all, 5465387 is a stupidly long number, and the probability of typing it wrong is rather high (I actually mistyped a number in the code above, can you spot the error?). Another problem is that if any of the populations change (e.g. you want to update to 2021 numbers), you would have to update it in four different places to make this simple change. Third: for someone looking at your code, how will they know what’s happening? These numbers have no meaning without their context.

Now we’re finally getting to the first taste of why R is more powerful than a regular calculator: we can give our numbers simple names.

1.2.1.2 Solution: assign the complicated numbers to named objects

What if you could write some text instead of our stupidly large numbers? Something like norway_pop instead of 5465387? Luckily, you can! You can do the following to give your variables a name:

# store the number to an object called norway_pop
norway_pop <- 5465387
# store another number to an object called sweden_pop
sweden_pop <- 10185555

This is called assignment, and is done using the arrow <-2. You have now created a named object, which is a very powerful tool. Your objects behave exactly like the numbers that are stored in them (try e.g. norway_pop*2 or sweden_pop+4445). This means that we can now simply write:

norway_pop / (norway_pop + sweden_pop)
#> [1] 0.349205
sweden_pop / (norway_pop + sweden_pop)
#> [1] 0.650795

Notice how similar this is to the conceptual version we saw earlier! This is much simpler, easier to read and way less error-prone than writing out the numbers each time. In addition, if you want to change any of the population sizes, you will just have to change it in one place instead of four.

To make things even easier, we could store norway_pop + sweden_pop as total_pop, and also store our results with a name as well:

# make total population object
total_pop <- norway_pop + sweden_pop

# make object for Norway's proportion
norway_proportion <- norway_pop / total_pop
# print the result
norway_proportion
#> [1] 0.349205

# make object for Sweden's population
sweden_proportion <- sweden_pop / total_pop
# print the result
sweden_proportion
#> [1] 0.650795

Additional info
If you have experience with Python, you would have printed the objects using print(norway_proportion). In R we don’t have to explicitly use print(). An object (or any calculation for that matter) is automatically printed when we run it. R does have a print() function, though, that can be used if you want to be explicit about printing something.

1.2.1.3 Notes about naming variables

You should always give your variables sensible names. In the code above, we could have saved ourselves some typing by e.g. calling the populations x and y respectively. However, this quickly becomes a nightmare to read when scripts get long, so even though norway_pop takes longer to write than x, you should go with norway_pop so it’s possible to understand what’s going on in your script.

You can name your variables just about anything, but there are some characters you can’t use in their names. Notably, spaces, dashes and a lot of special characters cant’ be used, and the name cannot start with a number. If you stick to letters, underscores and numbers (except in the beginning), you should be fine!

Another thing to note about variable names is that they are case sensitive, for instance meaning that A can contain something completely different than a. In the code above, we used all-lowercase variable names, so what happens if we try to run the object Norway_pop3?

Norway_pop
#> Error in eval(expr, envir, enclos): object 'Norway_pop' not found

R can’t find Norway_pop, because it doesn’t exist, only norway_pop does. Beware of this, as it’s often a source of annoying bugs in the code.

Important concept:
Assign any and all variables to named objects. It’s easier to read, and you’re less likely to make mistakes. Use good variable names, and beware of case sensitivity!

1.2.2 Vectors

Let’s expand our little Norway-Sweden project to all the Nordic countries. Table 1.1 shows the populations of all the Nordic countries.

Table 1.1: Population sizes of the nordic countries.
Country Population
Denmark 5868927
Finland 5572355
Iceland 350773
Norway 5465387
Sweden 10185555

Now, if we were to calculate the proportion of the total for all countries, like we did for Norway and Sweden earlier, the code would quickly become very long:

denmark_pop <- 5868927
finland_pop <- 5572355
iceland_pop <- 350773
nordic_total <-  norway_pop + sweden_pop + denmark_pop + finland_pop + iceland_pop

norway_proportion <- norway_pop / nordic_total

sweden_proportion <- sweden_pop / nordic_total

denmark_proportion <- denmark_pop / nordic_total

finland_proportion <- finland_pop / nordic_total

iceland_proportion <- iceland_pop / nordic_total

norway_proportion
sweden_proportion
denmark_proportion
finland_proportion
iceland_proportion

That is a lot of typing! And remember that this is only for the Nordic countries, imagine doing this for all the countries in the world! In R, we can avoid all this tedious work by storing our data in a vector

1.2.2.2 Vector properties

1.2.2.2.1 Mathematical operations

When you have a vector of numbers, you can apply all the same mathematical operations that you can on a single number. The operation is then applied to each element of the vector separately. Try it!

# create a vector from 1 to 10
my_vector <- 1:10

# multiply each element by 2
my_vector * 2
#>  [1]  2  4  6  8 10 12 14 16 18 20

# add 5 to each element
my_vector + 5
#>  [1]  6  7  8  9 10 11 12 13 14 15

If you have two vectors of equal length, you can do mathematical operations on both, e.g. multiply two vectors. The first element of the first vector is then multiplied with the first element of the second vector, the second element with the second and so on. You can do the same for division, addition, subtraction and any operation you can think of.

# create another vector
my_vector2 <- 11:20

# mutliply them together
my_vector * my_vector2
#>  [1]  11  24  39  56  75  96 119 144 171 200

# add them
my_vector + my_vector2
#>  [1] 12 14 16 18 20 22 24 26 28 30

If the two vectors aren’t of the same length, you will get a warning, but R will still try to perform the operation. The shorter vector will then start over when it runs out of numbers.

# multiplying two vectors of unequal length
1:10 * 1:9
#>  [1]  1  4  9 16 25 36 49 64 81 10

# this is what R does:
#1*1
#2*2
#3*3
#4*4
#5*5
#6*6
#7*7
#8*8
#9*9 # here we use the last element of the vector 1:9
#10*1 # here R "recycles" the vector, using the first element of 1:9

Remember to keep all your vectors the same length, or something unexpected might happen! To check the length of a vector, you can use the length() function.

length(my_vector)
#> [1] 10
length(my_vector2)
#> [1] 10

Exercise: express the Nordic population sizes in millions by dividing all the numbers in the nordic vector by 10^6

nordic / 10^6
1.2.2.2.2 Extracting numbers

Having all numbers stored in one place is great, but what if you want to use just one of them? We use square brackets [] to access numbers inside vectors. You put the index of the element you want to extract inside the square brackets like this: my_vector[1]. This will extract the first element from the vector, my_vector[2] will extract the second. On our nordic vector, we could for instance do the following:

# extract the third element (Iceland) from the nordic vector
nordic[3]
#> [1] 350773

We can also select multiple numbers, by inputting a vector inside the square brackets. nordic[c(3, 5)] will extract the third and the fifth element of ournordicvector, andnordic[1:3] will extract elements 1 through 3.

Exercise: extract elements 2 through 4 from the nordic vector. Then, extract only the Scandinavian countries, and store them in an object with a good name.

Vector indexing:
If you’re familiar with Python or another programming language, you are probably used to that counting starts on 0 when indexing. In R, however, we start at 1. This can be annoying when switching between languages, but is unfortunately something you just need to remember.

1.2.2.2.3 Getting the sum and mean of a vector

You can do a variety of operations on vectors in addition to using the mathemathical operators +, -, * and /. Two of the most common operations are calculating the sum and the mean of all the numbers in the vector. With the sum() and mean() functions.

sum(my_vector)
#> [1] 55
mean(my_vector)
#> [1] 5.5

There’s a variety of other functions you can apply to a vector, such as max(), min() and median(). Try them out!

Exercise: Use what you have learned about vectors to calculate the population proportions of all the Nordic countries in a single operation. Save it to an object with a good name.

The solution to the exercise shows how powerful working with vectors can be.

nordic_prop <- nordic / sum(nordic)
nordic_prop
#> [1] 0.21385882 0.20305198 0.01278188 0.19915416 0.37115316

This is way simpler than doing this without using vectors! Also, imagine if you had a hundred, or even a million values in your vector. The code would still look exactly the same, making calculations with any number of values trivial.

Important concept
If you have many values that go together, store them together in a vector. You can do a variety of mathematical operations on vectors, but make sure that all vectors have the same length!

1.2.3 Strings

As mentioned way back in section 1.1.2.2, you have to write "hello" rather than hello to get the actual text “hello”. Now you may have figured out that this is because we have to separate objects from plain text in some way. Text within quotes in R (and any programming language) is called strings. These can be stored in objects and combined into vectors just like you can with numbers.

"Hello, world!"
#> [1] "Hello, world!"
greeting <- "Hello, world!"
greeting
#> [1] "Hello, world!"
string_vector <- c("this", "is", "a", "vector", "of", "strings!")
string_vector
#> [1] "this"     "is"       "a"        "vector"   "of"       "strings!"

To combine several strings into one, or even combine numbers and strings, you can use the function paste():

paste("These two strings", "become one")
#> [1] "These two strings become one"

nordic_sum <- sum(nordic)
paste("The total population of the nordic countries is", nordic_sum, "people.")
#> [1] "The total population of the nordic countries is 27442997 people."

Exercise: Combine the names of the Nordic countries into a vector. Make sure the names are in the same order as in table 1.1.

nordic_names <- c("Denmark", "Finland", "Iceland", "Norway", "Sweden")

One of many uses for strings is to give names to a vector. You can use the names() function to see a vectors names.

names(nordic)
#> NULL

As you can see, we get NULL here, which means that the vector elements don’t have any names. We can set the names like this:

names(nordic) <- nordic_names
nordic
#>  Denmark  Finland  Iceland   Norway   Sweden 
#>  5868927  5572355   350773  5465387 10185555

Now, when we print our nordic vector, we can see which population size belongs to which country! Neat! We can also extract values from our vector based on names rather than just position.

nordic["Denmark"] # equivalent to nordic[1]
#> Denmark 
#> 5868927
nordic[c("Finland", "Norway")] # equivalent to nordic[c(2, 4)]
#> Finland  Norway 
#> 5572355 5465387

Exercise: Do some calculations on the now named nordic vector (e.g. calculate the proportions again). What happens to the names?

nordic / sum(nordic)
#>    Denmark    Finland    Iceland     Norway     Sweden 
#> 0.21385882 0.20305198 0.01278188 0.19915416 0.37115316

# The names carry over to the new vector

1.2.4 Logical values

R has some words that have special meaning, two of these are what we call logical (or boolean) values: TRUE and FALSE. These are often the result of checking if some condition is true or not, using logical operators.

The most important logical operators in R are larger than >, smaller than <, equal to ==, equal or larger/smaller >=/<= and not equal to !=. They return TRUE if the condition is true, and FALSE if the condition is false. In its simplest form, we can write:

# is 3 smaller than 4?
3 < 4
#> [1] TRUE
# is 3 exactly equal to 3.01?
3 == 3.01
#> [1] FALSE
# are these two strings the same?
"Norway" == "norway"
#> [1] FALSE

The operators also work on vectors. Then, every element of the vector is checked, and we get a vector of TRUE and FALSE.

# which numbers from 1 to 5 are larger than 3?
1:5 > 3
#> [1] FALSE FALSE FALSE  TRUE  TRUE
# which countries has a population larger than 5000000?
nordic > 5*10^6
#> Denmark Finland Iceland  Norway  Sweden 
#>    TRUE    TRUE   FALSE    TRUE    TRUE
# which countries in our vector are not named "Norway"?
names(nordic) != "Norway"
#> [1]  TRUE  TRUE  TRUE FALSE  TRUE

A neat property of logical values is that they behave as numbers in certain contexts. For example, if you use the sum() function on them, each TRUE is counted as 1, and each FALSE is counted as 0.

sum(c(TRUE, FALSE, TRUE))
#> [1] 2

We can use this to check how many elements in our vector match a certain condition:

# how many numbers from 1 to 5 are larger than 3?
sum(1:5 > 3)
#> [1] 2
# how many countries has a population larger than 5000000?
sum(nordic > 5*10^6)
#> [1] 4
# how many countries in our vector are not named "Norway"?
sum(names(nordic) != "Norway")
#> [1] 4

Exercise: play around with the logical operators so you get a feel for how they work. Ask questions like: “is sum(3, 4) the same as 3+4?” or “which is larger, 10^6 or 6^10?”, and answer them using logical operators.

1.2.4.1 NA

R also has another important special value, NA, which stands for “not available”. It is mostly used to indicate missing data in data sets. One important property of NA is that any operation involving NA returns NA.

5 * NA
#> [1] NA
sum(c(5, 6, 10, NA, 1))
#> [1] NA
mean(c(3, 5, 10, NA))
#> [1] NA

You can write na.rm = TRUE within the sum() and mean() functions to ignore NAs.

sum(c(5, 6, 10, NA, 1), na.rm = TRUE)
#> [1] 22
mean(c(3, 5, 10, NA), na.rm = TRUE)
#> [1] 6

Important concept: TRUE, FALSE and NA are examples of special values in R.

  • TRUE and FALSE are called logical (or boolean) values, and are often the result of using logical operators like ==, >, != etc. TRUE and FALSE behave as 1 and 0 respectively when used inside e.g. the sum() function.
  • NA indicates missing data. Any operation involving NA returns NA

1.2.5 Functions

So far you have seen a handful of functions being used, like c(), seq() and mean() to name some. Functions are always written in the form functionname(). The text outside the parentheses is the function name, and whatever is inside the parentheses are called arguments. A function can be seen as a series of operations that are applied to the arguments, or simply as somewhere you put something in, and get something else in return. They can often save you a tremendous amount of time, compare for example manually calculating the mean of a 1000 numbers vs. using the mean() function.

If you want to know more about what a function does, you can write ? and then the function name, for example ?seq. Then you will get a help page that tells you more about how the function works, and some examples of use in the bottom. Be aware that the help page is written by programmers for programmers, and is often difficult to understand. In the beginning you will often be better off googling what a function does (but the examples are always useful).

Functions mostly have 1 or more arguments (sometimes 0), which go inside of the parentheses separated by comma, conceptually: function(arg1, arg2, arg3). The arguments can either be input in a set order, or you can name them. Consider the following:

seq(1, 10, 2) # "from" is argument no. 1, "to" is argument no. 2 etc.
#> [1] 1 3 5 7 9
seq(from = 1, to = 10, by = 2)
#> [1] 1 3 5 7 9

These are exactly the same, but one uses argument order, and the other the argument names. To figure out the order and names of arguments, you have to consult the help pages or the internet. For simple functions like seq() and mean() it’s common to omit the argument names, while for more complicated functions it’s better to include them, so you clearly show what you’re doing. Note that function arguments are never enclosed with quotes ".

Line breaks in functions  A tip for making code more readable is that as long as we are inside a parenthesis, we can have line breaks in our code. This means that we can put each argument on a separate line. Instead of the code above, we could also have written:

seq(from = 1,
    to = 10,
    by = 2)
#> [1] 1 3 5 7 9

This may not matter much for such a simple function, but it gets way easier to read long, complicated functions if it’s formatted this way. You will see this formatting in our next section about data frames.

1.2.6 Data frames

Let’s return to investigating more aspects of the Nordic countries, after all there’s more to a country than just it’s population size. Table 1.2 shows some additional information about the Nordic countries:

Table 1.2: More information on the nordic countries.
Country Population size Area (km2) Life expectancy
Denmark 5868927 42434 81.24
Finland 5572355 303815 81.33
Iceland 350773 100250 83.26
Norway 5465387 304282 82.14
Sweden 10185555 410335 82.40

What if you want to use some more information about the countries, e.g. the area? One solution is to store the new information in a vector:

nordic_area <- c(42434, 303815, 100250, 304282, 410335)

However, as the number of variables increase, you can get quite a lot of vectors! Luckily there’s a way to keep everything together, in what R calls a data frame.

1.2.6.1 Creating a data frame

A data frame is conceptually similar to the table above. You have columns of different variables, and rows of observations of these variables. If you have a number of vectors (of the same length!), you can combine these into a data frame with the function data.frame():4

nordic_df <- data.frame(country = nordic_names, 
                        population = nordic, 
                        area = nordic_area)
nordic_df
#>         country population   area
#> Denmark Denmark    5868927  42434
#> Finland Finland    5572355 303815
#> Iceland Iceland     350773 100250
#> Norway   Norway    5465387 304282
#> Sweden   Sweden   10185555 410335

The arguments of data.frame() are kind of special in that you can name them whatever you want. Your argument names (i.e. the text before the equal sign) determine the names of the columns in your data frame.

1.2.6.2 Extracting the data within the data frame

You previously learned that you can access data in a vector using square brackets [], and the same goes for data frames. However, we now have two dimensions (rows and columns) instead of just 1. The syntax for extracting from a data frame is df[row, column]. For instance, to get the area (column 3) of Finland (row 2), we can run:

# extract row 2, column 3
nordic_df[2, 3]
#> [1] 303815

You can also use the name of the column instead of the position (remember quotes!).

nordic_df[2, "area"]
#> [1] 303815

If you leave the row field empty, you will get all rows, and if you leave the column field empty you will get all columns.

# Get area of all countries
nordic_df[,"area"]
#> [1]  42434 303815 100250 304282 410335
# Get all the information about Iceland
nordic_df[3,]
#>         country population   area
#> Iceland Iceland     350773 100250
# get all rows and columns (useless, but it works!)
nordic_df[,]
#>         country population   area
#> Denmark Denmark    5868927  42434
#> Finland Finland    5572355 303815
#> Iceland Iceland     350773 100250
#> Norway   Norway    5465387 304282
#> Sweden   Sweden   10185555 410335

Another way of getting data from your data frame is with the $ operator. Writing something like df$column returns the entire column as a vector.

nordic_df$population
#> [1]  5868927  5572355   350773  5465387 10185555

This is very useful, as you can do the same operations on these vectors that you can on any vector.

nordic_df$population / 10^6
#> [1]  5.868927  5.572355  0.350773  5.465387 10.185555
mean(nordic_df$area)
#> [1] 232223.2
nordic_df$area + nordic_df$population
#> [1]  5911361  5876170   451023  5769669 10595890
nordic_df$population > 5*10^6
#> [1]  TRUE  TRUE FALSE  TRUE  TRUE

Note that the $ syntax only works for columns, not for rows.

Exercise: Calculate the population density (population/area) of the Nordic countries using the data frame and the $ operator.

nordic_df$population / nordic_df$area
#> [1] 138.307183  18.341277   3.498983  17.961585  24.822535

1.2.6.3 Adding data to the data frame

Adding data to the data frame can also be done with the $ operator. Instead of referencing an existing column, you can simply write a new name after the $, and assign to it as a variable like so: df$newcolumn <- c(1, 2, 3, 4, 5).

nordic_df$is_norway <- c("no", "no", "no", "yes", "no")
nordic_df
#>         country population   area is_norway
#> Denmark Denmark    5868927  42434        no
#> Finland Finland    5572355 303815        no
#> Iceland Iceland     350773 100250        no
#> Norway   Norway    5465387 304282       yes
#> Sweden   Sweden   10185555 410335        no

Now you have added a column named “is_norway” to your data frame, which is kind of useless, but still cool.

Exercise: For a more useful application of creating a new column, add a population density column to nordic_df (using the same calculation as in the previous exercise).

nordic_df$pop_density <- nordic_df$population / nordic_df$area

1.2.6.4 Subsetting with logical operators

Sometimes you don’t know exactly which rows you want to extract from your data. For example, you may want all countries with an area below 300 000 km2, or all countries that aren’t Norway. For this, we can use the logical operators that you learned about in section 1.2.4.

If you want to figure out which countries have an area less than 300 000 km2, you have learned that you can do the following:

# which elements have an area less than 300000?
nordic_df$area < 300000
#> [1]  TRUE FALSE  TRUE FALSE FALSE

The result shows that element 1 and 3 are TRUE, while the rest are FALSE. If you put this same statement within square brackets (before the comma) to index your data frame, R will return all rows that are TRUE, and discard all rows that are FALSE.

# get all countries with an area below 300 000 km^2
nordic_df[nordic_df$area < 300000, ]
#>         country population   area is_norway pop_density
#> Denmark Denmark    5868927  42434        no  138.307183
#> Iceland Iceland     350773 100250        no    3.498983

We could do the same to get all rows where the country isn’t Norway:

# get all countries except Norway
nordic_df[nordic_df$country != "Norway", ] 
#>         country population   area is_norway pop_density
#> Denmark Denmark    5868927  42434        no  138.307183
#> Finland Finland    5572355 303815        no   18.341277
#> Iceland Iceland     350773 100250        no    3.498983
#> Sweden   Sweden   10185555 410335        no   24.822535

Two important things to note here is that you have to explicitly write nordic_df$ inside the square brackets, and that you have to end with a comma to tell R that you’re filtering rows (i.e. leave the column space empty). Neither nordic_df[area < 300000, ] nor nordic_df[nordic_df$area < 300000] will work.

Exercise: make a subset of your data containing all the countries with population density of more than 18 (using the pop_density column you created in the previous exercise).

nordic_df[nordic_df$pop_density > 18, ]
#>         country population   area is_norway pop_density
#> Denmark Denmark    5868927  42434        no   138.30718
#> Finland Finland    5572355 303815        no    18.34128
#> Sweden   Sweden   10185555 410335        no    24.82254

Important concept
When you have a lot of data that belongs together, typically something you could store in an excel document and show in a table, make it into a data frame. Many things we will be doing later in this course requires that your data is in a data frame, so learn to recognize this type of object.


  1. with the comments, I hope!↩︎

  2. You can also use = instead of <- . If you know another programming language already, like Python, this may feel more natural. I like to use the arrow to remind my muscle memory that I’m working in R, but it makes absolutely no difference which you use, so use whichever you like!↩︎

  3. As a side note, whenever you wonder “what happens if I do …”, try it! The worst thing that can happen if you try something is that you get an error, the best thing is that you learn something useful.↩︎

  4. You can see here that the names of nordic carries over when making the data frame. This results in the names of the countries stored in what seems to be a nameless column. These are the row names, and can be accessed with row.names(nordic_df)↩︎