1.2 R essentials
In this part of the tutorial, we will learn the fundamentals of R programming, while investigating the demographics of the Nordic countries. All numbers are the 2020 populations taken from the International Data Base (IDB) from the US government.
1.2.1 Assigning to objects
1.2.1.1 Problem: numbers are complicated
Let’s start by looking at Norway and Sweden. Say you want to find out what proportion the Norwegian population makes up of the total population of Norway plus Sweden. Conceptually, it looks something like this:
\[\frac{Norwegian\ population}{Norwegian\ population+Swedish\ population}\]
Exercise: The Norwegian population is 5 465 387 and the Swedish is 10 185 555. Use R to calculate the proportion the Norwegian population makes up of the total population. Do the same for the Swedish population.
If you were able to do this, great work! Your solution probably looked something like this1:
However, there are several problems with doing it this way. First of all, 5465387 is a stupidly long number, and the probability of typing it wrong is rather high (I actually mistyped a number in the code above, can you spot the error?). Another problem is that if any of the populations change (e.g. you want to update to 2021 numbers), you would have to update it in four different places to make this simple change. Third: for someone looking at your code, how will they know what’s happening? These numbers have no meaning without their context.
Now we’re finally getting to the first taste of why R is more powerful than a regular calculator: we can give our numbers simple names.
1.2.1.2 Solution: assign the complicated numbers to named objects
What if you could write some text instead of our stupidly large numbers? Something like norway_pop
instead of 5465387? Luckily, you can! You can do the following to give your variables a name:
# store the number to an object called norway_pop
norway_pop <- 5465387
# store another number to an object called sweden_pop
sweden_pop <- 10185555
This is called assignment, and is done using the arrow <-
2. You have now created a named object, which is a very powerful tool. Your objects behave exactly like the numbers that are stored in them (try e.g. norway_pop*2
or sweden_pop+4445
). This means that we can now simply write:
Notice how similar this is to the conceptual version we saw earlier! This is much simpler, easier to read and way less error-prone than writing out the numbers each time. In addition, if you want to change any of the population sizes, you will just have to change it in one place instead of four.
To make things even easier, we could store norway_pop + sweden_pop
as total_pop
, and also store our results with a name as well:
# make total population object
total_pop <- norway_pop + sweden_pop
# make object for Norway's proportion
norway_proportion <- norway_pop / total_pop
# print the result
norway_proportion
#> [1] 0.349205
# make object for Sweden's population
sweden_proportion <- sweden_pop / total_pop
# print the result
sweden_proportion
#> [1] 0.650795
Additional info
If you have experience with Python, you would have printed the objects using print(norway_proportion)
. In R we don’t have to explicitly use print()
. An object (or any calculation for that matter) is automatically printed when we run it. R does have a print()
function, though, that can be used if you want to be explicit about printing something.
1.2.1.3 Notes about naming variables
You should always give your variables sensible names. In the code above, we could have saved ourselves some typing by e.g. calling the populations x
and y
respectively. However, this quickly becomes a nightmare to read when scripts get long, so even though norway_pop
takes longer to write than x
, you should go with norway_pop
so it’s possible to understand what’s going on in your script.
You can name your variables just about anything, but there are some characters you can’t use in their names. Notably, spaces, dashes and a lot of special characters cant’ be used, and the name cannot start with a number. If you stick to letters, underscores and numbers (except in the beginning), you should be fine!
Another thing to note about variable names is that they are case sensitive, for instance meaning that A
can contain something completely different than a
. In the code above, we used all-lowercase variable names, so what happens if we try to run the object Norway_pop
3?
R can’t find Norway_pop
, because it doesn’t exist, only norway_pop
does. Beware of this, as it’s often a source of annoying bugs in the code.
Important concept:
Assign any and all variables to named objects. It’s easier to read, and you’re less likely to make mistakes. Use good variable names, and beware of case sensitivity!
1.2.2 Vectors
Let’s expand our little Norway-Sweden project to all the Nordic countries. Table 1.1 shows the populations of all the Nordic countries.
Country | Population |
---|---|
Denmark | 5868927 |
Finland | 5572355 |
Iceland | 350773 |
Norway | 5465387 |
Sweden | 10185555 |
Now, if we were to calculate the proportion of the total for all countries, like we did for Norway and Sweden earlier, the code would quickly become very long:
denmark_pop <- 5868927
finland_pop <- 5572355
iceland_pop <- 350773
nordic_total <- norway_pop + sweden_pop + denmark_pop + finland_pop + iceland_pop
norway_proportion <- norway_pop / nordic_total
sweden_proportion <- sweden_pop / nordic_total
denmark_proportion <- denmark_pop / nordic_total
finland_proportion <- finland_pop / nordic_total
iceland_proportion <- iceland_pop / nordic_total
norway_proportion
sweden_proportion
denmark_proportion
finland_proportion
iceland_proportion
That is a lot of typing! And remember that this is only for the Nordic countries, imagine doing this for all the countries in the world! In R, we can avoid all this tedious work by storing our data in a vector
1.2.2.2 Vector properties
1.2.2.2.1 Mathematical operations
When you have a vector of numbers, you can apply all the same mathematical operations that you can on a single number. The operation is then applied to each element of the vector separately. Try it!
# create a vector from 1 to 10
my_vector <- 1:10
# multiply each element by 2
my_vector * 2
#> [1] 2 4 6 8 10 12 14 16 18 20
If you have two vectors of equal length, you can do mathematical operations on both, e.g. multiply two vectors. The first element of the first vector is then multiplied with the first element of the second vector, the second element with the second and so on. You can do the same for division, addition, subtraction and any operation you can think of.
# create another vector
my_vector2 <- 11:20
# mutliply them together
my_vector * my_vector2
#> [1] 11 24 39 56 75 96 119 144 171 200
If the two vectors aren’t of the same length, you will get a warning, but R will still try to perform the operation. The shorter vector will then start over when it runs out of numbers.
# this is what R does:
#1*1
#2*2
#3*3
#4*4
#5*5
#6*6
#7*7
#8*8
#9*9 # here we use the last element of the vector 1:9
#10*1 # here R "recycles" the vector, using the first element of 1:9
Remember to keep all your vectors the same length, or something unexpected might happen! To check the length of a vector, you can use the length()
function.
Exercise: express the Nordic population sizes in millions by dividing all the numbers in the nordic
vector by 10^6
1.2.2.2.2 Extracting numbers
Having all numbers stored in one place is great, but what if you want to use just one of them? We use square brackets []
to access numbers inside vectors. You put the index of the element you want to extract inside the square brackets like this: my_vector[1]
. This will extract the first element from the vector, my_vector[2]
will extract the second. On our nordic
vector, we could for instance do the following:
We can also select multiple numbers, by inputting a vector inside the square brackets. nordic[c(3, 5)]
will extract the third and the fifth element of ournordic
vector, andnordic[1:3]
will extract elements 1 through 3.
Exercise: extract elements 2 through 4 from the nordic
vector. Then, extract only the Scandinavian countries, and store them in an object with a good name.
Vector indexing:
If you’re familiar with Python or another programming language, you are probably used to that counting starts on 0 when indexing. In R, however, we start at 1. This can be annoying when switching between languages, but is unfortunately something you just need to remember.
1.2.2.2.3 Getting the sum and mean of a vector
You can do a variety of operations on vectors in addition to using the mathemathical operators +
, -
, *
and /
. Two of the most common operations are calculating the sum and the mean of all the numbers in the vector. With the sum()
and mean()
functions.
There’s a variety of other functions you can apply to a vector, such as max()
, min()
and median()
. Try them out!
Exercise: Use what you have learned about vectors to calculate the population proportions of all the Nordic countries in a single operation. Save it to an object with a good name.
The solution to the exercise shows how powerful working with vectors can be.
This is way simpler than doing this without using vectors! Also, imagine if you had a hundred, or even a million values in your vector. The code would still look exactly the same, making calculations with any number of values trivial.
Important concept
If you have many values that go together, store them together in a vector. You can do a variety of mathematical operations on vectors, but make sure that all vectors have the same length!
1.2.3 Strings
As mentioned way back in section 1.1.2.2, you have to write "hello"
rather than hello
to get the actual text “hello”. Now you may have figured out that this is because we have to separate objects from plain text in some way. Text within quotes in R (and any programming language) is called strings. These can be stored in objects and combined into vectors just like you can with numbers.
string_vector <- c("this", "is", "a", "vector", "of", "strings!")
string_vector
#> [1] "this" "is" "a" "vector" "of" "strings!"
To combine several strings into one, or even combine numbers and strings, you can use the function paste()
:
nordic_sum <- sum(nordic)
paste("The total population of the nordic countries is", nordic_sum, "people.")
#> [1] "The total population of the nordic countries is 27442997 people."
Exercise: Combine the names of the Nordic countries into a vector. Make sure the names are in the same order as in table 1.1.
One of many uses for strings is to give names to a vector. You can use the names()
function to see a vectors names.
As you can see, we get NULL
here, which means that the vector elements don’t have any names. We can set the names like this:
names(nordic) <- nordic_names
nordic
#> Denmark Finland Iceland Norway Sweden
#> 5868927 5572355 350773 5465387 10185555
Now, when we print our nordic
vector, we can see which population size belongs to which country! Neat! We can also extract values from our vector based on names rather than just position.
Exercise: Do some calculations on the now named nordic
vector (e.g. calculate the proportions again). What happens to the names?
1.2.4 Logical values
R has some words that have special meaning, two of these are what we call logical (or boolean) values: TRUE
and FALSE
. These are often the result of checking if some condition is true or not, using logical operators.
The most important logical operators in R are larger than >
, smaller than <
, equal to ==
, equal or larger/smaller >=
/<=
and not equal to !=
. They return TRUE
if the condition is true, and FALSE
if the condition is false. In its simplest form, we can write:
The operators also work on vectors. Then, every element of the vector is checked, and we get a vector of TRUE
and FALSE
.
# which countries has a population larger than 5000000?
nordic > 5*10^6
#> Denmark Finland Iceland Norway Sweden
#> TRUE TRUE FALSE TRUE TRUE
# which countries in our vector are not named "Norway"?
names(nordic) != "Norway"
#> [1] TRUE TRUE TRUE FALSE TRUE
A neat property of logical values is that they behave as numbers in certain contexts. For example, if you use the sum()
function on them, each TRUE
is counted as 1, and each FALSE
is counted as 0.
We can use this to check how many elements in our vector match a certain condition:
Exercise: play around with the logical operators so you get a feel for how they work. Ask questions like: “is sum(3, 4) the same as 3+4?” or “which is larger, 10^6 or 6^10?”, and answer them using logical operators.
1.2.4.1 NA
R also has another important special value, NA
, which stands for “not available”. It is mostly used to indicate missing data in data sets. One important property of NA
is that any operation involving NA
returns NA
.
You can write na.rm = TRUE
within the sum()
and mean()
functions to ignore NA
s.
Important concept: TRUE
, FALSE
and NA
are examples of special values in R.
TRUE
andFALSE
are called logical (or boolean) values, and are often the result of using logical operators like==
,>
,!=
etc.TRUE
andFALSE
behave as 1 and 0 respectively when used inside e.g. thesum()
function.NA
indicates missing data. Any operation involvingNA
returnsNA
1.2.5 Functions
So far you have seen a handful of functions being used, like c()
, seq()
and mean()
to name some. Functions are always written in the form functionname()
. The text outside the parentheses is the function name, and whatever is inside the parentheses are called arguments. A function can be seen as a series of operations that are applied to the arguments, or simply as somewhere you put something in, and get something else in return. They can often save you a tremendous amount of time, compare for example manually calculating the mean of a 1000 numbers vs. using the mean()
function.
If you want to know more about what a function does, you can write ?
and then the function name, for example ?seq
. Then you will get a help page that tells you more about how the function works, and some examples of use in the bottom. Be aware that the help page is written by programmers for programmers, and is often difficult to understand. In the beginning you will often be better off googling what a function does (but the examples are always useful).
Functions mostly have 1 or more arguments (sometimes 0), which go inside of the parentheses separated by comma, conceptually: function(arg1, arg2, arg3)
. The arguments can either be input in a set order, or you can name them. Consider the following:
These are exactly the same, but one uses argument order, and the other the argument names. To figure out the order and names of arguments, you have to consult the help pages or the internet. For simple functions like seq()
and mean()
it’s common to omit the argument names, while for more complicated functions it’s better to include them, so you clearly show what you’re doing. Note that function arguments are never enclosed with quotes "
.
Line breaks in functions A tip for making code more readable is that as long as we are inside a parenthesis, we can have line breaks in our code. This means that we can put each argument on a separate line. Instead of the code above, we could also have written:
This may not matter much for such a simple function, but it gets way easier to read long, complicated functions if it’s formatted this way. You will see this formatting in our next section about data frames.
1.2.6 Data frames
Let’s return to investigating more aspects of the Nordic countries, after all there’s more to a country than just it’s population size. Table 1.2 shows some additional information about the Nordic countries:
Country | Population size | Area (km2) | Life expectancy |
---|---|---|---|
Denmark | 5868927 | 42434 | 81.24 |
Finland | 5572355 | 303815 | 81.33 |
Iceland | 350773 | 100250 | 83.26 |
Norway | 5465387 | 304282 | 82.14 |
Sweden | 10185555 | 410335 | 82.40 |
What if you want to use some more information about the countries, e.g. the area? One solution is to store the new information in a vector:
However, as the number of variables increase, you can get quite a lot of vectors! Luckily there’s a way to keep everything together, in what R calls a data frame.
1.2.6.1 Creating a data frame
A data frame is conceptually similar to the table above. You have columns of different variables, and rows of observations of these variables. If you have a number of vectors (of the same length!), you can combine these into a data frame with the function data.frame()
:4
nordic_df <- data.frame(country = nordic_names,
population = nordic,
area = nordic_area)
nordic_df
#> country population area
#> Denmark Denmark 5868927 42434
#> Finland Finland 5572355 303815
#> Iceland Iceland 350773 100250
#> Norway Norway 5465387 304282
#> Sweden Sweden 10185555 410335
The arguments of data.frame()
are kind of special in that you can name them whatever you want. Your argument names (i.e. the text before the equal sign) determine the names of the columns in your data frame.
1.2.6.2 Extracting the data within the data frame
You previously learned that you can access data in a vector using square brackets []
, and the same goes for data frames. However, we now have two dimensions (rows and columns) instead of just 1. The syntax for extracting from a data frame is df[row, column]
. For instance, to get the area (column 3) of Finland (row 2), we can run:
You can also use the name of the column instead of the position (remember quotes!).
If you leave the row field empty, you will get all rows, and if you leave the column field empty you will get all columns.
# Get all the information about Iceland
nordic_df[3,]
#> country population area
#> Iceland Iceland 350773 100250
# get all rows and columns (useless, but it works!)
nordic_df[,]
#> country population area
#> Denmark Denmark 5868927 42434
#> Finland Finland 5572355 303815
#> Iceland Iceland 350773 100250
#> Norway Norway 5465387 304282
#> Sweden Sweden 10185555 410335
Another way of getting data from your data frame is with the $
operator. Writing something like df$column
returns the entire column as a vector.
This is very useful, as you can do the same operations on these vectors that you can on any vector.
Note that the $
syntax only works for columns, not for rows.
Exercise: Calculate the population density (population/area) of the Nordic countries using the data frame and the $
operator.
1.2.6.3 Adding data to the data frame
Adding data to the data frame can also be done with the $
operator. Instead of referencing an existing column, you can simply write a new name after the $
, and assign to it as a variable like so: df$newcolumn <- c(1, 2, 3, 4, 5)
.
nordic_df$is_norway <- c("no", "no", "no", "yes", "no")
nordic_df
#> country population area is_norway
#> Denmark Denmark 5868927 42434 no
#> Finland Finland 5572355 303815 no
#> Iceland Iceland 350773 100250 no
#> Norway Norway 5465387 304282 yes
#> Sweden Sweden 10185555 410335 no
Now you have added a column named “is_norway” to your data frame, which is kind of useless, but still cool.
Exercise: For a more useful application of creating a new column, add a population density column to nordic_df
(using the same calculation as in the previous exercise).
1.2.6.4 Subsetting with logical operators
Sometimes you don’t know exactly which rows you want to extract from your data. For example, you may want all countries with an area below 300 000 km2, or all countries that aren’t Norway. For this, we can use the logical operators that you learned about in section 1.2.4.
If you want to figure out which countries have an area less than 300 000 km2, you have learned that you can do the following:
# which elements have an area less than 300000?
nordic_df$area < 300000
#> [1] TRUE FALSE TRUE FALSE FALSE
The result shows that element 1 and 3 are TRUE
, while the rest are FALSE
. If you put this same statement within square brackets (before the comma) to index your data frame, R will return all rows that are TRUE
, and discard all rows that are FALSE
.
# get all countries with an area below 300 000 km^2
nordic_df[nordic_df$area < 300000, ]
#> country population area is_norway pop_density
#> Denmark Denmark 5868927 42434 no 138.307183
#> Iceland Iceland 350773 100250 no 3.498983
We could do the same to get all rows where the country isn’t Norway:
# get all countries except Norway
nordic_df[nordic_df$country != "Norway", ]
#> country population area is_norway pop_density
#> Denmark Denmark 5868927 42434 no 138.307183
#> Finland Finland 5572355 303815 no 18.341277
#> Iceland Iceland 350773 100250 no 3.498983
#> Sweden Sweden 10185555 410335 no 24.822535
Two important things to note here is that you have to explicitly write nordic_df$
inside the square brackets, and that you have to end with a comma to tell R that you’re filtering rows (i.e. leave the column space empty). Neither nordic_df[area < 300000, ]
nor nordic_df[nordic_df$area < 300000]
will work.
Exercise: make a subset of your data containing all the countries with population density of more than 18 (using the pop_density
column you created in the previous exercise).
Important concept
When you have a lot of data that belongs together, typically something you could store in an excel document and show in a table, make it into a data frame. Many things we will be doing later in this course requires that your data is in a data frame, so learn to recognize this type of object.
with the comments, I hope!↩︎
You can also use
=
instead of<-
. If you know another programming language already, like Python, this may feel more natural. I like to use the arrow to remind my muscle memory that I’m working in R, but it makes absolutely no difference which you use, so use whichever you like!↩︎As a side note, whenever you wonder “what happens if I do …”, try it! The worst thing that can happen if you try something is that you get an error, the best thing is that you learn something useful.↩︎
You can see here that the names of
nordic
carries over when making the data frame. This results in the names of the countries stored in what seems to be a nameless column. These are the row names, and can be accessed withrow.names(nordic_df)
↩︎