A Complete Introduction to R for Data Science
Ignoring all the boring stuff this article will teach you the basic core concepts you need to know to start working as a data scientist in R.
This article is focused on getting started on R for Data Science. The topics I will cover will take you in a data science mindset from start. I found topics from internet and practiced it myself then arranged it in a way to give you a step by step approach to getting started on R for Data Science.
Perquisite:
You just need to know basic programming terms like variables, functions etc and intermediate Mathematics knowledge like square roots, vectors, matrices etc.
These are the main topics that I’ll cover:
- Installations
- Basic Building Blocks
- Workspace & Files
- Sequences of Numbers
- Vectors
- Missing Values
- Subsetting Vectors
- Matrices & Data Frames
- Looking at Data
- Base Graphics
- Manipulating Data with dplyr
- Getting & Cleaning Data
- Tidying Data with tidyr
If you are just getting started with R, I recommend that you go step by step, otherwise feel free to jump to the topic that interest you the most.
1. Installations
Let us install all the required things first…
Installing R:
Windows:
Here is a great video, (courtesy of Roger Peng at Johns Hopkins Biostatistics), that will guide you through the installation process step by step.
Alternatively, here are step-by-step instructions:
- Open an internet browser and go to www.r-project.org.
- Click the “download R” link in the middle of the page under “Getting Started.”
- Select a CRAN location (a mirror site) and click the corresponding link.
- Click on the “Download R for Windows” link at the top of the page.
- Click on the “install R for the first time” link at the top of the page.
- Click “Download R for Windows” and save the executable file somewhere on your computer. Run the .exe file and follow the installation instructions.
Mac OS X:
Here is a great video, (courtesy of Roger Peng at Johns Hopkins Biostatistics), that will guide you through the installation process step by step.
Alternatively, here are step-by-step instructions:
- Open an internet browser and go to www.r-project.org.
- Click the “download R” link in the middle of the page under “Getting Started.”
- Select a CRAN location (a mirror site) choose the one closest to you for maximal efficiency (although any of them will work), and click the corresponding link.
- Click on the “Download R for (Mac) OS X” link at the top of the page.
- Save the .pkg file, double-click it to open, and follow the installation instructions.
- Install XQuartz
Get RStudio:
In addition to R, it is highly recommended that you install RStudio. It makes learning R easier, because it allows you to see your console, text editor, graphs, and past commands all in a single window.
Here are step by step instructions that will help guide you through the installation process:
Windows:
- Go to www.rstudio.com and click on the “Download RStudio” button.
- Click on “Download RStudio Desktop”, which is found under the Open Source Edition column.
- The next screen will tell you to “Choose your version of RStudio”
- Click on the version recommended for your system, or the latest Windows version, and save the executable file. Run the .exe file and follow the installation instructions.
Mac OS X:
- Go to www.rstudio.com and click on the “Download RStudio” button.
- Click on “Download RStudio Desktop” which is found under the Open Source Edition column.
- The next screen will tell you to “Choose your version of RStudio” Go to the latest version recommended for your system, or the latest Mac version, save the .dmg file on your computer, double-click it to open, and then drag and drop it to your applications folder.
2. Basic Building Blocks
R can be used as an interactive calculator. Open R (Not RStudio) and Type:
5 + 7
and press Enter.
R is a programming language and often the reason we use a programming language as opposed to a calculator is to automate some process or avoid unnecessary repetition.
In this case, we may want to use our result from above in a second calculation. Instead of retyping 5+7
every time we need it, we can just create a new variable that stores the result.
To assign the result the result of 5+7
to a new variable called x, you type:
x <- 5+7
This can be read as ‘x gets 5 plus 7’. Assignment operator, which is just a ‘less than, symbol followed by a ‘minus’ sign. Think of the assignment operator as an arrow. You are assigning the value on the right side of the arrow to the variable name on the left side of the arrow.
To view the contents of the variable x
, just type x
and press Enter.
x
Now, store the result of x-3
in a new variable called y
.
y <- x -3y
Now, let’s create a small collection of numbers called a vector. Any object that contains data is called a data structure and numeric vectors are the simplest type of data structure in R. In fact, even a single number is considered a vector of length one.
The easiest way to create a vector is with the c()
function, which stands for ‘concatenate’ or ‘combine’. To create a vector containing the number 1.1, 9 and 3.14, type:
z <- c(1.1,9,3.14)
You can combine vectors to make a new vector. Create a new vector that contains z, 555, then z again in that order.
c(z, 555, z)
Numeric vectors can be used in arithmetic expressions. Type the following to see what happens:
z * 2 + 100
First, R multiplied each of the three elements in z
by 2. Then it added 100 to each element to get the result you see above.
Other common arithmetic operators are `+`, `-`, `/`, and `^` (where x² means ‘x squared’). To take the square root, use the sqrt()
function and to take the absolute value, use the abs()
function.
Take the square root of z-1
:
my_sqrt <- sqrt(z-1)
Now, create a new variable called my_div
that gets the value of z
divided by my_sqrt
my_div <- z/my_sqrt
When given two vectors of the same length, R simply performs the specified arithmetic operation (`+`, `-`, `*`, etc.) element-by-element.
If the vectors are of different lengths, R ‘recycles’ the shorter vector until it is the same length as the longer vector. When we did z*2+100
in our earlier example, z
was a vector of length 3, but technically 2 and 100 are each vectors of length 1. Behind the scenes, R is ‘recycling’ the 2 to make a vector of 2s and the 100 to make a vector of 100s. In other words, when you ask R to compute z * 2 + 100
, what it really computes is this: z * c(2, 2, 2) + c(100, 100, 100)
.
Let’s see another example of how this vector recycling works.
c(1,2,3,4)+c(0,10)
If the length of the shorter vector does not divide evenly into the length of the longer vector, R will still apply the ‘recycling’ method, but will throw a warning to let you know something fishy might be going on. Try this example:
c(1,2,3,4)+c(0,10,100)
This is the very basic version of R for Data Science that you did in this section.
3. Workspace & Files
In this lesson, you’ll learn how to examine your local workspace in R and begin to explore the relationship between your workspace and the file system of your machine.
Because different operating systems have different conventions with regards to things like file paths, the outputs of these commands may vary across machines. However it’s important to note that R provides a common API (a common set of commands) for interacting with files, that way your code will work across different kinds of computers.
Let’s jump right in so you can get a feel for how these special functions work! Determine which directory your R session is using as its current working directory using:
getwd()
List all the objects in your local workspace using:
ls()
List all the files in your working directory using
list.files() #or you can also do dir()
#
is used to add comments in R language. When you see #
sign in the code block read the information but don’t write it in the console.
Using the args()
function on a function name is also a handy way to see what arguments a function can take.
args(list.files)
you can get extended help by typing:
?list.files
You can also store values from functions into variables like:
old.dir <- getwd() #will add current working directory to variable
Keep old.dir
in mind as we will use it at the end of this section.
To create a directory called testdir
in the current directory, type:
dir.create("testdir")
We will do all the work in this new directory and then delete it after we are done. Set your working directory to testdir
with the setwd()
command.
setwd("testdir")
Create a file in your working directory called mytest.R
:
file.create("mytest.R")
This should be the only file in this newly created directory. Let’s check this
by listing all the files in the current directory by typing dir()
.
Check to see if mytest.R
exists in the working directory using the file.exists()
function.
Access information about the file mytest.R
by using file.info()
.
You can use the $
operator — — e.g., file.info(“mytest.R”)$mode
— — to grab specific items.
Change the name of the file mytest.R
to mytest2.R
by using file.rename()
.
file.rename("mytest.R","mytest2.R")
Try to delete mytest.R
using file.remove(‘mytest.R’)
, but that won’t work since mytest.R
no longer exists. You have already renamed it.
Make a copy of mytest2.R
called mytest3.R
using file.copy()
.
file.copy("mytest2.R","mytest3.R")
Provide the relative path to the file mytest3.R
by using file.path()
.
You can use file.path
to construct file and directory paths. Pass folder1
and folder2
as arguments to file.path to make a platform-independent pathname.
> file.path('folder1', 'folder2')
[1] "folder1/folder2"
Create a directory in the current working directory called testdir2
and a subdirectory for it called testdir
, all in one command by using dir.create()
and file.path()
.
dir.create(file.path('testdir2','testdir3'), recursive = TRUE)
You can take help at any point if you want to learn more about something by adding ?
at start like ?dir.create
Delete the testdir2
directory that you created by using unlink()
.
unlink("testdir2", recursive = TRUE)
If you don’t use recursive = TRUE
, R is concerned that you’re unaware that you’re deleting a directory and all of its contents. R reasons that, if you don’t specify that recursive equals TRUE
, you don’t know that something is in the directory you’re trying to delete. R tries to prevent you from making a mistake.
Go back to your original working directory using setwd()
.
setwd(old.dir)
Delete the testdir
directory that you just left (and everything in it)
unlink("testdir", recursive = TRUE)
So you now know how to examine your R workspace and work with file system of your machine from within R.
4. Sequence of Numbers
Sequences are a key component in Data Science. You’ll learn how, later in this article.
The simplest way to create a sequence of numbers in R is by using the :
operator. Try:
> 1:20
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
That gave us every integer between (and including) 1 and 20. We could also use it to create a sequence of real numbers. For example, try pi:10
.
What happens if we do 15:1
? Give it a try to find out.
Remember that if you have questions about a particular R function, you can access its documentation with a question mark followed by the function name: ?function_name_here. However, in the case of an operator like the colon used above, you must enclose the symbol in backticks like this: ?`:`
. (NOTE: The backtick (`
) key is generally located in the top left corner of a keyboard, above the Tab key. If you don’t have a backtick key, you can use regular quotes.) For example to pull up the documentation for ‘:
’ we will use ?’:’
Often, we’ll desire more control over a sequence we’re creating than what the :
operator gives us. The seq()
function serves this purpose. The most basic use of seq()
does exactly the same thing as the :
operator meaning if you do seq(1, 20)
it’ll do the same as 1:20
we tried above.
You are still using the seq()
function here, but this time with an extra argument that tells R you want to increment your sequence by 0.5. Try:
seq(0, 10, by=0.5)
Or maybe we don’t care what the increment is and we just want a sequence of 30 numbers between 5 and 10. seq(5, 10, length=30)
does the trick.
my_seq <- seq(5,10,length=30)
To confirm that my_seq
has length 30, we can use the length()
function.
length(my_seq)
Let’s pretend we don’t know the length of my_seq
, but we want to generate a sequence of integers from 1 to N, where N represents the length of the my_seq
vector. In other words, we want a new vector (1, 2, 3, …) that is the same length as my_seq
. There are often several approaches to solving the same problem, particularly in R. Simple approaches that involve less typing are generally best. It’s also important for your code to be readable, so that you and others can figure out what’s going on without too much hassle. You can do this by:
> 1:length(my_seq)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[27] 27 28 29 30
OR
> seq(along.with = my_seq)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[27] 27 28 29 30
However, as is the case with many common tasks, R has a separate built-in
function for this purpose called seq_along()
. Type:
> seq_along(my_seq)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[27] 27 28 29 30
If R has a built-in function for a particular task, it’s likely that function is highly optimized for that purpose and is your best option. As you become a more advanced R programmer, you’ll design your own functions to perform tasks when there are no better options. We’ll explore writing your own functions later in this article.
One more function related to creating sequences of numbers is rep()
, which stands for ‘replicate’. Let’s look at a few uses. If we want to create a vector that contains 40 zeros, we can use:
rep(0, times = 40)
If instead we want out vector to contain 10 repetitions of vector (0,1,2)
, we can do:
rep(c(0,1,2), times = 10)
Finally, let’s say that rather than repeating the vector (0, 1, 2)
over and over again, we want our vector to contain 10 zeros, then 10 twos. We can do this with the each
argument.
rep(c(0,1,2), each = 10)
So these were the number of ways you can create sequences.
5. Vectors
We have been talking about vectors for quite sometime now. Let’s see what they are and different ways to create or do things with them.
Vectors are the simplest and most common data structure in R.
Vectors comes in two different flavors:
- Atomic Vectors
An atomic vector contains exactly one data type. - Lists
A list may contain multiple data types.
In above topics we dealt entirely with numeric vectors, which are on type of atomic vector. Other types of atomic vectors include logical, character, integer, and complex. Let’s take a closer look at logical and character vectors.
Logical vectors can contain the values TRUE
, FALSE
, and NA
(for ‘not available’). These values are generated as the result of logical ‘conditions’. Let see a simple condition example:
> num_vect <- c(0.5, 55, -10, 6) #creating a vector num_vect
> tf <- num_vect < 1 # tf equals to num_vect is less than 1
> tf
[1] TRUE FALSE TRUE FALSE
The first element of num_vect
is 0.5, which is less than 1 and therefore the statement 0.5 < 1
is TRUE
. The second element of num_vect
is 55, which is greater than 1, so the statement 55 < 1
is FALSE
. The same logic applies for the third and fourth elements.
The `<
` symbol in this is called ‘logical operators’. Other logical operators include `>
`, `>=
`,`<=
`, `==
` for exact equality, and `!=
` for inequality. For example you can do num_vect >= 6
and see the result.
If you have two logical expressions, A
and B
, we can ask wether at-least one is TRUE
with A | B
(logical ‘or’ a.k.a union) or wether they are both TRUE
with A & B
(logical ‘and’ a.k.a intersection). Lastly, !A
is the negation of A
and is TRUE
when A
is FALSE
and vice versa.
It’s a good idea to spend some time playing around with various combinations of these logical operators until you get comfortable with their use. Here are a three statements try to guess the answer of each first and then run it in R console.
> (3 > 5) & (4 == 4)> (TRUE == TRUE) | (TRUE == FALSE)> ((111 >= 111) | !(TRUE)) & ((4 + 1) == 5)
Don’t worry if you found these to be tricky. They’re supposed to be. Working with logical statements in R takes practice, but your efforts will be rewarded in topics like subsetting and control structures.
Character vectors are also very common in R. Double quotes are used to distinguish character objects, as in the following example, where we are creating a character vector that contains “My”, “name”, “is”.
> my_char <- c("My", "name", "is")
> my_char
[1] "My" "name" "is"
Right now, my_char
is a character vector of length 3. Let’s say we want to join the elements of my_char together into one continuous character string (i.e. a character vector of length 1). We can do this using the paste()
function.
> paste(my_char, collapse = " ")
[1] "My name is"
The collapse
argument to the paste()
function tells R that when we join together the elements of the my_char
character vector, we’d like to separate them with single spaces.
It seems that we’re missing something…. Ah, yes! Your name!
Remember concatenate? This is exactly what we’ll use here:
> my_name <- c(my_char, "Talha")
> my_name
[1] "My" "name" "is" "Talha"
Now, use the paste()
function once more to join the words in my_name
together into a single character string.
> paste(my_name, collapse = " ")
[1] "My name is Talha"
In this example, we used the paste()
function to collapse the elements of a single character vector. paste()
can also be used to join the elements of multiple character vectors. In the simplest case, we can join two character vectors that are each of length 1 (i.e. join two words). Try paste(“Hello”, “world!”, sep = “ “)
, where the sep
argument tells R that we want to separate the joined elements with a single space.
For a slightly more complicated example, we can join two vectors, each of length 3. Use paste()
to join the integer vector 1:3
with the character vector c(“X”, “Y”, “Z”)
. This time, use sep = “”
to leave no space between the joined elements.
> paste(1:3, c("X", "Y", "Z"), sep = "")
[1] "1X" "2Y" "3Z"
Remember vector recycling? Let’s see how it works with character vectors.
Try paste(LETTERS, 1:4, sep = “-”)
, where LETTERS
is a predefined variable in R containing a character vector of all 26 letters in the English alphabet.
> paste(LETTERS, 1:4, sep = "-")
[1] "A-1" "B-2" "C-3" "D-4" "E-1" "F-2" "G-3" "H-4" "I-1" "J-2" "K-3" "L-4" "M-1"
[14] "N-2" "O-3" "P-4" "Q-1" "R-2" "S-3" "T-4" "U-1" "V-2" "W-3" "X-4" "Y-1" "Z-2"
Since the character vector LETTERS
is longer than the numeric vector 1:4
, R simply recycles, or repeats, 1:4
until it matches the length of LETTERS
. Also worth noting is that the numeric vector 1:4
gets coerced into a character vector by the paste()
function. We’ll discuss coercion in another topic, but all it really means is that the numbers 1, 2, 3, and 4 in the output above are no longer numbers to R, but rather characters “1”, “2”, “3”, and “4”.
6. Missing Values
If you have seen data on an excel sheet, which is in its raw form meaning it’s not been cleaned, you just got it fresh out of the database, there is a chance that it’ll have missing values meaning they will be NA
or blank cells or another way that is defined by the organization to fill the missing value in a dataset. Missing values play an important role in statistics and data analysis. Often, missing values must not be ignored, but rather they should be carefully studied to see it there’s an underlying pattern or cause for their missingness.
In R, NA
is used to represent any value that is ‘not available’ or ‘missing’ (in statistical sense). In this topic we’ll explore missing values further. Any operation involving NA
generally yields NA
as the result. Let’s see an example:
x <- c(44, NA, 5, NA)
Now, let’s multiply x
by 3
> x*3
[1] 132 NA 15 NA
Notice that the elements of the resulting vector that correspond with the NA
values in x are also NA
. TO make things a little more interesting, lets create a vector containing 1000 draws from a standard normal distribution:
y <- rnorm(1000)
rnorm
is used to generate n normal random numbers.
Now let’s generate a vector of 1000 NAs.
z <- rep(NA, 1000)
let’s select 100 elements at random from these 2000 values (combining y
and z
) such that we don’t know how many NAs we’ll wind up with or what positions they’ll occupy in our final vector :
my_data <- sample(c(y,z),100)
Let’s first ask the question of where our NAs are located in our data The is.na()
functions tells us wether each element of a vector is NA
.
my_na <- is.na(my_data)
In the previous topic we saw the ==
operator as a method of testing for equality between two objects. So, you might think the expression my_data == NA
yields the same results as is.na()
. Give it a try:
Don’t worry if that’s a little confusing. The key takeaway is to be cautious when using logical expressions anytime NAs might creep in, since a single NA
value can derail the entire thing. So, back to the task at hand.
Now that we have a vector, my_na
, that has a TRUE
for every NA
and FALSE
for every numeric value, we can compute the total number of NAs in our data.
Let’s give that a try here. Call the sum()
function on my_na
to count the total number of TRUEs in my_na, and thus the total number of NAs in my_data
.
sum(my_na, TRUE)
Now that we’ve got NAs down pat, let’s look at a second type of missing value — NaN
, which stands for ‘not a number’. To generate NaN
, try dividing (using a forward slash) 0
by 0
now.
> 0/0
[1] NaN
Let’s do one more just for fun. In R, Inf
stands for infinity. What happens if you subtract Inf
from Inf
?
Inf - Inf
So these are some common missing values types. This will all add up when you will start doing first data science project. These are core concepts in becoming a data scientist in R.
7. Subsetting Vectors
In this topic we will see hot to extract elements from a vector based on some conditions that we specify.
Start by creating a vector:
> rand_nums <- rnorm(1000) #Random numbers
> nas <- rep(NA, 1000) #NAs
> x <- sample(c(rand_nums,nas),40) #getting a sample of 40
> x #print
The way you tell R that you want to select some particular elements (i.e. a ‘subset’) from a vector is by placing an ‘index vector’ in square brackets immediately following the name of the vector. For a simple example, try:
x[1:10] #Will print first ten elements of x.
Index vectors come in four different flavors:
- Logical vectors
- Vectors of positive integers
- Vectors of negative integers
- Vectors of character strings
One common scenario when working with real-world data is that we want to extract all elements of a vector that are not NA (i.e. missing data). Recall that is.na(x)
yields a vector of logical values the same length as x, with TRUEs corresponding to NA
values in x and FALSEs corresponding to non-NA values in x
.
If you do:
> x[is.na(x)]
#Think for a minute what will happen and select your answer before #trying it in console.
#1: A vector of TRUEs and FALSEs
#2: A vector of all NAs
#3: A vector with no NAs
#4: A vector of length 0
Recall that !
gives us the negation of a logical expression, so !isna(x)
can be read as `is not NA`. Therefore, if we want to create a vector called y
that contains all of the non-NA values from x
, we can use:
y <- x[!is.na(x)]
Now that we have a non-NA vector y
, we can subset it as we please. Like if we want all the positive elements (i.e. greater than zero) we can do:
y[y>0]
One might wonder why we didn’t just start with x[x>0]
to isolate the positive elements? Well, try that and see why!
x[x>0]
Since NA
is not a value, but rather a placeholder for an unknown quantity, the expression NA > 0
evaluates to NA
. Hence we get a bunch of NAs mixed in with our positive numbers when we do this.
Combining our knowledge of logical operators with our new knowledge of subsetting, we could do this:
x[!is.na(x) & x > 0]
In this case, we request only values of x that are both non-missing AND greater than zero.
Many programming languages use what’s called ‘zero-based indexing’, which means that the first element of a vector is considered element 0. R uses ‘one-based indexing’, which means the first element of a vector is considered element 1.
I’ve already shown you how to subset just the first ten values of x using x[1:10]
. In this case, we’re providing a vector of positive integers inside of the square brackets, which tells R to return only the elements of x
numbered 1 through 10. Let’s subset the 3rd, 5th, and 7th elements of x
.
x[c(3,5,7)] #or we can also do c(x[3],x[5],x[7])
To further understand the `one-based indexing` of R let’s see what the following expression gives us:
x[0]
As you might expect, we get nothing useful. Unfortunately, R doesn’t prevent us from doing this. What if we ask for the 3000th element of x
?
x[3000]
Again, nothing useful, but R doesn’t prevent us from asking fro it. You should always make sure that what you are asking for is within the bounds of the vector you’re working with.
If we’re interested in all elements of x
EXCEPT the 2nd and 10th:
x[c(-2,-10)]
R accepts negative integer indexes. Where x[c(2, 10)]
gives us ONLY the 2nd and 10th element of x
, x[c(-2, -10)]
gives us all elements of x
EXCEPT for 2nd and 10th elements.
A shorthand way of specifying multiple negative numbers is to put the negative sign out in front of the vector of positive numbers. If you type x[-c(2,10)]
you’ll get the exact same result.
So far, we’ve covered three types of index vectors — logical, positive integer, and negative integer. The only remaining type requires us to introduce the concept of ‘named’ elements.
Let’s create a numeric vector with three named elements:
> vect <- c(foo = 11, bar = 2, norf = NA)
# When we print vect to the console, you'll see that each element #has a name.
> vect
foo bar norf
11 2 NA
We can also get the names of vect
by passing vect
as an argument to the names()
names(vect)
Alternatively, we can create an unnamed vector vect2
with values (11, 2, NA)
:
vect2 <- c(11,2,NA)
Then, we can add the names
attribute to vect2
after the fact with:
names(vect2) <- c("foo", "bar", "norf")
Now, let’s check that vect
and vect2
are the same by passing them as arguments to the identical()
function.
identical(vect, vect2)
Now, back to the matter of subsetting a vector by named elements. If we want to get the element named “bar” (i.e. the second element of vect
), we will do:
vect["bar"]
Likewise, we can specify a vector of names with:
vect[c("foo","bar")]
Now you know all four methods of subsetting data from vectors. Different approaches are best in different scenarios and when in doubt, try it out!
8. Matrices & Data Frames
Matrices and Data frames represents ‘rectangular’ data types, meaning that they are used to store tabular data, with rows and columns.
The main difference, is that matrices can only contain a single class of data, while data frames can consist of many different classes of data.
Let’s create a vector:
my_vector <- 1:20 #Contains number 1 through 20
The dim()
function tells us the ‘dimensions’ of an object. So if we do:
> dim(my_vector)
NULL
Clearly, that’s not very helpful! Since my_vector
is a vector, it doesn’t have a dim
attribute (so it’s just NULL), but we can find it’s length using the length(my_vect)
we will get 20
which is what we wanted. But, what if we give my_vector
a dim
attribute?
dim(my_vector) <- c(4,5)
It’s okay if that last command seemed a little strange to you. It should! The dim()
function allows you to get OR set the dim
attribute for an R object. In this case, we assigned the value c(4, 5)
to the dim
attribute of my_vector
. Use dim(my_vector) to confirm that we've set the
dim` attribute correctly.
Another way of seeing this is by calling the attributes()
function on my_vector
.
attributes(my_vector)
Just like in math class, when dealing with a 2-dimensional object (think rectangular table), the first number is the number of rows and the second is the number of columns. Therefore, we just gave my_vector
4 rows and 5 columns.
But, wait! That doesn’t sound like a vector any more. Well, it’s not. Now it’s a matrix. Let’s print the contents of my_vector
to see what it looks like:
my_vector
Now, let’s confirm it’s actually a matrix by using the class()
function.
> class(my_vector)
[1] "matrix"
Sure enough, my_vector is now a matrix. We should store it in a new variable that helps us remember what it is.
my_matrix <- my_vector
The example that we’ve used so far was meant to illustrate the point that a matrix is simply an atomic vector with a dimension attribute. A more direct method of creating the same matrix uses the matrix()
function. Lets do that now:
my_matrix2 <- matrix(data = 1:20, nrow = 4, ncol = 5)
You can alway do ?matrix
to see help in order to understand the arguments.
Let’s confirm that my_matrix
and my_matrix2
are actually identical.
identical(my_matrix,my_matrix2)
Now, imagine that the numbers in our table represent some measurements from a clinical experiment, where each row represents one patient and each column represents one variable for which measurements were taken. We may want to label the rows, so that we know which numbers belong to each patient in the experiment.
One way to do this is to add a column to the matrix, which contains the names of all four people. SO let’s create a character vector containing names of our patients:
patients <- c("Bill", "Gina", "Kelly", "Sean")
Now we’ll use the cbind()
function to ‘combine columns’.
cbind(patients, my_matrix)
Something is fishy about our result! It appears that combining the character vector with our matrix of numbers caused everything to be enclosed in double quotes. This means we’re left with a matrix of character strings, which is no good.
If you remember to the beginning of this topic, I told you that matrices can only contain ONE class of data. Therefore, when we tried to combine a character vector with a numeric matrix, R was forced to ‘coerce’ the numbers to characters, hence the double quotes. This is called ‘implicit coercion`, because we didn’t ask for it. It just happened.
So, we’re still left with the question of how to include the names of our patients in the table without destroying the integrity of our numeric data. Try:
my_data <- data.frame(patients, my_matrix)
It looks like the data.frame()
function allowed us to store our character vector of names right alongside our matrix of numbers. That’s exactly what we were hoping for! Behind the scenes, the data.frame()
takes any number of arguments and returns a single object of class data.frame
that is composed of the original objects. You can confirm this class by calling class(my_data)
It’s also possible to assign names to the individual rows and columns of a data frame, which presents another possible way of determining which row of values in our table belongs to each patient. However, since we’ve already solved that problem, let’s solve a different problem by assigning names to the columns of our data frame so that we know what type of measurement each column represents.
Since we have six columns (including patient names), we’ll need to first create
a vector containing one element for each column.
cnames <- c("patient","age","weight","bp","rating","test")
Now, use the colnames()
function to set the colnames
attribute for our data frame. This is similar to the way we used the dim()
function earlier.
> colnames(my_data) <- cnames
> my_data
So these were the basics of working with two very important and common data structures. There is so much more to it which we will cover in future topics.
9. Looking at Data
Whenever you’re working with a new dataset, the first thing you should do is look at it! What is the format of the data? What are the dimensions? What are the variable names? How are the variables stored? Are there missing data? Are there any flaws in the data?
This topic will help you answer all of these questions and more using R’s built-in functions. We’ll be using a dataset constructed from the United States Department of Agriculture’s PLANTS database which you can download https://www.dropbox.com/s/nfi3d1d0wwrc6z8/plants.xlsx?dl=0 and move it in the same directory as your working directory.
Let’s load it in our R program first. Don’t worry if you don’t understand the code below just get along as we’ll cover in another topic.
#installing a library to read excel files
> install.packages("readxl")#Load Library
> library("readxl")#reading excel file, converting it to data frame and assigning it to plants variable
> plants <- data.frame(read_excel("plants.xlsx"))#make sure that plants is a data frame.
class(plants)
Since the dataset is stored in a data frame, we know it is rectangular. In other words, it has two dimensions (rows and columns) and fits neatly into a table or spreadsheet. To see exactly how many rows and column we’re dealing with use:
> dim(plants)
[1] 5166 10
The first number you see (5166) is the number of rows (observations) and the second number (10) is the number of columns (variables).
You can also use nrow(plants)
to see only the number of rows and ncol(plants)
to see only the number of columns.
If you are curious as to how much space the dataset is occupying in memory, you can use:
object.size(plants)
Now that we have a sense of the shape and size of the dataset, let’s get a feel for what’s inside.
names(plants)
will return a character vector of column (i.e. variable) names.
I’ve applied fairly descriptive variable names to this dataset, but that won’t always be the case. A logical next step is to peek at the actual data. However, our dataset contains over 5000 observations (rows), so it’s impractical to view the whole thing all at once. This is why the head()
function allows you to preview the top of the dataset.
head(plants)
By default, head()
shows you the first six rows of data. You can alter this behavior by passing as a second argument the number of rows you’d like to view.
head(plants, 10)
The same applies for using tail()
to preview the end of the dataset.
tail(plants, 15)
After reviewing the top and bottom of the data, you probably noticed lots of NAs, which are R’s placeholders for missing values. Use summary()
function to get a better feel for how each variable is distributed and how much of the dataset is missing.
summary(plants)
summary()
provides different output for each variable, depending on its class. For numeric data such as Precip_Min
, summary()
displays the minimum, 1st quartile, median, mean, 3rd quartile, and maximum. These values help us understand how the data are distributed.
For categorical variables (called ‘factor’ variables in R), summary()
displays the number of times each value (or ‘level’) occurs in the data. For example, each value of Scientific_Name
only appears once, since it is unique to a specific plant. In contrast, the summary for Duration
(also a factor variable) tells us that our dataset contains 3031 Perennial plants, 682 Annual plants, etc.
You can see that R truncated the summary for Active_Growth_Period
by including a catch-all category called Other
. Since it is a categorical/factor variable, we can see how many times each value actually occurs in the data with:
table(plants$Active_Growth_Period)
Each of the functions we’ve introduced so far has its place in helping you better understand the structure of your data.
The most useful and concise function for understanding the *str*ucture of your data is str()
.
str(plants)
str()
is actually a very general function that you can use on most objects in R. Any time you want to understand the structure of something (a dataset. function, etc.), str()
is a good place to start.
In this topic, you learned how to get a feel for the structure and contents of a new dataset using a collection of simple and useful functions. Taking the time to do this upfront can save you time and frustration later on in your analysis.
10. Base Graphics
One of the greatest strengths of R, relative to other programming languages, is the ease with which we can create publication-quality graphics.
Before we start this topic I want to mention that I will not be covering the more advanced portions of graphics in R like ‘lattice’, ‘ggplot2’ and ‘ggvis’. There is a school of thought that this approach is backwards, that we should teach ggplot2 first. See http://varianceexplained.org/r/teach_ggplot2_to_beginners/ for an outline of this view.
Scatterplots
Load cars, the pre-loaded datasets in R with:
data(cars)
Pull up help page for cars: ?cars
As you can see in the help page, the cars data set has only two variables: speed and stopping distance. Note that the data is from 1920s. Run head()
on cars data.
head(cars)
Before plotting, it is always a good idea to get a sense of the data. Run the plot()
command on the cars dataset.
plot(cars)
As always, R ties very hard to give you something sensible given the information that you have provided to it. First, R notes that the data frame you have given it has just two columns, so it assumes that you want to plot one column versus the other. Second, since we do not provide labels for either axis, R uses the names of the columns. Third, it creates axis tick marks at nice round numbers and labels them accordingly. Fourth, it uses the other defaults supplied in plot()
.
We will not spend time exploring plot, but many of the things covered in this topic will apply to most other R graphics functions. Note that ‘plot’ is short for scatterplot.
Open the help page of plot: ?plot
. The help page for plot()
highlights the different arguments that the function can take. The two most important are x and y, the variables that will be plotted. For the next set of examples, we will include the argument names in our answers. That is, we will not type plot(cars$speed, cars$dist)
, although that will work. Instead, we will use:
plot(x = cars$speed, y = cars$dist)
Note that this produces a slightly different answer than plot(cars)
. In this case, R is not sure what you want to use as the labels on axes, so it just uses the arguments which you pass in, data frame name and dollar signs included.
Not that there are other ways to call the plot command, i.e. using the “formula” interface. For example, we get a similar plot to the above with:
plot(dist ~ speed, cars)
However, we will wait till later in this topic before using the formula interface.
Use the plot()
command to show dist
on the x-axis and speed on the y-axis
from the cars data frame.
plot(x = cars$dist, y = cars$speed)
It probably makes more sense for speed to go on the x-axis since stopping distance is a function of speed more than the other way around.
Let’s change the x-axis label to “Speed” & y-axis to “Stopping Distance”:
plot(x = cars$speed, y = cars$dist, xlab = "Speed", ylab = "Stopping Distance")
The reason that plot(cars)
worked at the beginning of the topic was that R was smart enough to know that the first element (i.e., the first column) in cars should be assigned to the x
argument and the second element to the y
argument. To save typing, the next set of questions and answers will all be of the form, plot(cars, …) with various arguments added. For each question, we will only want one additional argument at a time. Of course, you can pass in more than one argument when doing a real project.
Q: Plot cars with a main title of “My Plot”. Note that the argument for the main title is main
not “title”.
plot(cars, main = "My Plot")
Q: Plot cars with a subtitle of “My Plot Subtitle”.
plot(cars, sub = "My Plot Subtitle")
The plot help page (?plot
) only covers a small number of the many arguments that can be passed in to plot() and to other graphical functions. To begin to explore the many other options, look at ?par
. Let’s look at some of the more commonly used ones. Continue using plot(cars, …) as the base answer to these questions.
Q: Plot cars so that the plotted points are colored red.
plot(cars, col = 2)
Q: Plot cars while limiting the x-axis to 10 through 15.
plot(cars, xlim = c(10,15))
You can also change the shape of the symbols in the plot. The help page for points (?points
) provides the details.
Q: Plot cars using triangles.
plot(cars, pch = 2)
Arguments like col
and pch
may not seem very intuitive. And that is because they aren’t! So, many/most people use more modern packages, like ggplot2, for creating their graphics in R. It is however, useful to have an introduction to base graphics because many of the idioms in lattice and ggplot2 are modeled on them.
Boxplots
Let’s now look at some other functions in base graphics that may be useful, starting with boxplots. For this let’s load a different built-in data frame called “mtcars”.
data(mtcars)
Anytime that you load up a new data frame, you should explore it before using it. Just to recall you can use dim(mtcars)
, head(mtcars)
etc.
To start on boxplots open its help page: ?boxplot
Instead of adding data columns directly as input arguments, as we did with plot()
it is often handy to pass in the entire data frame. This is what the data
argument in boxplot()
allows.
boxplot()
like many R functions, also take a “formula” argument, generally an expression with a tilde ( ~
) which indicates the relationship between the input variables. This allows you to enter something like mpg ~ cyl
to plot the relationship between cyl
(number of cylinders) on the x-axis and mpg
(miles per gallon) on the y-axis.
Q: Use boxplot()
with formula = mpg ~ cyl
and data = mtcars
to create a box plot.
boxplot(mpg ~ cyl, mtcars)
The plot shows that mpg is much lower for cars with more cylinders.
Note that we can use the same set of arguments that we explored with plot()
above to add axis labels, titles and so on.
Histograms
When looking at a single variable, histogram are a useful tool, hist()
is the associated R function. Like plot()
, hist()
is best used by just passing in a single vector.
Q: Use hist()
with the vector mtcars$mpg
to create a histogram.
hist(mtcars$mpg)
This is it for this topic. The best place to go from here is to study the ggplot2 package.
If you want to learn ggplot2 here is a good resource: http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
If you want to explore other elements of base graphics see http://www.ling.upenn.edu/~joseff/rstudy/week4.html
11. Manipulating Data with dplyr
First thing’s first, install the dplyr
package:
> install.packages("dplyr")
> library("dplyr")
dplyr
is a fast and powerful R package written by Hadley Wickham and Romain Francois that provides a consistent and concise grammar for manipulating tabular data.
One unique aspect of dpylr
is that the same set of tools allow you to work with tabular data from variety of sources, including data frames, data tables, databases and multidimensional arrays. In this topic we will focus on data frames, but everything you lean will apply equally to other formats.
As you may know, “CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R” (http://cran.rstudio.com/). RStudio maintains one of these so-called ‘CRAN mirrors’ and they generously make their download logs publicly available (http://cran-logs.rstudio.com/). We’ll be working with the log from July 8, 2014, which contains information on roughly 225,000 package downloads.
Let’s first real the data file. This time it’s not an excel sheet. The file format is .csv (Comma Separated Values).
Download Data file: https://www.dropbox.com/s/85umed14g0eumjz/2014-07-08.csv?dl=0 (Make sure to move it in your working directory i.e. getwd()
)
#Read, convert and store the csv file in mydf
mydf <- data.frame(read.csv("2014-07-08.csv", stringsAsFactors = FALSE))
Let’s look at the dimensions of mydf
.
dim(mydf)
Now use head()
to preview the data:
head(mydf)
For this topic it is important that you have dplyr
version 0.4.0 or later. To confirm this, type: packageVersion("dplyr")
The first step of working with data in dplyr
is to load the data into what the package authors call a ‘data frame tbl’ or tbl_df
.
cran <- tbl_df(mydf)
To avoid any confusion and keep things running smoothly, let’s remove the original data frame from your workspace with:
rm("mydf")
The main advantage to using a tbl_df
over a regular data frame is the printing.
cran
First, we are shown the class and dimensions of the dataset. Just below that, we get a preview of the data. Instead of attempting to print the entire dataset, dplyr
just shows us the first 10 rows of data and only as many columns as fit neatly in our console. At the bottom, we see the names and classes for any variables that didn’t fit on our screen.
According to the “Introduction to dplyr” vignette written by the package authors, “The dplyr philosophy is to have small functions that each do one thing well.” Specifically, dplyr supplies five ‘verbs’ that cover most fundamental data manipulation tasks: select()
, filter()
, arrange()
, mutate()
, and summarize()
.
Use ?select
to pull up the documentation for the first of these core functions.
As may often be the case, particularly with larger datasets, we are only interested in some of the variables.
select(cran, ip_id, package, country)
The first thing to notice is that we don’t have to type cran$ip_id
, cran&package
, and cran$country
, as we normally would when referring to columns of a data frame. The select()
function knows we are referring to columns of the cran
dataset. Also, note that the columns are returned to us in the order we specified, even though ip_id
is the rightmost column in the original dataset.
Recall that in R, the :
operator provides a compact notation for creating a sequence of numbers. For example, try 5:20
. Normally, this notation is reserved for numbers, but select() allows you to specify a sequence of columns this way, which can save a bunch of typing. Like:
select(cran, r_arch: country)
# Will select all columns starting from r_arch and ending with country.
Similarly we can do select(cran, country: r_arch)
to print the same columns in reverse order.
Instead of specifying the columns we want to keep, we can also specify the columns we want to throw away. Let’s omit the time column:
select(cran, -time)
The negative sign in front of time
tells select()
that we DON’T want the time
column. Now, let’s combine strategies to omit all columns from X through size (X:size).
select(cran, -(X:size))
Now that you know how to select a subset of columns using select()
, a natural next question is “How do I select a subset of rows?” That’s where the filter()
function comes in. Let’s look at an example:
filter(cran, package == "swirl")
filter()
recognizes package
as a column of cran
, without you having to explicitly specify cran$package
.
The ==
operator asks whether the thing on the left is equal to the thing on the right. If yes, then it returns TRUE
. If no, then FALSE
. In this case, package is an entire vector (column) of values, so package == "swirl”
returns a vector of TRUEs and FALSEs. filter()
then returns only the rows of cran
corresponding to the TRUEs.
You can specify as many conditions as you want, separated by commas. For example filter(cran, r_version == “3.1.1”, country == “US”)
will return all rows of cran
corresponding to downloads from users in the US
running R version 3.1.1
.
The conditions passed to filter()
make use of any of the standard comparison operators. To see all comparison operators see ?Comparison
filter(cran, r_version <= "3.0.2", country == "PK")
This call to filter()
request all rows for which some condition AND another condition were TRUE. We can also request rows for which EITHER one condition OR another condition are TRUE. For example:
filter(cran, country == “US” | country == “PK”)
# will gives us all rows for which the country variable equals either “US” or “PK”.
Q: Use filter()
to fetch all row for which size
is strictly greater than > 100500
(no quotes, since size is numeric) AND r_os
equals linux-gnu
.
filter(cran, size > 100500, r_os == “linux-gnu”)
Q: Get only the rows for which the r_version
is not missing.
filter(cran, !is.na(r_version))
We’ve seen how to select a subset of columns and rows from our dataset using select()
and filter()
, respectively. Inherent in select()
was also the ability to arrange our selected columns in any order we please.
Sometimes we want to order the rows of a dataset according to the values of a particular variable. This is the job of arrange()
.
To see how arrange() works, let’s first take a subset of cran
. select()
all columns from size
through ip_id
and store the result in cran2
.
cran2 <- select(cran, size:ip_id)
Now, to order the ROWS of cran2
so that ip_id
is in ascending order (from small to large), type:
arrange(cran2, ip_id)
To do the same, but in descending order, change the second argument as:
arrange(cran2, desc(ip_id)) #Where desc() stands for descending
We can also arrange the data according to the values of multiple variables. For example, arrange(cran2, package, ip_id)
will first arrange by package
names (ascending alphabetically), then by ip_id
. This means that if there are multiple rows with the same value for package
, they will be sorted by ip_id
(ascending numerically).
Q: Arrange cran2
by the following three variables, in this order: country
(ascending), r_version
(descending), and ip_id
(ascending).
arrange(cran2, country, desc(r_version), ip_id)
To illustrate the next major function in dplyr
, let’s take another subset of our original data. Use select()
to grab 3 columns from cran
— ip_id
, package
, and size
(in that order) — and store the result in a new variable called cran3
.
cran3 <- select(cran, ip_id, package, size)
It’s common to create a new variable based on the value of one or more variables already in a dataset. The mutate()
function does exactly this.
The size variable represents the download size in bytes, which are units of computer memory. These days, megabytes (MB) are a more common unit of measurement. One megabyte is equal to 2²⁰ bytes. That’s 2 to the power of 20, which is approximately one million bytes! We want to add a column called size_mb
that contains the download size in megabytes. Here is how to do it:
mutate(cran3, size_mb = size / 2^20)
An even larger unit of memory is a gigabyte (GB), which equals 2¹⁰ megabytes. We might as well add another column for download size in gigabytes.
mutate(cran3, size_mb = size / 2^20, size_gb = size_mb / 2^10)
Let’s try one more for practice. Pretend we discovered a glitch in the system that provided the original values for the size variable. All of the values in cran3
are 1000 bytes less than they should be. Using cran3
, create just one new column called correct_size
that contains the correct size.
mutate(cran3, correct_size = size + 1000)
The last of 5 core dplyr verbs, summarize()
collapses the dataset to a single row. Let’s say we’re interested in knowing the average download size.
summarize(cran, avg_bytes = mean(size))
This will yield the mean value of the size
variable. Here I’ve chosen to label the result ‘avg_bytes
’, but I could have named it anything. We’ll look at grouped data in the next topic, but the idea is that summarize()
can give you the requested value FOR EACH group in your dataset.
In this topic you’ve learned how to manipulate data using dplyr’s five main functions. In the next lesson, we’ll look at how to take advantage of some other useful features of dplyr to make your life as a data analyst much easier.
12. Getting and Cleaning Data
For this topi I highly recommend doing it on RStudio as we are going to see functions like View()
which works best on RStudio. Also ensure that dplyr is included, by: library(“dplyr”)
.
In the last lesson, you learned about the five main data manipulation ‘verbs’ in dplyr: select()
, filter()
, arrange()
, mutate()
, and summarize()
. The last of these, summarize()
, is most powerful when applied to grouped data.
The main idea behind grouping data is that you want to break up your dataset into groups of rows based on the values of one or more variables. The group_by()
function is responsible for doing this. We’ll use the same CRAN download log data, so let’s load it in a tbl_df()
:
cran <- tbl_df(mydf)
In the previous topic we removed mydf
so make sure you run the read.csv()
again and then remove it from the workspace again as we just wanted it’s copy, so: rm("mydf")
Bring up the help file for group_by()
.
Q: Group cran
by the package
variable and store the result in a new object called by_package
.
> by_package <- group_by(cran, package)
> by_package
At the top of the output above, you’ll see Groups: package
, which tells us that this tbl has been grouped by the package variable. Everything else looks the same, but now any operation we apply to the grouped data will take place on a per package
basis.
Recall that when we applied mean(size)
to the original tbl_df
via summarize()
, it returned a single number — the mean of all values in the size
column. We may care about what that number is, but wouldn’t it be so much more interesting to look at the mean download size for each unique package
?
summarize(by_package, mean(size))
Let’s look at an example which is a step further of summarize()
. Here is the scenario:
You are required to compute four values, in the following order, from the grouped data:
1.
count = n()
2.unique = n_distinct(ip_id)
3.countries = n_distinct(country)
4.avg_bytes = mean(size)
Store the result in
pack_sum
(for ‘package_summary’)Hints:
Take a look at?n
and?n_distinct
, so that you really understand what is going on.Here is the skeleton of partially constructed call to
summarize()
pack_sum <- summarize(by_package,
count = ,
unique = ,
countries = ,
avg_bytes = )
Finish the above in R or in RStudio as you prefer then look at the solution below.
pack_sum <- summarize(by_package,
count = n(),
unique = n_distinct(ip_id),
countries = n_distinct(country),
avg_bytes = mean(size))
Print the pack_sum
The ‘count
’ column, created with n()
, contains the total number of rows (i.e. downloads) for each package. The ‘unique
’ column, created with n_distinct(ip_id)
, gives the total number of unique downloads for each package, as measured by the number of distinct ip_id’s. The ‘countries
’ column, created with n_distinct(country)
, provides the number of countries in which each package was downloaded. And finally, the ‘avg_bytes
’ column, created with mean(size)
, contains the mean download size (in bytes) for each package.
Naturally, we’d like to know which packages were most popular on the day these data were collected (July 8, 2014). Let’s start by isolating the top 1% of packages, based on the total number of downloads as measured by the ‘count
’ column.
We need to know the value of count
that splits the data into the top 1% and bottom 99% of packages based on total downloads. In statistics, this is called the 0.99%, or 99%, sample quantile.
> quantile(pack_sum$count, probs = 0.99)
99%
679.56
No we can isolate only those packages which had more than 679 total downloads.
Q: Using filter()
select all rows from pack_sum
for which count
is strictly greater (>
) than 679.
top_counts <- filter(pack_sum, count > 679)
There are only 61 packages in our top 1%, so we’d like to see all of them. Since dplyr only shows us the first 10 rows, we can use the View()
function to see more.
View(top_counts) #Note that the 'V' in View() is capitalized.
arrange()
the rows of top_counts
based on the ‘count
’ column and assign the result to a new object called top_counts_sorted
. We want the packages with the highest number of downloads at the top, which means we want ‘count
’ to be in descending order. If you need help, check out ?arrange
and/or ?desc
.
> top_counts_sorted <- arrange(top_counts, desc(count))
> View(top_count_sorted)
If we use total number of downloads as our metric for popularity, then the above output shows us the most popular packages downloaded from the RStudio CRAN mirror on July 8, 2014. Not surprisingly, ggplot2 leads the pack with 4602 downloads, followed by Rcpp, plyr, rJava and so on.
Perhaps we’re more interested in the number of *unique* downloads on this particular day. In other words, if a package is downloaded ten times in one day from the same computer, we may wish to count that as only one download. That’s what the ‘unique
’ column will tell us. So just like how we did for count
, let’s find the 0.99 or 99% quantile for the unique
variable.
> quantile(pack_sum$unique, probs = 0.99)
99%
465
Apply filter()
to pack_sum
to select all rows corresponding to values of unique
that are strictly greater than 465.
top_unique <- filter(pack_sum, unique > 465)
Now arrange()
top_unique
by the unique
column, in descending order, to see which packages were downloaded from the greatest number of unique IP addresses.
> top_unique_sorted <- arrange(top_unique,desc(unique))
> View(top_unique_sorted)
Now Rcpp is in the lead, followed by stringr, digest, plyr, and ggplot2.
Our final metric of popularity is the number of distinct countries from which each package was downloaded. We’ll approach this one a little differently to introduce you to a method called chaining
or piping
.
Chaining allows you to string together multiple function calls in a way that is compact and readable, while still accomplishing the desired result. To make it more concrete, let’s compute our last popularity metric from scratch, starting with our original data.
top_countries <- filter(pack_sum, countries > 60)
result1 <- arrange(top_countries, desc(countries), avg_bytes)
It’s worth noting that we sorted primarily by country
, but used avg_bytes
(in ascending order) as a tie breaker. This means that if two packages were downloaded from the same number of countries, the package with a smaller average download size received a higher ranking.
We’d like to accomplish the same result as the last script, but avoid saving our intermediate results. This requires embedding function calls within one another. Have a look at this code block:
result3 <-
cran %>%
group_by(package) %>%
summarize(count = n(),
unique = n_distinct(ip_id),
countries = n_distinct(country),
avg_bytes = mean(size)
) %>%
filter(countries > 60) %>%
arrange(desc(countries), avg_bytes)
I’ve used a special chaining operator, %>%
, which was originally introduced in the magrittr
R package and has now become a key component of dplyr
. You can pull up the related documentation with ?chain
. The benefit of %>%
is that it allows us to chain the function calls in a linear fashion. The code to the right of %>%
operates on the result from the code to the left of %>%
. Note that when you’re using the chaining operator, you don’t need to specify the name of the data tbl in your call.
View(result3)
It looks like Rcpp is on top with downloads from 84 different countries, followed by digest, stringr, plyr, and ggplot2.
Let’s work through a few more examples of chaining. Let’s build a chain of dplyr commands. Here is the scenario:
-
select()
the following columns fromcran
. Keep in mind that when you’re using the chaining operator, you don’t need to specify the name of the data tbl in your call toselect()
.
- Usemutate()
to add a column calledsize_mb
that contains the size of each download in megabytes (i.e.size / 2²⁰
).
- Usefilter()
to select all rows for whichsize_mb
is less than or equal to (<=
)0.5
.
-arrange()
the result bysize_mb
, in descending order.Before seeing the solution below, you can see the chain example I shared above in which I’ve stored the result in
result3
. Just to make things easier here is a hint:cran %>% select() %>%
Here is the solution:
cran %>%
select(ip_id, country, package, size) %>%
mutate(size_mb = size / 2^20) %>%
filter(size_mb <= 0.5) %>%
arrange(desc(size_mb))
So far you learned about grouping and chaining using dplyr. You combined some of the things you learned in the previous topics with these more advanced ideas to produce concise, readable, and highly effective code.
13. Tidying Data with tidyr
The Tidying Data with tidyr topic is taken from: http://rstudio-pubs-static.s3.amazonaws.com/404441_1c525a8cece7475699cfb24dac14ff26.html
This topic requires readr
and tidyr
packages. So let’s install that first.
> install.packages("readr")
> install.packages("tidyr")
> library(readr)
> library(dplyr) #you need to load dplyr before tidyr
> library(tidyr)
In this topic you’ll learn how to tidy your data with the tidyr
package. Parts of this lesson will require the use of dplyr
.
The author of tidyr, Hadley Wickham, discusses his philosophy of tidy data in his ‘Tidy Data’ paper: http://vita.had.co.nz/papers/tidy-data.pdf This paper should be required reading for anyone who works with data, but it’s not required in order to complete this topic.
Tidy data is formatted in a standard way that facilitates exploration and analysis and works seamlessly with other tidy data tools. Specifically, tidy data satisfies three conditions:
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
Any dataset that doesn’t satisfy these conditions is considered ‘messy’ data. Therefore, all of the following are characteristics of messy data:
- Variables are stored in both rows and columns
- Column headers are values, not variable names
- Multiple variables are stored in one column
- A single observational unit is stored in multiple tables
- Multiple types of observational units are stored in the same table
Let’s work through a simple example of each of these messy data cases above, then tidy some real data.
1st problem is when you have column headers that are values, not variable names. Download and load this simple dataset called students that demonstrate this scenario: https://www.dropbox.com/s/b46cvs9qvxk67ob/students.xlsx?dl=0
> students <- data.frame(read_excel("students.xlsx"))
> View(students)
The first column represents each of five possible grades that students could receive for a particular class. The second and third columns give the number of male and female students, respectively, that received each grade.
This dataset actually has three variables: grade
, sex
, and count
. The first variable grade
, is already a column, so that should remain as it is. The second variable sex
, is captured by the second and third column headings. The third variable count
, is the number of students for each combination of grade
and sex
.
To tidy the students data, we need to have one column for each of these three variables. We will use the gather()
function from tidyr
to accomplish this. Pull up documentation for this function with ?gather
Q: Using this help file as a guide, call gather()
with the following arguments (in order): students
, sex
, count
, -grade
. Note the minus sign before grade, which says we want to gather all columns EXCEPT grade.
gather(students, sex, count, -grade)
Each row of the data now represents exactly one observation, characterized by a unique combination of the grade and sex variables. Each of our variables (grade, sex, and count) occupies exactly one column. That’s tidy data!
It is important to understand what each argument to gather()
means. The data
argument, students
, gives the name of the original dataset. The key
and value
arguments — sex
and count
, respectively — give the column names for our tidy dataset. The final argument, -grade
, says that we want to gather all columns EXCEPT the grade column (since grade is already a proper column variable.)
2nd messy data case we’ll look at is when multiple variables are stored in one column. For this download and load this dataset called students that demonstrate this scenario: https://www.dropbox.com/s/srlo52yh836easq/students2.xlsx?dl=0
> students2 <- data.frame(read_excel("students2.xlsx"))
> View(students2)
Let’s get started by using gather()
to stack the columns of student2, like we just did with students. This time, name the key
column sex_class
and the value
column count
. Save the result to a new variable called res
. Do this on your RStudio console and then see the answer below:
> res <- gather(students2, sex_class, count, -grade)
> res
That got us half way to tidy data, but we still have two different variables, sex
and class
, stored together in the sex_class
column. tidyr
offers a convenient separate()
function for the purpose of separating one column into multiple columns. Pull up the help file for separate()
now: ?separate
Call separate() on res to split the sex_class column into sex and class. You only need to specify the first three arguments: data = res
, col = sex_class
, into = c("sex", "class")
. You don’t have to provide the argument names as long as they are in the correct order. Do it yourself and then see the answer below:
separate(res, sex_class, c("sex", "class"))
Conveniently, separate()
was able to figure out on its own how to separate the sex_class
column. Unless you request otherwise with the sep
argument, it splits on non-alphanumeric values. In other words, it assumes that the values are separated by something other than a letter or number (in this case, on underscore.)
Tidying students2
required both gather()
and separate()
, causing us to save an intermediate result (res
). However, just like with dplyr
, you can use the %>%
operator to chain multiple function calls together. Let’s give this a try, do it yourself then see the solution below:
students2 %>%
gather(sex_class, count, -grade) %>%
separate(sex_class, c("sex", "class"))
3rd symptom of messy data is when variables are stored in both rows and columns. Download students3.xlsx
dataset to demonstrate this: https://www.dropbox.com/s/b5075ypnd1vecsb/students3.xlsx?dl=0
> students3 <- data.frame(read_excel("students3.xlsx"))
> View(students3)
In students3
, we have midterm and final exam grades for five students, each of whom were enrolled in exactly two of five possible classes.
The first variable, name
, is already a column and should remain as it is. The headers of the last five columns, class1
through class5
, are all different values of what should be a class variable. The values in the test
column, midterm
and final
, should each be its own variable containing the respective grades for each student.
This will require multiple steps, which we will build up gradually using %>%
. Read the instructions below and try to do it yourself first.
Step 1: Call
gather()
to gather the columnsclass1
throughclass5
into a new variable called class. Thekey
should beclass
, and thevalue
should begrade
.tidyr
makes it easy to reference multiple columns withclass1:class5
, just like with sequences of numbers. Since each student is only enrolled in two of the five possible classes, there are .lots of missing values (i.e. NAs). Use the argumentna.rm = TRUE
to omit these values from the final result.Step 2: This step require the use of
spread()
.spread()
which will allow us to turn the values of thetest
column,midterm
andfinal
, into column headers (i.e. variables). You only need to specify two arguments tospread()
. Can you figure out what they are by looking at the documentation:?spread
. (Hint: You don’t have to specify the data argument since we’re using the%>%
operator)Step 3:
readr
is required for certain data manipulations, such asparse_number()
, which will be used in this step. Make sure you have the package loaded i.e.library(readr)
. We want the values in theclass
column to simply be 1,2, …, 5 and not class1, class2, …, class5. We can use theparse_number()
function from readr to accomplish this.
Use themutate()
functiondplyr
along withparse_number()
. (Hint: you can overwrite a column withmutate()
by assigning a new value to the existing column instead of creating a new column.)Remember that when you’re using the
%>%
operator, the value to the left of it gets inserted as the first argument to the function on the right.Consult
?gather
and/or?chain
of you get stuck.Here is the skeleton of just step1 to get you started:
students3 %>%
gather( , , : , = TRUE) %>%
Here is the full solution:
students3 %>%
gather(class, grade, class1:class5, na.rm = TRUE) %>%
spread(test, grade) %>%
mutate(class=parse_number(class))
4th messy data problem we’ll look at occurs when multiple observational units are stored in the same table. Download students4.xlsx
and load it: https://www.dropbox.com/s/vif8aldhvae3sim/students4.xlsx?dl=0
> students4 <- data.frame(read_excel("students4.xlsx"))
> View(students4)
students4
is almost the same as our tidy version of students3. The only difference is that student4 provides a unique id for each student, as well as his or her sex (M = male; F = female). At first glance, there doesn’t seems to be much of a problem with students4
. All columns are variables and all rows are observations. However, notice that each id
, name
, and sex
is repeated twice, which seems quite redundant. This is a hint that out data contains multiple observational units in a single table.
Our solution will be to break students4
into two separate tables — one containing basic student information (id
, name
, and sex
) and other containing grades (id
, class
, midterm
, final
).
student_info <- students4 %>%
select(id,name,sex)
Noticed anything strange about student info? It contains five duplicate rows! This is something that unique()
can fix. As we are using %>%
and it has a nice feature that we can omit parentheses after the function name. So update the above code to:
student_info <- students4 %>%
select(id, name, sex) %>%
unique %>%
Now, create a second table called gradebook
using the id
, class
, midterm
, and final
columns.
gradebook <- student3 %>%
select(id, class, midterm, final)
It’s important to note that we left the id column in both tables. In the world of relational databases, id
is called our primary key
since it allows us to connect each student listed in student_info
with their grades listed in gradebook
. Without a unique identifier, we might not know how the tables are related. (In this case, we could have also used the name variable, since each student happens to have a unique name.)
5th and final messy data scenario that we’ll address is when a single observational unit is stored in multiple tables. It’s the opposite of the fourth problem.
Download passed.xlsx
https://www.dropbox.com/s/43f57cy3whnvx1b/passed.xlsx?dl=0 and failed.xlsx
https://www.dropbox.com/s/69rrh5gvz0j9q4r/failed.xlsx?dl=0 and move the in your working directory.
> passed <- data.frame(read_excel("passed.xlsx"))
> failed <- data.frame(read_excel("failed.xlsx"))
Teachers decided to only take into consideration final exam grades in determining whether students passed or failed each class. As you may have inferred from the data, students passed a class if they received a final exam grade of A or B and failed otherwise.
The name of each dataset actually represents the value of a new variable that we call status
. Before joining the two tables together, we’ll add a new column to each containing this information so that it’s not lost when we put everything together. Use dplyr’s mutate()
to add a new column to the passed table. The column should be called status
and the value, passed (a character string), should be the same for all students. Overwrite the current version of passed with the new one.
passed <- passed %>% mutate(status = "passed")
Now, do the same for the failed table, except the status column should have the value “failed” for all students.
failed <- failed %>% mutate(status = "failed")
Now, pass as arguments the passed
and failed tables
(in order) to the dplyr function bind_rows()
, which will join them together into a single unit. Check ?bind_rows
if you need help. Note: bind_row()
is only available in dplyr 0.4.0 or later.
Call bind_rows()
with two arguments, passed
and failed
(in that order), to join the two tables:
bind_rows(passed, failed)
Of course, we could arrange the rows however we wish at this point, but the important thing is that each row is an observation, each column is a variable, and the table contains a single observational unit. This, the data are tidy.
We have covered a lot in this topic. Let’s bring everything together and tidy a real dataset.
The SAT is a popular college-readiness exam in the United States that consists of three sections: critical reading, mathematics, and writing. Students can earn up to 800 points on each section. This dataset presents the total number of students, for each combination of exam section and sex, within each of six score ranges. It comes from the ‘Total Group Report 2013’, which can be found here: http://research.collegeboard.org/programs/sat/data/cb-seniors-2013
Download SAT dataset here: https://www.dropbox.com/s/q4ky3cf4en9r47s/sat.xlsx?dl=0
> sat <- data.frame(read_excel("sat.xlsx"))
> View(sat)
Here is what you need to do with this dataset:
Accomplish the following five goals:
1.
select()
all columns that do NOT contain the wordtotal
, since if we have themale
andfemale
data, we can always recreate the totalcount
in a separate column, if we want it. (Hint: User thecontains()
function, which you’ll find detailed in ‘Special Functions’ section of?select
)2.
gather()
all columns EXCEPTscore_range
, usingkey = part_sex
andvalue = count
.3.
separate()
part_sex
into two separate variables (columns), calledpart
andsex
, respectively. You may need to check the `Examples` section of?separate
to remember how theinto
argument should be phrased.4. Use
group_by()
(fromdplyr
) to group the data bypart
andsex
, in that order.5. Use
mutate
to add two new columns, whose values will be automatically computed group-by-group:
* total = sum(count)
* prop = count/totalLike always here is a code skeleton to get started:
sat %>%
select(-contains(###)) %>%
gather(###, ###, -###) %>%
### <Your call to separate()> %>%
Here is the solution:
sat %>%
select(-contains("total")) %>%
gather(part_sex, count, -score_range) %>%
separate(part_sex, c("part", "sex")) %>%
group_by(part, sex) %>%
mutate(total = sum(count),
prop = count/total)
So you learned how to tidy data with tidyr and dplyr. These tools will help you spend less time and energy getting your data ready to analyze and more time actually analyzing it. But this is the most important step in data science as data scientists spend most of the time cleaning data. Hey! but I’m not saying this ask Google!
That’s it! There is a lot in R that you are yet to discover. These topics that you have learnt gives you a great overview on what to do when someone gives you a dataset and ask a question. As of last step I recommend that you find datasets on internet that require cleaning and start cleaning them on R. Here is a good link that a quick Google search showed me: https://makingnoiseandhearingthings.com/2018/04/19/datasets-for-data-cleaning-practice/
Do share/comment this story about what you liked and what need improvement. Ciao!