This tutorial was adapted from Alberto Guidici's post (http://www.milanor.net/blog/reshape-data-r-tidyr-vs-reshape2/) by Aurora Tsai.
tidyr and reshape2 are packages that have many similar functions:
tidyr: gather(), spread(), and separate()
reshape2: melt(), dcast(), and colsplit()
Let's see how they compare.
Remember:
Pretend you have some "messy" data from 20 learners of Japanese with English, Korean, and Chinese as their first languages (L1s). They all took two language proficiency tests at the beginning and end of the semester. Your messy data frame has the proficiency test and times listed as four separate variables.
1) Copy the first part of this code into your R script to work with the "messy" dataframe.
messy < - data.frame(part = 1:20,
group = sample(rep(c('control', 'treatment'), each = 10)),
L1 = sample(c("Eng", "Kor", "Chi"), 20, replace = T, prob = c(0.4, 0.2, 0.4)),
sex = sample( c("F", "M"), 10, replace=TRUE, prob=c(0.5, 0.5)),
proftest1.T1 = as.integer(rnorm(20, 50, 20)),
proftest2.T1 = as.integer(rnorm(20, 50, 20)),
proftest1.T2 = as.integer(rnorm(20, 50, 20)),
proftest2.T2 = as.integer(rnorm(20, 50, 20)))
head(messy)
part group L1 sex proftest1.T1 proftest2.T1 proftest1.T2 proftest2.T2
1 1 treatment Eng F 132 36 45 33
2 2 treatment Kor F 70 23 42 18
3 3 control Kor M 62 41 35 56
4 4 treatment Chi M 22 80 66 77
5 5 control Kor M 43 46 32 35
6 6 treatment Kor M 90 106 21 42
You can use gather() to transform dataframe from wide to long format. This is similar to the melt() function in reshape2.
data | dataframe |
key.name | the new variable name you want all the language tests to be grouped under. Lets call them "test" |
value.name | often this is a dependent variable. In this case the scores students got on the language tests |
var1:var2 | indicates the slice of columns (variables) you want to reformat together into "long format" |
2) Install and require tidyr. Then try the following:
gather(messy, test, score, proftest1.T1:proftest2.T2)
What happened to your data? How was it rearranged?
Option 2: Instead of specifying a slice of variable columns (vectors) to include, you can indicate which ones to exclude using the "-" symbol infront of the variable name. For example, if we want to exclude the participant, group, L1, and sex column:
unmessy <- gather(messy, test, score, -part, -group, -L1, -sex)
head(unmessy)
Extra Stuff
Do you remember how to use the pipe operator, %>% with dplyr? You can also use it with tidyr. (allows you to exclude the dataframe name from the function(s) your using). How would the code look?
data | dataframe |
key.name | the name of the "key" column you want to spread into different columns (e.g., "test" -> 4 different language test columns) |
value.name | the values associated with each new column (e.g., the scores associated with each type of language test) |
Try it Out
You realized that you want to perform a mixed repeated ANOVA on the language test data, looking at the interaction between test type (1 & 2) and time (T1 & T2). In order to do this, you need to spread them back into four columns. How can you apply spread() to tidy your data?
Challenge Question
How would you do the same thing using melt() and dcast() in reshape2?
What is the difference between using tidyR and reshape2 if you do not specify key names and value names? e.g,. gather(messy) vs. melt(messy)
Instructors gave out two types of vocabulary tests to students. One was a vocabulary size test and the other was a vocabulary quiz you had in class. This is their data:
part <- rep(1:30, times = 1)
level <- sample(c("1", "2", "3"), 30, replace = T, prob = c(0.4, 0.2, 0.4))
vocab.size <- as.integer(rnorm(30, 500, sd = 200))
vocab.quiz<- rnorm(30, 50, sd=20)
read.test <- as.integer(rnorm(30, 50, sd = 20))
vdata <- data.frame(part, level, vocab.size, vocab.quiz, read.test)
They are listed as two different variables in your data frame, but you want to group them under the same column as "vocabtest". Use gather() to do this. Then try separating the data back into its original format.
Sometimes you might have two variables that you want to unite into one column. For example, take a look at the mtcars data:
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb gpm
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0.04761905
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0.04761905
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0.04385965
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 0.04672897
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0.05347594
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0.05524862
What if you wanted to put "vs" (V/S) and "am" (transmission) into the same column? You can do so with the unite() function.
data | dataframe |
var | the new column/variable name you will put the data under |
... | the columns you want to unite. Again, you can write "vs:am" to indicate a slice of columns, use "-" to exclude specific columns, or simply list which columns you want to unite, separating them by commas |
sep = "_" | If you want to separate the two values with a symbol, put that symbol between the " ". The default is to separate it with "_" |
remove = T | When you set this as true, it removes the original columns from the dataframe |
Try it Out
mtunit <- unite(mtcars, vs_am, vs, am, sep = "-")
head(mtunit)
Oops.
You realize you made a mistake! You want the vs and am variables to be separate again. You can reverse what you did quite easily using the separate() function.
separate(mtunit, vs_am, into = c("vs", "am"), sep = "-", remove = T)
You want to give each student a unique ID by taking assigning them a number (1 = elementary, 2 = intermediate, 3 = advanced) based on their class enrollement and a number based on the order in which they participated (e.g., "1-8" for an elementary class student who is your eighth participant).
1) Unite these columns in the "vdata" dataframe.
2) Now separate the information from this column so that "level" and "part" are separate variables in your dataframe
Use can also use the pipe operator %>% to do multiple functions on one dataframe. For example, unite two columns and then separate them back.
mtcars %>%
unite(vs_am, vs, am) %>%
separate(vs_am, c("vs", "am"))
vdata %>%
unite(level_part, level, part) %>%
separate(level_part, c("level", "part"))
QuestionWe have already learned how to drop NA values by making our own function. Do you remember how to do this?
You can also use a built-in function from tidyr:
data | dataframe |
. . . | variables to consider dropping. If not filled in, then considers ALL variables |
It may be a redundant function, but it is easy to use if you want to specify certain columns where you want to drop NAs, but leave NAs in other columns.
drop_na(df) ##drops all NAs in the df
drop_na(df, x) ##drops NAs in column x
Or
df %>% drop_na() ##drops all NAs
df %>% drop_na(x) ##drops NAs in column x
Let's try it. First, add NAs to the "vdata" dataframe with the first part of this code:
add.na <- function(cc){
cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]
}
vdata$read.test <- add.na(read.test)
vdata$vocab.quiz <- add.na(vocab.quiz)
vdata[1:20]
part level vocab.size vocab.quiz read.test
1 1 3 428 65.545408 46
2 2 3 306 58.385198 56
3 3 1 364 18.081852 72
4 4 2 898 56.043479 33
5 5 2 568 41.964858 24
6 6 3 857 56.961470 52
7 7 3 586 -3.993332 NA
8 8 1 378 52.643251 55
...
Not all of your participants completed all your tasks/tests because they got busy with schoolwork. You want to find the correlation between vocabulary size and reading comprehension, but you can’t do it unless you drop participants who didn’t take the reading test. Try dropping participants in "vdata" that didn't take the reading test.