An easy way to "tidy" your data.

This tutorial was adapted from Alberto Guidici's post (http://www.milanor.net/blog/reshape-data-r-tidyr-vs-reshape2/) by Aurora Tsai.

tidyr and reshape2 are packages that have many similar functions:

tidyr: gather(), spread(), and separate()
reshape2: melt(), dcast(), and colsplit()

Let's see how they compare.


  1. Each column is a variable.
  2. Each row is an observation.

Gather and Spread your Data

Pretend you have some "messy" data from 20 learners of Japanese with English, Korean, and Chinese as their first languages (L1s). They all took two language proficiency tests at the beginning and end of the semester. Your messy data frame has the proficiency test and times listed as four separate variables.

1) Copy the first part of this code into your R script to work with the "messy" dataframe.

messy <- data.frame(part = 1:20,
                    group = sample(rep(c('control', 'treatment'), each = 10)),
                    L1 = sample(c("Eng", "Kor", "Chi"), 20, replace = T, prob =   c(0.4, 0.2, 0.4)),
                    sex = sample( c("F", "M"), 10, replace=TRUE, prob=c(0.5, 0.5)),
                    proftest1.T1 = as.integer(rnorm(20, 50, 20)),
                    proftest2.T1 = as.integer(rnorm(20, 50, 20)),
                    proftest1.T2 = as.integer(rnorm(20, 50, 20)),
                    proftest2.T2 = as.integer(rnorm(20, 50, 20)))

  part     group  L1 sex proftest1.T1 proftest2.T1 proftest1.T2 proftest2.T2
1    1 treatment Eng   F          132           36           45           33
2    2 treatment Kor   F           70           23           42           18
3    3   control Kor   M           62           41           35           56
4    4 treatment Chi   M           22           80           66           77
5    5   control Kor   M           43           46           32           35
6    6 treatment Kor   M           90          106           21           42

You can use gather() to transform dataframe from wide to long format. This is similar to the melt() function in reshape2.

gather(data, key.name, value.name, var1:var2)
key.namethe new variable name you want all the language tests to be grouped under. Lets call them "test"
value.nameoften this is a dependent variable. In this case the scores students got on the language tests
var1:var2indicates the slice of columns (variables) you want to reformat together into "long format"

2) Install and require tidyr. Then try the following:

gather(messy, test, score, proftest1.T1:proftest2.T2)

What happened to your data? How was it rearranged?

Option 2: Instead of specifying a slice of variable columns (vectors) to include, you can indicate which ones to exclude using the "-" symbol infront of the variable name. For example, if we want to exclude the participant, group, L1, and sex column:

unmessy <- gather(messy, test, score, -part, -group, -L1, -sex)

Extra Stuff
Do you remember how to use the pipe operator, %>% with dplyr? You can also use it with tidyr. (allows you to exclude the dataframe name from the function(s) your using). How would the code look?

spread(data, key.name, value.name)
key.namethe name of the "key" column you want to spread into different columns (e.g., "test" -> 4 different language test columns)
value.namethe values associated with each new column (e.g., the scores associated with each type of language test)

Try it Out
You realized that you want to perform a mixed repeated ANOVA on the language test data, looking at the interaction between test type (1 & 2) and time (T1 & T2). In order to do this, you need to spread them back into four columns. How can you apply spread() to tidy your data?

Challenge Question
How would you do the same thing using melt() and dcast() in reshape2?

What is the difference between using tidyR and reshape2 if you do not specify key names and value names? e.g,. gather(messy) vs. melt(messy)

Example Problem

Instructors gave out two types of vocabulary tests to students. One was a vocabulary size test and the other was a vocabulary quiz you had in class. This is their data:

part <- rep(1:30, times = 1)
level <- sample(c("1", "2", "3"), 30, replace = T, prob = c(0.4, 0.2, 0.4))
vocab.size <- as.integer(rnorm(30, 500, sd = 200))
vocab.quiz<- rnorm(30, 50, sd=20)
read.test <- as.integer(rnorm(30, 50, sd = 20))

vdata <- data.frame(part, level, vocab.size, vocab.quiz, read.test)

They are listed as two different variables in your data frame, but you want to group them under the same column as "vocabtest". Use gather() to do this. Then try separating the data back into its original format.

Unite and Separate Values

Sometimes you might have two variables that you want to unite into one column. For example, take a look at the mtcars data:


   mpg cyl disp  hp drat    wt  qsec vs am gear carb        gpm
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 0.04761905
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 0.04761905
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 0.04385965
4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 0.04672897
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 0.05347594
6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 0.05524862

What if you wanted to put "vs" (V/S) and "am" (transmission) into the same column? You can do so with the unite() function.

unite(data, var, ..., sep = "_", remove = TRUE)
varthe new column/variable name you will put the data under
...the columns you want to unite. Again, you can write "vs:am" to indicate a slice of columns, use "-" to exclude specific columns, or simply list which columns you want to unite, separating them by commas
sep = "_"If you want to separate the two values with a symbol, put that symbol between the " ". The default is to separate it with "_"
remove = TWhen you set this as true, it removes the original columns from the dataframe

Try it Out

mtunit <- unite(mtcars, vs_am, vs, am, sep = "-")

You realize you made a mistake! You want the vs and am variables to be separate again. You can reverse what you did quite easily using the separate() function.

separate(data, var, into = c("var1", "var2"), sep = "-", remove = T)

separate(mtunit, vs_am, into = c("vs", "am"), sep = "-", remove = T)

Example Problem

You want to give each student a unique ID by taking assigning them a number (1 = elementary, 2 = intermediate, 3 = advanced) based on their class enrollement and a number based on the order in which they participated (e.g., "1-8" for an elementary class student who is your eighth participant).
1) Unite these columns in the "vdata" dataframe.

2) Now separate the information from this column so that "level" and "part" are separate variables in your dataframe

Use can also use the pipe operator %>% to do multiple functions on one dataframe. For example, unite two columns and then separate them back.

mtcars %>%
  unite(vs_am, vs, am) %>%
  separate(vs_am, c("vs", "am"))
vdata %>%
  unite(level_part, level, part) %>%
  separate(level_part, c("level", "part"))

This may not be so useful for uniting and separating the same columns, but in what other situations could you use it?

Dropping NA values

We have already learned how to drop NA values by making our own function. Do you remember how to do this?

You can also use a built-in function from tidyr:

drop_na(data, . . .)

. . .variables to consider dropping. If not filled in, then considers ALL variables

It may be a redundant function, but it is easy to use if you want to specify certain columns where you want to drop NAs, but leave NAs in other columns.

drop_na(df)     ##drops all NAs in the df
drop_na(df, x)  ##drops NAs in column x


df %>% drop_na()     ##drops all NAs
df %>% drop_na(x)  ##drops NAs in column x

Let's try it. First, add NAs to the "vdata" dataframe with the first part of this code:

add.na <- function(cc){
  cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]
vdata$read.test <- add.na(read.test)
vdata$vocab.quiz <- add.na(vocab.quiz)


   part level vocab.size vocab.quiz read.test
1     1     3        428  65.545408        46
2     2     3        306  58.385198        56
3     3     1        364  18.081852        72
4     4     2        898  56.043479        33
5     5     2        568  41.964858        24
6     6     3        857  56.961470        52
7     7     3        586  -3.993332        NA
8     8     1        378  52.643251        55


Example Problem

Not all of your participants completed all your tasks/tests because they got busy with schoolwork. You want to find the correlation between vocabulary size and reading comprehension, but you can’t do it unless you drop participants who didn’t take the reading test. Try dropping participants in "vdata" that didn't take the reading test.