Random sampling of a dataset into training and test datasets

Training and test datasets represented as buckets

When you’re building a model, you generally have a dataset to train on. When evaluating the model later, you should be using a different dataset. Google’s dev crash course gives more information here. An easy way to have two datasets is to split it, and R has a way of doing that easily. Say you have the dataframe below:

> df
# A tibble: 20 x 3
   Sample   cat1  cat2 
   <chr>    <chr> <chr>
sample1  A     X    
sample2  A     X    
sample3  A     X    
sample4  A     X    
sample5  A     X    
sample6  A     Y    
sample7  A     Y    
sample8  A     Y    
sample9  A     Y    
sample10 A     Y    
sample11 B     X    
sample12 B     X    
sample13 B     X    
sample14 B     X    
sample15 B     X    
sample16 B     Y    
sample17 B     Y    
sample18 B     Y    
sample19 B     Y    
sample20 B     Y  

The following code splits the data by column cat1. First, it loads the required libraries for processing. Starting with dataframe df, dplyr groups the data by the column cat1, and passes the result to slice_sample(). slice_sample() takes each group and grabs 80% of each group and stores it in trainingDataset. Note: If you are following this tutorial exactly, don’t expect to get the exact same results, because slice_sample() chooses rows of df randomly.

library(dplyr)
library(sampling)
trainingDataset <- df %>% group_by(cat1) %>% slice_sample(prop = 0.8)

Sometimes, the output looks like below, where this trainingDataset represents 80% of each group in cat1, but not in cat2.

> trainingDataset
# A tibble: 16 x 3
# Groups:   cat1 [2]
   Sample   cat1  cat2 
   <chr>    <chr> <chr>
sample8  A     Y    
sample6  A     Y    
sample5  A     X    
sample9  A     Y    
sample7  A     Y    
sample1  A     X    
sample10 A     Y    
sample2  A     X    
sample19 B     Y    
sample20 B     Y    
sample14 B     X    
sample11 B     X    
sample13 B     X    
sample17 B     Y    
sample16 B     Y    
sample18 B     Y  

Now, we need to split the data by two categories. Adding column names to the group_by() function allows us to do so.

multiCatDataset <-df %>% group_by(cat1, cat2) %>% slice_sample(prop = 0.8)

Now, the data looks like so:

> multiCatDataset
# A tibble: 16 x 3
# Groups:   cat1, cat2 [4]
   Sample   cat1  cat2 
   <chr>    <chr> <chr>
sample3  A     X    
sample1  A     X    
sample5  A     X    
sample4  A     X    
sample10 A     Y    
sample9  A     Y    
sample6  A     Y    
sample8  A     Y    
sample13 B     X    
sample15 B     X    
sample11 B     X    
sample12 B     X    
sample19 B     Y    
sample17 B     Y    
sample18 B     Y    
sample20 B     Y  

Now, we need to grab the samples from df that are not in our training dataset:

testDataset <- subset(df, !(Sample %in% trainingDataset$Sample))

This produces the correct final result.

> testDataset
# A tibble: 4 x 3
  Sample   cat1  cat2 
  <chr>    <chr> <chr>
1 sample3  A     X    
2 sample4  A     X    
3 sample12 B     X    
4 sample15 B     X 

This can be used multiple times, to separate a dataset into as many as are needed for your analysis, for example: training, cross-validation and test datasets.