Loading multiple files into a single df and subsetting them

Listing all files in a directory

To get your files into R, you need to know where they are. You could hard-code a list of files, and input them, which is fine if this analysis is being done once on a small number of files, like so:

listOfFiles <- c('file1.tsv', 'SomeOtherFiles.tsv', 'A_Third_File.tsv')

There is a better way, for large numbers of files or if the input files will change with multiple iterations of this analysis.

listOfFiles <- list.files(path = '/location/of/the/files')

If you have other files in the directory, and want only a certain type of file, you can specify a pattern that the filenames must conform to.

listOfFiles <- list.files(
                            path = '/location/of/the/files', 
                            pattern = '\\.tsv')

The patterns used here are regular expressions (Regexes). Info on them can be found here. For those who are familiar with them, a cheatsheet is available here

Inputting lots of files at once

Let’s say you have a single file, and you want to read it. You’d probably use something like the following (although there are dozens of different ways to do the same thing).

singleDF <- read.table('data.tsv', header = TRUE)

Reading multiple files can be done by putting read.table() into a function, and iterating over it. More info about lapply() can be found here, and the official docs are here. If a function creates a df, then the lapply() function run over a function like that will make a list of dfs. The do.call('rbind') bit takes all of the dfs in the list, and sticks them together in one.

readFilesIn <- function(fileName) {
  outDF <- read.table(fileName, header = TRUE)
  return(outDF)
}

listOfDFs <- lapply(files, readFilesIn)
bigDF <- do.call("rbind", listOfDFs)

Processing the data whilst inputting them

The beauty of reading each file within a function is that you can do some preprocessing of each df before concatenation. You can use this to cut down on the amount of data that you store in RAM, or add a marker to show which file the data comes from.

In the following example, each file looks like so:

a	b
A	45
B	44
C	87
D	22
E	15

Let’s say you only want the data from rows where column ‘a’ is ‘A’ or ‘B’.

readAndSubset <- function(fileName) {
  # Read in a file whose name is in the fileName variable
  outDF <- read.table(fileName, header = TRUE)
  #	List the values in the rows that you want
  usefulRows <- c('A', 'B')
  # Delete all rows that don't match the above list
  outDF <- outDF[which(outDF$a %in% usefulRows),]
  # Add a column containing the filename
  outDF$fileName <- fileName
  return(outDF)
}

listOfDFs <- lapply(files, readFilesIn)
bigDF <- do.call("rbind", listOfDFs)

The output of the above, for two similar files called in.tsv and in2.tsv is the following:

  a  b fileName
A 45   in.tsv
B 44   in.tsv
A 74  in2.tsv
B 47  in2.tsv

Since you are removing rows when processing each file at a time, the amount of data that R needs to store in RAM at any point is only:

The good rows that it has found so far
The file that is currently being processed

… as opposed to reading all files in at once, and processing the lot, where R would need to store all data in RAM, then remove the unwanted rows.