Loading multiple files into a single df and subsetting them
Listing all files in a directory
To get your files into R, you need to know where they are. You could hard-code a list of files, and input them, which is fine if this analysis is being done once on a small number of files, like so:
listOfFiles <- c('file1.tsv', 'SomeOtherFiles.tsv', 'A_Third_File.tsv')
There is a better way, for large numbers of files or if the input files will change with multiple iterations of this analysis.
listOfFiles <- list.files(path = '/location/of/the/files')
If you have other files in the directory, and want only a certain type of file, you can specify a pattern that the filenames must conform to.
listOfFiles <- list.files(
path = '/location/of/the/files',
pattern = '\\.tsv')
The patterns used here are regular expressions (Regexes). Info on them can be found here. For those who are familiar with them, a cheatsheet is available here
Inputting lots of files at once
Let’s say you have a single file, and you want to read it. You’d probably use something like the following (although there are dozens of different ways to do the same thing).
singleDF <- read.table('data.tsv', header = TRUE)
Reading multiple files can be done by putting read.table() into a function, and iterating over it.
More info about lapply() can be found here, and the official docs are here.
If a function creates a df, then the lapply() function run over a function like that will make a list of dfs.
The do.call('rbind') bit takes all of the dfs in the list, and sticks them together in one.
readFilesIn <- function(fileName) {
outDF <- read.table(fileName, header = TRUE)
return(outDF)
}
listOfDFs <- lapply(files, readFilesIn)
bigDF <- do.call("rbind", listOfDFs)
Processing the data whilst inputting them
The beauty of reading each file within a function is that you can do some preprocessing of each df before concatenation. You can use this to cut down on the amount of data that you store in RAM, or add a marker to show which file the data comes from.
In the following example, each file looks like so:
a b
A 45
B 44
C 87
D 22
E 15
Let’s say you only want the data from rows where column ‘a’ is ‘A’ or ‘B’.
readAndSubset <- function(fileName) {
# Read in a file whose name is in the fileName variable
outDF <- read.table(fileName, header = TRUE)
# List the values in the rows that you want
usefulRows <- c('A', 'B')
# Delete all rows that don't match the above list
outDF <- outDF[which(outDF$a %in% usefulRows),]
# Add a column containing the filename
outDF$fileName <- fileName
return(outDF)
}
listOfDFs <- lapply(files, readFilesIn)
bigDF <- do.call("rbind", listOfDFs)
The output of the above, for two similar files called in.tsv and in2.tsv is the following:
a b fileName
1 A 45 in.tsv
2 B 44 in.tsv
3 A 74 in2.tsv
4 B 47 in2.tsv
Since you are removing rows when processing each file at a time, the amount of data that R needs to store in RAM at any point is only:
- The good rows that it has found so far
- The file that is currently being processed
… as opposed to reading all files in at once, and processing the lot, where R would need to store all data in RAM, then remove the unwanted rows.