Using sequence data in dataframes with BioStrings

Introduction

Sometimes, you have a dataframe which contains sequences that you have processed, and you want to convert it into a DNAStringSet object. From there, you can continue to process them with BioStrings functions, or write them out to a FASTA file.

Let’s say you have a dataframe like this:

SequencesDF <- data.frame(Title=c("wildType", "mutant", "Zoidberg"),
                             Seqs=c('AAATTCCC', 'AAATGCCC', 'GAGATATA'))

     Title     Seqs
wildType AAATTCCC
 mutant AAATGCCC
Zoidberg GAGATATA

Converting it to a DNAStringSet

library(Biostrings)

#   Convert the sequence column into an object
SequencesObj <- DNAStringSet(unique(SequencesDF$Seqs))

Adding names metadata

# Make the names "Sequence_1", "Sequence_2"...
names(SequencesObj) <- paste0("Sequence_", seq(length(SequencesDF$Title)))

# Have a look
names(SequencesObj)

# Make the names the same as the "Titles" column
names(SequencesObj) <-SequencesDF$Title