(QuickTip) Extracting SAM-formatted tags from data in R
Last modified: 2024-06-03
Sometimes, you have a data frame, or a list of strings that contain SAM-formatted tags in them.
Maybe you’ve imported a FASTA file using library('Biostrings'), or you have a CSV file with some tags that you need to extract.
This is how to do it.
First, let’s generate some toy data that could look like what you are using. I have generated a list, but if you have a dataframe, keep reading.
# Generate a list of conceivable titles with a mixture of two SAM tags
listOfTitles <- list("seq1 RA:Z:54",
"seq2 RA:Z:126",
"amazingSeq RA:Z:56",
"SomethingElse RA:Z:61",
"BBTaggedThing BB:Z:56",
"SomethingWith2Tags BB:Z:99 RA:Z:11")
This is the meat of how the process works.
library('stringr')
# Go through the list that you gave it, and extract all of the possible tags
tagNameList <- gsub(':.*',
'',
unique(
unlist(
str_extract_all(listOfTitles,
'([:upper:]{2}):[AifZHB]:'))))
# Define function for going through a list of titles, extracting a specific tag
# from it, and storing the result in a single-column dataframe
extractATag <- function(tag, titles) {
extractedTags <- data.frame(tag = gsub('.*:',
'',
str_extract(titles,
paste0(tag,
':[AifZHB]:\\w*'))))
names(extractedTags)[names(extractedTags) == 'tag'] <- tag
return(extractedTags)
}
# Run the above function on the list of possible tags, to generate a list of
# single columns, each containing the extracted tag values.
tagValList <- lapply(tagNameList, function(x) {extractATag(x, listOfTitles)})
# If you started with a list of titles, you can unlist it, and combine it with
# the list of dataframes like so.
cbind(unlist(listOfTitles), as.data.frame(tagValList))
You could start with a dataframe df, with a column called 'title' instead, and run the following lines.
Note, the function extractATag() still needs to be defined, but I don’t want to repeat myself.
tagNameList <- gsub(':.*',
'',
unique(
unlist(
str_extract_all(df$title,
'([:upper:]{2}):[AifZHB]:'))))
tagValList <- lapply(tagNameList, function(x) {extractATag(x, df$title)})
df <- cbind(df, as.data.frame(tagValList))
Putting everything together:
library(Biostrings)
library(stringr)
# Get a single FASTA file in as a df
temp <- readDNAStringSet('file.fa')
dss2df <- function(dss) data.frame(width=width(dss), seq=as.character(dss), names=names(dss))
tempdf <- dss2df(temp)
# Extract the SAM formatted tags from the title (names) column and put them into new columns
extractTagsFromCol <- function(df, colOfTitles) {
tagNameList <- gsub(':.*', '', unique(unlist(str_extract_all(colOfTitles,'([:upper:]{2}):[AifZHB]:'))))
extractATag <- function(tag, titles) {
extractedTags <- data.frame(tag = gsub('.*:',
'',
str_extract(titles,
paste0(tag,
':[AifZHB]:\\w*'))))
names(extractedTags)[names(extractedTags) == 'tag'] <- tag
return(extractedTags)
}
tagValList <- lapply(tagNameList, function(x) {extractATag(x, colOfTitles)})
return(cbind(df, as.data.frame(tagValList)))
}
tempdf2 <- extractTagsFromCol(tempdf, tempdf$names)