(QuickTip) Extracting SAM-formatted tags from data in R

Last modified: 2024-06-03

Sometimes, you have a data frame, or a list of strings that contain SAM-formatted tags in them. Maybe you’ve imported a FASTA file using library('Biostrings'), or you have a CSV file with some tags that you need to extract. This is how to do it.

First, let’s generate some toy data that could look like what you are using. I have generated a list, but if you have a dataframe, keep reading.

# Generate a list of conceivable titles with a mixture of two SAM tags
listOfTitles <- list("seq1 RA:Z:54", 
                     "seq2 RA:Z:126", 
                     "amazingSeq RA:Z:56", 
                     "SomethingElse RA:Z:61", 
                     "BBTaggedThing BB:Z:56", 
                     "SomethingWith2Tags BB:Z:99 RA:Z:11")

This is the meat of how the process works.

library('stringr')

# Go through the list that you gave it, and extract all of the possible tags
tagNameList <- gsub(':.*', 
                    '', 
                    unique(
                      unlist(
                        str_extract_all(listOfTitles, 
                                        '([:upper:]{2}):[AifZHB]:'))))

# Define function for going through a list of titles, extracting a specific tag 
# from it, and storing the result in a single-column dataframe
extractATag <- function(tag, titles) {
  extractedTags <- data.frame(tag = gsub('.*:', 
                                         '', 
                                         str_extract(titles, 
                                                     paste0(tag, 
                                                            ':[AifZHB]:\\w*'))))
  names(extractedTags)[names(extractedTags) == 'tag'] <- tag
  return(extractedTags)
}

# Run the above function on the list of possible tags, to generate a list of
# single columns, each containing the extracted tag values.
tagValList <- lapply(tagNameList, function(x) {extractATag(x, listOfTitles)})

# If you started with a list of titles, you can unlist it, and combine it with
# the list of dataframes like so.
cbind(unlist(listOfTitles), as.data.frame(tagValList))

You could start with a dataframe df, with a column called 'title' instead, and run the following lines. Note, the function extractATag() still needs to be defined, but I don’t want to repeat myself.

tagNameList <- gsub(':.*', 
                    '', 
                    unique(
                      unlist(
                        str_extract_all(df$title, 
                                        '([:upper:]{2}):[AifZHB]:'))))
tagValList <- lapply(tagNameList, function(x) {extractATag(x, df$title)})
df <- cbind(df, as.data.frame(tagValList))

Putting everything together:

library(Biostrings)
library(stringr)

# Get a single FASTA file in as a df
temp <- readDNAStringSet('file.fa')
dss2df <- function(dss) data.frame(width=width(dss), seq=as.character(dss), names=names(dss))
tempdf <- dss2df(temp)

# Extract the SAM formatted tags from the title (names) column and put them into new columns
extractTagsFromCol <- function(df, colOfTitles) {
  tagNameList <- gsub(':.*', '', unique(unlist(str_extract_all(colOfTitles,'([:upper:]{2}):[AifZHB]:'))))
  extractATag <- function(tag, titles) {
    extractedTags <- data.frame(tag = gsub('.*:', 
                                           '', 
                                           str_extract(titles, 
                                                       paste0(tag, 
                                                              ':[AifZHB]:\\w*'))))
    names(extractedTags)[names(extractedTags) == 'tag'] <- tag
    return(extractedTags)
  }
  tagValList <- lapply(tagNameList, function(x) {extractATag(x, colOfTitles)})
  return(cbind(df, as.data.frame(tagValList)))
}

tempdf2 <- extractTagsFromCol(tempdf, tempdf$names)