This is a regularly modified post which holds all of the small bits and tips that don’t warrant their own post. If there is a group of related tips that pass a critical mass, they will be spun out into their own post, and a placeholder will remain here.

Introduction learning materials

An intro to R for biologists https://mbite.org/r-intro-biologists/intro_r_biologists.html
Problem-based learning https://rosalind.info/problems/locations/

Where to get other cheatsheets

A base R cheatsheet base R cheatsheet
For string matching, substitutions and Regexes stringr cheatsheet
For data visualisation ggplot2 cheatsheet
For data wrangling and transformations, in an intuitive way dplyr/tidyr cheatsheet

Inputting data

Command line arguments, etc.

args = commandArgs(trailingOnly=TRUE)

#   Access with
args[2]
  • Getting the location of the current script regardless of if it’s running in RStudio or not.
    library(tidyverse)
    getCurrentFileLocation <-  function()
    {
        this_file <- commandArgs() %>% 
        tibble::enframe(name = NULL) %>%
        tidyr::separate(col=value, into=c("key", "value"), sep="=", fill='right') %>%
        dplyr::filter(key == "--file") %>%
        dplyr::pull(value)
        if (length(this_file)==0)
        {
          this_file <- rstudioapi::getSourceEditorContext()$path
        }
        return(dirname(this_file))
    }
    
  • (from https://stackoverflow.com/questions/47044068/get-the-path-of-current-script)
  • More info available at https://www.r-bloggers.com/2015/09/passing-arguments-to-an-r-script-from-command-lines/

Loading a FASTA file and converting it into a df

If you need to install Biostrings:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("Biostrings")

How to actually do it

library(Biostrings)
temp <- readDNAStringSet('file.fa')
dss2df <- function(dss) data.frame(width=width(dss), seq=as.character(dss), names=names(dss))
tempdf <- dss2df(temp)

Optional: removing SAM formatted tags from the df and putting them into new columns

Loading multiple files into a single df and subsetting them

Previous post

Processing data

Join a large bunch of dataframes together

library(purrr)
library(dplyr)
bigDF <- reduce(listOfDFs, full_join, by = "Seq")

Type conversion

Converting all columns of a df apart from one into numeric columns

This seems to work only on actual dataframes, not tibbles, so convert it before you start. In this case, the omitted column is outDF$Seq.

cols <- colnames(outDF[, -which(names(outDF) == 'Seq')])
outDF[cols] <- sapply(outDF[cols],as.numeric)

Check the results with this:

sapply(outDF, class)

Making data shorter/collating data

Column totals by group

This works only on numerical columns. Any columns that aren’t part of the group_by() function or one of the numerical ones are removed.

test2 %>%
  group_by(Seq) %>%
  summarise(across(where(is.numeric), ~ sum(.x)))

Sequencing data-specific processing

Reverse complement a sequence without converting into a DNAString object

bla <- "ATGACTCATGCAGTCGCATCGACT"
stringi::stri_reverse(chartr("ATGC", "TACG", bla))

Outputting data

Functions