R cheatsheet
This is a regularly modified post which holds all of the small bits and tips that don’t warrant their own post. If there is a group of related tips that pass a critical mass, they will be spun out into their own post, and a placeholder will remain here.
Introduction learning materials
| An intro to R for biologists | https://mbite.org/r-intro-biologists/intro_r_biologists.html |
| Problem-based learning | https://rosalind.info/problems/locations/ |
Where to get other cheatsheets
| A base R cheatsheet | base R cheatsheet |
| For string matching, substitutions and Regexes | stringr cheatsheet |
| For data visualisation | ggplot2 cheatsheet |
| For data wrangling and transformations, in an intuitive way | dplyr/tidyr cheatsheet |
Inputting data
Command line arguments, etc.
args = commandArgs(trailingOnly=TRUE)
# Access with
args[2]
- Getting the location of the current script regardless of if it’s running in RStudio or not.
library(tidyverse) getCurrentFileLocation <- function() { this_file <- commandArgs() %>% tibble::enframe(name = NULL) %>% tidyr::separate(col=value, into=c("key", "value"), sep="=", fill='right') %>% dplyr::filter(key == "--file") %>% dplyr::pull(value) if (length(this_file)==0) { this_file <- rstudioapi::getSourceEditorContext()$path } return(dirname(this_file)) } - (from https://stackoverflow.com/questions/47044068/get-the-path-of-current-script)
- More info available at
https://www.r-bloggers.com/2015/09/passing-arguments-to-an-r-script-from-command-lines/
Loading a FASTA file and converting it into a df
If you need to install Biostrings:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Biostrings")
How to actually do it
library(Biostrings)
temp <- readDNAStringSet('file.fa')
dss2df <- function(dss) data.frame(width=width(dss), seq=as.character(dss), names=names(dss))
tempdf <- dss2df(temp)
Optional: removing SAM formatted tags from the df and putting them into new columns
Loading multiple files into a single df and subsetting them
Processing data
Join a large bunch of dataframes together
library(purrr)
library(dplyr)
bigDF <- reduce(listOfDFs, full_join, by = "Seq")
Type conversion
Converting all columns of a df apart from one into numeric columns
This seems to work only on actual dataframes, not tibbles, so convert it before you start.
In this case, the omitted column is outDF$Seq.
cols <- colnames(outDF[, -which(names(outDF) == 'Seq')])
outDF[cols] <- sapply(outDF[cols],as.numeric)
Check the results with this:
sapply(outDF, class)
Making data shorter/collating data
Column totals by group
This works only on numerical columns.
Any columns that aren’t part of the group_by() function or one of the numerical ones are removed.
test2 %>%
group_by(Seq) %>%
summarise(across(where(is.numeric), ~ sum(.x)))
Sequencing data-specific processing
Reverse complement a sequence without converting into a DNAString object
bla <- "ATGACTCATGCAGTCGCATCGACT"
stringi::stri_reverse(chartr("ATGC", "TACG", bla))