• Using sequence data in dataframes with BioStrings

    Introduction

    Sometimes, you have a dataframe which contains sequences that you have processed, and you want to convert it into a DNAStringSet object. From there, you can continue to process them with BioStrings functions, or write them out to a FASTA file.

  • How to move data from Rocket to AWS

    Interacting with AWS on a Mac/Linux Bash terminal is done using the AWS Command Line Interface (awscli). You’ll need to download and install that onto whatever system that you are using (in this case, Rocket) in order to do so. Note: This will allow you to do anything that you user is able to do on your user, so make sure that you trust the system that you are using before doing this.

  • Plotting time series in R

    Intro

    Let’s say you have a time series of data which looks something like this:

  • Error catching in scripts and pipes

    Intro

    Error catching in Bash is really useful if you need to run things unattended, on a server or on the cloud. This guide runs through what exit codes are, how they can be used to detect how the commands in your script are doing, and some more complictated examples.

  • Using R to get info from websites using APIs

    You may have used websites like https://www.ncbi.nlm.nih.gov and others to type queries into their web form boxes, and to get some answers. Sometimes, you have too many queries to process manually, and you want to do that programmatically, or even in batches. Websites actually often want you to do that too, as they can get away with just serving the results that you want, rather than having to send you the all of the rubbish that comes with a whole web page. This is where APIs come in.

  • Perl cheatsheet

    This is a regularly modified post which holds all of the small bits and tips that don’t warrant their own post. If there is a group of related tips that pass a critical mass, they will be spun out into their own post, and a placeholder will remain here.

  • Getting to grips with R functions

    The situation

  • R cheatsheet

    This is a regularly modified post which holds all of the small bits and tips that don’t warrant their own post. If there is a group of related tips that pass a critical mass, they will be spun out into their own post, and a placeholder will remain here.

  • Loading multiple files into a single df and subsetting them

    Listing all files in a directory

    To get your files into R, you need to know where they are. You could hard-code a list of files, and input them, which is fine if this analysis is being done once on a small number of files, like so:

  • Biopython cheatsheet

    This is a regularly modified post which holds all of the small bits and tips that don’t warrant their own post. If there is a group of related tips that pass a critical mass, they will be spun out into their own post, and a placeholder will remain here.

  • Python cheatsheet

    This is a regularly modified post which holds all of the small bits and tips that don’t warrant their own post. If there is a group of related tips that pass a critical mass, they will be spun out into their own post, and a placeholder will remain here.

  • (QuickTip) Functions and scoping in Python

    Without scoping

    Once you get to writing more and more complex scripts in Python, one thing you will inevitably end up doing is accidentally reusing variable names. This can introduce bugs in your code when you assume a variable is empty and it’s not, etc.. Let’s take the following example:

  • Rocket Introduction

    Pre-requisites

    This guide is designed for people just starting with High-Performance Computing at Newcastle University. Please email me if you notice anything that is out of date, as I want to keep this current whilst I work here. The guide assumes that you have been taught some Linux before, and have got a login on Aidan. If the terms Aidan, unix.ncl.ac.uk, or SSH are alien to you, book yourself on a Linux Intro course (the Bioinformatics Support Unit run many) and ask about the following link: https://services.ncl.ac.uk/itservice/technical-services/unix-time-sharing/

  • (Personal) Moving this blog to a new machine, and nuking the bundle

    Note: This is a personal note, and is not formatted for wider consumption.

  • Wrapper scripts - Running your jobs on clusters efficiently

    Introduction

    Sometimes, you write a script that works on a single dataset (ie: it processes a single file), and you want to run it on more datasets. When running a script on so many datasets, you will probably want it to run it on your High-Performance Computing cluster, to save your laptop from exploding. This way, you can run many jobs in parallel, to save time too. This guide assumes that you have run scripts on an HPC system before.

  • Bioinformatics commands store

    This is a regularly modified post which holds all of the small commands and tips that don’t warrant their own post. If there is a group of related tips that pass a critical mass, they will be spun out into their own post, and a placeholder will remain here.

  • Using bucket policies to restrict access to an S3 bucket

    So you want to restrict access to a bucket to only certain users or sets of users, or roles. There are three main ways of doing this, which can work together.

    1. The first is to give a user or a role permission to access named buckets, and only those buckets. This can be dangerous, and isn’t covered by this guide, because a slight misconfiguration that allows a user to access s3:* without restrictions to certain buckets will give them access to s3:* on all buckets. This is very easy to do.
    2. The second way is to encrypt your bucket, and to give access to the key to named users. There are many guides on this online, so that isn’t covered here either.
    3. The third is to set a policy on the bucket to allow access from only certain users, certain roles, or certain sets of users.
  • (QuickTip) Navigating large file trees (1)

    Going back and forth between a bunch of different directories/folders in a large tree can be taxing, especially when looking through large numbers of directories for different files. Sometimes, it’s helpful to be able to go back to a specific directory with a single command, without having to use pwd and note down where you were.

  • (QuickTip) Extracting SAM-formatted tags from data in R

    Last modified: 2024-06-03

  • Clustering in R with a custom distance function

    There are a lot of tutorials online which talk about how to cluster data, starting with a vector or list of inputs. The overall method used by most people is to create a distance matrix which represents the distance between any two pieces of data in the input. The resulting matrix is used as an input into a clustering algorithm (the maths for which is handled by an external library). This tutorial goes over the whole process, including the theory behind it, and plotting dendrograms afterwards. This documentation goes into clustering of DNA sequence objects.

  • How to access Newcastle University email from external email clients

    This post will change and be updated as new information comes out, to hopefully be as up to date as possible. The post date will be incremented accordingly.

  • Random sampling of a dataset into training and test datasets

    Training and test datasets represented as buckets

  • (QuickTip) Automatic installation of the latest HTSlib

    When setting up Ubuntu/Debian machines for biology, I sometimes need to have HTSlib installed separately from SAMtools. I also need this done automatically, without my intervention (E.G.: for the creation of Amazon Machine Images). To do this, I do the following as Root.

  • (QuickTip) When Perl can't find the modules that you just installed

    Read the following story and see if the errors and troubleshooting that you’ve done match up. If so, the tip at the end might work for you.

  • Installing Snakemake and Tibanna using AWS Image Builder

    AWS EC2 Image Builder is a great way of automatically building machine images that can be used to run your data analyses. When you launch an EC2 instance, you want all of your software installed on the instance already, so that you can get your analysis started without having to manually install it every time. To do this, you need a recipe, which is an ordered list of components. Components like the one below can be used to install software, verify it, and test it. In this case, the component below installs the latest Snakemake, Tibanna, numpy, biopython and bioconda (and dependancies), and creates an environment called Tibanna which allows you to access them.

  • (QuickTip) Writing data frames or lists of data frames to files elegantly

    I produce a lot of data frames in my R code, and sometimes need to save them as TSVs for logs, etc. The following function and its input neatly names the output files and saves them, in a properly vectorised manner.

  • Creating launchers which pass all arguments to the target software

    Sometimes, especially when using institutional servers, you don’t have root or sudo access. That means you’ve downloaded and built a bunch of software somewhere in your home directory. It’s a pain to continually define the location of your software in order to use it (e.g.: /home/username/bin/package_1.3.5/bin/software). You could add an alias to your ~/.bashrc file, but adding one for every piece of software you install will quickly make your ~/.bashrc file long and messy. There is a better way.