RYim’s Tidbits

18 Jun, 2024
Using sequence data in dataframes with BioStrings
Introduction

Sometimes, you have a dataframe which contains sequences that you have processed, and you want to convert it into a DNAStringSet object. From there, you can continue to process them with BioStrings functions, or write them out to a FASTA file.
13 May, 2024
How to move data from Rocket to AWS
Interacting with AWS on a Mac/Linux Bash terminal is done using the AWS Command Line Interface (awscli). You’ll need to download and install that onto whatever system that you are using (in this case, Rocket) in order to do so. Note: This will allow you to do anything that you user is able to do on your user, so make sure that you trust the system that you are using before doing this.
12 Apr, 2024
Plotting time series in R
Intro

Let’s say you have a time series of data which looks something like this:
25 Oct, 2023
Error catching in scripts and pipes
Intro

Error catching in Bash is really useful if you need to run things unattended, on a server or on the cloud. This guide runs through what exit codes are, how they can be used to detect how the commands in your script are doing, and some more complictated examples.
20 Jun, 2023
Using R to get info from websites using APIs
You may have used websites like https://www.ncbi.nlm.nih.gov and others to type queries into their web form boxes, and to get some answers. Sometimes, you have too many queries to process manually, and you want to do that programmatically, or even in batches. Websites actually often want you to do that too, as they can get away with just serving the results that you want, rather than having to send you the all of the rubbish that comes with a whole web page. This is where APIs come in.
19 May, 2023
Perl cheatsheet
This is a regularly modified post which holds all of the small bits and tips that don’t warrant their own post. If there is a group of related tips that pass a critical mass, they will be spun out into their own post, and a placeholder will remain here.
18 May, 2023
Getting to grips with R functions
The situation
18 May, 2023
R cheatsheet
This is a regularly modified post which holds all of the small bits and tips that don’t warrant their own post. If there is a group of related tips that pass a critical mass, they will be spun out into their own post, and a placeholder will remain here.
25 Apr, 2023
Loading multiple files into a single df and subsetting them
Listing all files in a directory

To get your files into R, you need to know where they are. You could hard-code a list of files, and input them, which is fine if this analysis is being done once on a small number of files, like so:
14 Apr, 2023
Biopython cheatsheet
This is a regularly modified post which holds all of the small bits and tips that don’t warrant their own post. If there is a group of related tips that pass a critical mass, they will be spun out into their own post, and a placeholder will remain here.
13 Apr, 2023
Python cheatsheet
This is a regularly modified post which holds all of the small bits and tips that don’t warrant their own post. If there is a group of related tips that pass a critical mass, they will be spun out into their own post, and a placeholder will remain here.
28 Mar, 2023
(QuickTip) Functions and scoping in Python
Without scoping

Once you get to writing more and more complex scripts in Python, one thing you will inevitably end up doing is accidentally reusing variable names. This can introduce bugs in your code when you assume a variable is empty and it’s not, etc.. Let’s take the following example:
15 Mar, 2023
Rocket Introduction
Pre-requisites

This guide is designed for people just starting with High-Performance Computing at Newcastle University. Please email me if you notice anything that is out of date, as I want to keep this current whilst I work here. The guide assumes that you have been taught some Linux before, and have got a login on Aidan. If the terms Aidan, unix.ncl.ac.uk, or SSH are alien to you, book yourself on a Linux Intro course (the Bioinformatics Support Unit run many) and ask about the following link: https://services.ncl.ac.uk/itservice/technical-services/unix-time-sharing/
14 Mar, 2023
(Personal) Moving this blog to a new machine, and nuking the bundle
Note: This is a personal note, and is not formatted for wider consumption.
14 Mar, 2023
Wrapper scripts - Running your jobs on clusters efficiently
Introduction

Sometimes, you write a script that works on a single dataset (ie: it processes a single file), and you want to run it on more datasets. When running a script on so many datasets, you will probably want it to run it on your High-Performance Computing cluster, to save your laptop from exploding. This way, you can run many jobs in parallel, to save time too. This guide assumes that you have run scripts on an HPC system before.
7 Feb, 2023
Bioinformatics commands store
This is a regularly modified post which holds all of the small commands and tips that don’t warrant their own post. If there is a group of related tips that pass a critical mass, they will be spun out into their own post, and a placeholder will remain here.
30 Jan, 2023
Using bucket policies to restrict access to an S3 bucket
So you want to restrict access to a bucket to only certain users or sets of users, or roles. There are three main ways of doing this, which can work together.
1. The first is to give a user or a role permission to access named buckets, and only those buckets. This can be dangerous, and isn’t covered by this guide, because a slight misconfiguration that allows a user to access s3:* without restrictions to certain buckets will give them access to s3:* on all buckets. This is very easy to do.
2. The second way is to encrypt your bucket, and to give access to the key to named users. There are many guides on this online, so that isn’t covered here either.
3. The third is to set a policy on the bucket to allow access from only certain users, certain roles, or certain sets of users.
22 Nov, 2022
(QuickTip) Navigating large file trees (1)
Going back and forth between a bunch of different directories/folders in a large tree can be taxing, especially when looking through large numbers of directories for different files. Sometimes, it’s helpful to be able to go back to a specific directory with a single command, without having to use pwd and note down where you were.
17 Aug, 2022
(QuickTip) Extracting SAM-formatted tags from data in R
Last modified: 2024-06-03
8 Aug, 2022
Clustering in R with a custom distance function
There are a lot of tutorials online which talk about how to cluster data, starting with a vector or list of inputs. The overall method used by most people is to create a distance matrix which represents the distance between any two pieces of data in the input. The resulting matrix is used as an input into a clustering algorithm (the maths for which is handled by an external library). This tutorial goes over the whole process, including the theory behind it, and plotting dendrograms afterwards. This documentation goes into clustering of DNA sequence objects.
20 Jul, 2022
How to access Newcastle University email from external email clients
This post will change and be updated as new information comes out, to hopefully be as up to date as possible. The post date will be incremented accordingly.
28 Jun, 2022
Random sampling of a dataset into training and test datasets
21 Jun, 2022
(QuickTip) Automatic installation of the latest HTSlib
When setting up Ubuntu/Debian machines for biology, I sometimes need to have HTSlib installed separately from SAMtools. I also need this done automatically, without my intervention (E.G.: for the creation of Amazon Machine Images). To do this, I do the following as Root.
19 May, 2022
(QuickTip) When Perl can't find the modules that you just installed
Read the following story and see if the errors and troubleshooting that you’ve done match up. If so, the tip at the end might work for you.
17 May, 2022
Installing Snakemake and Tibanna using AWS Image Builder
AWS EC2 Image Builder is a great way of automatically building machine images that can be used to run your data analyses. When you launch an EC2 instance, you want all of your software installed on the instance already, so that you can get your analysis started without having to manually install it every time. To do this, you need a recipe, which is an ordered list of components. Components like the one below can be used to install software, verify it, and test it. In this case, the component below installs the latest Snakemake, Tibanna, numpy, biopython and bioconda (and dependancies), and creates an environment called Tibanna which allows you to access them.
5 May, 2022
(QuickTip) Writing data frames or lists of data frames to files elegantly
I produce a lot of data frames in my R code, and sometimes need to save them as TSVs for logs, etc. The following function and its input neatly names the output files and saves them, in a properly vectorised manner.
31 Mar, 2022
Creating launchers which pass all arguments to the target software
Sometimes, especially when using institutional servers, you don’t have root or sudo access. That means you’ve downloaded and built a bunch of software somewhere in your home directory. It’s a pain to continually define the location of your software in order to use it (e.g.: /home/username/bin/package_1.3.5/bin/software). You could add an alias to your ~/.bashrc file, but adding one for every piece of software you install will quickly make your ~/.bashrc file long and messy. There is a better way.

Introduction

Intro

Intro

The situation

Listing all files in a directory

Without scoping

Pre-requisites

Introduction