Introduction

Sometimes, you write a script that works on a single dataset (ie: it processes a single file), and you want to run it on more datasets. When running a script on so many datasets, you will probably want it to run it on your High-Performance Computing cluster, to save your laptop from exploding. This way, you can run many jobs in parallel, to save time too. This guide assumes that you have run scripts on an HPC system before.

This guide will walk you through:

  1. Turning a script into a generic version, that can be run on a different file or set of files each time
  2. Running that generic script on multiple files
  3. Creating a wrapper script to run the generic script on a cluster
  4. Running the two scripts efficiently on a cluster

Your script

For this example, let’s say you have an R script that looks like something below. This script reads a TSV file in, plots a bar graph, and saves the plot.

library(ggplot2)

# Read a file into a dataframe
inData <- read.delim('/home/username/awesomeTable.tsv')
# It looks like this:
# Sample Value
# 1      1     8
# 2      2     4
# 3      3     2
# 4      4     8

# Plot and save
randomPlot <- ggplot(inData, aes(x = Sample, y = Value)) +
  geom_bar(stat = 'identity') +
  theme_classic()

ggsave("/home/username/awesomePlot.png", 
       plot = randomPlot, 
       path = NULL, 
       width = 26, 
       height = 13, 
       units = "cm", 
       dpi = 78, 
       device = 'png')

The problem is that every time you want to run this on another different table, you will need to change the argument of read.delim and the output filename.

Making your script able to run on any file, using arguments

Instead of hard-coding your filenames into the scripts themselves, you can accept arguments from the terminal when you run them. The following stores all arguments from the terminal in a vector args, for which the first is meant to be the input file, and the second is the output file. The hard-coded filenames from the script above have now been replaced by references to elements in args.

library(ggplot2)

# Store arguments in a vector called args
args <- commandArgs(trailingOnly = TRUE)
# args[1] is the input file
# args[2] is the output file

# Read a file into a dataframe
inData <- read.delim(args[1])
# It looks like this:
# Sample Value
# 1      1     8
# 2      2     4
# 3      3     2
# 4      4     8

# Plot and save
randomPlot <- ggplot(inData, aes(x = Sample, y = Value)) +
  geom_bar(stat = 'identity') +
  theme_classic()

ggsave(args[2], 
       plot = randomPlot, 
       path = NULL, 
       width = 26, 
       height = 13, 
       units = "cm", 
       dpi = 78, 
       device = 'png')

To run the script once, you can do something like the following.

Rscript ./Rscript.R awesomeTable.tsv awesomePlot.png

If you have a series of input files, you could run the script sequentially using a for loop. In the following example, the files awesomeTable1.tsv, awesomeTable2.tsv and awesomeTable3.tsv are sequentially stored in a variable $f. Each time the loop runs, $f is used to create $outFile by substituting .tsv for .png. The R script is run on these two files, once for every loop. In this case, I used the absolute (full) path to the script file, just to show you that it can be done.

NOTE: DO NOT DO THIS ON THE LOGIN NODE OF YOUR CLUSTER.

This kind of behaviour unnecessarily uses resources on the login nodes, which are for data transfer, and job submission. Worker nodes are for data processing, and the next section shows you how to submit your jobs to the worker nodes.

for f in awesomeTable1.tsv awesomeTable2.tsv awesomeTable3.tsv
do
    outFile=$(sed 's/\.tsv/.png/' <<< $f)

    Rscript /home/username/Rscript.R $f $outFile
done

Making a wrapper script to run it on a cluster

So you now have an R script that can be run separately on different files. A wrapper script can be used to specify your cluster settings. The following is for Rocket, but the concept is true for many other HPC systems. This script takes an argument from the terminal, but unlike R, Bash doesn’t need to use a function like commandArgs(). Bash stores the args in $1, $2, etc…

#!/bin/bash -l
################################################################################
#                               Slurm env setup                                #

#   Set number of cores
#SBATCH -c 1

#   Set RAM per core
#SBATCH --mem-per-cpu=2M

#   Set mail preferences (NONE, BEGIN, END, FAIL, REQUEUE, ALL)
#SBATCH --mail-type=NONE

#   Set queue in which to submit: defq bigmem short power
#SBATCH -p short

#   Set wall clock time
#SBATCH -t 0-0:05:00

#                                                                              #
################################################################################

#   Generate output filename from input
outFile=$(sed 's/\.tsv/.png/' <<< $1)

#   Send it!
Rscript /home/username/Rscript.R $1 $outFile

Now, you could run it singly on the cluster using:

sbatch \
    --job-name=AwesomeScriptOnRocket \
    AwesomeWrapperScript.slurm \
        /nobackup/username/awesomeTable.tsv

Using a for loop to handle your submission

To track your jobs in the queue, you want a unique one for each input filename. The sed command in this case replaces anything before a / (ie: the file path) in $f with nothing, so you just get the filename. Concatenate that with a name for your process like AwesomeJob_ and you have a decent unique job name!

cd /nobackup/username/inputDataLocation
for f in *.tsv
do
    jobName=AwesomeJob_$(sed 's/.*\///' <<< $f)

    sbatch \
        --job-name=$jobName \
        AwesomeWrapperScript.slurm \
            $f
done

Now, the for loop handles submission of jobs to the cluster, so nothing is running on the login node apart from the for loop, and each R script runs on a separate worker node.