Biopython cheatsheet
This is a regularly modified post which holds all of the small bits and tips that don’t warrant their own post. If there is a group of related tips that pass a critical mass, they will be spun out into their own post, and a placeholder will remain here.
This post focuses on Biopython, and using this set of packages to deal with sequencing data.
Installation
pip install biopython
Bio.SeqIO
This is meant to be the main interface for inputting and outputting sequences from Python. It supports most formats that we need, including FASTA, FASTQ, etc.. Full tutorial here. Main documentation here. Bio.AlignIO handles other files, like MAF.
Sequence input
FASTA input
Read a FASTA file and process it seq-by-seq
from Bio import SeqIO
for record in SeqIO.parse("FASTA.fa", "fasta"):
print(record.id)
Read a FASTA file as a handle, then iterate over the handle.
with open("example.fasta") as handle:
for record in SeqIO.parse(handle, "fasta"):
print(record.id)
Grab a small FASTA file into a dict.
This is not a good idea for large files, so use with caution.
Apparently this is better than Bio.SeqIO.to_dict(), but still.
from Bio import SeqIO
record_dict = SeqIO.index("example.fasta", "fasta")
print(record_dict["gi:12345678"]) # use any record ID
Access the title and seq from the record using the .id and .seq methods.
print(record.id)
print(record.seq)
FASTA output
from Bio import SeqIO
# Where outRecord is a SeqRecord object, such as the type produced by SeqIO
outFH = open("outFile.txt", "w")
SeqIO.write(outRecord, outFH, 'fasta')
outFH.close()
Manually creating a SeqRecord object
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
manualRecord = SeqRecord(
Seq('ATGCAGCTGCATAGTACGTGCATGACTGCATGTACGACTAGTC'),
id = 'NameOfManualRecord',
description = "Description of manual record")
Common things to do
GC content
from Bio.SeqUtils import gc_fraction
print(gc_fraction("TGCAGTACTAGCTACGT"))
Translate an RNA seq
# seqRecord is a SeqRecord object, such as the type produced by SeqIO
outRecord = seqRecord.translate()