Using R to get info from websites using APIs
You may have used websites like https://www.ncbi.nlm.nih.gov and others to type queries into their web form boxes, and to get some answers.
Sometimes, you have too many queries to process manually, and you want to do that programmatically, or even in batches.
Websites actually often want you to do that too, as they can get away with just serving the results that you want, rather than having to send you the all of the rubbish that comes with a whole web page.
This is where APIs come in.
APIs, or Application Programming Interfaces, allow a piece of software that you write on your computer to talk to the website or server, and to retrieve just the results in a sensible format. Detail is available here, and more technical detail is here. There are many flavours of API, but most bioinformatics websites have some kind of interface that uses HTTP - ie: you can access it in the same way you would type a web address into your browser. Don’t worry about how it works for now; we just need to know how to use one.
Variant annotation using the NCBI E-utilities API
Let’s take the NCBI website, for example.
This website has a lot of databases that you can search, and this can help to do a lot of your work.
Imagine you have a lot of variants that you have dbSNP “rs” numbers for, and you want to check if any records associated with each are considered pathogenic.
You have been doing your analysis in R so far, so you have a data frame that looks something like this:
vars <- data.frame(Patient = c('ID_123456', 'ID_87654321', 'ID_9999999'),
Variant = c('RS34612342', 'RS3731249', 'RS28897689')
)
rownames(vars) <- vars$Patient
Patient Variant
ID_123456 ID_123456 RS34612342
ID_87654321 ID_87654321 RS3731249
ID_9999999 ID_9999999 RS28897689
To process these, you could search the ClinVar database using the website, look down the table, and note the results, but that will be time consuming if you have thousands of variants.
Prototyping
Manual search queries
Firstly, you need to check if a website offers an API.
This is easiest to do by Googling: websitename.com API.
In this case, that brings you to their API landing page, which tells you that the NCBI have multiple APIs.
In this case, the Entrez Programming Utilities (E-utilities) one seems to be what you’re after, and there is documentation for it!
The Quick Start guide gives you a nice overview of how the API is used, and the most important part of this is that it can be accessed using HTTPS URLs.
The examples aren’t amazingly useful though, as they are for other databases.
Googling clinvar eutils reveals the page Accessing and using data in ClinVar, which has a lot of examples.
You can modify these examples in a bit to get the information that you need.
Googling clinvar fields gives you How to search ClinVar, which shows you all of the different kinds of queries that you can use.
One of these queries is AND "clinsig pathogenic"[Properties]' for pathogenic records.
Next, you can use the advanced search on the web form to see what kind of query you need to use to get the info that you need.
In this case, searching for the variant rs number in all fields returns all records associated with it.
Adding AND "clinsig pathogenic"[Properties]' to the search query gives you only pathogenic records, and returns no results for variants with no pathogenic annotations.
Great!
Turning search queries into API requests
The thing about HTTP-based APIs is that you can use your browser to run queries and look at the outputs. This can be used to run trials. The above documentation tells you that you can assemble URLs containing the request to do so. Let’s go with a working example:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=FGFR3[gene]+AND+single_gene[prop]&retmax=500
retmax=500 sounds like “return max. 500 results”, so that can be omitted.
The rest can be changed to:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=RS28897689+AND+clinsig_pathogenic[properties]
Since URLs can’t have spaces in them, I tried replacing clinsig pathogenic from the search query with clinsig_pathogenic.
Running that in the browser gives some output in XML format:
<eSearchResult>
<Count>1</Count>
<RetMax>1</RetMax>
<RetStart>0</RetStart>
<IdList>
<Id>531295</Id>
</IdList>
<TranslationSet/>
<TranslationStack>
<TermSet>
<Term>RS28897689[All Fields]</Term>
<Field>All Fields</Field>
<Count>2</Count>
<Explode>N</Explode>
</TermSet>
<TermSet>
<Term>clinsig_pathogenic[properties]</Term>
<Field>properties</Field>
<Count>175625</Count>
<Explode>N</Explode>
</TermSet>
<OP>AND</OP>
</TranslationStack>
<QueryTranslation>
RS28897689[All Fields] AND clinsig_pathogenic[properties]
</QueryTranslation>
</eSearchResult>
Looking at the output from a few queries with this and other rs numbers, it seems like the <IdList> is a list of returned results, and vars that don’t have results don’t have and <Id> records returned.
You can use this.
Putting everything into an R script
Googling how to use API in r gives the following tutorial: R API Tutorial: Getting Started with APIs in R
They recommend using the httr package to make HTTP requests, so let’s go with that.
The output is processed with jsonlite, but we are dealing with XML, not JSON, so we’ll leave that.
Besides, you just need to check whether the output contains <Id> or not, so you don’t need to understand (parse) the XML anyway.
Running a prototype API query
library(httr)
eUtilsUrl <- paste0('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=',
'RS28897689',
'+AND+clinsig_pathogenic[properties]')
output <- GET(eUtilsUrl)
Now, you have the response from the server in output.
Reading the help for GET(), the output is actually a response() object.
Retrieving the output from that is actually done using the content() function:
outputText <- content(output, as = 'text')
outputText looks like this:
[1] "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n<!DOCTYPE eSearchResult PUBLIC \"-//NLM//DTD esearch 20060628//EN\" \"https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd\">\n<eSearchResult><Count>1</Count><RetMax>1</RetMax><RetStart>0</RetStart><IdList>\n<Id>531295</Id>\n</IdList><TranslationSet/><TranslationStack> <TermSet> <Term>RS28897689[All Fields]</Term> <Field>All Fields</Field> <Count>2</Count> <Explode>N</Explode> </TermSet> <TermSet> <Term>clinsig_pathogenic[properties]</Term> <Field>properties</Field> <Count>175625</Count> <Explode>N</Explode> </TermSet> <OP>AND</OP> </TranslationStack><QueryTranslation>RS28897689[All Fields] AND clinsig_pathogenic[properties]</QueryTranslation></eSearchResult>\n"
Now, we can check if the outputText contains <Id> using Stringr:
library(stringr)
if (str_detect(as.character(outputText), '<Id>')) {
print('Ya!')
} else {
print('Na...')
}
Putting it all together
Let’s say you want to add a column with whether a variant is associated with any ClinVar records that are “pathogenic”. You can add an empty column for the ClinVar sigs, and populate that with a function:
# Add an empty column for results
vars['ClinVarSig'] <- NA
library(httr)
library(stringr)
# Function which takes a patient ID, and checks the variant in that row for ClinSigs
isItPathogenic <- function(patient) {
# Getting the URL, getting the response from it, and processing it
eUtilsUrl <- paste0('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=',
vars[patient,]$Variant,
'+AND+clinsig_pathogenic[properties]')
output <- GET(eUtilsUrl)
outputText <- content(output, as = 'text')
# Storing the relevant info
if (str_detect(as.character(outputText), '<Id>')) {
vars[patient,]$ClinVarSig <<- 'Y'
} else {
vars[patient,]$ClinVarSig <<- 'N'
}
}
# Run the function on all rows
lapply(vars$Patient, isItPathogenic)