Downloading EnteroBase genotyping schemes through the API

Many API users want to fetch the entire catalog of allele profiles and sequences for a given genotyping scheme. Some schemes such as wgMLST are ~1GB and very slow to download walking through the API like other data (e.g. strain metadata). We try to provide daily ‘dumps’ of the entire database for users to quickly capture the current state of the database.

For users who wish to synchronize with EnteroBase, we recommend a workflow of:

  1. Initially download the daily dump of all data. (.tar.gz)
  2. Append new information by polling EnteroBase at regular intervals through the main REST API.

Step 1. What are the schemes?

A simple request to the schemes endpoints will give you a description of each Scheme in EnteroBase, including a link to the static download for ST profiles. You can use ‘only_fields’ to just fetch the download link, ‘?only_fields=download_sts_link’.

http://enterobase.warwick.ac.uk/api/v2.0/senterica/schemes?limit=1000
{
    "Schemes": [
        {
            "created": "2015-08-26T15:04:34.033635+00:00",
            "download_sts_link": "http://enterobase.warwick.ac.uk/download_data?allele=profiles&scheme=UoW&species=Salmonella",
            "label": "Achtman 7 Gene",
            "lastmodified": "2015-12-07T17:50:17.186416+00:00",
            "scheme_barcode": "SAL_AA0001AA_SC",
            "scheme_name": "MLST_Achtman",
            "version": 1
        },
        {
            "created": "2015-12-30T15:39:05.748726+00:00",
            "download_sts_link": "http://enterobase.warwick.ac.uk/download_data?allele=profiles&scheme=cgMLSTv4&species=Salmonella",
            "label": "cgMLST(3020) Beta",
            "lastmodified": "2015-12-30T15:39:05.748726+00:00",
            "scheme_barcode": "SAL_AA0010AA_SC",
            "scheme_name": "cgMLST",
            "version": 4
        },
        {
            "created": null,
            "download_sts_link": "http://enterobase.warwick.ac.uk/download_data?allele=profiles&scheme=wgMLSTv1&species=SALwgMLST",
            "label": "wgMLST",
            "lastmodified": null,
            "scheme_barcode": null,
            "scheme_name": "wgMLST",
            "version": null
        },
        {
            "created": null,
            "download_sts_link": "http://enterobase.warwick.ac.uk/download_data?allele=profiles&scheme=cgMLSTv1&species=SALwgMLST",
            "label": "cgMLST V2",
            "lastmodified": null,
            "scheme_barcode": null,
            "scheme_name": "cgMLST_v2",
            "version": null
        }
    ],
    "links": {
        "paging": {},
        "records": 4,
        "total_records": 4
    }
}

Note that all downloads are through a single URL (http://enterobase.warwick.ac.uk/download_data) with a scheme and a species (database) parameter passed.

Scheme name Scheme description
MLST_Achtman Achtman 7 Gene
cgMLSTv1 cgMLST version 1 (Beta) - deprecated
wgMLSTv1 Whole genome MLST (~21K)
cgMLSTv2 cgMLST version 2

Step 2. Downloading the ST profile tar ball

If you follow the ‘download_sts_link’, even in your browser you will be able to download a tar.gz file of the ST profiles.

This is a python snippet that illustrates Step 1 and downloading the tar ball. Remember to write your file as binary (‘wb’).

from urllib2 import HTTPError
import urllib2
import base64
import json
import os

SERVER_ADDRESS = 'http://enterobase.warwick.ac.uk'
DATABASE = 'senterica'
scheme = 'MLST_Achtman'

def __create_request(request_str):

    request = urllib2.Request(request_str)
    base64string = base64.encodestring('%s:%s' % (API_TOKEN,'')).replace('\n', '')
    request.add_header("Authorization", "Basic %s" % base64string)
    return request

address = SERVER_ADDRESS + '/api/v2.0/%s/schemes?scheme_name=%s&limit=%d&only_fields=download_sts_link' %(DATABASE, scheme, 4000)

os.mkdir(scheme)
try:
    response = urllib2.urlopen(__create_request(address))
    data = json.load(response)
    for scheme_record in data['Schemes']:
        profile_link = scheme_record.get('download_sts_link', None)
        if profile_link:
           response = urllib2.urlopen(profile_link)
           with open(os.path.join(scheme, 'MLST-profiles.gz'), 'wb') as output_profile:
               output_profile.write(response.read())
except HTTPError as Response_error:
    print '%d %s. <%s>\n Reason: %s' %(Response_error.code,
                                                      Response_error.msg,
                                                      Response_error.geturl(),
                                                      Response_error.read())

Step 3. Fetching the Alleles

Step 1 & 2 give you the allele profile (the ST and a vector of allele numbers). The allele sequences are fetched through the Loci endpoint. The same principle applies in downloading the allele sequences tarball.

If you are interested in both the allele sequences and numbers, I would recommend a workflow such as:

  1. Query ‘Schemes’ for all schemes
  2. For each scheme:
    1. Download the ST profile tarball
    2. Query ‘Loci’ for all Loci in the scheme
    3. Download the Allele sequence tarball
http://enterobase.warwick.ac.uk/api/v2.0/senterica/MLST_Achtman/loci?limit=50&scheme=MLST_Achtman

Would give you results like this:

{
  "links": {
    "paging": {},
    "records": 7,
    "total_records": 7
  },
  "loci": [
    {
      "database": "Salmonella",
      "download_alleles_link": "http://enterobase.warwick.ac.uk/download_data?allele=aroC&scheme=UoW&species=Salmonella",
      "locus": "aroC",
      "locus_barcode": "SAL_AA0001AA_LO",
      "scheme": "UoW"
    },
    {
      "database": "Salmonella",
      "download_alleles_link": "http://enterobase.warwick.ac.uk/download_data?allele=dnaN&scheme=UoW&species=Salmonella",
      "locus": "dnaN",
      "locus_barcode": "SAL_AA0002AA_LO",
      "scheme": "UoW"
    },
    {
      "database": "Salmonella",
      "download_alleles_link": "http://enterobase.warwick.ac.uk/download_data?allele=hemD&scheme=UoW&species=Salmonella",
      "locus": "hemD",
      "locus_barcode": "SAL_AA0003AA_LO",
      "scheme": "UoW"
    },
    {
      "database": "Salmonella",
      "download_alleles_link": "http://enterobase.warwick.ac.uk/download_data?allele=hisD&scheme=UoW&species=Salmonella",
      "locus": "hisD",
      "locus_barcode": "SAL_AA0004AA_LO",
      "scheme": "UoW"
    },
    {
      "database": "Salmonella",
      "download_alleles_link": "http://enterobase.warwick.ac.uk/download_data?allele=purE&scheme=UoW&species=Salmonella",
      "locus": "purE",
      "locus_barcode": "SAL_AA0005AA_LO",
      "scheme": "UoW"
    },
    {
      "database": "Salmonella",
      "download_alleles_link": "http://enterobase.warwick.ac.uk/download_data?allele=sucA&scheme=UoW&species=Salmonella",
      "locus": "sucA",
      "locus_barcode": "SAL_AA0006AA_LO",
      "scheme": "UoW"
    },
    {
      "database": "Salmonella",
      "download_alleles_link": "http://enterobase.warwick.ac.uk/download_data?allele=thrA&scheme=UoW&species=Salmonella",
      "locus": "thrA",
      "locus_barcode": "SAL_AA0007AA_LO",
      "scheme": "UoW"
    }
  ]
}

Step 4. Keeping in Sync

Once you have the static files you may wish to continue to poll EnteroBase to stay up to date. This could be done with a simple request to alleles - specifying the scheme, locus and number of days since your last update (with a parameter “reldate” for the “relative date”) - which will give a list of alleles sequences.

For example, suppose that we are interested in new allele sequences for aroC in the 7 gene MLST scheme for Salmonella in the last 20 days:

http://enterobase.warwick.ac.uk/api/v2.0/senterica/MLST_Achtman/alleles?reldate=20&locus=aroC&limit=50

Alternatively, fetching the new STs in the last 20 days would be a request such as:

http://enterobase.warwick.ac.uk/api/v2.0/senterica/MLST_Achtman/sts?scheme=MLST_Achtman&show_alleles=false&limit=5&reldate=20

which would give you results like this:

{
  "STs": [
    {
      "ST_id": "3767",
      "barcode_link": "http://enterobase.warwick.ac.uk/api/v1.0/lookup?barcode=SAL_GB7319AA_ST",
      "create_time": "2017-02-04 05:22:02.847270",
      "info": null,
      "reference": {
        "lab_contact": "public",
        "refstrain": "SAL_QA3953AA_AS",
        "source": "mlst.warwick.ac.uk"
      },
      "scheme": "UoW",
      "st_barcode": "SAL_GB7319AA_ST"
    },
    {
      "ST_id": "3768",
      "barcode_link": "http://enterobase.warwick.ac.uk/api/v1.0/lookup?barcode=SAL_GB7322AA_ST",
      "create_time": "2017-02-04 07:31:04.645023",
      "info": {
        "lineage": "",
        "st_complex": "61",
        "subspecies": ""
      },
      "reference": {
        "lab_contact": "public",
        "refstrain": "SAL_QA3967AA_AS",
        "source": "mlst.warwick.ac.uk"
      },
      "scheme": "UoW",
      "st_barcode": "SAL_GB7322AA_ST"
    },
    {
      "ST_id": "3769",
      "barcode_link": "http://enterobase.warwick.ac.uk/api/v1.0/lookup?barcode=SAL_GB7346AA_ST",
      "create_time": "2017-02-07 01:22:57.583287",
      "info": {
        "lineage": "",
        "st_complex": "401",
        "subspecies": ""
      },
      "reference": {
        "lab_contact": "public",
        "refstrain": "SAL_QA4517AA_AS",
        "source": "mlst.warwick.ac.uk"
      },
      "scheme": "UoW",
      "st_barcode": "SAL_GB7346AA_ST"
    },
    {
      "ST_id": "3770",
      "barcode_link": "http://enterobase.warwick.ac.uk/api/v1.0/lookup?barcode=SAL_GB7347AA_ST",
      "create_time": "2017-02-07 07:50:52.782618",
      "info": {
        "lineage": "",
        "st_complex": "65",
        "subspecies": ""
      },
      "reference": {
        "lab_contact": "public",
        "refstrain": "SAL_QA4540AA_AS",
        "source": "mlst.warwick.ac.uk"
      },
      "scheme": "UoW",
      "st_barcode": "SAL_GB7347AA_ST"
    },
    {
      "ST_id": "3771",
      "barcode_link": "http://enterobase.warwick.ac.uk/api/v1.0/lookup?barcode=SAL_GB7348AA_ST",
      "create_time": "2017-02-07 10:23:56.025904",
      "info": {
        "lineage": "",
        "st_complex": "205",
        "subspecies": ""
      },
      "reference": {
        "lab_contact": "public",
        "refstrain": "SAL_QA4606AA_AS",
        "source": "mlst.warwick.ac.uk"
      },
      "scheme": "UoW",
      "st_barcode": "SAL_GB7348AA_ST"
    }
  ],
  "links": {
    "paging": {
      "next": "http://enterobase.warwick.ac.uk/api/v2.0/senterica/MLST_Achtman/sts?limit=5&offset=5&show_alleles=false&scheme=MLST_Achtman&reldate=20"
    },
    "records": 5,
    "total_records": 33
  }
}

Downloading Assemblies

http://enterobase.warwick.ac.uk/api/v2.0/%s/straindata?serotype=Agona&assembly_status=Assembled&limit=%d&only_fields=strain_name,download_fasta_link

Key points to remember:

  • The straindata resource includes information about the Assemblies, Strain metadata and ST information. This allows us to search for assemblies where the strain metadata says the serovar is ‘Agona’; ‘?serovar=Agona’.
  • ‘only_fields’ parameter will only request the fields you specify, making your queries much faster. Since we only want the link to download the FASTA file and the strain name (to rename our FASTA file), ‘&only_fields=strain_name,download_fasta_link’. Note the use of comma to delimit.
  • If you already have Assembly barcodes, you can fetch this easily through the Assemblies endpoint directly.
import os
import urllib2
import json
import base64
import sys
from urllib2 import HTTPError
import logging

# You must have a valid API Token
API_TOKEN = os.getenv('ENTEROBASE_API_TOKEN', None)
SERVER_ADDRESS = 'http://enterobase.warwick.ac.uk'
SEROTYPE = 'Agona'
DATABASE = 'senterica'

def __create_request(request_str):

    request = urllib2.Request(request_str)
    base64string = base64.encodestring('%s:%s' % (API_TOKEN,'')).replace('\n', '')
    request.add_header("Authorization", "Basic %s" % base64string)
    return request


if not os.path.exists('temp'):
    os.mkdir('temp')
address = SERVER_ADDRESS + '/api/v2.0/%s/straindata?serotype=%s'\
    '&assembly_status=Assembled&limit=%d&only_fields=strain_name,download_fasta_link' \
    %(DATABASE, SEROTYPE, 40)
try:
    response = urllib2.urlopen(__create_request(address))
    data = json.load(response)
    for record in data['straindata']:
        record_values = data['straindata'][record]
        response = urllib2.urlopen(__create_request(record_values['download_fasta_link']))
        with open(os.path.join('temp', '%s.fasta' %record_values['strain_name']),'w') as out_ass:
            out_ass.write(response.read())
except HTTPError as Response_error:
    logging.error('%d %s. <%s>\n Reason: %s' %(Response_error.code,
                                              Response_error.msg,
                                              Response_error.geturl(),
                                              Response_error.read()))