Schemes

Schemes represent any analysis that is associated with an assembly of a strain entry, not just MLST. A Scheme needs to define the following three processes

  • Sending an assembly to be analysed and obtaining/storing the results
  • Retrieving and searching data from the scheme
  • Displaying the data

The scheme is described in the scheme table of the species database and the scheme’s fields are defined in the data_param table. The data for each assembly/scheme combination is stored in the assembly_lookup table . For MLST schemes, the the ST barcode is stored in the ST_barcode column, so that the allele and ST data can be retrieved from nserv when requested. For non-MLST schemes data can be stored in this table under the ‘results’ key in the other_data. If the scheme is very simple than the actual data can be stored here e.g the O and H types in E. coli serotype production. However, usually this is used to store a pointer to a file or a key to a more complex data structure stored elsewhere

Scheme Parameters (scheme table)

  • description - This is actually the name of the scheme
  • name - The human readable label
  • param - A json column describing the scheme
    • display Can be public, private or none. Public schemes are always accessible on the main search page, whereas private schemes are only visible to those with permission. A value of none means the scheme is not displayed at all in the conventional way e.g. snp_calls
    • js_grid The name of the grid class that will be used to display the scheme. If not supplied then BaseExperimentGrid will be used
    • pipeline The name of the pipeline that nserv uses to run the job. If no pipeline is specified, then the scheme will not automatically be called on new assemblies
    • scheme The name of the MLST scheme used by nserv - If no scheme is specified than this will imply the scheme is not an MLST scheme
    • barcode For MLST schemes, this describes where the ST barcode can be found in the json log output from nserv e.g a value of ST,1 will mean that the barcode is the second item of a list with the key ST
    • query_method The name of the method in entero.ExtraFuncs.query_functions.py which will either supply data from the scheme for each specified strain, or run a query on the scheme and return the data and associated strains e.g. process_small_scheme_query
    • summary The field in the scheme that can be used to get a summary breakdown of the scheme. For example, in 7 gene MLST schemes it could be the ST or for serotype prediction, the O group. If not appropriate for the scheme then this can be absent
    • job_rate If this is specified then this is the number of jobs sent for this scheme when the script update_all_schemes is run on the database. If missing , then the number supplied when running the script will be used
    • alternate_scheme If a scheme is a subset of another scheme, than alternate_scheme is the name of the subscheme For example Salmonella_CRISPR jobs will return data for CRISPR and CRISPOL schemes, therefore CRISPOL will have scheme as Salmonella_CRISPR and alternate_scheme as Salmonella_CRISPOL. This way, only one job (for the scheme CRISPR) will be sent, but when data is returned data for both CRISPR and CRISPOL schemes will be processed
    • params For non-MLST (generic) schemes, these are the extra parameters required by the nserve pipeline. For example, the E. coli annotation scheme has the following:- “taxon”:”Escherichia”,”prefix”:”{barcode}”. If a value is enclosed in curly braces (barcode in this example), this specifies the value of that field in the assembly.
    • input_type The input to the pipeline. Can either be ‘assembly’ or ‘reads’
    • main_scheme if ‘true’ then this will be the main scheme used for calling genes - default is wgMLST

ToDo

Remove all redundant columns from the scheme table

Scheme Fields (data_param table)

The columns that are relevant to schemes in the data_param table are

  • tabname - The name of the scheme (the description field in the scheme table)
  • name - The database-friendly name of the field
  • mlst_field - The position where the value can be found in the json output of nserv e.g for ST complex in Salmonella rMLST this is ‘info,st_complex’. This is not required for loci, as these can be automatically retrieved
  • display_order - The order in the table that the field will be displayed
  • nested order - For compound fields the sub-order. For normal fields this should be 0. For Loci , this specifies the order of the locus
  • label - The name that the user will see
  • datatype - The datatype of the field, can be integer,double or text
  • group_name - Only required for compound fields, this is the name describing the group of sub-fields. For MLST schemes, each locus should have ‘Locus’ for this field
  • param A json column with extra parameters in key:value format
    • not_searchable If true then the field is either too complex or not relevant to be used to search for scheme information, e.g. the GFF or GBK file in the annotation scheme

ToDo A lot of columns can be moved from the main table and placed in the param column

Examples

Scheme Based on an NCBI API call

In this example, the ‘scheme’, which has been added to the miu database, will simply show the NCBI project name and project description. This is obtained from an api call to NCBI using the project accession field in Enterobase. Alternatively the project description/name can be searched at NCBI and the returned project accessions used to return records from Enterobase. Although not that useful, this example shows how a scheme can get data via a 3rd party. All that is required is entries into the schemes and dataparam tables and a single method in the query_functions method.

(1) Add details of the scheme/experiment to the scheme table. We need to add the description, which is actually the key (don’t ask why) and a name (that the user will see). The param column describes the scheme in json format. In this column we only need to specify two things, ‘display’ and ‘process_method’. The display is public so anybody can see/query this scheme and the process_method points to the method in entero/ExtraFuncs/query_functions which will be where all the code for the method will be. This is not an MLST scheme so ‘scheme’ is not needed and we do not call any job to populate each entry so ‘pipeline’ is not required

description name param
project_info Project Info {“display”:”public”,”process_method”:”process_project_info_query”}

(2) Next add the fields in the scheme to the dataparams . We only have two fields project_name and project_description, which are both have the datatype text. tabname is actually the name of the scheme/experiment, which in this case is project_info and name is what the user will see. nested_order should be 0 , unless you are going to have compound fields.

tabname name display_order nested_order label datatype
project_info project_name 1 0 Project Name text
project_info project_description 2 0 Description text

(3) Finally we need the actual method. This method accepts 5 parameters. database and exp are just the names of the database and scheme/experiment respectively (sometimes the same method can be used for different schemes). query will be the SQL like query, although this depends on the value of query_type, which is either ‘query’ (default) or assembly_ids. If it is the latter than the query parameter will contain a string of comma delimited assembly ids. The method returns a dictionary of assembly ids to a dictionary of key/value pairs of the fields specified for the scheme and a list of assembly ids.

def process_project_info_query(database,exp,query,query_type,extra_data=None):
    '''A toy method to demonstrate how to write a query methods. The arguments and
    return values are standard
    '''
    search_url= "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    summary_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
    study_to_aid={}
    return_data={}
    return_list=[]

    #need to get project information from the assembly_ids supplied
    if query_type== "assembly_ids":
        #get all the study accessions from the strain database and link them to assembly id
        study_accessions=[]
        sql = 'SELECT study_accession AS  sa, best_assembly AS aid FROM strains WHERE best_assembly in (%s)' % query
        results = dbhandle[database].execute_query(sql)
        for item in results:
            study= item['sa']
            if not study:
                continue
            study_accessions.append(study)
            aids = study_to_aid.get(study)
            if not aids:
                aids=[]
                study_to_aid[study]=aids
            aids.append(item['aid'])
        #params for ncbi search are just all the project accessions
        params={"db":"bioproject","term":" OR ".join(study_accessions),"retmode":"json","retmax":1000}

    #need to search ncbi projects based on query and retreive any relevant records in Enterobase
    else:
        #Try and change the query into an api call to ncbi. The query is in SQL like text, so this can be a bit
        #tricky. Here , simply took the third word, removed any punctuation and used this as the term to
        #search ncbi
        arr=query.split()
        term = arr[2].replace("'","").replace("%","")
        #Another hack to narrow down search results - this should have been a value from the current database,
        #but since this example is in the test MIU database, this is not possible
        term=term+" AND Salmonella"
        params={"db":"bioproject","term":term,"retmode":"json"}

    #actually do the search and retreive record ids
    results = requests.post(search_url,data=params)
    data = ujson.loads(results.text)
    ids =data['esearchresult']['idlist']

    #use the ids to actually get the data
    params={"db":"bioproject","retmode":"json","id":",".join(ids),"retmax":1000}
    results=requests.post(summary_url,data=params)
    data=ujson.loads(results.text)

    #for all the project info, link it to assembly id using project id and the study_to_aid dictionary
    if query_type=='assembly_ids':
        for uid in data['result']:
            if uid == 'uids':
                continue
            info =data['result'][uid]
            aids =study_to_aid[info['project_acc']]
            for aid in aids:
                return_data[aid]={"project_name":info["project_name"],"project_description":info['project_description']}
                return_list.append(aid)

    # need to link the project  info obtained in the search to assembly ids
    else:
        project_to_info={}
        project_list=[]
        #go through returned data and get all project accessions and link them to project info
        for uid in data['result']:
            if uid == 'uids':
                continue
            info =data['result'][uid]
            project_list.append(info['project_acc'])
            project_to_info[info['project_acc']] ={"project_name":info["project_name"],"project_description":info['project_description']}
        #search strain database for the project accessions and get assembly ids
        projects= "('"+"','".join(project_list)+"')"
        sql= "SELECT study_accession AS  sa, best_assembly AS aid FROM strains WHERE study_accession in %s" % projects
        results = dbhandle[database].execute_query(sql)
        #link project info to assembly ids using the project to info dictionary
        for item in results:
            return_data[item['aid']]=dict(project_to_info[item['sa']])
            return_list.append(item['aid'])


    return return_data,return_list

SeroPred (Generic Scheme)

The SeroPred is a generic scheme, so the handling of jobs and processing of queries is already handled, all that is needed are some parameters to be added to the scheme and data_param tables

(1) The details in the scheme table look like this

description name param
SeroPred Serotype Prediction {“input_type”:”assembly”,”pipeline”:”SeroPred”, “query_method”:”process_generic_query”, “params”:{“taxon”:”Escherichia”}, “display”:”public”, “js_grid”:”BaseExperimentGrid”, “summary”:”O”}”

The params are slightly more complicated since this scheme specifies a CRobot task that has to be run each time a new assembly is added to the database. The pipeline describes this task, in this case SeroPred. The input_type specifies the assembly (not the reads) are to be sent to CRobot and “params’ states that the input parameter taxon should be Escherichia. The summary parameter states that the O (O Antigen) field should be used to summarize this scheme e.g. in the pie chart on the database home page

(2) The details in the scheme table look like this

SeroPred O log,O 1 0 O Antigen text
SeroPred H log,H 2 0 H Antigen text

The mlst_field actually describes how the data is harvested when CRobot returns data. It is actually the location in the json string twhere the value for the field can be found. For example the returned output of SeroPred looks like this

"comment": "null"
"source_ip": "137.205.123.127"
"tag": 2673082
"query_time": "2018-01-06 20:45:22.475990"
"log": {
     "H": "H10"
     "O": "O45"
}
"_cpu": 1

hence log,O and log,H would specify how to retrieve the values for the O and H fields respectively

(3) The Query Method. Because pipeline was specified in the scheme table, each time a new assembly is added, a job will be sent to CRobot and the results handled (the values for each field will be put in the other_data column in the assembly_lookup table under the key results. Queries to the scheme can thus be handled with the process_generic_query method, which was specified in the scheme table. This method assumes all fields are in the results in the other_data column of the assembly_lookup table

Universal Schemes

These schemes should automatically be added to each database when a database is created using the [add_new_database](Adding a Database) script

assembly_stats

Dispalys the assembly information

snp_calls

This scheme is never displayed, but is used to store any gff files (containing snp information) for each assembly/reference combination. This prevents duplicate jobs being sent. The information is stored in the assembly lookup table with the assembly_id and scheme_id as normal, but the st_barcode specifies the barcode of the reference assembly. The pointer to the gff file is stored in other_data under ‘file_pointer’, however due to a bug it may be doubly json encoded. The id of the snp_calls scheme may differ in each database so it has to retrieved using description=snp_calls.

ref_masker

Again this scheme is not for display, but is used for storing a gff file for an assembly, which contains information about the repeat regions in the genome which are masked for snp calling. The pointer to the gff file is stored in other_data under file_location.

prokka_annotation

This scheme is displayed and will allow download of gff or gbk files that have been generated for a genome. It requires two entries in the data_param table (gff_file and gbk_file). The file pointers to the files are stored in the lookup table in other_data > results