Schemes

Schemes represent any analysis that is associated with an assembly of a strain entry, not just MLST. A Scheme needs to define the following three processes

Sending an assembly to be analysed and obtaining/storing the results
Retrieving and searching data from the scheme
Displaying the data

The scheme is described in the scheme table of the species database and the scheme’s fields are defined in the data_param table. The data for each assembly/scheme combination is stored in the assembly_lookup table . For MLST schemes, the the ST barcode is stored in the ST_barcode column, so that the allele and ST data can be retrieved from nserv when requested. For non-MLST schemes data can be stored in this table under the ‘results’ key in the other_data. If the scheme is very simple than the actual data can be stored here e.g the O and H types in E. coli serotype production. However, usually this is used to store a pointer to a file or a key to a more complex data structure stored elsewhere

Scheme Parameters (scheme table)

description - This is actually the name of the scheme
name - The human readable label
param - A json column describing the scheme
- display Can be public, private or none. Public schemes are always accessible on the main search page, whereas private schemes are only visible to those with permission. A value of none means the scheme is not displayed at all in the conventional way e.g. snp_calls
- js_grid The name of the grid class that will be used to display the scheme. If not supplied then BaseExperimentGrid will be used
- pipeline The name of the pipeline that nserv uses to run the job. If no pipeline is specified, then the scheme will not automatically be called on new assemblies
- scheme The name of the MLST scheme used by nserv - If no scheme is specified than this will imply the scheme is not an MLST scheme
- barcode For MLST schemes, this describes where the ST barcode can be found in the json log output from nserv e.g a value of ST,1 will mean that the barcode is the second item of a list with the key ST
- query_method The name of the method in entero.ExtraFuncs.query_functions.py which will either supply data from the scheme for each specified strain, or run a query on the scheme and return the data and associated strains e.g. process_small_scheme_query
- summary The field in the scheme that can be used to get a summary breakdown of the scheme. For example, in 7 gene MLST schemes it could be the ST or for serotype prediction, the O group. If not appropriate for the scheme then this can be absent
- job_rate If this is specified then this is the number of jobs sent for this scheme when the script update_all_schemes is run on the database. If missing , then the number supplied when running the script will be used
- alternate_scheme Supposedly for when a scheme is a subset of another scheme such that only one job is required in order to return data for two schemes. Specified for CRISPR schemes but never implemented.
- depends_on_scheme Used when a scheme depends on the output from another scheme. Enterobase only runs the pipeline job for this scheme when the job for the parent scheme has been succesfully completed. An example is AMR_finder being dependent on prokka_annotation
- params For non-MLST (generic) schemes, these are the extra parameters required by the pipeline. For example, the E. coli annotation scheme has the following:- “taxon”:”Escherichia”,”prefix”:”{barcode}”. If a value is enclosed in curly braces (barcode in this example), this specifies the value of that field in the assembly.
- input_type The input to the pipeline. Can either be ‘assembly’ or ‘reads’
- main_scheme if ‘true’ then this will be the main scheme used for calling genes - default is wgMLST

ToDo: Remove all redundant columns from the scheme table

Scheme Fields (data_param table)

The columns that are relevant to schemes in the data_param table are

tabname - The name of the scheme (the description field in the scheme table)
name - The database-friendly name of the field
mlst_field - The position where the value can be found in the json output of nserv e.g for ST complex in Salmonella rMLST this is ‘info,st_complex’. This is not required for loci, as these can be automatically retrieved
display_order - The order in the table that the field will be displayed
nested order - For compound fields the sub-order. For normal fields this should be 0. For Loci , this specifies the order of the locus
label - The name that the user will see
datatype - The datatype of the field, can be integer,double or text
group_name - Only required for compound fields, this is the name describing the group of sub-fields. For MLST schemes, each locus should have ‘Locus’ for this field
param A json column with extra parameters in key:value format
- not_searchable If true then the field is either too complex or not relevant to be used to search for scheme information, e.g. the GFF or GBK file in the annotation scheme
- admin_only If present and set to true then this is a column that is only visible to administrators

ToDo: A lot of columns can be moved from the main table and placed in the param column

Examples

Scheme Based on an NCBI API call

In this example, the ‘scheme’, which has been added to the miu database, will simply show the NCBI project name and project description. This is obtained from an api call to NCBI using the project accession field in Enterobase. Alternatively the project description/name can be searched at NCBI and the returned project accessions used to return records from Enterobase. Although not that useful, this example shows how a scheme can get data via a 3rd party. All that is required is entries into the schemes and dataparam tables and a single method in the query_functions method.

(1) Add details of the scheme/experiment to the scheme table. The description needs to be added, which is actually the key (don’t ask why) and a name (that the user will see). The param column describes the scheme in json format. In this column two things need to be specified: ‘display’ and ‘process_method’. The display is public so anybody can see/query this scheme and the process_method points to the method in entero/ExtraFuncs/query_functions which will be where all the code for querying the results will be. This is not an MLST scheme so ‘scheme’ is not needed and we do not call any job to populate each entry so ‘pipeline’ is not required

description	name	param
project_info	Project Info	{“display”:”public”,”process_method”:”process_project_info_query”}

(2) Next add the fields in the scheme to the dataparams. Only two fields are needed: project_name and project_description, which both have the datatype text. tabname is he name of the scheme/experiment, which in this case is project_info and name is what the user will see. nested_order should be 0, unless it is a compound field which are viewed by clicking on the result in the GUI.

tabname	name	display_order	nested_order	label	datatype
project_info	project_name	1	0	Project Name	text
project_info	project_description	2	0	Description	text

(3) Finally the query method itself is needed. This method accepts 5 parameters. database and exp are just the names of the database and scheme/experiment respectively (sometimes the same method can be used for different schemes). The query will be the SQL like query, although this depends on the value of query_type, which is either ‘query’ (default) or assembly_ids. If it is the latter than the query parameter will contain a string of comma delimited assembly ids. The method returns a dictionary of assembly ids to a dictionary of key/value pairs of the fields specified for the scheme and a list of assembly ids.

def process_project_info_query(database,exp,query,query_type,extra_data=None):
    '''A toy method to demonstrate how to write a query methods. The arguments and
    return values are standard
    '''
    search_url= "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    summary_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
    study_to_aid={}
    return_data={}
    return_list=[]

    #need to get project information from the assembly_ids supplied
    if query_type== "assembly_ids":
        #get all the study accessions from the strain database and link them to assembly id
        study_accessions=[]
        sql = 'SELECT study_accession AS  sa, best_assembly AS aid FROM strains WHERE best_assembly in (%s)' % query
        results = dbhandle[database].execute_query(sql)
        for item in results:
            study= item['sa']
            if not study:
                continue
            study_accessions.append(study)
            aids = study_to_aid.get(study)
            if not aids:
                aids=[]
                study_to_aid[study]=aids
            aids.append(item['aid'])
        #params for ncbi search are just all the project accessions
        params={"db":"bioproject","term":" OR ".join(study_accessions),"retmode":"json","retmax":1000}

    #need to search ncbi projects based on query and retreive any relevant records in Enterobase
    else:
        #Try and change the query into an api call to ncbi. The query is in SQL like text, so this can be a bit
        #tricky. Here , simply took the third word, removed any punctuation and used this as the term to
        #search ncbi
        arr=query.split()
        term = arr[2].replace("'","").replace("%","")
        #Another hack to narrow down search results - this should have been a value from the current database,
        #but since this example is in the test MIU database, this is not possible
        term=term+" AND Salmonella"
        params={"db":"bioproject","term":term,"retmode":"json"}

    #actually do the search and retreive record ids
    results = requests.post(search_url,data=params)
    data = ujson.loads(results.text)
    ids =data['esearchresult']['idlist']

    #use the ids to actually get the data
    params={"db":"bioproject","retmode":"json","id":",".join(ids),"retmax":1000}
    results=requests.post(summary_url,data=params)
    data=ujson.loads(results.text)

    #for all the project info, link it to assembly id using project id and the study_to_aid dictionary
    if query_type=='assembly_ids':
        for uid in data['result']:
            if uid == 'uids':
                continue
            info =data['result'][uid]
            aids =study_to_aid[info['project_acc']]
            for aid in aids:
                return_data[aid]={"project_name":info["project_name"],"project_description":info['project_description']}
                return_list.append(aid)

    # need to link the project  info obtained in the search to assembly ids
    else:
        project_to_info={}
        project_list=[]
        #go through returned data and get all project accessions and link them to project info
        for uid in data['result']:
            if uid == 'uids':
                continue
            info =data['result'][uid]
            project_list.append(info['project_acc'])
            project_to_info[info['project_acc']] ={"project_name":info["project_name"],"project_description":info['project_description']}
        #search strain database for the project accessions and get assembly ids
        projects= "('"+"','".join(project_list)+"')"
        sql= "SELECT study_accession AS  sa, best_assembly AS aid FROM strains WHERE study_accession in %s" % projects
        results = dbhandle[database].execute_query(sql)
        #link project info to assembly ids using the project to info dictionary
        for item in results:
            return_data[item['aid']]=dict(project_to_info[item['sa']])
            return_list.append(item['aid'])

    return return_data,return_list