Implemented Schemes¶

There are four categories of schemes that can be found on Enterobase:

Universal schemes: Common to all species

Sequence Type schemes: Supports 7 gene, cgMLST, wgMLST etc schemes. Core implementation in NServ

Generic schemes: Standard implementation that allows new analyses to be supported in a generic way

AMR analysis schemes:

Universal Schemes¶

These schemes should automatically be added to each database when a database is created using the [add_new_database](Adding a Database) script

assembly_stats

Dispalys the assembly information

snp_calls

This scheme is never displayed, but is used to store any gff files (containing snp information) for each assembly/reference combination. This prevents duplicate jobs being sent. The information is stored in the assembly lookup table with the assembly_id and scheme_id as normal, but the st_barcode specifies the barcode of the reference assembly. The pointer to the gff file is stored in other_data under ‘file_pointer’, however due to a bug it may be doubly json encoded. The id of the snp_calls scheme may differ in each database so it has to retrieved using description=snp_calls.

ref_masker

Again this scheme is not for display, but is used for storing a gff file for an assembly, which contains information about the repeat regions in the genome which are masked for snp calling. The pointer to the gff file is stored in other_data under file_location.

prokka_annotation

This scheme is displayed and will allow download of gff or gbk files that have been generated for a genome. It requires two entries in the data_param table (gff_file and gbk_file). The file pointers to the files are stored in the lookup table in other_data > results

Sequence Type Schemes¶

Generic Schemes¶

SeroPred¶

The SeroPred is a generic scheme, so the handling of jobs and processing of queries is already handled, all that is needed are some parameters to be added to the scheme and data_param tables

(1) The details in the scheme table look like this

Heading¶
description	name	param
SeroPred	Serotype Prediction	{“input_type”:”assembly”,”pipeline”:”SeroPred”, “query_method”:”process_generic_query”, “params”:{“taxon”:”Escherichia”}, “display”:”public”, “js_grid”:”BaseExperimentGrid”, “summary”:”O”}”

The params are slightly more complicated since this scheme specifies a CRobot task that has to be run each time a new assembly is added to the database. The pipeline describes this task, in this case SeroPred. The input_type specifies the assembly (not the reads) are to be sent to CRobot and “params’ states that the input parameter taxon should be Escherichia. The summary parameter states that the O (O Antigen) field should be used to summarize this scheme e.g. in the pie chart on the database home page

(2) The details in the data_param table look like this

tabname	name	mlst_field	display_order	nested_order	label	datatype
SeroPred	O	log,O	1	0	O Antigen	text
SeroPred	H	log,H	2	0	H Antigen	text

The mlst_field actually describes how the data is harvested when CRobot returns data. It is actually the location in the json string twhere the value for the field can be found. For example the returned output of SeroPred looks like this

"comment": "null"
"source_ip": "137.205.123.127"
"tag": 2673082
"query_time": "2018-01-06 20:45:22.475990"
"log": {
     "H": "H10"
     "O": "O45"
}
"_cpu": 1

hence log,O and log,H would specify how to retrieve the values for the O and H fields respectively

(3) The Query Method. Because pipeline was specified in the scheme table, each time a new assembly is added, a job will be sent to CRobot and the results handled (the values for each field will be put in the other_data column in the assembly_lookup table under the key results. Queries to the scheme can thus be handled with the process_generic_query method, which was specified in the scheme table. This method assumes all fields are in the results in the other_data column of the assembly_lookup table

AMR analysis schemes¶

These integrate various Anti-microbial resistance applications into Enterobase. The challange of such schemes is that the classes of antibiotics and the identity of resistance inducing genes can change over time.

The solution to this was to introduce a new per-species table called amr_analysis that that is indexed by assembly identifer and also always contains an assembly barcode column. Addional species specific columns then contain the results for the AMR analysis tools that have been implemented for the species. These can either be simple text columns (such as for the AMR finder version) or JSONB format

{
  "Phenicol": {
    "catA1": [
      "PHENICOL",
      "CHLORAMPHENICOL",
      "AMR",
      "AMR",
      "EXACTP",
      "100.00",
      "100.00",
      "WP_000412211.1",
      "NODE_46",
      "8787",
      "9446",
      "-"
    ]
  },
  "Beta-lactam": {
    "blaTEM-1": [
      "BETA-LACTAM",
      "BETA-LACTAM",
      "AMR",
      "AMR",
      "ALLELEX",
      "100.00",
      "100.00",
      "WP_000027057.1",
      "NODE_23",
      "11008",
      "11865",
      "-"
    ]
  },

In order to create an AMR_analysis scheme for a species using an AMR tool that is already supported within the AMRanalysis job:

An AMRAnalysis entry must be created in the /databases/<species>/models.py that inherits from mod.AMRAnalysis and specifies the addtional columns for the particular AMR application being used.
An AMR_params.txt file should be created that describes the way that the data are returned from the AMRanalysis job and displayed as experimental data in the GUI
The manage.py add_amr_analysis script should be run to create the new database table, create the entry in the schemes table and setup the DataParam entries

When the AMR_analysis job runs it outputs results in json format with an entry for each of the database columns associated with the AMR application. In the case of the JSON formatted columns this contains a dictionary with a full record of the output from the AMR app.

Within enterobase-web the jobs are handled by the jos/jobs.py:AMRanalysisJob class, whose process_job method looks for json entries that match the list of columns listed in AMR_paramMappings. If there is a match then it looks for results that match the list of parameters for the column in AMR_paramMappings, strips off the key and converts the results to a list which is then stored in the database.

The way that the JSONB results are displayed is determined by the data_param entries for the GUI

Heading¶
tabname	name	mlst_field	display_order	label	datatype	group_name
AMR_analysis	Trimethoprim	amrfinder_results,Trimethoprim	13	Trimethoprim	text
AMR_analysis	Quaternary_ammonium	amrfinder_results,Quaternary_ammonium	10	Quaternary ammonium	text	hidden
AMR_analysis	Beta-lactam:BETA-LACTAM	amrfinder_results,Beta-lactam:BETA-LACTAM	8	Penicllin	text
AMR_analysis	amrfinder_results	log,amrfinder_results	10	AMR detailed data	text	hidden
AMR_analysis	amrfinder_db_version	log,amrfinder_db_version	22	AMR database version	text

There should be an entry for each json entry passed back from CRobot where the mlst_field starts with ‘log.’, indicating that the data is derived from the log section created by the CRobot job. In the case of JSON data, this should be hidden by setting the group_name to ‘hidden’. The mlst_field for GUI columns that are derived from JSON data should start with the name of the JSON column (amrfinder_results in the previous example, and then the class to be displayed in the column. If a specific class/subclass combination is to be displayed the class and subclass should be listed, separated by a colon. If multiple subclasses are to be included they should be added, separated by further colons. If there is an entry without any subclasses for a class where there is a column for the same class which is specific to a subclass then it will list genes that associated with other subclasses.

Searching for and retrieving AMR results is handled by ExtraFuncs/query_funcs.py/process_amr_analysis_query. The search string returned from the GUI is in standard SQL format, but as the data can be in JSONB format the code converts the standard SQL into postgresSQL that specific for JSONB data. Note that in some cases the query may include some additional entries in that it is not possible to create JSONB queries that are a perfect match to what was specified on the GUI.

When the ‘eye’ is clicked on the GUI the main/views.py:get_amr_details_by_assembly_id code outputs all of the data in the database, adding labels to the list of values and not using the class/subclass groupings used by the GUI.

The AMR analysis data can also be accessed by the API2. When this is done all of the columns in the database are output in their raw form, ie without post processing to add labels to the list of parameters for each gene. This is to minimise the CPU and bandwidth when outputting data for large numbers of strains.