Implemented Schemes

There are four categories of schemes that can be found on Enterobase:

  • Universal schemes: Common to all species
  • Sequence Type schemes: Supports 7 gene, cgMLST, wgMLST etc schemes. Core implementation in NServ
  • Generic schemes: Standard implementation that allows new analyses to be supported in a generic way
  • AMR analysis schemes:

Universal Schemes

These schemes should automatically be added to each database when a database is created using the [add_new_database](Adding a Database) script

assembly_stats

Dispalys the assembly information

snp_calls

This scheme is never displayed, but is used to store any gff files (containing snp information) for each assembly/reference combination. This prevents duplicate jobs being sent. The information is stored in the assembly lookup table with the assembly_id and scheme_id as normal, but the st_barcode specifies the barcode of the reference assembly. The pointer to the gff file is stored in other_data under ‘file_pointer’, however due to a bug it may be doubly json encoded. The id of the snp_calls scheme may differ in each database so it has to retrieved using description=snp_calls.

ref_masker

Again this scheme is not for display, but is used for storing a gff file for an assembly, which contains information about the repeat regions in the genome which are masked for snp calling. The pointer to the gff file is stored in other_data under file_location.

prokka_annotation

This scheme is displayed and will allow download of gff or gbk files that have been generated for a genome. It requires two entries in the data_param table (gff_file and gbk_file). The file pointers to the files are stored in the lookup table in other_data > results

Sequence Type Schemes

Generic Schemes

SeroPred

The SeroPred is a generic scheme, so the handling of jobs and processing of queries is already handled, all that is needed are some parameters to be added to the scheme and data_param tables

(1) The details in the scheme table look like this

Heading
description name param
SeroPred Serotype Prediction

{“input_type”:”assembly”,”pipeline”:”SeroPred”,

“query_method”:”process_generic_query”,

“params”:{“taxon”:”Escherichia”}, “display”:”public”,

“js_grid”:”BaseExperimentGrid”, “summary”:”O”}”

The params are slightly more complicated since this scheme specifies a CRobot task that has to be run each time a new assembly is added to the database. The pipeline describes this task, in this case SeroPred. The input_type specifies the assembly (not the reads) are to be sent to CRobot and “params’ states that the input parameter taxon should be Escherichia. The summary parameter states that the O (O Antigen) field should be used to summarize this scheme e.g. in the pie chart on the database home page

(2) The details in the data_param table look like this

tabname name mlst_field display_order nested_order label datatype
SeroPred O log,O 1 0 O Antigen text
SeroPred H log,H 2 0 H Antigen text

The mlst_field actually describes how the data is harvested when CRobot returns data. It is actually the location in the json string twhere the value for the field can be found. For example the returned output of SeroPred looks like this

"comment": "null"
"source_ip": "137.205.123.127"
"tag": 2673082
"query_time": "2018-01-06 20:45:22.475990"
"log": {
     "H": "H10"
     "O": "O45"
}
"_cpu": 1

hence log,O and log,H would specify how to retrieve the values for the O and H fields respectively

(3) The Query Method. Because pipeline was specified in the scheme table, each time a new assembly is added, a job will be sent to CRobot and the results handled (the values for each field will be put in the other_data column in the assembly_lookup table under the key results. Queries to the scheme can thus be handled with the process_generic_query method, which was specified in the scheme table. This method assumes all fields are in the results in the other_data column of the assembly_lookup table

AMR analysis schemes

These integrate various Anti-microbial resistance applications into Enterobase. The challange of such schemes is that the classes of antibiotics and the identity of resistance inducing genes can change over time.

The solution to this was to introduce a new per-species table called amr_analysis that that is indexed by assembly identifer and also always contains an assembly barcode column. Addional species specific columns then contain the results for the AMR analysis tools that have been implemented for the species. These can either be simple text columns (such as for the AMR finder version) or JSONB format

{
  "Phenicol": {
    "catA1": [
      "PHENICOL",
      "CHLORAMPHENICOL",
      "AMR",
      "AMR",
      "EXACTP",
      "100.00",
      "100.00",
      "WP_000412211.1",
      "NODE_46",
      "8787",
      "9446",
      "-"
    ]
  },
  "Beta-lactam": {
    "blaTEM-1": [
      "BETA-LACTAM",
      "BETA-LACTAM",
      "AMR",
      "AMR",
      "ALLELEX",
      "100.00",
      "100.00",
      "WP_000027057.1",
      "NODE_23",
      "11008",
      "11865",
      "-"
    ]
  },

In order to create an AMR_analysis scheme for a species using an AMR tool that is already supported within the AMRanalysis job:

  • An AMRAnalysis entry must be created in the /databases/<species>/models.py that inherits from mod.AMRAnalysis and specifies the addtional columns for the particular AMR application being used.
  • An AMR_params.txt file should be created that describes the way that the data are returned from the AMRanalysis job and displayed as experimental data in the GUI
  • The manage.py add_amr_analysis script should be run to create the new database table, create the entry in the schemes table and setup the DataParam entries

When the AMR_analysis job runs it outputs results in json format with an entry for each of the database columns associated with the AMR application. In the case of the JSON formatted columns this contains a dictionary with a full record of the output from the AMR app.

Within enterobase-web the jobs are handled by the jos/jobs.py:AMRanalysisJob class, whose process_job method looks for json entries that match the list of columns listed in AMR_paramMappings. If there is a match then it looks for results that match the list of parameters for the column in AMR_paramMappings, strips off the key and converts the results to a list which is then stored in the database.

The way that the JSONB results are displayed is determined by the data_param entries for the GUI

Heading
tabname name mlst_field display_order label datatype group_name
AMR_analysis Trimethoprim amrfinder_results,Trimethoprim 13 Trimethoprim text  
AMR_analysis Quaternary_ammonium amrfinder_results,Quaternary_ammonium 10 Quaternary ammonium text hidden
AMR_analysis Beta-lactam:BETA-LACTAM amrfinder_results,Beta-lactam:BETA-LACTAM 8 Penicllin text  
AMR_analysis amrfinder_results log,amrfinder_results 10 AMR detailed data text hidden
AMR_analysis amrfinder_db_version log,amrfinder_db_version 22 AMR database version text  

There should be an entry for each json entry passed back from CRobot where the mlst_field starts with ‘log.’, indicating that the data is derived from the log section created by the CRobot job. In the case of JSON data, this should be hidden by setting the group_name to ‘hidden’. The mlst_field for GUI columns that are derived from JSON data should start with the name of the JSON column (amrfinder_results in the previous example, and then the class to be displayed in the column. If a specific class/subclass combination is to be displayed the class and subclass should be listed, separated by a colon. If multiple subclasses are to be included they should be added, separated by further colons. If there is an entry without any subclasses for a class where there is a column for the same class which is specific to a subclass then it will list genes that associated with other subclasses.

Searching for and retrieving AMR results is handled by ExtraFuncs/query_funcs.py/process_amr_analysis_query. The search string returned from the GUI is in standard SQL format, but as the data can be in JSONB format the code converts the standard SQL into postgresSQL that specific for JSONB data. Note that in some cases the query may include some additional entries in that it is not possible to create JSONB queries that are a perfect match to what was specified on the GUI.

When the ‘eye’ is clicked on the GUI the main/views.py:get_amr_details_by_assembly_id code outputs all of the data in the database, adding labels to the list of values and not using the class/subclass groupings used by the GUI.

The AMR analysis data can also be accessed by the API2. When this is done all of the columns in the database are output in their raw form, ie without post processing to add labels to the list of parameters for each gene. This is to minimise the CPU and bandwidth when outputting data for large numbers of strains.