Adding a new Database

Creating the SQLAchemy Model

In entero/databases create a new folder (python package) with the name of your database.Then add the four files shown below (you can copy them from another database directory)

entero
│
└───databases
    │
    └───<database_name>
    │      data_params.txt
    │      db.py
    │      models.py
    │      __init__.py
    │
    └───senterica
    │
\_\_init\_\_.py
The standard __init__ file for any python package

dataparams.txt

This is a tab delimited text file that should contain information about extra fields that you specified in models.py The fields are described in developer-metadata-validation. Below is an example for an extra field strains, which will try and pull the value from the sra data and users are limited to only three values (please note ‘:’ in vals should be ‘|’ , but I could not escape this character in a markdown table)

tabname name sra_field display_order nested_order label datatype vals required
strains species Sample,Metadata,Species 7 0 Species combo Cronobacter sakezakii:Cronobacter dublinensis:Cronobacter universalis 0

Adding the database in the config

In entero/__init__.py add an item in the ACTIVE_DATABASES dictionary with the name of the database as key and an array(list) containing the following

  • The name of the Genus - this is important as it is used to retrieve the appropriate records from the SRA and to check whether assemblies are of the correct taxa
  • The url of the database
  • A boolean showing whether the database is public (True) or private (False)
  • Whether the database is active (1) or not (0) .Initially while we are creating the database, this should be set to 0
  • The three letter code identifying the database

e.g.

'cronobacter': [
              'Cronobacter',
              'postgresql://%s:%s@%s/cronobacter'%(USER, PASS, POSTGRES_SERVER),
               True,
               0,
               'CRO'
               ]

Also in config.py the DB_CODES codes dictionary needs to be updated with the database name as the key and a list comprising of a name and the code e.g

'cronobacter' : ['cronobacter','CRO']

Creating the Actual Database

First of all the database must be physically created which can be done with postgresql (right click on databases > New Database..) then just type in the name and press OK or via posttgresql command line by

CREATE DATABASE <database name>
  WITH OWNER = <owner_name>
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'en_GB.UTF-8'
       LC_CTYPE = 'en_GB.UTF-8'
       CONNECTION LIMIT = -1;

This must be done using the postgres admin account (e.g. postgres) e.g. by running psql as follows

sudo -su postgres psql -p<portno>

Next it can populated with tables, based on the models using the the script [create_new_database](Maintenance Scripts#markdown-header-create_new_database)

python manage.py create_new_database -d cronobacter

This script also populates the data_param table and adds a few schemes which are generic to all databases

  • assembly_stats This will display information about the assembly, the fields for this scheme have been added to data_param
  • ref_masker this scheme is not displayed to the user, but contains information about repeats in a genome and is used when calling SNPS
  • snp_calls Again, this scheme is not displayed but will store information abut snps called for a strain against a particular reference
  • prokka_annotation Contains information pointing to the annotation files (gff and genbank) for genomes. Again the fields for this scheme have been added to the data_param table

N.B. Now change the entry in ACTIVE_DATABASES in config.py to make the database active (change the fourth value to 1)

Populating the Database

Insert a scheme hosted in NServ

You can insert a scheme that has been defined in NServ into the database using the script “manage.py add_nserv_scheme”.

Insert a rMLST scheme for Cronobacter database.

python manage.py add_nserv_scheme -d cronobacter -s rMLST -r Cronobacter_rMLST

Insert a wgMLST scheme for Cronobacter, and grant access to some users specified in ‘-u’ parameter

python manage.py add_nserv_scheme -d cronobacter -s wgMLST -r CROwgMLST_wgMLST -u user_name_1,user_name_2

Importing SRA Data

You can populate the database using the script [importSRA](Maintenance Scripts#markdown-header-importsra) . To import all SRA records for the last 10 years (3000 days)

python manage.py importSRA -d cronobacter -r 3000

Assembling Genomes

In order to assembly genomes you can either do it through the web interface or run the script [update_assemblies](Maintenance Scripts#markdown-header-update_assemblies), which will check for any un-assembled strains and send them off for assembly . e.g. To send two strains for assembly with high priority.

python manage.py update_assemblies -p -9 -d cronobacter -l 2

Calling Schemes

Also schemes need to be called for each assembly, again this can be done through the web interface or using the script [update_all_schemes](Maintenance Scripts#markdown-header-update_all_schemes) (in this case probably only prokka annotation will be called as it is the only scheme in the database)

python manage.py update_all_schemes -d cronobacter -p 9

To automate the import, assembly and calling of schemes , you can add the name of the database to the shell scripts scripts/daily_update.sh and scripts/daily_import.sh see [here](Overall Structure#markdown-header-scripts)

Import whole genome

Imports assembled genomes from genbank

python manage.py import_whole_genome -d cronobacter

only complete genomes will be imported unless you add (-c False) to the command