Adding a new Database

Creating the SQLAchemy Model

In entero/databases create a new folder (python package) with the name of your database.Then add the four files shown below (you can copy them from another database directory)

entero
│
└───databases
    │
    └───<database_name>
    │      data_params.txt
    │      db.py
    │      models.py
    │      __init__.py
    │
    └───senterica
    │

\_\_init\_\_.py
The standard __init__ file for any python package

dataparams.txt

This is a tab delimited text file that should contain information about extra fields that you specified in models.py The fields are described in Configuration info for metadata and results fields. Below is an example for an extra field strains, which will try and pull the value from the sra data and users are limited to only three values (please note ‘:’ in vals should be ‘|’ , but I could not escape this character in a markdown table)

tabname	name	sra_field	display_order	nested_order	label	datatype	vals	required
strains	species	Sample,Metadata,Species	7	0	Species	combo	Cronobacter sakezakii:Cronobacter dublinensis:Cronobacter universalis	0

Adding the database in the config

In entero/__init__.py add an item in the ACTIVE_DATABASES dictionary with the name of the database as key and an array(list) containing the following

The name of the Genus - this is important as it is used to retrieve the appropriate records from the SRA and to check whether assemblies are of the correct taxa
The url of the database
A boolean showing whether the database is public (True) or private (False)
Whether the database is active (1) or not (0) .Initially while we are creating the database, this should be set to 0
The three letter code identifying the database

e.g.

'cronobacter': [
              'Cronobacter',
              'postgresql://%s:%s@%s/cronobacter'%(USER, PASS, POSTGRES_SERVER),
               True,
               0,
               'CRO'
               ]

Also in config.py the DB_CODES codes dictionary needs to be updated with the database name as the key and a list comprising of a name and the code e.g

'cronobacter' : ['cronobacter','CRO']

Creating the Actual Database

First of all the database must be physically created which can be done with postgresql (right click on databases > New Database..) then just type in the name and press OK or via posttgresql command line by

CREATE DATABASE <database name>
  WITH OWNER = <owner_name>
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'en_GB.UTF-8'
       LC_CTYPE = 'en_GB.UTF-8'
       CONNECTION LIMIT = -1;

This must be done using the postgres admin account (e.g. postgres) e.g. by running psql as follows

sudo -su postgres psql -p<portno>

Next it can populated with tables, based on the models using the the script [create_new_database](Maintenance Scripts#markdown-header-create_new_database)

python manage.py create_new_database -d cronobacter

This script also populates the data_param table and adds a few schemes which are generic to all databases

assembly_stats This will display information about the assembly, the fields for this scheme have been added to data_param
ref_masker this scheme is not displayed to the user, but contains information about repeats in a genome and is used when calling SNPS
snp_calls Again, this scheme is not displayed but will store information abut snps called for a strain against a particular reference
prokka_annotation Contains information pointing to the annotation files (gff and genbank) for genomes. Again the fields for this scheme have been added to the data_param table

N.B. Now change the entry in ACTIVE_DATABASES in config.py to make the database active (change the fourth value to 1)

Populating the Database

Insert a scheme hosted in NServ

You can insert a scheme that has been defined in NServ into the database using the script “manage.py add_nserv_scheme”.

Insert a rMLST scheme for Cronobacter database.

python manage.py add_nserv_scheme -d cronobacter -s rMLST -r Cronobacter_rMLST

Insert a wgMLST scheme for Cronobacter, and grant access to some users specified in ‘-u’ parameter

python manage.py add_nserv_scheme -d cronobacter -s wgMLST -r CROwgMLST_wgMLST -u user_name_1,user_name_2

Importing SRA Data

You can populate the database using the script [importSRA](Maintenance Scripts#markdown-header-importsra) . To import all SRA records for the last 10 years (3000 days)

python manage.py importSRA -d cronobacter -r 3000

Assembling Genomes

In order to assembly genomes you can either do it through the web interface or run the script [update_assemblies](Maintenance Scripts#markdown-header-update_assemblies), which will check for any un-assembled strains and send them off for assembly . e.g. To send two strains for assembly with high priority.

python manage.py update_assemblies -p -9 -d cronobacter -l 2

Calling Schemes

Also schemes need to be called for each assembly, again this can be done through the web interface or using the script [update_all_schemes](Maintenance Scripts#markdown-header-update_all_schemes) (in this case probably only prokka annotation will be called as it is the only scheme in the database)

python manage.py update_all_schemes -d cronobacter -p 9

To automate the import, assembly and calling of schemes , you can add the name of the database to the shell scripts scripts/daily_update.sh and scripts/daily_import.sh see [here](Overall Structure#markdown-header-scripts)

Import whole genome

Imports assembled genomes from genbank

python manage.py import_whole_genome -d cronobacter

only complete genomes will be imported unless you add (-c False) to the command