Adding a new Database

Creating the SQLAchemy Model

In entero/databases create a new folder (python package) with the name of your database.Then add the four files shown below (you can copy them from another database directory)

entero
│
└───databases
    │
    └───<database_name>
    │      data_params.txt
    │      db.py
    │      models.py
    │      __init__.py
    │
    └───senterica
    │
\_\_init\_\_.py
The standard __init__ file for any python package

dataparams.txt

This is a tab delimited text file that should contain information about extra fields that you specified in models.py The fields are described in Metadata Validation. Below is an example for an extra field strains, which will try and pull the value from the sra data and users are limited to only three values (please note ‘:’ in vals should be ‘|’ , but I could not escape this character in a markdown table)

tabname name sra_field display_order nested_order label datatype vals required
strains species Sample,Metadata,Species 7 0 Species combo Cronobacter sakezakii:Cronobacter dublinensis:Cronobacter universalis 0

Adding the database in the config

In entero/config.py add an item in the ACTIVE_DATABASES dictionary with the name of the database as key and an array(list) containing the following

  • The name of the Genus - this is important as it is used to retrieve the appropriate records from the SRA and to check whether assemblies are of the correct taxa
  • The url of the database
  • A boolean showing whether the database is public (True) or private (False)
  • Whether the database is active (1) or not (0) .Initially while we are creating the database, this should be set to 0
  • The three letter code identifying the database

e.g.

'cronobacter': [
              'Cronobacter',
              'postgresql://%s:%s@%s/cronobacter'%(USER, PASS, POSTGRES_SERVER),
               True,
               0,
               'CRO'
               ]

Also in config.py the DB_CODES codes dictionary needs to be updated with the database name as the key and a list comprising of a name and the code e.g

'cronobacter' : ['cronobacter','CRO']

Creating the Actual Database

First of all the database must be physically created which can be done with postgresql (right click on databases > New Database..) then just type in the name and press OK or via posttgresql command line by

CREATE DATABASE <database_name>
  WITH OWNER = <owner_name>
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'en_GB.UTF-8'
       LC_CTYPE = 'en_GB.UTF-8'
       CONNECTION LIMIT = -1;

Next it can populated with tables, based on the models using the the script [create_new_database](Maintenance Scripts#markdown-header-create_new_database)

python manage.py create_new_database -d <database_name>

This script also populates the data_param table and adds a few schemes which are generic to all databases

  • assembly_stats This will display information about the assembly, the fields for this scheme have been added to data_param
  • ref_masker this scheme is not displayed to the user, but contains information about repeats in a genome and is used when calling SNPS
  • snp_calls Again, this scheme is not displayed but will store information abut snps called for a strain against a particular reference
  • prokka_annotation Contains information pointing to the annotation files (gff and genbank) for genomes. Again the fields for this scheme have been added to the data_param table

N.B. Now change the entry in ACTIVE_DATABASES in config.py to make the database active (change the fourth value to 1)

Populating the Database

Importing SRA Data

You can populate the database using the script [importSRA](Maintenance Scripts#markdown-header-importsra) . To import all SRA records for the last 10 years (3000 days)

python manage.py importSRA -d cronobacter -r 3000

Assembling Genomes

In order to assembly genomes you can either do it through the web interface or run the script [update_assemblies](Maintenance Scripts#markdown-header-update_assemblies), which will check for any un-assembled strains and send them off for assembly . e.g. To send two strains for assembly with high priority.

python manage.py update_assemblies -p -9 -d cronobacter -l 2

Calling Schemes

Also schemes need to be called for each assembly, again this can be done through the web interface or using the script [update_all_schemes](Maintenance Scripts#markdown-header-update_all_schemes) (in this case probably only prokka annotation will be called as it is the only scheme in the database)

python manage.py update_all_schemes -d cronobacter -p 9

To automate the import, assembly and calling of schemes , you can add the name of the database to the shell scripts scripts/daily_update.sh and scripts/daily_import.sh see [here](Overall Structure#markdown-header-scripts)