Adding a new Database¶
Creating the SQLAchemy Model¶
In entero/databases create a new folder (python package) with the name of your database.Then add the four files shown below (you can copy them from another database directory)
entero
│
└───databases
│
└───<database_name>
│ data_params.txt
│ db.py
│ models.py
│ __init__.py
│
└───senterica
│
\_\_init\_\_.py
The standard __init__ file for any python package
dataparams.txt¶
This is a tab delimited text file that should contain information about extra fields that you specified in models.py The fields are described in developer-metadata-validation. Below is an example for an extra field strains, which will try and pull the value from the sra data and users are limited to only three values (please note ‘:’ in vals should be ‘|’ , but I could not escape this character in a markdown table)
tabname | name | sra_field | display_order | nested_order | label | datatype | vals | required |
strains | species | Sample,Metadata,Species | 7 | 0 | Species | combo | Cronobacter sakezakii:Cronobacter dublinensis:Cronobacter universalis | 0 |
Adding the database in the config¶
In entero/__init__.py add an item in the ACTIVE_DATABASES dictionary with the name of the database as key and an array(list) containing the following
- The name of the Genus - this is important as it is used to retrieve the appropriate records from the SRA and to check whether assemblies are of the correct taxa
- The url of the database
- A boolean showing whether the database is public (True) or private (False)
- Whether the database is active (1) or not (0) .Initially while we are creating the database, this should be set to 0
- The three letter code identifying the database
e.g.
'cronobacter': [
'Cronobacter',
'postgresql://%s:%s@%s/cronobacter'%(USER, PASS, POSTGRES_SERVER),
True,
0,
'CRO'
]
Also in config.py the DB_CODES codes dictionary needs to be updated with the database name as the key and a list comprising of a name and the code e.g
'cronobacter' : ['cronobacter','CRO']
Creating the Actual Database¶
First of all the database must be physically created which can be done with postgresql (right click on databases > New Database..) then just type in the name and press OK or via posttgresql command line by
CREATE DATABASE <database name>
WITH OWNER = <owner_name>
ENCODING = 'UTF8'
TABLESPACE = pg_default
LC_COLLATE = 'en_GB.UTF-8'
LC_CTYPE = 'en_GB.UTF-8'
CONNECTION LIMIT = -1;
This must be done using the postgres admin account (e.g. postgres) e.g. by running psql as follows
sudo -su postgres psql -p<portno>
Next it can populated with tables, based on the models using the the script [create_new_database](Maintenance Scripts#markdown-header-create_new_database)
python manage.py create_new_database -d cronobacter
This script also populates the data_param table and adds a few schemes which are generic to all databases
- assembly_stats This will display information about the assembly, the fields for this scheme have been added to data_param
- ref_masker this scheme is not displayed to the user, but contains information about repeats in a genome and is used when calling SNPS
- snp_calls Again, this scheme is not displayed but will store information abut snps called for a strain against a particular reference
- prokka_annotation Contains information pointing to the annotation files (gff and genbank) for genomes. Again the fields for this scheme have been added to the data_param table
N.B. Now change the entry in ACTIVE_DATABASES in config.py to make the database active (change the fourth value to 1)
Populating the Database¶
Insert a scheme hosted in NServ¶
You can insert a scheme that has been defined in NServ into the database using the script “manage.py add_nserv_scheme”.
Insert a rMLST scheme for Cronobacter database.
python manage.py add_nserv_scheme -d cronobacter -s rMLST -r Cronobacter_rMLST
Insert a wgMLST scheme for Cronobacter, and grant access to some users specified in ‘-u’ parameter
python manage.py add_nserv_scheme -d cronobacter -s wgMLST -r CROwgMLST_wgMLST -u user_name_1,user_name_2
Importing SRA Data¶
You can populate the database using the script [importSRA](Maintenance Scripts#markdown-header-importsra) . To import all SRA records for the last 10 years (3000 days)
python manage.py importSRA -d cronobacter -r 3000
Assembling Genomes¶
In order to assembly genomes you can either do it through the web interface or run the script [update_assemblies](Maintenance Scripts#markdown-header-update_assemblies), which will check for any un-assembled strains and send them off for assembly . e.g. To send two strains for assembly with high priority.
python manage.py update_assemblies -p -9 -d cronobacter -l 2
Calling Schemes¶
Also schemes need to be called for each assembly, again this can be done through the web interface or using the script [update_all_schemes](Maintenance Scripts#markdown-header-update_all_schemes) (in this case probably only prokka annotation will be called as it is the only scheme in the database)
python manage.py update_all_schemes -d cronobacter -p 9
To automate the import, assembly and calling of schemes , you can add the name of the database to the shell scripts scripts/daily_update.sh and scripts/daily_import.sh see [here](Overall Structure#markdown-header-scripts)
Import whole genome¶
Imports assembled genomes from genbank
python manage.py import_whole_genome -d cronobacter
only complete genomes will be imported unless you add (-c False) to the command