Adding a new Database ===================== Creating the SQLAchemy Model ---------------------------- In entero/databases create a new folder (python package) with the name of your database.Then add the four files shown below (you can copy them from another database directory) .. code-block:: html entero │ └───databases │ └─── │ data_params.txt │ db.py │ models.py │ __init__.py │ └───senterica │ .. code-block:: python \_\_init\_\_.py The standard __init__ file for any python package .. code-block: python from db import * ``` ### models.py ### This contains all the table classes, which inherit most memebers (columns) from the base classes in entro/database/generic_models.py, although you are free to add any extra. If you add any new columns than you need to also describe them in data_params.txt . Below shows an example where a new column 'species' is added ``` from sqlalchemy import Integer, Column, ForeignKey, Table, String, DateTime, Time, Text, MetaData from sqlalchemy.ext.declarative import declarative_base from entero.databases import generic_models as mod metadata = MetaData() Base = declarative_base(metadata=metadata) TracesAssembly = mod.getTraceAssemblyTable(metadata) class Assemblies(Base, mod.AssembliesRel): pass class AssembliesArchive(Base, mod.AssembliesArchive): pass class Traces(Base, mod.Traces): pass class TracesArchive(Base, mod.TracesArchive): pass class Schemes(Base,mod.Schemes): pass class SchemesArchive(Base,mod.SchemesArchive): pass class Strains(Base,mod.Strains): #add any database specific fields here species = Column("species",String(200)) class StrainsArchive(Base,mod.StrainsArchive): #duplicate any added strain fields here species = Column("species",String(200)) class DataParam(Base,mod.DataParam): pass class AssemblyLookup(Base,mod.AssemblyLookup): pass ``` ### db.py ### The database class inherits all functions and members from AbstractDatabase entero/database/database.py and usually no new methods have to be added ``` from ..database import AbstractDatabase import datetime from models import * import os from sqlalchemy.sql import func, exists from sqlalchemy import and_, select, or_ from entero import app from sqlalchemy.exc import IntegrityError class DB(AbstractDatabase): #add any database specific methods here pass ``` dataparams.txt ^^^^^^^^^^^^^^ This is a tab delimited text file that should contain information about extra fields that you specified in models.py The fields are described in :doc:`developer-metadata-validation`. Below is an example for an extra field strains, which will try and pull the value from the sra data and users are limited to only three values (please note ':' in vals should be '|' , but I could not escape this character in a markdown table) +---------------+--------+-----------------------+------------------+----------------+-------------+------------+-----------------------------------------------------------------------+------------+ | **tabname** |**name**|**sra_field** |**display_order** |**nested_order**|**label** |**datatype**|**vals** |**required**| +---------------+--------+-----------------------+------------------+----------------+-------------+------------+-----------------------------------------------------------------------+------------+ | strains |species |Sample,Metadata,Species|7 |0 |Species |combo | Cronobacter sakezakii:Cronobacter dublinensis:Cronobacter universalis | 0| +---------------+--------+-----------------------+------------------+----------------+-------------+------------+-----------------------------------------------------------------------+------------+ Adding the database in the config --------------------------------- In entero/__init__.py add an item in the ACTIVE_DATABASES dictionary with the name of the database as key and an array(list) containing the following * The name of the Genus - this is important as it is used to retrieve the appropriate records from the SRA and to check whether assemblies are of the correct taxa * The url of the database * A boolean showing whether the database is public (True) or private (False) * Whether the database is active (1) or not (0) .Initially while we are creating the database, this should be set to 0 * The three letter code identifying the database e.g. .. code-block:: python 'cronobacter': [ 'Cronobacter', 'postgresql://%s:%s@%s/cronobacter'%(USER, PASS, POSTGRES_SERVER), True, 0, 'CRO' ] Also in config.py the DB_CODES codes dictionary needs to be updated with the database name as the key and a list comprising of a name and the code e.g .. code-block:: python 'cronobacter' : ['cronobacter','CRO'] Creating the Actual Database ---------------------------- First of all the database must be physically created which can be done with postgresql (right click on databases > New Database..) then just type in the name and press OK or via posttgresql command line by .. code-block:: sql CREATE DATABASE WITH OWNER = ENCODING = 'UTF8' TABLESPACE = pg_default LC_COLLATE = 'en_GB.UTF-8' LC_CTYPE = 'en_GB.UTF-8' CONNECTION LIMIT = -1; This must be done using the postgres admin account (e.g. postgres) e.g. by running psql as follows .. code-block:: bash sudo -su postgres psql -p Next it can populated with tables, based on the models using the the script [create_new_database](Maintenance Scripts#markdown-header-create_new_database) .. code-block:: bash python manage.py create_new_database -d cronobacter This script also populates the data_param table and adds a few schemes which are generic to all databases * **assembly_stats** This will display information about the assembly, the fields for this scheme have been added to data_param * **ref_masker** this scheme is not displayed to the user, but contains information about repeats in a genome and is used when calling SNPS * **snp_calls** Again, this scheme is not displayed but will store information abut snps called for a strain against a particular reference * **prokka_annotation** Contains information pointing to the annotation files (gff and genbank) for genomes. Again the fields for this scheme have been added to the data_param table **N.B.** Now change the entry in ACTIVE_DATABASES in config.py to make the database active (change the fourth value to 1) Populating the Database ----------------------- Insert a scheme hosted in NServ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can insert a scheme that has been defined in NServ into the database using the script "manage.py add_nserv_scheme". Insert a rMLST scheme for Cronobacter database. .. code-block:: bash python manage.py add_nserv_scheme -d cronobacter -s rMLST -r Cronobacter_rMLST Insert a wgMLST scheme for Cronobacter, and grant access to some users specified in '-u' parameter .. code-block:: bash python manage.py add_nserv_scheme -d cronobacter -s wgMLST -r CROwgMLST_wgMLST -u user_name_1,user_name_2 Importing SRA Data ^^^^^^^^^^^^^^^^^^ You can populate the database using the script [importSRA](Maintenance Scripts#markdown-header-importsra) . To import all SRA records for the last 10 years (3000 days) .. code-block:: bash python manage.py importSRA -d cronobacter -r 3000 Assembling Genomes ^^^^^^^^^^^^^^^^^^ In order to assembly genomes you can either do it through the web interface or run the script [update_assemblies](Maintenance Scripts#markdown-header-update_assemblies), which will check for any un-assembled strains and send them off for assembly . e.g. To send two strains for assembly with high priority. .. code-block:: bash python manage.py update_assemblies -p -9 -d cronobacter -l 2 Calling Schemes ^^^^^^^^^^^^^^^ Also schemes need to be called for each assembly, again this can be done through the web interface or using the script [update_all_schemes](Maintenance Scripts#markdown-header-update_all_schemes) (in this case probably only prokka annotation will be called as it is the only scheme in the database) .. code-block:: python python manage.py update_all_schemes -d cronobacter -p 9 To automate the import, assembly and calling of schemes , you can add the name of the database to the shell scripts scripts/daily_update.sh and scripts/daily_import.sh see [here](Overall Structure#markdown-header-scripts) Import whole genome ^^^^^^^^^^^^^^^^^^^ Imports assembled genomes from genbank .. code-block:: python python manage.py import_whole_genome -d cronobacter only complete genomes will be imported unless you add (-c False) to the command