Maintenance Scripts

These scripts are in manage.py (in the top level directory) and can be run (assuming you are in the Enterobase virtual environment and in the top level directory) using the following syntax

python manage.py **script_name** **parameters**
**or**
python manage_metadata.py **script_name** **parameters**

e.g.

python manage.py check_queued_jobs -d ecoli -s MLST_Achtman

Database Scripts

create_new_database

Adds the rMLST scheme to the database

-d –database -The name of the database (required)
-c –create_schemes Whether to create generic schemes (e.g. assembly_stats, snp_calling etc.) default is True

backup_database

backs up the database in postgresql custom format in the specified folder under a subfolder with the current date

-f –folder - The folder in which to create the backup (a folder with the current date will be created in this folder)
-d –database - The name of the database to back up. By default all active databases are backed up
-s –system - By default the system database is backed up. Set this parameter to False if you do not want this behaviour

Scripts For Jobs

Schemes

update_all_schemes

Will check all complete assemblies that have passed QC and will check whether all schemes have been called on them or are queued. If not, any outstanding job will be sent.

-d –dbName -The database (default senterica)
-l –limit - The maximum number of jobs to send per scheme(default 100)
-f –force - If True,T,t or true then jobs which have failed more than 5 times will be sent (default false)
-q –queue - The queue in which the jobs will be placed (default backend)
-p –priority - The priority of the jobs (between 9 low and -9 high) (default 0)

Currently run every 4 hours for each species as part of the enterobase_daemons cron job

update_scheme

Will check all complete assemblies that have passed QC and will check whether the specified scheme has been called on them or are queued. If not, any outstanding job will be sent.

-d –dbName -The database (default senterica)
-l –limit - The maximum number of jobs to send per scheme(default 100)
-s –scheme - The name of the scheme (default rMLST)
-f –force - If True,T,t or true then jobs which have failed more than 5 times will be sent (default false)
-q –queue - The queue in which the jobs will be placed (default backend)
-p –priority - The priority of the jobs (between 9 low and -9 high) (default 0)

check_queued_jobs

Forces callbacks on all jobs that are currently queued for the scheme specified.

-d –database -The database (default senterica)
-s –scheme - The name of the scheme (default rMLST)

kill_duplicate_jobs

Will try and find all duplicate nserv jobs i.e. for jobs for the same assembly/scheme combination. Will remove duplicate entries from the assembly lookup and kill the associated job.

-d –database -The database (default senterica)
-s –scheme - The name of the scheme (default rMLST)
-k –kill - If False then the the entry will only be removed from the database, the job will not be killed (default True)

Assemblies

update_assemblies This script will check for any strains in the database which do not have an assembly and are capable of being assembled e.g has paired Illumina reads. Assemblies which have failed more than 5 times will not be resent.

-d –dbName -The database (default senterica)
-l –limit - The maximum number of assemblies to send (default 100)
-f –force - The assembly will not be sent if the number of failures is greater than this number (default 5)
-q –queue - The queue in which the assemblies will be placed (default backend)
-p –priority - The priority of the assemblies (between 9 low and -9 high) (default 0)

Currently run every 4 hours for each species as part of the enterobase_daemons cron job

check_queued_assemblies Forces callback on all assemblies that currently queued to check if the callback on any that have completed was missed

-d –database -The database (default senterica)

Was run every 4 hours for each species as part of the enterobase_daemons cron job but the run time was very long so currently commented out

General

runcelery

Runs celery

-t –threads -The number of threads (default 1)
-q –queue -The job queue (default celery)

update_job Forces callbacks on the specified job(s)

-j –job - The number of the job or multiple job numbers separated by commas

change_job_priority Changes the priority of the specified job(s)

-j –job - The number of the job or multiple job numbers separated by commas (default 0)
-u –user - Change the priority of all jobs submitted from this user (default none)
-p –priority - The new priority value (default -9)

Scripts For Importing Data

import_whole_genome

Imports assembled genomes from genbank

-d –database - The name of the database
-t –term - Key word which will identify the assemblies. Can be specific e.g the accession number or broader such as the species name (default none - all the assembled genomes for the database’s genus will be imported)
-c –complete - If set to T,True or true only complete genomes will be imported - not contigs or scaffolds (default True)

load_user_reads

This script will check which reads from a particular user need uploading and attempt to copy the read files from the specified folder (either local or ftp) and initiate all the analysis if successfully copied

-d –database - The name of the database
f –folder - The folder, either local or remote which contains the user’s read files
-r –remote - If True, then the folder is a remote ftp folder (default False)
-s –settings - The details of the FTP site in the following format address,user_name,password (default - my ebi drop box details)
-u –user - The username of the user

update_sra_fields Updates all records from metadata taken from the SRA. Can be used to retrospectively add data to a new colum

-d –database - The name of the database
-f –fields - The field or fields (comma separated) to update

import_sra_data Imports data for a particular sample,accession or project ID

-d –database - The name of the database
-p –project - The ID of the project - should be the sra project ID e.g. ERP020979, not the BioProject ID
-s –sample - The ID of the sample (or a comma delimited list of IDs)
-a –accession The accession (run) ID (or a comma delimited list of IDs)

importSRA Imports data from the SRA, either from a file or all new entries since a specified date

-r –reldate - Integer - Import all records from the last X days e.g. -r 30 will import all Short Reads that are not already in Enterobase that were added to the SRA in the last 30 days (default 7)
-f –file_loc - If records are to be loaded from a file and not directly from the SRA, then this parameter should specify the location of the file (json format)
-d –db - The name of the database
-l –live If True then the task will be run via Celery (default False)

Currently run every 4 hours for each species as part of the enterobase_daemons cron job

Scripts For JBrowse

make_jbrowse_annotation

Will create all the necessary files and configs for displaying an assembly in Jbrowse. For Genbank files a track for the GenBank annotation will be created. For in house assemblies, a track showing the quality of each base will be generated (based on the fastq file). Also tracks for prokka annotations and all schemes in the database will be created as well as a GC content track.

-d –database - The name of the database
-b –barcode - The barcode of the assembly to be annotated
-f –force - If True then a current annotation will be overwritten. If False and there is already an annotation nothing will be done.Default is false.

Upload user reads to EBI

create_database_EBI_Project

Create an EBI project for EnteroBase Database. It can also be used to create a new EBI project if the project title, name and description are provided.

-d –database - The name of the database
-m, –mode - running mode, possible value is
test: no data will be uploaded force: upload without confirmation prod: ask confirmation before uploading the data
-n –name - EBI project name
-t –title - EBI project title
-s –sc, EBI project description

The information about the project that is created in this way is stored in the broker_ebi_projects.json file to provide information about the Enterobase projects for the scripts used to upload reads.

upload_users_reads_to_EBI

It uploads users/s reads to EBI. If user_file is not provided, it will use the default user file that contains all the user names who have permitted us to submit their read to the EBI. Otherwise, it will use the user names inside the provided file. If a database is not provided, it will submit all the user reads from all public databases.

-m –mode , run mode
-u –users_file, Text file conatins user name to upload their reads to EBI
-d’, –database , ‘database name

This script then makes multiple calls to upload_reads_to_EBI_project to upload the reads assocated with each user/database combination.

Currently run twice a day as a cron job

upload_reads_to_EBI_project

It uploads strain reads along with the metadata to EBI. If the included_strain_barcodes attribute is provided, then the strains which have these barcodes will be uploaded. Otherwise, the public user strains will be uploaded. If a project title is provided, it will get the corresponding EBI EnteroBase project and upload the user reads to this project.

-d –database - database name.
-u –user_name - user name
-m –mode - run mode
-t –title - EBI project title
-b –included_strain_barcodes - string includes strains barcodes seprated by ‘,’

The read files are checked and modified if necessary to ensure they meet EBI requirements and then uploaded, together with selected metadata from enterobase to provide the metadata for EBI. When this is done, EBI provides the accesion number (ERR…) to Enterobase and Enterobase updates its strain data with these numbers. This then ensures that it does not subsequently download these strains from NCBI/EBI.

Delete user read files

delete_user_reads

It deletes the user read files for the assembled strains to free the storage. It exempts user reads whose names are provided inside the “users_EBI.txt” file. It will read the file from the “EBI_TEMP_FOLDER” folder. This file contains the user names of the EnteroBase research group and the users who permitted us to upload their reads to EBI. “EBI_TEMP_FOLDER” folder is an attribute inside the configuration YML file

-m –mode, run mode it can be either, test or delete

Cache the data

get_databases_info

It queries the databases and saves the results as a JSON string inside a text file. These results (JSON) are needed to load the main web page. It should run every periodically (e.g. every 10 minutes) to update the file contents to reflect the status of the data inside the databases. It saves the content inside a file which determines its name from the configuration file, i.e. ‘ENTEROBASE_CACHE_FOLDER’ attribute.