Overview

Enterobase works in five, largely independent, units:

  • Enterobase-web: The front end which provides the functionaility in association with five additional modules
  • CROBOT: Compute job control system. Runs pipeline jobs on calculation nodes to assemble and analyse strains, driven by enterobase-web. Uses a postGRES db for persisting dataa on all jobs that have been run. Can be queried via an API.
  • NSERV: Nomenclature server. Controls records of the MLST data in a postGRES db.
  • MetaParser ? Daily update server. Checks NCBI for latest records under the control of Enterobase-web, reformats metadata and sends it back to Enterobase-web.
  • RCatch: Short reads download control. Communicates with CRobot.

Enterobase is designed around the Python Flask HTTP/web server library, run as multiple parallel processes using the python gunicorn library. Enterobase can be setup on a single server or the modules can be spread across multiple servers.

The modules were originally written in Python2 but have now been uplifted to Python 3.8.8

Dependencies

The modules make extensive use of third party applications whch either have to be installed manually or are installed automatically by install scripts. These include:

  • Postgres for various per-module databases
  • pyenv for python virtual environments
  • Python modules such as Celery and Flask,
  • R and Perl for some pipeline jobs run by CRobot
  • EToKi for assembly of reads
  • GrapeTree for visualisation of hierarchical clusters
  • Third party applications for analysis of assemblies

Storage requirements

Each module requires some local storage, but long term storage of data associated with strains is stored on a network drive currently called /share_space and dimensioned at 60TB. Types of data stored include:

  • All read assemblies
  • Results from running pipeline jobs. These are indexed by the job number used to create the data.
  • Uploaded user reads for assembly.
  • Reformatted assemblies for viewing using JBrowse, calculated when required and then persisted for future reuse.
  • Temporary storage of reads that have been downloaded from NCBI/ERA by RCatch for assembly
  • Storage of database backups
  • Cached enterobase status information for speeding up front page reload
  • Storage of large config files required by metaParser

There is currently no automated system for deleting redundant data, such as reads that have been uploaded by users, particularly if the data was subsequently uploaded to ENA, or outputs from pipeline jobs that have been superceeded when the job has been rerun.