Frequently Asked Questions¶
Why my has assembly failed¶
The following link describes Quality Assessment (AQ) evaluation and it will help you understand why your assemblies have failed: https://enterobase.readthedocs.io/en/latest/pipelines/backend-pipeline-qaevaluation.html
Reasons for failing QC for your assemblies can be found under “Assembly stats” in the Experimental Data menu. Cells marked red indicate the specific failed criteria. Cut-off for the various criteria can be found at https://enterobase.readthedocs.io/en/latest/pipelines/backend-pipeline-qaevaluation.html
Why I can not obtain a sequence type for my failed assembly¶
Any assembly that is failed in the quality control will not be used to call MLST or for other downstream analyses
How do I get new allele/ST with Sanger/ABI sequencing traces¶
The legacy MLST website http://mlst.warwick.ac.uk, is now closed. You can no longer submit new strains with known ST designations into it. And, we no longer support new alleles and new STs based on ABI/Sanger sequencing there anymore.
During the lifespan of the legacy MLST website, one third of the new alleles we received were sequencing errors rather than real sequence variations. In addition, we don’t know how many new STs were wrong, because novel combinations of known alleles can be due to mix-ups rather than recombination.
In order to improve the quality of the database, we enforce all the new allele/STs to be defined by NGS short reads. If you upload the short reads into http://enterobase.warwick.ac.uk/, we will assemble the short reads and report you back all identified allele designations. On top of this, you get access to additional information by comparing your genomes with >120,000 existing genomes from all over the world. These genomes were either published in NCBI short read archive database, or uploaded by other users.
Now the average cost for NGS sequencing of a bacterial strain has been significantly dropped. If you have difficulties in getting NGS sequencing done, we can suggest you to get in touch with https://microbesng.uk/, which gives you cheap access to NGS. On the other hand, if you still want to stay with ABI sequencing, you can download all the alleles from Enterobase and built up local MLST databases. Alternatively, you can try https://cge.cbs.dtu.dk/services/MLST/, which is an online tool for allele identifications (but will not give you new alleles).
How often is EnteroBase updated with reads from the SRA?¶
We scan NCBI short read archive everyday to download new coming Illumina reads as well as complete genomes.
Which genomes/sequencing data is included in EnteroBase?¶
We includes all complete genomes, as well as Illumina pair-end reads from NCBI SRA. There are over 7,000 sets of short reads uploaded by our users as well.
Which sequencing platforms can be handled by EnteroBase?¶
Only Illumina short reads for the moment. Assemblies from PacBio sequencing can be uploaded by getting touch with administrators. If you can suggest a reliable pipeline to assemble reads from other platform, please get in touch with us as well.
Is the source code available for EnteroBase?¶
How is recombination handled in EnteroBase for SNPS and MLST based methods?¶
Recombination has little effect on MLST, see this article. We have not implemented any tool to remove recombination in SNP data.
What is a minimum spanning tree? How is this different to a phylogram?¶
In brief, minimum spanning tree is a maximum parsimony phylogram without hypothetical nodes. Read this link for more details.
Why do I have to upload short reads rather than just already assembled contigs?¶
When you feed identical short reads into different genomic assembly pipelines, they normally give back very different results. We use an internal pipeline to keep our assemblies consistent. Users are allowed to upload contigs in some conditions. Please contact us regarding the specifics.
How do I upload my short reads which are in FASTQ format?¶
In order to upload the reads in FASTQ format, you need to compress the reads using the correct compression format (i.e. gzip). A program which can do this which you can install on Microsoft Windows is 7-zip available at http://www.7-zip.org/download.html . (The download page that I have provided a link for has a link for suitable alternative software on a Mac. Also, you should be able to use the command line “gzip” on either a Mac or Linux system.)
After installing 7-zip on your Windows system, start 7-zip. The 7-zip user interface provides a view of folders on your Windows system in its current directory - use this to navigate to the folder with your reads in FASTQ format (i.e. the files with the .fastq filename extension). Then, separately for each .fastq file do the following Select the .fastq file and then go to File -> 7-zip -> Add to Archive. Pick “gzip” for “Archive format” in the dialogue that appears and then press the “OK” button. This will create compressed files ending “.fastq.gz” that you can upload onto the EnteroBase website by following the instructions here.
How long do analyses take to run in EnteroBase?¶
We are currently running analyses on a HPC cluster hosted by University of Warwick, as well as two local standalone servers.
The waiting time for a new analysis depends on its priority and also the number of tasks in the waiting list. When an analysis starts to run, the timing is about:
- Assembly: 0.5 - 3 hours depending on read depths.
- Genotyping: 1 - 2 minutes.
- Minimum spanning tree: 1 - 2 minutes.
- SNP analysis: depends on the number of strains included. Normally 1-2 hours for < 100 genomes.
- Assigning a new ST after uploading data for a new strain: where required, this consists of carrying out an assembly and genotyping and takes about 0.5 - 3 hours.
I’m worried about our data going public (and being scooped), How does EnteroBase protect me?¶
Your short reads will never be published anywhere, unless you ask us to upload them into EBI/NCBI for you. The genomic assemblies of your strains can be kept private for up to 12 months. Other data, including genotyping results and metadata, will be released immediately. This is to facilitate collaborations between our users and early identification of outbreaks.
What do negative ST values mean for MLST?¶
They represent genomes with incomplete genes. If any of the genes used is missing or truncated in the assembly, the genome cannot be given a formal (positive) ST designation. In such cases, we use the temporary (negative) number as its ST designation. This is normally an indication of bad genomic assemblies but not always be the case. (See the question and answer below for rMLST.)
What do negative rST values mean for rMLST?¶
These negative numbers came from two different cases. They can be either:
- Newly identified rST in EnteroBase, which have not been manually confirmed by the curator of rmlst.org (the only server that gives out rST numbers),
- Genomes with incomplete ribosomal genes. If any of the 51 ribosomal genes used in rMLST is missing or truncated in the assembly, rmlst.org is not able to give the genome a formal (positive) rST designation. In such cases, we use the temporary (negative) number as its rST designation. This is normally an indication of bad genomic assemblies but not always be the case.