Quality Assessment evaluation

Overview

The Quality Assessment (QA) evaluation pipeline is at the end of the [QAssembly] pipeline. Assemblies are evaluated according to several criteria.

A good assembly needs to fulfil the following criteria:

  • Number of bases (i.e. total length of contigs in the assembly)
  • Number of contigs
  • N50 value
  • Proportion of N’s
  • Correct Species Assignment in Kraken

The QA evaluation pipeline/ script evaluates the above quantities in the order given, with the percentage of contigs given a correct species assignment being evaluated after the script runs [Kraken]. (The version of [Kraken] used is the current version 0.10.5-beta.) Any assembly that is failed in this quality control will not be used to call MLST or for other downstream analyses.

The current version of the QA evaluation pipeline is 2.1.

Quality Control - Remove contigs generated by low level contamination

Low level of contamination (up to 10% of total reads) are frequently found in Illumina sequencing. They are not a severe problem in reference based mappings, because only the consensus callings were used in such a analysis. However, the contaminated reads can form their own contigs in assembly based methods and thus be kept in all the downstream analyses.

The QA evaluation pipeline identifies contigs generated by low level contamination by comparing their read depth differences against normal contigs. Only contigs that were >20% of the average read depths were kept in the final assemblies.

Species quality control taxonomy

Finally, assembled contigs were assigned taxonomic labels using [Kraken] in order to exclude potential contamination from genera other than the corresponding databases. Assemblies are disproved if only <80% of its contigs were assigned to the correct taxa.

Masking

The QA evaluation pipeline, also processes the assemblies in order to mask the sequence. Masking the sequence consists of replacing bases in the sequence whose base qualities are below a cutoff (10) with the [IUPAC code] for any base (i.e. “N”). Masked sequence for the assemblies in FASTA format is made [available for users to download](user_download_assemblies) from the EnteroBase website or [using the API](api_download_assemblies) in the case of programmers. (Downloadable assemblies may be available even if they failed QC providing that the earlier assembly stage did not fail itself).

Quality Control Criteria

The thresholds applied for the criteria vary according to the genus/ database:

Assembly criteria for Salmonella

Metrics Criteria
Number of bases 4 Mbp – 5.8 Mbp
N50 value >20kb
Number of contigs <600
Proportion of scaffolding placeholders (N’s) <3%
Species assignment using Kraken >70% contigs are assigned

Assembly criteria for Escherichia/ Shigella

Metrics Criteria
Number of bases 3.7 Mbp – 6.4 Mbp
N50 value >20kb
Number of contigs <=800
Proportion of scaffolding placeholders (N’s) <3%
Species assignment using Kraken > 70% contigs are assigned

Assembly criteria for Yersinia

Metrics Criteria
Number of bases 3.7 Mbp – 5.5 Mbp
N50 value >15kb
Number of contigs < 600
Proportion of scaffolding placeholders (N’s) <3%
Species assignment using Kraken >65% contigs are assigned

Assembly criteria for Moraxella

Metrics Criteria
Number of bases 1.8 Mbp – 2.6 Mbp
N50 value >20kb
Number of contigs < 600
Proportion of scaffolding placeholders (N’s) < 3%
Species assignment using Kraken > 65% contigs are assigned

Some users have access to additional databases - the criteria for these are [here](EnteroBase%20Backend%20Pipeline%20Optional%20Genera%20Assembly%20Criteria).

Inputs/Outputs

The QA evaluation pipeline will not be directly invoked by end users (not even programmers using the EnteroBase API); but details of inputs/ outputs are provided for internal reference and possibly to allow use of diagnostic information (e.g. in the jobs table).

Parameters

{
“params”: {
“scheme”: “Salmonella_UoW” # Ecoli_UoW, Yersinia_UoW, Mcatarrhalis_UoW or species names

}, “inputs”: {

“assembly”: “/path/to/folder/filename”

}

One of the two “read” tags in “reads” and “inputs” bins is required.

  • “read” in the “reads” bin will be downloaded the SRAs automatically.
  • “read” in the “inputs” bin needs to be pointed to user uploaded files.

Outputs

{
“log”: “…” # Evaluation will be shown in a JSON string.

}

History

Criteria for assemblies to fail in QC changed and also having different criteria for different genera/ databases introduced.

The allowed proportion of low quality bases was changed from 5% to 3%.

The allowed total number of contigs was changed from 500 to 600.

Start to report the causes of the failure of the assembly in QC.

Kraken analysis is applied to identify inter-species contamination or mix-ups.

  1. The total size of the assembly is between 4MB and 5.5MB.
  2. The [N50] value is not lower than 20KB
  3. The total number of contigs is not greater than 500
  4. If the assembly is in QAFastq format, the proportion of bases with quality lower than 10 is not greater than 3% of total bases. Or if the assembly is in [FASTA] format, the proportion of N’s is not greater than 3% of total bases.
[QAssembly]: EnteroBase%20Backend%20Pipeline%3A%20QAssembly “external link” [N50]: http://en.wikipedia.org/wiki/N50_statistic “external link” [Kraken]: http://ccb.jhu.edu/software/kraken/ “external link” [IUPAC code]: http://en.wikipedia.org/wiki/Nucleic_acid_notation#IUPAC_notation “external link” [FASTA]: http://en.wikipedia.org/wiki/FASTA_format “external link”