Quality Assessment evaluation

Overview

The Quality Assessment (QA) evaluation pipeline is at the end of the [QAssembly] pipeline. Assemblies are evaluated according to several criteria.

A good assembly needs to fulfil the following criteria:

Number of bases (i.e. total length of contigs in the assembly)

Number of contigs

N50 value

Proportion of N’s

Correct Species Assignment in Kraken

The QA evaluation pipeline/ script evaluates the above quantities in the order given, with the percentage of contigs given a correct species assignment being evaluated after the script runs Kraken. (The version of Kraken used is the current version 2.1.2 and the Kraken database is minikraken2_v2_8GB_201904) Any assembly that is failed in this quality control will not be used to call MLST or for other downstream analyses. The evaluation and subsequent pipeline jobs only make use of fragments that are >300 bp in length, so it is only these that are ised in the count of the number of bases and the proportion of Ns.

The current version of the QA evaluation pipeline is 2.2/3.

Quality Control - Remove contigs generated by low level contamination

Low level of contamination (up to 10% of total reads) are frequently found in Illumina sequencing. They are not a severe problem in reference based mappings, because only the consensus callings were used in such a analysis. However, the contaminated reads can form their own contigs in assembly based methods and thus be kept in all the downstream analyses.

The QA evaluation pipeline identifies contigs generated by low level contamination by comparing their read depth differences against normal contigs. Only contigs that were >20% of the average read depths were kept in the final assemblies.

Calculation of per nucleotide sequence-quality

After EToki has created an assembly from the reads it then aligns the reads to the assembly using minimap2. It then uses ‘samtools depth’ to calculate the depth at each nucleotide position. It then counts the number of times each possible depth is seen in across the assembly, and uses this to find the 25% and 75% quartiles and then finds 3 times the distance between these. This range is then expended by a factor of three to set the range outside of which the read depth is treated as an outlier. The coverage for each contig is then the average coverage excluding outliers and this is then used to calculate the overall average depth. This is then used to remove contigs where the average coverage is less than 20% of the average coverage. Next, Enterobase performs a ‘samtools mpileup’ to find the number of reads that match the assembly sequence at eachnucleotide position in the assembly. It then calculates a quality score based on the number of nucleotides that match the reference using the formula:

Q = INT(2.55 * N - 7.00 * E + 0.5)

where N is the number of nucleotides in the reads that match the assembly and E is the number of nucleotides that do not match. This is then clipped to a range between 0 and 40 giving a Phred like quality score. Internally this is then persisted in fastq files using asciii values between ! (0) and I(40). When viewing assemblies using JBrowse the assembly errors track displays 41 - Q values, which is why the miniumum value is 1 and the maximum is 41.

Nucleotides in reads may not match the assembly for one of two reasons. The first is because of poor read quality. The second is if the sample is not a single isolate and so there are alterative alleles present within the sample. Poor read quality or the sample not being a single isolate are both reasons for excluding a sample from further analysis as both can give misleading results and pollute the Enterobase databases with invald data.

The following shows the relationship between read depth, quality and expected number of mismatches
Read depth	Quality	Mismatches to assembly
10	10	1.6
10	20	0.6
10	30	0.0
20	10	4.3
20	20	3.2
20	30	2.2
20	40	1.0
30	10	7.0
30	20	5.9
30	30	4.9
30	40	3.8

Species quality control taxonomy

Finally, assembled contigs were assigned taxonomic labels using Kraken in order to exclude potential contamination from genera other than the corresponding databases. Assemblies are disproved if only <80% of the segments were assigned to the correct taxa. Segments are contigs uless the contig is greater than 20kbp in size in which case the contig is split into 10kbp segments.

Masking

The QA evaluation pipeline, also processes the assemblies in order to mask the sequence. Masking the sequence consists of replacing bases in the sequence whose base qualities are below a cutoff (10) with the IUPAC code for any base (i.e. “N”). Downloading masked sequence for the assemblies in FASTA format from the website is described here or from the API here. (Downloadable assemblies may be available even if they failed QC providing that the earlier assembly stage did not fail itself).`

Quality Control Criteria

The thresholds applied for the criteria vary according to the genus/ database:

Assembly criteria for Salmonella

Metrics	Criteria
Number of bases	4 Mbp – 5.8 Mbp
N50 value	>20kb
Number of contigs	<600
Proportion of scaffolding placeholders (N’s)	<3%
Species assignment using Kraken	>70% contigs are assigned

Assembly criteria for Escherichia/ Shigella

Metrics	Criteria
Number of bases	3.7 Mbp – 6.4 Mbp
N50 value	>20kb
Number of contigs	<=800
Proportion of scaffolding placeholders (N’s)	<3%
Species assignment using Kraken	> 70% contigs are assigned

Assembly criteria for Yersinia

Metrics	Criteria
Number of bases	3.7 Mbp – 5.5 Mbp
N50 value	>15kb
Number of contigs	< 600
Proportion of scaffolding placeholders (N’s)	<3%
Species assignment using Kraken	>65% contigs are assigned

Assembly criteria for Moraxella

Metrics	Criteria
Number of bases	1.8 Mbp – 2.6 Mbp
N50 value	>20kb
Number of contigs	< 600
Proportion of scaffolding placeholders (N’s)	< 3%
Species assignment using Kraken	> 65% contigs are assigned

Some users have access to additional databases - the criteria for these are [here](EnteroBase%20Backend%20Pipeline%20Optional%20Genera%20Assembly%20Criteria).

Inputs/Outputs

The QA evaluation pipeline will not be directly invoked by end users (not even programmers using the EnteroBase API); but details of inputs/ outputs are provided for internal reference and possibly to allow use of diagnostic information (e.g. in the jobs table).

Parameters

{

“params”: {
“scheme”: “Salmonella_UoW” # Ecoli_UoW, Yersinia_UoW, Mcatarrhalis_UoW or species names

}, “inputs”: {

“assembly”: “/path/to/folder/filename”

}

One of the two “read” tags in “reads” and “inputs” bins is required.

“read” in the “reads” bin will be downloaded the SRAs automatically.
“read” in the “inputs” bin needs to be pointed to user uploaded files.

Outputs

{
“log”: “…” # Evaluation will be shown in a JSON string.

}

History

Criteria for assemblies to fail in QC changed and also having different criteria for different genera/ databases introduced.

The allowed proportion of low quality bases was changed from 5% to 3%.

The allowed total number of contigs was changed from 500 to 600.

Start to report the causes of the failure of the assembly in QC.

Kraken analysis is applied to identify inter-species contamination or mix-ups.

The total size of the assembly is between 4MB and 5.5MB.
The [N50] value is not lower than 20KB
The total number of contigs is not greater than 500
If the assembly is in QAFastq format, the proportion of bases with quality lower than 10 is not greater than 3% of total bases. Or if the assembly is in [FASTA] format, the proportion of N’s is not greater than 3% of total bases.

[QAssembly]: EnteroBase%20Backend%20Pipeline%3A%20QAssembly “external link” [N50]: http://en.wikipedia.org/wiki/N50_statistic “external link” [Kraken]: http://ccb.jhu.edu/software/kraken/ “external link” [IUPAC code]: http://en.wikipedia.org/wiki/Nucleic_acid_notation#IUPAC_notation “external link” [FASTA]: http://en.wikipedia.org/wiki/FASTA_format “external link”