EnteroBase QAssembly =================== QAssembly (high **Q**uality **Assembly**) is an assembly pipeline that aims to generate currently best assemblies within a reasonable amount of time. The whole pipeline (current version: 3.61) contains several components: .. image:: https://bitbucket.org/repo/Xyayxn/images/840641445-qassembly_workflows_combo7quasicropresize3.png Quality Trimming (Sickle) & Barcode trimming -------------------------------------------- As recommended by `Del Fabbro *et al.*`_, QAssembly uses Sickle_ 1.33 with the argument `-q` `10` on `FASTQ`_ files to trim the ends of short reads of base calls with quality scores = 10. Some very old QAIIx reads, or reads generated for special experiments contain barcodes at their 5' ends in order to mark multiple biological samples within a common sequencing pool. Such barcodes result in the presence of identical runs of 5' nucleotides in >=50% of reads, which were identified 2 bp at a time and stripped by a nested Python function called `read_process()`. This function also removes reads from SRAs with too many reads, limiting their maximum size to to > 200X of the genome size. This approach was recommended by Andrew Millard's `blog`_ in order to reduce un-necessary processing time and memory because the quality of assemblies does not improve once read depths exceed 30X genome coverage. SPAdes Assembly --------------- Assembly is performed with `SPAdes`_ 3.9.0 on 7 threads, without pre-correction with BayesHammer. BayesHammer did not provide significant improvements in benchmark comparisons, and doubled the time needed for assemblies. We also did not use post-polishing within `SPAdes`_ because it is inferior to the post-polishing steps in QAssembly (below). BWA Remapping and base correction --------------------------------- Raw reads are mapped back onto assembly scaffolds in order to improve the accuracy of consensus base calls. #. `BWA`_ 0.7.12-r1039 is used to align reads back onto the assemblies. #. Because BWA was developed for alignments between genetically distinct sequences, it does not necessarily align the entirety of all reads against the consensus assembly, especially if it contains local mis-assemblies by SPAdes due to weak connections. These were identified by extending the BWA alignments over the entire length of the reads with the help of Python script (`sam_filter.py`) in order to mark such mis-assemblies for the next step. #. Consensus base calling, including format transformation, are performed with `SAMtools and BCFtools`_ (both in version 1.2). The prior assemblies are then corrected (polished) by incorporating differing, most probable consensus bases or indels reported by `bcftools` `call`. BWA base quality and save in FASTQ_ format ------------------------------------------- Base-specific quality scores are called a second time, in order to assess the uncertainties of the consensus callings in the assemblies. `BWA`_ and `SAMtools and BCFtools`_ are used to generate and analyse the remappings. The qualities of consensus bases were given by comparing the supports to the consensus callings against all other alternatives. Conversion of QAFastq format into FASTA_ format ------------------------------------------------ The :doc`QAtoFasta ` pipeline, also processes the assemblies in order to mask the sequence. Masking the sequence consists of replacing bases in the sequence whose base qualities are below a cutoff (10) with the `IUPAC code`_ for any base (i.e. "N"). Masked sequence for the assemblies in FASTA format is made available for users to download (and may be available even if the assemblies failed QC providing that the earlier assembly stage did not fail itself). .. _`Del Fabbro *et al.*`: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0085024 .. _Sickle: https://github.com/najoshi/sickle .. _blog: http://blogs.warwick.ac.uk/microbialunderground/entry/what_coverage_is/ .. _SPAdes: http://spades.bioinf.spbau.ru/release3.9.0/ .. _BWA: http://bio-bwa.sourceforge.net/ .. _`SAMtools and BCFtools`: http://www.htslib.org/ .. _Kraken: http://ccb.jhu.edu/software/kraken/ .. _FASTQ: http://en.wikipedia.org/wiki/FASTQ_format .. _`IUPAC code`: http://en.wikipedia.org/wiki/Nucleic_acid_notation#IUPAC_notation .. _FASTA: http://en.wikipedia.org/wiki/FASTA_format