Predicted vs. Empirical Quality Scores in Ion Torrent and MiSeq DataPosted by David Jenkins on Sep 13, 2011
A recent blogpost at BioLektures explains why it isn’t really fair to compare MiSeq and Ion Torrent data based on their predicted quality scores, as Ion Torrent tends to undercall quality across their reads. Illumina’s predicted quality score algorithm, however, is more accurate in predicting the actual quality score. In a recent application note Illumina compares predicted quality scores between their MiSeq technology and Life’s Ion Torrent technology.
Life Technologies has also released a recent application note where Ion Torrent long read data is compared to MiSeq data. In this application note Life brings up the distinction between predicted and empirical quality scores and offers a chart of MiSeq data comparing MiSeq’s predicted vs. empirical quality
Illumina MiSeq™ platform predicted accuracy versus measured accuracy. The difference between reported MiSeq™ platform quality scores and the actual measured quality observed when reads are mapped to the E. coli genome as described in the data analysis section. The MiSeq™ data are derived from MiSeq™ run sequencing E. coli MG1655.
The blue line on this plot represents predicted quality score for the MiSeq run, while the orange line represents empirical error rate of only substitution errors. This orange line could be possibly misleading, as explained on page 8 of a recent response from Illumina to the Life application note.
Although there are quite a few metrics that must be taken into account when comparing these two technologies, today’s blog post continues our previous Ion Torrent vs. MiSeq discussion and focuses on predicted vs. measured accuracy and attempts to show in a clear, reproducible way how Ion Torrent and MiSeq data compare under the same settings analyzed with the same tools. Both Illumina and Life have released a pair of datasets to the public. In order to compare the two technologies in the best way, choosing the correct two datasets to compare is important.
- Sequencing and mapping of Escherichia coli str. K-12 substr. DH10B, complete genome
- Sequencing and mapping of Escherichia coli str. K-12 substr. MG1655, complete genome
- Sequencing and mapping of Escherichia coli str. K-12 substr. DH10B, complete genome with 314L long-read technology
- Sequencing and mapping of Escherichia coli str. K-12 substr. DH10B, complete genome with 316 technology
To recreate the figure from Illumina’s application note, the bam file from the MiSeq DH10B run and the bam file from the MiSeq DH10B 316 chip run were analyzed with the FastQC program. The bam file from the MiSeq DH10B run combines the results from the “Read 1” and “Read 2” analyses from the Illumina chart above. The resulting per base sequence quality charts were plotted together.
Using the provided alignments for these runs, it is possible to calculate empirical quality by base position using the Broad’s Genome Analysis ToolKit. The GATK doesn’t currently accept Ion Torrent data as input, but because the error profile of Ion Torrent data is fairly similar to 454 sequencing data, the Ion Torrent data was run through the GATK as 454 data. Recalibrated bam files were created with the GATK and those files were then analyzed with FastQC.
Q-Q plots of the DH10B Ion Torrent 316 chip data expected vs empirical quality before recalibration (left) and after recalibration (right). Similar distributions will result in points along the line y=x. Before recalibration Ion torrent predicted quality values are being undercalled.
Q-Q plots of DH10B MiSeq data expected vs empirical quality before recalibration (left) and after recalibration (right). Illumina's quality prediction more accurately models empirical quality.
When empirical quality is compared between the two technologies, the differences in the performance become much more subtle. Life is clearly undercalling base quality across Ion Torrent reads. In this particular dataset an unexpected increase in per base quality is seen in the Ion Torrent empirical plot towards the end of the read. When the same analysis was repeated with an internal 316 chip Ion Torrent run, this large increase at the end of the read was not observed.
Q-Q plots of internal EdgeBio DH10B 316 chip data expected vs empirical quality before recalibration (left) and after recalibration (right).
If the same MiSeq data is also compared to the Ion Torrent 314 Long Read dataset, a similar difference in Ion Torrent predicted and empirical quality scores is seen with the obvious difference that the Ion Torrent data is capable of producing reads that are quite a bit longer than the MiSeq data. Another distinguishing characteristic is that Illumina quality starts to dip after about 120 bases, and Illumina has a hard time accounting for this in their basecalling quality scoring algorithm.
Q-Q plots of the DH10B Ion Torrent 314 long-read chip data expected vs empirical quality before recalibration (left) and after recalibration (right).
These results show that while Illumina’s MiSeq application note accurately compares per base predicted sequence quality scores between Ion Torrent and MiSeq, comparing empirical Q values shows a very different relationship between the two technologies quality scores.
Ion Torrent brings up this issue of predicted vs. empirical per base sequence quality in their Ion Torrent long read application note, but they only show predicted vs. empirical accuracy for the MiSeq MG1655 data. Additionally Ion Torrent only plots actual average substitution mismatch errors across the read. A better analysis would be to analyze this dataset with the GATK and not parse the output to select for specific errors. Using the same default analysis setting recommended by a presentation from the Broad Institute about quality score recalibration with the GATK, figure 3 from the Ion Torrent application note was recreated.
Q-Q plots of MG1655 MiSeq data expected vs empirical quality before recalibration (left) and after recalibration (right).
This analysis of predicted vs. empirical quality shows that when comparing only predicted per base sequence quality MiSeq appears to be a clear winner, where a comparison of empirical per base sequence quality shows the technologies are more evenly matched, and longer high quality reads may give Ion Torrent an advantage in the long run. It is obvious that Ion Torrent’s calling of base quality needs to improve, since these lower quality scores are not giving an accurate measure of per base quality to the user or to any software that utilizes these quality values. 454 had a similar problem when it first released the FLX sequencing platform, and Life has current initiatives to address these issues. Life has set a series of grand challenges for the community to improve Ion Torrent sequencing, one with the goal of doubling the basecalling accuracy.
In the end per base sequence quality is only one aspect of what will be an ongoing comparison between Ion Torrent and MiSeq, but it is an important metric to carefully compare to get an accurate picture of the performance of both sequencing technologies.
It's an exciting time for next generation sequencing and we will continue to provide insight and analysis as we explore different aspects of Ion Torrent sequencing. In our next post we will analyze Ion Torrent's built in variant calling software and try to get an accurate picture of the errors occurring in Ion Torrent sequencing.
Description of analysis:
Public Ion Torrent data was downloaded from the Ion Torrent community and public MiSeq data was downloaded from the Illumina website. Both technologies were analyzed with FastQC before recalibration.
fastqc --nogroup -o <OUTPUT_DIR> <original.bam>
Because Ion Torrent data is not accepted by the GATK, a temporary bam file was created with the datatype set as 454 data.
samtools view -h B14-387.bam | sed 's/IONTORRENT/LS454/g' > temp.sam samtools view -S -b temp.sam > temp.bam
Using the recommended workflow from the Broad Institute’s GATK website, the following commands were used to recalibrate the bam files.
#count original covariates java -Xmx4g -jar GenomeAnalysisTK.jar \ -R <reference.fasta> \ --run_without_dbsnp_potentially_ruining_quality \ -I temp.bam \ -T CountCovariates \ -cov ReadGroupCovariate \ -cov QualityScoreCovariate \ -cov CycleCovariate \ -cov DinucCovariate \ -recalFile table_original.recal_dataout.csv #recalibrate table java -Xmx4g -jar GenomeAnalysisTK.jar \ -R <reference.fasta> \ -I temp.bam \ -T TableRecalibration \ -recalFile table_original.recal_dataout.csv \ --out recalibrated_bam.bam #analyze original covariates java -Xmx4g -jar AnalyzeCovariates.jar \ -recalFile table_original.recal_dataout.csv \ -outputDir <OUTPUT_DIR> \ -resources <RESOURCES_DIR> \ -ignoreQ 5
The recalibrated_bam.bam files were then analyzed with FastQC:
fastqc --nogroup -o <OUTPUT_DIR> <original.bam>
- September 2013 (1)
- April 2013 (1)
- February 2013 (1)
- January 2013 (1)
- December 2012 (1)
- November 2012 (7)
- October 2012 (3)
- September 2012 (1)
- August 2012 (3)
- June 2012 (2)
- May 2012 (2)
- April 2012 (6)
- March 2012 (3)
- February 2012 (4)
- January 2012 (4)
- December 2011 (2)
- November 2011 (3)
- October 2011 (3)
- September 2011 (2)
- August 2011 (1)
- June 2011 (4)
- May 2011 (1)
- November 2010 (2)
- October 2010 (1)
- September 2010 (3)
- August 2010 (2)