Variant Calling on Ion Torrent DataPosted by David Jenkins on Oct 25, 2011
Variant calling, the detection of SNPs and INDELs, plays a particularly important role in Ion Torrent data due to its propensity for homopolymer errors. It’s particularly challenging to sort the insertions and deletions that occur through sequencing errors from true differences from the reference sequence. A variant calling plugin is included in the Ion Torrent analysis pipeline that assists in the identification of SNPs and INDELs. Utilizing SAMtools the plugin produces a variant sample report using settings adjusted to match the error model in Ion Torrent data. Using recently sequenced E. coli DH10B data from a 316 chip and an artificially mutated E. coli genome the variant analysis plugin settings were compared to other samtools settings to try to find settings that produce the most true variants while avoiding false positives. This mutated genome and a genome comprised of only homopolymer errors were also used to compare Illumina’s MiSeq technology to Ion Torrent in terms of variant analysis.
The Variant Calling Plugin
The Ion Torrent Variant Calling Plugin (now renamed "Germ Line Variant Calling Plugin") calls SAMtools, a set of tools that process alignments in the SAM/BAM format. First it calls samtools mpileup, which computes the likelihood of the data given specific quality parameters. The Variant Calling Plugin then applies bcftools to use that prior data to call the variants. Finally it calls ‘vcf_filter.pl,’ to filter out some of the data. The resulting variants are printed out in the report. Because the standard SAMtools settings are tuned toward data that doesn’t include as many insertion and deletion sequencing errors as Ion Torrent data, the Variant Calling Plugin reduces the number of INDEL variants that it detects by specifying a minimum quality requirement for each site that is considered, ignoring sites where the reference sequence is ambiguous, ignoring sites with small insertions or deletions with relatively low coverage, lowering the probability that a gap opening is a true variant rather than a sequencing error, lowering the probability that a gap extension is a true variant rather than a sequencing error, and selecting for sites with homopolymer errors.
The Variant Calling Plugin first calls samtools to compute the likelihood of the data and uses the following command.
samtools mpileup -d 10000 -L 1000 -Q 7 -h 50 -o 10 -e 17 -m 4 \ –f [REFERENCE.fasta] -g [ALIGNMENT.bam] -d 10,000 – at a position, read maximally 10,000 reads per input BAM (default 250) -L 1,000 – skip INDEL calling if the average per-sample depth is above 1,000 (default 250) -Q 7 – Minimum base quality of 7 required to be considered (default 13) -h 50 – Coefficient for modeling homopolymer errors. Given an L long homopolymer run, the sequencing error of an indel of size s is modeled as 50*s/L. (default 100) -o 10 – Phred-scaled gap open sequencing error probability. Reducing the number leads to more indel calls. (default 40) -e 17 – Phred-scaled gap extension error probability. Reducing the number leads to longer indels. (default 20) -m 4 – minimum of 4 gapped reads for an indel candidate (default 1) -g – generate a BCF output with computed likelihoods.
After this BCF output is created the Variant Calling Plugin calls bcftools view to do the actual variant calling.
bcftools view -e -g -c -v -N -i -1 [BCF_from_MPILEUP] -e – perform a likelihood based analysis -g – call genotypes at variant sites -c – SNP calling -v – output potential variant sites only -N – skip sites where REFERENCE is not A/C/G/T -i – indel-to-substitution ratio of -1
This produces a list of variants that is further filtered with a perl script called vcf_filter.pl
perl vcf_filter.pl -d 2 -D 10000000 -w 0 -W 0 -s 1 -S 1 -H 2 -d 2 – minimum read depth -D 10,000,000 – maximum read depth -w 0 – SNP within 0 bp around a gap to be filtered -W 0 – window size for filtering adjacent gaps -s 1 – minimum read depth of 1 for SNPs on each strand -S 1 – minimum read depth of 1 for INDELs on each strand -H 2 – minimum read depth of 2 for Homopolymer INDELs on each strand Additional filtering settings: -1 0.0001 – min P-value for strand bias (given PV4) -2 1e-100 – min P-value for baseQ bias -3 0 – min P-value for mapQ bias -4 0.0001 – min P-value for end distance bias
This script filters out the remaining variants to achieve the list of variants reported by the Variant Calling Plugin.
Mapping Ion Torrent Data to the DH10B Reference Genome
DH10B is a well validated and frequently resequenced E. coli genome. Because the sequence is so well validated, any resequencing and mapping of DH10B should have no SNPs or INDELs. Any identified SNPs or INDELs indicate sequencing errors. The following is a chart of SNP and INDEL errors identified with the Variant Analysis plugin in Ion Torrent and MiSeq resequencing projects of DH10B.
|Edge Bio Ion Torrent 316 DH10B||16||0|
|Life Ion Torrent 316 DH10B||18||0|
|Life Ion Torrent 314LR DH10B||24||0|
|Life Ion Torrent 316LR DH10B||120||0|
|Life Ion Torrent 316LR DH10B >99% per base accuracy with long reads.||38||0|
|Life Ion Torrent 318 Chip||105||0|
|Life Ion Torrent 314 100MB||184||0|
|Illumina MiSeq DH10B||1||1|
While the EdgeBio dataset in the table above called the least amount of false positive INDELs, this dataset has fairly low coverage across the genome. Samtools requires a specific mapping quality at a position in order to consider it as a potential variant. This EdgeBio dataset had greater than 300,000 positions with low coverage (raw coverage less than 7) compared to about 300 in the Life Ion Torrent 316 DH10B run. This may explain why the number of false positives in this run is lower than more recent runs with more coverage at each position due to higher throughput. Compare these results to some of the same data analyzed with samtools default settings.
|EdgeBio Ion Torrent 316 DH10B||Life Ion Torrent 316 DH10B||Life Ion Torrent 314LR DH10B||Life Ion Torrent 316LR DH10B||MiSeq DH10B|
These two analyses show that samtools with variant Analysis settings is more effective at avoiding false positives than samtools with default settings in Ion Torrent data. The real question is whether variant Analysis will be able to identify true SNPs and INDELs in a resequencing project with actual SNPs and INDELs to identify while still avoiding false positives. With some true SNPs and INDELs to identify, positive predictive value and sensitivity for variant Analysis can be calculated.
Mutated E. Coli Genome
The DH10B reference was mutated with the maq fakemut utility to introduce some SNP and INDEL errors into the original DH10B sequence. The result was a mutated DH10B genome with 220 artificial SNPs and 239 artificial INDELs of size 1, both randomly inserted throughout the genome. The EdgeBio DH10B dataset was then remapped to the mutated E. Coli reference using tmap. Samtools was then run on the EdgeBio DH10B dataset and the results were compared with the list of artificial variants. Once the number of true positives, false positives, true negatives, and false negatives were counted, PPV and sensitivity were calculated. Below is an example. The EdgeBio Ion Torrent 316 dataset was mapped with tmap to the mutated E. coli reference and analyzed with the variant analysis plugin.
|Fake Inserted Indels|
|Positive||354||16||Positive Predictive Value|
|Negative||105||4,685,662||Negative Predictive Value|
In the table above sensitivity is 77.124% because Ion Torrent missed 105 of the artificial variants. Looking at these variants 71 of the 105 total variants were in locations that had lower coverage than samtools considers for variant analysis. Because it is unfair to count these variants as truly missed by the plugin, 'Corrected Sensitivity' will be considered in the analysis below. Corrected sensitivity is calculated by dividing the number of false variants identified plus the number of variants missed due to low coverage, by the total number of artificial variants. For example in the above table Corrected Sensitivity = (354+71)/459 = 92.59%.
Samtools was run with various settings to make sure the settings chosen for the variant analysis plugin produce the optimal variant finding results. The optimal results would report as many true positive variants as possible without losing positive predictive value by reporting many false positives. Below is a table of the different samtools settings tested with PPV and sensitivity for each run. The values to the left of the PPV and Sensitivity listed below are the samtools settings described earlier modified in each analysis (ex: Q - minimum base quality required to be considered.)
|13||100||40||20||1||1||Default Samtools Settings||6.014%||96.682%||3.203%||100.000%||100.000%||100.000%|
|7||50||10||17||4||2||Variant Calling Setting||95.676%||100.000%||90.533%||90.650%||99.550%||83.260%|
Variant calling settings have higher PPV and sensitivity than default samtools settings and most other samtools settings tested, but at the expense of about 10% of the sensitivity of the samtools default runs. While many people would consider the results with the highest PPV and sensitivity 'best,' it's important to remember that the values that should be used in variant analysis are application specific. If your sequencing project can afford to loose a few true INDELs at the expense of a lower number of false positives in your results, variant analysis settings would work best for you. If, on the other hand, you can't afford to loose any INDELs, it may be better to run the samtools default settings and use other methods to filter out the identified false positives. All of the analysis in this blog post was not filtered in any in order to get a full picture of the variants that would be identified in a standard variant calling run.
Mapping Ion Torrent Public Datasets Against Mutated E. Coli DH10B
A great part of working with Ion Torrent data is the frequent release of new and 'better' datasets through the Ion Community. To see if the newer datasets were more accurately identifying fake mutations, the Ion Torrent public DH10B datasets were mapped to the mutated E. Coli reference. PPV and corrected sensitivity were calculated as above.
|Edge Bio Ion Torrent 316 DH10B||57.68MB||95.68%||100.00%||90.53%||90.65%||99.55%||83.26%|
|Life Ion Torrent 316 DH10B||175.57MB||95.66%||100.00%||91.91%||97.60%||97.73%||97.49%|
|Life Ion Torrent 314LR DH10B||78.07MB||93.56%||100.00%||85.44%||88.67%||96.82%||81.17%|
|Life Ion Torrent 316LR DH10B||664.13MB||74.74%||100.00%||60.38%||96.51%||97.73%||95.40%|
|Life Ion Torrent 316LR DH10B >99% per base accuracy||492.22MB||91.99%||100.00%||85.71%||96.73%||95.91%||97.49%|
|Life Ion Torrent 318LR Chip||1,203.97MB||80.61%||100.00%||67.92%||94.77%||97.73%||92.05%|
|Life Ion Torrent 314 100MB||122.59MB||71.21%||100.00%||55.99%||96.30%||96.36%||96.23%|
|Illumina MiSeq DH10B||2,592.60MB||99.08%||99.52%||98.66%||97.17%||98.18%||96.23%|
This table shows that as Ion Torrent throughput increases INDEL PPV tends to decrease. This makes sense looking at the original mapping to the non-mutated E. Coli reference. Datasets with longer reads appear to be calling more false positives. We know that as reads get longer there is a predisposition to calling homopolymers towards the ends of the reads. With increasing per base accuracy will come a decrease in miscalling INDELs and result in higher PPV. Ion Torrent has recently shown great improvement with the new algorithms used in Torrent Suite 1.5 and coupled with the $1 million accuracy grand challenge the number of miscalled INDELs is expected to decrease over the coming months.
Homopolymer Mutated E. Coli Genome
One limitation of the maq fakemut utility is that the inserted indels are all length one. A perl script was written to introduce random homopolymer mutations into the E. Coli DH10B genome. 294 random homopolymer mutations (159 insertions and 135 deletions) of various sizes was inserted into the genome. The public datasets were mapped to the homopolymer mutated genome and the homopolymer PPV and corrected sensitivity were calculated.
|Edge Bio Ion Torrent 316 DH10B||94.65%||78.23%|
|Life Ion Torrent 316 DH10B||93.59%||96.60%|
|Life Ion Torrent 314LR DH10B||88.14%||71.43%|
|Life Ion Torrent 316LR DH10B||64.55%||94.22%|
|Life Ion Torrent 316LR DH10B >99% per base accuracy with long reads.||87.87%||98.30%|
|Life Ion Torrent 318LR Chip||70.68%||90.14%|
|Life Ion Torrent 314 100MB||60.85%||95.58%|
|Illumina MiSeq DH10B||97.04%||91.16%|
Homopolymer INDEL PPV and Sensitivity are quite similar to the length one INDELs in the previous analysis with slightly decreased sensitivity. Identification of inserted INDELs represents a real challenge for the Ion Torrent platform but in runs like the >99% per base accuracy 316 chip run it is still able to identify 98.3% of the fake homopolymers with high PPV (87.87%).
The variant calling plugin is a powerful tool that is able to detect both SNPs and INDELs at high PPV and sensitivity. Unfortunately it is not perfect and cannot guarantee that its result does not contain false positives and false negatives. As Ion Torrent continues to increase throughput and quality variant analysis will benefit greatly. Nevertheless it is important to remember that results from variant analysis may not represent the entire picture of INDEL errors in the sample and further analysis may be necessary to achieve a more complete picture of INDEL errors in your sample.
MiSeq data has a different error profile than that of Ion Torrent data. Since the main type of errors produced in MiSeq data are SNPs, MiSeq PPV is high for INDEL errors, but lower or SNPs. At default samtools settings MiSeq detected INDELs at a PPV of only 75.627%. Interestingly, using Life’s TMAP mapping software and the samtools settings used for Life’s variant analysis plugin, the PPV and sensitivity of the resulting analysis is almost perfect (99.08% PPV and 96.95% sensitivity). Coverage of the MiSeq data was quite a bit higher than the coverage in most of the Ion Torrent runs used in this analysis so this may have had an effect on the results. As both technologies mature it will become easier to identify true variants but for now it is important to remember that careful application specific selection of variant calling settings is necessary to get a full picture of variants in a sample.
- April 2013 (1)
- February 2013 (1)
- January 2013 (1)
- December 2012 (1)
- November 2012 (7)
- October 2012 (3)
- September 2012 (1)
- August 2012 (3)
- June 2012 (2)
- May 2012 (2)
- April 2012 (6)
- March 2012 (3)
- February 2012 (4)
- January 2012 (4)
- December 2011 (2)
- November 2011 (3)
- October 2011 (3)
- September 2011 (2)
- August 2011 (1)
- June 2011 (4)
- May 2011 (1)
- November 2010 (2)
- October 2010 (1)
- September 2010 (3)
- August 2010 (2)