banner

How Much Exome Coverage Do I Need?

If I had a nickel for every time we answered the “How much coverage is that?” question in relation to our exome sequencing services, I wouldn’t need to be writing this article.  I’d be on a boat in the Caribbean.  But, I’m not on a boat anywhere, so with all my non-boating free time I thought I would try and collect information that would help people understand which important factors to consider when doing exome sequencing.

First and foremost, you are looking for variants, try to never lose sight of that.  You could cover 100% of the exome at 100X coverage and if you can’t call SNPs or INDELs with high accuracy, sensitivity and specificity – well, then there really is no point.  Calling variants is far from standardized or commoditized and some do it well, most do not – yet.  You can see how we do it in our CLIA pipeline by looking at our exome white papers.  Below are statistics from our CLIA validation samples, 3 HapMap (NA12878, NA19239, and HG02024) samples and J. Craig Venter (NS12911)

Sample

SNPs

dbSNP

INDELs

dbSNP

HG02024-AGXT2-LAB1368-A

42,019

40,979 (99.92%)

3,449

3,202 (96.28%)

HG02024-AGXT2-LAB1368-B

41,937

40,927 (99.94%)

3,402

3,155 (96.32%)

HG02024-NGv3-LAB1360-A

59,112

56,671 (99.90%)

4,002

3,685 (97.37%)

HG02024-NGv3-LAB1360-B

59,327

56,865 (99.92%)

4,074

3,721 (97.53%)

NA12878-AGXT2-LAB1368-A

42,171

41,781 (99.96%)

3,325

3,123 (96.29%)

NA12878-AGXT2-LAB1368-B

42,222

41,845 (99.96%)

3,353

3,135 (97.03%)

NA12878-NGv3-LAB1364-A

59,277

57,660 (99.95%)

4,086

3,802 (97.19%)

NA12878-NGv3-LAB1364-B

60,304

58,406 (99.93%)

4,126

3,824 (97.04%)

NA19239-AGXT2-LAB1368-A

51,629

51,147 (99.95%)

3,877

3,504 (95.58%)

NA19239-AGXT2-LAB1368-B

51,560

51,094 (99.96%)

3,904

3,538 (95.62%)

NA19239-NGv3-LAB1364-A

71,841

69,645 (99.91%)

4,788

4,307 (96.75%)

NS12911-AGXT2-LAB1368-A

42,296

41,866 (99.95%)

3,404

3,211 (96.76%)

NS12911-AGXT2-LAB1368-B

42,289

41,883 (99.97%)

3,399

3,209 (96.48%)

NS12911-NGv3-LAB1364-A

59,080

57,460 (99.94%)

4,000

3,774 (97.51%)

NS12911-NGv3-LAB1364-B

59,102

57,451 (99.93%)

4,000

3,760 (97.45%)

* AGXT2 – Agilent XT2 V4, NGv3 – Nimblgen All Exome V3

Variability

 

It is important to distill the idea of coverage down to the basics beyond just depth; the better the coverage, including the breadth and uniformity of the reads, the better the chance to find de novo variants and accurately call them.  What I want to make the case for is that our traditional viewpoint of coverage stemming from the human and other genome projects is antiquated.  First off, exome sequencing is not whole genome sequencing, you WILL introduce variability and not all targeted regions will be captured at the same efficiency. Heck, not every kit even covers the same exome regions.

comparison-exome-sequencing

 

NimbleGen

Agilent

TruSeq

Total

64,190,759

51,542,882

61,884,224

RefSeq (Coding)

33,491,892

32,326,914

31,817,166

RefSeq (UTR)

NA

3,920,825

31,642,004

Ensembl (CDS)

31,690,383

33,472,589

31,918,846

Ensembl (Exon)

33,731,215

38,123,201

59,275,652

miRBase

59,996

55,249

27,963

Fig1: Source www.NimbleGen.com, Table1: Source EdgeBio

But, what do I mean by “variability”, anyway – isn’t variability bad?  Well, it CAN be, but given that you can get extremely high coverage of most of the coding (and some non-coding) regions of a genome for 6.5X lower cost, exome sequencing is still a viable option compared to genome sequencing for research and clinical applications. Data from the 1,000 genomes project (Parla J.S. et.al)[i] showed that 20x coverage across 95% of the consensus coding sequence (CCDS) exons required 200Gb of raw input from 5 whole genomes (YRI and CEU samples). The same percentage of CCDS (~90% of the Agilent capture region and ~85% of the NimbleGen capture region) requires <20Gb of raw input from whole exome sequencing. Even by the most conservative estimates, exome sequencing achieves as much efficiency as whole genome sequencing with 10-20 fold less raw sequence data.

While hybrid capture is capable of targeting megabases of the captured region, the technique can have lower specificity and uneven coverage across targets. Exome capture yields such small amounts of DNA that it is not suited for use in enrichment studies that target less than a few megabases. PCR or rolling circle amplification typically yields higher specificity and deeper coverage than hybrid capture and works well with kilobase size targets, but is ill-suited for sequencing larger regions of DNA such as the exome.  For a more in depth look at different enrichment techniques, see our white paper, Project Design and Enrichment Techniques for Genomic DNA.

So, given what we already know about the strengths and weaknesses of hybrid capture, how can we qualify and quantify coverage in a way that isn’t antiquated and makes sense for the applications it is intended for today?  Let’s dive in.

Coverage is variable.  Period.  Even on the same sample, run by the same tech, from the same kit, run on the same sequencer.  Lets again take a look at our CLIA validation samples:

Sample

Total BP

Mean

1X

10X

20X

HG02024-AGXT2-LAB1368-A

4,925,364,304

96X

99.80%

97.60%

92.20%

HG02024-AGXT2-LAB1368-B

5,011,387,562

98X

99.80%

97.70%

92.50%

HG02024-NGv3-LAB1360-A

6,905,186,308

108X

99.50%

97.00%

94.20%

HG02024-NGv3-LAB1360-B

6,855,031,231

107X

99.50%

97.00%

94.10%

NA12878-AGXT2-LAB1368-A

4,094,098,215

80X

99.80%

96.90%

89.70%

NA12878-AGXT2-LAB1368-B

4,162,961,310

81X

99.80%

97.00%

90.00%

NA12878-NGv3-LAB1364-A

5,878,647,076

92X

99.20%

96.50%

93.40%

NA12878-NGv3-LAB1364-B

5,907,064,617

92X

99.20%

96.50%

93.40%

NA19239-AGXT2-LAB1368-A

4,113,202,304

80X

99.90%

96.90%

89.40%

NA19239-AGXT2-LAB1368-B

4,186,294,915

82X

99.90%

97.00%

89.80%

NA19239-NGv3-LAB1364-A

5,838,313,095

91X

99.30%

96.90%

93.40%

NS12911-AGXT2-LAB1368-A

4,912,577,301

96X

99.90%

97.70%

92.10%

NS12911-AGXT2-LAB1368-B

5,002,435,524

98X

99.90%

97.80%

92.40%

NS12911-NGv3-LAB1364-A

5,038,075,642

79X

99.20%

96.40%

91.70%

NS12911-NGv3-LAB1364-B

5,056,850,084

79X

99.20%

96.40%

91.70%

Pattern above for sample name is: DNASource-Kit-Tech-Replicate

Let’s take NA19239 (a Yoruban Male) and NS19211 (a CEU Male) for example. In looking at the differences between kits on NA19239, one sees a 10 fold difference (Nimblegen having 10X more coverage) between the kits, even between technical replicates.  But in NS12911, one sees the inverse, with almost 20 fold difference (Agilent has 18X more coverage).  Yet for NA19239 there is almost no difference between the 2 kits when looking at how much (96-98%) of the exome is covered at 10X.  In looking back at the variant table above, one can also see equally high (99.9+ % concordance rates for SNPs and 95-97% concordance rates for INDELs) across the samples that vary between 79-98X coverage. Similar patterns can be seen across lab techs and sequencing runs. Finding De novo mutations is a function of both sample origin, comparing a Yoruba individual against the predominantly CEU based reference sequence, and the size of the capture kits, which varies between manufacturer.

So, downstream variant calling can be consistent across samples, even if coverage is not entirely uniform.

Why the variability?

 

Regardless of how the enrichment is performed there are a few important considerations[ii] that affect the efficiency of capture:

Quantity and Quality of the Input DNA: A lower quantity or lower quality of DNA is often found to introduce bias in the downstream analysis.

Standard operating procedures (SOPs) are in place for sample submission in which quality requirements, quantity requirements, purity requirement, molecular weight requirements, storage instructions, and shipping instructions are clearly explained for various sample types. We run multiple QC processes at EdgeBio.

Repeat Elements, Tandem repeats and Pseudogenes: These regions tend to cause uneven distribution of coverage.

Extreme GC Content: Regions with high GC content such as the 5’UTR, promoter regions and the first exons of genes can affect enrichment efficiency.

Library Insert Length and its Distribution: Different capture platforms recommend different sets of standard practices for sample library preparation. As a result of these underlying chemistries, each platform has its own range of recommended fragment sizes. Agilent insert size ranges from 100 to 300bp, Nimblegen ranges from 150 to 250bp and TruSeq has the broadest range of 300 to 500bp.

Sequencing: This is actually one of the few areas where we as a lab have almost total control of how much coverage can be generated. Issues such as sample quantity and quality, distributions of capture fragment sizes, etc lie out of our control .  We almost always exceed the per run specification set forth by Illumina.

Can variability be overcome?

 

Yes, each of the commercial manufacturers are constantly re-evaluating and re-balancing their exome kits to provide more uniform coverage across areas of interest..  That’s why we are on V3, V4 or V5 of each kit  Also, as we do more exome sequencing, we can improve variant calls to account for potential variability and bias.  To detect these patterns you just need time, experience and a large sample cohort.  EdgeBio now has all three.

Can you guarantee coverage?

 

Anyone can guarantee coverage, that’s the easy part.  You can keep running a poor quality sample until you get enough data.  But a lab that truly understands exome sequencing and is building a sustainable business around the application probably wouldn’t.  For our clinical exome our typical yields are anywhere from 75-120X average on target coverage and 92-98% (Mean 95%) of the target exome covered at minimum 10X.  For our research use only (RUO) exome our  yields are anywhere from 40-80X average on target coverage and 88-94% (Mean 90%) of the target exome covered at minimum 10X.

We do guarantee that we will generate enough sequence data to theoretically hit these minimums based on the size of the capture kit.  This insulates a user from process, reagent or sequencing error/underperformance. 

But, as you can see from above, your mileage may (and will) vary.  This is biology, not driving your car to the grocery store.

Interested in Exome Sequencing?  Contact Us.

 


[i] Parla et al. A comparative analysis of exome capture. Genome Biology 2011 12:R97.  

[ii] Mertes F, ElSharawy A, et al. Targeted enrichment of genomic DNA regions for next-generation sequencing. Briefings in Functional Genomics doi:10.1093/elr033.