banner

Ion v. MiSeq - Is There a Competition, And If So, Why?

Ion v. MiSeq - Is there a competition, and if so, why?

With many data sets from the E. coli outbreak in Germany (EdgeBio analysis, crowdsourcing analysis) and other E.coli runs (EdgeBio runs, IonTorrent 316 run, MiSeq run), appearing in the public domain from many platforms and providers, it seems the time is ripe for people to start drawing comparisons between them.  In this post I want to quickly summarize and share some thoughts on the new 316 chip (Dan Kobolt from Massgenomics has done a first pass analysis as well) from Ion Torrent, drawing some comparison to the 314 chip data. I also want to discuss Illumina’s recent presentation they posted on their website, comparing MiSeq and the IonTorrent 314 chip data from EdgeBio, BGI, and Life Technologies.

Initial Look at the 316 Data

There are 4 areas that everyone is keenly interested in: throughput, quality, read length and cost.  In short - how can we get highest quality per base cheaply?  This is one way to look at it, but I think sometimes we lose sight of the science – what do we want to accomplish, in what time frame, and in what budget.

Throughput

The 316 chip generated 7x more data than the 314 chip (175Mb compared to an average of 24Mb on our 314 chips.) for an average coverage of 35X for E. coli – right in the sweet spot for denovo assembly, or mapping and variant calling.

Quality

There are 2 ways to look at quality, per base/read and consensus quality.

Read Quality

One immediate observation of the 316 data is that the read is staying higher quality for longer. The running average for the quality dips below Q20 around bp 57 on the 314 chip, but stays Q20 out until 67 bases.

Fig 1. 314 Chip

Fig 2. 316 Chip

While this can obviously use improving, and based on the changes from the first to second chip in such a short amount of time, I have faith that the quality will continue to improve.  It’s early.

Consensus Quality

When looking at the consensus quality after assembly, there is only a 1.5% error rate for both chips.  Another good measure of consensus quality is that since this is a lab strain, we should find no differences between this consensus and the reference DH109B genome. In our look into the 316 data using Newbler and CLC, both more geared towards the error profile of Ion, and calling SNPs through our bacterial SNP calling pipeline, I get only 4 differences, and they would be filtered due to having multiple alleles. There are many Indels though (~1000K), so this is a problem Ion (and/or the software) will need to address through chemistry and filtering improvements.

Read Length

The sequence length distributions for both the 314 and 316 are very similar.  My understanding from Ion is that they will focus on one area at a time.  First are throughput and newer chips, then quality and read lengths.  Jonathan Rothberg said in a statement to Genome Web that he expects 400bp reads by year’s end.  Very lofty and encouraging if they can hit the mark so soon.

Fig 3. 314 Chip

Fig 4. 316 Chip

316 v. MiSeq

I would like to talk a bit about the report recently put out by Illumina.  Up front I want to make clear that while we own PGMs, our role as a Contract Research Organization is to be unbiased and provide the best platform for our client’s needs, so we have no inherent bias agaianst the MiSeq (but we don’t own one - but then again, nobody does yet...) or towards Ion Torrent (We do own one, actually 2). I won’t regurgitate the specs laid out in the first 8 pages of the report, it’s clear that the MiSeq is as robust a platform as the HiSeq. On MiSeq the read quality and read length, as well as the paired end capabilities, are all higher than the current PGM metrics.  But does this matter?  Well yes - for now.  However, the potential of the PGM is what everyone is looking towards and it is coming sooner than we thought.  But let’s just look at the here and now and go through some of the comparisons drawn in the document .  On page 11 they compare the 314 runs Edge and other centers did to a single run of the MiSeq.  At face value, compelling for MiSeq.  1.7 GB of throughput in a 27 hour period.  But comparing MiSeq to 314 is like comparing a cheetah to me in terms of running ability. Justin not Cheetah

Let’s hypothesize for a moment.  In theory, if we could run 5 libraries in parallel (very doable off instrument), and sequence them on the 316 chip, we could get 1.3 Gb of sequence data in a 24 hour period.  With a 318, this would be 5 Gb per day from a single machine.  Of course with more runs it requires more hands on time, but with this comes a bit more flexibility on the design of a project.  You could litter in amplicons, metagenomic data, and microbes all in one day.  Dangerous to do this in a single barcoded run on a MiSeq.

The next couple pages of the document show quality plots very similar to those I showed above.  There is no doubting that the Illumina has a higher per read average accuracy.  But how much does this matter...

...Assembly

Well, I took the 316 data and the MiSeq data and did some denovo assemblies to see. With 1.7 Gb of 150x2 PE reads, Illumina did an assembly with Velvet and got 100K N50 in 125 total contigs, with 311K being the longest. With the 176Mb of singleton data from the Ion, I did assembly with Newbler and produced an assembly with 50K N50 in 200 contigs, with 211K being the longest with 99.5% of genome covered with Q40+ consensus quality.  So the question is, are the differences due to Illumina using 10x more data, Ion’s lower quality score, or the paired end chemistry?  The first assembly I ran was a subsampled MiSeq data (preserving the PE nature of the reads) down to 35X coverage and re-ran velvet with several seed values to produce the best assembly.  Results were:

Number of sequences              88
Residue counts:
Total                            4531589
Sequence lengths:
Minimum                          582
Maximum                          325825
Average                          51495.33
N50                              105505

A very comparable assembly as found in the Illumina presentation on page 8.  So read amounts weren’t the culprit. I then ran an assembly on only the forward Illumina reads yielding the following metrics:

Number of sequences             117
Residue counts:
Total                           4546543
Sequence lengths:
Minimum                         524
Maximum                         236274
Average                         38859.34
N50                             94926

Slightly worse, so it looks like 35X coverage is a reasonable amount to use in a denovo assembly of this organism, but it also shows in less complex organisms, you don’t get much in denovo assembly form the PE chemistry. For more complex organism, Ion plans to establish long mate pair libraries will prove useful. Our assembly of the 316 35X data set yielded:

Number of sequences             173
Residue counts:
Total                           4448382
Sequence lengths:
Minimum                         531
Maximum                         211950
Average                         25713.19
N50                             50520

Given the same coverage, the Ion underperforms, which shows that read quality does really matter. In short, MiSeq is better in terms of cost per sample with barcoding now.  Ion could benefit in the near term from partnering with other technologies like optical mapping company OpGen (We hope to have OpGen run a sample or two from our Ion experiments to test this hypothesis). With longer quality reads and MP protocol coming, Ion looks promising in its ability to generate very good, cheap, fast assemblies by year’s end, with quality coming along daily as a focus of Ion.

While the long term outlook for Ion is bright (but unproven), I wonder about the long term enhancements to the MiSeq (a proven system based on the HiSeq).  Indeed throughput will continue to increase much like the HiSeq, but are they limited to a sub 200 read lengths and current quality based on their chemistry?  I would love to hear from you on this in the comment section.

Genome Coverage

The slide I didn’t really understand was the coverage distribution slide on page 14. Having a threshold for percent of genome coverage set to 15X on a 5X run doesn’t make sense to me.  Of course at that threshold it won’t be covered.  In our analysis 93% of the genome was covered at Q40+.  Here they were also comparing their 360X coverage yielding 99% coverage. In subsampling down to 35X coverage, the MiSeq data yields 98% coverage of Q40+. Still impressive and Apples to Apples. We have seen in previous 314 experiments that as we add more data more of the genome is covered, so as new chips come on line, this metric will increase significantly for Ion.

All in all, I think both systems are solid investments for a lab and don’t really buy into either side making this a competition between platforms.  They're different.  And not in a bad way.  And if Ion can reach its projected throughput and quality read length goals, I see more of an issue for 454 than MiSeq.  While Ion may lag behind in denovo assembly compared to the MiSeq for now (and let’s remember we have only been running it for a couple months) there are many applications it is suited for NOW.  We are excited to have an alternative chemistry for rapid variant validation in our large exome studies (We couldn’t do this if we had HiSeq and MiSeq), as well as it’s potential for metagenomics and resequencing.  You can see a detailed analysis of some of its usefulness in resequencing application by visiting this blog post.

We look forward to continued scrutiny and debate for all of the platforms, because in the end (and while we may lose sight of it sometimes) it is about the science and getting that done, quickly, effectively, and within a realistic budget.