Microbial Annotation in the Cloud: CloVR and Ion Torrent

At EdgeBio we constantly try to extend the applications and services that we can perform utilizing Ion Torrent sequencing.  Recently we have been involved in some projects with microbial annotation. Microbial Annotation in the CloudBy annotating the results of de novo sequencing with gene predictions researchers can examine the results of the sequencing for biological significance.  With Ion Torrent sequencing the focus is on providing these results quickly and efficiently to our customers.  While there are many tools and options for microbial annotation, one particularly useful and interesting option is CloVR.  CloVR is a tool that packages many commonly used bioinformatics tools into intuitive pipelines.  CloVR is distributed as a virtual machine, allowing users to spin up the environment on a local machine or, even cooler, spin up an instance on Amazon EC2.  Because of its use of cloud computing and its intuitive web interface, CloVR is an attractive tool for many commonly performed bioinformatics tasks.  Pipelines include CloVR-Search that can perform large scale BLAST searches, CloVR-16S for ribosomal RNA sequence analysis, CloVR-metagenomics for taxonomic and functional classification, and CloVR-Microbe that allows for assembly and/or annotation of microbial organisms.  Today I’d like to share our experience using the CloVR-Microbe pipeline for annotation of Ion Torrent data.  Ion Torrent data isn’t yet officially supported, but because 454 data can be used as input to the pipeline, with a little effort Ion Torrent data can be successfully annotated as well.

We chose to perform the CloVR analysis on EC2.  Spinning up an instance of CloVR is relatively straight forward and documented on the CloVR website.  After the instance is running, pipelines can be configured and run via the web interface.  To run a pipeline, first enter EC2 credentials for the account you are using.

CloVR spins up separate instances for each analysis pipeline and spins these instances down when the analysis completes.   After adding EC2 credentials, upload and tag the data sets to be annotated.

With datasets with large number of contigs, CloVR tends to fail.  While I’m not sure exactly which step is causing the issue, CloVR users have reported in the past that CloVR can fail if it runs out of memory.  By splitting datasets into subsets of more manageable size (about 50 contigs) we’ve been able to successfully run the CloVR annotation pipeline although it does occasionally still fail.  The final step is to configure the annotation pipeline and click submit.

After a pipeline has been submitted its status can be tracked through the web interface.  After the pipeline is complete CloVR results are provided in genbank, sqn format (for submission to GenBank), coding sequence fasta, and polypeptide fasta format for annotated genes.  These are standard file formats that can be converted and manipulated by many programs for additional downstream analysis.

As a proof of concept we annotated an assembly of an E Coli DH10B dataset freely available on our website (username: edgebio, leave password blank).  Data dataset was split into smaller sets of contigs, ran on a CloVR instance on Amazon EC2, and the results were combined.

The resulting output files can be downloaded here.  In addition to the four output files provided by CloVR, a gff file of the annotations is also available.  To see if CloVR is accurately predicting genes found in DH10B, the gene predictions from CloVR’s annotation pipeline were compared to gene annotations from the DH10B genbank file.  CloVR predicted a total of 4,883 genes in this dataset.  Of the 4,883 predicted genes, 4,644 unique predictions were found and 2,644 predictions had an actual gene ID.  The 2,000 other genes were predicted hypothetical proteins and did not have an actual gene ID to cross-reference in reference genbank files.  The set of 2,644 genes was then compared to the actual E Coli DH10B genbank file which contained 4,181 annotated genes.  2,262 of the genes predicted by CloVR were annotated in the DH10B dataset.  With CloVR 54% of the genes from DH10B were predicted in our annotated assembly, and 85% of the non-hypothetical annotated genes were found in the DH10B genbank annotation file.

Because of the ease of spinning up CloVR on Amazon EC2, running a pipeline, and getting the analysis in common accepted formats, we see a lot of potential for CloVR to become a widely used pipeline for bioinformatics analysis.  CloVR is also a good candidate to be made into a plugin for the Torrent that will allow anyone with an EC2 account to annotate assembly results and visualize the output directly inside of the Torrent Browser.