banner

Wherefore art thou mouse dbSNP VCF file?

Every once in a while I come across a problem that surprises me. Getting a VCF file for the Mouse genome (mm9) is one of those problems. We use the GATK extensively internally, and it has standardized around the VCF format (rightfully so), so when validating, annotating, and recalibrating variants, one requires a VCF file.

Up until recently, we had been focusing on human exome re-sequencing, but now have several mouse exome projects on going here at EdgeBio. So we needed a mm9 dbSNP VCF file to integrate into our exome pipline using the GATK. Easy right? Not really. I assumed there was one available from NCBI dbSNP FTP download since there is one for human. Nope. OK, well then, I can just download the txt files from NCBI and convert to VCF right? Nope. They don't have any genotype information. One could also download the associated XML files, and try and munge the two files into a VCF. Maybe as a last resort. Maybe.  I was familiar with VariantstoVCF from the GATK, a walker to convert several file types to VCF.  They have a bed to vcf converter, but any download from the UCSC table browser variant track didn't contain the genotypes either.  Not very helpful.

I then came across vcfutils.pl which is part of the samtools package.  It can take the raw txt UCSC database dump and convert it to VCF.  Worked like a charm.  I assume this will work with any genome at UCSC.  SO, if you need a VCF file for mouse (mm9) you can:

wget http://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/snp128.txt.gz
gunzip snp128.txt.gz
vcfutils.pl ucscsnp2vcf snp128.txt > snp128.vcf