VCF, or Variant Call Format, is a spec that came out out the 1,000 genomes project for describing genotypes in samples. It comes straight out of the ‘make up a file format’ school of bioinformatics and it is far from perfect.
For starters, it has two different encodings in a single line. The info field uses the format ‘k1=v1;k2=v2’ and the call fields use ‘v1:v2’. In these fields, the header defines the type of a field so you cannot tell whether a field is a string or a number by looking at it, you need to look at the header. This problem is solved in JSON by using quotes round strings. But the opinion in the field is that using a well defined simple encoding like JSON is too much effort.
It also sucks to work with when you have any number of samples. If you have, say, more than three samples then you cannot tell which call is which by eye, if the calls are even on the screen since huge INFO fields push them over to the right. I much prefer long formats to wide formats when dealing with 100s of observed variables.
Finally, it doesn’t really require anything of the generators other than the crazy encoding rules. For example, allelic depths are reported completely differently in freebayes and GATK. So to compare outputs from these two tools, yes you can use the same parser, but then you end up with tool specific logic to compute the value of interest. Not much of a win.
For these reasons, I always swore I wouldn’t get involved with a VCF parser as it might encourage adoption this format. So I have eaten my hat and undertaken to maintain jdoughertyii’s PyVCF at my fork until someone willing and able comes along since I need to parse VCF data all the time.
Anyway, Brad Chapman asked if it should go into biopython. This got me thinking. I’m not sure I have made up my mind. The Bio.* projects should provide a nice home for lots of different code that can be maintained by a community. I developed some modules back in the day for biopython, and at least someone has patched them up since. Ideally, they should provide some focus for the development and support for novices in the packaging, and testing process. On the other hand, I’m not sure in the world of github and pip we really need a single large Bio project, when it is so easy to install and maintain smaller focused modules. This allows you to see the activity and participation on a module much quicker than with a monolithic distribution. You can also ship updates much quicker. Installation is easier and running tests you don’t have to deal with 100s of irrelevant tests. Also, my PEP8 OCD objects to the package names in Biopython (you should ignore this reason).
I also think that for simple pure python modules like this one, it could be possible to include them into biopython using git externals, allowing the best of both worlds. Biopython packagers would then be assembling coherent components rather than dealing with an entire distribution.
So I’m still not sure, what should be done?
