casbon.me

James Casbon's Blog

    • 0
      18 Apr 2012

      Backbone.js handlers for tornado and mongodb

      • Edit
      • Delete
      • Tags
      • Autopost

      • views
      • Tweet
      • Tweet
    • 0
      28 Mar 2012

      Notebook.js live demo

      • Edit
      • Delete
      • Tags
      • Autopost

      I finally got around to posting a very alpha version of notebook.js.  It is very rough around the edges, but I'm putting it up just so it's there and I can show people what it is about.  It is hosted on heroku, which is easy to deploy but suffers huge response times if the slug is not running.

      I will write more soon on what it is about, but the main points are literate, browser based code notebooks.  Head over there and let me know what you think!

      • views
      • Tweet
      • Tweet
    • 0
      9 Mar 2012

      PyVCF Python Variant Call Format parser 0.4.0 release

      • Edit
      • Delete
      • Tags
      • Autopost

      I just pushed the latest version of PyVCF to github.  Check out the documentation

      There were a few internal changes in this release, but the most significant change is a non backwards compatible move to return single values for some types of genotype data, rather than lists.

      If you'd like to help out, the next big feature for me is VCF 4.1 and structural variation support.  You can find the details on the issue list.

      This will also hopefull get taken up as part of Biopython's google summer of code work, see proposal on the wiki

      • views
      • Tweet
      • Tweet
    • 11
      21 Feb 2012

      What will pypy do for your website? Benchmarking python web servers on pypy and cpython.

      • Edit
      • Delete
      • Tags
      • Autopost

      There are currently quite a few different ways of developing a web application in python.  When you add in how you deploy the application as well, there are even more choices.  In terms of application frameworks, you have at least:

      • Django
      • twisted.web
      • flask
      • bottle
      • cyclone
      • tornado
      • pyramid

      Then these can be run using many different servers, including:

      • tornado
      • twisted
      • cyclone
      • wsgiref
      • rocket
      • cherrypy
      • gunicorn
      • fapws
      • google app engine
      • gevent

      And many more.  Typically, these take one of several approaches.  Asynchronous either explicit (cyclone, tornado) or via monkey patch and event loop (gevent); threaded such as rocket, or written in C to use an event loop.  In addition to this, you now have several different pythons for deployment:

      • cpython
      • jython
      • pypy

      At some point, these servers are generally dealing with asynchronous event loops or using threading.  The two approaches to handling this are either to program in a normal style (gevent) or to explicitly use event based programming (eg cyclone).  The rise of javascript and node.js has seen event based programming becoming more mainstream.  I wanted to find out which of these many combinations would perform best, and in particular what effect using pypy as the interpreter would have on the performance.

      The benchmark

      I created a fairly simple benchmark and implemented it across the different application styles.  The benchmark creates one route which renders 'Hello world', a click counter stored by redis that we increase, and finally a static 2,000 character string that we retrieve from redis.  I then run ab against the application for 10,000 requests for three replicates at different levels of concurrency (4, 16, 64, 256 connections).  We take the stats for the requests per second and also the total request time averages, range and standard deviation.

      These were run on a linux box with the kernel 'Linux 2.6.41.4-1.fc15.x86_64 #1 SMP Tue Nov 29 11:53:48 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux' and 24 core 'Intel(R) Xeon(R) CPU           X5675  @ 3.07GH' (although only one core will be used for python) and 48Gb ram.  The pythons used were 'Python 2.7.1' and 'Python 2.7.2, [PyPy 1.8.0 with GCC 4.4.6]'.

      You can run the benchmark by checking out the repository on github.  If you know anything about the application styles, I would encourage you to take a look at the implementations in the servers directory.

      The versions used are:

      • Flask==0.8
      • Paste==1.7.5.1
      • Rocket==1.2.4
      • Twisted==12.0.0
      • WebOb==1.2b3
      • Werkzeug==0.8.3
      • bottle==0.10.9
      • bottle-redis==0.1
      • cyclone==1.0-rc3
      • distribute==0.6.24
      • pyramid==1.3a8
      • redis==2.4.11
      • repoze.lru==0.4
      • tornado==2.2
      • gevent==1.0dev
      • greenlet==0.3.4
      • eventlet==0.9.16
      • CherryPy==3.2.2

      I patched bottle-redis to use a connection queue and bottle to silence the logging when using gevent.  Twisted needs C extensions disabled to install on pypy.

      Results

      JIT effect

      The first thing to notice is that pypy takes a little time to optmize the code.  This plot shows the requests per second for each test in iterations 1, 2 and 3.  Pypy is on the left (True) and cpython on the right (False).  The vertical facet is by the concurrency.

      Jit-effect

      You can clearly see the effect of the JIT. Cpython, on the right, has stable performance across all iterations.  Pypy shows a marked improvement from the second iteration.

      Requests per second

      We now look at requests per second against concurrency across all combinations of application and server.  Each box contains pypy and cpython lines, where available.  These are from the third iteration, when the JIT compiler has done its work

      Rqs

      Looking at this, you can see a few things.  Twisted and Cherrypy and paste are not really stable under increasing concurrency shown by the lines sloping to the right or incomplete lines (server failed).  Gevent, tornado and cyclone are stable under loads and show fairly equivalent performance under cpython.  Pypy introduces a 1.5-2x performance increase for almost all servers.

      Comparing the microframeworks, flask and bottle, bottle is always faster and really flies under tornado and gevent.  Pyramid does the best and serves over 5,000 requests per second with tornado and pypy.  Cyclone comes second under pypy with over 3,500 requests per second.  However, I found the lack of performance increase over tornado and bottle slightly dissapointing, since cyclone is using an async redis connection.

      Response time

      Here we look at the average response time, its standard deviation, and the maximum response time.  A very similar picture for all technologies: under increasing concurrency the response times degrade.

      Time

      No clear winner between pypy and cpython.

      Conclusions

      • Gevent and tornado have excellent WSGI servers that can serve 1,000s of requests per second.
      • Pypy can provide 1.5-2x performance, and this is available with tornado and cyclone but not gevent
      • Explicit async code in cyclone did not provide a noticeable increase in performance over tornado, pyramid and bottle
      • bottle outperforms flask and really flies with gevent and tornado
      • pyramid is really fast with gevent and tornado
      • threading approaches don't seem to match the other approaches here
      • cherrypy seems to have problems with higher concurrencies

      Overall, I think this shows if you are really interested in performance you should take a good look at pypy.  I have a lot of respect for the tornado code now, and would seriously consider it for future projects.  Bottle is a very good microframework that outperforms flask.

      Changes

      23/2/12 Add cherrypy, eventlet, created common wsgi server code.  Couldn't get eventlet to run stable under pypy (socket and RPy errors).

      • views
      • Tweet
      • Tweet
    • 2
      16 Feb 2012

      A comparison of openelec.tv, yaVDR, MythTV and a Sony Bravia

      • Edit
      • Delete
      • Tags
      • Autopost

      I recently had all my electronic goods stolen. One of the pieces that went was my MythTV box, a little mini PC that recorded TV and played youtube, etc. Fortunately, I was insured so I received a new TV and empty box. Things have changed a lot since I first starting using MythTV over nine years ago so I thought it would be a good opportunity to try out some of the other contenders for homebrew PVRs. In my list were:

      • openelec.tv, an extremely minimal XBMC frontend with a TVHeadend backend
      • yaVDR: an ubuntu based VDR setup with VDR or XBMC frontends
      • Sony Bravia TV. You could plug a HD into the replacement TV I got, it also supported internet video from lovefilm, etc

      I will first run down my experience with each…

      Sony Bravia TV

      The TV has the option to use an external HDD to record via USB. It also supports DLNA (more later) and has some internet TV apps (BBC iplayer, LoveFilm). Upon unboxing, I was surprised to receive a copy of the GPL. It seems some Bravias are running Linux.

      The setup with the HDD was pure Sony crap. The HDD needed to be initialized to work with the TV. This initialization was designed to make the disk unreadable anywhere else and to ensure nothing recorded elsewhere could be played back via USB (despite the fact it supports DLNA to access network shares).

      Setting up a recording involved navigating to the show in the guide and hitting record, or setting a manual timer. There are no search options or series records. No management of expiry on the disk. No watching of things being broadcast on the same multiplex. All in all, this resembled an 80’s video recorder for all the features it had: basically unusable.

      The internet videos are OK. Iplayer and lovefilm work well, but the interfaces are pretty slow. There is a scroll effect in iplayer that takes ages. But the interesting thing, for me, is that here we have LoveFilm working on a linux box despite the fact that it now requires Silverlight. The only reasonable conclusions are that Sony has a Silverlight implementation on linux or that they have negotiated a position to allow non silverlight streaming to their device. Vertical market/DRM integration, FTW!

      openelec.tv

      This is a super minimal install of XBMC’s pvr branch backed by TVHeadend. Installation was easy, boots are super fast. TVHeadend is configured via its web interface and is easy to set up. After install, the box sits with SAMBA and other file sharing set up, so you can copy content straight on or off the box.

      TVHeadend, although easy to set up, is still lacking a lot in terms of the finer points of scheduling. It’s duplication detection just doesn’t really work expecially in terms of channels with many repeats and +1 versions. If they fix these, this will be a killer set up. But as it was, I just couldn’t live with it. It needed too much management to do a series record.

      It also is worth noting that you cannot change anything about the openelec install, even the ssh password. Doing so requires that you recompile the image which you might find a pain. This means you can’t use many things that you would be used to on a linux install (gcc, perl, python, wget, rsync, etc). However, that is what keeps it lightweight and robust.

      yaVDR

      yaVDR is a ubuntu setup with VDR installed. It claims to be minimal, but actually installs a whole bunch of crap. VDR is the other big linux PVR project to MythTV, but it has typically helped that you speak German if you want to read the documentation.

      The yaVDR install is pretty much the normal ubuntu one, and startup is similar to a normal ubuntu box. yaVDR has a custom web front end for configuring yaVDR and you can drop into the VDR frontend to control the PVR. I found that the XBMC-pvr branch they were using was too unreliable to use and so I was using the VDR frontend. This was a little tricky as it is designed around a remote (which I didn’t have).

      My experience with VDR was almost as frustrating as with TVHeadend. You can have automated recordings, but for channels with +1s and repeats and different episodes it never seemed to get things correct. VDR is based around plugins so it may have been that I could have found something to rectify that, but I never could get a satisfactory set up.

      Aside: XBMC-pvr, uPNP, DLNA

      While I was trying these setups, I was also testing out whether they could stream direct to the Bravia TV which supports DLNA. This is a subset of uPNP with certain codecs. TVHeadend did not support DLNA and neither did VDR. However, there was a plugin for VDR that supports uPNP. However, I could not get this to work with streaming DVB broadcasts to the Sony TV.

      I had higher hopes for XBMC. It has a reasonable uPNP server, but you soon find the problems (including unsorted albums: http://forum.xbmc.org/showthread.php?t=105141) but more troublesome is the fact that the TV shows in XBMC are not served via uPNP at all.

      Back to Myth

      After all this, I installed a linux distribution and put Myth on top. The Myth setup was pretty painless (channel scan even worked), but partly this is because I know all the problems too well.

      The worst thing about Myth is that the frontend is pretty awful. I did try to go the XBMC-pvr route, but the integration is very alpha quality. However, I was pleased to find that this didn’t matter. The uPNP server from myth is awesome, everything is listed in various categories (date, title, etc) and playback is fine.

      Add in MythWeb and I think you have a perfect PVR. Mythweb is good for searching, scheduling and the flash based streaming is great.

      Conclusions

      So TL;DR: MythTV is awesome with a DLNA television. Openelec.tv will be awesome when TVHeadend’s scheduling is better.

      Also: there’s nothing on TV so save your money and buy a book.

      • views
      • Tweet
      • Tweet
    • 0
      15 Feb 2012

      Test driven sysadmin with lettuce and fabric

      • Edit
      • Delete
      • Tags
      • Autopost

      I have been thinking about test driven sysadmin recently. We wouldn’t want to write code without tests, so we also shouldn’t want to create untestable pieces of infrastructure.

      I think cucumber (or gherkin) could be a nice way of expressing policy:

      Now, the idea is that the scenarios are skipped when the clauses in the ‘given’ part are not met. In this case I have defined two scenarios: a mail server and a production server to relay via a mail host. When I run this against a host, the given clauses work out which policies to test. I needed to teach lettuce to have skippable scenarios. You can find a rough and ready implementation of this on my branch of lettuce. Now all that is needed is to implement the steps to check those features (which is at the bottom of this post).

      The nice part about this is I can write a fair amount of logic in python to test the state of the system, but these can be calling out to fabric simple shell calls. i.e. I don’t need to install python onto to the system I wish to test. Changes in policy can be tested site wide by testing a feature against all hosts. I can also test the external view of the test machine from my box by say, accessing a port or using an http get.

      This is just the sketch of an idea: I need to write a command line tool to handle the setup of of fabric hosts, and to allow multiple hosts to be tested at once. I would also like to add a fix mode, whereby failing steps could call out to the code that corrects a failing policy. Do you think this is a good way to test a policy?

      • views
      • Tweet
      • Tweet
    • 1
      16 Jan 2012

      Seems I'm maintaining a VCF parser; should it go into biopython?

      • Edit
      • Delete
      • Tags
      • Autopost

      VCF, or Variant Call Format, is a spec that came out out the 1,000 genomes project for describing genotypes in samples. It comes straight out of the ‘make up a file format’ school of bioinformatics and it is far from perfect.

      For starters, it has two different encodings in a single line. The info field uses the format ‘k1=v1;k2=v2’ and the call fields use ‘v1:v2’. In these fields, the header defines the type of a field so you cannot tell whether a field is a string or a number by looking at it, you need to look at the header. This problem is solved in JSON by using quotes round strings. But the opinion in the field is that using a well defined simple encoding like JSON is too much effort.

      It also sucks to work with when you have any number of samples. If you have, say, more than three samples then you cannot tell which call is which by eye, if the calls are even on the screen since huge INFO fields push them over to the right. I much prefer long formats to wide formats when dealing with 100s of observed variables.

      Finally, it doesn’t really require anything of the generators other than the crazy encoding rules. For example, allelic depths are reported completely differently in freebayes and GATK. So to compare outputs from these two tools, yes you can use the same parser, but then you end up with tool specific logic to compute the value of interest. Not much of a win.

      For these reasons, I always swore I wouldn’t get involved with a VCF parser as it might encourage adoption this format. So I have eaten my hat and undertaken to maintain jdoughertyii’s PyVCF at my fork until someone willing and able comes along since I need to parse VCF data all the time.

      Anyway, Brad Chapman asked if it should go into biopython. This got me thinking. I’m not sure I have made up my mind. The Bio.* projects should provide a nice home for lots of different code that can be maintained by a community. I developed some modules back in the day for biopython, and at least someone has patched them up since. Ideally, they should provide some focus for the development and support for novices in the packaging, and testing process. On the other hand, I’m not sure in the world of github and pip we really need a single large Bio project, when it is so easy to install and maintain smaller focused modules. This allows you to see the activity and participation on a module much quicker than with a monolithic distribution. You can also ship updates much quicker. Installation is easier and running tests you don’t have to deal with 100s of irrelevant tests. Also, my PEP8 OCD objects to the package names in Biopython (you should ignore this reason).

      I also think that for simple pure python modules like this one, it could be possible to include them into biopython using git externals, allowing the best of both worlds. Biopython packagers would then be assembling coherent components rather than dealing with an entire distribution.

      So I’m still not sure, what should be done?

      • views
      • Tweet
      • Tweet
    • 1
      9 Jan 2012

      HTML templating in python using the with statement

      • Edit
      • Delete
      • Tags
      • Autopost

      A few weeks back, I hacked up a simple templating language using python’s ‘with’ statement:

      It was inspired by coffeekup and haml, since I find it much easier to code html in those DSLs than with raw HTML (presumably, I need a better editor to help match tags and indent). HTML I write these days tends to be as minimal as possible, and indent based languages help me see the structure much better, which is a product of too long spent with python. See the gist below for the full description.

      I posted this to Hacker News and it got into the top ten brieflt, with a few nice comments, a few why bothers and a lot of prior art. In particular, jperla’s weby was very similar with a more involved syntax. However, this getting to the top ten shows how easy it is to have 15 seconds of fame these days. Next time, to number one.

      Anyway, here is the gist:

      • views
      • Tweet
      • Tweet
    • 0
      9 Jan 2012

      New blog

      • Edit
      • Delete
      • Tags
      • Autopost

      I’ve been in internet lurker mode for far too long since machine-envy (my last blog) died. Partly that was because I got bored of maintaining wordpress, so I’m hoping this posterous based blog will be maintenance free. I’ve already found a way of posting from vim and including gists is easy. The lack of any javascript is a bit tricky – you cannot include a tweet widget, for example – but that’s presumably why it is maintenance free.

      Unfortunately, since machine envy is gone, you can’t enjoy such gems as ‘Whether Monty Hall applies to Deal or No Deal’. However, there might be more classics in the pipeline, so stick around.

      • views
      • Tweet
      • Tweet
    • Search

    • My links

      • Fork me on github
      • Follow me on twitter
      • My scholarly articles
      • Open Knowledge Foundation
      • Population Genetics
    • Tags

      • python bioinformatics github biopython
      • vcf python biopython
    • Archive

      • 2012 (10)
        • April (1)
        • March (2)
        • February (4)
        • January (3)
      • 2011 (1)
        • December (1)
    • Obox Design
  • casbon.me


    9687 Views
  • Get Updates

    Subscribe via RSS