thaines.com - Density Estimation Blowout!

Density Estimation Blowout!

Over the last two weeks I have added to my Google code repository (Accessible from the menu.) not one, not two, but three modules dedicated to density estimation. And not just any old density estimation, but Gaussian mixture models! Ok, that last bit is really a disadvantage, and I would only ever choose one of the modules, except when computational limitations force me do otherwise, but it still means that this code base has a pretty decent selection. Anyway, the three modules are:

kde_inc: The incremental kernel density estimate method of Sillito & Fisher, which uses Gaussian kernels. You do have to provide a kernel size, which is a problem as its main feature is being incremental, and the kernel size should change as the amount of data does, but then this is the most basic method. To help with estimating the kernel size a leave one out method is provided. Honestly not sure what use this is - as an algorithm it only makes sense when speed really matters, but as this implementation is in pure python its rather slow, and hence somewhat useless. Good for prototyping I guess, or as a reference for implementing something using a lower level language.

gmm: A basic Gaussian mixture model using the EM method, with k-means for initialisation. Nothing special, but a good all rounder I guess. Will use the Bayesian information criterion (BIC) to do cluster count selection. Depends on scipy.weave stuff, so you will need that working to use this.

dpgmm: A Dirichlet process Gaussian mixture model, implemented using the mean field variational technique. Theoretically speaking this is just plain awesome, and is the best general purpose density estimation technique I know. Its flaw is its very demanding, in terms of both computation and memory, but if you have them go for it. Used in the right way you can basically ignore all parameters, though I have it by default setup to be somewhat conservative with the number of clusters, and to produce a very accurate answer. The code is pure python, and has lots of vectorisation going on - it is not much slower than a low level implementation would be, and the gap will only get smaller the larger the problem.

There is probably going to be a gap in updates now. I defiantly have to upload the code for my BMVC paper at the end of August, and there is also the (rather horrific) code for an ICCV paper - that should be early November. Other than that I don't have anything that comes to mind for uploading, other than another topic model and code related to my current research. But I'm sure I will think of something - I code somewhat obsessively (I blame this on having unreliable friends, who don't take me to the pub often enough, but I digress.), and if I choose to expand beyond general purpose machine learning algorithms I am sure I could go nuts.