thaines.com - Random Forest and Gaussian Belief Propagation

Random Forest and Gaussian Belief Propagation

Last weekend I pushed two more modules to my Helit Github. Both are part of the 'My Text in Your Handwriting' project. Similarly to the mean shift module, they are reasonably generic, and can be used to solve many tasks:

Fast Random Forest (frf):

This is the result of my frustration with using the scikit-learn random forest implementation. Whilst the actually random forest is perfectly acceptable, the file I/O is a joke (its dependent on Python\'s pickle, which is not appropriate for loading/saving large numpy arrays). It had reached the point where I was spending more time loading the random forest from disk than actually using it, and this was in a GUI, so I could not ignore it. Hence I wrote my third (!) random forest implementation, as the other two did not have the necessary features.

To be clear, the use of 'fast' in the name is a reference to the file I/O more than the actual training/testing time. Whilst training/testing is fast, the code is also very generic, supporting many different scenarios, so there are certainly faster implementations out there. Its actually about the same speed as scikit-learn for training, though considerably more flexible. Its designed so that, with the exception of an index built after loading, the file layout and in-memory layout are identical - loading is therefore seriously fast, as its two reads for the header then two reads per tree. Two reads are because the first gets the size of the object, then the second gets the remaining data. Rebuilding the index takes almost no time as well. The file I/O is also exposed such that sending models over the network/between processes is trivial, so that you can distribute both training and testing if you want.

Feature set is currently fairly standard, but it supports multiple output features as well as the usual multiple input features, and any feature can be discrete or continuous. For output that is the difference between a classification tree or a regression tree, though you can also have mixed mode with multiple outputs. Code is very modular and all in C - being my third go at this its disturbingly neat;-)

To figure out how to use it I would focus on the many tests provided, even if some of them are quite silly. The minion type classifier is my favourite, but none of them can be accused of sanity! Its about 8000 additions according to git - I feel sorry for my fingers.

Gaussian Belief Propagation:

I originally used GBP for my paper 'Integrating Stereo with Shape-from-Shading derived Orientation Information', many moons ago, but reimplemented it with a Python interface for the 'My Text in Your Handwriting' paper. Unlike the previous version is allows for an arbitrary graph - you can use a chain to get Kalman smoothing, or a 2D grid to uncurl a normal map (which is an included demo), for instance. It can also be used to solve sparse linear equations - a demo is included. About 4000 additions according to git - this brings the total for the handwriting project up to 20K, and I am yet to publish anything handwriting-specific!

Whilst coding it I added support for TRW-S, in addition to the usual BP-with-momentum. This was an experiment to see what kind of difference it made. GBP is expected to converge to the global solution (if it exists), so you don't expect improved results, but it certainly converges faster. I also tried using it to solve linear equations - it can solve matrices that don't converge with normal BP, which surprised me. This result was only ever confirmed experimentally however, and I haven't figured out what the mathematical basis for it may be.