Which programming language for my Disaggregation system? Matlab versus Python; Graphical Models.
Early conclusions regarding implementing PGMs
I think I should go ahead an prototype the backend of my system in Python, even though there aren't many PGM frameworks for Python. Python does have good support for general graphical models. I can always jump over to C++(11) for the PGM stuff. Plus I'm not sure yet that I will be using "textbook" PGM approaches; I might build my own algorithms based on "normal" graphical models. Also of interest, the "Why Matlab" wiki page for the Probabilistic Modeling Toolkit states "In the future we may port to python. This seems to be growing in popularity within the machine learning community."
Thesaurus of Mathematical Languages, which has guides for translating between Matlab, NumPy and R.
OK. I've gone back to prototyping in Matlab. The programming assignments for the Stanford Probabilistic Graphical Models course are done in Matlab (or Octave) and I'm slowly learning to appreciate Matlab. There are some language features which I still find very frustrating (like not having much control over whether objects are passed by reference or value into functions; and no proper list data structure) but it is fast to develop in.
For a bunch of reasons, I'm seriously thinking of moving away from MATLAB to Python + NumPy + matplotlib. Some pros and cons of the main languages I'm considering are below:
- Python should work "out of the box" on Windows, Linux or Mac
- C++ needs, at the very least, to be re-compiled
- MATLAB should be pretty portable
- C++ is almost always the fastest, not least because it allows access to SIMD and GPU instructions
- Python can by fast if used with PyPy, NumPy etc (and, of course, can use C++ code)
- MATLAB is fast for a few things but is remarkably slow for loops, recursion, OOP and a bunch of other things
Tinkering with data during development
- This is probably where MATLAB excels, although its graphing system is far from perfect
- Python with matplotlib should allow for easy tinkering
- C++ I'd have to dump data out to gnuplot (as I did for my MSc project) or something similar
Maintaining a large code base
- C++ and Python are both great for building large apps
- MATLAB isn't so good (no namespace; the editor lacks many features found in Eclipse like refactoring aids; editor lacks support for git; writing unit tests for MATLAB code isn't as easy as it should be)
Accessability for other developers
- If I want hobbiests to be able to use and/or modify the system then MATLAB is not an option
- Python is probably better understood by the "hobbiest" community than C++
- But MATLAB seems to be in vogue for academic NILM research
- This is probably an area where Python excels. (Yes, you can build GUIs in MATLAB and C++ but it's probably least painful in Python)
- If I did go with a pure C++ implementation then maybe wxWigets in combination with gpPanel would be a good strategy
See "stackoverflow: Python Graph Library". graph-tool gets some love; it's is an efficient python module based heavily on the Boost Graph Library, hence might be a good bet given my experience with the BGL for my MSc project. There are others which may be more focussed on probabilistic graphical models; like gPy. NetworkX also looks very attractive and was praised at PyCon2012.
Also, this "Could anybody recommend a graphical model implementation in Python" thread is very useful. Amongst other things, it lists PyMC which "is a python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo. Its flexibility and extensibility make it applicable to a large suite of problems. Along with core sampling functionality, PyMC includes methods for summarizing output, plotting, goodness-of-fit and convergence diagnostics."
Also see this blog post "Bayes net by example using Python and Khan Academy Data".
Probabilistic Graphical Modelling frameworks in C++, Java, R and Matlab
There's an extensive list of PGM frameworks here written by Kevin P Murphy (who's one of the major committers to the Matlab Bayes Net Toolbox and the Probabilistic Modelling Toolkit amongst many other things; and is also co-teaching the Stanford Probabilistic Graphical Models course in Spring 2012; and he has a very interesting-looking book called "Machine Learning: a Probabilistic Perspective" coming out in August 2012) . The breakdown by language:
- 13 in Java
- 10 in C++
- 5 in Matlab
- 3 in R
- 1 in Python
Of some interest: the FastInf C++ library was developed in the labs of Prof Friedman and Prof Daphne Koller (who together wrote the "Probabilistic Graphical Models" book I'm reading; and Koller is giving a free online course on Probabilistic Graphical Models from Stanford).
My current plan is starting to look something like this:
- prototype in Python with NumPy, matlplotlib and wxPython
- re-write slow bits of code in C++ (see this StackOverflow thread discussing tools for writing C++ APIs for Python; looks like I should probably start with manual wrapping)
- I'm really having to fight the urge to write my whole thing in C++.
I've just done a very quick benchmark on a bit of code to identify "steady steates" (as defined by Hart 1992). The test was run on a dataset with 139,000 samples. The benchmark included loading the file from disk. The results (in seconds):
- MATLAB 2012a = 0.143
- Python 2.7.3 = 0.140
- PyPy = 0.119
- C++ (g++ 4.6.3, no optimisation) = 0.040
- C++ (g++ 4.6.3, -O3) = 0.030
So MATLAB and Python are neck-and-neck at 3.5 times slower than C++. I'm quite surprised the difference wasn't much larger given that this test runs a for loop, and MATLAB is notoriously slow at for loops.