Which programming language for my Disaggregation system? Matlab versus Python; Graphical Models.

Over the course of my PhD, I intend to write a smart meter disaggregation system.  Maybe this system will end up as a web service; maybe not.  At the very least, it will need to play nicely with existing web services like Pachube.  I've been wondering which language(s) I should use to build my system.  My current answer to this question is to write a complete prototype of the "backend" in Python, with the front-end written in JavaScript, HTML5 and SVG.  It's likely that parts of the "backend" will run rather slowly in Python; but luckily it's easy to get Python to play well with C++ code, so I'd plan to re-write computationally intensive sections in C++.

My initial plan was to use Matlab.  But after writing several thousand lines of Matlab, I couldn't help but feel uncomfortable with it.  There are some seriously ugly bits of the language; and in general it has a rather "hacked together" feel to it.  It turns out I'm not the only one who feels uncomfortable with Matlab: there's a blog called "Abandon MATLAB" with gems like "[Mathworks] even updated the docs for “getframe” to clarify that you need to turn off the fucking screen saver and walk away from the computer like it’s 1992.".  One especially interesting post in "Abandon MATLAB" links to the results of a survey which compares attitudes to MATLAB to attitudes to Python.  Basically, I feel content that I wasn't completely crazy to abandon Matlab in favor of Python and C++.  I'll admit that I'm struggling a bit to wrap my head around JavaScript but I'm getting there with the help of Douglas Crockford's excellent book "JavaScript: The Good Parts".

 

Early conclusions regarding implementing PGMs

I think I should go ahead an prototype the backend of my system in Python, even though there aren't many PGM frameworks for Python.  Python does have good support for general graphical models.  I can always jump over to C++(11) for the PGM stuff.  Plus I'm not sure yet that I will be using "textbook" PGM approaches; I might build my own algorithms based on "normal" graphical models.  Also of interest, the "Why Matlab" wiki page for the Probabilistic Modeling Toolkit states "In the future we may port to python. This seems to be growing in popularity within the machine learning community."

Thesaurus of Mathematical Languages, which has guides for translating between Matlab, NumPy and R.

Update 1/4/2012

OK.  I've gone back to prototyping in Matlab.  The programming assignments for the Stanford Probabilistic Graphical Models course are done in Matlab (or Octave) and I'm slowly learning to appreciate Matlab.  There are some language features which I still find very frustrating (like not having much control over whether objects are passed by reference or value into functions; and no proper list data structure) but it is fast to develop in.

Update 18/6/2012

For a bunch of reasons, I'm seriously thinking of moving away from MATLAB to Python + NumPy + matplotlib.  Some pros and cons of the main languages I'm considering are below:

Portability

  • Python should work "out of the box" on Windows, Linux or Mac
  • C++ needs, at the very least, to be re-compiled
  • MATLAB should be pretty portable

Processing speed

  • C++ is almost always the fastest, not least because it allows access to SIMD and GPU instructions
  • Python can by fast if used with PyPy, NumPy etc (and, of course, can use C++ code)
  • MATLAB is fast for a few things but is remarkably slow for loops, recursion, OOP and a bunch of other things

Tinkering with data during development

  • This is probably where MATLAB excels, although its graphing system is far from perfect
  • Python with matplotlib should allow for easy tinkering
  • C++ I'd have to dump data out to gnuplot (as I did for my MSc project) or something similar

Maintaining a large code base

  • C++ and Python are both great for building large apps
  • MATLAB isn't so good (no namespace; the editor lacks many features found in Eclipse like refactoring aids; editor lacks support for git; writing unit tests for MATLAB code isn't as easy as it should be)

Accessability for other developers

  • If I want hobbiests to be able to use and/or modify the system then MATLAB is not an option
  • Python is probably better understood by the "hobbiest" community than C++
  • But MATLAB seems to be in vogue for academic NILM research

GUI development

  • This is probably an area where Python excels.  (Yes, you can build GUIs in MATLAB and C++ but it's probably least painful in Python)
  • If I did go with a pure C++ implementation then maybe wxWigets in combination with gpPanel would be a good strategy

Graphical models

Python

See "stackoverflow: Python Graph Library".  graph-tool gets some love; it's is an efficient python module based heavily on the Boost Graph Library, hence might be a good bet given my experience with the BGL for my MSc project.  There are others which may be more focussed on probabilistic graphical models; like gPy.  NetworkX also looks very attractive and was praised at PyCon2012.

Also, this "Could anybody recommend a graphical model implementation in Python" thread is very useful. Amongst other things, it lists  PyMC which "is a python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo. Its flexibility and extensibility make it applicable to a large suite of problems. Along with core sampling functionality, PyMC includes methods for summarizing output, plotting, goodness-of-fit and convergence diagnostics."

Also see this blog post "Bayes net by example using Python and Khan Academy Data".

Probabilistic Graphical Modelling frameworks in C++, Java, R and Matlab

There's an extensive list of PGM frameworks here written by Kevin P Murphy (who's one of the major committers to the Matlab Bayes Net Toolbox and the Probabilistic Modelling Toolkit amongst many other things; and is also co-teaching the Stanford Probabilistic Graphical Models course in Spring 2012; and he has a very interesting-looking book called "Machine Learning: a Probabilistic Perspective" coming out in August 2012) .  The breakdown by language:

  • 13 in Java
  • 10 in C++
  • 5 in Matlab
  • 3 in R
  • 1 in Python

Of some interest: the FastInf C++ library was developed in the labs of Prof Friedman and Prof Daphne Koller (who together wrote the "Probabilistic Graphical Models" book I'm reading; and Koller is giving a free online course on Probabilistic Graphical Models from Stanford).

Update 20/6/2012

My current plan is starting to look something like this:

  • prototype in Python with NumPy, matlplotlib and wxPython
  • re-write slow bits of code in C++ (see this StackOverflow thread discussing tools for writing C++ APIs for Python; looks like I should probably start with manual wrapping)
  • I'm really having to fight the urge to write my whole thing in C++.  

I've just done a very quick benchmark on a bit of code to identify "steady steates" (as defined by Hart 1992).  The test was run on a dataset with 139,000 samples.  The benchmark included loading the file from disk.  The results (in seconds):

  • MATLAB 2012a = 0.143
  • Python 2.7.3 = 0.140
  • PyPy = 0.119
  • C++ (g++ 4.6.3, no optimisation) = 0.040
  • C++ (g++ 4.6.3, -O3) = 0.030

So MATLAB and Python are neck-and-neck at 3.5 times slower than C++.  I'm quite surprised the difference wasn't much larger given that this test runs a for loop, and MATLAB is notoriously slow at for loops.

Comments

I would use Java and GWT for the dynamic web site. Java is used in so many devices from Tridium Niagra to banks system. Android is effectively Java. It is very scalable and transferable. Also all the development are excellent tools are free and well supported by many companies. The only reason I would use something else is if I was already skilled in another language.

@peter thanks loads for the reply.  Sorry, my sparse notes above didn't explain my thinking very well: most of the serious number-crunching will be done on the server side.  Ultimately all the server-side code will probably be written in C++ but I want to prototype in a higher level language like Python.  I am considering using Java for the backend code.

In terms of the front-end GUI, I'm very eager to build the front-end of the website using modern tools like HTML5, SVG and JavaScript rather than requiring users to download Java applets or something like that.

Another point I failed to make above is that I actually want to learn some new programming languages!

Hi. Just seeing this now via a link on Hacker News. Thanks for a well balanced comparison and for revising over time. As someone involved in PyPy I would be interested in seeing why we did not fare better, perhaps the code did not run long enough to warm up the JIT (possibly true for Matlab2012a as well). Could I somehow add your benchmark to speed.pypy.org or could you run twenty cycles of it after loading from disk?

Hi Mattip,

Thanks loads for your comment. PyPy is an awesome project!

To be honest, that little benchmark I did was so quick-and-dirty that it really doesn't tell us much! And I'm not sure I've got the code any more... it really was just a quick little test with embarrassingly little vigour involved. I could take down my benchmark results if you'd like?

Thanks, and no big deal. I didn't mean to suggest your data was bad or incorrect, just the opposite - we are always looking for new and interesting ways to test PyPy in the wild, real-life benchmarking is almost as hard as disaggregating power readings :).