An Idiosyncratic Analogical Overview of Some Programming Languages from an Evolutionary Biologist's Perspective

Site Section: 

Keywords: 

This article was originally posted at geodendron.

R

R is like a microwave oven. It is capable of handling a wide range of pre-packaged tasks, but can be frustrating or inappropriate when trying to do even simple things that are outside of its (admittedly vast) library of functions. Ever tried to make toast in a microwave? There has been a push to start using R for simulations and phylogenetic analysis, and I am actually rather ambiguous about how I feel about this. On the one hand, I would much rather an open source platform R be used than some proprietary commercial platform such as Mathematica or Matlab. On the other hand, I do not think that R is the most suitable for the full spectrum of these applications. There are some serious limitations on its capability and performance when handling large datasets (mainly for historical design reasons), and, to be frank, I find many aspects of its syntax and idiom quite irritating. I primarily use it for what it was designed for: numerical and statistical computing, as well as data visualization and plotting, and in this context I am quite happy with it. For almost anything else, I look elsewhere. R has an excellent and very friendly community of people actively developing and supporting it, and I might change my view as R evolves. But, as things stand, it simply is not my first choice for the majority of my programming needs.

Python

If R is like a microwave oven, then Python is a full-fledged modern kitchen. You can produce almost anything you want, from toast to multi-course banquets, but you probably need to do some extra work relative to R. With R, if you want an apple pie, all you need to do is pick up a frozen one from a supermarket and heat it up, and you'll be feasting in a matter of minutes. With Python, you will need to knead your own dough, cook your own apple filling, etc., but you will get your apple pie, and, to be honest, programming in Python is such a pleasure that you will probably enjoy the extra work. And what happens if you instead want a pie with strawberries and dark chocolate and a hint of chili in the filling? With R, if you cannot find an appropriate instant pie in your supermarket, you are out of luck, or you might be in for a very painful adventure in trying to mash together some chimeric concoction that will look horrible and taste worse. But with Python, any pie you can dream of is completely within reach, and probably will not take too much more effort than plain old apple pie. From data wrangling and manipulation (prepping data files, converting file formats, relabeling sequences, etc. etc.) to pipelining workflows, and even to many types analyses and simulations, Python is the ideal choice for the majority of tasks that an evolutionary biologist carries out.

C++

If R is like a microwave, and Python is like modern kitchen, then C++ is like an antique Victorian kitchen in the countryside, far, far, far away from any supermarket. You want an apple pie? With C++, you can consider yourself lucky if you do not actually have to grow the apples and harvest the wheat yourself. You will certainly need to mill and sift the flour, churn the butter, pluck the apples, grind the spices, etc. And none of this "set the oven to 400°" business: you will need to chop up the wood for the fuel, start the fire and keep tending the heat while it is baking. You will eventually get the apple pie, but you are going to have to work very, very, very hard to get it, and most of it will be tedious and painful work. And you will probably have at least one or two bugs in the pie when all is done. More likely than not, memory bugs ...

Stepping out of the cooking/kitchen analogy, if I had to point out the single aspect of programming in C++ that makes it such a pain, I would say "memory management". Much of the time programming in C++ is spent coding up the overhead involved in managing memory (in the broadest sense, from declaration, definition and initialization of stack variables, to allocation and deallocation of heap variables, to tracking and maintaining both types through their lifecycles), and even more is spent in tracking down bugs caused by memory leaks. The Standard Template Library certainly helps, and I've come to find it indispensable, but it still exacts its own cost in terms of overhead and its own price in terms of chasing down memory leaks and bugs.

For an example of the overhead, compare looping through elements of a container in Python:

for i in data:
    print i

vs. C++ using the Standard Template Library:

for (std::vector<long>::const_iterator i = data.begin();
        i != data.end();      
        ++i) {
    std::cout << *i << std::endl;
}

And for an example of insiduous memory bug even with the Standard Template Library, consider this: what might happen sometimes to a pointer that you have been keeping to an element in a vector, when some part of your code appends a new element to the vector? It can be quite ugly.

So what does all that extra work and pain get you?

Performance.

When it comes to performance, C++ rocks. My initial attempt at a complex phylogeography simulator was in Python. It took me a week to get it working. I could manage population sizes of about 1000 on a 1G machine, and it could complete 10000 generations in about a week. I rewrote it in C++. Instead of a week, it took me two and a half months to get it to the same level of complexity. When completed, however, it could manage population sizes of over 1 million on a 1 G machine, and run 2.5 million generations in 24 hours.

After that experience, when I am thinking of coding up something that might be computationally intensive or push the memory limits of my machines, the language that comes to mind is C++. More likely than not, however, I would probably still try to code up the initial solution in Python, and only turn to C++ when it becomes clear that Python's performance is not up to the task.

Java

Java, like Python, is a modern kitchen, allowing for a full range of operations with all the sanity-preserving conveniences and facilities (e.g., garbage-collection/memory-management). But it is a large, industrial kitchen, with an executive chef, sous chefs, and a full team of chefs de partie running things. And so, while you can do everything from making toast to multi-course meals, even the simplest tasks takes a certain minimal investment of organization and overhead. At the end of the day, for many simpler things, such as scrambled eggs and toast, you would get just as good results a lot quicker using Python.

I think that Java is really nice language, and I do like its idioms and syntax, which, by design, enforces many good programming practices. It is also probably much more suited for enterprise-level application development than Python. But I find it very top-heavy for the vast majority of things that I do, and the extra investment in programming overhead that it imposes (think: getters and setters) does not buy me any performance benefit at all. As a result, I have not found a need to code a single application in Java since I started using Python several years ago.

21 Comments

Consider PyPy

You might want to look at PyPy ( http://pypy.org/ ). They've made significant progress improving performance in calculation-intense applications. Obviously it won't work for everything, and I suspect performance won't match c or c++ in most cases, but the gains of being able to write your simulation in python may outweigh the cost.

Haskell?

I wonder what you would make of Haskell. If Python is a modern kitchen, then I guess Haskell is like a Star Trek replicator; you specify what you want, and the replicator takes care of the details. If you want something in its vast library of known dishes then just name the dish. If you want something new then you are going to have to specify it in more detail, but you still don't have to bake it yourself.

Haskell kitchen, first draft:

Haskell kitchen, first draft: your kitchen is an assortment of ultra-safe and very efficient devices. The knives cut through steel but the edge's design makes it impossible for you to cut your fingers. However, learning to use the knife will take a year. The oven's door won't open if you try to place a pie in there with incorrect ingredients.

Sometimes you notice something weird about the dishwasher... seems like it only cleans the pots right before you start needing them.

Your kitchen staff is made entirely of nutrition and biology pregrads, graduates, doctors and professors who know everything about the field and about foodstuffs in the molecular level but have never prepared a single dish.

The kitchen metaphore is

The kitchen metaphore is actually quite apt. One note on your python/c++ performace issues, it would have been better to profile your python code to find out which particular functions the code was spending it's time in, and then reimplement them in C++. Every time the issue of speed is raised in computing, someone brings up the phrase, "80% of the time is spend in 20% of the code". I don't know how true that is in scientific analysis, but it seems to be true in an awfull lot of places.

Also, could you make your captcha a little more readable? I've had to try 4 times now

Nice analogies :)

Out of curiosity, do you still have your old Python version of that simulator you mention? I'd be *really* interesting in knowing how running it with PyPy (http://pypy.org/) performs relative to your C++ version.

Also, you may want to explore Cython (http://cython.org/) as a way to manually speed up selected parts of a Python application without having to resort to recoding *everything* (even the parts which aren't slowing things down) in C++.

PyPy excellent with memory

I can't speak for Cython, but I have found PyPy to be excellent with its memory usage. For example, I have a script that in CPython 2.6 takes ~1.5GB memory, while in PyPy 1.5, it only uses 750 MB. This still could probably be beaten in C++, but I have found the difference to be very significant. The people at PyPy even advertise the lower memory usage on their homepage: http://pypy.org/

PyPy excellent with memory

I can't speak for Cython, but I have found PyPy to be excellent with its memory usage. For example, I have a script that in CPython 2.6 takes ~1.5GB memory, while in PyPy 1.5, it only uses 750 MB. This still could probably be beaten in C++, but I have found the difference to be very significant. The people at PyPy even advertise the lower memory usage on their homepage: http://pypy.org/

Something like that.

Good post!
My work flow is usually Python and R with a bit of bash as duct-tape. Then, if better performance, etc, is needed I might do bits and pieces in C... That seems to work nicely.

I really wish that Python had R/ggplot2's 2D graphics capabilities... matplotlib has come a long way, but it still has a ways to go.

Better multi-processing/threading tools in both Python and R would be great. Some of R Revolution's work is interesting...

Python vs C

Yes.

I did a program to compare text (sort of like word diff, but works) see it here: http://apps.opensourcelaw.biz/compareL/default/index

To first parse the text before a compare I used pyparsing. It could parse a non-trivial text (30-150 pages twice) in a matter of seconds to minutes (on the order of 20s-120s). This was ok for me, but no use if I wanted other people to use it.

So I had to teach myself some C in order to rewrite the parser (using flex). The parsing time is now under half a second for the same docs. It was (comparatively) a lot of effort to get that.

I have done some stuff in Java (for Android) and I find it very tedious compared to Python.

Cheers

C++11

Wrt your example of looping through elements of a container, the new C++ standard specify new keywords (auto) and behavior (foreach loop) which greatly simplify writing loop like these.

for(auto d: data)
  std::cout << d << std::endl;

It isn't still like the python example but it remove a lot of useless clutter.

To find out more about the new standard: http://en.wikipedia.org/wiki/C%2B%2B0x

Scala might be of interest

Very interesting comparison. It's great to see languages like Python being popular with evolutionary biology.
I wonder if "Creation Scientists" use Visual Basic :-)

When you like Java you might want to have a look at Scala. It allows you to write short and concise code similar to Python or Ruby, gives you the full power of both object oriented and functional programming (without being too difficult - despite what other people will tell you) and you have full access to all the Java libraries out there (I am sure there is great Java stuff for biologists available).

Having a loot at Jython (Python on the JVM) might also be a good idea.

Do you have any examples on how you use Python or R for your research?

Markus

You can also do something

You can also do something like this with the new standard:

#include
...
std::for_each (container.begin(), container.end(), [](ElementType e){ //a lambda
//do something with e
});

Add new comment