The main purpose of using using higher level programming languages is to be able to use higher level of abstraction in order to make code more readable and reusable. However, added levels of abstraction usually results in loss of efficiency (speed, memory or both). The purpose of this post is to study the expected amount of performance degradation for three different programming languages.

This posting is a small case study on three different programming languages comparing this hit on the running time as the level of abstraction and genericity increases. To highlight the overhead I chose a task which is CPU bound an where the relative overhead of abstraction (when introduced) is as high as possible. So this example can be viewed as a kind of worst case scenario.

The languages in this comparison are:

- C++ (gcc version 4.1.2, flag: -O3)
- OCaml (ocamlopt, version 3.09.3)
- Scala (version 2.7.4 using JVM 1.6)

The task is a simple Mandelbrot computation. The result is not stored, but total number of iterations over each grid point is computed and printed to the console so that the semantic equivalence of the implementations can be readily checked. The programs internally compute the picture of the Mandelbrot set on an 5000X5000 grid. For each point, the maximum number of iterations is set to 100, resulting in around 300 million complex multiplications and more than a billion floating point operations.

Let us start with the lowest level C++ implementation:

#include <stdio.h> #define RESOLUTION 5000 void set_fpu (unsigned int mode) { asm ("fldcw %0" : : "m" (*&mode)); } int iters1(int max_iter,double xc,double yc) { double x = xc; double y = yc; for(int count = 0;count<max_iter;count++) { if( x*x + y*y >= 4.0) { return count; } double tmp = x*x-y*y+xc; y = 2.0 * x * y + yc; x = tmp; } return max_iter; } int main() { set_fpu (0x27F); int max_val = RESOLUTION/2; int min_val = -max_val; double mul = 2.0 / max_val; int count = 0; for(int i=min_val;i<=max_val;i++) { for(int j=min_val;j<=max_val;j++) { count += iters1(100,mul*i,mul*j); } } printf("result: %d\n",count); }

Complex numbers don't show up explicitely in the code. We will compare these with versions that use classes (or records) to model complex numbers.

A minor technical detail is that we enabled 64-bit rounding mode by the `fldcw`

assembler instruction to get the exact same beviour as the other two implementations.

The equivalent OCaml code is given below:

let resolution = 5000 let iters1 max_iter xc yc = let rec aux count x y = if count = max_iter then max_iter else begin if x *. x +. y *. y >= 4.0 then count else aux (count+1) (x *. x -. y *. y +. xc) (2.0 *. x *. y +. yc) end in aux 0 xc yc let _ = let max_val = resolution/2 in let min_val = - max_val in let mul = 2.0 /. float_of_int max_val in let count = ref 0 in for i=min_val to max_val do for j = min_val to max_val do let x = float_of_int i *. mul in let y = float_of_int j *. mul in count := !count + iters1 100 x y; done done; Printf.printf "result: %d\n" !count

And the Scala version is:

object Mandel { val RESOLUTION = 5000 def iters1(max_iter:Int,xc:Double,yc:Double):Int = { def aux(count:Int,x:Double,y:Double):Int = { if(count >= max_iter) { max_iter } else { if( x * x + y * y >= 4.0 ) count else { aux(count+1,x * x - y * y+xc,2.0*x*y+yc) } } } aux(0,xc,yc) } def main(args:Array[String]) { val max_val = RESOLUTION/2 val min_val = -RESOLUTION/2 val mul = 2.0 / max_val var count = 0 var count_tiles = 0 for( i <- min_val to max_val; j <- min_val to max_val) { count += iters3(100,mul*i,mul*j) } println("result: "+count) } }

Running times on a 2Ghz Core 2 duo 32bit mode Linux box:

C++ | 2.3 sec |

OCaml | 8.2 sec |

Scala | 3.6 sec |

We see that Scala (on the current JVM) outperforms the native OCaml code by a large margin. It is not very surprising, since floating point has never been OCaml's main strength.

Now let us turn to the real meat of the test: we will introduce an abstraction for the floating point numbers to perform the addition and multiplication of complex numbers as functions. It should be made clear that it is not possible to create a solution which is perfectly equivalent for all the languages. OCaml, for example has class/object support, however it is based on the structural equivalence and therefore mostly relies on run-time reflexivity. To make the three tests comparable, I decided to use record in the second OCaml implementation to model floating point numbers which has much less perfomance overhead, albeit offering a slightly lower level of abstraction than the Scala and C++ implementations.

For this test we replace `iter1`

by a new `iter2`

function. Let us start with the C++ version again:

class Complex { public: Complex(double xc,double yc) : x(xc), y(yc) { } double norm_square() { return x*x+y*y; } Complex operator*(const Complex &other) { return Complex(x*other.x-y*other.y, x*other.y+y*other.x); } Complex operator+(const Complex &other) { return Complex(x + other.x, y + other.y); } private: double x; double y; }; int iters2(int max_iter,double xc,double yc) { Complex c(xc,yc); Complex z(xc,yc); for(int count = 0;count<max_iter;count++) { if( z.norm_square() >= 4.0 ) { return count; } z = z * z + c; } return max_iter; }

OCaml:

type complex = { x : float; y : float; } let norm_square c = c.x *. c.x +. c.y *. c.y let mul c1 c2 = { x = c1.x *. c2.x -. c1.y *. c2.y; y = c1.x *. c2.y +. c1.y *. c2.x } let add c1 c2 = { x = c1.x +. c2.x; y = c1.y +. c2.y } let iters2 max_iter xc yc = let c = { x = xc; y = yc } in let rec aux count z = if count >= max_iter then max_iter else begin if norm_square z >= 4.0 then count else begin aux (count+1) (add (mul z z) c) end end in aux 0 c

Scala:

final case class Complex(x:Double,y:Double) { def norm_square = x*x + y*y def +(other:Complex) = new Complex(x+other.x,y+other.y) def *(other:Complex) = new Complex(x*other.x-y*other.y, x*other.y+y*other.x) } def iters2(max_iter:Int,xc:Double,yc:Double):Int = { val c = new Complex(xc,yc) def aux(count:Int,z:Complex):Int = { if(count >= max_iter) { max_iter } else { if( z.norm_square >= 4.0 ) count else { aux(count+1,z*z+c) } } } aux(0,c) }

To be fair, we have to mention, that OCaml and Scala code will create objects on the heap, whereas the C++ implementation will work on the stack. Still, these above snippets are the most straightforward implementations for the task at hand, they are relatively comparable.

C++ | 2.4 sec | +4% |

OCaml | 11.6 sec | +41% |

Scala | 10.8 sec | +180% |

We can see that although Scala takes the biggest relative hit, it still perfoms slightly better than the OCaml solution. This is probably due to OCaml's poor general floating point performance. For an integer/pointer heavy problem, probably a quite different picture would emerge.

The last test compares with solutions of higher genericity. In this case the Complex class is implemented in a duck-typing manner. That is, the parameter of the addition and multiplication operation will expect an object that has `x`

and `y`

methods, but its type is not specified precisely.

The OCaml implementation will use a straightforward object-oriented (class based) solution, since OCaml uses structural equivalence for classes anyways.

The C++ and Scala implementations use generic (templatized) methods. Again, although these two implementations look very similar syntactically, they use completely different methods under the hub: where C++ uses static specialization, Scala (more akin to OCaml) uses run-time reflexivity, which explains the benchmark results.

C++ code:

class Complex2 { public: Complex2(double xc,double yc) : x(xc), y(yc) { } double norm_square() { return x*x+y*y; } template<class T> Complex2 operator*(const T &other) { return Complex2(x*other.x-y*other.y, x*other.y+y*other.x); } template<class T> Complex2 operator+(const T &other) { return Complex2(x + other.x, y + other.y); } private: double x; double y; }; int iters3(int max_iter,double xc,double yc) { Complex2 c(xc,yc); Complex2 z(xc,yc); for(int count = 0;count<max_iter;count++) { if( z.norm_square() >= 4.0 ) { return count; } z = z * z + c; } return max_iter; }

OCaml:

class complex_c x_init y_init = object val x = x_init val y = y_init method get_x = x method get_y = y method norm_square = x *. x +. y *. y method add (c:complex_c) = new complex_c (x +. c#get_x) (y +. c#get_y) method mul (c:complex_c) = new complex_c (x *. c#get_x -. y *. c#get_y) (x *. c#get_y +. y *. c#get_x) end let iters3 max_iter xc yc = let c = new complex_c xc yc in let rec aux count z = if count >= max_iter then max_iter else begin if z#norm_square >= 4.0 then count else begin aux (count+1) ((z#mul z)#add c) end end in aux 0 c

Scala:

final case class Complex2(x:Double,y:Double) { def norm_square = x*x + y*y def +[T<:{def x:Double; def y:Double}](other:T) = new Complex2(x+other.x,y+other.y) def *[T<:{def x:Double; def y:Double}](other:T) = new Complex2(x*other.x-y*other.y, x*other.y+y*other.x) } def iters3(max_iter:Int,xc:Double,yc:Double):Int = { val c = new Complex2(xc,yc) def aux(count:Int,z:Complex2):Int = { if(count >= max_iter) { max_iter } else { if( z.norm_square >= 4.0 ) count else { aux(count+1,z*z+c) } } } aux(0,c) }

The running times (relative values to the first implementation):

C++ | 2.4 sec | +4% |

OCaml | 101 sec | +1130% |

Scala | 60 sec | +1566% |

Of course, C++'s template mechanism comes with a catch: a generic function itself may be buggy until it's instantiatied. Worse than that: there are no clear requirements on the template parameters to work correctly. On the other hand, it seems to introduce no performance overhead whatsoever.

The results for the scala and ocaml look poor. Especially the OCaml implementation: now the floating point performance can't be blamed for the huge loss anymore. It is clear that the object system of OCaml is designed in special purpose tasks in mind, its performance is about at the level of dynamically typed scripting languages.

So the final table is:

C++ | 2.3 sec | 2.4 sec (+4%) | 2.4 sec (+4%) |

OCaml | 8.2 sec | 11.6sec (+41%) | 101 sec (+1130%) |

Scala | 3.6 sec | 10.8 sec (+180%) | 60.0 sec (+1566%) |

The overall conclusion is that while C++ encourages the usage of the most sophisticated abstraction available, both Scala and OCaml pay a high price for both abstraction and (especially) for genericity. This is troubling since the strength of these languages is exactly the high level of abstraction achievable. If the programmer is discouraged to use that, then the main advantage of these langugages is at risk.

Of course, one must take the above results with a grain of salt: the problem at hand was artificial and designed with the mindset to highlight the most pessimistic scenario possible. Algorithms with more dynamic allocation and less granular structure would probably pay a much lower price.

Post Comments/Remarks here