The Price of Abstraction

The main purpose of using using higher level programming languages is to be able to use higher level of abstraction in order to make code more readable and reusable. However, added levels of abstraction usually results in loss of efficiency (speed, memory or both). The purpose of this post is to study the expected amount of performance degradation for three different programming languages.

This posting is a small case study on three different programming languages comparing this hit on the running time as the level of abstraction and genericity increases. To highlight the overhead I chose a task which is CPU bound an where the relative overhead of abstraction (when introduced) is as high as possible. So this example can be viewed as a kind of worst case scenario.

The languages in this comparison are:

  1. C++ (gcc version 4.1.2, flag: -O3)
  2. OCaml (ocamlopt, version 3.09.3)
  3. Scala (version 2.7.4 using JVM 1.6)

The task is a simple Mandelbrot computation. The result is not stored, but total number of iterations over each grid point is computed and printed to the console so that the semantic equivalence of the implementations can be readily checked. The programs internally compute the picture of the Mandelbrot set on an 5000X5000 grid. For each point, the maximum number of iterations is set to 100, resulting in around 300 million complex multiplications and more than a billion floating point operations.

Let us start with the lowest level C++ implementation:

#include <stdio.h>
#define RESOLUTION 5000

void 
set_fpu (unsigned int mode) {
  asm ("fldcw %0" : : "m" (*&mode));
}

int iters1(int max_iter,double xc,double yc) {
   double x = xc;
   double y = yc;
   for(int count = 0;count<max_iter;count++) {
      if( x*x + y*y >= 4.0) { return count; }
      double tmp = x*x-y*y+xc;
      y = 2.0 * x * y + yc;
      x = tmp;
   }
   return max_iter;
}

int main() {
   set_fpu (0x27F); 
   int    max_val = RESOLUTION/2;
   int    min_val = -max_val;
   double mul = 2.0 / max_val;
   int    count = 0;
   for(int i=min_val;i<=max_val;i++) {
      for(int j=min_val;j<=max_val;j++) {
         count += iters1(100,mul*i,mul*j);
      }
   }
   printf("result: %d\n",count);
}

Complex numbers don't show up explicitely in the code. We will compare these with versions that use classes (or records) to model complex numbers.

A minor technical detail is that we enabled 64-bit rounding mode by the fldcw assembler instruction to get the exact same beviour as the other two implementations.

The equivalent OCaml code is given below:

let resolution = 5000

let iters1 max_iter xc yc = 
  let rec aux count x y = 
    if count = max_iter then max_iter else begin
      if x *. x  +. y *. y >= 4.0 then count else
      aux (count+1) (x *. x -. y *. y +. xc) (2.0 *. x *. y +. yc)
    end in
  aux 0 xc yc

let _ = 
  let max_val = resolution/2 in
  let min_val = - max_val in
  let mul = 2.0 /. float_of_int max_val in
  let count = ref 0 in
  for i=min_val to max_val do
    for j = min_val to max_val do
      let x = float_of_int i *. mul in
      let y = float_of_int j *. mul in
      count := !count + iters1 100 x y;
    done
  done;
  Printf.printf "result: %d\n" !count

And the Scala version is:

object Mandel {
  val RESOLUTION = 5000

  def iters1(max_iter:Int,xc:Double,yc:Double):Int = {
    def aux(count:Int,x:Double,y:Double):Int = {
      if(count >= max_iter) { max_iter } else {
        if( x * x + y * y >= 4.0 ) count else {
          aux(count+1,x * x - y * y+xc,2.0*x*y+yc)
        }
      }
    }
    aux(0,xc,yc)
  }
  
  def main(args:Array[String]) {
    val max_val = RESOLUTION/2
    val min_val = -RESOLUTION/2
    val mul = 2.0 / max_val
    var count = 0
    var count_tiles = 0
    for( i <- min_val to max_val; j <- min_val to max_val) {
      count += iters3(100,mul*i,mul*j)
    }
    println("result: "+count)
  }  
}

Running times on a 2Ghz Core 2 duo 32bit mode Linux box:

C++ 2.3 sec
OCaml 8.2 sec
Scala 3.6 sec

We see that Scala (on the current JVM) outperforms the native OCaml code by a large margin. It is not very surprising, since floating point has never been OCaml's main strength.

Now let us turn to the real meat of the test: we will introduce an abstraction for the floating point numbers to perform the addition and multiplication of complex numbers as functions. It should be made clear that it is not possible to create a solution which is perfectly equivalent for all the languages. OCaml, for example has class/object support, however it is based on the structural equivalence and therefore mostly relies on run-time reflexivity. To make the three tests comparable, I decided to use record in the second OCaml implementation to model floating point numbers which has much less perfomance overhead, albeit offering a slightly lower level of abstraction than the Scala and C++ implementations.

For this test we replace iter1 by a new iter2 function. Let us start with the C++ version again:

class Complex {
public:
   Complex(double xc,double yc) : x(xc), y(yc) { }

   double norm_square() {
      return x*x+y*y;
   }
  
   Complex operator*(const Complex &other) {
      return Complex(x*other.x-y*other.y,
                     x*other.y+y*other.x);
   }

   Complex operator+(const Complex &other) {
      return Complex(x + other.x, y + other.y);
   }
   
private:
   double x;
   double y;
};

int iters2(int max_iter,double xc,double yc) {
   Complex c(xc,yc);
   Complex z(xc,yc);
   for(int count = 0;count<max_iter;count++) {
      if( z.norm_square() >= 4.0 ) { return count; }
      z = z * z + c;
   }
   
   return max_iter;
}

OCaml:

type complex = {
    x : float;
    y : float;
  }

let norm_square c = 
  c.x *. c.x +. c.y *. c.y
  
let mul c1 c2 = 
  { x = c1.x *. c2.x -. c1.y *. c2.y;
    y = c1.x *. c2.y +. c1.y *. c2.x }

let add c1 c2 = { x = c1.x +. c2.x; y = c1.y +. c2.y }

let iters2 max_iter xc yc = 
  let c = { x = xc; y = yc } in
  let rec aux count z = 
    if count >= max_iter then max_iter else begin
      if norm_square z >= 4.0 then count else begin
        aux (count+1) (add (mul z z) c)
      end
    end in
  aux 0 c

Scala:

  final case class Complex(x:Double,y:Double) {
    def norm_square = x*x + y*y
    
    def +(other:Complex) = new Complex(x+other.x,y+other.y)

    def *(other:Complex) = new Complex(x*other.x-y*other.y,
                                       x*other.y+y*other.x)
  }


  def iters2(max_iter:Int,xc:Double,yc:Double):Int = {
    val c = new Complex(xc,yc)
    def aux(count:Int,z:Complex):Int = {
      if(count >= max_iter) { max_iter } else {
        if( z.norm_square >= 4.0 ) count else {
          aux(count+1,z*z+c)
        }
      }
    }
    aux(0,c)
  }

To be fair, we have to mention, that OCaml and Scala code will create objects on the heap, whereas the C++ implementation will work on the stack. Still, these above snippets are the most straightforward implementations for the task at hand, they are relatively comparable.

C++ 2.4 sec +4%
OCaml 11.6 sec +41%
Scala 10.8 sec +180%

We can see that although Scala takes the biggest relative hit, it still perfoms slightly better than the OCaml solution. This is probably due to OCaml's poor general floating point performance. For an integer/pointer heavy problem, probably a quite different picture would emerge.

The last test compares with solutions of higher genericity. In this case the Complex class is implemented in a duck-typing manner. That is, the parameter of the addition and multiplication operation will expect an object that has x and y methods, but its type is not specified precisely.

The OCaml implementation will use a straightforward object-oriented (class based) solution, since OCaml uses structural equivalence for classes anyways.

The C++ and Scala implementations use generic (templatized) methods. Again, although these two implementations look very similar syntactically, they use completely different methods under the hub: where C++ uses static specialization, Scala (more akin to OCaml) uses run-time reflexivity, which explains the benchmark results.

C++ code:

class Complex2 {
public:
   Complex2(double xc,double yc) : x(xc), y(yc) { }

   double norm_square() {
      return x*x+y*y;
   }
   
   template<class T> 
   Complex2 operator*(const T &other) {
      return Complex2(x*other.x-y*other.y,
                      x*other.y+y*other.x);
   }

   template<class T> 
   Complex2 operator+(const T &other) {
      return Complex2(x + other.x, y + other.y);
   }
   
private:
   double x;
   double y;
};

int iters3(int max_iter,double xc,double yc) {
   Complex2 c(xc,yc);
   Complex2 z(xc,yc);
   for(int count = 0;count<max_iter;count++) {
      if( z.norm_square() >= 4.0 ) { return count; }
      z = z * z + c;
   }
   
   return max_iter;
}

OCaml:

class complex_c x_init y_init = 
  object
    val x = x_init
    val y = y_init
    method get_x = x
    method get_y = y
    method norm_square = x *. x  +. y *. y
    method add (c:complex_c) = new complex_c (x +. c#get_x) (y +. c#get_y)
    method mul (c:complex_c) = new complex_c (x *. c#get_x -. y *. c#get_y) (x *. c#get_y +. y *. c#get_x)
  end

let iters3 max_iter xc yc = 
  let c = new complex_c xc yc in
  let rec aux count z = 
    if count >= max_iter then max_iter else begin
      if z#norm_square >= 4.0 then count else begin
        aux (count+1) ((z#mul z)#add c)
      end
    end in
  aux 0 c

Scala:

  final case class Complex2(x:Double,y:Double) {
    def norm_square = x*x + y*y
    def +[T<:{def x:Double; def y:Double}](other:T) = new Complex2(x+other.x,y+other.y)
    def *[T<:{def x:Double; def y:Double}](other:T) = new Complex2(x*other.x-y*other.y,
                                                                   x*other.y+y*other.x)
  }

  def iters3(max_iter:Int,xc:Double,yc:Double):Int = {
    val c = new Complex2(xc,yc)
    def aux(count:Int,z:Complex2):Int = {
      if(count >= max_iter) { max_iter } else {
        if( z.norm_square >= 4.0 ) count else {
          aux(count+1,z*z+c)
        }
      }
    }
    aux(0,c)
  }

The running times (relative values to the first implementation):

C++ 2.4 sec +4%
OCaml 101 sec +1130%
Scala 60 sec +1566%

Of course, C++'s template mechanism comes with a catch: a generic function itself may be buggy until it's instantiatied. Worse than that: there are no clear requirements on the template parameters to work correctly. On the other hand, it seems to introduce no performance overhead whatsoever.

The results for the scala and ocaml look poor. Especially the OCaml implementation: now the floating point performance can't be blamed for the huge loss anymore. It is clear that the object system of OCaml is designed in special purpose tasks in mind, its performance is about at the level of dynamically typed scripting languages.

So the final table is:

C++ 2.3 sec 2.4 sec (+4%) 2.4 sec (+4%)
OCaml 8.2 sec 11.6sec (+41%) 101 sec (+1130%)
Scala 3.6 sec 10.8 sec (+180%) 60.0 sec (+1566%)

Conclusion

The overall conclusion is that while C++ encourages the usage of the most sophisticated abstraction available, both Scala and OCaml pay a high price for both abstraction and (especially) for genericity. This is troubling since the strength of these languages is exactly the high level of abstraction achievable. If the programmer is discouraged to use that, then the main advantage of these langugages is at risk.

Of course, one must take the above results with a grain of salt: the problem at hand was artificial and designed with the mindset to highlight the most pessimistic scenario possible. Algorithms with more dynamic allocation and less granular structure would probably pay a much lower price.


Source Code:


Post Comments/Remarks here