A performance updateI've continued playing with the LLVM. I discovered that when generating code for the normcdf and Black-Scholes functions I did not tell LLVM that the functions that were called (exp etc.) are actually pure functions. That meant that the LLVM didn't perform CSE properly.
So here are some updated numbers for computing an option prices for 10,000,000 options:
- Pure Haskell: 8.7s
- LLVM: 2.0s
normcdf x = x %< 0 ?? (1 - w, w) where w = 1.0 - 1.0 / sqrt (2.0 * pi) * exp(-l*l / 2.0) * poly k k = 1.0 / (1.0 + 0.2316419 * l) l = abs x poly = horner coeff coeff = [0.0,0.31938153,-0.356563782,1.781477937,-1.821255978,1.330274429]A noteworthy thing is that exactly the same code can be used both for the pure Haskell and the LLVM code generation; it's just a matter of overloading.
VectorsAn very cool aspect of the LLVM is that it has vector instructions. On the x86 these translate into using the SSE extensions to the processor and can speed up computations by doing things in parallel.
Again, by using overloading, the exact same code can be used to compute over vectors of numbers as with individual numbers.
So what about performance? I used four element vectors of 32 bit floating point numbers and got these results:
- Pure Haskell: 8.7s
- LLVM, scalar: 2.0s
- LLVM, vector: 1.1s
- C, gcc -O3: 2.5s
- Only on MacOS does the LLVM package give you fast primitive functions, because that's the only platform that seems to have this as a standard.
- The vector version of floating point comparison has not been implemented in the LLVM yet.
- Do not use two element vectors of type 32 bit floats. This will generate code that is wrong on the x86. I sent in a bug report about this, but was told that it is a feature and not a bug. (I kid you not.) To make the code right you have to manually insert EMMS instructions.
- The GHC FFI is broken for all operations that allocate memory for a Storable, e.g., alloca, with, withArray etc. These operations do not take the alignment into account when allocating. This means that, e.g., a vector of four floats may end up on 8 byte alignment instead of 16. This generates a segfault.
[Edit:] Added point about broken FFI.