We explain optimization techniques used to set three world speed records. Using a combination of code generation and hardware specific optimizations, we achieved a 20x speedup over hand tuned assembly. These techniques depend on two things 1) exploiting domain specific dependencies that are too specialized for a compiler to detect and too tedious for a programmer to exploit, and 2) knowing how to profile the operations being performed by your CPU. These optimizations can be successfully applied to CPU bound code in any compiled language for a wide range of analytics problems.
David’s talk is now available on the Chariot Solutions website.