View on GitHub

Bench

Algorithms benchmarking "framework"

Download this project as a .zip file Download this project as a tar.gz file

Description

Simple benchmarking program with fancy graphs. It measures the time which was taken to process some (varying) amount of data by the algorithm. Then it just plots the relationship between time and amount of processed data.

This project is written for (self)educational purposes and fun. Some things may look a bit awkward.

Sample graphs

Internally amount of data is just a number (lets say N). We can make it behave bit differently. For example we can keep size of data same, but vary some properties of that data according to N:

This plot shows relationship between "randomness" of data and time taken to sort that data. So on X axis we have the (rough) number of random shuffles in pre-sorted array of 1 000 000 integers.

When array is almost sorted Insertion sort is the fastest. It becames slower than Timsort with 16 shuffles in array.
With 128 shuffles Timsort becomes slower than Introsort (default std::sort in stl), and then quickly becomes 1.5 times slower than Introsort.
Merge sort behaves quite independently on randomnes. Only when array is almost completely random it becomes ~3 times slower. And only ~2 times slower than Introsort.
Sorting with the GPU (CUDA thrust library) is independent on array randomness. It is always blazing fast.

Other graphs from various sources:

home coputer (i5-4430, Tesla GT215)
work coputer (i7-4700K)
some arm from work (Freescale ARMv7 i.MX6 Quad, 996MHz, 2GB RAM)
aws g2.2xlarge instance (Xeon E5-2670, Kepler GK104)
aws c3.8xlarge instance (Xeon E5-2680, 32 vCPU)

Interesting results came from i7 cpu. Timsort performed much better and become slower than Introsort only with 4096 shuffles. Also it seems that i7 do not suffer from false sharing (or maybe thanks to modern gcc 5.1.0 for that).

Usage

Compilation:

git submodule update --init
mkdir build
cd build
cmake ..
make

Running:

Make separate folder and run bench from where mkdir tmp && cd tmp

Listing all tasks and algorithms

$ ../bench -l
Task 0: popcnt perfomance
  32bit SWAR popcnt
  Brian Kernighan popcnt
  Simple popcnt
  generalized 32 bit SWAR popcnt
  generalized 64 bit SWAR popcnt
  intrinsics _mm_popcnt_u64 manual asm popcnt
  intrinsics _mm_popcnt_u64 popcnt
  intrinsics _mm_popcnt_u64 unrolled popcnt
  table lookup popcnt
  thrust popcnt
Task 1: sorting algorithms
  Insertion sort n^2
  Introsort std::sort n*log(n)
  Merge sort n*log(n)
  Selection sort n^2
  Shell sort n*log^2(n)
  swenson binary insertion sort n^2
  swenson grail sort
  swenson heapsort n*log(n)
  swenson mergesort n*log(n)
  swenson quiksort n*log(n)
  swenson selection sort n^2
  swenson sell sort n*log^2(n)
  swenson sqrt sort
  swenson timsort n*log(n)
  thrust::sort
Task 2: Sorting algorithms, partially sorted data, 1000000 elements
  Insertion sort n^2
  Introsort std::sort n*log(n)
  Merge sort n*log(n)
  Shell sort n*log^2(n)
  swenson timsort n*log(n)
  thrust::sort
Task 3: 1000000 vector lengths
  handmade unrolling
  loop unrolling
  template unrolling

Quick run

$ ../bench -t 0.1

Skip some tasks. However final listing will include this skipped tasks and old measured perfomance data will be preserved between runs.

$ ../bench -t 1 -s 2

Each algorithm runs a (by default 3) times with same data to minimize the error. It is possible to regenerate data and run algorithm a times again. The number of iterations of data regeneration specified by -b

$ ../bench -t 1 -s 0,2 -a 2 -b 3

Viewing results:

For now it is quiet uncomfortable. You need a web server and a couple of links.

$ ln -s ../../html/d3/main.js main.js
$ ln -s ../../html/d3/main.css main.css
$ ln -s ../../html/d3/index.html index.html
$ python3 -m http.server 8082
$ $BROWSER http://localhost:8082

LICENSE

MIT.