Description
Simple benchmarking program with fancy graphs. It measures the time which was taken to process some (varying) amount of data by the algorithm. Then it just plots the relationship between time and amount of processed data.
This project is written for (self)educational purposes and fun. Some things may look a bit awkward.
Sample graphs
Internally amount of data is just a number (lets say N). We can make it behave bit differently. For example we can keep size of data same, but vary some properties of that data according to N:
This plot shows relationship between "randomness" of data and time taken to sort that data. So on X axis we have the (rough) number of random shuffles in pre-sorted array of 1 000 000 integers.- When array is almost sorted Insertion sort is the fastest. It becames slower than Timsort with 16 shuffles in array.
- With 128 shuffles Timsort becomes slower than Introsort (default std::sort in stl), and then quickly becomes 1.5 times slower than Introsort.
- Merge sort behaves quite independently on randomnes. Only when array is almost completely random it becomes ~3 times slower. And only ~2 times slower than Introsort.
- Sorting with the GPU (CUDA thrust library) is independent on array randomness. It is always blazing fast.
Other graphs from various sources:
- home coputer (i5-4430, Tesla GT215)
- work coputer (i7-4700K)
- some arm from work (Freescale ARMv7 i.MX6 Quad, 996MHz, 2GB RAM)
- aws g2.2xlarge instance (Xeon E5-2670, Kepler GK104)
- aws c3.8xlarge instance (Xeon E5-2680, 32 vCPU)
Interesting results came from i7 cpu. Timsort performed much better and become slower than Introsort only with 4096 shuffles. Also it seems that i7 do not suffer from false sharing (or maybe thanks to modern gcc 5.1.0 for that).
Usage
Compilation:
git submodule update --init
mkdir build
cd build
cmake ..
make
Running:
Make separate folder and run bench from where
mkdir tmp && cd tmp
Listing all tasks and algorithms
$ ../bench -l
Task 0: popcnt perfomance
32bit SWAR popcnt
Brian Kernighan popcnt
Simple popcnt
generalized 32 bit SWAR popcnt
generalized 64 bit SWAR popcnt
intrinsics _mm_popcnt_u64 manual asm popcnt
intrinsics _mm_popcnt_u64 popcnt
intrinsics _mm_popcnt_u64 unrolled popcnt
table lookup popcnt
thrust popcnt
Task 1: sorting algorithms
Insertion sort n^2
Introsort std::sort n*log(n)
Merge sort n*log(n)
Selection sort n^2
Shell sort n*log^2(n)
swenson binary insertion sort n^2
swenson grail sort
swenson heapsort n*log(n)
swenson mergesort n*log(n)
swenson quiksort n*log(n)
swenson selection sort n^2
swenson sell sort n*log^2(n)
swenson sqrt sort
swenson timsort n*log(n)
thrust::sort
Task 2: Sorting algorithms, partially sorted data, 1000000 elements
Insertion sort n^2
Introsort std::sort n*log(n)
Merge sort n*log(n)
Shell sort n*log^2(n)
swenson timsort n*log(n)
thrust::sort
Task 3: 1000000 vector lengths
handmade unrolling
loop unrolling
template unrolling
Quick run
$ ../bench -t 0.1
Skip some tasks. However final listing will include this skipped tasks and old measured perfomance data will be preserved between runs.
$ ../bench -t 1 -s 2
Each algorithm runs a
(by default 3) times with same data to minimize the error.
It is possible to regenerate data and run algorithm a
times again. The number
of iterations of data regeneration specified by -b
$ ../bench -t 1 -s 0,2 -a 2 -b 3
Viewing results:
For now it is quiet uncomfortable. You need a web server and a couple of links.
$ ln -s ../../html/d3/main.js main.js
$ ln -s ../../html/d3/main.css main.css
$ ln -s ../../html/d3/index.html index.html
$ python3 -m http.server 8082
$ $BROWSER http://localhost:8082
LICENSE
MIT.