# Test Set¶

I’ve implemented the three different benchmarks: Fast, Slow, Fluctuating in several frameworks for comparison.

Fast

Benchmarks x += x, starting from 1. This is a single instruction, and prone to be optimized away.

Slow

Benchmarks std::this_thread::sleep_for(10ms). For a microbenchmark this is very slow, and it is interesting how the framework’s autotuning deals with this.

Fluctuating

A microbenchmark where each evaluation takes a different time. This randomly fluctuating runtime is achieved by randomly producing 0-255 random numbers with std::mt19937_64.

All benchmarks are run on an i7-8700 CPU locked at 3.2GHz, using pyperf system tune.

# Runtime¶

I wrote a little timing tool that measures how long exactly it takes to print benchmark output to the screen. With this I have measured the runtimes of major benchmarking frameworks which support automatic tuning of the number of iterations: Google Benchmark, Catch2, nonius, sltbench, and of course nanobench.

Benchmarking Framework

Fast

Slow

Fluctuating

total

0.367

11.259

0.825

0.000

12.451

Catch2

1.004

2.074

0.966

1.737

5.782

nonius

0.741

1.815

0.740

1.715

5.010

sltbench

0.202

0.204

0.203

3.001

3.610

nanobench

0.079

0.112

0.000

0.001

0.192

Nanobench is clearly the fastest autotuning benchmarking framework, by an enormous margin.

# Implementations & Output¶

## nanobench¶

### Sourcecode¶

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 // https://github.com/martinus/nanobench // g++ -O2 -I../../include main.cpp -o m #define ANKERL_NANOBENCH_IMPLEMENT #include #include #include #include int main(int, char**) { uint64_t x = 1; ankerl::nanobench::Bench().run("x += x", [&]() { ankerl::nanobench::doNotOptimizeAway(x += x); }); ankerl::nanobench::Bench().run("sleep 10ms", [&]() { std::this_thread::sleep_for(std::chrono::milliseconds(10)); }); std::random_device dev; std::mt19937_64 rng(dev()); ankerl::nanobench::Bench().run("random fluctuations", [&]() { // each run, perform a random number of rng calls auto iterations = rng() & UINT64_C(0xff); for (uint64_t i = 0; i < iterations; ++i) { (void)rng(); } }); } 

### Results¶

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                0.31 |    3,192,709,967.58 |    0.0% |            1.00 |            1.00 |  0.999 |           0.00 |    0.0% |      0.00 | x += x
|       10,149,086.00 |               98.53 |    0.1% |           45.00 |        2,394.00 |  0.019 |           9.00 |   88.9% |      0.11 | sleep 10ms
|              744.50 |        1,343,183.34 |   11.2% |        2,815.05 |        2,375.86 |  1.185 |         524.73 |   12.5% |      0.00 | :wavy_dash: random fluctuations (Unstable with ~23.3 iters. Increase minEpochIterations to e.g. 233)


Very feature rich, battle proven, but a bit aged. Requires google test. Get it here: Google Benchmark

### Sourcecode¶

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 #include "benchmark.h" #include #include #include // Build instructions: https://github.com/google/benchmark#installation // curl --output benchmark.h // https://raw.githubusercontent.com/google/benchmark/master/include/benchmark/benchmark.h // g++ -O2 main.cpp -Lgit/benchmark/build/src -lbenchmark -lpthread -o m void ComparisonFast(benchmark::State& state) { uint64_t x = 1; for (auto _ : state) { x += x; } benchmark::DoNotOptimize(x); } BENCHMARK(ComparisonFast); void ComparisonSlow(benchmark::State& state) { for (auto _ : state) { std::this_thread::sleep_for(std::chrono::milliseconds(10)); } } BENCHMARK(ComparisonSlow); void ComparisonFluctuating(benchmark::State& state) { std::random_device dev; std::mt19937_64 rng(dev()); for (auto _ : state) { // each run, perform a random number of rng calls auto iterations = rng() & UINT64_C(0xff); for (uint64_t i = 0; i < iterations; ++i) { (void)rng(); } } } BENCHMARK(ComparisonFluctuating); BENCHMARK_MAIN(); 

### Results¶

g++ -O2 main.cpp -L/home/martinus/git/benchmark/build/src -lbenchmark -lpthread -o gbench


executing it gives this result:

2019-10-12 12:03:25
Running ./gbench
Run on (12 X 4600 MHz CPU s)
CPU Caches:
L1 Data 32K (x6)
L1 Instruction 32K (x6)
L2 Unified 256K (x6)
L3 Unified 12288K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
ComparisonFast             0.313 ns        0.313 ns   1000000000
ComparisonSlow          10137913 ns         3920 ns         1000
ComparisonFluctuating        993 ns          992 ns       706946


Running the tests individually takes 0.365s, 11.274 sec, 0.828sec.

## nonius¶

It gives lots of statistics, but seems a bit complicated to me. Not as straight forward as I’d like it. It shows lots of statistics, which makes the output a bit hard to read. I am not sure if it is still actively maintained. The homepage has been down for a while. Get it here: nonius

### Sourcecode¶

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 #define NONIUS_RUNNER #include // g++ -O2 main.cpp -pthread -I. -o m #include #include #include NONIUS_PARAM(X, UINT64_C(1)) template struct volatilize_fn { Fn fn; auto operator()() const -> decltype(fn()) { volatile auto x = fn(); return x; } }; template auto volatilize(Fn&& fn) -> volatilize_fn::type> { return {std::forward(fn)}; } NONIUS_BENCHMARK("x += x", [](nonius::chronometer meter) { auto x = meter.param(); meter.measure(volatilize([&]() { return x += x; })); }) NONIUS_BENCHMARK("sleep 10ms", [] { std::this_thread::sleep_for(std::chrono::milliseconds(10)); }) NONIUS_BENCHMARK("random fluctuations", [](nonius::chronometer meter) { std::random_device dev; std::mt19937_64 rng(dev()); meter.measure([&] { // each run, perform a random number of rng calls auto iterations = rng() & UINT64_C(0xff); for (uint64_t i = 0; i < iterations; ++i) { (void)rng(); } }); }) 

### Results¶

clock resolution: mean is 22.0426 ns (20480002 iterations)

new round for parameters
X = 1

benchmarking x += x
collecting 100 samples, 56376 iterations each, in estimated 0 ns
mean: 0.391109 ns, lb 0.391095 ns, ub 0.391135 ns, ci 0.95
std dev: 9.50619e-05 ns, lb 6.25215e-05 ns, ub 0.000167224 ns, ci 0.95
found 4 outliers among 100 samples (4%)
variance is unaffected by outliers

benchmarking sleep 10ms
collecting 100 samples, 1 iterations each, in estimated 1013.66 ms
mean: 10.1258 ms, lb 10.1189 ms, ub 10.1313 ms, ci 0.95
std dev: 31.1777 μs, lb 26.5814 μs, ub 35.4952 μs, ci 0.95
found 13 outliers among 100 samples (13%)
variance is unaffected by outliers

benchmarking random fluctuations
collecting 100 samples, 23 iterations each, in estimated 2.2724 ms
mean: 1016.26 ns, lb 991.161 ns, ub 1041.66 ns, ci 0.95
std dev: 128.963 ns, lb 109.803 ns, ub 159.509 ns, ci 0.95
found 2 outliers among 100 samples (2%)
variance is severely inflated by outliers


The tests individually take 0.713sec, 1.883sec, 0.819sec. Plus a startup overhead of 1.611sec.

## Picobench¶

It took me a while to figure out that I have to configure the slow test, otherwise it would run for a looong time. The number of iterations is hardcoded, this library seems very basic. Get it here: picobench

### Sourcecode¶

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 #define PICOBENCH_IMPLEMENT_WITH_MAIN #include "picobench.hpp" #include #include #include #include // https://github.com/iboB/picobench // g++ -O2 picobench.cpp -o pb PICOBENCH_SUITE("ComparisonFast"); static void ComparisonFast(picobench::state& state) { uint64_t x = 1; for (auto _ : state) { x += x; } state.set_result(x); } PICOBENCH(ComparisonFast); PICOBENCH_SUITE("ComparisonSlow"); void ComparisonSlow(picobench::state& state) { for (auto _ : state) { std::this_thread::sleep_for(std::chrono::milliseconds(10)); } } PICOBENCH(ComparisonSlow).iterations({1, 2, 5, 10}); PICOBENCH_SUITE("fluctuating"); void ComparisonFluctuating(picobench::state& state) { std::random_device dev; std::mt19937_64 rng(dev()); for (auto _ : state) { // each run, perform a random number of rng calls auto iterations = rng() & UINT64_C(0xff); for (uint64_t i = 0; i < iterations; ++i) { (void)rng(); } } } PICOBENCH(ComparisonFluctuating); 

### Results¶

ComparisonFast:
===============================================================================
Name (baseline is *)   |   Dim   |  Total ms |  ns/op  |Baseline| Ops/second
===============================================================================
ComparisonFast * |       8 |     0.000 |       6 |      - |156862745.1
ComparisonFast * |      64 |     0.000 |       1 |      - |512000000.0
ComparisonFast * |     512 |     0.000 |       0 |      - |2560000000.0
ComparisonFast * |    4096 |     0.001 |       0 |      - |3110098709.2
ComparisonFast * |    8192 |     0.003 |       0 |      - |3141104294.5
===============================================================================
ComparisonSlow:
===============================================================================
Name (baseline is *)   |   Dim   |  Total ms |  ns/op  |Baseline| Ops/second
===============================================================================
ComparisonSlow * |       1 |    10.056 |10055959 |      - |       99.4
ComparisonSlow * |       2 |    20.178 |10088773 |      - |       99.1
ComparisonSlow * |       5 |    50.570 |10114054 |      - |       98.9
ComparisonSlow * |      10 |   101.136 |10113643 |      - |       98.9
===============================================================================
fluctuating:
===============================================================================
Name (baseline is *)   |   Dim   |  Total ms |  ns/op  |Baseline| Ops/second
===============================================================================
ComparisonFluctuating * |       8 |     0.012 |    1551 |      - |   644485.6
ComparisonFluctuating * |      64 |     0.068 |    1057 |      - |   945584.6
ComparisonFluctuating * |     512 |     0.565 |    1103 |      - |   906222.0
ComparisonFluctuating * |    4096 |     4.469 |    1090 |      - |   916619.4
ComparisonFluctuating * |    8192 |     9.003 |    1098 |      - |   909957.2
===============================================================================


It doesn’t really make sense to provide runtime numbers here, because picobench just executes the given number of iterations, and that’s it. No autotuning.

## Catch2¶

Catch2 is mostly a unit testing framework, and has recently integrated benchmarking faciliy. It is very easy to use, but does not seem too configurable. I find the way it writes the output very confusing. Get it here: Catch2

### Sourcecode¶

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 // https://github.com/catchorg/Catch2 // g++ -O2 catch.cpp -o c #define CATCH_CONFIG_ENABLE_BENCHMARKING #define CATCH_CONFIG_MAIN #include "catch.hpp" #include #include #include TEST_CASE("comparison_fast") { uint64_t x = 1; BENCHMARK("x += x") { return x += x; }; } TEST_CASE("comparison_slow") { BENCHMARK("sleep 10ms") { std::this_thread::sleep_for(std::chrono::milliseconds(10)); }; } TEST_CASE("comparison_fluctuating_v2") { std::random_device dev; std::mt19937_64 rng(dev()); BENCHMARK("random fluctuations") { // each run, perform a random number of rng calls auto iterations = rng() & UINT64_C(0xff); for (uint64_t i = 0; i < iterations; ++i) { (void)rng(); } }; } 

### Results¶

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
c is a Catch v2.9.2 host application.
Run with -? for options

-------------------------------------------------------------------------------
comparison_fast
-------------------------------------------------------------------------------
catch.cpp:12
...............................................................................

benchmark name                                  samples       iterations    estimated
mean          low mean      high mean
std dev       low std dev   high std dev
-------------------------------------------------------------------------------
x += x                                                  100        12414    1.2414 ms
1 ns         1 ns         1 ns
0 ns         0 ns         0 ns

-------------------------------------------------------------------------------
comparison_slow
-------------------------------------------------------------------------------
catch.cpp:19
...............................................................................

benchmark name                                  samples       iterations    estimated
mean          low mean      high mean
std dev       low std dev   high std dev
-------------------------------------------------------------------------------
sleep 10ms                                              100            1    1.01319 s
10.1357 ms   10.1302 ms   10.1396 ms
23.539 us    18.061 us    29.575 us

-------------------------------------------------------------------------------
comparison_fluctuating_v2
-------------------------------------------------------------------------------
catch.cpp:25
...............................................................................

benchmark name                                  samples       iterations    estimated
mean          low mean      high mean
std dev       low std dev   high std dev
-------------------------------------------------------------------------------
random fluctuations                                     100           28    2.3324 ms
827 ns       810 ns       844 ns
88 ns        79 ns        99 ns

===============================================================================
test cases: 3 | 3 passed
assertions: - none -


## moodycamel::microbench¶

A very simple benchmarking tool, and an API that’s very similar to ankerl::nanobench. No autotuning, no doNotOptimize, no output formatting. Get it here: moodycamel::microbench

### Sourcecode¶

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 #include "microbench.h" #include #include #include #include // g++ -O2 -c systemtime.cpp // g++ -O2 -c microbench.cpp // g++ microbench.o systemtime.o -o mb int main(int, char**) { // something fast uint64_t x = 1; std::cout << moodycamel::microbench( [&]() { x += x; }, 10000000, 51) << " sec x += x (x==" << x << ")" << std::endl; std::cout << moodycamel::microbench([&] { std::this_thread::sleep_for(std::chrono::milliseconds(10)); }) << " sec sleep 10ms" << std::endl; std::random_device dev; std::mt19937_64 rng(dev()); std::cout << moodycamel::microbench( [&] { // each run, perform a random number of // rng calls auto iterations = rng() & UINT64_C(0xff); for (uint64_t i = 0; i < iterations; ++i) { (void)rng(); } }, 1000, 51) << " sec random fluctuations" << std::endl; } 

### Results¶

3.12506e-07 sec x += x (x==0)
10.056 sec sleep 10ms
0.000661384 sec random fluctuations


## sltbench¶

C++ benchmark which seems to have similar intentions to nanonbech. It claims to be 4.7 times faster than googlebench. It requires to be compiled and linked. I initially got a compile error because of missing <cstdint> include. After that it compiled fine, and I created an example. I didn’t like that I had to use global variables for the state that I needed in my ComparisonFast and ComparisonSlow benchmark. Get it here: sltbench

### Sourcecode¶

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 #include // https://github.com/ivafanas/sltbench #include #include #include // cmake build as online instructions describes // // g++ -O3 -I/home/martinus/git/sltbench/install/include -c main.cpp // g++ -o m -L/home/martinus/git/sltbench/install/lib main.o -lsltbench uint64_t x = 1; void ComparisonFast() { sltbench::DoNotOptimize(x += x); } SLTBENCH_FUNCTION(ComparisonFast); void ComparisonSlow() { std::this_thread::sleep_for(std::chrono::milliseconds(10)); } SLTBENCH_FUNCTION(ComparisonSlow); std::random_device dev; std::mt19937_64 rng(dev()); void ComparisonFluctuating() { // each run, perform a random number of rng calls auto iterations = rng() & UINT64_C(0xff); for (uint64_t i = 0; i < iterations; ++i) { (void)rng(); } } SLTBENCH_FUNCTION(ComparisonFluctuating); SLTBENCH_MAIN(); 

### Results¶

benchmark                                                   arg                      status               time(ns)
ComparisonFast                                                                       ok                          1
ComparisonFluctuating                                                                ok                         20
ComparisonSlow                                                                       ok                   10055943


Interestingly, the executable takes exactly 3 seconds startup time, then each benchmark runs for about 0.2 seconds.

## Celero¶

Unfortunately I couldn’t get it working. I only got segmentation faults for my x += x benchmarks. Get it here: celero

## folly Benchmark¶

Facebook’s folly comes with benchmarking facility. It seems rather basic, but with good DoNotOptimizeAway functionality. Honestly, I was too lazy to get this working. Too much installation hazzle. Get it here: folly