Test Set

I’ve implemented the three different benchmarks: Fast, Slow, Fluctuating in several frameworks for comparison.

Fast

Benchmarks x += x, starting from 1. This is a single instruction, and prone to be optimized away.

Slow

Benchmarks std::this_thread::sleep_for(10ms). For a microbenchmark this is very slow, and it is interesting how the framework’s autotuning deals with this.

Fluctuating

A microbenchmark where each evaluation takes a different time. This randomly fluctuating runtime is achieved by randomly producing 0-255 random numbers with std::mt19937_64.

All benchmarks are run on an i7-8700 CPU locked at 3.2GHz, using pyperf system tune.

Runtime

I wrote a little timing tool that measures how long exactly it takes to print benchmark output to the screen. With this I have measured the runtimes of major benchmarking frameworks which support automatic tuning of the number of iterations: Google Benchmark, Catch2, nonius, sltbench, and of course nanobench.

Total Runtimes

Benchmarking Framework

Fast

Slow

Fluctuating

Overhead

total

Google Benchmark

0.367

11.259

0.825

0.000

12.451

Catch2

1.004

2.074

0.966

1.737

5.782

nonius

0.741

1.815

0.740

1.715

5.010

sltbench

0.202

0.204

0.203

3.001

3.610

nanobench

0.079

0.112

0.000

0.001

0.192

Nanobench is clearly the fastest autotuning benchmarking framework, by an enormous margin.

Implementations & Output

nanobench

Sourcecode

 1// https://github.com/martinus/nanobench
 2// g++ -O2 -I../../include main.cpp -o m
 3
 4#define ANKERL_NANOBENCH_IMPLEMENT
 5#include <nanobench.h>
 6
 7#include <chrono>
 8#include <random>
 9#include <thread>
10
11int main(int, char**) {
12    uint64_t x = 1;
13    ankerl::nanobench::Bench().run("x += x", [&]() {
14        ankerl::nanobench::doNotOptimizeAway(x += x);
15    });
16
17    ankerl::nanobench::Bench().run("sleep 10ms", [&]() {
18        std::this_thread::sleep_for(std::chrono::milliseconds(10));
19    });
20
21    std::random_device dev;
22    std::mt19937_64 rng(dev());
23    ankerl::nanobench::Bench().run("random fluctuations", [&]() {
24        // each run, perform a random number of rng calls
25        auto iterations = rng() & UINT64_C(0xff);
26        for (uint64_t i = 0; i < iterations; ++i) {
27            (void)rng();
28        }
29    });
30}

Results

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                0.31 |    3,192,709,967.58 |    0.0% |            1.00 |            1.00 |  0.999 |           0.00 |    0.0% |      0.00 | `x += x`
|       10,149,086.00 |               98.53 |    0.1% |           45.00 |        2,394.00 |  0.019 |           9.00 |   88.9% |      0.11 | `sleep 10ms`
|              744.50 |        1,343,183.34 |   11.2% |        2,815.05 |        2,375.86 |  1.185 |         524.73 |   12.5% |      0.00 | :wavy_dash: `random fluctuations` (Unstable with ~23.3 iters. Increase `minEpochIterations` to e.g. 233)

Google Benchmark

Very feature rich, battle proven, but a bit aged. Requires google test. Get it here: Google Benchmark

Sourcecode

 1#include "benchmark.h"
 2
 3#include <chrono>
 4#include <random>
 5#include <thread>
 6
 7// Build instructions: https://github.com/google/benchmark#installation
 8// curl --output benchmark.h
 9// https://raw.githubusercontent.com/google/benchmark/master/include/benchmark/benchmark.h
10// g++ -O2 main.cpp -Lgit/benchmark/build/src -lbenchmark -lpthread -o m
11void ComparisonFast(benchmark::State& state) {
12    uint64_t x = 1;
13    for (auto _ : state) {
14        x += x;
15    }
16    benchmark::DoNotOptimize(x);
17}
18BENCHMARK(ComparisonFast);
19
20void ComparisonSlow(benchmark::State& state) {
21    for (auto _ : state) {
22        std::this_thread::sleep_for(std::chrono::milliseconds(10));
23    }
24}
25BENCHMARK(ComparisonSlow);
26
27void ComparisonFluctuating(benchmark::State& state) {
28    std::random_device dev;
29    std::mt19937_64 rng(dev());
30    for (auto _ : state) {
31        // each run, perform a random number of rng calls
32        auto iterations = rng() & UINT64_C(0xff);
33        for (uint64_t i = 0; i < iterations; ++i) {
34            (void)rng();
35        }
36    }
37}
38BENCHMARK(ComparisonFluctuating);
39
40BENCHMARK_MAIN();

Results

Compiled & linked with

g++ -O2 main.cpp -L/home/martinus/git/benchmark/build/src -lbenchmark -lpthread -o gbench

executing it gives this result:

2019-10-12 12:03:25
Running ./gbench
Run on (12 X 4600 MHz CPU s)
CPU Caches:
L1 Data 32K (x6)
L1 Instruction 32K (x6)
L2 Unified 256K (x6)
L3 Unified 12288K (x1)
Load Average: 0.21, 0.55, 0.60
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
ComparisonFast             0.313 ns        0.313 ns   1000000000
ComparisonSlow          10137913 ns         3920 ns         1000
ComparisonFluctuating        993 ns          992 ns       706946

Running the tests individually takes 0.365s, 11.274 sec, 0.828sec.

nonius

It gives lots of statistics, but seems a bit complicated to me. Not as straight forward as I’d like it. It shows lots of statistics, which makes the output a bit hard to read. I am not sure if it is still actively maintained. The homepage has been down for a while. Get it here: nonius

Sourcecode

 1#define NONIUS_RUNNER
 2#include <nonius/nonius_single.h++>
 3
 4// g++ -O2 main.cpp -pthread -I. -o m
 5
 6#include <chrono>
 7#include <random>
 8#include <thread>
 9
10NONIUS_PARAM(X, UINT64_C(1))
11
12template <typename Fn>
13struct volatilize_fn {
14    Fn fn;
15    auto operator()() const -> decltype(fn()) {
16        volatile auto x = fn();
17        return x;
18    }
19};
20
21template <typename Fn>
22auto volatilize(Fn&& fn) -> volatilize_fn<typename std::decay<Fn>::type> {
23    return {std::forward<Fn>(fn)};
24}
25
26NONIUS_BENCHMARK("x += x", [](nonius::chronometer meter) {
27    auto x = meter.param<X>();
28    meter.measure(volatilize([&]() {
29        return x += x;
30    }));
31})
32
33NONIUS_BENCHMARK("sleep 10ms", [] {
34    std::this_thread::sleep_for(std::chrono::milliseconds(10));
35})
36
37NONIUS_BENCHMARK("random fluctuations", [](nonius::chronometer meter) {
38    std::random_device dev;
39    std::mt19937_64 rng(dev());
40    meter.measure([&] {
41        // each run, perform a random number of rng calls
42        auto iterations = rng() & UINT64_C(0xff);
43        for (uint64_t i = 0; i < iterations; ++i) {
44            (void)rng();
45        }
46    });
47})

Results

clock resolution: mean is 22.0426 ns (20480002 iterations)


new round for parameters
X = 1

benchmarking x += x
collecting 100 samples, 56376 iterations each, in estimated 0 ns
mean: 0.391109 ns, lb 0.391095 ns, ub 0.391135 ns, ci 0.95
std dev: 9.50619e-05 ns, lb 6.25215e-05 ns, ub 0.000167224 ns, ci 0.95
found 4 outliers among 100 samples (4%)
variance is unaffected by outliers

benchmarking sleep 10ms
collecting 100 samples, 1 iterations each, in estimated 1013.66 ms
mean: 10.1258 ms, lb 10.1189 ms, ub 10.1313 ms, ci 0.95
std dev: 31.1777 μs, lb 26.5814 μs, ub 35.4952 μs, ci 0.95
found 13 outliers among 100 samples (13%)
variance is unaffected by outliers

benchmarking random fluctuations
collecting 100 samples, 23 iterations each, in estimated 2.2724 ms
mean: 1016.26 ns, lb 991.161 ns, ub 1041.66 ns, ci 0.95
std dev: 128.963 ns, lb 109.803 ns, ub 159.509 ns, ci 0.95
found 2 outliers among 100 samples (2%)
variance is severely inflated by outliers

The tests individually take 0.713sec, 1.883sec, 0.819sec. Plus a startup overhead of 1.611sec.

Picobench

It took me a while to figure out that I have to configure the slow test, otherwise it would run for a looong time. The number of iterations is hardcoded, this library seems very basic. Get it here: picobench

Sourcecode

 1#define PICOBENCH_IMPLEMENT_WITH_MAIN
 2#include "picobench.hpp"
 3
 4#include <chrono>
 5#include <initializer_list>
 6#include <random>
 7#include <thread>
 8
 9
10// https://github.com/iboB/picobench
11// g++ -O2 picobench.cpp -o pb
12
13PICOBENCH_SUITE("ComparisonFast");
14static void ComparisonFast(picobench::state& state) {
15    uint64_t x = 1;
16    for (auto _ : state) {
17        x += x;
18    }
19    state.set_result(x);
20}
21PICOBENCH(ComparisonFast);
22
23PICOBENCH_SUITE("ComparisonSlow");
24void ComparisonSlow(picobench::state& state) {
25    for (auto _ : state) {
26        std::this_thread::sleep_for(std::chrono::milliseconds(10));
27    }
28}
29PICOBENCH(ComparisonSlow).iterations({1, 2, 5, 10});
30
31PICOBENCH_SUITE("fluctuating");
32void ComparisonFluctuating(picobench::state& state) {
33    std::random_device dev;
34    std::mt19937_64 rng(dev());
35    for (auto _ : state) {
36        // each run, perform a random number of rng calls
37        auto iterations = rng() & UINT64_C(0xff);
38        for (uint64_t i = 0; i < iterations; ++i) {
39            (void)rng();
40        }
41    }
42}
43PICOBENCH(ComparisonFluctuating);

Results

ComparisonFast:
===============================================================================
   Name (baseline is *)   |   Dim   |  Total ms |  ns/op  |Baseline| Ops/second
===============================================================================
         ComparisonFast * |       8 |     0.000 |       6 |      - |156862745.1
         ComparisonFast * |      64 |     0.000 |       1 |      - |512000000.0
         ComparisonFast * |     512 |     0.000 |       0 |      - |2560000000.0
         ComparisonFast * |    4096 |     0.001 |       0 |      - |3110098709.2
         ComparisonFast * |    8192 |     0.003 |       0 |      - |3141104294.5
===============================================================================
ComparisonSlow:
===============================================================================
   Name (baseline is *)   |   Dim   |  Total ms |  ns/op  |Baseline| Ops/second
===============================================================================
         ComparisonSlow * |       1 |    10.056 |10055959 |      - |       99.4
         ComparisonSlow * |       2 |    20.178 |10088773 |      - |       99.1
         ComparisonSlow * |       5 |    50.570 |10114054 |      - |       98.9
         ComparisonSlow * |      10 |   101.136 |10113643 |      - |       98.9
===============================================================================
fluctuating:
===============================================================================
   Name (baseline is *)   |   Dim   |  Total ms |  ns/op  |Baseline| Ops/second
===============================================================================
  ComparisonFluctuating * |       8 |     0.012 |    1551 |      - |   644485.6
  ComparisonFluctuating * |      64 |     0.068 |    1057 |      - |   945584.6
  ComparisonFluctuating * |     512 |     0.565 |    1103 |      - |   906222.0
  ComparisonFluctuating * |    4096 |     4.469 |    1090 |      - |   916619.4
  ComparisonFluctuating * |    8192 |     9.003 |    1098 |      - |   909957.2
===============================================================================

It doesn’t really make sense to provide runtime numbers here, because picobench just executes the given number of iterations, and that’s it. No autotuning.

Catch2

Catch2 is mostly a unit testing framework, and has recently integrated benchmarking faciliy. It is very easy to use, but does not seem too configurable. I find the way it writes the output very confusing. Get it here: Catch2

Sourcecode

 1// https://github.com/catchorg/Catch2
 2// g++ -O2 catch.cpp -o c
 3
 4#define CATCH_CONFIG_ENABLE_BENCHMARKING
 5#define CATCH_CONFIG_MAIN
 6#include "catch.hpp" // NOLINT
 7
 8#include <chrono>
 9#include <random>
10#include <thread>
11
12TEST_CASE("comparison_fast"){
13    uint64_t x = 1;
14    BENCHMARK("x += x") {
15        return x += x;
16    };
17}
18
19TEST_CASE("comparison_slow") {
20    BENCHMARK("sleep 10ms") {
21        std::this_thread::sleep_for(std::chrono::milliseconds(10));
22    };
23}
24
25// NOLINTNEXTLINE(fuchsia-statically-constructed-objects,llvmlibc-implementation-in-namespace)
26TEST_CASE("comparison_fluctuating_v2") {
27    std::random_device dev;
28    std::mt19937_64 rng(dev());
29    BENCHMARK("random fluctuations") {
30        // each run, perform a random number of rng calls
31        auto iterations = rng() & UINT64_C(0xff);
32        for (uint64_t i = 0; i < iterations; ++i) {
33            (void)rng();
34        }
35    };
36}

Results

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
c is a Catch v2.9.2 host application.
Run with -? for options

-------------------------------------------------------------------------------
comparison_fast
-------------------------------------------------------------------------------
catch.cpp:12
...............................................................................

benchmark name                                  samples       iterations    estimated
                                                mean          low mean      high mean
                                                std dev       low std dev   high std dev
-------------------------------------------------------------------------------
x += x                                                  100        12414    1.2414 ms 
                                                       1 ns         1 ns         1 ns 
                                                       0 ns         0 ns         0 ns 
                                                                                      

-------------------------------------------------------------------------------
comparison_slow
-------------------------------------------------------------------------------
catch.cpp:19
...............................................................................

benchmark name                                  samples       iterations    estimated
                                                mean          low mean      high mean
                                                std dev       low std dev   high std dev
-------------------------------------------------------------------------------
sleep 10ms                                              100            1    1.01319 s 
                                                 10.1357 ms   10.1302 ms   10.1396 ms 
                                                  23.539 us    18.061 us    29.575 us 
                                                                                      

-------------------------------------------------------------------------------
comparison_fluctuating_v2
-------------------------------------------------------------------------------
catch.cpp:25
...............................................................................

benchmark name                                  samples       iterations    estimated
                                                mean          low mean      high mean
                                                std dev       low std dev   high std dev
-------------------------------------------------------------------------------
random fluctuations                                     100           28    2.3324 ms 
                                                     827 ns       810 ns       844 ns 
                                                      88 ns        79 ns        99 ns 
                                                                                      

===============================================================================
test cases: 3 | 3 passed
assertions: - none -

moodycamel::microbench

A very simple benchmarking tool, and an API that’s very similar to ankerl::nanobench. No autotuning, no doNotOptimize, no output formatting. Get it here: moodycamel::microbench

Sourcecode

 1#include "microbench.h"
 2
 3#include <chrono>
 4#include <iostream>
 5#include <random>
 6#include <thread>
 7
 8// g++ -O2 -c systemtime.cpp
 9// g++ -O2 -c microbench.cpp
10// g++ microbench.o systemtime.o -o mb
11int main(int, char**) {
12    // something fast
13    uint64_t x = 1;
14    std::cout << moodycamel::microbench(
15                     [&]() {
16                         x += x;
17                     },
18                     10000000, 51)
19              << " sec x += x (x==" << x << ")" << std::endl;
20
21    std::cout << moodycamel::microbench([&] {
22        std::this_thread::sleep_for(std::chrono::milliseconds(10));
23    }) << " sec sleep 10ms"
24              << std::endl;
25
26    std::random_device dev;
27    std::mt19937_64 rng(dev());
28    std::cout << moodycamel::microbench(
29                     [&] {
30                         // each run, perform a random number of rng calls
31                         auto iterations = rng() & UINT64_C(0xff);
32                         for (uint64_t i = 0; i < iterations; ++i) {
33                             (void)rng();
34                         }
35                     },
36                     1000, 51)
37              << " sec random fluctuations" << std::endl;
38}

Results

3.12506e-07 sec x += x (x==0)
10.056 sec sleep 10ms
0.000661384 sec random fluctuations

sltbench

C++ benchmark which seems to have similar intentions to nanonbech. It claims to be 4.7 times faster than googlebench. It requires to be compiled and linked. I initially got a compile error because of missing <cstdint> include. After that it compiled fine, and I created an example. I didn’t like that I had to use global variables for the state that I needed in my ComparisonFast and ComparisonSlow benchmark. Get it here: sltbench

Sourcecode

 1#include <sltbench/Bench.h> // https://github.com/ivafanas/sltbench
 2
 3#include <chrono>
 4#include <random>
 5#include <thread>
 6
 7// cmake build as online instructions describes
 8//
 9// g++ -O3 -I/home/martinus/git/sltbench/install/include -c main.cpp
10// g++ -o m -L/home/martinus/git/sltbench/install/lib main.o -lsltbench
11
12uint64_t x = 1;
13void ComparisonFast() {
14    sltbench::DoNotOptimize(x += x);
15}
16
17SLTBENCH_FUNCTION(ComparisonFast);
18
19void ComparisonSlow() {
20    std::this_thread::sleep_for(std::chrono::milliseconds(10));
21}
22SLTBENCH_FUNCTION(ComparisonSlow);
23
24std::random_device dev;
25std::mt19937_64 rng(dev());
26
27void ComparisonFluctuating() {
28    // each run, perform a random number of rng calls
29    auto iterations = rng() & UINT64_C(0xff);
30    for (uint64_t i = 0; i < iterations; ++i) {
31        (void)rng();
32    }
33}
34SLTBENCH_FUNCTION(ComparisonFluctuating);
35
36SLTBENCH_MAIN();

Results

benchmark                                                   arg                      status               time(ns)
ComparisonFast                                                                       ok                          1
ComparisonFluctuating                                                                ok                         20
ComparisonSlow                                                                       ok                   10055943

Interestingly, the executable takes exactly 3 seconds startup time, then each benchmark runs for about 0.2 seconds.

Celero

Unfortunately I couldn’t get it working. I only got segmentation faults for my x += x benchmarks. Get it here: celero

folly Benchmark

Facebook’s folly comes with benchmarking facility. It seems rather basic, but with good DoNotOptimizeAway functionality. Honestly, I was too lazy to get this working. Too much installation hazzle. Get it here: folly