Test Set

I’ve implemented the three different benchmarks: Fast, Slow, Fluctuating in several frameworks for comparison.

Fast: Benchmarks x += x, starting from 1. This is a single instruction, and prone to be optimized away.
Slow: Benchmarks std::this_thread::sleep_for(10ms). For a microbenchmark this is very slow, and it is interesting how the framework’s autotuning deals with this.
Fluctuating: A microbenchmark where each evaluation takes a different time. This randomly fluctuating runtime is achieved by randomly producing 0-255 random numbers with std::mt19937_64.

All benchmarks are run on an i7-8700 CPU locked at 3.2GHz, using pyperf system tune.

Runtime

I wrote a little timing tool that measures how long exactly it takes to print benchmark output to the screen. With this I have measured the runtimes of major benchmarking frameworks which support automatic tuning of the number of iterations: Google Benchmark, Catch2, nonius, sltbench, and of course nanobench.

Benchmarking Framework	Fast	Slow	Fluctuating	Overhead	total
Google Benchmark	0.367	11.259	0.825	0.000	12.451
Catch2	1.004	2.074	0.966	1.737	5.782
nonius	0.741	1.815	0.740	1.715	5.010
sltbench	0.202	0.204	0.203	3.001	3.610
nanobench	0.079	0.112	0.000	0.001	0.192

Nanobench is clearly the fastest autotuning benchmarking framework, by an enormous margin.

Implementations & Output

nanobench

Sourcecode// https://github.com/martinus/nanobench
// g++ -O2 -I../../include main.cpp -o m

#define ANKERL_NANOBENCH_IMPLEMENT
#include <nanobench.h>

#include <chrono>
#include <random>
#include <thread>

int main(int, char**) {
    uint64_t x = 1;
    ankerl::nanobench::Bench().run("x += x", [&]() {
        ankerl::nanobench::doNotOptimizeAway(x += x);
    });

    ankerl::nanobench::Bench().run("sleep 10ms", [&]() {
        std::this_thread::sleep_for(std::chrono::milliseconds(10));
    });

    std::random_device dev;
    std::mt19937_64 rng(dev());
    ankerl::nanobench::Bench().run("random fluctuations", [&]() {
        // each run, perform a random number of rng calls
        auto iterations = rng() & UINT64_C(0xff);
        for (uint64_t i = 0; i < iterations; ++i) {
            (void)rng();
        }
    });
}

Results

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                0.31 |    3,192,709,967.58 |    0.0% |            1.00 |            1.00 |  0.999 |           0.00 |    0.0% |      0.00 | `x += x`
|       10,149,086.00 |               98.53 |    0.1% |           45.00 |        2,394.00 |  0.019 |           9.00 |   88.9% |      0.11 | `sleep 10ms`
|              744.50 |        1,343,183.34 |   11.2% |        2,815.05 |        2,375.86 |  1.185 |         524.73 |   12.5% |      0.00 | :wavy_dash: `random fluctuations` (Unstable with ~23.3 iters. Increase `minEpochIterations` to e.g. 233)

Google Benchmark

Very feature rich, battle proven, but a bit aged. Requires google test. Get it here: Google Benchmark

Sourcecode

#include "benchmark.h"

#include <chrono>
#include <random>
#include <thread>

// Build instructions: https://github.com/google/benchmark#installation
// curl --output benchmark.h
// https://raw.githubusercontent.com/google/benchmark/master/include/benchmark/benchmark.h
// g++ -O2 main.cpp -Lgit/benchmark/build/src -lbenchmark -lpthread -o m
void ComparisonFast(benchmark::State& state) {
    uint64_t x = 1;
    for (auto _ : state) {
        x += x;
    }
    benchmark::DoNotOptimize(x);
}
BENCHMARK(ComparisonFast);

void ComparisonSlow(benchmark::State& state) {
    for (auto _ : state) {
        std::this_thread::sleep_for(std::chrono::milliseconds(10));
    }
}
BENCHMARK(ComparisonSlow);

void ComparisonFluctuating(benchmark::State& state) {
    std::random_device dev;
    std::mt19937_64 rng(dev());
    for (auto _ : state) {
        // each run, perform a random number of rng calls
        auto iterations = rng() & UINT64_C(0xff);
        for (uint64_t i = 0; i < iterations; ++i) {
            (void)rng();
        }
    }
}
BENCHMARK(ComparisonFluctuating);

BENCHMARK_MAIN();

Results

Compiled & linked with

g++ -O2 main.cpp -L/home/martinus/git/benchmark/build/src -lbenchmark -lpthread -o gbench

executing it gives this result:

2019-10-12 12:03:25
Running ./gbench
Run on (12 X 4600 MHz CPU s)
CPU Caches:
L1 Data 32K (x6)
L1 Instruction 32K (x6)
L2 Unified 256K (x6)
L3 Unified 12288K (x1)
Load Average: 0.21, 0.55, 0.60
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
ComparisonFast             0.313 ns        0.313 ns   1000000000
ComparisonSlow          10137913 ns         3920 ns         1000
ComparisonFluctuating        993 ns          992 ns       706946

Running the tests individually takes 0.365s, 11.274 sec, 0.828sec.

nonius

It gives lots of statistics, but seems a bit complicated to me. Not as straight forward as I’d like it. It shows lots of statistics, which makes the output a bit hard to read. I am not sure if it is still actively maintained. The homepage has been down for a while. Get it here: nonius

Sourcecode

#define NONIUS_RUNNER
#include <nonius/nonius_single.h++>

// g++ -O2 main.cpp -pthread -I. -o m

#include <chrono>
#include <random>
#include <thread>

NONIUS_PARAM(X, UINT64_C(1))

template <typename Fn>
struct volatilize_fn {
    Fn fn;
    auto operator()() const -> decltype(fn()) {
        volatile auto x = fn();
        return x;
    }
};

template <typename Fn>
auto volatilize(Fn&& fn) -> volatilize_fn<typename std::decay<Fn>::type> {
    return {std::forward<Fn>(fn)};
}

NONIUS_BENCHMARK("x += x", [](nonius::chronometer meter) {
    auto x = meter.param<X>();
    meter.measure(volatilize([&]() {
        return x += x;
    }));
})

NONIUS_BENCHMARK("sleep 10ms", [] {
    std::this_thread::sleep_for(std::chrono::milliseconds(10));
})

NONIUS_BENCHMARK("random fluctuations", [](nonius::chronometer meter) {
    std::random_device dev;
    std::mt19937_64 rng(dev());
    meter.measure([&] {
        // each run, perform a random number of rng calls
        auto iterations = rng() & UINT64_C(0xff);
        for (uint64_t i = 0; i < iterations; ++i) {
            (void)rng();
        }
    });
})

Results

clock resolution: mean is 22.0426 ns (20480002 iterations)


new round for parameters
X = 1

benchmarking x += x
collecting 100 samples, 56376 iterations each, in estimated 0 ns
mean: 0.391109 ns, lb 0.391095 ns, ub 0.391135 ns, ci 0.95
std dev: 9.50619e-05 ns, lb 6.25215e-05 ns, ub 0.000167224 ns, ci 0.95
found 4 outliers among 100 samples (4%)
variance is unaffected by outliers

benchmarking sleep 10ms
collecting 100 samples, 1 iterations each, in estimated 1013.66 ms
mean: 10.1258 ms, lb 10.1189 ms, ub 10.1313 ms, ci 0.95
std dev: 31.1777 μs, lb 26.5814 μs, ub 35.4952 μs, ci 0.95
found 13 outliers among 100 samples (13%)
variance is unaffected by outliers

benchmarking random fluctuations
collecting 100 samples, 23 iterations each, in estimated 2.2724 ms
mean: 1016.26 ns, lb 991.161 ns, ub 1041.66 ns, ci 0.95
std dev: 128.963 ns, lb 109.803 ns, ub 159.509 ns, ci 0.95
found 2 outliers among 100 samples (2%)
variance is severely inflated by outliers

The tests individually take 0.713sec, 1.883sec, 0.819sec. Plus a startup overhead of 1.611sec.

Picobench

It took me a while to figure out that I have to configure the slow test, otherwise it would run for a looong time. The number of iterations is hardcoded, this library seems very basic. Get it here: picobench

Sourcecode

#define PICOBENCH_IMPLEMENT_WITH_MAIN
#include "picobench.hpp"

#include <chrono>
#include <initializer_list>
#include <random>
#include <thread>


// https://github.com/iboB/picobench
// g++ -O2 picobench.cpp -o pb

PICOBENCH_SUITE("ComparisonFast");
static void ComparisonFast(picobench::state& state) {
    uint64_t x = 1;
    for (auto _ : state) {
        x += x;
    }
    state.set_result(x);
}
PICOBENCH(ComparisonFast);

PICOBENCH_SUITE("ComparisonSlow");
void ComparisonSlow(picobench::state& state) {
    for (auto _ : state) {
        std::this_thread::sleep_for(std::chrono::milliseconds(10));
    }
}
PICOBENCH(ComparisonSlow).iterations({1, 2, 5, 10});

PICOBENCH_SUITE("fluctuating");
void ComparisonFluctuating(picobench::state& state) {
    std::random_device dev;
    std::mt19937_64 rng(dev());
    for (auto _ : state) {
        // each run, perform a random number of rng calls
        auto iterations = rng() & UINT64_C(0xff);
        for (uint64_t i = 0; i < iterations; ++i) {
            (void)rng();
        }
    }
}
PICOBENCH(ComparisonFluctuating);

Results

ComparisonFast:
===============================================================================
   Name (baseline is *)   |   Dim   |  Total ms |  ns/op  |Baseline| Ops/second
===============================================================================
         ComparisonFast * |       8 |     0.000 |       6 |      - |156862745.1
         ComparisonFast * |      64 |     0.000 |       1 |      - |512000000.0
         ComparisonFast * |     512 |     0.000 |       0 |      - |2560000000.0
         ComparisonFast * |    4096 |     0.001 |       0 |      - |3110098709.2
         ComparisonFast * |    8192 |     0.003 |       0 |      - |3141104294.5
===============================================================================
ComparisonSlow:
===============================================================================
   Name (baseline is *)   |   Dim   |  Total ms |  ns/op  |Baseline| Ops/second
===============================================================================
         ComparisonSlow * |       1 |    10.056 |10055959 |      - |       99.4
         ComparisonSlow * |       2 |    20.178 |10088773 |      - |       99.1
         ComparisonSlow * |       5 |    50.570 |10114054 |      - |       98.9
         ComparisonSlow * |      10 |   101.136 |10113643 |      - |       98.9
===============================================================================
fluctuating:
===============================================================================
   Name (baseline is *)   |   Dim   |  Total ms |  ns/op  |Baseline| Ops/second
===============================================================================
  ComparisonFluctuating * |       8 |     0.012 |    1551 |      - |   644485.6
  ComparisonFluctuating * |      64 |     0.068 |    1057 |      - |   945584.6
  ComparisonFluctuating * |     512 |     0.565 |    1103 |      - |   906222.0
  ComparisonFluctuating * |    4096 |     4.469 |    1090 |      - |   916619.4
  ComparisonFluctuating * |    8192 |     9.003 |    1098 |      - |   909957.2
===============================================================================

It doesn’t really make sense to provide runtime numbers here, because picobench just executes the given number of iterations, and that’s it. No autotuning.

Catch2

Catch2 is mostly a unit testing framework, and has recently integrated benchmarking faciliy. It is very easy to use, but does not seem too configurable. I find the way it writes the output very confusing. Get it here: Catch2

Sourcecode

// https://github.com/catchorg/Catch2
// g++ -O2 catch.cpp -o c

#define CATCH_CONFIG_ENABLE_BENCHMARKING
#define CATCH_CONFIG_MAIN
#include "catch.hpp" // NOLINT

#include <chrono>
#include <random>
#include <thread>

TEST_CASE("comparison_fast"){
    uint64_t x = 1;
    BENCHMARK("x += x") {
        return x += x;
    };
}

TEST_CASE("comparison_slow") {
    BENCHMARK("sleep 10ms") {
        std::this_thread::sleep_for(std::chrono::milliseconds(10));
    };
}

// NOLINTNEXTLINE(fuchsia-statically-constructed-objects,llvmlibc-implementation-in-namespace)
TEST_CASE("comparison_fluctuating_v2") {
    std::random_device dev;
    std::mt19937_64 rng(dev());
    BENCHMARK("random fluctuations") {
        // each run, perform a random number of rng calls
        auto iterations = rng() & UINT64_C(0xff);
        for (uint64_t i = 0; i < iterations; ++i) {
            (void)rng();
        }
    };
}

Results

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
c is a Catch v2.9.2 host application.
Run with -? for options

-------------------------------------------------------------------------------
comparison_fast
-------------------------------------------------------------------------------
catch.cpp:12
...............................................................................

benchmark name                                  samples       iterations    estimated
                                                mean          low mean      high mean
                                                std dev       low std dev   high std dev
-------------------------------------------------------------------------------
x += x                                                  100        12414    1.2414 ms 
                                                       1 ns         1 ns         1 ns 
                                                       0 ns         0 ns         0 ns 
                                                                                      

-------------------------------------------------------------------------------
comparison_slow
-------------------------------------------------------------------------------
catch.cpp:19
...............................................................................

benchmark name                                  samples       iterations    estimated
                                                mean          low mean      high mean
                                                std dev       low std dev   high std dev
-------------------------------------------------------------------------------
sleep 10ms                                              100            1    1.01319 s 
                                                 10.1357 ms   10.1302 ms   10.1396 ms 
                                                  23.539 us    18.061 us    29.575 us 
                                                                                      

-------------------------------------------------------------------------------
comparison_fluctuating_v2
-------------------------------------------------------------------------------
catch.cpp:25
...............................................................................

benchmark name                                  samples       iterations    estimated
                                                mean          low mean      high mean
                                                std dev       low std dev   high std dev
-------------------------------------------------------------------------------
random fluctuations                                     100           28    2.3324 ms 
                                                     827 ns       810 ns       844 ns 
                                                      88 ns        79 ns        99 ns 
                                                                                      

===============================================================================
test cases: 3 | 3 passed
assertions: - none -

moodycamel::microbench

A very simple benchmarking tool, and an API that’s very similar to ankerl::nanobench. No autotuning, no doNotOptimize, no output formatting. Get it here: moodycamel::microbench

Sourcecode

#include "microbench.h"

#include <chrono>
#include <iostream>
#include <random>
#include <thread>

// g++ -O2 -c systemtime.cpp
// g++ -O2 -c microbench.cpp
// g++ microbench.o systemtime.o -o mb
int main(int, char**) {
    // something fast
    uint64_t x = 1;
    std::cout << moodycamel::microbench(
                     [&]() {
                         x += x;
                     },
                     10000000, 51)
              << " sec x += x (x==" << x << ")" << std::endl;

    std::cout << moodycamel::microbench([&] {
        std::this_thread::sleep_for(std::chrono::milliseconds(10));
    }) << " sec sleep 10ms"
              << std::endl;

    std::random_device dev;
    std::mt19937_64 rng(dev());
    std::cout << moodycamel::microbench(
                     [&] {
                         // each run, perform a random number of rng calls
                         auto iterations = rng() & UINT64_C(0xff);
                         for (uint64_t i = 0; i < iterations; ++i) {
                             (void)rng();
                         }
                     },
                     1000, 51)
              << " sec random fluctuations" << std::endl;
}

Results

12506e-07 sec x += x (x==0)
056 sec sleep 10ms
000661384 sec random fluctuations

sltbench

C++ benchmark which seems to have similar intentions to nanonbech. It claims to be 4.7 times faster than googlebench. It requires to be compiled and linked. I initially got a compile error because of missing <cstdint> include. After that it compiled fine, and I created an example. I didn’t like that I had to use global variables for the state that I needed in my ComparisonFast and ComparisonSlow benchmark. Get it here: sltbench

Sourcecode

#include <sltbench/Bench.h> // https://github.com/ivafanas/sltbench

#include <chrono>
#include <random>
#include <thread>

// cmake build as online instructions describes
//
// g++ -O3 -I/home/martinus/git/sltbench/install/include -c main.cpp
// g++ -o m -L/home/martinus/git/sltbench/install/lib main.o -lsltbench

uint64_t x = 1;
void ComparisonFast() {
    sltbench::DoNotOptimize(x += x);
}

SLTBENCH_FUNCTION(ComparisonFast);

void ComparisonSlow() {
    std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
SLTBENCH_FUNCTION(ComparisonSlow);

std::random_device dev;
std::mt19937_64 rng(dev());

void ComparisonFluctuating() {
    // each run, perform a random number of rng calls
    auto iterations = rng() & UINT64_C(0xff);
    for (uint64_t i = 0; i < iterations; ++i) {
        (void)rng();
    }
}
SLTBENCH_FUNCTION(ComparisonFluctuating);

SLTBENCH_MAIN();

Results

benchmark                                                   arg                      status               time(ns)
ComparisonFast                                                                       ok                          1
ComparisonFluctuating                                                                ok                         20
ComparisonSlow                                                                       ok                   10055943

Interestingly, the executable takes exactly 3 seconds startup time, then each benchmark runs for about 0.2 seconds.

Celero

Unfortunately I couldn’t get it working. I only got segmentation faults for my x += x benchmarks. Get it here: celero

folly Benchmark

Facebook’s folly comes with benchmarking facility. It seems rather basic, but with good DoNotOptimizeAway functionality. Honestly, I was too lazy to get this working. Too much installation hazzle. Get it here: folly