Test Set
I’ve implemented the three different benchmarks: Fast, Slow, Fluctuating in several frameworks for comparison.
- Fast
Benchmarks
x += x
, starting from 1. This is a single instruction, and prone to be optimized away.- Slow
Benchmarks
std::this_thread::sleep_for(10ms)
. For a microbenchmark this is very slow, and it is interesting how the framework’s autotuning deals with this.- Fluctuating
A microbenchmark where each evaluation takes a different time. This randomly fluctuating runtime is achieved by randomly producing 0-255 random numbers with
std::mt19937_64
.
All benchmarks are run on an i7-8700 CPU locked at 3.2GHz, using pyperf system tune.
Runtime
I wrote a little timing tool that measures how long exactly it takes to print benchmark output to the screen. With this I have measured the runtimes of major benchmarking frameworks which support automatic tuning of the number of iterations: Google Benchmark, Catch2, nonius, sltbench, and of course nanobench.
Benchmarking Framework |
Fast |
Slow |
Fluctuating |
Overhead |
total |
---|---|---|---|---|---|
Google Benchmark |
0.367 |
11.259 |
0.825 |
0.000 |
12.451 |
Catch2 |
1.004 |
2.074 |
0.966 |
1.737 |
5.782 |
nonius |
0.741 |
1.815 |
0.740 |
1.715 |
5.010 |
sltbench |
0.202 |
0.204 |
0.203 |
3.001 |
3.610 |
nanobench |
0.079 |
0.112 |
0.000 |
0.001 |
0.192 |
Nanobench is clearly the fastest autotuning benchmarking framework, by an enormous margin.
Implementations & Output
nanobench
Sourcecode
1// https://github.com/martinus/nanobench
2// g++ -O2 -I../../include main.cpp -o m
3
4#define ANKERL_NANOBENCH_IMPLEMENT
5#include <nanobench.h>
6
7#include <chrono>
8#include <random>
9#include <thread>
10
11int main(int, char**) {
12 uint64_t x = 1;
13 ankerl::nanobench::Bench().run("x += x", [&]() {
14 ankerl::nanobench::doNotOptimizeAway(x += x);
15 });
16
17 ankerl::nanobench::Bench().run("sleep 10ms", [&]() {
18 std::this_thread::sleep_for(std::chrono::milliseconds(10));
19 });
20
21 std::random_device dev;
22 std::mt19937_64 rng(dev());
23 ankerl::nanobench::Bench().run("random fluctuations", [&]() {
24 // each run, perform a random number of rng calls
25 auto iterations = rng() & UINT64_C(0xff);
26 for (uint64_t i = 0; i < iterations; ++i) {
27 (void)rng();
28 }
29 });
30}
Results
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 0.31 | 3,192,709,967.58 | 0.0% | 1.00 | 1.00 | 0.999 | 0.00 | 0.0% | 0.00 | `x += x`
| 10,149,086.00 | 98.53 | 0.1% | 45.00 | 2,394.00 | 0.019 | 9.00 | 88.9% | 0.11 | `sleep 10ms`
| 744.50 | 1,343,183.34 | 11.2% | 2,815.05 | 2,375.86 | 1.185 | 524.73 | 12.5% | 0.00 | :wavy_dash: `random fluctuations` (Unstable with ~23.3 iters. Increase `minEpochIterations` to e.g. 233)
Google Benchmark
Very feature rich, battle proven, but a bit aged. Requires google test. Get it here: Google Benchmark
Sourcecode
1#include "benchmark.h"
2
3#include <chrono>
4#include <random>
5#include <thread>
6
7// Build instructions: https://github.com/google/benchmark#installation
8// curl --output benchmark.h
9// https://raw.githubusercontent.com/google/benchmark/master/include/benchmark/benchmark.h
10// g++ -O2 main.cpp -Lgit/benchmark/build/src -lbenchmark -lpthread -o m
11void ComparisonFast(benchmark::State& state) {
12 uint64_t x = 1;
13 for (auto _ : state) {
14 x += x;
15 }
16 benchmark::DoNotOptimize(x);
17}
18BENCHMARK(ComparisonFast);
19
20void ComparisonSlow(benchmark::State& state) {
21 for (auto _ : state) {
22 std::this_thread::sleep_for(std::chrono::milliseconds(10));
23 }
24}
25BENCHMARK(ComparisonSlow);
26
27void ComparisonFluctuating(benchmark::State& state) {
28 std::random_device dev;
29 std::mt19937_64 rng(dev());
30 for (auto _ : state) {
31 // each run, perform a random number of rng calls
32 auto iterations = rng() & UINT64_C(0xff);
33 for (uint64_t i = 0; i < iterations; ++i) {
34 (void)rng();
35 }
36 }
37}
38BENCHMARK(ComparisonFluctuating);
39
40BENCHMARK_MAIN();
Results
Compiled & linked with
g++ -O2 main.cpp -L/home/martinus/git/benchmark/build/src -lbenchmark -lpthread -o gbench
executing it gives this result:
2019-10-12 12:03:25
Running ./gbench
Run on (12 X 4600 MHz CPU s)
CPU Caches:
L1 Data 32K (x6)
L1 Instruction 32K (x6)
L2 Unified 256K (x6)
L3 Unified 12288K (x1)
Load Average: 0.21, 0.55, 0.60
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------
ComparisonFast 0.313 ns 0.313 ns 1000000000
ComparisonSlow 10137913 ns 3920 ns 1000
ComparisonFluctuating 993 ns 992 ns 706946
Running the tests individually takes 0.365s, 11.274 sec, 0.828sec.
nonius
It gives lots of statistics, but seems a bit complicated to me. Not as straight forward as I’d like it. It shows lots of statistics, which makes the output a bit hard to read. I am not sure if it is still actively maintained. The homepage has been down for a while. Get it here: nonius
Sourcecode
1#define NONIUS_RUNNER
2#include <nonius/nonius_single.h++>
3
4// g++ -O2 main.cpp -pthread -I. -o m
5
6#include <chrono>
7#include <random>
8#include <thread>
9
10NONIUS_PARAM(X, UINT64_C(1))
11
12template <typename Fn>
13struct volatilize_fn {
14 Fn fn;
15 auto operator()() const -> decltype(fn()) {
16 volatile auto x = fn();
17 return x;
18 }
19};
20
21template <typename Fn>
22auto volatilize(Fn&& fn) -> volatilize_fn<typename std::decay<Fn>::type> {
23 return {std::forward<Fn>(fn)};
24}
25
26NONIUS_BENCHMARK("x += x", [](nonius::chronometer meter) {
27 auto x = meter.param<X>();
28 meter.measure(volatilize([&]() {
29 return x += x;
30 }));
31})
32
33NONIUS_BENCHMARK("sleep 10ms", [] {
34 std::this_thread::sleep_for(std::chrono::milliseconds(10));
35})
36
37NONIUS_BENCHMARK("random fluctuations", [](nonius::chronometer meter) {
38 std::random_device dev;
39 std::mt19937_64 rng(dev());
40 meter.measure([&] {
41 // each run, perform a random number of rng calls
42 auto iterations = rng() & UINT64_C(0xff);
43 for (uint64_t i = 0; i < iterations; ++i) {
44 (void)rng();
45 }
46 });
47})
Results
clock resolution: mean is 22.0426 ns (20480002 iterations)
new round for parameters
X = 1
benchmarking x += x
collecting 100 samples, 56376 iterations each, in estimated 0 ns
mean: 0.391109 ns, lb 0.391095 ns, ub 0.391135 ns, ci 0.95
std dev: 9.50619e-05 ns, lb 6.25215e-05 ns, ub 0.000167224 ns, ci 0.95
found 4 outliers among 100 samples (4%)
variance is unaffected by outliers
benchmarking sleep 10ms
collecting 100 samples, 1 iterations each, in estimated 1013.66 ms
mean: 10.1258 ms, lb 10.1189 ms, ub 10.1313 ms, ci 0.95
std dev: 31.1777 μs, lb 26.5814 μs, ub 35.4952 μs, ci 0.95
found 13 outliers among 100 samples (13%)
variance is unaffected by outliers
benchmarking random fluctuations
collecting 100 samples, 23 iterations each, in estimated 2.2724 ms
mean: 1016.26 ns, lb 991.161 ns, ub 1041.66 ns, ci 0.95
std dev: 128.963 ns, lb 109.803 ns, ub 159.509 ns, ci 0.95
found 2 outliers among 100 samples (2%)
variance is severely inflated by outliers
The tests individually take 0.713sec, 1.883sec, 0.819sec. Plus a startup overhead of 1.611sec.
Picobench
It took me a while to figure out that I have to configure the slow test, otherwise it would run for a looong time. The number of iterations is hardcoded, this library seems very basic. Get it here: picobench
Sourcecode
1#define PICOBENCH_IMPLEMENT_WITH_MAIN
2#include "picobench.hpp"
3
4#include <chrono>
5#include <initializer_list>
6#include <random>
7#include <thread>
8
9
10// https://github.com/iboB/picobench
11// g++ -O2 picobench.cpp -o pb
12
13PICOBENCH_SUITE("ComparisonFast");
14static void ComparisonFast(picobench::state& state) {
15 uint64_t x = 1;
16 for (auto _ : state) {
17 x += x;
18 }
19 state.set_result(x);
20}
21PICOBENCH(ComparisonFast);
22
23PICOBENCH_SUITE("ComparisonSlow");
24void ComparisonSlow(picobench::state& state) {
25 for (auto _ : state) {
26 std::this_thread::sleep_for(std::chrono::milliseconds(10));
27 }
28}
29PICOBENCH(ComparisonSlow).iterations({1, 2, 5, 10});
30
31PICOBENCH_SUITE("fluctuating");
32void ComparisonFluctuating(picobench::state& state) {
33 std::random_device dev;
34 std::mt19937_64 rng(dev());
35 for (auto _ : state) {
36 // each run, perform a random number of rng calls
37 auto iterations = rng() & UINT64_C(0xff);
38 for (uint64_t i = 0; i < iterations; ++i) {
39 (void)rng();
40 }
41 }
42}
43PICOBENCH(ComparisonFluctuating);
Results
ComparisonFast:
===============================================================================
Name (baseline is *) | Dim | Total ms | ns/op |Baseline| Ops/second
===============================================================================
ComparisonFast * | 8 | 0.000 | 6 | - |156862745.1
ComparisonFast * | 64 | 0.000 | 1 | - |512000000.0
ComparisonFast * | 512 | 0.000 | 0 | - |2560000000.0
ComparisonFast * | 4096 | 0.001 | 0 | - |3110098709.2
ComparisonFast * | 8192 | 0.003 | 0 | - |3141104294.5
===============================================================================
ComparisonSlow:
===============================================================================
Name (baseline is *) | Dim | Total ms | ns/op |Baseline| Ops/second
===============================================================================
ComparisonSlow * | 1 | 10.056 |10055959 | - | 99.4
ComparisonSlow * | 2 | 20.178 |10088773 | - | 99.1
ComparisonSlow * | 5 | 50.570 |10114054 | - | 98.9
ComparisonSlow * | 10 | 101.136 |10113643 | - | 98.9
===============================================================================
fluctuating:
===============================================================================
Name (baseline is *) | Dim | Total ms | ns/op |Baseline| Ops/second
===============================================================================
ComparisonFluctuating * | 8 | 0.012 | 1551 | - | 644485.6
ComparisonFluctuating * | 64 | 0.068 | 1057 | - | 945584.6
ComparisonFluctuating * | 512 | 0.565 | 1103 | - | 906222.0
ComparisonFluctuating * | 4096 | 4.469 | 1090 | - | 916619.4
ComparisonFluctuating * | 8192 | 9.003 | 1098 | - | 909957.2
===============================================================================
It doesn’t really make sense to provide runtime numbers here, because picobench just executes the given number of iterations, and that’s it. No autotuning.
Catch2
Catch2 is mostly a unit testing framework, and has recently integrated benchmarking faciliy. It is very easy to use, but does not seem too configurable. I find the way it writes the output very confusing. Get it here: Catch2
Sourcecode
1// https://github.com/catchorg/Catch2
2// g++ -O2 catch.cpp -o c
3
4#define CATCH_CONFIG_ENABLE_BENCHMARKING
5#define CATCH_CONFIG_MAIN
6#include "catch.hpp" // NOLINT
7
8#include <chrono>
9#include <random>
10#include <thread>
11
12TEST_CASE("comparison_fast"){
13 uint64_t x = 1;
14 BENCHMARK("x += x") {
15 return x += x;
16 };
17}
18
19TEST_CASE("comparison_slow") {
20 BENCHMARK("sleep 10ms") {
21 std::this_thread::sleep_for(std::chrono::milliseconds(10));
22 };
23}
24
25// NOLINTNEXTLINE(fuchsia-statically-constructed-objects,llvmlibc-implementation-in-namespace)
26TEST_CASE("comparison_fluctuating_v2") {
27 std::random_device dev;
28 std::mt19937_64 rng(dev());
29 BENCHMARK("random fluctuations") {
30 // each run, perform a random number of rng calls
31 auto iterations = rng() & UINT64_C(0xff);
32 for (uint64_t i = 0; i < iterations; ++i) {
33 (void)rng();
34 }
35 };
36}
Results
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
c is a Catch v2.9.2 host application.
Run with -? for options
-------------------------------------------------------------------------------
comparison_fast
-------------------------------------------------------------------------------
catch.cpp:12
...............................................................................
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
x += x 100 12414 1.2414 ms
1 ns 1 ns 1 ns
0 ns 0 ns 0 ns
-------------------------------------------------------------------------------
comparison_slow
-------------------------------------------------------------------------------
catch.cpp:19
...............................................................................
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
sleep 10ms 100 1 1.01319 s
10.1357 ms 10.1302 ms 10.1396 ms
23.539 us 18.061 us 29.575 us
-------------------------------------------------------------------------------
comparison_fluctuating_v2
-------------------------------------------------------------------------------
catch.cpp:25
...............................................................................
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
random fluctuations 100 28 2.3324 ms
827 ns 810 ns 844 ns
88 ns 79 ns 99 ns
===============================================================================
test cases: 3 | 3 passed
assertions: - none -
moodycamel::microbench
A very simple benchmarking tool, and an API that’s very similar to ankerl::nanobench
. No autotuning,
no doNotOptimize, no output formatting. Get it here: moodycamel::microbench
Sourcecode
1#include "microbench.h"
2
3#include <chrono>
4#include <iostream>
5#include <random>
6#include <thread>
7
8// g++ -O2 -c systemtime.cpp
9// g++ -O2 -c microbench.cpp
10// g++ microbench.o systemtime.o -o mb
11int main(int, char**) {
12 // something fast
13 uint64_t x = 1;
14 std::cout << moodycamel::microbench(
15 [&]() {
16 x += x;
17 },
18 10000000, 51)
19 << " sec x += x (x==" << x << ")" << std::endl;
20
21 std::cout << moodycamel::microbench([&] {
22 std::this_thread::sleep_for(std::chrono::milliseconds(10));
23 }) << " sec sleep 10ms"
24 << std::endl;
25
26 std::random_device dev;
27 std::mt19937_64 rng(dev());
28 std::cout << moodycamel::microbench(
29 [&] {
30 // each run, perform a random number of rng calls
31 auto iterations = rng() & UINT64_C(0xff);
32 for (uint64_t i = 0; i < iterations; ++i) {
33 (void)rng();
34 }
35 },
36 1000, 51)
37 << " sec random fluctuations" << std::endl;
38}
Results
3.12506e-07 sec x += x (x==0)
10.056 sec sleep 10ms
0.000661384 sec random fluctuations
sltbench
C++ benchmark which seems to have similar intentions to nanonbech. It claims to be 4.7 times faster than googlebench.
It requires to be compiled and linked. I initially got a compile error because of missing <cstdint>
include.
After that it compiled fine, and I created an example. I didn’t like that I had to use global variables for the state
that I needed in my ComparisonFast
and ComparisonSlow
benchmark. Get it
here: sltbench
Sourcecode
1#include <sltbench/Bench.h> // https://github.com/ivafanas/sltbench
2
3#include <chrono>
4#include <random>
5#include <thread>
6
7// cmake build as online instructions describes
8//
9// g++ -O3 -I/home/martinus/git/sltbench/install/include -c main.cpp
10// g++ -o m -L/home/martinus/git/sltbench/install/lib main.o -lsltbench
11
12uint64_t x = 1;
13void ComparisonFast() {
14 sltbench::DoNotOptimize(x += x);
15}
16
17SLTBENCH_FUNCTION(ComparisonFast);
18
19void ComparisonSlow() {
20 std::this_thread::sleep_for(std::chrono::milliseconds(10));
21}
22SLTBENCH_FUNCTION(ComparisonSlow);
23
24std::random_device dev;
25std::mt19937_64 rng(dev());
26
27void ComparisonFluctuating() {
28 // each run, perform a random number of rng calls
29 auto iterations = rng() & UINT64_C(0xff);
30 for (uint64_t i = 0; i < iterations; ++i) {
31 (void)rng();
32 }
33}
34SLTBENCH_FUNCTION(ComparisonFluctuating);
35
36SLTBENCH_MAIN();
Results
benchmark arg status time(ns)
ComparisonFast ok 1
ComparisonFluctuating ok 20
ComparisonSlow ok 10055943
Interestingly, the executable takes exactly 3 seconds startup time, then each benchmark runs for about 0.2 seconds.
Celero
Unfortunately I couldn’t get it working. I only got segmentation faults for my x += x
benchmarks.
Get it here: celero
folly Benchmark
Facebook’s folly comes with benchmarking facility. It seems rather basic, but with good DoNotOptimizeAway
functionality. Honestly, I was too lazy to get this working. Too much installation hazzle. Get it here:
folly