Installation
Direct Inclusion
Download
nanobench.h
from therelease
and make it available in your project.Create a .cpp file, e.g.
nanobench.cpp
, where the bulk of nanobench is compiled.1#define ANKERL_NANOBENCH_IMPLEMENT 2#include <nanobench.h>
Compile e.g. with
g++ -O3 -I../include -c nanobench.cpp
. This compiles the bulk of nanobench, and took 2.4 seconds on my machine. It needs to be compiled only once whenever you upgrade nanobench.
CMake Integration
nanobench
can be integrated with CMake’s FetchContent or as
a git submodule. Here is a full example how to this can be done:
1cmake_minimum_required(VERSION 3.14)
2set(CMAKE_CXX_STANDARD 17)
3
4project(
5 CMakeNanobenchExample
6 VERSION 1.0
7 LANGUAGES CXX)
8
9include(FetchContent)
10
11FetchContent_Declare(
12 nanobench
13 GIT_REPOSITORY https://github.com/martinus/nanobench.git
14 GIT_TAG v4.1.0
15 GIT_SHALLOW TRUE)
16
17FetchContent_MakeAvailable(nanobench)
18
19add_executable(MyExample my_example.cpp)
20target_link_libraries(MyExample PRIVATE nanobench)
Usage
Create the actual benchmark code, in
full_example.cpp
:1#include <nanobench.h> 2 3#include <atomic> 4 5int main() { 6 int y = 0; 7 std::atomic<int> x(0); 8 ankerl::nanobench::Bench().run("compare_exchange_strong", [&] { 9 x.compare_exchange_strong(y, 0); 10 }); 11}
The most important entry entry point is
ankerl::nanobench::Bench
. It creates a benchmarking object, optionally configures it, and then runs the code to benchmark withrun()
.Compile and link the example with
g++ -O3 -I../include nanobench.o full_example.cpp -o full_example
This takes just 0.28 seconds on my machine.
Run
./full_example
, which gives an output like this:| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 5.63 | 177,595,338.98 | 0.0% | 3.00 | 17.98 | 0.167 | 1.00 | 0.1% | 0.00 | `compare_exchange_strong`
Which renders as
ns/op
op/s
err%
ins/op
cyc/op
IPC
bra/op
miss%
total
benchmark
5.63
177,595,338.98
0.0%
3.00
17.98
0.167
1.00
0.1%
0.00
compare_exchange_strong
Which means that one
x.compare_exchange_strong(y, 0);
call takes 5.63ns on my machine (wall-clock time), or ~178 million operations per second. Runtime fluctuates by around 0.0%, so the results are very stable. Each call required 3 instructions, which took ~18 CPU cycles. There was a single branch per call, with only 0.1% mispredicted.
Nanobench does not come with a test runner, so you can easily use it with any framework you like. In the remaining examples, I’m using doctest as a unit test framework.
Note
CPU statistics like instructions, cycles, branches, branch misses are only available on Linux, through
perf events. On some systems you might need to
change permissions
through perf_event_paranoid
or use ACL.
Examples
Something Fast
Let’s benchmarks how fast we can do x += x
for uint64_t
:
1#include <nanobench.h>
2#include <thirdparty/doctest/doctest.h>
3
4// NOLINTNEXTLINE
5TEST_CASE("tutorial_fast_v1") {
6 uint64_t x = 1;
7 ankerl::nanobench::Bench().run("++x", [&]() {
8 ++x;
9 });
10}
After 0.2ms we get this output:
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| - | - | - | - | :boom: `++x` (iterations overflow. Maybe your code got optimized away?)
No data there! We only get :boom: iterations overflow.
. The compiler could optimize x += x
away because we never used the output. Thanks to doNotOptimizeAway
, this is easy to fix:
1#include <nanobench.h>
2#include <thirdparty/doctest/doctest.h>
3
4// NOLINTNEXTLINE
5TEST_CASE("tutorial_fast_v2") {
6 uint64_t x = 1;
7 ankerl::nanobench::Bench().run("++x", [&]() {
8 ankerl::nanobench::doNotOptimizeAway(x += 1);
9 });
10}
This time the benchmark runs for 2.2ms and we actually get reasonable data:
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 0.31 | 3,192,444,232.50 | 0.0% | 1.00 | 1.00 | 0.998 | 0.00 | 0.0% | 0.00 | `++x`
It’s a very stable result. One run the op/s is 3,192 million/sec, the next time I execute it I get 3,168 million/sec. It always takes 1.00 instructions per operation on my machine, and can do this in ~1 cycle.
Something Slow
Let’s benchmark if sleeping for 100ms really takes 100ms.
1#include <nanobench.h>
2#include <thirdparty/doctest/doctest.h>
3
4#include <chrono>
5#include <thread>
6
7// NOLINTNEXTLINE
8TEST_CASE("tutorial_slow_v1") {
9 ankerl::nanobench::Bench().run("sleep 100ms, auto", [&] {
10 std::this_thread::sleep_for(std::chrono::milliseconds(100));
11 });
12}
After 1.1 seconds I get
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------------------
| 100,125,753.00 | 9.99 | 0.0% | 51.00 | 7,714.00 | 0.007 | 11.00 | 90.9% | 1.10 | `sleep 100ms, auto`
So we actually take 100.125ms instead of 100ms. Next time I run it, I get 100.141. Also a very stable result. Interestingly, sleep takes 51 instructions but 7,714 cycles - so we only got 0.007 instructions per cycle. That’s extremely low, but expected of sleep
. It also required 11 branches, of which 90.9% were mispredicted on average.
If the extremely slow 1.1 second is too much for you, you can manually configure the number of evaluations (epochs):
1#include <nanobench.h>
2#include <thirdparty/doctest/doctest.h>
3
4#include <chrono>
5#include <thread>
6
7// NOLINTNEXTLINE
8TEST_CASE("tutorial_slow_v2") {
9 ankerl::nanobench::Bench().epochs(3).run("sleep 100ms", [&] {
10 std::this_thread::sleep_for(std::chrono::milliseconds(100));
11 });
12}
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 100,099,096.00 | 9.99 | 0.0% | 51.00 | 7,182.00 | 0.007 | 11.00 | 90.9% | 0.30 | `sleep 100ms`
This time it took only 0.3 seconds, but with only 3 evaluations instead of 11. The err% will be less meaningful, but since the benchmark is so stable it doesn’t really matter.
Something Unstable
Let’s create an extreme artificial test that’s hard to benchmark, because runtime fluctuates randomly: Each iteration randomly skip between 0-254 random numbers:
1#include <nanobench.h>
2#include <thirdparty/doctest/doctest.h>
3
4#include <random>
5
6// NOLINTNEXTLINE
7TEST_CASE("tutorial_fluctuating_v1") {
8 std::random_device dev;
9 std::mt19937_64 rng(dev());
10 ankerl::nanobench::Bench().run("random fluctuations", [&] {
11 // each run, perform a random number of rng calls
12 auto iterations = rng() & UINT64_C(0xff);
13 for (uint64_t i = 0; i < iterations; ++i) {
14 ankerl::nanobench::doNotOptimizeAway(rng());
15 }
16 });
17}
After 2.3ms, I get this result:
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 334.12 | 2,992,911.53 | 6.3% | 3,486.44 | 1,068.67 | 3.262 | 287.86 | 0.7% | 0.00 | :wavy_dash: `random fluctuations` (Unstable with ~56.7 iters. Increase `minEpochIterations` to e.g. 567)
So on average each loop takes about 334.12ns, but we get a warning that the results are unstable. The median percentage error is 6.3% which is quite high,
Let’s use the suggestion and set the minimum number of iterations to 5000, and try again:
1#include <nanobench.h>
2#include <thirdparty/doctest/doctest.h>
3
4#include <random>
5
6// NOLINTNEXTLINE
7TEST_CASE("tutorial_fluctuating_v2") {
8 std::random_device dev;
9 std::mt19937_64 rng(dev());
10 ankerl::nanobench::Bench().minEpochIterations(5000).run(
11 "random fluctuations", [&] {
12 // each run, perform a random number of rng calls
13 auto iterations = rng() & UINT64_C(0xff);
14 for (uint64_t i = 0; i < iterations; ++i) {
15 ankerl::nanobench::doNotOptimizeAway(rng());
16 }
17 });
18}
The fluctuations are much better:
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 277.31 | 3,606,106.48 | 0.7% | 3,531.75 | 885.18 | 3.990 | 291.59 | 0.7% | 0.00 | `random fluctuations`
The results are more stable, with only 0.7% error.
Comparing Results
To compare results, keep the ankerl::nanobench::Bench object around, enable .relative(true), and .run(…) your benchmarks. All benchmarks will be automatically compared to the first one.
As an example, I have implemented a comparison of multiple random number generators. Here several RNGs are compared to a baseline calculated from std::default_random_engine. I factored out the general benchmarking code so it’s easy to use for each of the random number generators:
1 }
2
3private:
4 static constexpr uint64_t rotl(uint64_t x, unsigned k) noexcept {
5 return (x << k) | (x >> (64U - k));
6 }
7
8 uint64_t stateA{};
9 uint64_t stateB{};
10};
11
12namespace {
13
14// Benchmarks how fast we can get 64bit random values from Rng.
15template <typename Rng>
16void bench(ankerl::nanobench::Bench* bench, char const* name) {
17 std::random_device dev;
18 Rng rng(dev());
19
20 bench->run(name, [&]() {
21 auto r = std::uniform_int_distribution<uint64_t>{}(rng);
22 ankerl::nanobench::doNotOptimizeAway(r);
23 });
24}
25
26} // namespace
27
28// NOLINTNEXTLINE
29TEST_CASE("example_random_number_generators") {
30 // perform a few warmup calls, and since the runtime is not always stable
31 // for each generator, increase the number of epochs to get more accurate
32 // numbers.
33 ankerl::nanobench::Bench b;
34 b.title("Random Number Generators")
35 .unit("uint64_t")
36 .warmup(100)
37 .relative(true);
38 b.performanceCounters(true);
39
40 // sets the first one as the baseline
41 bench<std::default_random_engine>(&b, "std::default_random_engine");
42 bench<std::mt19937>(&b, "std::mt19937");
43 bench<std::mt19937_64>(&b, "std::mt19937_64");
44 bench<std::ranlux24_base>(&b, "std::ranlux24_base");
45 bench<std::ranlux48_base>(&b, "std::ranlux48_base");
46 bench<std::ranlux24>(&b, "std::ranlux24_base");
47 bench<std::ranlux48>(&b, "std::ranlux48");
48 bench<std::knuth_b>(&b, "std::knuth_b");
49 bench<WyRng>(&b, "WyRng");
50 bench<NasamRng>(&b, "NasamRng");
51 bench<Sfc4>(&b, "Sfc4");
52 bench<RomuTrio>(&b, "RomuTrio");
53 bench<RomuDuo>(&b, "RomuDuo");
54 bench<RomuDuoJr>(&b, "RomuDuoJr");
55 bench<Orbit>(&b, "Orbit");
56 bench<ankerl::nanobench::Rng>(&b, "ankerl::nanobench::Rng");
57}
Runs for 60ms and prints this table:
| relative | ns/uint64_t | uint64_t/s | err% | ins/uint64_t | cyc/uint64_t | IPC | bra/uint64_t | miss% | total | Random Number Generators
|---------:|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:-------------------------
| 100.0% | 35.87 | 27,881,924.28 | 2.3% | 127.80 | 114.61 | 1.115 | 9.77 | 3.7% | 0.00 | `std::default_random_engine`
| 490.3% | 7.32 | 136,699,693.21 | 0.6% | 89.55 | 23.49 | 3.812 | 9.51 | 0.1% | 0.00 | `std::mt19937`
| 1,767.4% | 2.03 | 492,786,582.33 | 0.6% | 24.38 | 6.48 | 3.761 | 1.26 | 0.6% | 0.00 | `std::mt19937_64`
| 85.2% | 42.08 | 23,764,853.03 | 0.7% | 157.07 | 134.62 | 1.167 | 19.51 | 7.6% | 0.00 | `std::ranlux24_base`
| 121.3% | 29.56 | 33,824,759.51 | 0.5% | 91.03 | 94.35 | 0.965 | 10.00 | 8.1% | 0.00 | `std::ranlux48_base`
| 17.4% | 205.67 | 4,862,080.59 | 1.2% | 709.83 | 657.10 | 1.080 | 101.79 | 16.1% | 0.00 | `std::ranlux24_base`
| 8.7% | 412.46 | 2,424,497.97 | 1.8% | 1,514.70 | 1,318.43 | 1.149 | 219.09 | 16.7% | 0.00 | `std::ranlux48`
| 59.2% | 60.60 | 16,502,276.18 | 1.9% | 253.77 | 193.39 | 1.312 | 24.93 | 1.5% | 0.00 | `std::knuth_b`
| 5,187.1% | 0.69 | 1,446,254,071.66 | 0.1% | 6.00 | 2.21 | 2.714 | 0.00 | 0.0% | 0.00 | `WyRng`
| 1,431.7% | 2.51 | 399,177,833.54 | 0.0% | 21.00 | 8.01 | 2.621 | 0.00 | 0.0% | 0.00 | `NasamRng`
| 2,629.9% | 1.36 | 733,279,957.30 | 0.1% | 13.00 | 4.36 | 2.982 | 0.00 | 0.0% | 0.00 | `Sfc4`
| 3,815.7% | 0.94 | 1,063,889,655.17 | 0.0% | 11.00 | 3.01 | 3.661 | 0.00 | 0.0% | 0.00 | `RomuTrio`
| 3,529.5% | 1.02 | 984,102,081.37 | 0.3% | 9.00 | 3.25 | 2.768 | 0.00 | 0.0% | 0.00 | `RomuDuo`
| 4,580.4% | 0.78 | 1,277,113,402.06 | 0.0% | 7.00 | 2.50 | 2.797 | 0.00 | 0.0% | 0.00 | `RomuDuoJr`
| 2,291.2% | 1.57 | 638,820,992.09 | 0.0% | 11.00 | 5.00 | 2.200 | 0.00 | 0.0% | 0.00 | `ankerl::nanobench::Rng`
It shows that ankerl::nanobench::Rng
is one of the fastest RNG, and has the least amount of
fluctuation. It takes only 1.57ns to generate a random uint64_t
, so ~638 million calls per
seconds are possible. To the left we show relative performance compared to std::default_random_engine
.
Note
Here pure runtime performance is not necessarily the best benchmark. Especially the fastest RNG’s can be inlined and use instruction level parallelism to their advantage: they immediately return an old state, and while user code can already use that value, the next value is calculated in parallel. See the excellent paper at romu-random for details.
Asymptotic Complexity
It is possible to calculate asymptotic complexity (Big O) from multiple runs of a benchmark. Run the benchmark with different complexity N, then nanobench can calculate the best fitting curve.
The following example finds out the asymptotic complexity of std::set
’s find()
.
1#include <nanobench.h>
2#include <thirdparty/doctest/doctest.h>
3
4#include <iostream>
5#include <set>
6
7// NOLINTNEXTLINE
8TEST_CASE("tutorial_complexity_set_find") {
9 // Create a single benchmark instance that is used in multiple benchmark
10 // runs, with different settings for complexityN.
11 ankerl::nanobench::Bench bench;
12
13 // a RNG to generate input data
14 ankerl::nanobench::Rng rng;
15
16 std::set<uint64_t> set;
17
18 // Running the benchmark multiple times, with different number of elements
19 for (auto setSize :
20 {10U, 20U, 50U, 100U, 200U, 500U, 1000U, 2000U, 5000U, 10000U}) {
21
22 // fill up the set with random data
23 while (set.size() < setSize) {
24 set.insert(rng());
25 }
26
27 // Run the benchmark, provide setSize as the scaling variable.
28 bench.complexityN(set.size()).run("std::set find", [&] {
29 ankerl::nanobench::doNotOptimizeAway(set.find(rng()));
30 });
31 }
32
33 // calculate BigO complexy best fit and print the results
34 std::cout << bench.complexityBigO() << std::endl;
35}
The loop runs the benchmark 10 times, with different set sizes from 10 to 10k.
Note
Each of the 10 benchmark runs automatically scales the number of iterations so results are still fast and accurate. In total the whole test takes about 90ms.
The Bench
object holds the benchmark results of the 10 benchmark runs. Each benchmark is recorded with a
different setting for complexityN
.
After the benchmark prints the benchmark results, we calculate & print the Big O of the most important complexity functions.
std::cout << bench.complexityBigO() << std::endl;
prints e.g. this markdown table:
| coefficient | err% | complexity
|--------------:|-------:|------------
| 6.66562e-09 | 29.1% | O(log n)
| 1.47588e-11 | 58.3% | O(n)
| 1.10742e-12 | 62.6% | O(n log n)
| 5.15683e-08 | 63.8% | O(1)
| 1.40387e-15 | 78.7% | O(n^2)
| 1.32792e-19 | 85.7% | O(n^3)
The table is sorted, best fitting complexity function first. So \(\mathcal{O}(\log{}n)\) provides the best approximation for the complexity. Interestingly, in that case error compared to \(\mathcal{O}(n)\) is not very large, which can be an indication that even though the red-black tree should theoretically have logarithmic complexity, in practices that is not perfectly the case.
Rendering Mustache-like Templates
Nanobench comes with a powerful Mustache-like template mechanism to process the benchmark
results into all kinds of formats. You can find a full description of all possible tags at ankerl::nanobench::render()
.
Several preconfigured format exist in the namespace ankerl::nanobench::templates
. Rendering these templates can be done
with either ankerl::nanobench::render()
, or directly with ankerl::nanobench::Bench::render()
.
The following example shows how to use the CSV - Comma-Separated Values template, without writing the standard output.
1#include <nanobench.h>
2#include <thirdparty/doctest/doctest.h>
3
4#include <atomic>
5#include <iostream>
6
7// NOLINTNEXTLINE
8TEST_CASE("tutorial_render_simple") {
9 std::atomic<int> x(0);
10
11 ankerl::nanobench::Bench()
12 .output(nullptr)
13 .run("std::vector",
14 [&] {
15 ++x;
16 })
17 .render(ankerl::nanobench::templates::csv(), std::cout);
18}
In line 11 we call Bench::output()
with nullptr
, thus disabling the standard output.
After the benchmark we directly call Bench::render()
in line 16. Here we use the
CSV template, and write the rendered output to std::cout
. When running, we get just the CSV output to the console which looks like this:
"title";"name";"unit";"batch";"elapsed";"error %";"instructions";"branches";"branch misses";"total"
"benchmark";"std::vector";"op";1;6.51982200647249e-09;8.26465858909014e-05;23.0034662045061;5;0.00116867939228672;0.000171959
Nanobench comes with a few preconfigured templates, residing in the namespace ankerl::nanobench::templates
. To demonstrate what these templates can do,
here is a simple example that benchmarks two random generators std::mt19937_64
and std::knuth_b
and prints both the template and the rendered
output:
1#include <nanobench.h>
2#include <thirdparty/doctest/doctest.h>
3
4#include <fstream>
5#include <random>
6
7namespace {
8
9void gen(std::string const& typeName, char const* mustacheTemplate,
10 ankerl::nanobench::Bench const& bench) {
11
12 std::ofstream templateOut("mustache.template." + typeName);
13 templateOut << mustacheTemplate;
14
15 std::ofstream renderOut("mustache.render." + typeName);
16 ankerl::nanobench::render(mustacheTemplate, bench, renderOut);
17}
18
19} // namespace
20
21// NOLINTNEXTLINE
22TEST_CASE("tutorial_mustache") {
23 ankerl::nanobench::Bench bench;
24 bench.title("Benchmarking std::mt19937_64 and std::knuth_b");
25
26 // NOLINTNEXTLINE(cert-msc32-c,cert-msc51-cpp)
27 std::mt19937_64 rng1;
28 bench.run("std::mt19937_64", [&] {
29 ankerl::nanobench::doNotOptimizeAway(rng1());
30 });
31
32 // NOLINTNEXTLINE(cert-msc32-c,cert-msc51-cpp)
33 std::knuth_b rng2;
34 bench.run("std::knuth_b", [&] {
35 ankerl::nanobench::doNotOptimizeAway(rng2());
36 });
37
38 gen("json", ankerl::nanobench::templates::json(), bench);
39 gen("html", ankerl::nanobench::templates::htmlBoxplot(), bench);
40 gen("csv", ankerl::nanobench::templates::csv(), bench);
41}
Nanobench allows to specify further context information, which may be accessed using {{context(name)}}
where name
names a variable defined via Bench::context()
.
1#include <nanobench.h>
2#include <thirdparty/doctest/doctest.h>
3
4#include <cmath>
5#include <iostream>
6
7namespace {
8
9template <typename T>
10void fma() {
11 T x(1);
12 T y(2);
13 T z(3);
14 z = std::fma(x, y, z);
15 ankerl::nanobench::doNotOptimizeAway(z);
16}
17
18template <typename T>
19void plus_eq() {
20 T x(1);
21 T y(2);
22 T z(3);
23 z += x * y;
24 ankerl::nanobench::doNotOptimizeAway(z);
25}
26
27char const* csv() {
28 return R"DELIM("title";"name";"scalar";"foo";"elapsed";"total"
29{{#result}}"{{title}}";"{{name}}";"{{context(scalar)}}";"{{context(foo)}}";{{median(elapsed)}};{{sumProduct(iterations, elapsed)}}
30{{/result}})DELIM";
31}
32
33} // namespace
34
35// NOLINTNEXTLINE
36TEST_CASE("tutorial_context") {
37 ankerl::nanobench::Bench bench;
38 bench.title("Addition").output(nullptr);
39 bench.context("scalar", "f32")
40 .context("foo", "bar")
41 .run("+=", plus_eq<float>)
42 .run("fma", fma<float>);
43 bench.context("scalar", "f64")
44 .context("foo", "baz")
45 .run("+=", plus_eq<double>)
46 .run("fma", fma<double>);
47 bench.render(csv(), std::cout);
48 // Changing the title resets the results, but not the context:
49 bench.title("New Title");
50 bench.run("+=", plus_eq<float>);
51 bench.render(csv(), std::cout);
52 CHECK_EQ(bench.results().front().context("foo"), "baz"); // != bar
53 // The context has to be reset manually, which causes render to fail:
54 bench.title("Yet Another Title").clearContext();
55 bench.run("+=", plus_eq<float>);
56
57 // NOLINTNEXTLINE(llvm-else-after-return,readability-else-after-return)
58 CHECK_THROWS(bench.render(csv(), std::cout));
59}
CSV - Comma-Separated Values
The function ankerl::nanobench::templates::csv()
provides this template:
1"title";"name";"unit";"batch";"elapsed";"error %";"instructions";"branches";"branch misses";"total"
2{{#result}}"{{title}}";"{{name}}";"{{unit}}";{{batch}};{{median(elapsed)}};{{medianAbsolutePercentError(elapsed)}};{{median(instructions)}};{{median(branchinstructions)}};{{median(branchmisses)}};{{sumProduct(iterations, elapsed)}}
3{{/result}}
This generates a compact CSV file, where entries are separated by a semicolon ;. Run with the example, I get this output:
1"title";"name";"unit";"batch";"elapsed";"error %";"instructions";"branches";"branch misses";"total"
2"Benchmarking std::mt19937_64 and std::knuth_b";"std::mt19937_64";"op";1;2.54441805225653e-08;0.0236579384033733;125.989678899083;16.7645714285714;0.564133016627078;0.000218811
3"Benchmarking std::mt19937_64 and std::knuth_b";"std::knuth_b";"op";1;3.19013867488444e-08;0.00091350764819687;170.013008130081;28;0.0031104199066874;0.000217248
Rendered as CSV table:
title |
name |
unit |
batch |
elapsed |
error % |
instructions |
branches |
branch misses |
total |
---|---|---|---|---|---|---|---|---|---|
Benchmarking std::mt19937_64 and std::knuth_b |
std::mt19937_64 |
op |
1 |
2.54441805225653e-08 |
0.0236579384033733 |
125.989678899083 |
16.7645714285714 |
0.564133016627078 |
0.000218811 |
Benchmarking std::mt19937_64 and std::knuth_b |
std::knuth_b |
op |
1 |
3.19013867488444e-08 |
0.00091350764819687 |
170.013008130081 |
28 |
0.0031104199066874 |
0.000217248 |
Note that the CSV template doesn’t provide all the data that is available.
HTML Box Plots
With the template ankerl::nanobench::templates::htmlBoxplot()
you get a plotly based HTML output which generates
a boxplot of the runtime. The template is rather simple.
1<html>
2
3<head>
4 <script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
5</head>
6
7<body>
8 <div id="myDiv"></div>
9 <script>
10 var data = [
11 {{#result}}{
12 name: '{{name}}',
13 y: [{{#measurement}}{{elapsed}}{{^-last}}, {{/last}}{{/measurement}}],
14 },
15 {{/result}}
16 ];
17 var title = '{{title}}';
18
19 data = data.map(a => Object.assign(a, { boxpoints: 'all', pointpos: 0, type: 'box' }));
20 var layout = { title: { text: title }, showlegend: false, yaxis: { title: 'time per unit', rangemode: 'tozero', autorange: true } }; Plotly.newPlot('myDiv', data, layout, {responsive: true});
21 </script>
22</body>
23
24</html>
This generates a nice interactive boxplot, which gives a nice visual showcase of the runtime performance of the evaluated benchmarks. Each epoch is visualized as a dot, and the boxplot itself shows median, percentiles, and outliers. You’ll might want to increase the default number of epochs for an even better visualization result.
JSON - JavaScript Object Notation
The ankerl::nanobench::templates::json()
template gives everything, all data that is available, from all runs. The template is therefore quite complex:
1{
2 "results": [
3{{#result}} {
4 "title": "{{title}}",
5 "name": "{{name}}",
6 "unit": "{{unit}}",
7 "batch": {{batch}},
8 "complexityN": {{complexityN}},
9 "epochs": {{epochs}},
10 "clockResolution": {{clockResolution}},
11 "clockResolutionMultiple": {{clockResolutionMultiple}},
12 "maxEpochTime": {{maxEpochTime}},
13 "minEpochTime": {{minEpochTime}},
14 "minEpochIterations": {{minEpochIterations}},
15 "epochIterations": {{epochIterations}},
16 "warmup": {{warmup}},
17 "relative": {{relative}},
18 "median(elapsed)": {{median(elapsed)}},
19 "medianAbsolutePercentError(elapsed)": {{medianAbsolutePercentError(elapsed)}},
20 "median(instructions)": {{median(instructions)}},
21 "medianAbsolutePercentError(instructions)": {{medianAbsolutePercentError(instructions)}},
22 "median(cpucycles)": {{median(cpucycles)}},
23 "median(contextswitches)": {{median(contextswitches)}},
24 "median(pagefaults)": {{median(pagefaults)}},
25 "median(branchinstructions)": {{median(branchinstructions)}},
26 "median(branchmisses)": {{median(branchmisses)}},
27 "totalTime": {{sumProduct(iterations, elapsed)}},
28 "measurements": [
29{{#measurement}} {
30 "iterations": {{iterations}},
31 "elapsed": {{elapsed}},
32 "pagefaults": {{pagefaults}},
33 "cpucycles": {{cpucycles}},
34 "contextswitches": {{contextswitches}},
35 "instructions": {{instructions}},
36 "branchinstructions": {{branchinstructions}},
37 "branchmisses": {{branchmisses}}
38 }{{^-last}},{{/-last}}
39{{/measurement}} ]
40 }{{^-last}},{{/-last}}
41{{/result}} ]
42}
This also gives the data from each separate ankerl::nanobench::Bench::epochs()
, not just the accumulated data as in the CSV template.
1{
2 "results": [
3 {
4 "title": "Benchmarking std::mt19937_64 and std::knuth_b",
5 "name": "std::mt19937_64",
6 "unit": "op",
7 "batch": 1,
8 "complexityN": -1,
9 "epochs": 11,
10 "clockResolution": 1.8e-08,
11 "clockResolutionMultiple": 1000,
12 "maxEpochTime": 0.1,
13 "minEpochTime": 0,
14 "minEpochIterations": 1,
15 "warmup": 0,
16 "relative": 0,
17 "median(elapsed)": 2.54441805225653e-08,
18 "medianAbsolutePercentError(elapsed)": 0.0236579384033733,
19 "median(instructions)": 125.989678899083,
20 "medianAbsolutePercentError(instructions)": 0.035125448044942,
21 "median(cpucycles)": 81.3479809976247,
22 "median(contextswitches)": 0,
23 "median(pagefaults)": 0,
24 "median(branchinstructions)": 16.7645714285714,
25 "median(branchmisses)": 0.564133016627078,
26 "totalTime": 0.000218811,
27 "measurements": [
28 {
29 "iterations": 875,
30 "elapsed": 2.54708571428571e-08,
31 "pagefaults": 0,
32 "cpucycles": 81.472,
33 "contextswitches": 0,
34 "instructions": 125.885714285714,
35 "branchinstructions": 16.7645714285714,
36 "branchmisses": 0.574857142857143
37 },
38 {
39 "iterations": 809,
40 "elapsed": 2.58467243510507e-08,
41 "pagefaults": 0,
42 "cpucycles": 82.5290482076638,
43 "contextswitches": 0,
44 "instructions": 128.771322620519,
45 "branchinstructions": 17.0296662546354,
46 "branchmisses": 0.582200247218789
47 },
48 {
49 "iterations": 737,
50 "elapsed": 2.24097693351425e-08,
51 "pagefaults": 0,
52 "cpucycles": 71.6431478968792,
53 "contextswitches": 0,
54 "instructions": 118.374491180461,
55 "branchinstructions": 15.9470827679783,
56 "branchmisses": 0.417910447761194
57 },
58 {
59 "iterations": 872,
60 "elapsed": 2.53405963302752e-08,
61 "pagefaults": 0,
62 "cpucycles": 80.9896788990826,
63 "contextswitches": 0,
64 "instructions": 125.989678899083,
65 "branchinstructions": 16.7580275229358,
66 "branchmisses": 0.563073394495413
67 },
68 {
69 "iterations": 834,
70 "elapsed": 2.59256594724221e-08,
71 "pagefaults": 0,
72 "cpucycles": 82.7661870503597,
73 "contextswitches": 0,
74 "instructions": 127.635491606715,
75 "branchinstructions": 16.9352517985612,
76 "branchmisses": 0.575539568345324
77 },
78 {
79 "iterations": 772,
80 "elapsed": 2.25310880829016e-08,
81 "pagefaults": 0,
82 "cpucycles": 72.0129533678757,
83 "contextswitches": 0,
84 "instructions": 117.108808290155,
85 "branchinstructions": 15.8341968911917,
86 "branchmisses": 0.405440414507772
87 },
88 {
89 "iterations": 842,
90 "elapsed": 2.54441805225653e-08,
91 "pagefaults": 0,
92 "cpucycles": 81.3479809976247,
93 "contextswitches": 0,
94 "instructions": 127.266033254157,
95 "branchinstructions": 16.8859857482185,
96 "branchmisses": 0.564133016627078
97 },
98 {
99 "iterations": 792,
100 "elapsed": 2.20126262626263e-08,
101 "pagefaults": 0,
102 "cpucycles": 70.3623737373737,
103 "contextswitches": 0,
104 "instructions": 116.420454545455,
105 "branchinstructions": 15.7588383838384,
106 "branchmisses": 0.396464646464646
107 },
108 {
109 "iterations": 757,
110 "elapsed": 2.63870541611625e-08,
111 "pagefaults": 0,
112 "cpucycles": 84.332892998679,
113 "contextswitches": 0,
114 "instructions": 131.462351387054,
115 "branchinstructions": 17.334214002642,
116 "branchmisses": 0.618229854689564
117 },
118 {
119 "iterations": 850,
120 "elapsed": 2.23305882352941e-08,
121 "pagefaults": 0,
122 "cpucycles": 71.3505882352941,
123 "contextswitches": 0,
124 "instructions": 114.629411764706,
125 "branchinstructions": 15.5823529411765,
126 "branchmisses": 0.392941176470588
127 },
128 {
129 "iterations": 774,
130 "elapsed": 2.60607235142119e-08,
131 "pagefaults": 0,
132 "cpucycles": 83.1679586563308,
133 "contextswitches": 0,
134 "instructions": 130.576227390181,
135 "branchinstructions": 17.2635658914729,
136 "branchmisses": 0.590439276485788
137 }
138 ]
139 },
140 {
141 "title": "Benchmarking std::mt19937_64 and std::knuth_b",
142 "name": "std::knuth_b",
143 "unit": "op",
144 "batch": 1,
145 "complexityN": -1,
146 "epochs": 11,
147 "clockResolution": 1.8e-08,
148 "clockResolutionMultiple": 1000,
149 "maxEpochTime": 0.1,
150 "minEpochTime": 0,
151 "minEpochIterations": 1,
152 "warmup": 0,
153 "relative": 0,
154 "median(elapsed)": 3.19013867488444e-08,
155 "medianAbsolutePercentError(elapsed)": 0.00091350764819687,
156 "median(instructions)": 170.013008130081,
157 "medianAbsolutePercentError(instructions)": 4.11992392254248e-06,
158 "median(cpucycles)": 101.973254086181,
159 "median(contextswitches)": 0,
160 "median(pagefaults)": 0,
161 "median(branchinstructions)": 28,
162 "median(branchmisses)": 0.0031104199066874,
163 "totalTime": 0.000217248,
164 "measurements": [
165 {
166 "iterations": 568,
167 "elapsed": 3.2137323943662e-08,
168 "pagefaults": 0,
169 "cpucycles": 102.55985915493,
170 "contextswitches": 0,
171 "instructions": 170.014084507042,
172 "branchinstructions": 28,
173 "branchmisses": 0.00528169014084507
174 },
175 {
176 "iterations": 576,
177 "elapsed": 3.19305555555556e-08,
178 "pagefaults": 0,
179 "cpucycles": 102.059027777778,
180 "contextswitches": 0,
181 "instructions": 170.013888888889,
182 "branchinstructions": 28,
183 "branchmisses": 0.00347222222222222
184 },
185 {
186 "iterations": 643,
187 "elapsed": 3.18973561430793e-08,
188 "pagefaults": 0,
189 "cpucycles": 101.973561430793,
190 "contextswitches": 0,
191 "instructions": 170.012441679627,
192 "branchinstructions": 28,
193 "branchmisses": 0.0031104199066874
194 },
195 {
196 "iterations": 591,
197 "elapsed": 3.1912013536379e-08,
198 "pagefaults": 0,
199 "cpucycles": 101.944162436548,
200 "contextswitches": 0,
201 "instructions": 170.013536379019,
202 "branchinstructions": 28,
203 "branchmisses": 0.00169204737732657
204 },
205 {
206 "iterations": 673,
207 "elapsed": 3.19049034175334e-08,
208 "pagefaults": 0,
209 "cpucycles": 101.973254086181,
210 "contextswitches": 0,
211 "instructions": 170.011887072808,
212 "branchinstructions": 28,
213 "branchmisses": 0.00297176820208024
214 },
215 {
216 "iterations": 649,
217 "elapsed": 3.19013867488444e-08,
218 "pagefaults": 0,
219 "cpucycles": 101.850539291217,
220 "contextswitches": 0,
221 "instructions": 170.012326656394,
222 "branchinstructions": 28,
223 "branchmisses": 0.00308166409861325
224 },
225 {
226 "iterations": 606,
227 "elapsed": 3.18547854785479e-08,
228 "pagefaults": 0,
229 "cpucycles": 101.83498349835,
230 "contextswitches": 0,
231 "instructions": 170.013201320132,
232 "branchinstructions": 28,
233 "branchmisses": 0.0033003300330033
234 },
235 {
236 "iterations": 650,
237 "elapsed": 3.18769230769231e-08,
238 "pagefaults": 0,
239 "cpucycles": 101.898461538462,
240 "contextswitches": 0,
241 "instructions": 170.012307692308,
242 "branchinstructions": 28,
243 "branchmisses": 0.00307692307692308
244 },
245 {
246 "iterations": 615,
247 "elapsed": 3.18520325203252e-08,
248 "pagefaults": 0,
249 "cpucycles": 101.858536585366,
250 "contextswitches": 0,
251 "instructions": 170.013008130081,
252 "branchinstructions": 28,
253 "branchmisses": 0.0032520325203252
254 },
255 {
256 "iterations": 579,
257 "elapsed": 3.18618307426598e-08,
258 "pagefaults": 0,
259 "cpucycles": 101.989637305699,
260 "contextswitches": 0,
261 "instructions": 170.013816925734,
262 "branchinstructions": 28,
263 "branchmisses": 0.00345423143350604
264 },
265 {
266 "iterations": 657,
267 "elapsed": 3.19558599695586e-08,
268 "pagefaults": 0,
269 "cpucycles": 102.229832572298,
270 "contextswitches": 0,
271 "instructions": 170.012176560122,
272 "branchinstructions": 28,
273 "branchmisses": 0.0030441400304414
274 }
275 ]
276 }
277 ]
278}
pyperf - Python pyperf module Output
Pyperf is a powerful tool for benchmarking and system tuning, and it can also analyze benchmark results. This template allows generation of output so it can be used for further analysis with pyperf.
Note
Pyperf supports only a single benchmark result per generated output, so it is best to create a new
Bench
object for each benchmark.
The template looks like this. Note that it directly makes use of {{#measurement}}
, which is only possible when there is a single result in the benchmark.
1{
2 "benchmarks": [
3 {
4 "runs": [
5 {
6 "values": [
7{{#measurement}} {{elapsed}}{{^-last}},
8{{/last}}{{/measurement}}
9 ]
10 }
11 ]
12 }
13 ],
14 "metadata": {
15 "loops": {{sum(iterations)}},
16 "inner_loops": {{batch}},
17 "name": "{{title}}",
18 "unit": "second"
19 },
20 "version": "1.0"
21}
Here is an example that generates pyperf compatible output for a benchmark that shuffles a vector:
1#include <nanobench.h>
2#include <thirdparty/doctest/doctest.h>
3
4#include <algorithm>
5#include <fstream>
6#include <random>
7
8// NOLINTNEXTLINE
9TEST_CASE("shuffle_pyperf") {
10 std::vector<uint64_t> data(500, 0); // input data for shuffling
11
12 // NOLINTNEXTLINE(cert-msc32-c,cert-msc51-cpp)
13 std::default_random_engine defaultRng(123);
14 std::ofstream fout1("pyperf_shuffle_std.json");
15 ankerl::nanobench::Bench()
16 .epochs(100)
17 .run("std::shuffle with std::default_random_engine",
18 [&]() {
19 std::shuffle(data.begin(), data.end(), defaultRng);
20 })
21 .render(ankerl::nanobench::templates::pyperf(), fout1);
22
23 std::ofstream fout2("pyperf_shuffle_nanobench.json");
24 ankerl::nanobench::Rng rng(123);
25 ankerl::nanobench::Bench()
26 .epochs(100)
27 .run("ankerl::nanobench::Rng::shuffle",
28 [&]() {
29 rng.shuffle(data);
30 })
31 .render(ankerl::nanobench::templates::pyperf(), fout2);
32}
This benchmark run creates the two files pyperf_shuffle_std.json
and pyperf_shuffle_nanobench.json
.
Here are some of the analysis you can do:
Show Benchmark Statistics
Output from python3 -m pyperf stats pyperf_shuffle_std.json
:
Total duration: 364 ms
Raw value minimum: 3.57 ms
Raw value maximum: 4.21 ms
Number of calibration run: 0
Number of run with values: 1
Total number of run: 1
Number of warmup per run: 0
Number of value per run: 100
Loop iterations per value: 100
Total number of values: 100
Minimum: 35.7 us
Median +- MAD: 36.2 us +- 0.2 us
Mean +- std dev: 36.4 us +- 0.9 us
Maximum: 42.1 us
0th percentile: 35.7 us (-2% of the mean) -- minimum
5th percentile: 35.8 us (-2% of the mean)
25th percentile: 36.1 us (-1% of the mean) -- Q1
50th percentile: 36.2 us (-0% of the mean) -- median
75th percentile: 36.4 us (+0% of the mean) -- Q3
95th percentile: 36.7 us (+1% of the mean)
100th percentile: 42.1 us (+16% of the mean) -- maximum
Number of outlier (out of 35.6 us..36.9 us): 4
Show a Histogram
It’s often interesting to see a histogram, especially to visually find out if there are outliers involved.
Run python3 -m pyperf hist pyperf_shuffle_std.json
produces this output
35.7 us: 21 ######################################
36.0 us: 33 ############################################################
36.3 us: 37 ###################################################################
36.6 us: 5 #########
36.9 us: 0 |
37.2 us: 1 ##
37.5 us: 0 |
37.8 us: 0 |
38.1 us: 0 |
38.4 us: 0 |
38.7 us: 0 |
39.0 us: 0 |
39.3 us: 0 |
39.6 us: 1 ##
39.9 us: 0 |
40.2 us: 0 |
40.5 us: 1 ##
40.8 us: 0 |
41.1 us: 0 |
41.5 us: 0 |
41.8 us: 0 |
42.1 us: 1 ##
Compare Results
We have generated two results in the above examples, and we can compare them easily with python3 -m pyperf compare_to a.json b.json
:
+-----------+--------------------+------------------------------+
| Benchmark | pyperf_shuffle_std | pyperf_shuffle_nanobench |
+===========+====================+==============================+
| benchmark | 36.4 us | 11.2 us: 3.24x faster (-69%) |
+-----------+--------------------+------------------------------+
For more information of pyperfs analysis capability, please see pyperf - Analyze benchmark results.