Installation

Direct Inclusion

  1. Download nanobench.h from the release and make it available in your project.

  2. Create a .cpp file, e.g. nanobench.cpp, where the bulk of nanobench is compiled.

    nanobench.cpp
    1#define ANKERL_NANOBENCH_IMPLEMENT
    2#include <nanobench.h>
    
  3. Compile e.g. with g++ -O3 -I../include -c nanobench.cpp. This compiles the bulk of nanobench, and took 2.4 seconds on my machine. It needs to be compiled only once whenever you upgrade nanobench.

CMake Integration

nanobench can be integrated with CMake’s FetchContent or as a git submodule. Here is a full example how to this can be done:

CMakeLists.txt
 1cmake_minimum_required(VERSION 3.14)
 2set(CMAKE_CXX_STANDARD 17)
 3
 4project(
 5    CMakeNanobenchExample
 6    VERSION 1.0
 7    LANGUAGES CXX)
 8
 9include(FetchContent)
10
11FetchContent_Declare(
12    nanobench
13    GIT_REPOSITORY https://github.com/martinus/nanobench.git
14    GIT_TAG v4.1.0
15    GIT_SHALLOW TRUE)
16
17FetchContent_MakeAvailable(nanobench)
18
19add_executable(MyExample my_example.cpp)
20target_link_libraries(MyExample PRIVATE nanobench)

Usage

  1. Create the actual benchmark code, in full_example.cpp:

    full_example.cpp
     1#include <nanobench.h>
     2
     3#include <atomic>
     4
     5int main() {
     6    int y = 0;
     7    std::atomic<int> x(0);
     8    ankerl::nanobench::Bench().run("compare_exchange_strong", [&] {
     9        x.compare_exchange_strong(y, 0);
    10    });
    11}
    

    The most important entry entry point is ankerl::nanobench::Bench. It creates a benchmarking object, optionally configures it, and then runs the code to benchmark with run().

  2. Compile and link the example with

    g++ -O3 -I../include nanobench.o full_example.cpp -o full_example
    

    This takes just 0.28 seconds on my machine.

  3. Run ./full_example, which gives an output like this:

    |               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |                5.63 |      177,595,338.98 |    0.0% |            3.00 |           17.98 |  0.167 |           1.00 |    0.1% |      0.00 | `compare_exchange_strong`
    

    Which renders as

    ns/op

    op/s

    err%

    ins/op

    cyc/op

    IPC

    bra/op

    miss%

    total

    benchmark

    5.63

    177,595,338.98

    0.0%

    3.00

    17.98

    0.167

    1.00

    0.1%

    0.00

    compare_exchange_strong

    Which means that one x.compare_exchange_strong(y, 0); call takes 5.63ns on my machine (wall-clock time), or ~178 million operations per second. Runtime fluctuates by around 0.0%, so the results are very stable. Each call required 3 instructions, which took ~18 CPU cycles. There was a single branch per call, with only 0.1% misspredicted.

Nanobench does not come with a test runner, so you can easily use it with any framework you like. In the remaining examples, I’m using doctest as a unit test framework.

Note

CPU statistics like instructions, cycles, branches, branch misses are only available on Linux, through perf events. On some systems you might need to change permissions through perf_event_paranoid or use ACL.

Examples

Something Fast

Let’s benchmarks how fast we can do x += x for uint64_t:

tutorial_fast_v1.cpp
1#include <nanobench.h>
2#include <thirdparty/doctest/doctest.h>
3
4TEST_CASE("tutorial_fast_v1") {
5    uint64_t x = 1;
6    ankerl::nanobench::Bench().run("++x", [&]() {
7        ++x;
8    });
9}

After 0.2ms we get this output:

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                   - |                   - |       - |         - | :boom: `++x` (iterations overflow. Maybe your code got optimized away?)

No data there! we only get :boom: iterations overflow.. The compiler could optimize x += x away because we never used the output. Thanks to doNotOptimizeAway, this is easy to fix:

tutorial_fast_v2.cpp
1#include <nanobench.h>
2#include <thirdparty/doctest/doctest.h>
3
4TEST_CASE("tutorial_fast_v2") {
5    uint64_t x = 1;
6    ankerl::nanobench::Bench().run("++x", [&]() {
7        ankerl::nanobench::doNotOptimizeAway(x += 1);
8    });
9}

This time the benchmark runs for 2.2ms and we actually get reasonable data:

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                0.31 |    3,192,444,232.50 |    0.0% |            1.00 |            1.00 |  0.998 |           0.00 |    0.0% |      0.00 | `++x`

It’s a very stable result. One run the op/s is 3,192 million/sec, the next time I execute it I get 3,168 million/sec. It always takes 1.00 instructions per operation on my machine, and can do this in ~1 cycle.

Something Slow

Let’s benchmark if sleeping for 100ms really takes 100ms.

tutorial_slow_v1.cpp
 1#include <nanobench.h>
 2#include <thirdparty/doctest/doctest.h>
 3
 4#include <chrono>
 5#include <thread>
 6
 7TEST_CASE("tutorial_slow_v1") {
 8    ankerl::nanobench::Bench().run("sleep 100ms, auto", [&] {
 9        std::this_thread::sleep_for(std::chrono::milliseconds(100));
10    });
11}

After 1.1 seconds I get

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------------------
|      100,125,753.00 |                9.99 |    0.0% |           51.00 |        7,714.00 |  0.007 |          11.00 |   90.9% |      1.10 | `sleep 100ms, auto`

So we actually take 100.125ms instead of 100ms. Next time I run it, I get 100.141. Also a very stable result. Interestingly, sleep takes 51 instructions but 7,714 cycles - so we only got 0.007 instructions per cycle. That’s extremely low, but expected of sleep. It also required 11 branches, of which 90.9% were misspredicted on average.

If the extremely slow 1.1 second is too much for you, you can manually configure the number of evaluations (epochs):

tutorial_slow_v2.cpp
 1#include <nanobench.h>
 2#include <thirdparty/doctest/doctest.h>
 3
 4#include <chrono>
 5#include <thread>
 6
 7TEST_CASE("tutorial_slow_v2") {
 8    ankerl::nanobench::Bench().epochs(3).run("sleep 100ms", [&] {
 9        std::this_thread::sleep_for(std::chrono::milliseconds(100));
10    });
11}
|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|      100,099,096.00 |                9.99 |    0.0% |           51.00 |        7,182.00 |  0.007 |          11.00 |   90.9% |      0.30 | `sleep 100ms`

This time it took only 0.3 seconds, but with only 3 evaluations instead of 11. The err% will be less meaningfull, but since the benchmark is so stable it doesn’t really matter.

Something Unstable

Lets create an extreme artifical test that’s hard to benchmark, because runtime fluctuates randomly: Each iteration randomly skip between 0-254 random numbers:

tutorial_fluctuating_v1.cpp
 1#include <nanobench.h>
 2#include <thirdparty/doctest/doctest.h>
 3
 4#include <random>
 5
 6TEST_CASE("tutorial_fluctuating_v1") {
 7    std::random_device dev;
 8    std::mt19937_64 rng(dev());
 9    ankerl::nanobench::Bench().run("random fluctuations", [&] {
10        // each run, perform a random number of rng calls
11        auto iterations = rng() & UINT64_C(0xff);
12        for (uint64_t i = 0; i < iterations; ++i) {
13            ankerl::nanobench::doNotOptimizeAway(rng());
14        }
15    });
16}

After 2.3ms, I get this result:

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|              334.12 |        2,992,911.53 |    6.3% |        3,486.44 |        1,068.67 |  3.262 |         287.86 |    0.7% |      0.00 | :wavy_dash: `random fluctuations` (Unstable with ~56.7 iters. Increase `minEpochIterations` to e.g. 567)

So on average each loop takes about 334.12ns, but we get a warning that the results are unstable. The median percentage error is 6.3% which is quite high,

Let’s use the suggestion and set the minimum number of iterations to 5000, and try again:

tutorial_fluctuating_v2.cpp
 1#include <nanobench.h>
 2#include <thirdparty/doctest/doctest.h>
 3
 4#include <random>
 5
 6TEST_CASE("tutorial_fluctuating_v2") {
 7    std::random_device dev;
 8    std::mt19937_64 rng(dev());
 9    ankerl::nanobench::Bench().minEpochIterations(5000).run(
10        "random fluctuations", [&] {
11            // each run, perform a random number of rng calls
12            auto iterations = rng() & UINT64_C(0xff);
13            for (uint64_t i = 0; i < iterations; ++i) {
14                ankerl::nanobench::doNotOptimizeAway(rng());
15            }
16        });
17}

The fluctuations are much better:

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|              277.31 |        3,606,106.48 |    0.7% |        3,531.75 |          885.18 |  3.990 |         291.59 |    0.7% |      0.00 | `random fluctuations`

The results are more stable, with only 0.7% error.

Comparing Results

I have implemented a comparison of multiple random number generators. Here several RNGs are compared to a baseline calculated from std::default_random_engine. I factored out the general benchmarking code so it’s easy to use for each of the random number generators:

example_random_number_generators.cpp (excerpt)
 1    uint64_t stateB;
 2};
 3
 4namespace {
 5
 6// Benchmarks how fast we can get 64bit random values from Rng.
 7template <typename Rng>
 8void bench(ankerl::nanobench::Bench* bench, char const* name) {
 9    std::random_device dev;
10    Rng rng(dev());
11
12    bench->run(name, [&]() {
13        auto r = std::uniform_int_distribution<uint64_t>{}(rng);
14        ankerl::nanobench::doNotOptimizeAway(r);
15    });
16}
17
18} // namespace
19
20TEST_CASE("example_random_number_generators") {
21    // perform a few warmup calls, and since the runtime is not always stable
22    // for each generator, increase the number of epochs to get more accurate
23    // numbers.
24    ankerl::nanobench::Bench b;
25    b.title("Random Number Generators")
26        .unit("uint64_t")
27        .warmup(100)
28        .relative(true);
29    b.performanceCounters(true);
30
31    // sets the first one as the baseline
32    bench<std::default_random_engine>(&b, "std::default_random_engine");
33    bench<std::mt19937>(&b, "std::mt19937");
34    bench<std::mt19937_64>(&b, "std::mt19937_64");
35    bench<std::ranlux24_base>(&b, "std::ranlux24_base");
36    bench<std::ranlux48_base>(&b, "std::ranlux48_base");
37    bench<std::ranlux24>(&b, "std::ranlux24_base");
38    bench<std::ranlux48>(&b, "std::ranlux48");
39    bench<std::knuth_b>(&b, "std::knuth_b");
40    bench<WyRng>(&b, "WyRng");
41    bench<NasamRng>(&b, "NasamRng");
42    bench<Sfc4>(&b, "Sfc4");
43    bench<RomuTrio>(&b, "RomuTrio");
44    bench<RomuDuo>(&b, "RomuDuo");
45    bench<RomuDuoJr>(&b, "RomuDuoJr");
46    bench<Orbit>(&b, "Orbit");
47    bench<ankerl::nanobench::Rng>(&b, "ankerl::nanobench::Rng");
48}

Runs for 60ms and prints this table:

| relative |         ns/uint64_t |          uint64_t/s |    err% |    ins/uint64_t |    cyc/uint64_t |    IPC |   bra/uint64_t |   miss% |     total | Random Number Generators
|---------:|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:-------------------------
|   100.0% |               35.87 |       27,881,924.28 |    2.3% |          127.80 |          114.61 |  1.115 |           9.77 |    3.7% |      0.00 | `std::default_random_engine`
|   490.3% |                7.32 |      136,699,693.21 |    0.6% |           89.55 |           23.49 |  3.812 |           9.51 |    0.1% |      0.00 | `std::mt19937`
| 1,767.4% |                2.03 |      492,786,582.33 |    0.6% |           24.38 |            6.48 |  3.761 |           1.26 |    0.6% |      0.00 | `std::mt19937_64`
|    85.2% |               42.08 |       23,764,853.03 |    0.7% |          157.07 |          134.62 |  1.167 |          19.51 |    7.6% |      0.00 | `std::ranlux24_base`
|   121.3% |               29.56 |       33,824,759.51 |    0.5% |           91.03 |           94.35 |  0.965 |          10.00 |    8.1% |      0.00 | `std::ranlux48_base`
|    17.4% |              205.67 |        4,862,080.59 |    1.2% |          709.83 |          657.10 |  1.080 |         101.79 |   16.1% |      0.00 | `std::ranlux24_base`
|     8.7% |              412.46 |        2,424,497.97 |    1.8% |        1,514.70 |        1,318.43 |  1.149 |         219.09 |   16.7% |      0.00 | `std::ranlux48`
|    59.2% |               60.60 |       16,502,276.18 |    1.9% |          253.77 |          193.39 |  1.312 |          24.93 |    1.5% |      0.00 | `std::knuth_b`
| 5,187.1% |                0.69 |    1,446,254,071.66 |    0.1% |            6.00 |            2.21 |  2.714 |           0.00 |    0.0% |      0.00 | `WyRng`
| 1,431.7% |                2.51 |      399,177,833.54 |    0.0% |           21.00 |            8.01 |  2.621 |           0.00 |    0.0% |      0.00 | `NasamRng`
| 2,629.9% |                1.36 |      733,279,957.30 |    0.1% |           13.00 |            4.36 |  2.982 |           0.00 |    0.0% |      0.00 | `Sfc4`
| 3,815.7% |                0.94 |    1,063,889,655.17 |    0.0% |           11.00 |            3.01 |  3.661 |           0.00 |    0.0% |      0.00 | `RomuTrio`
| 3,529.5% |                1.02 |      984,102,081.37 |    0.3% |            9.00 |            3.25 |  2.768 |           0.00 |    0.0% |      0.00 | `RomuDuo`
| 4,580.4% |                0.78 |    1,277,113,402.06 |    0.0% |            7.00 |            2.50 |  2.797 |           0.00 |    0.0% |      0.00 | `RomuDuoJr`
| 2,291.2% |                1.57 |      638,820,992.09 |    0.0% |           11.00 |            5.00 |  2.200 |           0.00 |    0.0% |      0.00 | `ankerl::nanobench::Rng`

It shows that ankerl::nanobench::Rng is one of the fastest RNG, and has the least amount of fluctuation. It takes only 1.57ns to generate a random uint64_t, so ~638 million calls per seconds are possible. To the left we show relative performance compared to std::default_random_engine.

Note

Here pure runtime performance is not necessarily the best benchmark. Especially the fastest RNG’s can be inlined and use instruction level parallelism to their advantage: they immediately return an old state, and while user code can already use that value, the next value is calculated in parallel. See the excellent paper at romu-random for details.

Asymptotic Complexity

It is possible to calculate asymptotic complexity (Big O) from multiple runs of a benchmark. Run the benchmark with different complexity N, then nanobench can calculate the best fitting curve.

The following example finds out the asymptotic complexity of std::set’s find().

tutorial_complexity_set.cpp
 1#include <nanobench.h>
 2#include <thirdparty/doctest/doctest.h>
 3
 4#include <iostream>
 5#include <set>
 6
 7TEST_CASE("tutorial_complexity_set_find") {
 8    // Create a single benchmark instance that is used in multiple benchmark
 9    // runs, with different settings for complexityN.
10    ankerl::nanobench::Bench bench;
11
12    // a RNG to generate input data
13    ankerl::nanobench::Rng rng;
14
15    std::set<uint64_t> set;
16
17    // Running the benchmark multiple times, with different number of elements
18    for (auto setSize :
19         {10U, 20U, 50U, 100U, 200U, 500U, 1000U, 2000U, 5000U, 10000U}) {
20
21        // fill up the set with random data
22        while (set.size() < setSize) {
23            set.insert(rng());
24        }
25
26        // Run the benchmark, provide setSize as the scaling variable.
27        bench.complexityN(set.size()).run("std::set find", [&] {
28            ankerl::nanobench::doNotOptimizeAway(set.find(rng()));
29        });
30    }
31
32    // calculate BigO complexy best fit and print the results
33    std::cout << bench.complexityBigO() << std::endl;
34}

The loop runs the benchmark 10 times, with different set sizes from 10 to 10k.

Note

Each of the 10 benchmark runs automatically scales the number of iterations so results are still fast and accurate. In total the whole test takes about 90ms.

The Bench object holds the benchmark results of the 10 benchmark runs. Each benchmark is recorded with a different setting for complexityN.

After the benchmark prints the benchmark results, we calculate & print the Big O of the most important complexity functions. std::cout << bench.complexityBigO() << std::endl; prints e.g. this markdown table:

|   coefficient |   err% | complexity
|--------------:|-------:|------------
|   6.66562e-09 |  29.1% | O(log n)
|   1.47588e-11 |  58.3% | O(n)
|   1.10742e-12 |  62.6% | O(n log n)
|   5.15683e-08 |  63.8% | O(1)
|   1.40387e-15 |  78.7% | O(n^2)
|   1.32792e-19 |  85.7% | O(n^3)

The table is sorted, best fitting complexity function first. So \(\mathcal{O}(\log{}n)\) provides the best approximation for the complexity. Interestingly, in that case error compared to \(\mathcal{O}(n)\) is not very large, which can be an indication that even though the red-black tree should theoretically have logarithmic complexity, in practices that is not perfectly the case.

Rendering Mustache-like Templates

Nanobench comes with a powerful Mustache-like template mechanism to process the benchmark results into all kinds of formats. You can find a full description of all possible tags at ankerl::nanobench::render().

Several preconfigured format exist in the namespace ankerl::nanobench::templates. Rendering these templates can be done with either ankerl::nanobench::render(), or directly with ankerl::nanobench::Bench::render().

The following example shows how to use the CSV - Comma-Separated Values template, without writing the standard output.

tutorial_render_simple.cpp
 1#include <nanobench.h>
 2#include <thirdparty/doctest/doctest.h>
 3
 4#include <atomic>
 5#include <iostream>
 6
 7TEST_CASE("tutorial_render_simple") {
 8    std::atomic<int> x(0);
 9
10    ankerl::nanobench::Bench()
11        .output(nullptr)
12        .run("std::vector",
13             [&] {
14                 ++x;
15             })
16        .render(ankerl::nanobench::templates::csv(), std::cout);
17}

In line 11 we call Bench::output() with nullptr, thus disabling the standard output.

After the benchmark we directly call Bench::render() in line 16. Here we use the CSV template, and write the rendered output to std::cout. When running, we get just the CSV output to the console which looks like this:

"title";"name";"unit";"batch";"elapsed";"error %";"instructions";"branches";"branch misses";"total"
"benchmark";"std::vector";"op";1;6.51982200647249e-09;8.26465858909014e-05;23.0034662045061;5;0.00116867939228672;0.000171959

Nanobench comes with a few preconfigured templates, residing in the namespace ankerl::nanobench::templates. To demonstrate what these templates can do, here is an simple example that benchmarks two random generators std::mt19937_64 and std::knuth_b and prints both the template and the rendered output:

 1#include <nanobench.h>
 2#include <thirdparty/doctest/doctest.h>
 3
 4#include <fstream>
 5#include <random>
 6
 7namespace {
 8
 9void gen(std::string const& typeName, char const* mustacheTemplate,
10         ankerl::nanobench::Bench const& bench) {
11
12    std::ofstream templateOut("mustache.template." + typeName);
13    templateOut << mustacheTemplate;
14
15    std::ofstream renderOut("mustache.render." + typeName);
16    ankerl::nanobench::render(mustacheTemplate, bench, renderOut);
17}
18
19} // namespace
20
21TEST_CASE("tutorial_mustache") {
22    ankerl::nanobench::Bench bench;
23    bench.title("Benchmarking std::mt19937_64 and std::knuth_b");
24
25    std::mt19937_64 rng1;
26    bench.run("std::mt19937_64", [&] {
27        ankerl::nanobench::doNotOptimizeAway(rng1());
28    });
29
30    std::knuth_b rng2;
31    bench.run("std::knuth_b", [&] {
32        ankerl::nanobench::doNotOptimizeAway(rng2());
33    });
34
35    gen("json", ankerl::nanobench::templates::json(), bench);
36    gen("html", ankerl::nanobench::templates::htmlBoxplot(), bench);
37    gen("csv", ankerl::nanobench::templates::csv(), bench);
38}

CSV - Comma-Separated Values

The function ankerl::nanobench::templates::csv() provides this template:

1"title";"name";"unit";"batch";"elapsed";"error %";"instructions";"branches";"branch misses";"total"
2{{#result}}"{{title}}";"{{name}}";"{{unit}}";{{batch}};{{median(elapsed)}};{{medianAbsolutePercentError(elapsed)}};{{median(instructions)}};{{median(branchinstructions)}};{{median(branchmisses)}};{{sumProduct(iterations, elapsed)}}
3{{/result}}

This generates a compact CSV file, where entries are separated by a semicolon ;. Run with the example, I get this output:

1"title";"name";"unit";"batch";"elapsed";"error %";"instructions";"branches";"branch misses";"total"
2"Benchmarking std::mt19937_64 and std::knuth_b";"std::mt19937_64";"op";1;2.54441805225653e-08;0.0236579384033733;125.989678899083;16.7645714285714;0.564133016627078;0.000218811
3"Benchmarking std::mt19937_64 and std::knuth_b";"std::knuth_b";"op";1;3.19013867488444e-08;0.00091350764819687;170.013008130081;28;0.0031104199066874;0.000217248

Rendered as CSV table:

title

name

unit

batch

elapsed

error %

instructions

branches

branch misses

total

Benchmarking std::mt19937_64 and std::knuth_b

std::mt19937_64

op

1

2.54441805225653e-08

0.0236579384033733

125.989678899083

16.7645714285714

0.564133016627078

0.000218811

Benchmarking std::mt19937_64 and std::knuth_b

std::knuth_b

op

1

3.19013867488444e-08

0.00091350764819687

170.013008130081

28

0.0031104199066874

0.000217248

Note that the CSV template doesn’t provide all the data that is available.

HTML Box Plots

With the template ankerl::nanobench::templates::htmlBoxplot() you get a plotly based HTML output which generates an boxplot of the runtime. The template is rather simple.

 1<html>
 2
 3<head>
 4    <script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
 5</head>
 6
 7<body>
 8    <div id="myDiv"></div>
 9    <script>
10        var data = [
11            {{#result}}{
12                name: '{{name}}',
13                y: [{{#measurement}}{{elapsed}}{{^-last}}, {{/last}}{{/measurement}}],
14            },
15            {{/result}}
16        ];
17        var title = '{{title}}';
18
19        data = data.map(a => Object.assign(a, { boxpoints: 'all', pointpos: 0, type: 'box' }));
20        var layout = { title: { text: title }, showlegend: false, yaxis: { title: 'time per unit', rangemode: 'tozero', autorange: true } }; Plotly.newPlot('myDiv', data, layout, {responsive: true});
21    </script>
22</body>
23
24</html>

This generates a nice interactive boxplot, which gives a nice visual showcase of the runtime performance of the evaluated benchmarks. Each epoch is visualized as a dot, and the boxplot itself shows median, percentiles, and outliers. You’ll might want to increase the default number of epochs for an even better visualization result.

JSON - JavaScript Object Notation

The ankerl::nanobench::templates::json() template gives everything, all data that is available, from all runs. The template is therefore quite complex:

 1{
 2    "results": [
 3{{#result}}        {
 4            "title": "{{title}}",
 5            "name": "{{name}}",
 6            "unit": "{{unit}}",
 7            "batch": {{batch}},
 8            "complexityN": {{complexityN}},
 9            "epochs": {{epochs}},
10            "clockResolution": {{clockResolution}},
11            "clockResolutionMultiple": {{clockResolutionMultiple}},
12            "maxEpochTime": {{maxEpochTime}},
13            "minEpochTime": {{minEpochTime}},
14            "minEpochIterations": {{minEpochIterations}},
15            "epochIterations": {{epochIterations}},
16            "warmup": {{warmup}},
17            "relative": {{relative}},
18            "median(elapsed)": {{median(elapsed)}},
19            "medianAbsolutePercentError(elapsed)": {{medianAbsolutePercentError(elapsed)}},
20            "median(instructions)": {{median(instructions)}},
21            "medianAbsolutePercentError(instructions)": {{medianAbsolutePercentError(instructions)}},
22            "median(cpucycles)": {{median(cpucycles)}},
23            "median(contextswitches)": {{median(contextswitches)}},
24            "median(pagefaults)": {{median(pagefaults)}},
25            "median(branchinstructions)": {{median(branchinstructions)}},
26            "median(branchmisses)": {{median(branchmisses)}},
27            "totalTime": {{sumProduct(iterations, elapsed)}},
28            "measurements": [
29{{#measurement}}                {
30                    "iterations": {{iterations}},
31                    "elapsed": {{elapsed}},
32                    "pagefaults": {{pagefaults}},
33                    "cpucycles": {{cpucycles}},
34                    "contextswitches": {{contextswitches}},
35                    "instructions": {{instructions}},
36                    "branchinstructions": {{branchinstructions}},
37                    "branchmisses": {{branchmisses}}
38                }{{^-last}},{{/-last}}
39{{/measurement}}            ]
40        }{{^-last}},{{/-last}}
41{{/result}}    ]
42}

This also gives the data from each separate ankerl::nanobench::Bench::epochs(), not just the accumulated data as in the CSV template.

  1{
  2    "results": [
  3        {
  4            "title": "Benchmarking std::mt19937_64 and std::knuth_b",
  5            "name": "std::mt19937_64",
  6            "unit": "op",
  7            "batch": 1,
  8            "complexityN": -1,
  9            "epochs": 11,
 10            "clockResolution": 1.8e-08,
 11            "clockResolutionMultiple": 1000,
 12            "maxEpochTime": 0.1,
 13            "minEpochTime": 0,
 14            "minEpochIterations": 1,
 15            "warmup": 0,
 16            "relative": 0,
 17            "median(elapsed)": 2.54441805225653e-08,
 18            "medianAbsolutePercentError(elapsed)": 0.0236579384033733,
 19            "median(instructions)": 125.989678899083,
 20            "medianAbsolutePercentError(instructions)": 0.035125448044942,
 21            "median(cpucycles)": 81.3479809976247,
 22            "median(contextswitches)": 0,
 23            "median(pagefaults)": 0,
 24            "median(branchinstructions)": 16.7645714285714,
 25            "median(branchmisses)": 0.564133016627078,
 26            "totalTime": 0.000218811,
 27            "measurements": [
 28                {
 29                    "iterations": 875,
 30                    "elapsed": 2.54708571428571e-08,
 31                    "pagefaults": 0,
 32                    "cpucycles": 81.472,
 33                    "contextswitches": 0,
 34                    "instructions": 125.885714285714,
 35                    "branchinstructions": 16.7645714285714,
 36                    "branchmisses": 0.574857142857143
 37                },
 38                {
 39                    "iterations": 809,
 40                    "elapsed": 2.58467243510507e-08,
 41                    "pagefaults": 0,
 42                    "cpucycles": 82.5290482076638,
 43                    "contextswitches": 0,
 44                    "instructions": 128.771322620519,
 45                    "branchinstructions": 17.0296662546354,
 46                    "branchmisses": 0.582200247218789
 47                },
 48                {
 49                    "iterations": 737,
 50                    "elapsed": 2.24097693351425e-08,
 51                    "pagefaults": 0,
 52                    "cpucycles": 71.6431478968792,
 53                    "contextswitches": 0,
 54                    "instructions": 118.374491180461,
 55                    "branchinstructions": 15.9470827679783,
 56                    "branchmisses": 0.417910447761194
 57                },
 58                {
 59                    "iterations": 872,
 60                    "elapsed": 2.53405963302752e-08,
 61                    "pagefaults": 0,
 62                    "cpucycles": 80.9896788990826,
 63                    "contextswitches": 0,
 64                    "instructions": 125.989678899083,
 65                    "branchinstructions": 16.7580275229358,
 66                    "branchmisses": 0.563073394495413
 67                },
 68                {
 69                    "iterations": 834,
 70                    "elapsed": 2.59256594724221e-08,
 71                    "pagefaults": 0,
 72                    "cpucycles": 82.7661870503597,
 73                    "contextswitches": 0,
 74                    "instructions": 127.635491606715,
 75                    "branchinstructions": 16.9352517985612,
 76                    "branchmisses": 0.575539568345324
 77                },
 78                {
 79                    "iterations": 772,
 80                    "elapsed": 2.25310880829016e-08,
 81                    "pagefaults": 0,
 82                    "cpucycles": 72.0129533678757,
 83                    "contextswitches": 0,
 84                    "instructions": 117.108808290155,
 85                    "branchinstructions": 15.8341968911917,
 86                    "branchmisses": 0.405440414507772
 87                },
 88                {
 89                    "iterations": 842,
 90                    "elapsed": 2.54441805225653e-08,
 91                    "pagefaults": 0,
 92                    "cpucycles": 81.3479809976247,
 93                    "contextswitches": 0,
 94                    "instructions": 127.266033254157,
 95                    "branchinstructions": 16.8859857482185,
 96                    "branchmisses": 0.564133016627078
 97                },
 98                {
 99                    "iterations": 792,
100                    "elapsed": 2.20126262626263e-08,
101                    "pagefaults": 0,
102                    "cpucycles": 70.3623737373737,
103                    "contextswitches": 0,
104                    "instructions": 116.420454545455,
105                    "branchinstructions": 15.7588383838384,
106                    "branchmisses": 0.396464646464646
107                },
108                {
109                    "iterations": 757,
110                    "elapsed": 2.63870541611625e-08,
111                    "pagefaults": 0,
112                    "cpucycles": 84.332892998679,
113                    "contextswitches": 0,
114                    "instructions": 131.462351387054,
115                    "branchinstructions": 17.334214002642,
116                    "branchmisses": 0.618229854689564
117                },
118                {
119                    "iterations": 850,
120                    "elapsed": 2.23305882352941e-08,
121                    "pagefaults": 0,
122                    "cpucycles": 71.3505882352941,
123                    "contextswitches": 0,
124                    "instructions": 114.629411764706,
125                    "branchinstructions": 15.5823529411765,
126                    "branchmisses": 0.392941176470588
127                },
128                {
129                    "iterations": 774,
130                    "elapsed": 2.60607235142119e-08,
131                    "pagefaults": 0,
132                    "cpucycles": 83.1679586563308,
133                    "contextswitches": 0,
134                    "instructions": 130.576227390181,
135                    "branchinstructions": 17.2635658914729,
136                    "branchmisses": 0.590439276485788
137                }
138            ]
139        },
140        {
141            "title": "Benchmarking std::mt19937_64 and std::knuth_b",
142            "name": "std::knuth_b",
143            "unit": "op",
144            "batch": 1,
145            "complexityN": -1,
146            "epochs": 11,
147            "clockResolution": 1.8e-08,
148            "clockResolutionMultiple": 1000,
149            "maxEpochTime": 0.1,
150            "minEpochTime": 0,
151            "minEpochIterations": 1,
152            "warmup": 0,
153            "relative": 0,
154            "median(elapsed)": 3.19013867488444e-08,
155            "medianAbsolutePercentError(elapsed)": 0.00091350764819687,
156            "median(instructions)": 170.013008130081,
157            "medianAbsolutePercentError(instructions)": 4.11992392254248e-06,
158            "median(cpucycles)": 101.973254086181,
159            "median(contextswitches)": 0,
160            "median(pagefaults)": 0,
161            "median(branchinstructions)": 28,
162            "median(branchmisses)": 0.0031104199066874,
163            "totalTime": 0.000217248,
164            "measurements": [
165                {
166                    "iterations": 568,
167                    "elapsed": 3.2137323943662e-08,
168                    "pagefaults": 0,
169                    "cpucycles": 102.55985915493,
170                    "contextswitches": 0,
171                    "instructions": 170.014084507042,
172                    "branchinstructions": 28,
173                    "branchmisses": 0.00528169014084507
174                },
175                {
176                    "iterations": 576,
177                    "elapsed": 3.19305555555556e-08,
178                    "pagefaults": 0,
179                    "cpucycles": 102.059027777778,
180                    "contextswitches": 0,
181                    "instructions": 170.013888888889,
182                    "branchinstructions": 28,
183                    "branchmisses": 0.00347222222222222
184                },
185                {
186                    "iterations": 643,
187                    "elapsed": 3.18973561430793e-08,
188                    "pagefaults": 0,
189                    "cpucycles": 101.973561430793,
190                    "contextswitches": 0,
191                    "instructions": 170.012441679627,
192                    "branchinstructions": 28,
193                    "branchmisses": 0.0031104199066874
194                },
195                {
196                    "iterations": 591,
197                    "elapsed": 3.1912013536379e-08,
198                    "pagefaults": 0,
199                    "cpucycles": 101.944162436548,
200                    "contextswitches": 0,
201                    "instructions": 170.013536379019,
202                    "branchinstructions": 28,
203                    "branchmisses": 0.00169204737732657
204                },
205                {
206                    "iterations": 673,
207                    "elapsed": 3.19049034175334e-08,
208                    "pagefaults": 0,
209                    "cpucycles": 101.973254086181,
210                    "contextswitches": 0,
211                    "instructions": 170.011887072808,
212                    "branchinstructions": 28,
213                    "branchmisses": 0.00297176820208024
214                },
215                {
216                    "iterations": 649,
217                    "elapsed": 3.19013867488444e-08,
218                    "pagefaults": 0,
219                    "cpucycles": 101.850539291217,
220                    "contextswitches": 0,
221                    "instructions": 170.012326656394,
222                    "branchinstructions": 28,
223                    "branchmisses": 0.00308166409861325
224                },
225                {
226                    "iterations": 606,
227                    "elapsed": 3.18547854785479e-08,
228                    "pagefaults": 0,
229                    "cpucycles": 101.83498349835,
230                    "contextswitches": 0,
231                    "instructions": 170.013201320132,
232                    "branchinstructions": 28,
233                    "branchmisses": 0.0033003300330033
234                },
235                {
236                    "iterations": 650,
237                    "elapsed": 3.18769230769231e-08,
238                    "pagefaults": 0,
239                    "cpucycles": 101.898461538462,
240                    "contextswitches": 0,
241                    "instructions": 170.012307692308,
242                    "branchinstructions": 28,
243                    "branchmisses": 0.00307692307692308
244                },
245                {
246                    "iterations": 615,
247                    "elapsed": 3.18520325203252e-08,
248                    "pagefaults": 0,
249                    "cpucycles": 101.858536585366,
250                    "contextswitches": 0,
251                    "instructions": 170.013008130081,
252                    "branchinstructions": 28,
253                    "branchmisses": 0.0032520325203252
254                },
255                {
256                    "iterations": 579,
257                    "elapsed": 3.18618307426598e-08,
258                    "pagefaults": 0,
259                    "cpucycles": 101.989637305699,
260                    "contextswitches": 0,
261                    "instructions": 170.013816925734,
262                    "branchinstructions": 28,
263                    "branchmisses": 0.00345423143350604
264                },
265                {
266                    "iterations": 657,
267                    "elapsed": 3.19558599695586e-08,
268                    "pagefaults": 0,
269                    "cpucycles": 102.229832572298,
270                    "contextswitches": 0,
271                    "instructions": 170.012176560122,
272                    "branchinstructions": 28,
273                    "branchmisses": 0.0030441400304414
274                }
275            ]
276        }
277    ]
278}

pyperf - Python pyperf module Output

Pyperf is a powerful tool for benchmarking and system tuning, and it can also analyze benchmark results. This template allows generation of output so it can be used for further analysis with pyperf.

Note

Pyperf supports only a single benchmark result per generated output, so it is best to create a new Bench object for each benchmark.

The template looks like this. Note that it directly makes use of {{#measurement}}, which is only possible when there is a single result in the benchmark.

 1{
 2    "benchmarks": [
 3        {
 4            "runs": [
 5                {
 6                    "values": [
 7{{#measurement}}                        {{elapsed}}{{^-last}},
 8{{/last}}{{/measurement}}
 9                    ]
10                }
11            ]
12        }
13    ],
14    "metadata": {
15        "loops": {{sum(iterations)}},
16        "inner_loops": {{batch}},
17        "name": "{{title}}",
18        "unit": "second"
19    },
20    "version": "1.0"
21}

Here is an example that generates pyperf compatible output for a benchmark that shuffles an vector:

example_pyperf.cpp
 1#include <nanobench.h>
 2#include <thirdparty/doctest/doctest.h>
 3
 4#include <algorithm>
 5#include <fstream>
 6#include <random>
 7
 8TEST_CASE("shuffle_pyperf") {
 9    std::vector<uint64_t> data(500, 0); // input data for shuffling
10
11    std::default_random_engine defaultRng(123);
12    std::ofstream fout1("pyperf_shuffle_std.json");
13    ankerl::nanobench::Bench()
14        .epochs(100)
15        .run("std::shuffle with std::default_random_engine",
16             [&]() {
17                 std::shuffle(data.begin(), data.end(), defaultRng);
18             })
19        .render(ankerl::nanobench::templates::pyperf(), fout1);
20
21    std::ofstream fout2("pyperf_shuffle_nanobench.json");
22    ankerl::nanobench::Rng rng(123);
23    ankerl::nanobench::Bench()
24        .epochs(100)
25        .run("ankerl::nanobench::Rng::shuffle",
26             [&]() {
27                 rng.shuffle(data);
28             })
29        .render(ankerl::nanobench::templates::pyperf(), fout2);
30}

This benchmark run creates the two files pyperf_shuffle_std.json and pyperf_shuffle_nanobench.json. Here are some of the analysis you can do:

Show Benchmark Statistics

Output from python3 -m pyperf stats pyperf_shuffle_std.json:

Total duration: 364 ms
Raw value minimum: 3.57 ms
Raw value maximum: 4.21 ms

Number of calibration run: 0
Number of run with values: 1
Total number of run: 1

Number of warmup per run: 0
Number of value per run: 100
Loop iterations per value: 100
Total number of values: 100

Minimum:         35.7 us
Median +- MAD:   36.2 us +- 0.2 us
Mean +- std dev: 36.4 us +- 0.9 us
Maximum:         42.1 us

  0th percentile: 35.7 us (-2% of the mean) -- minimum
  5th percentile: 35.8 us (-2% of the mean)
 25th percentile: 36.1 us (-1% of the mean) -- Q1
 50th percentile: 36.2 us (-0% of the mean) -- median
 75th percentile: 36.4 us (+0% of the mean) -- Q3
 95th percentile: 36.7 us (+1% of the mean)
100th percentile: 42.1 us (+16% of the mean) -- maximum

Number of outlier (out of 35.6 us..36.9 us): 4

Show a Histogram

It’s often interesting to see a histogram, especially to visually find out if there are outliers involved. Run python3 -m pyperf hist pyperf_shuffle_std.json produces this output

35.7 us: 21 ######################################
36.0 us: 33 ############################################################
36.3 us: 37 ###################################################################
36.6 us:  5 #########
36.9 us:  0 |
37.2 us:  1 ##
37.5 us:  0 |
37.8 us:  0 |
38.1 us:  0 |
38.4 us:  0 |
38.7 us:  0 |
39.0 us:  0 |
39.3 us:  0 |
39.6 us:  1 ##
39.9 us:  0 |
40.2 us:  0 |
40.5 us:  1 ##
40.8 us:  0 |
41.1 us:  0 |
41.5 us:  0 |
41.8 us:  0 |
42.1 us:  1 ##

Compare Results

We have generated two results in the above examples, and we can compare them easily with python3 -m pyperf compare_to a.json b.json:

+-----------+--------------------+------------------------------+
| Benchmark | pyperf_shuffle_std | pyperf_shuffle_nanobench     |
+===========+====================+==============================+
| benchmark | 36.4 us            | 11.2 us: 3.24x faster (-69%) |
+-----------+--------------------+------------------------------+

For more information of pyperfs analysis capability, please see pyperf - Analyze benchmark results.