Installation

Direct Inclusion

Download nanobench.h from the release and make it available in your project.
Create a .cpp file, e.g. nanobench.cpp, where the bulk of nanobench is compiled.
nanobench.cpp
```
1#define ANKERL_NANOBENCH_IMPLEMENT
2#include <nanobench.h>
```
Compile e.g. with g++ -O3 -I../include -c nanobench.cpp. This compiles the bulk of nanobench, and took 2.4 seconds on my machine. It needs to be compiled only once whenever you upgrade nanobench.

CMake Integration

nanobench can be integrated with CMake’s FetchContent or as a git submodule. Here is a full example how to this can be done:

CMakeLists.txt

cmake_minimum_required(VERSION 3.14)
set(CMAKE_CXX_STANDARD 17)

project(
    CMakeNanobenchExample
    VERSION 1.0
    LANGUAGES CXX)

include(FetchContent)

FetchContent_Declare(
    nanobench
    GIT_REPOSITORY https://github.com/martinus/nanobench.git
    GIT_TAG v4.1.0
    GIT_SHALLOW TRUE)

FetchContent_MakeAvailable(nanobench)

add_executable(MyExample my_example.cpp)
target_link_libraries(MyExample PRIVATE nanobench)

Usage

Create the actual benchmark code, in full_example.cpp:

full_example.cpp

#include <nanobench.h>

#include <atomic>

int main() {
    int y = 0;
    std::atomic<int> x(0);
    ankerl::nanobench::Bench().run("compare_exchange_strong", [&] {
        x.compare_exchange_strong(y, 0);
    });
}

The most important entry entry point is ankerl::nanobench::Bench. It creates a benchmarking object, optionally configures it, and then runs the code to benchmark with run().

Compile and link the example with
```
g++ -O3 -I../include nanobench.o full_example.cpp -o full_example
```
This takes just 0.28 seconds on my machine.
Run ./full_example, which gives an output like this:
```
|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                5.63 |      177,595,338.98 |    0.0% |            3.00 |           17.98 |  0.167 |           1.00 |    0.1% |      0.00 | `compare_exchange_strong`
```
Which renders as

ns/op

op/s

err%

ins/op

cyc/op

IPC

bra/op

miss%

total

benchmark

5.63

177,595,338.98

0.0%

3.00

17.98

0.167

1.00

0.1%

0.00

compare_exchange_strong

Which means that one x.compare_exchange_strong(y, 0); call takes 5.63ns on my machine (wall-clock time), or ~178 million operations per second. Runtime fluctuates by around 0.0%, so the results are very stable. Each call required 3 instructions, which took ~18 CPU cycles. There was a single branch per call, with only 0.1% mispredicted.

Nanobench does not come with a test runner, so you can easily use it with any framework you like. In the remaining examples, I’m using doctest as a unit test framework.

Note

CPU statistics like instructions, cycles, branches, branch misses are only available on Linux, through perf events. On some systems you might need to change permissions through perf_event_paranoid or use ACL.

Examples

Something Fast

Let’s benchmarks how fast we can do x += x for uint64_t:

tutorial_fast_v1.cpp

#include <nanobench.h>
#include <thirdparty/doctest/doctest.h>

// NOLINTNEXTLINE
TEST_CASE("tutorial_fast_v1") {
    uint64_t x = 1;
    ankerl::nanobench::Bench().run("++x", [&]() {
        ++x;
    });
}

After 0.2ms we get this output:

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                   - |                   - |       - |         - | :boom: `++x` (iterations overflow. Maybe your code got optimized away?)

No data there! We only get :boom: iterations overflow.. The compiler could optimize x += x away because we never used the output. Thanks to doNotOptimizeAway, this is easy to fix:

tutorial_fast_v2.cpp

#include <nanobench.h>
#include <thirdparty/doctest/doctest.h>

// NOLINTNEXTLINE
TEST_CASE("tutorial_fast_v2") {
    uint64_t x = 1;
    ankerl::nanobench::Bench().run("++x", [&]() {
        ankerl::nanobench::doNotOptimizeAway(x += 1);
    });
}

This time the benchmark runs for 2.2ms and we actually get reasonable data:

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                0.31 |    3,192,444,232.50 |    0.0% |            1.00 |            1.00 |  0.998 |           0.00 |    0.0% |      0.00 | `++x`

It’s a very stable result. One run the op/s is 3,192 million/sec, the next time I execute it I get 3,168 million/sec. It always takes 1.00 instructions per operation on my machine, and can do this in ~1 cycle.

Something Slow

Let’s benchmark if sleeping for 100ms really takes 100ms.

tutorial_slow_v1.cpp

#include <nanobench.h>
#include <thirdparty/doctest/doctest.h>

#include <chrono>
#include <thread>

// NOLINTNEXTLINE
TEST_CASE("tutorial_slow_v1") {
    ankerl::nanobench::Bench().run("sleep 100ms, auto", [&] {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
    });
}

After 1.1 seconds I get

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------------------
|      100,125,753.00 |                9.99 |    0.0% |           51.00 |        7,714.00 |  0.007 |          11.00 |   90.9% |      1.10 | `sleep 100ms, auto`

So we actually take 100.125ms instead of 100ms. Next time I run it, I get 100.141. Also a very stable result. Interestingly, sleep takes 51 instructions but 7,714 cycles - so we only got 0.007 instructions per cycle. That’s extremely low, but expected of sleep. It also required 11 branches, of which 90.9% were mispredicted on average.

If the extremely slow 1.1 second is too much for you, you can manually configure the number of evaluations (epochs):

tutorial_slow_v2.cpp

#include <nanobench.h>
#include <thirdparty/doctest/doctest.h>

#include <chrono>
#include <thread>

// NOLINTNEXTLINE
TEST_CASE("tutorial_slow_v2") {
    ankerl::nanobench::Bench().epochs(3).run("sleep 100ms", [&] {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
    });
}

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|      100,099,096.00 |                9.99 |    0.0% |           51.00 |        7,182.00 |  0.007 |          11.00 |   90.9% |      0.30 | `sleep 100ms`

This time it took only 0.3 seconds, but with only 3 evaluations instead of 11. The err% will be less meaningful, but since the benchmark is so stable it doesn’t really matter.

Something Unstable

Let’s create an extreme artificial test that’s hard to benchmark, because runtime fluctuates randomly: Each iteration randomly skip between 0-254 random numbers:

tutorial_fluctuating_v1.cpp

#include <nanobench.h>
#include <thirdparty/doctest/doctest.h>

#include <random>

// NOLINTNEXTLINE
TEST_CASE("tutorial_fluctuating_v1") {
    std::random_device dev;
    std::mt19937_64 rng(dev());
    ankerl::nanobench::Bench().run("random fluctuations", [&] {
        // each run, perform a random number of rng calls
        auto iterations = rng() & UINT64_C(0xff);
        for (uint64_t i = 0; i < iterations; ++i) {
            ankerl::nanobench::doNotOptimizeAway(rng());
        }
    });
}

After 2.3ms, I get this result:

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|              334.12 |        2,992,911.53 |    6.3% |        3,486.44 |        1,068.67 |  3.262 |         287.86 |    0.7% |      0.00 | :wavy_dash: `random fluctuations` (Unstable with ~56.7 iters. Increase `minEpochIterations` to e.g. 567)

So on average each loop takes about 334.12ns, but we get a warning that the results are unstable. The median percentage error is 6.3% which is quite high,

Let’s use the suggestion and set the minimum number of iterations to 5000, and try again:

tutorial_fluctuating_v2.cpp

#include <nanobench.h>
#include <thirdparty/doctest/doctest.h>

#include <random>

// NOLINTNEXTLINE
TEST_CASE("tutorial_fluctuating_v2") {
    std::random_device dev;
    std::mt19937_64 rng(dev());
    ankerl::nanobench::Bench().minEpochIterations(5000).run(
        "random fluctuations", [&] {
            // each run, perform a random number of rng calls
            auto iterations = rng() & UINT64_C(0xff);
            for (uint64_t i = 0; i < iterations; ++i) {
                ankerl::nanobench::doNotOptimizeAway(rng());
            }
        });
}

The fluctuations are much better:

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|              277.31 |        3,606,106.48 |    0.7% |        3,531.75 |          885.18 |  3.990 |         291.59 |    0.7% |      0.00 | `random fluctuations`

The results are more stable, with only 0.7% error.

Comparing Results

To compare results, keep the ankerl::nanobench::Bench object around, enable .relative(true), and .run(…) your benchmarks. All benchmarks will be automatically compared to the first one.

As an example, I have implemented a comparison of multiple random number generators. Here several RNGs are compared to a baseline calculated from std::default_random_engine. I factored out the general benchmarking code so it’s easy to use for each of the random number generators:

example_random_number_generators.cpp (excerpt)

    }

private:
    static constexpr uint64_t rotl(uint64_t x, unsigned k) noexcept {
        return (x << k) | (x >> (64U - k));
    }

    uint64_t stateA{};
    uint64_t stateB{};
};

namespace {

// Benchmarks how fast we can get 64bit random values from Rng.
template <typename Rng>
void bench(ankerl::nanobench::Bench* bench, char const* name) {
    std::random_device dev;
    Rng rng(dev());

    bench->run(name, [&]() {
        auto r = std::uniform_int_distribution<uint64_t>{}(rng);
        ankerl::nanobench::doNotOptimizeAway(r);
    });
}

} // namespace

// NOLINTNEXTLINE
TEST_CASE("example_random_number_generators") {
    // perform a few warmup calls, and since the runtime is not always stable
    // for each generator, increase the number of epochs to get more accurate
    // numbers.
    ankerl::nanobench::Bench b;
    b.title("Random Number Generators")
        .unit("uint64_t")
        .warmup(100)
        .relative(true);
    b.performanceCounters(true);

    // sets the first one as the baseline
    bench<std::default_random_engine>(&b, "std::default_random_engine");
    bench<std::mt19937>(&b, "std::mt19937");
    bench<std::mt19937_64>(&b, "std::mt19937_64");
    bench<std::ranlux24_base>(&b, "std::ranlux24_base");
    bench<std::ranlux48_base>(&b, "std::ranlux48_base");
    bench<std::ranlux24>(&b, "std::ranlux24_base");
    bench<std::ranlux48>(&b, "std::ranlux48");
    bench<std::knuth_b>(&b, "std::knuth_b");
    bench<WyRng>(&b, "WyRng");
    bench<NasamRng>(&b, "NasamRng");
    bench<Sfc4>(&b, "Sfc4");
    bench<RomuTrio>(&b, "RomuTrio");
    bench<RomuDuo>(&b, "RomuDuo");
    bench<RomuDuoJr>(&b, "RomuDuoJr");
    bench<Orbit>(&b, "Orbit");
    bench<ankerl::nanobench::Rng>(&b, "ankerl::nanobench::Rng");
}

Runs for 60ms and prints this table:

| relative |         ns/uint64_t |          uint64_t/s |    err% |    ins/uint64_t |    cyc/uint64_t |    IPC |   bra/uint64_t |   miss% |     total | Random Number Generators
|---------:|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:-------------------------
|   100.0% |               35.87 |       27,881,924.28 |    2.3% |          127.80 |          114.61 |  1.115 |           9.77 |    3.7% |      0.00 | `std::default_random_engine`
|   490.3% |                7.32 |      136,699,693.21 |    0.6% |           89.55 |           23.49 |  3.812 |           9.51 |    0.1% |      0.00 | `std::mt19937`
| 1,767.4% |                2.03 |      492,786,582.33 |    0.6% |           24.38 |            6.48 |  3.761 |           1.26 |    0.6% |      0.00 | `std::mt19937_64`
|    85.2% |               42.08 |       23,764,853.03 |    0.7% |          157.07 |          134.62 |  1.167 |          19.51 |    7.6% |      0.00 | `std::ranlux24_base`
|   121.3% |               29.56 |       33,824,759.51 |    0.5% |           91.03 |           94.35 |  0.965 |          10.00 |    8.1% |      0.00 | `std::ranlux48_base`
|    17.4% |              205.67 |        4,862,080.59 |    1.2% |          709.83 |          657.10 |  1.080 |         101.79 |   16.1% |      0.00 | `std::ranlux24_base`
|     8.7% |              412.46 |        2,424,497.97 |    1.8% |        1,514.70 |        1,318.43 |  1.149 |         219.09 |   16.7% |      0.00 | `std::ranlux48`
|    59.2% |               60.60 |       16,502,276.18 |    1.9% |          253.77 |          193.39 |  1.312 |          24.93 |    1.5% |      0.00 | `std::knuth_b`
| 5,187.1% |                0.69 |    1,446,254,071.66 |    0.1% |            6.00 |            2.21 |  2.714 |           0.00 |    0.0% |      0.00 | `WyRng`
| 1,431.7% |                2.51 |      399,177,833.54 |    0.0% |           21.00 |            8.01 |  2.621 |           0.00 |    0.0% |      0.00 | `NasamRng`
| 2,629.9% |                1.36 |      733,279,957.30 |    0.1% |           13.00 |            4.36 |  2.982 |           0.00 |    0.0% |      0.00 | `Sfc4`
| 3,815.7% |                0.94 |    1,063,889,655.17 |    0.0% |           11.00 |            3.01 |  3.661 |           0.00 |    0.0% |      0.00 | `RomuTrio`
| 3,529.5% |                1.02 |      984,102,081.37 |    0.3% |            9.00 |            3.25 |  2.768 |           0.00 |    0.0% |      0.00 | `RomuDuo`
| 4,580.4% |                0.78 |    1,277,113,402.06 |    0.0% |            7.00 |            2.50 |  2.797 |           0.00 |    0.0% |      0.00 | `RomuDuoJr`
| 2,291.2% |                1.57 |      638,820,992.09 |    0.0% |           11.00 |            5.00 |  2.200 |           0.00 |    0.0% |      0.00 | `ankerl::nanobench::Rng`

It shows that ankerl::nanobench::Rng is one of the fastest RNG, and has the least amount of fluctuation. It takes only 1.57ns to generate a random uint64_t, so ~638 million calls per seconds are possible. To the left we show relative performance compared to std::default_random_engine.

Note

Here pure runtime performance is not necessarily the best benchmark. Especially the fastest RNG’s can be inlined and use instruction level parallelism to their advantage: they immediately return an old state, and while user code can already use that value, the next value is calculated in parallel. See the excellent paper at romu-random for details.

Asymptotic Complexity

It is possible to calculate asymptotic complexity (Big O) from multiple runs of a benchmark. Run the benchmark with different complexity N, then nanobench can calculate the best fitting curve.

The following example finds out the asymptotic complexity of std::set’s find().

tutorial_complexity_set.cpp

#include <nanobench.h>
#include <thirdparty/doctest/doctest.h>

#include <iostream>
#include <set>

// NOLINTNEXTLINE
TEST_CASE("tutorial_complexity_set_find") {
    // Create a single benchmark instance that is used in multiple benchmark
    // runs, with different settings for complexityN.
    ankerl::nanobench::Bench bench;

    // a RNG to generate input data
    ankerl::nanobench::Rng rng;

    std::set<uint64_t> set;

    // Running the benchmark multiple times, with different number of elements
    for (auto setSize :
         {10U, 20U, 50U, 100U, 200U, 500U, 1000U, 2000U, 5000U, 10000U}) {

        // fill up the set with random data
        while (set.size() < setSize) {
            set.insert(rng());
        }

        // Run the benchmark, provide setSize as the scaling variable.
        bench.complexityN(set.size()).run("std::set find", [&] {
            ankerl::nanobench::doNotOptimizeAway(set.find(rng()));
        });
    }

    // calculate BigO complexy best fit and print the results
    std::cout << bench.complexityBigO() << std::endl;
}

The loop runs the benchmark 10 times, with different set sizes from 10 to 10k.

Note

Each of the 10 benchmark runs automatically scales the number of iterations so results are still fast and accurate. In total the whole test takes about 90ms.

The Bench object holds the benchmark results of the 10 benchmark runs. Each benchmark is recorded with a different setting for complexityN.

After the benchmark prints the benchmark results, we calculate & print the Big O of the most important complexity functions. std::cout << bench.complexityBigO() << std::endl; prints e.g. this markdown table:

|   coefficient |   err% | complexity
|--------------:|-------:|------------
|   6.66562e-09 |  29.1% | O(log n)
|   1.47588e-11 |  58.3% | O(n)
|   1.10742e-12 |  62.6% | O(n log n)
|   5.15683e-08 |  63.8% | O(1)
|   1.40387e-15 |  78.7% | O(n^2)
|   1.32792e-19 |  85.7% | O(n^3)

The table is sorted, best fitting complexity function first. So \(\mathcal{O}(\log{}n)\) provides the best approximation for the complexity. Interestingly, in that case error compared to \(\mathcal{O}(n)\) is not very large, which can be an indication that even though the red-black tree should theoretically have logarithmic complexity, in practices that is not perfectly the case.

Rendering Mustache-like Templates

Nanobench comes with a powerful Mustache-like template mechanism to process the benchmark results into all kinds of formats. You can find a full description of all possible tags at ankerl::nanobench::render().

Several preconfigured format exist in the namespace ankerl::nanobench::templates. Rendering these templates can be done with either ankerl::nanobench::render(), or directly with ankerl::nanobench::Bench::render().

The following example shows how to use the CSV - Comma-Separated Values template, without writing the standard output.

tutorial_render_simple.cpp

#include <nanobench.h>
#include <thirdparty/doctest/doctest.h>

#include <atomic>
#include <iostream>

// NOLINTNEXTLINE
TEST_CASE("tutorial_render_simple") {
    std::atomic<int> x(0);

    ankerl::nanobench::Bench()
        .output(nullptr)
        .run("std::vector",
             [&] {
                 ++x;
             })
        .render(ankerl::nanobench::templates::csv(), std::cout);
}

In line 11 we call Bench::output() with nullptr, thus disabling the standard output.

After the benchmark we directly call Bench::render() in line 16. Here we use the CSV template, and write the rendered output to std::cout. When running, we get just the CSV output to the console which looks like this:

"title";"name";"unit";"batch";"elapsed";"error %";"instructions";"branches";"branch misses";"total"
"benchmark";"std::vector";"op";1;6.51982200647249e-09;8.26465858909014e-05;23.0034662045061;5;0.00116867939228672;0.000171959

Nanobench comes with a few preconfigured templates, residing in the namespace ankerl::nanobench::templates. To demonstrate what these templates can do, here is a simple example that benchmarks two random generators std::mt19937_64 and std::knuth_b and prints both the template and the rendered output:

#include <nanobench.h>
#include <thirdparty/doctest/doctest.h>

#include <fstream>
#include <random>

namespace {

void gen(std::string const& typeName, char const* mustacheTemplate,
         ankerl::nanobench::Bench const& bench) {

    std::ofstream templateOut("mustache.template." + typeName);
    templateOut << mustacheTemplate;

    std::ofstream renderOut("mustache.render." + typeName);
    ankerl::nanobench::render(mustacheTemplate, bench, renderOut);
}

} // namespace

// NOLINTNEXTLINE
TEST_CASE("tutorial_mustache") {
    ankerl::nanobench::Bench bench;
    bench.title("Benchmarking std::mt19937_64 and std::knuth_b");

    // NOLINTNEXTLINE(cert-msc32-c,cert-msc51-cpp)
    std::mt19937_64 rng1;
    bench.run("std::mt19937_64", [&] {
        ankerl::nanobench::doNotOptimizeAway(rng1());
    });

    // NOLINTNEXTLINE(cert-msc32-c,cert-msc51-cpp)
    std::knuth_b rng2;
    bench.run("std::knuth_b", [&] {
        ankerl::nanobench::doNotOptimizeAway(rng2());
    });

    gen("json", ankerl::nanobench::templates::json(), bench);
    gen("html", ankerl::nanobench::templates::htmlBoxplot(), bench);
    gen("csv", ankerl::nanobench::templates::csv(), bench);
}

Nanobench allows to specify further context information, which may be accessed using {{context(name)}} where name names a variable defined via Bench::context().

#include <nanobench.h>
#include <thirdparty/doctest/doctest.h>

#include <cmath>
#include <iostream>

namespace {

template <typename T>
void fma() {
    T x(1);
    T y(2);
    T z(3);
    z = std::fma(x, y, z);
    ankerl::nanobench::doNotOptimizeAway(z);
}

template <typename T>
void plus_eq() {
    T x(1);
    T y(2);
    T z(3);
    z += x * y;
    ankerl::nanobench::doNotOptimizeAway(z);
}

char const* csv() {
    return R"DELIM("title";"name";"scalar";"foo";"elapsed";"total"
{{#result}}"{{title}}";"{{name}}";"{{context(scalar)}}";"{{context(foo)}}";{{median(elapsed)}};{{sumProduct(iterations, elapsed)}}
{{/result}})DELIM";
}

} // namespace

// NOLINTNEXTLINE
TEST_CASE("tutorial_context") {
    ankerl::nanobench::Bench bench;
    bench.title("Addition").output(nullptr);
    bench.context("scalar", "f32")
        .context("foo", "bar")
        .run("+=", plus_eq<float>)
        .run("fma", fma<float>);
    bench.context("scalar", "f64")
        .context("foo", "baz")
        .run("+=", plus_eq<double>)
        .run("fma", fma<double>);
    bench.render(csv(), std::cout);
    // Changing the title resets the results, but not the context:
    bench.title("New Title");
    bench.run("+=", plus_eq<float>);
    bench.render(csv(), std::cout);
    CHECK_EQ(bench.results().front().context("foo"), "baz"); // != bar
    // The context has to be reset manually, which causes render to fail:
    bench.title("Yet Another Title").clearContext();
    bench.run("+=", plus_eq<float>);

    // NOLINTNEXTLINE(llvm-else-after-return,readability-else-after-return)
    CHECK_THROWS(bench.render(csv(), std::cout));
}

CSV - Comma-Separated Values

The function ankerl::nanobench::templates::csv() provides this template:

"title";"name";"unit";"batch";"elapsed";"error %";"instructions";"branches";"branch misses";"total"
{{#result}}"{{title}}";"{{name}}";"{{unit}}";{{batch}};{{median(elapsed)}};{{medianAbsolutePercentError(elapsed)}};{{median(instructions)}};{{median(branchinstructions)}};{{median(branchmisses)}};{{sumProduct(iterations, elapsed)}}
{{/result}}

This generates a compact CSV file, where entries are separated by a semicolon ;. Run with the example, I get this output:

"title";"name";"unit";"batch";"elapsed";"error %";"instructions";"branches";"branch misses";"total"
"Benchmarking std::mt19937_64 and std::knuth_b";"std::mt19937_64";"op";1;2.54441805225653e-08;0.0236579384033733;125.989678899083;16.7645714285714;0.564133016627078;0.000218811
"Benchmarking std::mt19937_64 and std::knuth_b";"std::knuth_b";"op";1;3.19013867488444e-08;0.00091350764819687;170.013008130081;28;0.0031104199066874;0.000217248

Rendered as CSV table:

title	name	unit	batch	elapsed	error %	instructions	branches	branch misses	total
Benchmarking std::mt19937_64 and std::knuth_b	std::mt19937_64	op	1	2.54441805225653e-08	0.0236579384033733	125.989678899083	16.7645714285714	0.564133016627078	0.000218811
Benchmarking std::mt19937_64 and std::knuth_b	std::knuth_b	op	1	3.19013867488444e-08	0.00091350764819687	170.013008130081	28	0.0031104199066874	0.000217248

Note that the CSV template doesn’t provide all the data that is available.

HTML Box Plots

With the template ankerl::nanobench::templates::htmlBoxplot() you get a plotly based HTML output which generates a boxplot of the runtime. The template is rather simple.

<html>

<head>
    <script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
</head>

<body>
    <div id="myDiv"></div>
    <script>
        var data = [
            {{#result}}{
                name: '{{name}}',
                y: [{{#measurement}}{{elapsed}}{{^-last}}, {{/last}}{{/measurement}}],
            },
            {{/result}}
        ];
        var title = '{{title}}';

        data = data.map(a => Object.assign(a, { boxpoints: 'all', pointpos: 0, type: 'box' }));
        var layout = { title: { text: title }, showlegend: false, yaxis: { title: 'time per unit', rangemode: 'tozero', autorange: true } }; Plotly.newPlot('myDiv', data, layout, {responsive: true});
    </script>
</body>

</html>

This generates a nice interactive boxplot, which gives a nice visual showcase of the runtime performance of the evaluated benchmarks. Each epoch is visualized as a dot, and the boxplot itself shows median, percentiles, and outliers. You’ll might want to increase the default number of epochs for an even better visualization result.

JSON - JavaScript Object Notation

The ankerl::nanobench::templates::json() template gives everything, all data that is available, from all runs. The template is therefore quite complex:

{
    "results": [
{{#result}}        {
            "title": "{{title}}",
            "name": "{{name}}",
            "unit": "{{unit}}",
            "batch": {{batch}},
            "complexityN": {{complexityN}},
            "epochs": {{epochs}},
            "clockResolution": {{clockResolution}},
            "clockResolutionMultiple": {{clockResolutionMultiple}},
            "maxEpochTime": {{maxEpochTime}},
            "minEpochTime": {{minEpochTime}},
            "minEpochIterations": {{minEpochIterations}},
            "epochIterations": {{epochIterations}},
            "warmup": {{warmup}},
            "relative": {{relative}},
            "median(elapsed)": {{median(elapsed)}},
            "medianAbsolutePercentError(elapsed)": {{medianAbsolutePercentError(elapsed)}},
            "median(instructions)": {{median(instructions)}},
            "medianAbsolutePercentError(instructions)": {{medianAbsolutePercentError(instructions)}},
            "median(cpucycles)": {{median(cpucycles)}},
            "median(contextswitches)": {{median(contextswitches)}},
            "median(pagefaults)": {{median(pagefaults)}},
            "median(branchinstructions)": {{median(branchinstructions)}},
            "median(branchmisses)": {{median(branchmisses)}},
            "totalTime": {{sumProduct(iterations, elapsed)}},
            "measurements": [
{{#measurement}}                {
                    "iterations": {{iterations}},
                    "elapsed": {{elapsed}},
                    "pagefaults": {{pagefaults}},
                    "cpucycles": {{cpucycles}},
                    "contextswitches": {{contextswitches}},
                    "instructions": {{instructions}},
                    "branchinstructions": {{branchinstructions}},
                    "branchmisses": {{branchmisses}}
                }{{^-last}},{{/-last}}
{{/measurement}}            ]
        }{{^-last}},{{/-last}}
{{/result}}    ]
}

This also gives the data from each separate ankerl::nanobench::Bench::epochs(), not just the accumulated data as in the CSV template.

{
    "results": [
        {
            "title": "Benchmarking std::mt19937_64 and std::knuth_b",
            "name": "std::mt19937_64",
            "unit": "op",
            "batch": 1,
            "complexityN": -1,
            "epochs": 11,
            "clockResolution": 1.8e-08,
            "clockResolutionMultiple": 1000,
            "maxEpochTime": 0.1,
            "minEpochTime": 0,
            "minEpochIterations": 1,
            "warmup": 0,
            "relative": 0,
            "median(elapsed)": 2.54441805225653e-08,
            "medianAbsolutePercentError(elapsed)": 0.0236579384033733,
            "median(instructions)": 125.989678899083,
            "medianAbsolutePercentError(instructions)": 0.035125448044942,
            "median(cpucycles)": 81.3479809976247,
            "median(contextswitches)": 0,
            "median(pagefaults)": 0,
            "median(branchinstructions)": 16.7645714285714,
            "median(branchmisses)": 0.564133016627078,
            "totalTime": 0.000218811,
            "measurements": [
                {
                    "iterations": 875,
                    "elapsed": 2.54708571428571e-08,
                    "pagefaults": 0,
                    "cpucycles": 81.472,
                    "contextswitches": 0,
                    "instructions": 125.885714285714,
                    "branchinstructions": 16.7645714285714,
                    "branchmisses": 0.574857142857143
                },
                {
                    "iterations": 809,
                    "elapsed": 2.58467243510507e-08,
                    "pagefaults": 0,
                    "cpucycles": 82.5290482076638,
                    "contextswitches": 0,
                    "instructions": 128.771322620519,
                    "branchinstructions": 17.0296662546354,
                    "branchmisses": 0.582200247218789
                },
                {
                    "iterations": 737,
                    "elapsed": 2.24097693351425e-08,
                    "pagefaults": 0,
                    "cpucycles": 71.6431478968792,
                    "contextswitches": 0,
                    "instructions": 118.374491180461,
                    "branchinstructions": 15.9470827679783,
                    "branchmisses": 0.417910447761194
                },
                {
                    "iterations": 872,
                    "elapsed": 2.53405963302752e-08,
                    "pagefaults": 0,
                    "cpucycles": 80.9896788990826,
                    "contextswitches": 0,
                    "instructions": 125.989678899083,
                    "branchinstructions": 16.7580275229358,
                    "branchmisses": 0.563073394495413
                },
                {
                    "iterations": 834,
                    "elapsed": 2.59256594724221e-08,
                    "pagefaults": 0,
                    "cpucycles": 82.7661870503597,
                    "contextswitches": 0,
                    "instructions": 127.635491606715,
                    "branchinstructions": 16.9352517985612,
                    "branchmisses": 0.575539568345324
                },
                {
                    "iterations": 772,
                    "elapsed": 2.25310880829016e-08,
                    "pagefaults": 0,
                    "cpucycles": 72.0129533678757,
                    "contextswitches": 0,
                    "instructions": 117.108808290155,
                    "branchinstructions": 15.8341968911917,
                    "branchmisses": 0.405440414507772
                },
                {
                    "iterations": 842,
                    "elapsed": 2.54441805225653e-08,
                    "pagefaults": 0,
                    "cpucycles": 81.3479809976247,
                    "contextswitches": 0,
                    "instructions": 127.266033254157,
                    "branchinstructions": 16.8859857482185,
                    "branchmisses": 0.564133016627078
                },
                {
                    "iterations": 792,
                    "elapsed": 2.20126262626263e-08,
                    "pagefaults": 0,
                    "cpucycles": 70.3623737373737,
                    "contextswitches": 0,
                    "instructions": 116.420454545455,
                    "branchinstructions": 15.7588383838384,
                    "branchmisses": 0.396464646464646
                },
                {
                    "iterations": 757,
                    "elapsed": 2.63870541611625e-08,
                    "pagefaults": 0,
                    "cpucycles": 84.332892998679,
                    "contextswitches": 0,
                    "instructions": 131.462351387054,
                    "branchinstructions": 17.334214002642,
                    "branchmisses": 0.618229854689564
                },
                {
                    "iterations": 850,
                    "elapsed": 2.23305882352941e-08,
                    "pagefaults": 0,
                    "cpucycles": 71.3505882352941,
                    "contextswitches": 0,
                    "instructions": 114.629411764706,
                    "branchinstructions": 15.5823529411765,
                    "branchmisses": 0.392941176470588
                },
                {
                    "iterations": 774,
                    "elapsed": 2.60607235142119e-08,
                    "pagefaults": 0,
                    "cpucycles": 83.1679586563308,
                    "contextswitches": 0,
                    "instructions": 130.576227390181,
                    "branchinstructions": 17.2635658914729,
                    "branchmisses": 0.590439276485788
                }
            ]
        },
        {
            "title": "Benchmarking std::mt19937_64 and std::knuth_b",
            "name": "std::knuth_b",
            "unit": "op",
            "batch": 1,
            "complexityN": -1,
            "epochs": 11,
            "clockResolution": 1.8e-08,
            "clockResolutionMultiple": 1000,
            "maxEpochTime": 0.1,
            "minEpochTime": 0,
            "minEpochIterations": 1,
            "warmup": 0,
            "relative": 0,
            "median(elapsed)": 3.19013867488444e-08,
            "medianAbsolutePercentError(elapsed)": 0.00091350764819687,
            "median(instructions)": 170.013008130081,
            "medianAbsolutePercentError(instructions)": 4.11992392254248e-06,
            "median(cpucycles)": 101.973254086181,
            "median(contextswitches)": 0,
            "median(pagefaults)": 0,
            "median(branchinstructions)": 28,
            "median(branchmisses)": 0.0031104199066874,
            "totalTime": 0.000217248,
            "measurements": [
                {
                    "iterations": 568,
                    "elapsed": 3.2137323943662e-08,
                    "pagefaults": 0,
                    "cpucycles": 102.55985915493,
                    "contextswitches": 0,
                    "instructions": 170.014084507042,
                    "branchinstructions": 28,
                    "branchmisses": 0.00528169014084507
                },
                {
                    "iterations": 576,
                    "elapsed": 3.19305555555556e-08,
                    "pagefaults": 0,
                    "cpucycles": 102.059027777778,
                    "contextswitches": 0,
                    "instructions": 170.013888888889,
                    "branchinstructions": 28,
                    "branchmisses": 0.00347222222222222
                },
                {
                    "iterations": 643,
                    "elapsed": 3.18973561430793e-08,
                    "pagefaults": 0,
                    "cpucycles": 101.973561430793,
                    "contextswitches": 0,
                    "instructions": 170.012441679627,
                    "branchinstructions": 28,
                    "branchmisses": 0.0031104199066874
                },
                {
                    "iterations": 591,
                    "elapsed": 3.1912013536379e-08,
                    "pagefaults": 0,
                    "cpucycles": 101.944162436548,
                    "contextswitches": 0,
                    "instructions": 170.013536379019,
                    "branchinstructions": 28,
                    "branchmisses": 0.00169204737732657
                },
                {
                    "iterations": 673,
                    "elapsed": 3.19049034175334e-08,
                    "pagefaults": 0,
                    "cpucycles": 101.973254086181,
                    "contextswitches": 0,
                    "instructions": 170.011887072808,
                    "branchinstructions": 28,
                    "branchmisses": 0.00297176820208024
                },
                {
                    "iterations": 649,
                    "elapsed": 3.19013867488444e-08,
                    "pagefaults": 0,
                    "cpucycles": 101.850539291217,
                    "contextswitches": 0,
                    "instructions": 170.012326656394,
                    "branchinstructions": 28,
                    "branchmisses": 0.00308166409861325
                },
                {
                    "iterations": 606,
                    "elapsed": 3.18547854785479e-08,
                    "pagefaults": 0,
                    "cpucycles": 101.83498349835,
                    "contextswitches": 0,
                    "instructions": 170.013201320132,
                    "branchinstructions": 28,
                    "branchmisses": 0.0033003300330033
                },
                {
                    "iterations": 650,
                    "elapsed": 3.18769230769231e-08,
                    "pagefaults": 0,
                    "cpucycles": 101.898461538462,
                    "contextswitches": 0,
                    "instructions": 170.012307692308,
                    "branchinstructions": 28,
                    "branchmisses": 0.00307692307692308
                },
                {
                    "iterations": 615,
                    "elapsed": 3.18520325203252e-08,
                    "pagefaults": 0,
                    "cpucycles": 101.858536585366,
                    "contextswitches": 0,
                    "instructions": 170.013008130081,
                    "branchinstructions": 28,
                    "branchmisses": 0.0032520325203252
                },
                {
                    "iterations": 579,
                    "elapsed": 3.18618307426598e-08,
                    "pagefaults": 0,
                    "cpucycles": 101.989637305699,
                    "contextswitches": 0,
                    "instructions": 170.013816925734,
                    "branchinstructions": 28,
                    "branchmisses": 0.00345423143350604
                },
                {
                    "iterations": 657,
                    "elapsed": 3.19558599695586e-08,
                    "pagefaults": 0,
                    "cpucycles": 102.229832572298,
                    "contextswitches": 0,
                    "instructions": 170.012176560122,
                    "branchinstructions": 28,
                    "branchmisses": 0.0030441400304414
                }
            ]
        }
    ]
}

pyperf - Python pyperf module Output

Pyperf is a powerful tool for benchmarking and system tuning, and it can also analyze benchmark results. This template allows generation of output so it can be used for further analysis with pyperf.

Note

Pyperf supports only a single benchmark result per generated output, so it is best to create a new Bench object for each benchmark.

The template looks like this. Note that it directly makes use of {{#measurement}}, which is only possible when there is a single result in the benchmark.

{
    "benchmarks": [
        {
            "runs": [
                {
                    "values": [
{{#measurement}}                        {{elapsed}}{{^-last}},
{{/last}}{{/measurement}}
                    ]
                }
            ]
        }
    ],
    "metadata": {
        "loops": {{sum(iterations)}},
        "inner_loops": {{batch}},
        "name": "{{title}}",
        "unit": "second"
    },
    "version": "1.0"
}

Here is an example that generates pyperf compatible output for a benchmark that shuffles a vector:

example_pyperf.cpp

#include <nanobench.h>
#include <thirdparty/doctest/doctest.h>

#include <algorithm>
#include <fstream>
#include <random>

// NOLINTNEXTLINE
TEST_CASE("shuffle_pyperf") {
    std::vector<uint64_t> data(500, 0); // input data for shuffling

    // NOLINTNEXTLINE(cert-msc32-c,cert-msc51-cpp)
    std::default_random_engine defaultRng(123);
    std::ofstream fout1("pyperf_shuffle_std.json");
    ankerl::nanobench::Bench()
        .epochs(100)
        .run("std::shuffle with std::default_random_engine",
             [&]() {
                 std::shuffle(data.begin(), data.end(), defaultRng);
             })
        .render(ankerl::nanobench::templates::pyperf(), fout1);

    std::ofstream fout2("pyperf_shuffle_nanobench.json");
    ankerl::nanobench::Rng rng(123);
    ankerl::nanobench::Bench()
        .epochs(100)
        .run("ankerl::nanobench::Rng::shuffle",
             [&]() {
                 rng.shuffle(data);
             })
        .render(ankerl::nanobench::templates::pyperf(), fout2);
}

This benchmark run creates the two files pyperf_shuffle_std.json and pyperf_shuffle_nanobench.json. Here are some of the analysis you can do:

Show Benchmark Statistics

Output from python3 -m pyperf stats pyperf_shuffle_std.json:

Total duration: 364 ms
Raw value minimum: 3.57 ms
Raw value maximum: 4.21 ms

Number of calibration run: 0
Number of run with values: 1
Total number of run: 1

Number of warmup per run: 0
Number of value per run: 100
Loop iterations per value: 100
Total number of values: 100

Minimum:         35.7 us
Median +- MAD:   36.2 us +- 0.2 us
Mean +- std dev: 36.4 us +- 0.9 us
Maximum:         42.1 us

  0th percentile: 35.7 us (-2% of the mean) -- minimum
  5th percentile: 35.8 us (-2% of the mean)
 25th percentile: 36.1 us (-1% of the mean) -- Q1
 50th percentile: 36.2 us (-0% of the mean) -- median
 75th percentile: 36.4 us (+0% of the mean) -- Q3
 95th percentile: 36.7 us (+1% of the mean)
100th percentile: 42.1 us (+16% of the mean) -- maximum

Number of outlier (out of 35.6 us..36.9 us): 4

Show a Histogram

It’s often interesting to see a histogram, especially to visually find out if there are outliers involved. Run python3 -m pyperf hist pyperf_shuffle_std.json produces this output

7 us: 21 ######################################
0 us: 33 ############################################################
3 us: 37 ###################################################################
6 us:  5 #########
9 us:  0 |
2 us:  1 ##
5 us:  0 |
8 us:  0 |
1 us:  0 |
4 us:  0 |
7 us:  0 |
0 us:  0 |
3 us:  0 |
6 us:  1 ##
9 us:  0 |
2 us:  0 |
5 us:  1 ##
8 us:  0 |
1 us:  0 |
5 us:  0 |
8 us:  0 |
1 us:  1 ##

Compare Results

We have generated two results in the above examples, and we can compare them easily with python3 -m pyperf compare_to a.json b.json:

+-----------+--------------------+------------------------------+
| Benchmark | pyperf_shuffle_std | pyperf_shuffle_nanobench     |
+===========+====================+==============================+
| benchmark | 36.4 us            | 11.2 us: 3.24x faster (-69%) |
+-----------+--------------------+------------------------------+

For more information of pyperfs analysis capability, please see pyperf - Analyze benchmark results.