ankerl::nanobench Reference

Bench - Main Entry Point

class ankerl::nanobench::Bench

Main entry point to nanobench’s benchmarking facility.

It holds configuration and results from one or more benchmark runs. Usually it is used in a single line, where the object is constructed, configured, and then a benchmark is run. E.g. like this:

ankerl::nanobench::Bench().unit("byte").batch(1000).run("random fluctuations", [&] {
    // here be the benchmark code
});
In that example Bench() constructs the benchmark, it is then configured with unit() and batch(), and after configuration a benchmark is executed with run(). Once run() has finished, it prints the result to std::cout. It would also store the results in the Bench instance, but in this case the object is immediately destroyed so it’s not available any more.

Public Functions

Bench()

Creates a new benchmark for configuration and running of benchmarks.

template<typename Op>
Bench &run(char const *benchmarkName, Op &&op)

Repeatedly calls op() based on the configuration, and performs measurements.

This call is marked with noinline to prevent the compiler to optimize beyond different benchmarks. This can have quite a big effect on benchmark accuracy.

Note

Each call to your lambda must have a side effect that the compiler can’t possibly optimize it away. E.g. add a result to an externally defined number (like x in the above example), and finally call doNotOptimizeAway on the variables the compiler must not remove. You can also use ankerl::nanobench::doNotOptimizeAway() directly in the lambda, but be aware that this has a small overhead.

Template Parameters

Op – The code to benchmark.

template<typename Op>
Bench &run(Op &&op)

Same as run(char const* benchmarkName, Op op), but instead uses the previously set name.

Template Parameters

Op – The code to benchmark.

Bench &title(char const *benchmarkTitle)

Title of the benchmark, will be shown in the table header. Changing the title will start a new markdown table.

Parameters

benchmarkTitle – The title of the benchmark.

Bench &name(char const *benchmarkName)

Name of the benchmark, will be shown in the table row.

template<typename T>
Bench &batch(T b) noexcept

Sets the batch size.

E.g. number of processed byte, or some other metric for the size of the processed data in each iteration. If you benchmark hashing of a 1000 byte long string and want byte/sec as a result, you can specify 1000 as the batch size.

Template Parameters

T – Any input type is internally cast to double.

Parameters

b – batch size

Bench &unit(char const *unit)

Sets the operation unit.

Defaults to “op”. Could be e.g. “byte” for string processing. This is used for the table header, e.g. to show ns/byte. Use singular (byte, not bytes). A change clears the currently collected results.

Parameters

unit – The unit name.

Bench &timeUnit(std::chrono::duration<double> const &tu, std::string const &tuName)

Sets the time unit to be used for the default output.

Nanobench defaults to using ns (nanoseconds) as output in the markdown. For some benchmarks this is too coarse, so it is possible to configure this. E.g. use timeUnit(1ms, "ms") to show ms/op instead of ns/op.

Parameters
  • tu – Time unit to display the results in, default is 1ns.

  • tuName – Name for the time unit, default is “ns”

Bench &output(std::ostream *outstream) noexcept

Set the output stream where the resulting markdown table will be printed to.

The default is &std::cout. You can disable all output by setting nullptr.

Parameters

outstream – Pointer to output stream, can be nullptr.

Bench &clockResolutionMultiple(size_t multiple) noexcept

Modern processors have a very accurate clock, being able to measure as low as 20 nanoseconds. This is the main trick nanobech to be so fast: we find out how accurate the clock is, then run the benchmark only so often that the clock’s accuracy is good enough for accurate measurements.

The default is to run one epoch for 1000 times the clock resolution. So for 20ns resolution and 11 epochs, this gives a total runtime of

\[ 20ns * 1000 * 11 \approx 0.2ms \]

To be precise, nanobench adds a 0-20% random noise to each evaluation. This is to prevent any aliasing effects, and further improves accuracy.

Total runtime will be higher though: Some initial time is needed to find out the target number of iterations for each epoch, and there is some overhead involved to start & stop timers and calculate resulting statistics and writing the output.

Parameters

multiple – Target number of times of clock resolution. Usually 1000 is a good compromise between runtime and accuracy.

Bench &epochs(size_t numEpochs) noexcept

Controls number of epochs, the number of measurements to perform.

The reported result will be the median of evaluation of each epoch. The higher you choose this, the more deterministic the result be and outliers will be more easily removed. Also the err% will be more accurate the higher this number is. Note that the err% will not necessarily decrease when number of epochs is increased. But it will be a more accurate representation of the benchmarked code’s runtime stability.

Choose the value wisely. In practice, 11 has been shown to be a reasonable choice between runtime performance and accuracy. This setting goes hand in hand with minEpocIterations() (or minEpochTime()). If you are more interested in median runtime, you might want to increase epochs(). If you are more interested in mean runtime, you might want to increase minEpochIterations() instead.

Parameters

numEpochs – Number of epochs.

Bench &maxEpochTime(std::chrono::nanoseconds t) noexcept

Upper limit for the runtime of each epoch.

As a safety precausion if the clock is not very accurate, we can set an upper limit for the maximum evaluation time per epoch. Default is 100ms. At least a single evaluation of the benchmark is performed.

See

minEpochTime(), minEpochIterations()

Parameters

t – Maximum target runtime for a single epoch.

Bench &minEpochTime(std::chrono::nanoseconds t) noexcept

Minimum time each epoch should take.

Default is zero, so we are fully relying on clockResolutionMultiple(). In most cases this is exactly what you want. If you see that the evaluation is unreliable with a high err%, you can increase either minEpochTime() or minEpochIterations().

See

maxEpochTime(), minEpochIterations()

Parameters

t – Minimum time each epoch should take.

Bench &minEpochIterations(uint64_t numIters) noexcept

Sets the minimum number of iterations each epoch should take.

Default is 1, and we rely on clockResolutionMultiple(). If the err% is high and you want a more smooth result, you might want to increase the minimum number or iterations, or increase the minEpochTime().

See

minEpochTime(), maxEpochTime(), minEpochIterations()

Parameters

numIters – Minimum number of iterations per epoch.

Bench &epochIterations(uint64_t numIters) noexcept

Sets exactly the number of iterations for each epoch. Ignores all other epoch limits. This forces nanobench to use exactly the given number of iterations for each epoch, not more and not less. Default is 0 (disabled).

Parameters

numIters – Exact number of iterations to use. Set to 0 to disable.

Bench &warmup(uint64_t numWarmupIters) noexcept

Sets a number of iterations that are initially performed without any measurements.

Some benchmarks need a few evaluations to warm up caches / database / whatever access. Normally this should not be needed, since we show the median result so initial outliers will be filtered away automatically. If the warmup effect is large though, you might want to set it. Default is 0.

Parameters

numWarmupIters – Number of warmup iterations.

Bench &relative(bool isRelativeEnabled) noexcept

Marks the next run as the baseline.

Call relative(true) to mark the run as the baseline. Successive runs will be compared to this run. It is calculated by

\[ 100\% * \frac{baseline}{runtime} \]

  • 100% means it is exactly as fast as the baseline

  • >100% means it is faster than the baseline. E.g. 200% means the current run is twice as fast as the baseline.

  • <100% means it is slower than the baseline. E.g. 50% means it is twice as slow as the baseline.

See the tutorial section “Comparing Results” for example usage.

Parameters

isRelativeEnabled – True to enable processing

Bench &performanceCounters(bool showPerformanceCounters) noexcept

Enables/disables performance counters.

On Linux nanobench has a powerful feature to use performance counters. This enables counting of retired instructions, count number of branches, missed branches, etc. On default this is enabled, but you can disable it if you don’t need that feature.

Parameters

showPerformanceCounters – True to enable, false to disable.

std::vector<Result> const &results() const noexcept

Retrieves all benchmark results collected by the bench object so far.

Each call to run() generates a Result that is stored within the Bench instance. This is mostly for advanced users who want to see all the nitty gritty detials.

Returns

All results collected so far.

template<typename Arg>
Bench &doNotOptimizeAway(Arg &&arg)

Convenience shortcut to ankerl::nanobench::doNotOptimizeAway().

template<typename T>
Bench &complexityN(T b) noexcept

Sets N for asymptotic complexity calculation, so it becomes possible to calculate Big O from multiple benchmark evaluations.

Use ankerl::nanobench::Bench::complexityBigO() when the evaluation has finished. See the tutorial Asymptotic Complexity for details.

Template Parameters

T – Any type is cast to double.

Parameters

b – Length of N for the next benchmark run, so it is possible to calculate bigO.

std::vector<BigO> complexityBigO() const

Calculates Big O of the results with all preconfigured complexity functions. Currently these complexity functions are fitted into the benchmark results:

\( \mathcal{O}(1) \), \( \mathcal{O}(n) \), \( \mathcal{O}(\log{}n) \), \( \mathcal{O}(n\log{}n) \), \( \mathcal{O}(n^2) \), \( \mathcal{O}(n^3) \).

If we e.g. evaluate the complexity of std::sort, this is the result of std::cout << bench.complexityBigO():

|   coefficient |   err% | complexity
|--------------:|-------:|------------
|   5.08935e-09 |   2.6% | O(n log n)
|   6.10608e-08 |   8.0% | O(n)
|   1.29307e-11 |  47.2% | O(n^2)
|   2.48677e-15 |  69.6% | O(n^3)
|   9.88133e-06 | 132.3% | O(log n)
|   5.98793e-05 | 162.5% | O(1)

So in this case \( \mathcal{O}(n\log{}n) \) provides the best approximation.

See the tutorial Asymptotic Complexity for details.

Returns

Evaluation results, which can be printed or otherwise inspected.

template<typename Op>
BigO complexityBigO(char const *name, Op op) const

Calculates bigO for a custom function.

E.g. to calculate the mean squared error for \( \mathcal{O}(\log{}\log{}n) \), which is not part of the default set of complexityBigO(), you can do this:

auto logLogN = bench.complexityBigO("O(log log n)", [](double n) {
    return std::log2(std::log2(n));
});

The resulting mean squared error can be printed with std::cout << logLogN. E.g. it prints something like this:

2.46985e-05 * O(log log n), rms=1.48121

Template Parameters

Op – Type of mapping operation.

Parameters
  • name – Name for the function, e.g. “O(log log n)”

  • op – Op’s operator() maps a double with the desired complexity function, e.g. log2(log2(n)).

Returns

BigO Error calculation, which is streamable to std::cout.

Bench &render(char const *templateContent, std::ostream &os)

Convenience shortcut to ankerl::nanobench::render().

Rng - Extremely fast PRNG

class ankerl::nanobench::Rng

An extremely fast random generator. Currently, this implements RomuDuoJr, developed by Mark Overton. Source: http://www.romu-random.org/

RomuDuoJr is extremely fast and provides reasonable good randomness. Not enough for large jobs, but definitely good enough for a benchmarking framework.

  • Estimated capacity: \( 2^{51} \) bytes

  • Register pressure: 4

  • State size: 128 bits

This random generator is a drop-in replacement for the generators supplied by <random>. It is not cryptographically secure. It’s intended purpose is to be very fast so that benchmarks that make use of randomness are not distorted too much by the random generator.

Rng also provides a few non-standard helpers, optimized for speed.

Public Types

using result_type = uint64_t

This RNG provides 64bit randomness.

Public Functions

Rng(Rng const&) = delete

As a safety precausion, we don’t allow copying. Copying a PRNG would mean you would have two random generators that produce the same sequence, which is generally not what one wants. Instead create a new rng with the default constructor Rng(), which is automatically seeded from std::random_device. If you really need a copy, use copy().

Rng &operator=(Rng const&) = delete

Same as Rng(Rng const&), we don’t allow assignment. If you need a new Rng create one with the default constructor Rng().

Rng()

Creates a new Random generator with random seed.

Instead of a default seed (as the random generators from the STD), this properly seeds the random generator from std::random_device. It guarantees correct seeding. Note that seeding can be relatively slow, depending on the source of randomness used. So it is best to create a Rng once and use it for all your randomness purposes.

explicit Rng(uint64_t seed) noexcept

Creates a new Rng that is seeded with a specific seed. Each Rng created from the same seed will produce the same randomness sequence. This can be useful for deterministic behavior.

As per the Romu paper, this seeds the Rng with splitMix64 algorithm and performs 10 initial rounds for further mixing up of the internal state.

Note

The random algorithm might change between nanobench releases. Whenever a faster and/or better random generator becomes available, I will switch the implementation.

Parameters

seed – The 64bit seed. All values are allowed, even 0.

Rng copy() const noexcept

Creates a copy of the Rng, thus the copy provides exactly the same random sequence as the original.

inline uint64_t operator()() noexcept

Produces a 64bit random value. This should be very fast, thus it is marked as inline. In my benchmark, this is ~46 times faster than std::default_random_engine for producing 64bit random values. It seems that the fastest std contender is std::mt19937_64. Still, this RNG is 2-3 times as fast.

Returns

uint64_t The next 64 bit random value.

inline uint32_t bounded(uint32_t range) noexcept

Generates a random number between 0 and range (excluding range).

The algorithm only produces 32bit numbers, and is slightly biased. The effect is quite small unless your range is close to the maximum value of an integer. It is possible to correct the bias with rejection sampling (see here, but this is most likely irrelevant in practices for the purposes of this Rng.

See Daniel Lemire’s blog post A fast alternative to the modulo reduction

Parameters

range – Upper exclusive range. E.g a value of 3 will generate random numbers 0, 1, 2.

Returns

uint32_t Generated random values in range [0, range(.

inline double uniform01() noexcept

Provides a random uniform double value between 0 and 1. This uses the method described in Generating uniform doubles in the unit interval, and is extremely fast.

Returns

double Uniformly distributed double value in range [0,1(, excluding 1.

template<typename Container>
void shuffle(Container &container) noexcept

Shuffles all entries in the given container. Although this has a slight bias due to the implementation of bounded(), this is preferable to std::shuffle because it is over 5 times faster. See Daniel Lemire’s blog post Fast random shuffling.

Parameters

container – The whole container will be shuffled.

std::vector<uint64_t> state() const

Extracts the full state of the generator, e.g. for serialization. For this RNG this is just 2 values, but to stay API compatible with future implementations that potentially use more state, we use a vector.

Returns

Vector containing the full state:

Result - Benchmark Results

class Result

doNotOptimizeAway()

template<typename Arg>
void ankerl::nanobench::doNotOptimizeAway(Arg &&arg)

Makes sure none of the given arguments are optimized away by the compiler.

Template Parameters

Arg – Type of the argument that shouldn’t be optimized away.

Parameters

arg – The input that we mark as being used, even though we don’t do anything with it.

render() - Mustache-like Templates

void ankerl::nanobench::render(char const *mustacheTemplate, Bench const &bench, std::ostream &out)

Renders output from a mustache-like template and benchmark results.

The templating facility here is heavily inspired by mustache - logic-less templates. It adds a few more features that are necessary to get all of the captured data out of nanobench. Please read the excellent mustache manual to see what this is all about.

nanobench output has two nested layers, result and measurement. Here is a hierarchy of the allowed tags:

  • {{#result}} Marks the begin of the result layer. Whatever comes after this will be instantiated as often as a benchmark result is available. Within it, you can use these tags:

    • {{title}} See Bench::title().

    • {{name}} Benchmark name, usually directly provided with Bench::run(), but can also be set with Bench::name().

    • {{unit}} Unit, e.g. byte. Defaults to op, see Bench::title().

    • {{batch}} Batch size, see Bench::batch().

    • {{complexityN}} Value used for asymptotic complexity calculation. See Bench::complexityN().

    • {{epochs}} Number of epochs, see Bench::epochs().

    • {{clockResolution}} Accuracy of the clock, i.e. what’s the smallest time possible to measure with the clock. For modern systems, this can be around 20 ns. This value is automatically determined by nanobench at the first benchmark that is run, and used as a static variable throughout the application’s runtime.

    • {{clockResolutionMultiple}} Configuration multiplier for clockResolution. See Bench::clockResolutionMultiple(). This is the target runtime for each measurement (epoch). That means the more accurate your clock is, the faster will be the benchmark. Basing the measurement’s runtime on the clock resolution is the main reason why nanobench is so fast.

    • {{maxEpochTime}} Configuration for a maximum time each measurement (epoch) is allowed to take. Note that at least a single iteration will be performed, even when that takes longer than maxEpochTime. See Bench::maxEpochTime().

    • {{minEpochTime}} Minimum epoch time, usually not set. See Bench::minEpochTime().

    • {{minEpochIterations}} See Bench::minEpochIterations().

    • {{epochIterations}} See Bench::epochIterations().

    • {{warmup}} Number of iterations used before measuring starts. See Bench::warmup().

    • {{relative}} True or false, depending on the setting you have used. See Bench::relative().

    Apart from these tags, it is also possible to use some mathematical operations on the measurement data. The operations are of the form {{command(name)}}. Currently name can be one of elapsed, iterations. If performance counters are available (currently only on current Linux systems), you also have pagefaults, cpucycles, contextswitches, instructions, branchinstructions, and branchmisses. All the measuers (except iterations) are provided for a single iteration (so elapsed is the time a single iteration took). The following tags are available:

    • {{median(<name>)}} Calculate median of a measurement data set, e.g. {{median(elapsed)}}.

    • {{average(<name>)}} Average (mean) calculation.

    • {{medianAbsolutePercentError(<name>)}} Calculates MdAPE, the Median Absolute Percentage Error. The MdAPE is an excellent metric for the variation of measurements. It is more robust to outliers than the Mean absolute percentage error (M-APE).

      \[ \mathrm{MdAPE}(e) = \mathrm{med}\{| \frac{e_i - \mathrm{med}\{e\}}{e_i}| \} \]
      E.g. for elapsed: First, \( \mathrm{med}\{e\} \) calculates the median by sorting and then taking the middle element of all elapsed measurements. This is used to calculate the absolute percentage error to this median for each measurement, as in \( | \frac{e_i - \mathrm{med}\{e\}}{e_i}| \). All these results are sorted, and the middle value is chosen as the median absolute percent error.

      This measurement is a bit hard to interpret, but it is very robust against outliers. E.g. a value of 5% means that half of the measurements deviate less than 5% from the median, and the other deviate more than 5% from the median.

    • {{sum(<name>)}} Sums of all the measurements. E.g. {{sum(iterations)}} will give you the total number of iterations measured in this benchmark.

    • {{minimum(<name>)}} Minimum of all measurements.

    • {{maximum(<name>)}} Maximum of all measurements.

    • {{sumProduct(<first>, <second>)}} Calculates the sum of the products of corresponding measures:

      \[ \mathrm{sumProduct}(a,b) = \sum_{i=1}^{n}a_i\cdot b_i \]
      E.g. to calculate total runtime of the benchmark, you multiply iterations with elapsed time for each measurement, and sum these results up: {{sumProduct(iterations, elapsed)}}.

    • {{#measurement}} To access individual measurement results, open the begin tag for measurements.

      • {{elapsed}} Average elapsed wall clock time per iteration, in seconds.

      • {{iterations}} Number of iterations in the measurement. The number of iterations will fluctuate due to some applied randomness, to enhance accuracy.

      • {{pagefaults}} Average number of pagefaults per iteration.

      • {{cpucycles}} Average number of CPU cycles processed per iteration.

      • {{contextswitches}} Average number of context switches per iteration.

      • {{instructions}} Average number of retired instructions per iteration.

      • {{branchinstructions}} Average number of branches executed per iteration.

      • {{branchmisses}} Average number of branches that were missed per iteration.

    • {{/measurement}} Ends the measurement tag.

  • {{/result}} Marks the end of the result layer. This is the end marker for the template part that will be instantiated for each benchmark result.

    For the layer tags result and measurement you additionally can use these special markers:

    • {{#-first}} - Begin marker of a template that will be instantiated only for the first entry in the layer. Use is only allowed between the begin and end marker of the layer allowed. So between {{#result}} and {{/result}}, or between {{#measurement}} and {{/measurement}}. Finish the template with {{/-first}}.

    • {{^-first}} - Begin marker of a template that will be instantiated for each except the first entry in the layer. This, this is basically the inversion of {{#-first}}. Use is only allowed between the begin and end marker of the layer allowed. So between {{#result}} and {{/result}}, or between {{#measurement}} and {{/measurement}}.

    • {{/-first}} - End marker for either {{#-first}} or {{^-first}}.

    • {{#-last}} - Begin marker of a template that will be instantiated only for the last entry in the layer. Use is only allowed between the begin and end marker of the layer allowed. So between {{#result}} and {{/result}}, or between {{#measurement}} and {{/measurement}}. Finish the template with {{/-last}}.

    • {{^-last}} - Begin marker of a template that will be instantiated for each except the last entry in the layer. This, this is basically the inversion of {{#-last}}. Use is only allowed between the begin and end marker of the layer allowed. So between {{#result}} and {{/result}}, or between {{#measurement}} and {{/measurement}}.

    • {{/-last}} - End marker for either {{#-last}} or {{^-last}}.

For an overview of all the possible data you can get out of nanobench, please see the tutorial at JSON - JavaScript Object Notation.

The templates that ship with nanobench are:

Parameters
  • mustacheTemplate – The template.

  • bench – Benchmark, containing all the results.

  • out – Output for the generated output.

templates::csv

char const *ankerl::nanobench::templates::csv() noexcept

CSV data for the benchmark results.

Generates a comma-separated values dataset. First line is the header, each following line is a summary of each benchmark run.

See the tutorial at CSV - Comma-Separated Values for an example.

templates::htmlBoxplot

char const *ankerl::nanobench::templates::htmlBoxplot() noexcept

HTML output that uses plotly to generate an interactive boxplot chart. See the tutorial for an example output.

The output uses only the elapsed wall clock time, and displays each epoch as a single dot.

See the tutorial at HTML Box Plots for an example.

See

ankerl::nanobench::render()

templates::json

char const *ankerl::nanobench::templates::json() noexcept

Template to generate JSON data.

The generated JSON data contains all data that has been generated. All times are as double values, in seconds. The output can get quite large.

See the tutorial at JSON - JavaScript Object Notation for an example.

templates::pyperf

char const *ankerl::nanobench::templates::pyperf() noexcept

Output in pyperf compatible JSON format, which can be used for more analyzations.

See the tutorial at pyperf - Python pyperf module Output for an example how to further analyze the output.

Environment Variables

NANOBENCH_ENDLESS - Run a Specific Test Endlessly

Sometimes it helps to run a benchmark for a very long time, so that it’s possible to attach with a profiler like perf and get meaningful statistics. This can be done with the environment variable NANOBENCH_ENDLESS. E.g. to run the benchmark with the name x += x endlessly, call the app this way:

NANOBENCH_ENDLESS="x += x" ./yourapp

When your app runs it will run all benchmark normally, but when it encounters a benchmarked named x += x, it will run this one endlessly. It will print in nice friendly letters

NANOBENCH_ENDLESS set: running 'x += x' endlessly

once it reaches that state.

Warning

For optimal profiling with perf, you shouldn’t use pyperf system tune in the endless mode. PyPerf dramatically reduces the number of events that can be captured per second. This is a good to get accurate benchmark numbers from nanobench, but a bad when you actually want to use perf to analyze hotspots.

NANOBENCH_SUPPRESS_WARNINGS - No Stability Warnings

In environmens where it is clear that the results will not be stable, e.g. in CI where benchmarks are merely run to check if they don’t cause a crash, the environment variable NANOBENCH_SUPPRESS_WARNINGS can be used to suppress any warnings. This includes the header warnings like for frequency scaling, and the :wavy_dash: warnings for the individual tests.

Set NANOBENCH_SUPPRESS_WARNINGS=1 to disable all warnings, or set it to 0 to enable warnings (the default mode).

NANOBENCH_SUPPRESS_WARNINGS=1 ./yourapp