Benchmark Tests

Getting Started with Benchmark Tests

The Cangjie unit test framework provides powerful support for flexibly performing benchmark tests.

To ensure the reliability of the result, most operations are completed in the Cangjie unit test framework, including:

  • Manage the warmup run.
  • Execute repeatedly.
  • Reduce noise and outliers caused by garbage collection (GC).
  • Obtain the test framework overhead.
  • Provide statistical analysis results.

You can perform a simple benchmark test using the @Bench macro in the corresponding example.

@Test
class Foo {
    var x = 0
    @Bench
    func case() {
        x += 1
    }
}

To run a benchmark test, you can pass the --bench option to the executable file of the test, or use the cjpm bench command in the cjpm project.

Parameterized Benchmark Tests

The API syntax of benchmark tests is similar to that of unit tests and should be integrated with existing unit test features. Similar to the @TestCase macro, the @Bench macro supports parameterized cases and data policies. In addition, most other macro benchmark tests (such as @BeforeEach, @AfterEach, @BeforeAll, @AfterAll, @Types, and @Configure) can be performed in the same way.

For example, you can write a simple benchmark test for the default hash method. Note that the data is created outside the benchmark test. Ensure that the data is created only once before the benchmark test starts.

@Test
class Foo {
    var hash = 0
    let data = String(Array(1000, item: r'a'))

    @Bench
    func hashCode() {
        hash = data.hashCode()
    }
}

Next, the most direct way to run the benchmark test on different data is as follows:

@Test
class Foo {
    var hash = 0
    let data_1 = String(Array(1000, repeat: r'a'))
    let data_2 = String(Array(10000, repeat: r'a'))

    @Bench[data in [data_1, data_2]]
    func hashCode(data: String) {
        hash = data.hashCode()
    }
}

Although this method works, the input parameters are named as data[0] and data[1] in the final report because for any input parameter, the framework does not know which attribute is most important in the benchmark test. In addition, the method has the problem of unnecessary code redundancy. Another more serious potential issue of this method is as follows: If the data type is not character strings, but is more complex, such as string arrays, a large number of active objects are allocated, most of which are used only to perform specific benchmark tests.

@Test
class ArrayBenchmarks {
    var hash = 0
    let data_1 = Array(1000) { i => i.toString() }
    let data_2 = Array(10000) { i => i.toString() }

    @Bench[data in [data_1, data_2]]
    func hashCode(data: Array<String>) {
        var hasher = DefaultHasher()
        for (e in data) {
            hasher.write(e)
        }
        hash = hasher.finish()
    }

    @Bench
    func createArray() {
        Array(10,repeat: 0)
    }
}

When the benchmark test is performed on createArray, the data_1 and data_2 elements are traversed each time GC is triggered, even if they are irrelevant in other tests except the hashCode benchmark test. Especially when a large number of objects are processed, the benchmark test may be unstable, affecting the accuracy of the final result.

The preceding problems and more complex problems can be solved by developing specific policies and applying the @Strategy macro. This macro can be used to create the same data domain specific language (DSL) as that created using the @Bench and @TestCase macros, and a new policy can be generated to map the input in a flattened manner. Therefore, this example may evolve as follows:

@Test
class ArrayBenchmarks {
    var hash = 0

    @Strategy[len in [1000, 10000]]
    func arrays(len: Int64): Array<String> {
        Array(10000) { i => i.toString() }
    }

    @Bench[data in arrays]
    func hashCodeArray(data: Array<String>) {
        var hasher = DefaultHasher()
        for (e in data) {
            hasher.write(e)
        }
        hash = hasher.finish()
    }

    @Strategy[len in [1000, 10000]]
    func strings(len: Int64): String {
        String(Array(len, repeat: r'a'))
    }

    @Bench[data in strings]
    func hashCodeString(data: String) {
        hash = data.hashCode()
    }
}

The output is as follows:

TP: package, time elapsed: 18438985580 ns, RESULT:
    TCS: ArrayBenchmarks, time elapsed: 18438962951 ns, RESULT:
    | Case           | Args   |   Median |         Err |   Err% |     Mean |
    |:---------------|:-------|---------:|------------:|-------:|---------:|
    | hashCodeArray  | 1000   | 10.68 us |  ±0.0832 us |  ±0.8% | 10.57 us |
    | hashCodeArray  | 10000  | 104.3 us |   ±0.504 us |  ±0.5% | 103.8 us |
    |                |        |          |             |        |          |
    | hashCodeString | 1000   | 165.7 ns |   ±0.513 ns |  ±0.3% | 165.6 ns |
    | hashCodeString | 10000  | 1.576 us | ±0.00644 us |  ±0.4% | 1.563 us |
Summary: TOTAL: 2
    PASSED: 2, SKIPPED: 0, ERROR: 0
    FAILED: 0

In this case, length is used as the initial input parameter of the framework. Before the benchmark test starts, related data is generated only for a specific benchmark, which does not affect the subsequent benchmark test. Code redundancy is reduced. The [1000, 10000] array can be moved to an independent variable to further reduce redundancy. In addition, because data is processed within the framework, the compiler cannot directly obtain accurate parameter values for optimization.

Assume that the input data does not change during the benchmark test to ensure that the same data version is received in each function call. This framework also supports benchmark tests that allow input data to be modified. For details, see Setting Before Each Call.

How to Achieve Ideal Results

This framework is mainly used to reduce the impact of various factors on the execution time variance and obtain reliable and reproducible results. The Err% column in the result table lists the main indicators for measuring the reliability of the test result. Generally, if Err% is less than 3%, the result is reliable. If Err% is greater than 10%, you need to find out the cause and reduce the variance. This is not a universal measurement criterion, but can be used for preliminary judgment. In some benchmark tests, the execution time difference may be large, which leads to a larger variance, but the average value is still stable.

However, there are still some external factors that we cannot control. These factors need to be managed by users. These factors include:

  • Compiler optimization options. Generally, most optimization options should be enabled unless you want to test the effect of a specific optimization option. You are advised to enable at least the -O2 option for the Cangjie compiler.
  • Background CPU operations. If sudden task switching occurs in the operating system during a benchmark test, the test result may be significantly affected. Therefore, all background tasks that consume CPU resources should be completed or suspended before the benchmark test starts. You can also set explicit CPU affinity for these tasks to ensure that the benchmark test and other CPU-intensive tasks run on different CPU cores.
  • External I/O is used. Users may accidentally test the performance and latency of I/O operations rather than the performance of follow-up operations. You are advised to perform a benchmark test on the I/O part or processing part separately.
  • Unnecessary optimization. When the performance of a function is tested by setting parameters to specific values, these parameter values may be considered as constants by the compiler for optimization. This issue can be prevented by using parameterized benchmark tests. In the future, a "black box" approach will be provided to help better control such optimizations.
  • Side effects. For all analyses performed in the framework, it is assumed that the code path depends only on the input parameters during execution of the functions that have been tested in benchmark tests. Therefore, when writing a benchmark test, you must ensure that side effects (such as modifying all variables or test fields) do not affect the code execution mode during each iteration of the benchmark test. Note that by default, the parameter values are the same each time the function to be tested in the benchmark test is called. If a parameter value is modified, the modified value will be used in the next call. This operation is not recommended because the parameter values need to be set each time the function is called.
  • Redundant static allocation. If a large number of objects are allocated before the benchmark test (whether statically allocated or allocated in the @Before* method), ensure that these objects are released in time after the benchmark test is complete. Otherwise, the GC load may increase because the GC still needs to traverse the objects that are no longer used but still reachable, affecting the accuracy of the subsequent benchmark test.

The framework detects whether these factors exist and affect the benchmark test result, and generates warnings accordingly. However, this is only a reminder, not a reliable solution. That is, even if no warning is generated, there is no guarantee that the impact of all these factors has been excluded. In addition, in some cases, when the compilation optimization options are enabled, we may not know exactly what the user wants to test in the benchmark test.

Framework Test Method

The core algorithm of the framework can be represented using the following pseudocode:


// Generated by the @Bench[arg in dataStrategy] macro in the someFunc (arg: Arg) function.

let measurement = TimeNow () // Or implemented using other values of Measurement provided by the @Measure macro.
func measureBatchSomeFunc(
    parameter: ImmutableInputProvider<Arg>, // Or implemented using other values of BenchInputProvider returned by the policy.
    batchSize: Int64,
    maxBatchSize: Int64,
): Float64 {

    parameter.reset(maxBatchSize)   // If input data needs to be pre-generated in batches, reset BatchInputProvider.

    measurement.setup()
    let start = measurement.measure()

    for(i in 0..maxBatchSize) {     // The loop is always executed to maxBatchSize so that the time required by the loop can be excluded from the final result.
        let arg = parameter.get(i)
        if (i < batchSize) {
            /*
            body of someFunc
            */
        }
    }

    measurement.measure() - start
}
Framework.addBenchmark(dataStrategy, Benchmark(measureBatchSomeFunc))

// Inside the framework.

for (data in dataStrategy) {
    benchmark.runBench(data)
}

doStatisticalAnalisys(benchmark.results)

class Benchmark<T> {
    let results = ArrayList<(BatchSize, BatchResult)>
    Benchmark(let runBatch: (T, Int64, Int64) -> Float64) {}

    func runBench(data: T) {
        let estimation = warmup(data)

        let batchSizes = calcBatchSizes(estimation)
        for (i in batchSizes) {
            let resultBatch = runBatch(data, i, batchSizes.end)
            results.append(i, resultBatch)

            // Run GC based on the explicitGC configuration parameter.
            explicitGC.invoke()
        }
    }

    func warmup(data: T) {
        // Perform warm-up based on the warmup configuration parameter.
    }

    func calcBatchSizes(estimation: Duration) {
        // Select the most appropriate batch processing volume and size based on the warm-up estimation.
    }

}

The main logic is that the execution time difference between measureBatchSomeFunc(data, i, maxBatchSize) and measureBatchSomeFunc(data, i+n, maxBatchSize) equals to the time required for executing someFunc n times.

This means we can accurately estimate the time required for executing someFunc once based on the execution time difference between measureBatchSomeFunc(data, i, maxBatchSize) and measureBatchSomeFunc(data, i+n, maxBatchSize). This estimate does not include the overhead of the test, loops in batch, or obtaining the next input data.

Advanced Features

Some benchmark tests require special configurations to accurately determine the expected results or fully understand the test results. This framework provides multiple APIs to cover as many complex use cases as possible.

Detailed Reports

When the benchmark test result shows significant instability, analysis cannot be carried out simply relying on the aggregated statistical parameters. Detailed information can be obtained by printing all the original data, but it is inconvenient for manual analysis if too much information is printed. To solve this problem, an HTML-based report is provided, which contains charts and tables displaying the original test data and statistical analysis results. To generate this report, you need to use the --report-format=html option. The report contains a navigation page that lists all executed test cases, and a detailed report is provided for each test case, showing all parameters and executed test data. For each test data record, a kernel density estimation graph of probability distribution and a graph showing all the original test data are also provided. Currently, this framework uses the gnuplot tool to draw charts. You need to install this tool.

If you have a statistical background, you can collect and analyze data by yourself. To facilitate analysis, you can use the --report-format=csv-raw option to export the original test data to a CSV file.

Custom Test Source

By default, the framework tests time, which is usually sufficient. However, in some cases, other performance features also need to be tested to evaluate the performance in more detail. To implement such a test, the Measurement API is provided. In addition, some common advanced test sources are supported. To enable these test sources, you can use a special @Measure class to annotate such a list of test sources.

@Test
@Measure[TimeNow(Nanos), CpuCycles(), Perf(), Perf(HW_CACHE_MISSES)]
class Bench {
    var x = 0
    @Bench
    func case() {
        x += 1
    }
}

The framework provides the following out-of-the-box test tools:

  • TimeNow: DateTime.now is used to test the time in real time. You can configure a specific time unit so that all results are printed in the same time unit.
  • CpuCycles: tests the CPU cycles required by the BMS CPU instructions. It is available only on platforms with such instructions and can be executed in user space.
  • Perf: The perf_event_open system call of Linux is used to test the CPU counters of various types of software and hardware.

Setting Before Each Call

Assume we want to perform the benchmark test on the Array.sort function. This function modifies the input data, leading to different benchmark test results. The reason is that except the first time, each time the sort function is called, the test is performed on the sorted array. To solve this problem, we need to regenerate data before each function call. The BenchInputProvider interface implementation is provided. Data can be returned through the function annotated with @Strategy.

@Test
class Foo {
    @Strategy[len in [1000, 10000]]
    func strings(len: Int64): GenerateEachInputProvider<Array<Int64>> {
        GenerateEachInputProvider { => Array<Int64> }
    }

    @Bench[data in strings]
    func sort(data: Array<Int64>) {
        data.sort()
    }
}

This framework provides the following four implementation modes, which cover all possible application scenarios:

  • ImmutableInputProvider: Only the original data is copied each time. This implementation mode is used by default when the regular policy type is returned (the BenchInputProvider interface is not implemented).
  • GenerateEachInputProvider: Data is generated before the function to be tested in the benchmark test is called each time. In this way, data is generated during the test, and the framework subtracts the overhead for generating data from the total test time. This approach requires that the execution of the generated function should be as pure and stable as possible. In addition, the data generation time should be less than the actual benchmark test time. Otherwise, the actual benchmark test time may be inaccurate due to the difference of the generated function. You are advised to set the time before each call. If the preceding conditions are not met or the variance is still large, you can consider using other methods.
  • BatchSizeOneInputProvider: Data is generated before the function to be tested in the benchmark test is called each time. However, this method requires that the batch size be 1, and the framework does not count the data generation time into the total test time. In this mode, the framework tests the execution time of each function benchmark test separately. This mode does not have the advantage of batch execution, so its main disadvantage is low precision. Whether low precision is a problem depends on the hardware configuration. Generally, if the benchmark test time is less than 1 microsecond, you should pay attention. If the benchmark test time is less than 100 nanoseconds, this mode is not recommended.
  • BatchInputProvider: Multiple data copies are generated in the buffer before each batch processing starts. Theoretically, this mode has the advantage of batch execution. However, it also has various problems. First, it causes redundant allocation. If the batch size is large enough, it may lead to high GC workload. Second, the test results may be slightly different. In other modes, the data is immediately passed to the function to be tested in the benchmark test after the data is generated, but in this mode, the data is likely to remain in the cache. In this case, the first generated element can be removed from the cache only after all data is generated.

Advanced Configuration Parameters

Many configuration parameters related to benchmark tests can be set using the @Configure macro. Currently, the following configuration parameters are supported:

  • batchSize: specifies the exact batch size when there is a non-linear relationship between the batch size and execution time. By default, the framework automatically selects the batch size based on the warm-up result.
  • minBatches: specifies the minimum batch processing volume after workload splitting. By default, the framework selects the most appropriate value between 10 and 200 based on the warm-up result. Note that the statistical analysis time will be long if a large value is specified.
  • warmup: specifies the warm-up time for the framework to call the function to perform the benchmark test.
  • minDuration: specifies the target duration of the benchmark test. The framework selects the batch processing size and volume to ensure that the total execution time of the benchmark test slightly exceeds the target value of minDuration.
  • explicitGC: specifies how to perform GC between batches. By default, the framework triggers GC between batches to evenly distribute the GC workload. Otherwise, unknown GC fluctuations may occur in memory-intensive benchmark tests, affecting test results. However, for some benchmarks without memory allocation, this behavior may result in inaccurate or unstable results. To disable this behavior, set this parameter to Disable.

In addition, it may sometimes be necessary to iterate over multiple parameters with different values to verify whether they affect the result. To meet this requirement, the framework provides a special data DSL syntax in the form of config.<parameter> in <strategy>.

@Test
class Foo {
    var x = 0
    @Bench[config.warmup in [Duration.second*0.1, Duration.second]]
    func case() {
        x += 1
    }
}

Preventing Unnecessary Optimizations

When a complex code is tested in the benchmark test, the code usually contains parameters that can affect its behavior. Therefore, to perform a deterministic benchmark test, you may need to specify specific values for these parameters. The benchmark test code could be as follows:

@Bench
func foo(): Unit {
    complexCode(param: 1)
}

However, it is possible that this function is not called in this way in a real program. That is, the value of param is calculated during running of a real program, while we use literals to allow the compiler to use the information during optimization. The problem is that we hope complexCode of the benchmark test can accurately simulate its performance in a real program. Therefore, the solution to this problem is to use policies to hide the exact values that the compiler sees.

@Bench[arg in [1]]
func foo(arg: Int64): Unit {
    complexCode(param: arg)
}

However, there is another problem. The return value of complexCode is not used. If the compiler detects that it can partially or completely remove the function call, optimization occurs. To solve this problem, the return value should be processed through the black box. This functionality is still under development. Therefore, the temporary solution for now is to store the return value in the global variable.