Microbenchmarking

The Art of Realizing One is Wrong

About René Schwietzke

Co-Founder and Managing Directory Xceptance
Master of Computer Science (in German: Dipl.-Inf.)
Programmer since 1989
Java since 1996, Java 1.0
QA and Testing since 1998
Performance Tester since 1999
@ReneSchwietzke
@reneschwietzke@foojay.social
#java #qa #test #performance #performancetest #quality

View it live

For your own browsing and copying needs

Slides
https://t.ly/NxQiB

capital November, X-ray, capital Quebec, India, capital Bravo

https://github.com/Xceptance/jmh-jmm-training

What's in for you?

Why you should stick around

Get a feel for microbenchmarking
Learn about typical mistakes
Learn about hardware and JVM insides
Take home microbenchmarking awareness
See the good and the ugly
Learn if you need it

This is not a JMH training!

Get some car and racing analogies

Why this presentation?

What is my motivation?

A lot of non-sense microbenchmark examples turned up lately
Teammates wasted weeks for nothing but saw double the performance in their benchmarks
Many don't dissect problems enough before going nuts on tuning
Microbenchmarking helped me to understand the JVM and hardware interactions better
Certain aspects are just extremly fun, if you enjoy low-level things

If you want to drive fast, you have to understand how a car works. Some physics knowledge won't hurt either.

Boring Stuff First

Definitions and Guesses

What are benchmarks?

A definition first

To study (something, such as a competitor's product or business practices) in order to improve the performance of one's own company

A point of reference from which measurements may be made

A standardized problem or test that serves as a basis for evaluation or comparison (as of computer system performance)

Racing a car around a track
Measuring the time it takes

What is a microbenchmark?

What is this micro thing and why is it micro?

A simplified benchmark that tests components in isolation.

It simplifies the process by abstracting the problem, removing complexity, and improving debuggability.

Components are usually quite small. Compare that to the unit test idea!

Just testing the engine or electric motor
No need to go out and test drive, just build a test bed and test 24x7

Let's guestimate goals

What might be our goals?

How long does it take?
How much memory does it need?
Does it scale?
How does it work with larger data?
Is there any warmup?
Can I write that more concise?
Is this really true?

Max revolutions
Max temperature
Power output and torque
Warmup time
Gas or power consumption
Vibrations
Exhaust gas composition
...

Drawbacks

Side effects of simplification

Environment is different
Usage is simulated
Removal of factors which might be the cause
Unknown factors are eliminated too
Less interaction with other components
Development of a bias

Race track
Tires
Petrol
Driver
External forces
Vibrations
Temperatures

Get us some numbers

Some Experiments First

Let's Measure - Example 1

Always use System.arrayCopy over manual copying


public static void main(String[] args)
{
    var length = 255;
    var src = new byte[length];
    var dest = new byte[length];

    var s1 = System.nanoTime();
    System.arraycopy(src, 0, dest, 0, length);
    var e1 = System.nanoTime();

    var s2 = System.nanoTime();
    for (int i = 0; i < length; i++)
    {
        dest[i] = src[i];
    }
    var e2 = System.nanoTime();

    System.out.printf("System Copy: %d ns%n", e1 - s1);
    System.out.printf("Manual Copy: %d ns%n", e2 - s2);
}


$ java -cp target/classes/ org.devoxxpl23.naive.Example1
System Copy: 2,685 ns
Manual Copy: 5,781 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example1
System Copy: 2,565 ns
Manual Copy: 4,609 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example1
System Copy: 2,545 ns
Manual Copy: 4,589 ns

All the proof we needed!

Let's Measure - Example 2

Use a sized StringBuilder all the time


public static void main(String[] args) {
    final String a = "This is a test";
    final String b = "More fun with a string";
    final long c = System.currentTimeMillis(); // e.g. 1684581249117
    var l = a.length() + b.length() + 13;

    var s3 = System.nanoTime();
    {
        var r = a + b + c;
    }
    var e3 = System.nanoTime();

    var s1 = System.nanoTime();
    {
        var sb = new StringBuilder();
        sb.append(a); sb.append(b); sb.append(c);
        var r = sb.toString();
    }
    var e1 = System.nanoTime();

    var s2 = System.nanoTime();
    {
        var sb = new StringBuilder(l);
        sb.append(a); sb.append(b); sb.append(c);
        var r = sb.toString();
    }
    var e2 = System.nanoTime();

    System.out.printf("Pure String usage : %d ns%n", e3 - s3);
    System.out.printf("StringBuilder     : %d ns%n", e1 - s1);
    System.out.printf("StringBuilder size: %d ns%n", e2 - s2);
}


$ java -cp target/classes/ org.devoxxpl23.naive.Example2
Pure String usage : 44,441,844 ns
StringBuilder     :     93,092 ns
StringBuilder size:      7,123 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example2
Pure String usage : 38,471,326 ns
StringBuilder     :     74,865 ns
StringBuilder size:      6,076 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example2
Pure String usage : 37,914,191 ns
StringBuilder     :     72,841 ns
StringBuilder size:      6,146 ns

Makes sense, matches common Internet knowledge
Sized builder is dramatically faster
Pure String operations are horrible

Let's Measure - Example 3

Streams are slow and parallel() is always better


public static void main(String[] args)
{
    var data = new long[10000];
    for (int i = 0; i < data.length; i++) {
        data[i] = i;
    }

    // manual
    var s1 = System.nanoTime(); {
        var total = 0.0d;
        for (int i = 0; i < data.length; i++) {
            total += calc(data[i]);
        }
        var r = total / data.length;
    }
    var e1 = System.nanoTime();

    // stream
    var s2 = System.nanoTime(); {
        var r = Arrays.stream(data).map(i -> calc(i)).average();
    }
    var e2 = System.nanoTime();

    // stream parallel
    var s3 = System.nanoTime(); {
        var r = Arrays.stream(data).parallel().map(i -> calc(i)).average();
    }
    var e3 = System.nanoTime();
    ...
    }

    private static long calc(long c) {
        var t = c;
        for (long i = 100; i > 0; i--) {
            t += (t * 0x5DEECE66DL + 0xBL + i) & (0xFFFFFFFFFFFFL);
        }
        return t;
    }
}


$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual         :  5,832,700 ns
Stream         : 23,259,444 ns
Parallel stream: 10,246,827 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual         :  5,834,677 ns
Stream         : 20,817,603 ns
Parallel stream: 10,223,933 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual         :  6,054,650 ns
Stream         : 20,718,407 ns
Parallel stream: 10,415,500 ns

Streams are bad
Parallel streams are obviously better
But why are the results not the same all the time?

Thesis - What might be wrong

Forget everything you just saw

Of course, all tests before are garbage!
But they confirmed the Internet truth!!!
So, what could be wrong with these?

No warm up
Always a Le Mans start
One round only?
Just time measurement?
Make and model
Race track
Weight of car

Goals

What do we really want?

Repeatability We can easily repeat the test.
Predictability We get the same result when nothing has changed.
Changeability We can easily change things (object under test).
Controllability We can control parameters and vary them (environment).
Debuggability We can get additional insights.
Understandability Things are so simple that we can understand the influence of changes.

Our true goal: We want to learn and understand by changing details under controlled conditions. Turn a knob a bit and see what happens. We want to derive reasoning or even proof from our changes.

The right tool

Don't ask what you can do, ask others what they have done (not JFK)!

You can benchmark things with a runtime 30 s or more mostly okish
Don't spend 2 days on writing a framework you will use for 2 hours, rather spend 2 hours on the setup and benchmark for 2 days
Don't reinvent the wheel, cause... we will see soon why
Use a proven tool chain that helps to mitigate a lot of, yet unknown, problems
JMH - Java Microbenchmark Harness to the rescue

JMH is a Java harness for building, running, and analysing nano/micro/milli/macro benchmarks written in Java and other languages targeting the JVM.

Built to address shortcomings of naive benchmarks
Tested and proven by a large community
Used for the JDK
Comes with built-in profiling tools
Compact
Open Source

The Challenges

Some Factors by Example

Disclaimer

We don't have the time...

We cannot research the challenges together, it just takes too much time
Please feel guided
Most benchmark knowledge is programming knowledge and vice versa
Understanding the why will help you to make your code faster and leaner
See beyond the pure Java code and meet the machine

Main goal Awareness! You can google the rest.
This is a rather a 270° view, not the full 360°.

Time Measurement

Whom do you ask for time?

JDK, it asks the OS, which asks the hardware
APIs: System.nanoTime() and System.currentTimeMillis()
Whoops! We cannot measure less than 1 ns?
Imagine: 3 GHz CPU - 0.33 ns per clock cycle, 6 instructions per cycle
A lot can/could be done in 1 ns!
Even worse, we cannot measure more precisely than 20 - 30 ns
Your clock is... code

Time Measurement

We can measure nanoseconds, can't we?


@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Fork(2)
@State(Scope.Thread)
public class TimerResolution
{
    private long lastValue;

    @Benchmark
    public long latency_nanotime()
    {
        return System.nanoTime();
    }

    public static double precision(long[] t)
    {
        // store samples
        for (int i = 0; i < t.length; i++)
        {
            t[i] = System.nanoTime();
        }
    }
}


Benchmark                             Mode  Cnt   Score   Error  Units
TimerResolution.latency_nanotime      avgt   10  25.519 ± 0.985  ns/op

# Samples Lenovo T14s, AMD, Linux 5.15, TSC
372942679845594
372942679845635
372942679845665
372942679845705
372942679845745
372942679845775
Difference between calls 30 to 40 ns

Measuring takes about 20 to 30 ns
Returned value has a resolution of 30 to 40 ns
Solution Measure less often but execute code under test often. Calculate runtime, don't measure it.
Can't measure seconds? 1 round - 2 min 15 s, do 100 instead: 225 min!

OS - Clock Source

Not every clock is created equal

RTC Real Time Clock
PIT Programmable Interval Timer
HPET High Performance Event Timer
ACPI PM ACPI Power Management Timer
APIC Advanced Programmable Interrupt Controller
TSC Time Stamp Counter
... and more

x86 Time Stamp Counter (TSC), a low-latency monotonically increasing clock, to measure event duration


Benchmark                             Mode  Cnt      Score    Error  Units
TimerResolution.latency_nanotime      avgt   10  1,128.204 ± 32.831  ns/op

# Samples Lenovo T14s, AMD, Linux 5.15, HPET
374172181482681
374172181484078
374172181484986
374172181485964
374172181487291
374172181488199
Difference between calls 900 to 1,400 ns


Benchmark                             Mode  Cnt   Score   Error  Units
TimerResolution.latency_nanotime      avgt   10  25.519 ± 0.985  ns/op

# Samples Lenovo T14s, AMD, Linux 5.15, TSC
495719354555374
495719354555394
495719354555424
495719354555444
495719354555464
495719354555484
Difference between calls 20 to 30 ns


$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
kvm-clock tsc hpet acpi_pm

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet

$ echo 'tsc' > /sys/devices/system/clocksource/clocksource0/current_clocksource

Runtime Environment - JDK

Java evolves and so does performance


public class Example1ArrayCopying {
    @Param({"1000"}) int SIZE;
    byte[] src;

    @Setup
    public void setup() {
        src = new byte[SIZE];
        final Random r = new Random(7L);
        r.nextBytes(src);
    }

    @Benchmark
    public byte[] systemCopy() {
        var dest = new byte[src.length];
        System.arraycopy(src, 0, dest, 0, src.length);

        return dest;
    }

    @Benchmark
    public byte[] arrayCopy() {
        return Arrays.copyOf(src, src.length);
    }

    @Benchmark
    public byte[] manualCopy() {
        var dest = new byte[src.length];
        for (int i = 0; i < src.length; i++) {
            dest[i] = src[i];
        }

        return dest;
    }
}


# JDK 11
Benchmark                       (SIZE) Mode Cnt    Score     Error Units
Example1ArrayCopying.manualCopy   1000 avgt   3  152.889 ±  32.305 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3  130.963 ± 119.643 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3  128.726 ±  83.837 ns/op

# JDK 17
Example1ArrayCopying.manualCopy   1000 avgt   3  153.508 ±   7.987 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3  124.021 ±  35.199 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3  121.271 ±  35.723 ns/op

# JDK 19
Example1ArrayCopying.manualCopy   1000 avgt   3  149.298 ±  79.705 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3  129.439 ± 104.106 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3  133.358 ±  40.304 ns/op

# JDK 20
Example1ArrayCopying.manualCopy   1000 avgt   3  152.070 ±   7.096 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3  119.613 ±   9.873 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3  117.901 ±   8.322 ns/op


# Graal 22.3.r17 - Whoops!
Example1ArrayCopying.manualCopy   1000 avgt   3  335.612 ±  22.000 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3  150.035 ±   4.466 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3  149.176 ±   4.496 ns/op

Lifecycle and Frequency

When, where, and how often is it called - the JIT at work

When is the code in question executed?
- Startup, shutdown, normally, on error
How often is the code executed?
- Once, a few times, like hell
How is it called?
- From a single place, multiple places, in a loop, conditional or unconditional

Loading of code
JIT compiler and its stages
Data cache
Instruction cache
Inlining
Loop unrolling
TLB - translation cache of virtual to physical memory

Lifecycle and Frequency


public class Streams {
    final int SIZE = 1024;
    final int[] integers = new int[SIZE];

    @Setup
    public void setup() {
        for (int i = 0; i < integers.length; i++) {
            integers[i] = i - SIZE;
        }
    }

    @Benchmark
    @Warmup(iterations = 0, batchSize = 1)
    public int lambdaArrayCold() {
        return Arrays.stream(integers).filter(i -> i % 2 == 0).sum();
    }
    @Benchmark
    @Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
    public int lambdaArrayWarm() {...}

    @Benchmark
    @Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
    public int lambdaArrayHot() {...}

    @Benchmark
    @Warmup(iterations = 0, batchSize = 1)
    public int loopCold() {
        int sum = 0;
        for (int i = 0; i < integers.length; i++) {
            var v = integers[i];
            if (v % 2 == 0) {
                sum += v;
            }
        }
        return sum;
    }

    @Benchmark
    @Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
    public int loopWarm() {...}

    @Benchmark
    @Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
    public int loopHot() {...}
}


Benchmark                Mode  Cnt    Score       Error  Units
Streams.lambdaArrayCold  avgt    5  688,797 ± 4,665,070  ns/op
Streams.lambdaArrayWarm  avgt    5   12,797 ±    59,453  ns/op
Streams.lambdaArrayHot   avgt    5    1,113 ±     2,602  ns/op

Streams.loopCold         avgt    5   52,936 ±    42,276  ns/op
Streams.loopWarm         avgt    5    2,030 ±     2,527  ns/op
Streams.loopHot          avgt    5    1,422 ±       733  ns/op


Streams.lambdaArrayCold
Iteration   1: 2,855,843.000 ns/op
Iteration   2:   172,353.000 ns/op
Iteration   3:   139,810.000 ns/op
Iteration   4:   139,670.000 ns/op
Iteration   5:   136,313.000 ns/op

Streams.lambdaArrayWarm
Iteration   1:    40,097.000 ns/op
Iteration   2:     4,712.000 ns/op
Iteration   3:     9,992.667 ns/op
Iteration   4:     4,145.143 ns/op
Iteration   5:     5,039.667 ns/op

Streams.loopCold
Iteration   1:    44,510.286 ns/op
Iteration   2:    61,584.000 ns/op
Iteration   3:    67,831.000 ns/op
Iteration   4:    45,749.000 ns/op
Iteration   5:    45,007.000 ns/op

Lifecycle and Frequency - Real Work


@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Measurement(iterations = 5, batchSize = 1, time = 1, timeUnit = TimeUnit.NANOSECONDS)
@Fork(1)
public class HeavyStreams {
    // as seen before

    @Benchmark
    @Warmup(iterations = 0, batchSize = 1)
    public int lambdaArrayCold() {
        return Arrays.stream(integers)
            .filter(i -> i.mod(BigInteger.TWO).equals(BigInteger.ZERO))
            .map(i -> i.sqrt())
            .mapToInt(BigInteger::intValue)
            .sum();
    }

    @Benchmark
    @Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
    public int lambdaArrayWarm() { ... }

    @Benchmark
    @Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
    public int lambdaArrayHot() { ... }

    @Benchmark
    @Warmup(iterations = 0, batchSize = 1)
    public int loopCold() {
        int sum = 0;
        for (int i = 0; i < integers.length; i++) {
            var v = integers[i];

            if (v.mod(BigInteger.TWO).equals(BigInteger.ZERO)) {
                sum += v.sqrt().intValue();
            }
        }

        return sum;
    }

    @Benchmark
    @Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
    public int loopWarm()  { ... }

    @Benchmark
    @Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
    public int loopHot() { ... }
}


# ms now!!!
Benchmark                     Mode  Cnt Score   Error  Units
HeavyStreams.lambdaArrayCold  avgt    5   362 ±   113  ms/op
HeavyStreams.lambdaArrayWarm  avgt    5   297 ±   177  ms/op
HeavyStreams.lambdaArrayHot   avgt    5   264 ±    12  ms/op

HeavyStreams.loopCold         avgt    5   363 ±   100  ms/op
HeavyStreams.loopWarm         avgt    5   302 ±   183  ms/op
HeavyStreams.loopHot          avgt    5   271 ±    16  ms/op


HeavyStreams.lambdaArrayCold
Iteration   1: 409 ms/op
Iteration   2: 372 ms/op
Iteration   3: 347 ms/op
Iteration   4: 342 ms/op
Iteration   5: 339 ms/op

HeavyStreams.lambdaArrayWarm
Iteration   1: 263 ms/op
Iteration   2: 270 ms/op
Iteration   3: 264 ms/op
Iteration   4: 263 ms/op
Iteration   5: 263 ms/op

HeavyStreams.loopCold
Iteration   1: 401 ms/op
Iteration   2: 376 ms/op
Iteration   3: 352 ms/op
Iteration   4: 342 ms/op
Iteration   5: 340 ms/op

Overhead almost disappeared

Code Removal

It's dead, Jim!


public class DeadCode {
    private double computeSideEffect(int n) {
        return BigInteger.valueOf(n).pow(7).sqrt().doubleValue();
    }

    private double fib(int n) {
        int fib = 1;
        int prevFib = 1;

        for (int i = 2; i < n; i++) {
            int temp = fib;
            fib += prevFib;
            prevFib = temp;
        }
        return fib;
    }

    @Benchmark
    public double baseline() {
        return 5.0d;
    }

    @Benchmark
    public void computeWrongSE() {
        computeSideEffect(2000);
    }

    @Benchmark
    public double computeRightSE() {
        return computeSideEffect(2000);
    }

    @Benchmark
    public void computeWrong() {
        fib(2000);
    }

    @Benchmark
    public double computeRight() {
        return fib(2000);
    }
}


# JDK 11
Benchmark                Mode  Cnt     Score   Error  Units
DeadCode.baseline        avgt    3       3.4 ±   0.2  ns/op

# JDK 17
Benchmark                Mode  Cnt     Score   Error  Units
DeadCode.baseline        avgt    3       0.4 ±   0.1  ns/op

DeadCode.computeWrong    avgt    3      16.1 ±   1.6  ns/op
DeadCode.computeRight    avgt    3     479.2 ±  20.2  ns/op

DeadCode.computeWrongSE  avgt    3    1616.7 ± 434.4  ns/op
DeadCode.computeRightSE  avgt    3    1713.4 ± 876.9  ns/op

Empty method is not even called, it seems
computeWrong is not really dropped
compute with objects is behaving unexpected
computeWrong fib(1000) yields 8.8 ns - ???
computeWrong fib(100) yields 0.4 ns - ???

Complex Example


public class CachingEffects {
    @Param({"true", "false"})
    boolean live;

    int SIZE = 10000;
    int LOAD = 1000000;

    List<String> work;
    List<String> load;

    @Setup
    public void setup()     {
        load = new ArrayList<>(LOAD);

        var s = new StringBuilder("ß09876512[more here]");
        s.append(System.currentTimeMillis());
        s.append('$');

        for (int i = 0; i < LOAD; i++) {
            // align the copies next to each other
            load.add(s.toString());
        }

        work = new ArrayList<>(SIZE);
        if (live) {
            // shuffle the data, with fixed outcome
            Collections.shuffle(load, new Random(42L));
        }

        // copy us the data, when live, we get a random pattern, when
        // not live, we get basically data that is laid out next to each other
        for (int i = 0; i < SIZE; i++) {
            work.add(load.get(i));
        }
    }
    ...
}


public class CachingEffects
{
    ...

    /**
     * Manually search for a character pos
     */
    @Benchmark
    public int manual() {
        var n = 0;

        for (int i = 0; i < work.size(); i++) {
            var s = work.get(i);

            for (int index = 0; index < s.length(); index++) {
                if (s.charAt(index) == '$') {
                    n += index;
                    break;
                }
            }
        }
        return n;
    }

    /**
     * Search via indexOf for the char position
     */
    @Benchmark
    public int indexof() {
        var n = 0;

        for (int i = 0; i < work.size(); i++) {
            n += work.get(i).indexOf('$');
        }
        return n;
    }
}

Caches

The hidden memory that speeds things up

CPUs feature multiple layers of data and instruction caches
Caches are per core, socket, and global
You cannot control the caches easily
Your benchmark likely betrays you here


# JDK 11.0.17 - Aligned Data
Benchmark              (live) Mode Cnt    Score       Error  Units Diff
CachingEffects.indexof  false avgt   3    435,184 ±  54,444  ns/op
CachingEffects.manual   false avgt   3    318,798 ± 166,076  ns/op  73%


# JDK 11.0.17 - More Random Data
Benchmark              (live) Mode Cnt      Score     Error  Units Diff
CachingEffects.indexof   true avgt   3  1,420,598 ± 549,023  ns/op
CachingEffects.manual    true avgt   3  1,389,879 ± 342,328  ns/op  98%


# JDK 11.0.17 - Single Executions
Benchmark               (live)  Mode  Cnt     Score  Units
CachingEffects.indexof   false  avgt    1    562,553 ns/op
CachingEffects.indexof   false  avgt    1    456,493 ns/op
CachingEffects.indexof   false  avgt    1    469,517 ns/op

CachingEffects.manual    false  avgt    1    313,673 ns/op
CachingEffects.manual    false  avgt    1    677,170 ns/op
CachingEffects.manual    false  avgt    1    309,094 ns/op

CachingEffects.indexof    true  avgt    1  1,637,546 ns/op
CachingEffects.indexof    true  avgt    1  2,477,774 ns/op
CachingEffects.indexof    true  avgt    1  2,604,464 ns/op

CachingEffects.manual     true  avgt    1  1,560,230 ns/op
CachingEffects.manual     true  avgt    1  2,067,318 ns/op
CachingEffects.manual     true  avgt    1  1,888,360 ns/op

CPU and Memory

CPU design, usage, and speed make things go fuzzy

CPU Frequencies
Power management
Design (cache size, instruction set, pipeline)
Core types (real vs. hyper/virtual)
Memory speed
CPU and JDK: intrinics
CPU and JDK: defaults
...and much more


# Laptop Busy, JDK 11
Benchmark               (live)  Mode  Cnt      Score Units  Code Cache
CachingEffects.indexof    true  avgt    3  1,572,759 ns/op
CachingEffects.manual     true  avgt    3  1,413,919 ns/op   89%
CachingEffects.indexof   false  avgt    3    377,649 ns/op         24%
CachingEffects.manual    false  avgt    3    294,783 ns/op   78%   21%

# Laptop Less Busy, JDK 11
CachingEffects.indexof    true  avgt    3  1,263,913 ns/op
CachingEffects.manual     true  avgt    3  1,150,578 ns/op   91%
CachingEffects.indexof   false  avgt    3    422,753 ns/op         33%
CachingEffects.manual    false  avgt    3    301,250 ns/op   71%   26%

# Turbo off, JDK 11
CachingEffects.indexof    true  avgt    3  2,191,820 ns/op
CachingEffects.manual     true  avgt    3  1,928,017 ns/op   88%
CachingEffects.indexof   false  avgt    3    858,436 ns/op         39%
CachingEffects.manual    false  avgt    3    687,400 ns/op   80%   36%


# Google c2-standard-8, Intel, JDK 11
CachingEffects.indexof    true  avgt    3  1,064,037 ns/op
CachingEffects.manual     true  avgt    3    908,363 ns/op   85%
CachingEffects.indexof   false  avgt    3    509,849 ns/op         47%
CachingEffects.manual    false  avgt    3    476,487 ns/op   93%   52%

# Google c2d-standard-8, AMDs, JDK 11
CachingEffects.indexof    true  avgt    3    707,885 ns/op
CachingEffects.manual     true  avgt    3    599,987 ns/op   84%
CachingEffects.indexof   false  avgt    3    375,242 ns/op         53%
CachingEffects.manual    false  avgt    3    261,686 ns/op   70%   44%

# Google c3-highcpu-8, Intel, JDK 11
CachingEffects.indexof    true  avgt    3  1,122,492 ns/op
CachingEffects.manual     true  avgt    3  1,075,496 ns/op   96%
CachingEffects.indexof   false  avgt    3    357,894 ns/op         32%
CachingEffects.manual    false  avgt    3    303,531 ns/op   85%   28%

CPU - Branch Prediction

The unknown fortune teller

CPU has a pipeline it wants to fill
Instructions per cycle: 4-8
CPU executes code out of order
When you branch, it tries to go the likely route ahead of time
When the guess was wrong, it becomes expensive
JIT analyses frequency and reorders conditions too


public class BranchPrediction {
    private static final int COUNT = 10000;

    private int[] sorted = new int[COUNT];
    private int[] unsorted = new int[COUNT];
    private int[] pattern2 = new int[COUNT];
    private int[] pattern5 = new int[COUNT];
    private int[] shuffledPattern = new int[COUNT];

    @Setup
    public void setup() {
        final Random r = new Random(13241L);

        // using int here to avoid problems with memory and caches aka
        // only inline data and hence same caching behavior
        for (int i = 0; i < COUNT; i++) {
        	final int d = r.nextInt(100_000_000) + 1;

            sorted[i] = d; unsorted[i] = d;
            pattern2[i] = i % 2 == 0 ? 100_000_000 : r.nextInt(50_000_000);
            pattern5[i] = i % 5 == 0 ? 100_000_000 : r.nextInt(50_000_000);
            shuffledPattern[i] = pattern2[i];
        }
        Arrays.sort(sorted);

        // make it noisy to break the branch predictor
        for (int i = 0; i < shuffledPattern.length; i++) {
            // shuffle here
        }
    }

    public long doIt(int[] array) {
    	long sum = 2;
        for (int v : array) {
            if (v > 50_000_000) {
                sum += (v >> 2);
            }
            else {
                sum += (v << 2);
            }
        }
        return sum;
    }
    ...
}

Branch Prediction

Your data sorting becomes the driver of the results


# JDK 11
Benchmark                        Mode C  Score Units
BranchPrediction.unsorted        avgt 4 14,424 ns/op
BranchPrediction.sorted          avgt 4  4,709 ns/op
BranchPrediction.pattern2        avgt 4  4,943 ns/op
BranchPrediction.pattern5        avgt 4  4,642 ns/op
BranchPrediction.shuffledPattern avgt 4 14,209 ns/op

# JDK 17
BranchPrediction.unsorted        avgt 4 14,285 ns/op
BranchPrediction.sorted          avgt 4  4,858 ns/op
BranchPrediction.pattern2        avgt 4  4,353 ns/op
BranchPrediction.pattern5        avgt 4  4,744 ns/op
BranchPrediction.shuffledPattern avgt 4 14,132 ns/op


# JDK 11, -XX:-BlockLayoutByFrequency
Benchmark                         Mode C  Score Units
BranchPrediction.unsorted         avgt 4  9,800 ns/op
BranchPrediction.sorted           avgt 4  4,791 ns/op
BranchPrediction.pattern2         avgt 4  5,509 ns/op
BranchPrediction.pattern5         avgt 4  5,802 ns/op
BranchPrediction.shuffledPattern  avgt 4  9,865 ns/op

# JDK 17, -XX:-BlockLayoutByFrequency
BranchPrediction.unsorted         avgt 4  9,304 ns/op
BranchPrediction.sorted           avgt 4  4,239 ns/op
BranchPrediction.pattern2         avgt 4  5,530 ns/op
BranchPrediction.pattern5         avgt 4  4,182 ns/op
BranchPrediction.shuffledPattern  avgt 4  9,085 ns/op

Data

Static and little data is not enough

Data is one key driver of valid benchmarks
Not every String has a length of just 42 characters
Data usually has these dimensions:
- Size
- Content
- Variance
Static data is good for repeatability and predictablity
Live-like data is great for the final validation

com.devoxxpl23.HashingDataDimensions


public class HashingDataDimensions {
    @Param({"10", "50", "100", "500", "1000", "2000"}) int CHARARRAY_SIZE;
    @Param({"true", "false"}) boolean RANDOM;
    final static int MAX = 100_000;
    List data, holder;

    @Setup
    public void setup() {
        data = new ArrayList(MAX);
        holder = new ArrayList(MAX);

        for (int i = 0; i < MAX; i++) {
            // the data to work with
            var a = random(CHARARRAY_SIZE);
            data.add(a);
            holder.add(a);

            if (!RANDOM) {
                // same data in our data array all the time, still a miss access when accessing it
                // so the RANDOM test has the same code path, just ends up on the same data
                data.set(data.size() - 1, data.get(0));
            }
        }
    }

    @Benchmark
    public void traditionalHashcode(Blackhole b) {
        for (int i = 0; i < data.size(); i++) {
            b.consume(hashCodeClassic(data.get(r.nextInt(data.size()))));
        }
    }
    ...
    public static int hashCodeClassic(char[] src) {
        final int last = src.length;

        int h = 0;
        for (int i = 0; i < last; i++) {
            h = 31 * h + src[i];
        }

        return h;
    }

    public static int hashCodeVectoredJDK19(char[] src) {
        int h = 0;
        int i = 0;
        int l, l2;
        l = l2 = src.length;
        l = l & ~(8 - 1);

        for (; i < l; i += 8) {
            h = -1807454463 * h +
                            1742810335 * src[i + 0] +
                            887503681 * src[i + 1] +
                            28629151 * src[i + 2] +
                            923521 * src[i + 3] +
                            29791 * src[i + 4] +
                            961 * src[i + 5] +
                            31 * src[i + 6] +
                            1 * src[i + 7];
        }

        for (; i < l2; i++) {
            h = 31 * h + src[i];
        }

        return h;
    }


    public static int hashCodeNoMul(char[] src) {
        final int last = src.length;

        int h = 0;
        for (int i = 0; i < last; i++) {
            h = ((h << 5) - h) + src[i];
        }

        return h;
    }
}

Data


Benchmark           (CHARARRAY_SIZE)  (RANDOM)    Score Units    %
traditionalHashcode               10      true    2,354 us/op
traditionalHashcode               10     false    1,424 us/op  60%

traditionalHashcode               50      true    6,905 us/op
traditionalHashcode               50     false    4,255 us/op  61%

traditionalHashcode              100      true   12,590 us/op
traditionalHashcode              100     false    8,651 us/op  68%

traditionalHashcode              500      true   58,176 us/op
traditionalHashcode              500     false   43,322 us/op  74%

traditionalHashcode             1000      true  103,746 us/op
traditionalHashcode             1000     false   86,582 us/op  83%

traditionalHashcode             2000      true  194,827 us/op
traditionalHashcode             2000     false  173,190 us/op  88%

vectoredHashcode                  10      true    2,701 us/op
vectoredHashcode                  10     false    1,682 us/op  62%

vectoredHashcode                  50      true    5,893 us/op
vectoredHashcode                  50     false    2,919 us/op  50%

vectoredHashcode                 100      true    8,593 us/op
vectoredHashcode                 100     false    4,735 us/op  55%

vectoredHashcode                 500      true   40,104 us/op
vectoredHashcode                 500     false   20,698 us/op  51%

vectoredHashcode                1000      true   63,079 us/op
vectoredHashcode                1000     false   40,718 us/op  64%

vectoredHashcode                2000      true  106,395 us/op
vectoredHashcode                2000     false   80,286 us/op  75%


Benchmark           (CHARARRAY_SIZE)  (RANDOM)    Score Units    %
traditionalHashcode               10      true    2,354 us/op
vectoredHashcode                  10      true    2,701 us/op 114%

traditionalHashcode               50      true    6,905 us/op
vectoredHashcode                  50      true    5,893 us/op  85%

traditionalHashcode              100      true   12,590 us/op
vectoredHashcode                 100      true    8,593 us/op  68%

traditionalHashcode              500      true   58,176 us/op
vectoredHashcode                 500      true   40,104 us/op  69%

traditionalHashcode             1000      true  103,746 us/op
vectoredHashcode                1000      true   63,079 us/op  61%

traditionalHashcode             2000      true  194,827 us/op
vectoredHashcode                2000      true  106,395 us/op  55%

traditionalHashcode               10     false    1,424 us/op
vectoredHashcode                  10     false    1,682 us/op 118%

traditionalHashcode               50     false    4,255 us/op
vectoredHashcode                  50     false    2,919 us/op  69%

traditionalHashcode              100     false    8,651 us/op
vectoredHashcode                 100     false    4,735 us/op  55%

traditionalHashcode              500     false   43,322 us/op
vectoredHashcode                 500     false   20,698 us/op  49%

traditionalHashcode             1000     false   86,582 us/op
vectoredHashcode                1000     false   40,718 us/op  47%

traditionalHashcode             2000     false  173,190 us/op
vectoredHashcode                2000     false   80,286 us/op  46%

JDK again

Let's revisit the JDK version


# JDK 11.0.16, T14s AMD Gen 1
Benchmark               (live)  Mode Cnt     Score Units Code Cache
CachingEffects.indexof    true  avgt   3 1,263,913 ns/op
CachingEffects.manual     true  avgt   3 1,150,578 ns/op  91%
CachingEffects.indexof   false  avgt   3   422,753 ns/op        33%
CachingEffects.manual    false  avgt   3   301,250 ns/op  71%   26%


# JDK 11.0.16, c3-highcpu-8
CachingEffects.indexof    true  avgt   3 1,122,492 ns/op
CachingEffects.manual     true  avgt   3 1,075,496 ns/op  95%
CachingEffects.indexof   false  avgt   3   357,894 ns/op        32%
CachingEffects.manual    false  avgt   3   303,531 ns/op  85%   28%


# JDK 17.0.6, T14s AMD Gen 1
CachingEffects.indexof    true  avgt   3   403,302 ns/op
CachingEffects.manual     true  avgt   3 1,160,649 ns/op 287%
CachingEffects.indexof   false  avgt   3    53,984 ns/op        13%
CachingEffects.manual    false  avgt   3   293,749 ns/op 542%   25%


# JDK 17.0.6, c3-highcpu-8
CachingEffects.indexof    true  avgt   3   271,683 ns/op
CachingEffects.manual     true  avgt   3 1,080,750 ns/op 398%
CachingEffects.indexof   false  avgt   3    58,124 ns/op        21%
CachingEffects.manual    false  avgt   3   310,592 ns/op 534%   29%

Newer versions might surprise you
Stay recent and you can roll back some changes
Platform changes per JDK are possible too
Even when you cannot migrate, try and maybe you can skip a change or motivate your boss
Remember, GIT is your friend and proper commit messages are a lifesaver!

The Human Factor

Never underestimate your own bias

If you want to see something, you will see it.
Changes have the desired effect, but the effect has a different reason, you might not quesiton the change.
Common myths often explain it... incorrectly.
Desperate tuning is often bad tuning.
A self-fulfilling prophecy is a real thing when benchmarking.

Bad news

If this wasn't challenging enough...

You just saw a fraction of all factors that might or might not influence the correctness of your microbenchmarks. You don't have to understand them all, but you have to be aware that a microbenchmark is not just about cpu and time measurement.

Approaching Microbenchmarks

Let's Tackle the Problem

Got a Problem?

What is your motivation?

Real Problem

Production problem or oddity
You have an idea or suspicion

Made Up Problem

Colleague told you
Read it on Medium
Told to make it scale
Told to cut cost in half

Production problem has been profiled, hopefully
You have to benchmark before you microbenchmark
When an arbitrary goal is given, you have to profile first
You might have to instrument first

Theory

No idea? No benchmark!

Develop a sound theory why something is slow or hungry
You might need a benchmark first!
Consider these five areas:
- Code aka algorithms
- Memory aka usage (allocation, freeing, access)
- Cache aka locality of memory usage
- I/O aka external dependencies
- Scale aka synchronisation and CPU usage
Think about interactions of your problem component with the world

Test Case

Build your code

Isolate your problem
Avoid rewriting the code, try to use the API
Don't remove code only because it is not important for the benchmark
Isolate data preparation from code execution
Standard JMH prepares data only once, difficult when mutating data
Concurrency tests are hard to write, see how your production runs it
Don't apply your ideas or theory to the test case

Data

Shape the future outcome

Shape your test data according to your production data
No production data, because the idea is made up? Think in dimensions:
- Amount - How many strings?
- Size - How long is a string?
- Structure - Where is the challenge in the string?
- Layout - Holders and copies
- Variance - How many different sizes, structures, and layouts and how? At once? Isolated?
You data might create or be the problem you are looking for
Think unit tests and the data you might need to cover your code properly

Execution Profile

Wrong conclusions can be drawn in seconds with this trick

How often is your code executed?
- Only at startup (how often)
- Occasionally
- Always
- On error or on a rare path
Where is your code used?
- Inlining
- Branch prediction and out of order

Unit Testing

Functional correctness before performance

Your first duty is to correctness!
Try to reuse your regular tests
Validate every change to avoid false hopes
Don't unit-test memory, cache or other performance things
Ensure proper coverage, because performance tuning often removes edge-cases
Try to use the same data for testing and benchmarking

Question the Results

Never believe in your first results

Is this your expected outcome? Question it.
Is this not what you expected? Question it.
You have not had any expectations? Well... we got a problem.
Start validating, what you got, by experimenting around
Vary the hell out of code and data before your start tuning!
Think of edge-cases

Verification by Variation

Vary a little to feel confident

Always keep the original state running
Vary a copy of original
Vary the tuned version
Vary the code
- Just switch lines
- Reuse variables or don't
- Flip branches and conditions
- Flatten out code or compact it

Vary the data
- Sort or unsort it
- Make it longer, bigger, and smaller
Vary the environment
- Try another JDK
- Try another machine
- Play with the GC

Verification by Tooling

Use the power of JMH

GC -prof gc
Low-level stats -prof perf/perfnorm
Generated code -prof perfasm
JIT Compiler -prof comp
Classloader -prof cl
JFR -prof jfr
Async Profiler -prof async
And more!

Verification by Review

Ask someone but ChatGPT

Ask a colleague
Don't just send a PR
Explain your thought process
No need to ask a benchmarking expert
Compare with JDK benchmarks and JMH examples
Ask the mighty Internet politely

Verfication by Knowledge

Remember our experiments and more

Care about the platform and environment
Remember our CPU, cache, and memory examples
Keep the Java core concepts in mind (JIT, JMM)
Be prepared to start from scratch
Always benchmark the outcome in reality
NEVER apply your benchmark outcome as a general rule in the future!

Microbenchmark when...

A Summary of Things

Conclusion

Microbenchmarking is...

Good

For isolating a problem
Investigating an isolated problem
Tuning an isolated problem
Extremly detailed tuning
Understanding the inner-workings
Squeezing the lemon a little more

Bad

Drawing general conclusions
Black and white comparisons
Expecting large scale impacts
The impatient developer

The Conclusion rephrased

When and when not to do microbenchmarks

Yes You are an JDK contributor
Yes You are an open source contributor
Yes Your problem can be isolated...
Yes ...or you isolate it for easier tuning
Yes You want to debunk your colleague's claims
Yes You are just curious

No You want to write a Medium post about which method is faster
No You want to tune production without a need
No You swipped left on all previous slides

One last thing

The JMH warning

REMEMBER: The numbers are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.

Do not assume the numbers tell you what you want them to tell.

Because you have seen it on my slides, does not make it right!