Microbenchmarking

The Art of Realizing One is Wrong

View it live

For your own browsing and copying needs

Slides
https://t.ly/VtTPF

capital Victor, tango, capital Tango, capital Papa, capital Foxtrot

https://github.com/Xceptance/jmh-jmm-training

About René Schwietzke

  • Co-Founder and Managing Directory
  • Master of Computer Science (in German: Dipl.-Inf.)
  • Programmer since 1989
  • Java since Java 1.0, 1996
  • QA and Testing since 1998
  • Performance Tester since 1999
  • @ReneSchwietzke
  • @reneschwietzke@foojay.social
  • #java #qa #test #performance #loadtest #quality

About

  • Founded 2004
  • Headquarters in Jena, Germany; Subsidiary in Cambridge, MA, USA
  • Specialized in Software Testing and Quality Assurance
  • Functional Testing, Load and Performance Testing, Test Automation
  • Test Management, Training, Process Consulting
  • More Than 80% of the Projects are about Commerce and/or Web
  • Performance Test Tool XLT (), Java-based, APL 2.0
  • Neodymium for Test Automation, Java/WebDriver-based, MIT

License

I care and share

c b a

This presentation is licensed under
Creative Commons Attribution-ShareAlike 4.0 International License.

What's in for you?

Why you should stick around

  • Get a feel for microbenchmarking
  • Learn about typical mistakes
  • Learn about hardware and JVM insides
  • Take home microbenchmarking awareness
  • See the good and the ugly
  • Get some car and racing analogies

Why this presentation?

What is my motivation?

  • A lot of non-sense microbenchmark examples turned up lately
  • Many don't dissect problems properly before going nuts on tuning
  • Microbenchmarking help me to understand the JVM and its hardware interactions better
  • Certain aspects are just extremly fun, if you enjoy low-level things

If you want to drive fast, you have to understand how a car works. Some physics knowledge won't hurt either.

Boring Stuff First

Definitions and Guesses

What are benchmarks?

A definition first

To study (something, such as a competitor's product or business practices) in order to improve the performance of one's own company

A point of reference from which measurements may be made

A standardized problem or test that serves as a basis for evaluation or comparison (as of computer system performance)

  • Racing a car around a track
  • Measuring the time it takes

What is a microbenchmark?

What is this micro thing and why is it micro?

A simplified benchmark that tests components in isolation.

It simplifies the process by abstracting the problem, removing complexity, and improving debuggability.

Components are usually quite small. Compare that to the unit test idea!

  • Just testing the engine or electric motor
  • No need to go out and test drive, just build a test bed and test 24x7

Let's guestimate goals

What might be our goals?

  • How long does it take?
  • How much memory does it need?
  • Does it scale?
  • How does it work with larger data?
  • Is there any warmup?
  • Can I write that more concise?
  • Is this really true?
  • Max revolutions
  • Max temperature
  • Power output and torque
  • Warmup time
  • Gas or power consumption
  • Vibrations
  • Exhaust gas composition
  • ...

Drawbacks

Side effects of simplification

  • Environment is different
  • Usage is simulated
  • Factors are removed which might be the cause
  • Unknown factors are eliminated too
  • Less interaction with other components
  • Development of a bias
  • Race track
  • Tires
  • Petrol
  • Driver
  • External forces
  • Vibrations
  • Temperatures
full

Get us some numbers

Some Experiments First

Let's Measure - Example 1

Always use System.arrayCopy over manual copying


public static void main(String[] args)
{
    var length = 255;
    var src = new byte[length];
    var dest = new byte[length];

    var s1 = System.nanoTime();
    System.arraycopy(src, 0, dest, 0, length);
    var e1 = System.nanoTime();

    var s2 = System.nanoTime();
    for (int i = 0; i < length; i++)
    {
        dest[i] = src[i];
    }
    var e2 = System.nanoTime();

    System.out.printf("System Copy: %d ns%n", e1 - s1);
    System.out.printf("Manual Copy: %d ns%n", e2 - s2);
}

$ java -cp target/classes/ org.devoxxpl23.naive.Example1
System Copy: 2,685 ns
Manual Copy: 5,781 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example1
System Copy: 2,565 ns
Manual Copy: 4,609 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example1
System Copy: 2,545 ns
Manual Copy: 4,589 ns
  • All the proof we needed!

Let's Measure - Example 2

Use a sized StringBuilder all the time


public static void main(String[] args) {
    final String a = "This is a test";
    final String b = "More fun with a string";
    final long c = System.currentTimeMillis(); // e.g. 1684581249117
    var l = a.length() + b.length() + 13;

    var s3 = System.nanoTime();
    {
        var r = a + b + c;
    }
    var e3 = System.nanoTime();

    var s1 = System.nanoTime();
    {
        var sb = new StringBuilder();
        sb.append(a); sb.append(b); sb.append(c);
        var r = sb.toString();
    }
    var e1 = System.nanoTime();

    var s2 = System.nanoTime();
    {
        var sb = new StringBuilder(l);
        sb.append(a); sb.append(b); sb.append(c);
        var r = sb.toString();
    }
    var e2 = System.nanoTime();

    System.out.printf("Pure String usage : %d ns%n", e3 - s3);
    System.out.printf("StringBuilder     : %d ns%n", e1 - s1);
    System.out.printf("StringBuilder size: %d ns%n", e2 - s2);
}

$ java -cp target/classes/ org.devoxxpl23.naive.Example2
Pure String usage : 44,441,844 ns
StringBuilder     :     93,092 ns
StringBuilder size:      7,123 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example2
Pure String usage : 38,471,326 ns
StringBuilder     :     74,865 ns
StringBuilder size:      6,076 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example2
Pure String usage : 37,914,191 ns
StringBuilder     :     72,841 ns
StringBuilder size:      6,146 ns
  • Makes sense, matches common Internet knowledge
  • Sized builder is dramatically faster
  • Pure String operations are horrible

Let's Measure - Example 3

Streams are slow and parallel() is always better


public static void main(String[] args)
{
    var data = new long[10000];
    for (int i = 0; i < data.length; i++) {
        data[i] = i;
    }

    // manual
    var s1 = System.nanoTime(); {
        var total = 0.0d;
        for (int i = 0; i < data.length; i++) {
            total += calc(data[i]);
        }
        var r = total / data.length;
    }
    var e1 = System.nanoTime();

    // stream
    var s2 = System.nanoTime(); {
        var r = Arrays.stream(data).map(i -> calc(i)).average();
    }
    var e2 = System.nanoTime();

    // stream parallel
    var s3 = System.nanoTime(); {
        var r = Arrays.stream(data).parallel().map(i -> calc(i)).average();
    }
    var e3 = System.nanoTime();
    ...
    }

    private static long calc(long c) {
        var t = c;
        for (long i = 100; i > 0; i--) {
            t += (t * 0x5DEECE66DL + 0xBL + i) & (0xFFFFFFFFFFFFL);
        }
        return t;
    }
}

$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual         :  5,832,700 ns
Stream         : 23,259,444 ns
Parallel stream: 10,246,827 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual         :  5,834,677 ns
Stream         : 20,817,603 ns
Parallel stream: 10,223,933 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual         :  6,054,650 ns
Stream         : 20,718,407 ns
Parallel stream: 10,415,500 ns
  • Streams are bad
  • Parallel streams are obviously better
  • But why are the results not the same all the time?

Thesis - What might be wrong

Forget everything you just saw

  • Of course, all tests before are garbage!
  • But they confirmed the Internet truth!!!
  • So, what could be wrong with these?
  • No warm up
  • Always a Le Mans start
  • One round only?
  • Just time measurement?
  • Make and model
  • Race track
  • Weight of car

Goals

What do we really want?

  • Repeatability Easily repeat the test
  • Predictability Same result when nothing has changed
  • Changeability Easily change things (object under test)
  • Controllability Control parameters and vary them (environment)
  • Debuggability Get additional insights

Our true goal: We want to learn and understand by changing details under controlled conditions. Turn a knob a bit and see what happens. We want to derive reasoning or even proof from our changes.

full

The right tool

Don't ask what you can do, ask others what they have done (not JFK)!

  • You can benchmark things with a runtime 30 s or more mostly okish
  • Don't spend 2 days on writing a framework you will use for 2 hours, rather spend 2 hours on the setup and benchmark for 2 days
  • Don't reinvent the wheel, cause... we will see soon why
  • Use a proven tool chain that helps to mitigate a lot of, yet unknown, problems
  • JMH - Java Microbenchmark Harness to the rescue

JMH is a Java harness for building, running, and analysing nano/micro/milli/macro benchmarks written in Java and other languages targeting the JVM.

  • Built to address shortcomings of naive benchmarks
  • Tested and proven by a large community
  • Used for the JDK
  • Comes with built-in profiling tools
  • Compact
  • Open Source

The Challenges

Some Factors by Example

Time Measurement

Whom do you ask for time?

  • JDK, it asks the OS, which asks the hardware
  • APIs: System.nanoTime() and System.currentTimeMillis()
  • Whoops! We cannot measure less than 1 ns?
  • Imagine: 3 GHz CPU - 0.33 ns per clock cycle, 6 instructions per cycle
  • A lot can/could be done in 1 ns!
  • But we cannot measure 1 ns! Why...?
  • Your clock is... code

Time Measurement

We can measure nanoseconds, can't we?


@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Fork(2)
@State(Scope.Thread)
public class TimerResolution
{
    private long lastValue;

    @Benchmark
    public long latency_nanotime()
    {
        return System.nanoTime();
    }

    public static double precision(long[] t)
    {
        // store samples
        for (int i = 0; i < t.length; i++)
        {
            t[i] = System.nanoTime();
        }
    }
}

Benchmark        Cnt  Score   Error  Units
latency_nanotime  10  25.519 ± 0.985  ns/op

# Samples Lenovo T14s, AMD, Linux 5.15, TSC
372942679845594
372942679845635
372942679845665
372942679845705
372942679845745
372942679845775

Difference between calls 30 to 41 ns
  • Solution Measure less often but execute code under test often. Calculate runtime, don't measure it.
  • Can't measure seconds?
    1 round - 2 min 15 s, do 100 instead: 225 min
    Result: 00:02:15

OS - Clock Source

Not every clock is created equal

  • RTC Real Time Clock
  • PIT Programmable Interval Timer
  • HPET High Performance Event Timer
  • ACPI PM ACPI Power Management Timer
  • APIC Advanced Programmable Interrupt Controller
  • TSC Time Stamp Counter

x86 Time Stamp Counter (TSC), a low-latency monotonically increasing clock, to measure event duration


Benchmark         Mode  Cnt      Score    Error  Units
latency_nanotime  avgt   10  1,128.204 ± 32.831  ns/op

# Samples Lenovo T14s, AMD, Linux 5.15, HPET
374172181482681
374172181484078
374172181484986
374172181485964
374172181487291
374172181488199
Difference between calls 900 to 1,400 ns

Benchmark         Mode  Cnt   Score   Error  Units
latency_nanotime  avgt   10  25.519 ± 0.985  ns/op

# Samples Lenovo T14s, AMD, Linux 5.15, TSC
495719354555374
495719354555394
495719354555424
Difference between calls 20 to 30 ns

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
kvm-clock tsc hpet acpi_pm

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet

$ echo 'tsc' > /sys/devices/system/clocksource/clocksource0/current_clocksource

Runtime Environment - JDK

Java evolves and so does performance


public class Example1ArrayCopying {
    @Param({"1000"}) int SIZE;
    byte[] src;

    @Setup
    public void setup() {
        src = new byte[SIZE];
        final Random r = new Random(7L);
        r.nextBytes(src);
    }

    @Benchmark
    public byte[] systemCopy() {
        var dest = new byte[src.length];
        System.arraycopy(src, 0, dest, 0, src.length);

        return dest;
    }

    @Benchmark
    public byte[] arrayCopy() {
        return Arrays.copyOf(src, src.length);
    }

    @Benchmark
    public byte[] manualCopy() {
        var dest = new byte[src.length];
        for (int i = 0; i < src.length; i++) {
            dest[i] = src[i];
        }

        return dest;
    }
}

# JDK 11
Benchmark                       (SIZE) Mode Cnt    Score     Error Units
Example1ArrayCopying.manualCopy   1000 avgt   3  152.889 ±  32.305 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3  130.963 ± 119.643 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3  128.726 ±  83.837 ns/op

# JDK 17
Example1ArrayCopying.manualCopy   1000 avgt   3  153.508 ±   7.987 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3  124.021 ±  35.199 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3  121.271 ±  35.723 ns/op

# JDK 19
Example1ArrayCopying.manualCopy   1000 avgt   3  149.298 ±  79.705 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3  129.439 ± 104.106 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3  133.358 ±  40.304 ns/op

# JDK 20
Example1ArrayCopying.manualCopy   1000 avgt   3  152.070 ±   7.096 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3  119.613 ±   9.873 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3  117.901 ±   8.322 ns/op

# Graal 22.3.r17 - Whoops!
Example1ArrayCopying.manualCopy   1000 avgt   3  335.612 ±  22.000 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3  150.035 ±   4.466 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3  149.176 ±   4.496 ns/op

Lifecycle and Frequency

When, where, and how often is it called - the JIT at work

  • When is the code in question executed?
    • Startup, shutdown, normally, on error
  • How often is the code executed?
    • Once, a few times, like hell
  • How is it called?
    • From a single place, multiple places, in a loop, conditional or unconditional
  • Loading of code
  • JIT compiler and its stages
  • Data cache
  • Instruction cache
  • Inlining
  • Loop unrolling
  • TLB - translation cache of virtual to physical memory

Lifecycle and Frequency


public class Streams {
    final int SIZE = 1024;
    final int[] integers = new int[SIZE];

    @Setup
    public void setup() {
        for (int i = 0; i < integers.length; i++) {
            integers[i] = i - SIZE;
        }
    }

    @Benchmark
    @Warmup(iterations = 0, batchSize = 1)
    public int lambdaArrayCold() {
        return Arrays.stream(integers).filter(i -> i % 2 == 0).sum();
    }
    @Benchmark
    @Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
    public int lambdaArrayWarm() {...}

    @Benchmark
    @Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
    public int lambdaArrayHot() {...}

    @Benchmark
    @Warmup(iterations = 0, batchSize = 1)
    public int loopCold() {
        int sum = 0;
        for (int i = 0; i < integers.length; i++) {
            var v = integers[i];
            if (v % 2 == 0) {
                sum += v;
            }
        }
        return sum;
    }

    @Benchmark
    @Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
    public int loopWarm() {...}

    @Benchmark
    @Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
    public int loopHot() {...}
}

Benchmark                Mode  Cnt    Score       Error  Units
Streams.lambdaArrayCold  avgt    5  688,797 ± 4,665,070  ns/op
Streams.lambdaArrayWarm  avgt    5   12,797 ±    59,453  ns/op
Streams.lambdaArrayHot   avgt    5    1,113 ±     2,602  ns/op

Streams.loopCold         avgt    5   52,936 ±    42,276  ns/op
Streams.loopWarm         avgt    5    2,030 ±     2,527  ns/op
Streams.loopHot          avgt    5    1,422 ±       733  ns/op

Streams.lambdaArrayCold
Iteration   1: 2,855,843.000 ns/op
Iteration   2:   172,353.000 ns/op
Iteration   3:   139,810.000 ns/op
Iteration   4:   139,670.000 ns/op
Iteration   5:   136,313.000 ns/op

Streams.lambdaArrayWarm
Iteration   1:    40,097.000 ns/op
Iteration   2:     4,712.000 ns/op
Iteration   3:     9,992.667 ns/op
Iteration   4:     4,145.143 ns/op
Iteration   5:     5,039.667 ns/op

Streams.loopCold
Iteration   1:    44,510.286 ns/op
Iteration   2:    61,584.000 ns/op
Iteration   3:    67,831.000 ns/op
Iteration   4:    45,749.000 ns/op
Iteration   5:    45,007.000 ns/op

Lifecycle and Frequency - Real Work


@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Measurement(iterations = 5, batchSize = 1, time = 1, timeUnit = TimeUnit.NANOSECONDS)
@Fork(1)
public class HeavyStreams {
    // as seen before

    @Benchmark
    @Warmup(iterations = 0, batchSize = 1)
    public int lambdaArrayCold() {
        return Arrays.stream(integers)
            .filter(i -> i.mod(BigInteger.TWO).equals(BigInteger.ZERO))
            .map(i -> i.sqrt())
            .mapToInt(BigInteger::intValue)
            .sum();
    }

    @Benchmark
    @Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
    public int lambdaArrayWarm() { ... }

    @Benchmark
    @Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
    public int lambdaArrayHot() { ... }

    @Benchmark
    @Warmup(iterations = 0, batchSize = 1)
    public int loopCold() {
        int sum = 0;
        for (int i = 0; i < integers.length; i++) {
            var v = integers[i];

            if (v.mod(BigInteger.TWO).equals(BigInteger.ZERO)) {
                sum += v.sqrt().intValue();
            }
        }

        return sum;
    }

    @Benchmark
    @Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
    public int loopWarm()  { ... }

    @Benchmark
    @Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
    public int loopHot() { ... }
}

# ms now!!!
Benchmark                     Mode  Cnt Score   Error  Units
HeavyStreams.lambdaArrayCold  avgt    5   362 ±   113  ms/op
HeavyStreams.lambdaArrayWarm  avgt    5   297 ±   177  ms/op
HeavyStreams.lambdaArrayHot   avgt    5   264 ±    12  ms/op

HeavyStreams.loopCold         avgt    5   363 ±   100  ms/op
HeavyStreams.loopWarm         avgt    5   302 ±   183  ms/op
HeavyStreams.loopHot          avgt    5   271 ±    16  ms/op

HeavyStreams.lambdaArrayCold
Iteration   1: 409 ms/op
Iteration   2: 372 ms/op
Iteration   3: 347 ms/op
Iteration   4: 342 ms/op
Iteration   5: 339 ms/op

HeavyStreams.lambdaArrayWarm
Iteration   1: 263 ms/op
Iteration   2: 270 ms/op
Iteration   3: 264 ms/op
Iteration   4: 263 ms/op
Iteration   5: 263 ms/op

HeavyStreams.loopCold
Iteration   1: 401 ms/op
Iteration   2: 376 ms/op
Iteration   3: 352 ms/op
Iteration   4: 342 ms/op
Iteration   5: 340 ms/op
  • Overhead almost disappeared

Code Removal

It's dead, Jim!


public class DeadCode {
    private double computeSideEffect(int n) {
        return BigInteger.valueOf(n).pow(7).sqrt().doubleValue();
    }

    private double fib(int n) {
        int fib = 1;
        int prevFib = 1;

        for (int i = 2; i < n; i++) {
            int temp = fib;
            fib += prevFib;
            prevFib = temp;
        }
        return fib;
    }

    @Benchmark
    public double baseline() {
        return 5.0d;
    }

    @Benchmark
    public void computeWrongSE() {
        computeSideEffect(2000);
    }

    @Benchmark
    public double computeRightSE() {
        return computeSideEffect(2000);
    }

    @Benchmark
    public void computeWrong() {
        fib(2000);
    }

    @Benchmark
    public double computeRight() {
        return fib(2000);
    }
}

# JDK 11
Benchmark                Mode  Cnt     Score   Error  Units
DeadCode.baseline        avgt    3       3.4 ±   0.2  ns/op

# JDK 17
Benchmark                Mode  Cnt     Score   Error  Units
DeadCode.baseline        avgt    3       0.4 ±   0.1  ns/op

DeadCode.computeWrong    avgt    3      16.1 ±   1.6  ns/op
DeadCode.computeRight    avgt    3     479.2 ±  20.2  ns/op

DeadCode.computeWrongSE  avgt    3    1616.7 ± 434.4  ns/op
DeadCode.computeRightSE  avgt    3    1713.4 ± 876.9  ns/op
  • Empty method is not even called, it seems
  • computeWrong is not really dropped
  • compute with objects is behaving unexpected
  • computeWrong fib(1000) yields 8.8 ns - ???
  • computeWrong fib(100) yields 0.4 ns - ???

Complex Example


public class CachingEffects {
    @Param({"true", "false"})
    boolean live;

    int SIZE = 10000;
    int LOAD = 1000000;

    List<String> work;
    List<String> load;

    @Setup
    public void setup()     {
        load = new ArrayList<>(LOAD);

        var s = new StringBuilder("ß09876512[more here]");
        s.append(System.currentTimeMillis());
        s.append('$');

        for (int i = 0; i < LOAD; i++) {
            // align the copies next to each other
            load.add(s.toString());
        }

        work = new ArrayList<>(SIZE);
        if (live) {
            // shuffle the data, with fixed outcome
            Collections.shuffle(load, new Random(42L));
        }

        // copy us the data, when live, we get a random pattern, when
        // not live, we get basically data that is laid out next to each other
        for (int i = 0; i < SIZE; i++) {
            work.add(load.get(i));
        }
    }
    ...
}

public class CachingEffects
{
    ...

    /**
     * Manually search for a character pos
     */
    @Benchmark
    public int manual() {
        var n = 0;

        for (int i = 0; i < work.size(); i++) {
            var s = work.get(i);

            for (int index = 0; index < s.length(); index++) {
                if (s.charAt(index) == '$') {
                    n += index;
                    break;
                }
            }
        }
        return n;
    }

    /**
     * Search via indexOf for the char position
     */
    @Benchmark
    public int indexof() {
        var n = 0;

        for (int i = 0; i < work.size(); i++) {
            n += work.get(i).indexOf('$');
        }
        return n;
    }
}

Caches

The hidden memory that speeds things up

  • CPUs feature multiple layers of data and instruction caches
  • Caches are per core, socket, and global
  • You cannot control the caches easily
  • Your benchmark likely betrays you here

# JDK 11.0.17 - Aligned Data
Benchmark              (live) Mode Cnt    Score       Error  Units Diff
CachingEffects.indexof  false avgt   3    435,184 ±  54,444  ns/op
CachingEffects.manual   false avgt   3    318,798 ± 166,076  ns/op  73%

# JDK 11.0.17 - More Random Data
Benchmark              (live) Mode Cnt      Score     Error  Units Diff
CachingEffects.indexof   true avgt   3  1,420,598 ± 549,023  ns/op
CachingEffects.manual    true avgt   3  1,389,879 ± 342,328  ns/op  98%

# JDK 11.0.17 - Single Executions
Benchmark               (live)  Mode  Cnt     Score  Units
CachingEffects.indexof   false  avgt    1    562,553 ns/op
CachingEffects.indexof   false  avgt    1    456,493 ns/op
CachingEffects.indexof   false  avgt    1    469,517 ns/op

CachingEffects.manual    false  avgt    1    313,673 ns/op
CachingEffects.manual    false  avgt    1    677,170 ns/op
CachingEffects.manual    false  avgt    1    309,094 ns/op

CachingEffects.indexof    true  avgt    1  1,637,546 ns/op
CachingEffects.indexof    true  avgt    1  2,477,774 ns/op
CachingEffects.indexof    true  avgt    1  2,604,464 ns/op

CachingEffects.manual     true  avgt    1  1,560,230 ns/op
CachingEffects.manual     true  avgt    1  2,067,318 ns/op
CachingEffects.manual     true  avgt    1  1,888,360 ns/op

Detour: Cost

What it costs to leave the CPU

  Real Humanized
CPU Cycle 0.4 ns 1 s
L1 Cache 0.9 ns 2 s
L2 Cache 2.8 ns 7 s
L3 Cache 28 ns 1 min
Memory Access 100 ns 4 min
NVM SSD 25 μs 17 h
SSD 50–150 μs1.5-4 days
HDD 1–10 ms 1-9 months
TCP SF to NY 65 ms 5 years
TCP SF to Hong Kong141 ms11 years
  • It costs 100 to 200 cycles when we have to go to main memory
  • Hence Java and the CPU avoid main memory access at all cost
  • Java does not have the notion of a memory location, hence this is transparent
s

CPU and Memory

CPU design, usage, and speed make things go fuzzy

  • CPU Frequencies
  • Power management
  • Design (cache size, instruction set, pipeline)
  • Core types (real vs. hyper/virtual)
  • Memory speed
  • CPU and JDK: intrinics
  • CPU and JDK: defaults
  • ...and much more

# Laptop Busy, JDK 11
Benchmark               (live)  Mode  Cnt      Score Units  Code Cache
CachingEffects.indexof    true  avgt    3  1,572,759 ns/op
CachingEffects.manual     true  avgt    3  1,413,919 ns/op   89%
CachingEffects.indexof   false  avgt    3    377,649 ns/op         24%
CachingEffects.manual    false  avgt    3    294,783 ns/op   78%   21%

# Laptop Less Busy, JDK 11
CachingEffects.indexof    true  avgt    3  1,263,913 ns/op
CachingEffects.manual     true  avgt    3  1,150,578 ns/op   91%
CachingEffects.indexof   false  avgt    3    422,753 ns/op         33%
CachingEffects.manual    false  avgt    3    301,250 ns/op   71%   26%

# Turbo off, JDK 11
CachingEffects.indexof    true  avgt    3  2,191,820 ns/op
CachingEffects.manual     true  avgt    3  1,928,017 ns/op   88%
CachingEffects.indexof   false  avgt    3    858,436 ns/op         39%
CachingEffects.manual    false  avgt    3    687,400 ns/op   80%   36%

# Google c2-standard-8, Intel, JDK 11
CachingEffects.indexof    true  avgt    3  1,064,037 ns/op
CachingEffects.manual     true  avgt    3    908,363 ns/op   85%
CachingEffects.indexof   false  avgt    3    509,849 ns/op         47%
CachingEffects.manual    false  avgt    3    476,487 ns/op   93%   52%

# Google c2d-standard-8, AMDs, JDK 11
CachingEffects.indexof    true  avgt    3    707,885 ns/op
CachingEffects.manual     true  avgt    3    599,987 ns/op   84%
CachingEffects.indexof   false  avgt    3    375,242 ns/op         53%
CachingEffects.manual    false  avgt    3    261,686 ns/op   70%   44%

# Google c3-highcpu-8, Intel, JDK 11
CachingEffects.indexof    true  avgt    3  1,122,492 ns/op
CachingEffects.manual     true  avgt    3  1,075,496 ns/op   96%
CachingEffects.indexof   false  avgt    3    357,894 ns/op         32%
CachingEffects.manual    false  avgt    3    303,531 ns/op   85%   28%

JDK again

Let's revisit the JDK version


# JDK 11.0.16, T14s AMD Gen 1
Benchmark               (live)  Mode Cnt     Score Units Code Cache
CachingEffects.indexof    true  avgt   3 1,263,913 ns/op
CachingEffects.manual     true  avgt   3 1,150,578 ns/op  91%
CachingEffects.indexof   false  avgt   3   422,753 ns/op        33%
CachingEffects.manual    false  avgt   3   301,250 ns/op  71%   26%

# JDK 11.0.16, c3-highcpu-8
CachingEffects.indexof    true  avgt   3 1,122,492 ns/op
CachingEffects.manual     true  avgt   3 1,075,496 ns/op  95%
CachingEffects.indexof   false  avgt   3   357,894 ns/op        32%
CachingEffects.manual    false  avgt   3   303,531 ns/op  85%   28%

# JDK 17.0.6, T14s AMD Gen 1
CachingEffects.indexof    true  avgt   3   403,302 ns/op
CachingEffects.manual     true  avgt   3 1,160,649 ns/op 287%
CachingEffects.indexof   false  avgt   3    53,984 ns/op        13%
CachingEffects.manual    false  avgt   3   293,749 ns/op 542%   25%

# JDK 17.0.6, c3-highcpu-8
CachingEffects.indexof    true  avgt   3   271,683 ns/op
CachingEffects.manual     true  avgt   3 1,080,750 ns/op 398%
CachingEffects.indexof   false  avgt   3    58,124 ns/op        21%
CachingEffects.manual    false  avgt   3   310,592 ns/op 534%   29%
  • Newer versions might surprise you
  • Stay recent and you can roll back some changes
  • Platform changes per JDK are possible too
  • Even when you cannot migrate, try and maybe you can skip a change or motivate your boss
  • Remember, GIT is your friend and proper commit messages are a lifesaver!

CPU - Branch Prediction

The unknown fortune teller

  • CPU has a pipeline it wants to fill
  • Instructions per cycle: 4-8
  • CPU executes code out of order
  • When you branch, it tries to go the likely route ahead of time
  • When the guess was wrong, it becomes expensive
  • JIT analyses frequency and reorders conditions too

public class BranchPrediction {
    private static final int COUNT = 10000;

    private int[] sorted = new int[COUNT];
    private int[] unsorted = new int[COUNT];
    private int[] pattern2 = new int[COUNT];
    private int[] pattern5 = new int[COUNT];
    private int[] shuffledPattern = new int[COUNT];

    @Setup
    public void setup() {
        final Random r = new Random(13241L);

        // using int here to avoid problems with memory and caches aka
        // only inline data and hence same caching behavior
        for (int i = 0; i < COUNT; i++) {
        	final int d = r.nextInt(100_000_000) + 1;

            sorted[i] = d; unsorted[i] = d;
            pattern2[i] = i % 2 == 0 ? 100_000_000 : r.nextInt(50_000_000);
            pattern5[i] = i % 5 == 0 ? 100_000_000 : r.nextInt(50_000_000);
            shuffledPattern[i] = pattern2[i];
        }
        Arrays.sort(sorted);

        // make it noisy to break the branch predictor
        for (int i = 0; i < shuffledPattern.length; i++) {
            // shuffle here
        }
    }

    public long doIt(int[] array) {
    	long sum = 2;
        for (int v : array) {
            if (v > 50_000_000) {
                sum += (v >> 2);
            }
            else {
                sum += (v << 2);
            }
        }
        return sum;
    }
    ...
}

Branch Prediction

Your data sorting becomes the driver of the results


# JDK 11
Benchmark                        Mode C  Score Units
BranchPrediction.unsorted        avgt 4 13,616 ns/op
BranchPrediction.shuffledPattern avgt 4 12,842 ns/op
BranchPrediction.sorted          avgt 4  4,764 ns/op
BranchPrediction.pattern2        avgt 4  3,903 ns/op
BranchPrediction.pattern5        avgt 4  4,317 ns/op

# JDK 17
BranchPrediction.unsorted        avgt 4 13,857 ns/op
BranchPrediction.shuffledPattern avgt 4 12,873 ns/op
BranchPrediction.sorted          avgt 4  3,990 ns/op
BranchPrediction.pattern2        avgt 4  3,908 ns/op
BranchPrediction.pattern5        avgt 4  4,715 ns/op

# JDK 21
BranchPrediction.unsorted        avgt 4 15,330 ns/op
BranchPrediction.shuffledPattern avgt 4 15,003 ns/op
BranchPrediction.sorted          avgt 4  4,826 ns/op
BranchPrediction.pattern2        avgt 4  4,573 ns/op
BranchPrediction.pattern5        avgt 4  4,505 ns/op

# JDK 11, -XX:-BlockLayoutByFrequency
Benchmark                         Mode C  Score Units
BranchPrediction.unsorted        avgt 4  9,145 ns/op
BranchPrediction.shuffledPattern avgt 4  9,086 ns/op
BranchPrediction.sorted          avgt 4  4,226 ns/op
BranchPrediction.pattern2        avgt 4  5,632 ns/op
BranchPrediction.pattern5        avgt 4  4,857 ns/op

# JDK 17, -XX:-BlockLayoutByFrequency
BranchPrediction.unsorted        avgt 4  9,021 ns/op
BranchPrediction.shuffledPattern avgt 4  8,927 ns/op
BranchPrediction.sorted          avgt 4  4,207 ns/op
BranchPrediction.pattern2        avgt 4  5,569 ns/op
BranchPrediction.pattern5        avgt 4  4,669 ns/op

# JDK 21, -XX:-BlockLayoutByFrequency
BranchPrediction.unsorted        avgt 4  8,991 ns/op
BranchPrediction.shuffledPattern avgt 4  9,038 ns/op
BranchPrediction.sorted          avgt 4  4,431 ns/op
BranchPrediction.pattern2        avgt 4  5,537 ns/op
BranchPrediction.pattern5        avgt 4  4,914 ns/op

Data

Static and little data is not enough

  • Data is one key driver of valid benchmarks
  • Not every String has a length of just 42 characters
  • Data usually has these dimensions:
    • Size
    • Content
    • Variance
  • Static data is good for repeatability and predictablity
  • Live-like data is great for the final validation

public class HashingDataDimensions {
    @Param({"10", "50", "100", "500", "1000", "2000"}) int CHARARRAY_SIZE;
    @Param({"true", "false"}) boolean RANDOM;
    final static int MAX = 100_000;
    List<char[]> data, holder;

    @Setup
    public void setup() {
        data = new ArrayList<char[]>(MAX);
        holder = new ArrayList<char[]>(MAX);

        for (int i = 0; i < MAX; i++) {
            // the data to work with
            var a = random(CHARARRAY_SIZE);
            data.add(a);
            holder.add(a);

            if (!RANDOM) {
                // same data in our data array all the time, still a miss access when accessing it
                // so the RANDOM test has the same code path, just ends up on the same data
                data.set(data.size() - 1, data.get(0));
            }
        }
    }

    @Benchmark
    public void traditionalHashcode(Blackhole b) {
        for (int i = 0; i < data.size(); i++) {
            b.consume(hashCodeClassic(data.get(r.nextInt(data.size()))));
        }
    }
    ...
    public static int hashCodeClassic(char[] src) {
        final int last = src.length;

        int h = 0;
        for (int i = 0; i < last; i++) {
            h = 31 * h + src[i];
        }

        return h;
    }

    public static int hashCodeVectoredJDK19(char[] src) {
        int h = 0;
        int i = 0;
        int l, l2;
        l = l2 = src.length;
        l = l & ~(8 - 1);

        for (; i < l; i += 8) {
            h = -1807454463 * h +
                            1742810335 * src[i + 0] +
                            887503681 * src[i + 1] +
                            28629151 * src[i + 2] +
                            923521 * src[i + 3] +
                            29791 * src[i + 4] +
                            961 * src[i + 5] +
                            31 * src[i + 6] +
                            1 * src[i + 7];
        }

        for (; i < l2; i++) {
            h = 31 * h + src[i];
        }

        return h;
    }


    public static int hashCodeNoMul(char[] src) {
        final int last = src.length;

        int h = 0;
        for (int i = 0; i < last; i++) {
            h = ((h << 5) - h) + src[i];
        }

        return h;
    }
}

Data


Benchmark           (CHARARRAY_SIZE)  (RANDOM)    Score Units    %
traditionalHashcode               10      true    2,354 us/op
traditionalHashcode               10     false    1,424 us/op  60%

traditionalHashcode               50      true    6,905 us/op
traditionalHashcode               50     false    4,255 us/op  61%

traditionalHashcode              100      true   12,590 us/op
traditionalHashcode              100     false    8,651 us/op  68%

traditionalHashcode              500      true   58,176 us/op
traditionalHashcode              500     false   43,322 us/op  74%

traditionalHashcode             1000      true  103,746 us/op
traditionalHashcode             1000     false   86,582 us/op  83%

traditionalHashcode             2000      true  194,827 us/op
traditionalHashcode             2000     false  173,190 us/op  88%

vectoredHashcode                  10      true    2,701 us/op
vectoredHashcode                  10     false    1,682 us/op  62%

vectoredHashcode                  50      true    5,893 us/op
vectoredHashcode                  50     false    2,919 us/op  50%

vectoredHashcode                 100      true    8,593 us/op
vectoredHashcode                 100     false    4,735 us/op  55%

vectoredHashcode                 500      true   40,104 us/op
vectoredHashcode                 500     false   20,698 us/op  51%

vectoredHashcode                1000      true   63,079 us/op
vectoredHashcode                1000     false   40,718 us/op  64%

vectoredHashcode                2000      true  106,395 us/op
vectoredHashcode                2000     false   80,286 us/op  75%

Benchmark           (CHARARRAY_SIZE)  (RANDOM)    Score Units    %
traditionalHashcode               10      true    2,354 us/op
vectoredHashcode                  10      true    2,701 us/op 114%

traditionalHashcode               50      true    6,905 us/op
vectoredHashcode                  50      true    5,893 us/op  85%

traditionalHashcode              100      true   12,590 us/op
vectoredHashcode                 100      true    8,593 us/op  68%

traditionalHashcode              500      true   58,176 us/op
vectoredHashcode                 500      true   40,104 us/op  69%

traditionalHashcode             1000      true  103,746 us/op
vectoredHashcode                1000      true   63,079 us/op  61%

traditionalHashcode             2000      true  194,827 us/op
vectoredHashcode                2000      true  106,395 us/op  55%

traditionalHashcode               10     false    1,424 us/op
vectoredHashcode                  10     false    1,682 us/op 118%

traditionalHashcode               50     false    4,255 us/op
vectoredHashcode                  50     false    2,919 us/op  69%

traditionalHashcode              100     false    8,651 us/op
vectoredHashcode                 100     false    4,735 us/op  55%

traditionalHashcode              500     false   43,322 us/op
vectoredHashcode                 500     false   20,698 us/op  49%

traditionalHashcode             1000     false   86,582 us/op
vectoredHashcode                1000     false   40,718 us/op  47%

traditionalHashcode             2000     false  173,190 us/op
vectoredHashcode                2000     false   80,286 us/op  46%

The Human Factor

Never underestimate your own bias

  • If you want to see something, you will see it.
  • Changes have the desired effect, but the effect has a different reason
  • Common myths often explain it... incorrectly
  • Desperate tuning is bad tuning
  • A self-fulfilling prophecy is a real thing

Bad news

If this wasn't challenging enough...

You just saw a fraction of all factors that might or might not influence the correctness of your microbenchmarks.

You don't have to understand them all, but be aware that a microbenchmark is more than just CPU and time.

Approaching Microbenchmarks

Let's Tackle the Problem

Got a Problem?

What is your motivation?

Real Problem

  • Production problem or oddity
  • You have an idea or suspicion

Made Up Problem

  • Colleague told you
  • Read it on Medium
  • Told to make it scale
  • Told to cut cost in half

Theory

No idea? No benchmark!

  • Develop a sound theory first
  • You might need a benchmark first!
  • Consider:
    • Code - algorithms
    • Memory - usage (allocation, freeing, access)
    • Cache - locality of memory usage
    • I/O - external dependencies
    • Scale - synchronisation and CPU usage

Test Case

Build your code

  • Isolate your problem
  • Avoid rewriting the code, try to use the API
  • Don't remove code only because it doesn't seem important
  • Isolate data preparation from execution
  • JMH prepares data once by default
  • Concurrency tests are hard to write
  • Don't apply your ideas or theories to the test case

Data

Shape the future outcome

  • Shape your test data according to your production data
  • No production data? Think dimensions:
    • Amount - How many strings?
    • Size - How long is a string?
    • Structure - Where is the challenge in the string?
    • Layout - Holders and copies
    • Variance - Sizes, structures, and layout
  • You data might be the problem you are looking for
  • Think unit testing and data

Execution Profile

Wrong conclusions can be drawn in seconds with this trick

  • How often is your code executed?
    • Only at startup (how often)
    • Occasionally
    • Always
    • On error or on a rare path
  • Where is your code used?
    • Might get inlined
    • Branch prediction and out of order

Unit Testing

Functional correctness before performance

  • Your first duty is to correctness!
  • Try to reuse your regular tests
  • Validate every change to avoid false positives
  • Don't unit-test memory, cache, or performance things
  • Ensure proper coverage, performance tuning often removes edge-cases
  • Try to use the same data for testing and benchmarking

Question the Results

Never believe in your first results

  • Is it your expected outcome? Question it.
  • It's not what you expected? Question it.
  • You have not had any expectations? Houston, we have a problem.
  • Start validating by experimenting around
  • Vary the hell out of code and data before any tuning
  • Think of edge-cases

Verification by Variation

Vary a little to feel confident

  • Always keep the original state running
  • Vary a copy of original
  • Vary the tuned version
  • Vary the code
    • Just switch lines
    • Reuse variables or don't
    • Flip branches and conditions
    • Flatten out code or compact it
  • Vary the data
    • Sort or unsort it
    • Make it longer, bigger, and smaller
  • Vary the environment
    • Try another JDK
    • Try another machine
    • Play with the GC

Verification by Tooling

Use the power of JMH

  • GC -prof gc
  • Low-level stats -prof perf/perfnorm
  • Generated code -prof perfasm
  • JIT Compiler -prof comp
  • Classloader -prof cl
  • JFR -prof jfr
  • Async Profiler -prof async
  • And more!

Verification by Review

Ask someone but ChatGPT

  • Ask a colleague
  • Don't just send a PR
  • Explain your thought process
  • No need to ask a benchmarking expert
  • Compare with JDK benchmarks and JMH examples
  • Ask the mighty Internet politely

Verfication by Knowledge

Remember our experiments and more

  • Care about the platform and environment
  • Remember the CPU, cache, and memory examples
  • Keep Java core concepts in mind (JIT, JMM)
  • Be prepared to start from scratch
  • Always benchmark the outcome in reality
  • NEVER apply your benchmark outcome as a general rule

Microbenchmark when...

A Summary of Things

Conclusion

Microbenchmarking is...

Good

  • For isolating a problem
  • Investigating an isolated problem
  • Tuning an isolated problem
  • Detailed tuning
  • Understanding inner-workings
  • Squeezing the lemon

Bad

  • Drawing general conclusions
  • Black and white comparisons
  • Expecting large scale impacts
  • For the impatient developer

The Conclusion rephrased

When and when not to do microbenchmarks

  • Yes You are an JDK contributor
  • Yes You are an open source contributor
  • Yes Your problem can be isolated...
  • Yes ...or you isolate it for easier tuning
  • Yes You want to debunk your colleague's claims
  • Yes You are just curious
  • No You want to write a Medium post about which method is faster
  • No You want to tune production without a need
  • No You swipped left on all previous slides

One last thing

The JMH warning

REMEMBER: The numbers are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.

Do not assume the numbers tell you what you want them to tell.

Because you have seen it on my slides, does not make it right!

Never believe in any benchmark result unless you doctored it yourself!

René Schwietzke