Microbenchmarking

The Art of Realizing One is Wrong

About René Schwietzke

  • Co-Founder and Managing Directory Xceptance
  • Master of Computer Science (in German: Dipl.-Inf.)
  • Programmer since 1989
  • Java since 1996, Java 1.0
  • QA and Testing since 1998
  • Performance Tester since 1999
  • @ReneSchwietzke
  • @reneschwietzke@foojay.social
  • #java #qa #test #performance #performancetest #quality

About Xceptance

  • Founded 2004, Offices in Jena and Erfurt, Germany
  • Subsidiary in Cambridge, MA, USA
  • 50+ Employees
  • Focused on Software Test and Quality Assurance
  • Performance Testing, Test Automation, Functional Testing, QA, and Test Process Consulting
  • Mostly Active in the Areas of Ecommerce and Internet
  • Open Source Load Test Tool XLT (APL 2.0)
  • Open Source Test Automation Tool Neodymium (MIT)

What's in for you?

Why you should stick around

  • Get a feel if you need microbenchmarking
  • Learn about typical mistakes
  • Learn about hardware and JVM insides
  • Take home microbenchmarking awareness
  • See the good and the ugly

This is not a JMH training!

  • Get some car and racing analogies

Why this presentation?

What is my motivation?

  • A lot of non-sense microbenchmark examples turned up lately
  • Teammates wasted weeks for nothing but saw double the performance in their benchmarks
  • Many don't dissect problems enough before going nuts on tuning
  • Microbenchmarking helped me to understand the JVM and hardware interactions better
  • Certain aspects are just extremly fun, if you enjoy low-level things

If you want to drive fast, you have to understand how a car works. Some physics knowledge won't hurt either.

Boring Stuff First

Definitions and Guesses

What are benchmarks?

A definition first

To study (something, such as a competitor's product or business practices) in order to improve the performance of one's own company

A point of reference from which measurements may be made

A standardized problem or test that serves as a basis for evaluation or comparison (as of computer system performance)

  • Racing a car around a track
  • Measuring the time it takes

What is a microbenchmark?

What is this micro thing and why is it micro?

A simplified benchmark that tests components in isolation.

It simplifies the process by abstracting the problem, removing complexity, and improving debuggability.

Components are usually quite small. Compare that to the unit test idea!

  • Just testing the engine or electric motor
  • No need to go out and test drive, just build a test bed and test 24x7

Let's guestimate goals

What might be our goals?

  • How long does it take?
  • How much memory does it need?
  • Does it scale?
  • How does it work with larger data?
  • Is there any warmup?
  • Can I write that more concise?
  • Is this really true?
  • Max revolutions
  • Max temperature
  • Power output and torque
  • Warmup time
  • Gas or power consumption
  • Vibrations
  • Exhaust gas composition
  • ...

Drawbacks

Side effects of simplification

  • Environment is different
  • Usage is simulated
  • Removal of factors which might be the cause
  • Unknown factors are eliminated too
  • Less interaction with other components
  • Development of a bias
  • Race track
  • Tires
  • Petrol
  • Driver
  • External forces
  • Vibrations
  • Temperatures
full

Get us some numbers

Some Experiments First

Let's Measure - Example 1

Always use System.arrayCopy over manual copying


public static void main(String[] args)
{
    var length = 255;
    var src = new byte[length];
    var dest = new byte[length];

    var s1 = System.nanoTime();
    System.arraycopy(src, 0, dest, 0, length);
    var e1 = System.nanoTime();

    var s2 = System.nanoTime();
    for (int i = 0; i < length; i++)
    {
        dest[i] = src[i];
    }
    var e2 = System.nanoTime();

    System.out.printf("System Copy: %d ns%n", e1 - s1);
    System.out.printf("Manual Copy: %d ns%n", e2 - s2);
}

$ java -cp target/classes/ org.devoxxpl23.naive.Example1
System Copy: 2685 ns
Manual Copy: 5781 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example1
System Copy: 2565 ns
Manual Copy: 4609 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example1
System Copy: 2545 ns
Manual Copy: 4589 ns
  • All the proof we needed!

Let's Measure - Example 2

Use a sized StringBuilder all the time


public static void main(String[] args)
{
    final String a = "This is a test";
    final String b = "More fun with a string";
    final long c = System.currentTimeMillis(); // e.g. 1684581249117
    var l = a.length() + b.length() + 13;

    var s3 = System.nanoTime();
    {
        var r = a + b + c;
    }
    var e3 = System.nanoTime();

    var s1 = System.nanoTime();
    {
        var sb = new StringBuilder();
        sb.append(a); sb.append(b); sb.append(c);
        var r = sb.toString();
    }
    var e1 = System.nanoTime();

    var s2 = System.nanoTime();
    {
        var sb = new StringBuilder(l);
        sb.append(a); sb.append(b); sb.append(c);
        var r = sb.toString();
    }
    var e2 = System.nanoTime();

    System.out.printf("Pure String usage : %d ns%n", e3 - s3);
    System.out.printf("StringBuilder     : %d ns%n", e1 - s1);
    System.out.printf("StringBuilder size: %d ns%n", e2 - s2);
}

$ java -cp target/classes/ org.devoxxpl23.naive.Example2
Pure String usage : 44441844 ns
StringBuilder     : 93092 ns
StringBuilder size: 7123 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example2
Pure String usage : 38471326 ns
StringBuilder     : 74865 ns
StringBuilder size: 6076 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example2
Pure String usage : 37914191 ns
StringBuilder     : 72841 ns
StringBuilder size: 6146 ns
  • Makes sense, matches common Internet knowledge
  • Sized builder is dramatically faster
  • Pure String operations are horrible

Let's Measure - Example 3

Streams are slow and parallel() is always better


public static void main(String[] args)
{
    var data = new int[100_000];
    for (int i = 0; i < data.length; i++)
    {
        data[i] = i;
    }

    // manual
    var s1 = System.nanoTime();
    {
        var total = 0.0d;
        for (int i = 0; i < data.length; i++)
        {
            total += data[i];
        }
        var r = total / data.length;
    }
    var e1 = System.nanoTime();

    // stream
    var s2 = System.nanoTime();
    {
        var r = Arrays.stream(data).average();
    }
    var e2 = System.nanoTime();

    // stream parallel
    var s3 = System.nanoTime();
    {
        var r = Arrays.stream(data).parallel().average();
    }
    var e3 = System.nanoTime();
    ...
}

$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual         :  1,595,527 ns
Stream         : 29,082,919 ns
Parallel stream: 21,997,029 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual         :  1,597,572 ns
Stream         : 29,913,900 ns
Parallel stream: 23,778,407 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual         :  2,008,348 ns
Stream         : 30,491,623 ns
Parallel stream: 22,356,628 ns
  • Streams are bad
  • Parallel streams are obviously better
  • But why are the results not the same all the time?

Thesis - What might be wrong

Forget everything you just saw

  • Of course, all tests before are garbage!
  • But they confirmed the Internet truth!!!
  • So, what could be wrong with these?
  • No warm up
  • Always a Le Mans start
  • One round only?
  • Just time measurement?
  • Make and model
  • Race track
  • Weight of car

Goals

What do we really want?

  • Repeatability We can easily repeat the test.
  • Predictability We get the same result when nothing has changed.
  • Changeability We can easily change things (object under test).
  • Controllability We can control parameters and vary them (environment).
  • Debuggability We can get additional insights.
  • Understandability Things are so simple that we can understand the influence of changes.

Our true goal: We want to learn and understand by changing details under controlled conditions. Turn a knob a bit and see what happens. We want to derive reasoning or even proof from our changes.

full

The right tool

Don't ask what you can do, ask others what they have done

  • You can benchmark things with a runtime 30 s or more mostly okish
  • Don't spend 2 days on writing a framework you will use for 2 hours, rather spend 2 hours on the setup and benchmark for 2 days.
  • Don't reinvent the wheel, cause... we will see soon why
  • Use a proven tool chain that helps to mitigate a lot of, yet unknown, problems
  • JMH - Java Microbenchmark Harness to the rescue

JMH is a Java harness for building, running, and analysing nano/micro/milli/macro benchmarks written in Java and other languages targeting the JVM.

  • Built to address shortcomings of naive benchmarks
  • Tested and proven by a large community
  • Used for the JDK
  • Comes with built-in profiling tools
  • Compact
  • Open Source

The Challenges

Some Factors by Example

Disclaimer

We don't have the time...

  • We cannot research the challenges together, it just takes too much time
  • Please feel guided
  • Most benchmark knowledge is programming knowledge and vice versa
  • Understanding the why will help you to make your code faster and leaner
  • See beyond the pure Java code and meet the machine
  • Main goal Awareness! You can google the rest.
  • This is a rather a 270° view, not the full 360°.

Time Measurement

Your clock is... code

  • Whom do you ask for time?
  • APIs: System.nanoTime() and System.currentTimeMillis()
  • JDK, it asks the OS, which asks the hardware
  • Imagine: 3 GHz CPU - 0.33 ns per clock cycle, 6 instructions per cycle
  • A lot can/could be done in 1 ns!
  • Whoops! We cannot measure less than 1 ns!
  • Bad news: Even worse, we cannot measure more precise than 20 - 30 ns

Time Measurement

We can measure nanoseconds, can't we?


@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Fork(2)
@State(Scope.Thread)
public class TimerResolution
{
    private long lastValue;

    @Benchmark
    public long latency_nanotime()
    {
        return System.nanoTime();
    }

    public static double precision(long[] t)
    {
        // store samples
        for (int i = 0; i < t.length; i++)
        {
            t[i] = System.nanoTime();
        }
    }
}

Benchmark                             Mode  Cnt   Score   Error  Units
TimerResolution.latency_nanotime      avgt   10  25.519 ± 0.985  ns/op

# Samples Lenovo T14s, AMD, Linux 5.15, TSC
372942679845594
372942679845635
372942679845665
372942679845705
372942679845745
372942679845775
Difference between calls 30 to 40 ns
  • Measuring takes about 20 to 30 ns
  • Returned value has a resolution of 30 to 40 ns
  • Solution Measure less often but execute code under test often. Calculate runtime, don't measure it.
  • 1 round - 2 min 15 s, do 100 instead: 225 min!

OS - Clock Source

Not every clock is created equal

  • RTC Real Time Clock
  • PIT Programmable Interval Timer
  • HPET High Performance Event Timer
  • ACPI PM ACPI Power Management Timer
  • APIC Advanced Programmable Interrupt Controller
  • TSC Time Stamp Counter
  • ... and more

x86 Time Stamp Counter (TSC), a low-latency monotonically increasing clock, to measure event duration


Benchmark                             Mode  Cnt      Score    Error  Units
TimerResolution.latency_nanotime      avgt   10  1,128.204 ± 32.831  ns/op

# Samples Lenovo T14s, AMD, Linux 5.15, HPET
374172181482681
374172181484078
374172181484986
374172181485964
374172181487291
374172181488199
Difference between calls 900 to 1,400 ns

Benchmark                             Mode  Cnt   Score   Error  Units
TimerResolution.latency_nanotime      avgt   10  25.519 ± 0.985  ns/op

# Samples Lenovo T14s, AMD, Linux 5.15, TSC
495719354555374
495719354555394
495719354555424
495719354555444
495719354555464
495719354555484
Difference between calls 20 to 30 ns

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
kvm-clock tsc hpet acpi_pm

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet

$ echo 'tsc' > /sys/devices/system/clocksource/clocksource0/current_clocksource

Runtime Environment - JDK

Java evolves and so does performance


public class Example1ArrayCopying {
    @Param({"1000"}) int SIZE;
    byte[] src;

    @Setup
    public void setup() {
        src = new byte[SIZE];
        final Random r = new Random(7L);
        r.nextBytes(src);
    }

    @Benchmark
    public byte[] systemCopy() {
        var dest = new byte[src.length];
        System.arraycopy(src, 0, dest, 0, src.length);

        return dest;
    }

    @Benchmark
    public byte[] arrayCopy() {
        return Arrays.copyOf(src, src.length);
    }

    @Benchmark
    public byte[] manualCopy() {
        var dest = new byte[src.length];
        for (int i = 0; i < src.length; i++) {
            dest[i] = src[i];
        }

        return dest;
    }
}

# JDK 11
Benchmark                        (SIZE)  Mode  Cnt    Score     Error  Units
Example1ArrayCopying.manualCopy    1000  avgt    3  152.889 ±  32.305  ns/op
Example1ArrayCopying.systemCopy    1000  avgt    3  130.963 ± 119.643  ns/op
Example1ArrayCopying.arrayCopy     1000  avgt    3  128.726 ±  83.837  ns/op

# JDK 17
Example1ArrayCopying.manualCopy    1000  avgt    3  153.508 ±   7.987  ns/op
Example1ArrayCopying.systemCopy    1000  avgt    3  124.021 ±  35.199  ns/op
Example1ArrayCopying.arrayCopy     1000  avgt    3  121.271 ±  35.723  ns/op

# JDK 19
Example1ArrayCopying.manualCopy    1000  avgt    3  149.298 ±  79.705  ns/op
Example1ArrayCopying.systemCopy    1000  avgt    3  129.439 ± 104.106  ns/op
Example1ArrayCopying.arrayCopy     1000  avgt    3  133.358 ±  40.304  ns/op

# JDK 20
Example1ArrayCopying.manualCopy    1000  avgt    3  152.070 ±   7.096  ns/op
Example1ArrayCopying.systemCopy    1000  avgt    3  119.613 ±   9.873  ns/op
Example1ArrayCopying.arrayCopy     1000  avgt    3  117.901 ±   8.322  ns/op

# Graal 22.3.r17 - Whoops!
Example1ArrayCopying.manualCopy    1000  avgt    3  335.612 ±  22.000  ns/op <<<
Example1ArrayCopying.systemCopy    1000  avgt    3  150.035 ±   4.466  ns/op
Example1ArrayCopying.arrayCopy     1000  avgt    3  149.176 ±   4.496  ns/op

Lifecycle and Frequency

When, where, and how often is it called - the JIT at work

  • When is the code in question executed?
    • Startup, shutdown, normally, on error
  • How often is the code executed?
    • Once, a few times, like hell
  • How is it called?
    • From a single place, multiple places, in a loop, conditional or unconditional
  • Loading of code
  • JIT compiler and its stages
  • Data cache
  • Instruction cache
  • Inlining
  • Loop unrolling
  • TLB - translation cache of virtual to physical memory

Lifecycle and Frequency


@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Measurement(iterations = 5, batchSize = 1, time = 1, timeUnit = TimeUnit.NANOSECONDS)
@Fork(1)
public class Streams {
    final int SIZE = 1024;
    final int[] integers = new int[SIZE];

    @Setup
    public void setup() {
        for (int i = 0; i < integers.length; i++) {
            integers[i] = i - SIZE;
        }
    }

    @Benchmark
    @Warmup(iterations = 0, batchSize = 1)
    public int lambdaArrayCold() {
        return Arrays.stream(integers).filter(i -> i % 2 == 0).sum();
    }
    @Benchmark
    @Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
    public int lambdaArrayWarm() {...}

    @Benchmark
    @Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
    public int lambdaArrayHot() {...}

    @Benchmark
    @Warmup(iterations = 0, batchSize = 1)
    public int loopCold() {
        int sum = 0;
        for (int i = 0; i < integers.length; i++) {
            var v = integers[i];
            if (v % 2 == 0) {
                sum += v;
            }
        }
        return sum;
    }

    @Benchmark
    @Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
    public int loopWarm() {...}

    @Benchmark
    @Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
    public int loopHot() {...}
}

Benchmark                Mode  Cnt    Score       Error  Units
Streams.lambdaArrayCold  avgt    5  688,797 ± 4,665,070  ns/op
Streams.lambdaArrayWarm  avgt    5   12,797 ±    59,453  ns/op
Streams.lambdaArrayHot   avgt    5    1,113 ±     2,602  ns/op

Streams.loopCold         avgt    5   52,936 ±    42,276  ns/op
Streams.loopWarm         avgt    5    2,030 ±     2,527  ns/op
Streams.loopHot          avgt    5    1,422 ±       733  ns/op

Streams.lambdaArrayCold
Iteration   1: 2,855,843.000 ns/op
Iteration   2:   172,353.000 ns/op
Iteration   3:   139,810.000 ns/op
Iteration   4:   139,670.000 ns/op
Iteration   5:   136,313.000 ns/op

Streams.lambdaArrayWarm
Iteration   1:    40,097.000 ns/op
Iteration   2:     4,712.000 ns/op
Iteration   3:     9,992.667 ns/op
Iteration   4:     4,145.143 ns/op
Iteration   5:     5,039.667 ns/op

Streams.loopCold
Iteration   1:    44,510.286 ns/op
Iteration   2:    61,584.000 ns/op
Iteration   3:    67,831.000 ns/op
Iteration   4:    45,749.000 ns/op
Iteration   5:    45,007.000 ns/op

Lifecycle and Frequency - Real Work


@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Measurement(iterations = 5, batchSize = 1, time = 1, timeUnit = TimeUnit.NANOSECONDS)
@Fork(1)
public class HeavyStreams {
    // as seen before

    @Benchmark
    @Warmup(iterations = 0, batchSize = 1)
    public int lambdaArrayCold() {
        return Arrays.stream(integers)
            .filter(i -> i.mod(BigInteger.TWO).equals(BigInteger.ZERO))
            .map(i -> i.sqrt())
            .mapToInt(BigInteger::intValue)
            .sum();
    }

    @Benchmark
    @Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
    public int lambdaArrayWarm() { ... }

    @Benchmark
    @Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
    public int lambdaArrayHot() { ... }

    @Benchmark
    @Warmup(iterations = 0, batchSize = 1)
    public int loopCold() {
        int sum = 0;
        for (int i = 0; i < integers.length; i++) {
            var v = integers[i];

            if (v.mod(BigInteger.TWO).equals(BigInteger.ZERO)) {
                sum += v.sqrt().intValue();
            }
        }

        return sum;
    }

    @Benchmark
    @Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
    public int loopWarm()  { ... }

    @Benchmark
    @Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
    public int loopHot() { ... }
}

# ms now!!!
Benchmark                     Mode  Cnt Score   Error  Units
HeavyStreams.lambdaArrayCold  avgt    5   362 ±   113  ms/op
HeavyStreams.lambdaArrayWarm  avgt    5   297 ±   177  ms/op
HeavyStreams.lambdaArrayHot   avgt    5   264 ±    12  ms/op

HeavyStreams.loopCold         avgt    5   363 ±   100  ms/op
HeavyStreams.loopWarm         avgt    5   302 ±   183  ms/op
HeavyStreams.loopHot          avgt    5   271 ±    16  ms/op

HeavyStreams.lambdaArrayCold
Iteration   1: 409 ms/op
Iteration   2: 372 ms/op
Iteration   3: 347 ms/op
Iteration   4: 342 ms/op
Iteration   5: 339 ms/op

HeavyStreams.lambdaArrayWarm
Iteration   1: 263 ms/op
Iteration   2: 270 ms/op
Iteration   3: 264 ms/op
Iteration   4: 263 ms/op
Iteration   5: 263 ms/op

HeavyStreams.loopCold
Iteration   1: 401 ms/op
Iteration   2: 376 ms/op
Iteration   3: 352 ms/op
Iteration   4: 342 ms/op
Iteration   5: 340 ms/op
  • Overhead almost disappeared

Code Removal

It's dead, Jim!


public class DeadCode {
    private double computeSideEffect(int n) {
        return BigInteger.valueOf(n).pow(7).sqrt().doubleValue();
    }

    private double fib(int n) {
        int fib = 1;
        int prevFib = 1;

        for (int i = 2; i < n; i++) {
            int temp = fib;
            fib += prevFib;
            prevFib = temp;
        }
        return fib;
    }

    @Benchmark
    public double baseline() {
        return 5.0d;
    }

    @Benchmark
    public void computeWrongSE() {
        computeSideEffect(2000);
    }

    @Benchmark
    public double computeRightSE() {
        return computeSideEffect(2000);
    }

    @Benchmark
    public void computeWrong() {
        fib(2000);
    }

    @Benchmark
    public double computeRight() {
        return fib(2000);
    }
}

Benchmark                Mode  Cnt     Score   Error  Units
DeadCode.baseline        avgt    3       0.4 ±   0.1  ns/op

DeadCode.computeWrong    avgt    3      16.1 ±   1.6  ns/op
DeadCode.computeRight    avgt    3     479.2 ±  20.2  ns/op

DeadCode.computeWrongSE  avgt    3    1616.7 ± 434.4  ns/op
DeadCode.computeRightSE  avgt    3    1713.4 ± 876.9  ns/op
  • Empty method is not even called, it seems
  • But JDK 11 says 3.7 ns
  • computeWrong is not really dropped
  • compute with objects is behaving unexpected
  • computeWrong is changing based on the parameter
  • computeWrong fib(1000) yields 8.8 ns - ???
  • computeWrong fib(100) yields 0.4 ns - ???

Complex Example


public class CachingEffects {
    @Param({"true", "false"})
    boolean live;

    int SIZE = 10000;
    int LOAD = 1000000;

    List<String> work;
    List<String> load;

    @Setup
    public void setup()     {
        load = new ArrayList<>(LOAD);

        var s = new StringBuilder("ß09876512[more here]");
        s.append(System.currentTimeMillis());
        s.append('$');

        for (int i = 0; i < LOAD; i++) {
            // align the copies next to each other
            load.add(s.toString());
        }

        work = new ArrayList<>(SIZE);
        if (live) {
            // shuffle the data, with fixed outcome
            Collections.shuffle(load, new Random(42L));
        }

        // copy us the data, when live, we get a random pattern, when
        // not live, we get basically data that is laid out next to each other
        for (int i = 0; i < SIZE; i++) {
            work.add(load.get(i));
        }
    }
    ...
}

public class CachingEffects
{
    ...

    /**
     * Manually search for a character pos
     */
    @Benchmark
    public int manual() {
        var n = 0;

        for (int i = 0; i < work.size(); i++) {
            var s = work.get(i);

            for (int index = 0; index < s.length(); index++) {
                if (s.charAt(index) == '$') {
                    n += index;
                    break;
                }
            }
        }
        return n;
    }

    /**
     * Search via indexOf for the char position
     */
    @Benchmark
    public int indexof() {
        var n = 0;

        for (int i = 0; i < work.size(); i++) {
            n += work.get(i).indexOf('$');
        }
        return n;
    }
}

Caches

The hidden memory that speeds things up

  • CPUs feature multiple layers of data and instruction caches
  • Caches are per core, socket, and global
  • You cannot control the caches easily
  • Your benchmark likely betrays you here

# JDK 11.0.17 - Aligned Data
Benchmark               (live)  Mode  Cnt      Score     Error  Units Diff
CachingEffects.indexof   false  avgt    3    435,184 ±  54,444  ns/op
CachingEffects.manual    false  avgt    3    318,798 ± 166,076  ns/op  73%

# JDK 11.0.17 - More Random Data
Benchmark               (live)  Mode  Cnt      Score     Error  Units Diff
CachingEffects.indexof    true  avgt    3  1,420,598 ± 549,023  ns/op
CachingEffects.manual     true  avgt    3  1,389,879 ± 342,328  ns/op  98%

# JDK 11.0.17 - Single Executions - TRIGGERED by ERROR VALUE
Benchmark               (live)  Mode  Cnt     Score  Units
CachingEffects.indexof   false  avgt    1    562,553 ns/op
CachingEffects.indexof   false  avgt    1    456,493 ns/op
CachingEffects.indexof   false  avgt    1    469,517 ns/op

CachingEffects.manual    false  avgt    1    313,673 ns/op
CachingEffects.manual    false  avgt    1    677,170 ns/op
CachingEffects.manual    false  avgt    1    309,094 ns/op

CachingEffects.indexof    true  avgt    1  1,637,546 ns/op
CachingEffects.indexof    true  avgt    1  2,477,774 ns/op
CachingEffects.indexof    true  avgt    1  2,604,464 ns/op

CachingEffects.manual     true  avgt    1  1,560,230 ns/op
CachingEffects.manual     true  avgt    1  2,067,318 ns/op
CachingEffects.manual     true  avgt    1  1,888,360 ns/op

CPU and Memory

CPU design, usage, and speed make things go fuzzy

  • CPU Frequencies
  • Power management
  • Design (cache size, instruction set, pipeline)
  • Core types (real vs. hyper/virtual)
  • Memory speed
  • CPU and JDK: intrinics
  • CPU and JDK: defaults
  • ...and much more

# Laptop Busy, JDK 11
Benchmark               (live)  Mode  Cnt      Score Units  Code Cache
CachingEffects.indexof    true  avgt    3  1,572,759 ns/op
CachingEffects.manual     true  avgt    3  1,413,919 ns/op   89%
CachingEffects.indexof   false  avgt    3    377,649 ns/op         24%
CachingEffects.manual    false  avgt    3    294,783 ns/op   78%   21%

# Laptop Less Busy, JDK 11
CachingEffects.indexof    true  avgt    3  1,263,913 ns/op
CachingEffects.manual     true  avgt    3  1,150,578 ns/op   91%
CachingEffects.indexof   false  avgt    3    422,753 ns/op         33%
CachingEffects.manual    false  avgt    3    301,250 ns/op   71%   26%

# Turbo off, JDK 11
CachingEffects.indexof    true  avgt    3  2,191,820 ns/op
CachingEffects.manual     true  avgt    3  1,928,017 ns/op   88%
CachingEffects.indexof   false  avgt    3    858,436 ns/op         39%
CachingEffects.manual    false  avgt    3    687,400 ns/op   80%   36%

# Google c2-standard-8, Intel, JDK 11
CachingEffects.indexof    true  avgt    3  1,064,037 ns/op
CachingEffects.manual     true  avgt    3    908,363 ns/op   85%
CachingEffects.indexof   false  avgt    3    509,849 ns/op         47%
CachingEffects.manual    false  avgt    3    476,487 ns/op   93%   52%

# Google c2d-standard-8, AMDs, JDK 11
CachingEffects.indexof    true  avgt    3    707,885 ns/op
CachingEffects.manual     true  avgt    3    599,987 ns/op   84%
CachingEffects.indexof   false  avgt    3    375,242 ns/op         53%
CachingEffects.manual    false  avgt    3    261,686 ns/op   70%   44%

# Google c3-highcpu-8, Intel, JDK 11
CachingEffects.indexof    true  avgt    3  1,122,492 ns/op
CachingEffects.manual     true  avgt    3  1,075,496 ns/op   96%
CachingEffects.indexof   false  avgt    3    357,894 ns/op         32%
CachingEffects.manual    false  avgt    3    303,531 ns/op   85%   28%

CPU - Branch Prediction

The unknown fortune teller

  • CPU has a pipeline it wants to fill
  • Instructions per cycle: 4-8
  • CPU executes code out of order
  • When you branch, it tries to go the likely route ahead of time
  • When the guess was wrong, it becomes expensive
  • JIT analyses frequency and reorders conditions too

public class BranchPrediction {
    private static final int COUNT = 10000;

    private int[] sorted = new int[COUNT];
    private int[] unsorted = new int[COUNT];
    private int[] pattern2 = new int[COUNT];
    private int[] pattern5 = new int[COUNT];
    private int[] shuffledPattern = new int[COUNT];

    @Setup
    public void setup() {
        final Random r = new Random(13241L);

        // using int here to avoid problems with memory and caches aka
        // only inline data and hence same caching behavior
        for (int i = 0; i < COUNT; i++) {
        	final int d = r.nextInt(100_000_000) + 1;

            sorted[i] = d; unsorted[i] = d;
            pattern2[i] = i % 2 == 0 ? 100_000_000 : r.nextInt(50_000_000);
            pattern5[i] = i % 5 == 0 ? 100_000_000 : r.nextInt(50_000_000);
            shuffledPattern[i] = pattern2[i];
        }
        Arrays.sort(sorted);

        // make it noisy to break the branch predictor
        for (int i = 0; i < shuffledPattern.length; i++) {
            // shuffle here
        }
    }

    public long doIt(int[] array) {
    	long sum = 2;
        for (int v : array) {
            if (v > 50_000_000) {
                sum += (v >> 2);
            }
            else {
                sum += (v << 2);
            }
        }
        return sum;
    }
    ...
}

Branch Prediction

Your data sorting becomes the driver of the results


# JDK 11
Benchmark                        Mode C  Score   Error Units
BranchPrediction.unsorted        avgt 4 14,424 ± 1,235 ns/op
BranchPrediction.sorted          avgt 4  4,709 ±    40 ns/op
BranchPrediction.pattern2        avgt 4  4,943 ±    14 ns/op
BranchPrediction.pattern5        avgt 4  4,642 ±   501 ns/op
BranchPrediction.shuffledPattern avgt 4 14,209 ±   833 ns/op

# JDK 17
BranchPrediction.unsorted        avgt 4 14,285 ± 1,940 ns/op
BranchPrediction.sorted          avgt 4  4,858 ±   722 ns/op
BranchPrediction.pattern2        avgt 4  4,353 ±    51 ns/op
BranchPrediction.pattern5        avgt 4  4,744 ±   347 ns/op
BranchPrediction.shuffledPattern avgt 4 14,132 ± 2,155 ns/op

# JDK 11, -XX:-BlockLayoutByFrequency
Benchmark                         Mode C  Score   Error Units
BranchPrediction.unsorted         avgt 4  9,800 ±   327 ns/op
BranchPrediction.sorted           avgt 4  4,791 ±   341 ns/op
BranchPrediction.pattern2         avgt 4  5,509 ±    57 ns/op
BranchPrediction.pattern5         avgt 4  5,802 ±   454 ns/op
BranchPrediction.shuffledPattern  avgt 4  9,865 ±   292 ns/op

# JDK 17, -XX:-BlockLayoutByFrequency
BranchPrediction.unsorted         avgt 4  9,304 ±   224 ns/op
BranchPrediction.sorted           avgt 4  4,239 ±   227 ns/op
BranchPrediction.pattern2         avgt 4  5,530 ±   187 ns/op
BranchPrediction.pattern5         avgt 4  4,182 ±   816 ns/op
BranchPrediction.shuffledPattern  avgt 4  9,085 ±   473 ns/op

Data

Static and little data is not enough

  • Data is one key driver of valid benchmarks
  • Not every String has a length of just 42 characters
  • Data usually has these dimensions:
    • Size
    • Content
    • Variance
  • Static data is good for repeatability and predictablity
  • Live-like data is great for the final validation

public class HashingDataDimensions {
    @Param({"10", "50", "100", "500", "1000", "2000"}) int CHARARRAY_SIZE;
    @Param({"true", "false"}) boolean RANDOM;
    final static int MAX = 100_000;
    List data, holder;

    @Setup
    public void setup() {
        data = new ArrayList(MAX);
        holder = new ArrayList(MAX);

        for (int i = 0; i < MAX; i++) {
            // the data to work with
            var a = random(CHARARRAY_SIZE);
            data.add(a);
            holder.add(a);

            if (!RANDOM) {
                // same data in our data array all the time, still a miss access when accessing it
                // so the RANDOM test has the same code path, just ends up on the same data
                data.set(data.size() - 1, data.get(0));
            }
        }
    }

    @Benchmark
    public void traditionalHashcode(Blackhole b) {
        for (int i = 0; i < data.size(); i++) {
            b.consume(hashCodeClassic(data.get(r.nextInt(data.size()))));
        }
    }
    ...
    public static int hashCodeClassic(char[] src) {
        final int last = src.length;

        int h = 0;
        for (int i = 0; i < last; i++) {
            h = 31 * h + src[i];
        }

        return h;
    }

    public static int hashCodeVectoredJDK19(char[] src) {
        int h = 0;
        int i = 0;
        int l, l2;
        l = l2 = src.length;
        l = l & ~(8 - 1);

        for (; i < l; i += 8) {
            h = -1807454463 * h +
                            1742810335 * src[i + 0] +
                            887503681 * src[i + 1] +
                            28629151 * src[i + 2] +
                            923521 * src[i + 3] +
                            29791 * src[i + 4] +
                            961 * src[i + 5] +
                            31 * src[i + 6] +
                            1 * src[i + 7];
        }

        for (; i < l2; i++) {
            h = 31 * h + src[i];
        }

        return h;
    }


    public static int hashCodeNoMul(char[] src) {
        final int last = src.length;

        int h = 0;
        for (int i = 0; i < last; i++) {
            h = ((h << 5) - h) + src[i];
        }

        return h;
    }
}

Data


Benchmark           (CHARARRAY_SIZE)  (RANDOM)    Score   Error  Units    %
traditionalHashcode               10      true    2,354 ±    98  us/op
traditionalHashcode               10     false    1,424 ±    18  us/op  60%

traditionalHashcode               50      true    6,905 ±   155  us/op
traditionalHashcode               50     false    4,255 ±    45  us/op  61%

traditionalHashcode              100      true   12,590 ±  5650  us/op
traditionalHashcode              100     false    8,651 ±   106  us/op  68%

traditionalHashcode              500      true   58,176 ± 14771  us/op
traditionalHashcode              500     false   43,322 ±  1022  us/op  74%

traditionalHashcode             1000      true  103,746 ±  7203  us/op
traditionalHashcode             1000     false   86,582 ±   842  us/op  83%

traditionalHashcode             2000      true  194,827 ± 11203  us/op
traditionalHashcode             2000     false  173,190 ±  1040  us/op  88%

vectoredHashcode                  10      true    2,701 ±    45  us/op
vectoredHashcode                  10     false    1,682 ±    20  us/op  62%

vectoredHashcode                  50      true    5,893 ±   798  us/op
vectoredHashcode                  50     false    2,919 ±   101  us/op  50%

vectoredHashcode                 100      true    8,593 ±  1135  us/op
vectoredHashcode                 100     false    4,735 ±   185  us/op  55%

vectoredHashcode                 500      true   40,104 ± 13728  us/op
vectoredHashcode                 500     false   20,698 ±   157  us/op  51%

vectoredHashcode                1000      true   63,079 ±  6775  us/op
vectoredHashcode                1000     false   40,718 ±   836  us/op  64%

vectoredHashcode                2000      true  106,395 ±  2552  us/op
vectoredHashcode                2000     false   80,286 ±   301  us/op  75%

Benchmark           (CHARARRAY_SIZE)  (RANDOM)    Score   Error  Units    %
traditionalHashcode               10      true    2,354 ±    98  us/op
vectoredHashcode                  10      true    2,701 ±    45  us/op 114%

traditionalHashcode               50      true    6,905 ±   155  us/op
vectoredHashcode                  50      true    5,893 ±   798  us/op  85%

traditionalHashcode              100      true   12,590 ±  5650  us/op
vectoredHashcode                 100      true    8,593 ±  1135  us/op  68%

traditionalHashcode              500      true   58,176 ± 14771  us/op
vectoredHashcode                 500      true   40,104 ± 13728  us/op  69%

traditionalHashcode             1000      true  103,746 ±  7203  us/op
vectoredHashcode                1000      true   63,079 ±  6775  us/op  61%

traditionalHashcode             2000      true  194,827 ± 11203  us/op
vectoredHashcode                2000      true  106,395 ±  2552  us/op  55%

traditionalHashcode               10     false    1,424 ±    18  us/op
vectoredHashcode                  10     false    1,682 ±    20  us/op 118%

traditionalHashcode               50     false    4,255 ±    45  us/op
vectoredHashcode                  50     false    2,919 ±   101  us/op  69%

traditionalHashcode              100     false    8,651 ±   106  us/op
vectoredHashcode                 100     false    4,735 ±   185  us/op  55%

traditionalHashcode              500     false   43,322 ±  1022  us/op
vectoredHashcode                 500     false   20,698 ±   157  us/op  49%

traditionalHashcode             1000     false   86,582 ±   842  us/op
vectoredHashcode                1000     false   40,718 ±   836  us/op  47%

traditionalHashcode             2000     false  173,190 ±  1040  us/op
vectoredHashcode                2000     false   80,286 ±   301  us/op  46%

JDK again

Let's revisit the JDK version


# JDK 11.0.16, T14s AMD Gen 1
Benchmark               (live)  Mode Cnt     Score Units Code Cache
CachingEffects.indexof    true  avgt   3 1,263,913 ns/op
CachingEffects.manual     true  avgt   3 1,150,578 ns/op  91%
CachingEffects.indexof   false  avgt   3   422,753 ns/op        33%
CachingEffects.manual    false  avgt   3   301,250 ns/op  71%   26%

# JDK 11.0.16, c3-highcpu-8
CachingEffects.indexof    true  avgt   3 1,122,492 ns/op
CachingEffects.manual     true  avgt   3 1,075,496 ns/op  95%
CachingEffects.indexof   false  avgt   3   357,894 ns/op        32%
CachingEffects.manual    false  avgt   3   303,531 ns/op  85%   28%

# JDK 17.0.6, T14s AMD Gen 1
CachingEffects.indexof    true  avgt   3   403,302 ns/op
CachingEffects.manual     true  avgt   3 1,160,649 ns/op 287%
CachingEffects.indexof   false  avgt   3    53,984 ns/op        13%
CachingEffects.manual    false  avgt   3   293,749 ns/op 542%   25%

# JDK 17.0.6, c3-highcpu-8
CachingEffects.indexof    true  avgt   3   271,683 ns/op
CachingEffects.manual     true  avgt   3 1,080,750 ns/op 398%
CachingEffects.indexof   false  avgt   3    58,124 ns/op        21%
CachingEffects.manual    false  avgt   3   310,592 ns/op 534%   29%
  • Newer versions might surprise you
  • Stay recent and you can roll back some changes
  • Platform changes per JDK are possible too
  • Even when you cannot migrate, try and maybe you can skip a change or motivate your boss
  • Remember, GIT is your friend and proper commit messages a lifesaver!

The Human Factor

Never underestimate your own bias

  • If you want to see something, you will see it.
  • Changes have the desired effect, but the effect has a different reason, you might not quesiton the change.
  • Common myths often explain it... incorrectly.
  • Desperate tuning is often bad tuning.
  • A self-fulfilling prophecy is a real thing when benchmarking.

Bad news

If this wasn't challenging enough...

You just saw a fraction of all factors that might or might not influence the correctness of your microbenchmarks. You don't have to understand them all, but you have to be aware that a microbenchmark is not just about cpu and time measurement.

Approaching Microbenchmarks

Let's Tackle the Problem

Got a Problem?

What is your motivation?

Real Problem

  • Production problem or oddity
  • You have an idea or suspicion

Made Up Problem

  • Colleague told you
  • Read it on somewhere
  • Told to make it scale
  • Told to cut cost in half
  • Production problem has been profiled, hopefully
  • You have to benchmark before you microbenchmark
  • When an arbitrary goal is given, you have to profile first
  • You might have to instrument first

Theory

No idea? No benchmark!

  • Develop a sound theory why something is slow or hungry
  • You might need a benchmark first!
  • Consider these five areas:
    • Code aka algorithms
    • Memory aka usage (allocation, freeing, access)
    • Cache aka locality of memory usage
    • I/O aka external dependencies
    • Scale aka synchronisation and CPU usage
  • Think about interactions of your problem component with the world

Test Case

Build your code

  • Isolate your problem
  • Avoid rewriting the code, try to use the API
  • Don't remove code only because it is not important for the benchmark
  • Isolate data preparation from code execution
  • Standard JMH prepares data only once, difficult when mutating data
  • Concurrency tests are hard to write, see how your production runs it
  • Don't apply your ideas or theory to the test case

Data

Shape the future outcome

  • Shape your test data according to your production data
  • No production data, because the idea is made up? Think in dimensions:
    • Amount - How many strings?
    • Size - How long is a string?
    • Structure - Where is the challenge in the string?
    • Layout - Holders and copies
    • Variance - How many different sizes, structures, and layouts and how? At once? Isolated?
  • You data might create or be the problem you are looking for
  • Think unit tests and the data you might need to cover your code properly

Execution Profile

Wrong conclusions can be drawn in seconds with this trick

  • How often is your code executed?
    • Only at startup (how often)
    • Occasionally
    • Always
    • On error or on a rare path
  • Where is your code used?
    • Inlining
    • Branch prediction and out of order

Unit Testing

Functional correctness before performance

  • Your first duty is to correctness!
  • Try to reuse your regular tests
  • Validate every change to avoid false hopes
  • Don't unit-test memory, cache or other performance things
  • Ensure proper coverage, because performance tuning often removes edge-cases
  • Try to use the same data for testing and benchmarking

Question the Results

Never believe in your first results

  • Is this your expected outcome? Question it.
  • Is this not what you expected? Question it.
  • You have not had any expectations? Well... we got a problem.
  • Start validating, what you got, by experimenting around
  • Vary the hell out of code and data before your start tuning!
  • Think of edge-cases

Verification by Variation

Vary a little to feel confident

  • Always keep the original state running
  • Vary the original, a copy of it, of course
  • Vary the tuned version
  • Vary the code
    • Just switch lines
    • Reuse variables or don't
    • Flip branches and conditions
    • Flatten out code or compact it
  • Vary the data
    • Sort or unsort it
    • Make it longer, bigger, and smaller
    • Change patterns
    • Switch out test data for production
  • Vary the environment
    • Change the order
    • Try another JDK
    • Try another machine
    • Play with the GC

Verification by Tooling

Use the power of JMH

  • GC -prof gc
  • Low-level stats -prof perf/perfnorm
  • Generated code -prof perfasm
  • JIT Compiler -prof comp
  • Classloader -prof cl
  • JFR -prof jfr
  • Async Profiler -prof async
  • And more!

Verification by Review

Ask someone but ChatGPT

  • Ask a colleague
  • Don't just send a PR
  • Explain your thought process
  • No need to ask a benchmarking expert
  • Compare with JDK benchmarks and JMH examples
  • Ask the mighty Internet politely

Verfication by Knowledge

Remember our experiments and more

  • Care about the platform and environment
  • Remember our CPU, cache, and memory examples
  • Keep the Java core concepts in mind (JIT, JMM)
  • Be prepared to start from scratch
  • Always benchmark the outcome in reality
  • NEVER apply your benchmark outcome as a general rule in the future!

Microbenchmark when...

A Summary of Things

Conclusion

Microbenchmarking is...

Good

  • For isolating a problem
  • Investigating an isolated problem
  • Tuning an isolated problem
  • Extremly detailed tuning
  • Understanding the inner-workings
  • Squeezing the lemon a little more

Bad

  • Drawing general conclusions
  • Black and white comparisons
  • Expecting large scale impacts
  • The impatient developer

The Conclusion rephrased

When and when not to do microbenchmarks

  • Yes You are an JDK contributor
  • Yes You are an open source contributor
  • Yes Your problem can be isolated...
  • Yes ...or you isolate it for easier tuning
  • Yes You want to debunk your colleague's claims
  • Yes You are just curious
  • No You want to write a Medium post about which method is faster
  • No You want to tune production without a need
  • No You swipped left on all previous slides

One last thing

The JMH warning

REMEMBER: The numbers are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.

Do not assume the numbers tell you what you want them to tell.

Because you have seen it on my slides, does not make it right!