Microbenchmarking
Deep-Dive

The Art of Realizing One is Wrong
While Learning How to Microbenchmark

Bonus Content: JVM Insights, Hardware Details, and Interactions

About René Schwietzke

  • Master of Computer Science (Dipl.-Inf.)
  • Programmer since 1989
  • Java since Java 1.0, 1996
  • QA and Testing since 1998
  • Performance Tester since 1999
  • Co-Founder of
  •    @ReneSchwietzke
  • @reneschwietzke@foojay.social

About

  • Founded 2004
  • Headquarters in Jena, Germany; Subsidiary in Cambridge, MA, USA
  • Specialized in Software Testing and Quality Assurance
  • Performance testing since 2004
  • Over 150 performance test projects every year
  • World-wide customer base including APAC and South America
  • Performance Test Tool , Java-based, APL 2.0

Thank you - One learns from others

Always attribute other people's work and state your sources

License

I care and share - Slides: https://t.ly/POQnP

c b

This presentation is licensed under
Creative Commons Attribution 4.0 International License

What's in for you?

Why you should stick around

  • Get a feel for microbenchmarking
  • Discover when you need it
  • Learn about JMH
  • Learn about typical mistakes
  • Learn about hardware and JVM insides
  • Take home microbenchmarking awareness
  • See the good and the ugly

This is a small JMH training!

We won't cover all aspects.

The JMH-Samples are awesome

Why all this?

What is my motivation?

  • Non-sense microbenchmarks are everywhere
  • Customers wasted weeks
  • Many don't dissect problems enough before going nuts on tuning
  • Microbenchmarking helps to understand the JVM and hardware interactions
  • Certain aspects are extremly fun

If you want to drive fast, you have to understand how a car works.

Boring Stuff First

Definitions, Guesses, Tools

Tools for Later

What you will need to make things fun

  • Java 21
  • Maven
  • GIT
  • A Java-IDE
  • A Commandline
  • (optional) Linux Perf
  • (optional) hsdis for ASM support

What are benchmarks?

A definition first

To study (something, such as a competitor's product or business practices) in order to improve the performance of one's own company

A point of reference from which measurements may be made

A standardized problem or test that serves as a basis for evaluation or comparison (as of computer system performance)

  • Racing a car around a track
  • Measuring the time it takes

What is a microbenchmark?

What is this micro thing and why is it micro?

A simplified benchmark that tests com­po­nents in isolation.

It simplifies the process by abstracting the problem, removing complexity, and improving debuggability.

Components are usually quite small. Compare that to the unit test idea!

  • Just testing the engine or electric motor
  • No need to go out and test drive, just build a test bed and test 24x7

Drawbacks

Side effects of simplification

  • Environment is different
  • Usage is simulated
  • Removal of factors which might be the cause
  • Unknown factors are eliminated too
  • Less interaction with other components
  • Development of a bias
  • Race track
  • Tires
  • Petrol
  • Driver
  • External forces
  • Vibrations
  • Temperatures
full

Get us some numbers

Some Experiments First

Let's Measure - Example 1

Always use System.arrayCopy over manual copying


public static void main(String[] args)
{
    var length = 255;
    var src = new byte[length];
    var dest = new byte[length];

    var s1 = System.nanoTime();
    System.arraycopy(src, 0, dest, 0, length);
    var e1 = System.nanoTime();

    var s2 = System.nanoTime();
    for (int i = 0; i < length; i++)
    {
        dest[i] = src[i];
    }
    var e2 = System.nanoTime();

    System.out.printf("System Copy: %d ns%n", e1 - s1);
    System.out.printf("Manual Copy: %d ns%n", e2 - s2);
}

$ cd jmh-training
$ java src/main/java/org/xc/jmh/naive/Example1.java
System Copy: 1,913 ns
Manual Copy: 3,135 ns

$ java src/main/java/org/xc/jmh/naive/Example1.java
System Copy: 1,172 ns
Manual Copy: 2,765 ns

$ java src/main/java/org/xc/jmh/naive/Example1.java
System Copy: 1,814 ns
Manual Copy: 3,917 ns
  • All the proof we need!

Let's Measure - Example 2

Use a sized StringBuilder all the time


public static void main(String[] args) {
    final String a = "This is a test";
    final String b = "More fun with a string";
    final long c = System.currentTimeMillis(); // e.g. 1684581249117
    var l = a.length() + b.length() + 13;

    var s1 = System.nanoTime();
    {
        var sb = new StringBuilder();
        sb.append(a); sb.append(b); sb.append(c);
        var r = sb.toString();
    }
    var e1 = System.nanoTime();

    var s2 = System.nanoTime();
    {
        var sb = new StringBuilder(l);
        sb.append(a); sb.append(b); sb.append(c);
        var r = sb.toString();
    }
    var e2 = System.nanoTime();

    var s3 = System.nanoTime();
    {
        var r = a + b + c;
    }
    var e3 = System.nanoTime();

    System.out.printf("Pure String usage  : %d ns%n", e3 - s3);
    System.out.printf("StringBuilder      : %d ns%n", e1 - s1);
    System.out.printf("Sized StringBuilder: %d ns%n", e2 - s2);
}

$ java src/main/java/org/xc/jmh/naive/Example2.java
Pure String usage  : 2,590,902 ns
StringBuilder      :    43,863 ns
Sized StringBuilder:     5,591 ns

$ java src/main/java/org/xc/jmh/naive/Example2.java
Pure String usage  : 2,926,695 ns
StringBuilder      :    35,547 ns
Sized StringBuilder:     7,364 ns

$ java src/main/java/org/xc/jmh/naive/Example2.java
Pure String usage  : 2,659,792 ns
StringBuilder      :    27,241 ns
Sized StringBuilder:     5,030 ns
  • Makes sense, matches common Internet knowledge
  • Sized builder is dramatically faster
  • Pure String operations are horrible

Let's Measure - Example 3

Streams are slow and parallel() is always better


public static void main(String[] args) {
    var data = new long[10000];
    for (int i = 0; i < data.length; i++) {
        data[i] = i;
    }

    // manual
    var s1 = System.nanoTime(); {
        var total = 0.0d;
        for (int i = 0; i < data.length; i++) {
            total += calc(data[i]);
        }
        var r = total / data.length;
    }
    var e1 = System.nanoTime();

    // stream
    var s2 = System.nanoTime(); {
        var r = Arrays.stream(data).map(i -> calc(i)).average();
    }
    var e2 = System.nanoTime();

    // stream parallel
    var s3 = System.nanoTime(); {
        var r = Arrays.stream(data).parallel().map(i -> calc(i)).average();
    }
    var e3 = System.nanoTime();
    ...
    }

    private static long calc(long c) {
        var t = c;
        for (long i = 100; i > 0; i--) {
            t += (t * 0x5DEECE66DL + 0xBL + i) & (0xFFFFFFFFFFFFL);
        }
        return t;
    }
}

$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual         :  5,832,700 ns
Stream         : 23,259,444 ns
Parallel stream: 10,246,827 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual         :  5,834,677 ns
Stream         : 20,817,603 ns
Parallel stream: 10,223,933 ns

$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual         :  6,054,650 ns
Stream         : 20,718,407 ns
Parallel stream: 10,415,500 ns
  • Streams are bad
  • Parallel streams are obviously better
  • But why are the results not the same all the time?

Thesis - What might be wrong

Forget everything you just saw

  • Of course, all tests before are garbage!
  • But they confirm the Internet truth?!
  • They are just nice and simple Java
  • So, what could be wrong with these?
  • Theorize or show off... please!

Goals

What do we really want?

  • Repeatability Easily repeat tests
  • Predictability Get the same result when nothing has changed
  • Changeability Easily change things (object under test)
  • Controllability Control parameters and vary them (environment)
  • Debuggability Get additional insights
  • Understandability Things are simple, so one can understand the influence of changes

Our true goal: We want to learn and understand by changing details under controlled conditions. Turn a knob a bit and see what happens. We want to derive reasoning or even proof from our changes.

The right tool

Don't ask what you can do, ask others what they have done

  • Runtimes > 30 s mostly okish naively
  • Don't reinvent the wheel
  • Don't spend 2 days on writing a framework using your assumptions
  • Use a proven tool chain that helps to mitigate a lot of, yet unknown, problems
  • JMH - Java Microbenchmark Harness to the rescue

JMH is a Java harness for building, running, and analysing benchmarks written in Java and other languages targeting the JVM.

  • Built to address shortcomings of naive benchmarks
  • Tested and proven by a large community
  • Used for the JDK, Open Source
  • Comes with built-in profiling tools

Disclaimer

This is not an extensive JMH training.

It is about the mistakes you can make when microbenchmarking, even with the best tool available.

"A fool with a tool is still a fool."

Grady Booch

A First Benchmark

Learning by Example

As Simple as Possible

Just naive and simple again


@Warmup(iterations = 5, time = 2,
        timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1,
        timeUnit = TimeUnit.SECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class A01 {
    long time;

    @Setup
    public void setup() {
        time = System.currentTimeMillis();
    }

    @Benchmark
	public void add() {
		var x =
            (time * 0x5DEECE66DL + 0xBL + time) /
            (time * time * 0.42d);
 	}
}

# ensure a full and clean build
$ mvn clean verify

# Only execute our example, not everything!
$ java -jar target/benchmarks.jar "A01"

# Warmup Iteration   1: 0.621 ns/op
# Warmup Iteration   2: 0.381 ns/op
# Warmup Iteration   3: 0.378 ns/op
# Warmup Iteration   4: 0.798 ns/op
# Warmup Iteration   5: 1.094 ns/op
Iteration   1: 1.090 ns/op
Iteration   2: 1.087 ns/op
Iteration   3: 1.085 ns/op
Iteration   4: 1.084 ns/op
Iteration   5: 1.095 ns/op
Iteration   6: 1.080 ns/op
Iteration   7: 1.079 ns/op
Iteration   8: 1.077 ns/op
Iteration   9: 1.089 ns/op
Iteration  10: 1.006 ns/op

Benchmark  Mode  Cnt  Score   Error  Units
A01.add    avgt   10  1.077 ± 0.039  ns/op

How good is it?

Can we trust it?


# T14s Gen1, Linux 5.15
# Warmup Iteration   1: 0.621 ns/op
# Warmup Iteration   2: 0.381 ns/op
# Warmup Iteration   3: 0.378 ns/op
# Warmup Iteration   4: 0.798 ns/op
# Warmup Iteration   5: 1.094 ns/op
Iteration   1: 1.090 ns/op
Iteration   2: 1.087 ns/op
Iteration   3: 1.085 ns/op
Iteration   4: 1.084 ns/op
Iteration   5: 1.095 ns/op
Iteration   6: 1.080 ns/op
Iteration   7: 1.079 ns/op
Iteration   8: 1.077 ns/op
Iteration   9: 1.089 ns/op
Iteration  10: 1.006 ns/op

Benchmark  Mode  Cnt  Score   Error  Units
A01.add    avgt   10  1.077 ± 0.039  ns/op

# T14s Gen1, Linux 5.15
# Warmup Iteration   1: 1.105 ns/op
# Warmup Iteration   2: 1.075 ns/op
# Warmup Iteration   3: 1.082 ns/op
# Warmup Iteration   4: 1.076 ns/op
# Warmup Iteration   5: 1.070 ns/op
Iteration   1: 1.112 ns/op
Iteration   2: 0.370 ns/op
Iteration   3: 0.377 ns/op
Iteration   4: 0.767 ns/op
Iteration   5: 1.097 ns/op
Iteration   6: 1.094 ns/op
Iteration   7: 0.383 ns/op
Iteration   8: 0.384 ns/op
Iteration   9: 0.374 ns/op
Iteration  10: 0.375 ns/op

Benchmark  Mode  Cnt  Score   Error  Units
A01.add    avgt   10  0.633 ± 0.521  ns/op

# T14s Gen1, Linux 5.15
# Warmup Iteration   1: 0.939 ns/op
# Warmup Iteration   2: 1.061 ns/op
# Warmup Iteration   3: 1.137 ns/op
# Warmup Iteration   4: 1.093 ns/op
# Warmup Iteration   5: 1.071 ns/op
Iteration   1: 1.071 ns/op
Iteration   2: 1.071 ns/op
Iteration   3: 1.070 ns/op
Iteration   4: 1.070 ns/op
Iteration   5: 0.378 ns/op
Iteration   6: 0.376 ns/op
Iteration   7: 0.375 ns/op
Iteration   8: 0.383 ns/op
Iteration   9: 0.374 ns/op
Iteration  10: 0.376 ns/op

Benchmark  Mode  Cnt  Score   Error  Units
A01.add    avgt   10  0.654 ± 0.541  ns/op

How good is it?

Can we trust it?


# Hetzner, AMD DEDICATED, Cloud, HEL
# Warmup Iteration   1: 0.292 ns/op
# Warmup Iteration   2: 0.289 ns/op
# Warmup Iteration   3: 0.299 ns/op
# Warmup Iteration   4: 0.294 ns/op
# Warmup Iteration   5: 0.288 ns/op
Iteration   1: 0.291 ns/op
Iteration   2: 0.287 ns/op
Iteration   3: 0.296 ns/op
Iteration   4: 0.287 ns/op
Iteration   5: 0.286 ns/op
Iteration   6: 0.287 ns/op
Iteration   7: 0.297 ns/op
Iteration   8: 0.291 ns/op
Iteration   9: 0.289 ns/op
Iteration  10: 0.289 ns/op

Benchmark  Mode  Cnt  Score   Error  Units
A01.add    avgt   10  0.290 ± 0.005  ns/op

# Hetzner, AMD DEDICATED, Cloud, HEL
# Warmup Iteration   1: 0.288 ns/op
# Warmup Iteration   2: 0.302 ns/op
# Warmup Iteration   3: 0.286 ns/op
# Warmup Iteration   4: 0.287 ns/op
# Warmup Iteration   5: 0.295 ns/op
Iteration   1: 0.287 ns/op
Iteration   2: 0.282 ns/op
Iteration   3: 0.286 ns/op
Iteration   4: 0.287 ns/op
Iteration   5: 0.293 ns/op
Iteration   6: 0.282 ns/op
Iteration   7: 0.284 ns/op
Iteration   8: 0.288 ns/op
Iteration   9: 0.292 ns/op
Iteration  10: 0.286 ns/op

Benchmark  Mode  Cnt  Score   Error  Units
A01.add    avgt   10  0.287 ± 0.006  ns/op

# Hetzner, AMD DEDICATED, Cloud, HEL
# Warmup Iteration   1: 0.287 ns/op
# Warmup Iteration   2: 0.306 ns/op
# Warmup Iteration   3: 0.286 ns/op
# Warmup Iteration   4: 0.286 ns/op
# Warmup Iteration   5: 0.293 ns/op
Iteration   1: 0.287 ns/op
Iteration   2: 0.286 ns/op
Iteration   3: 0.288 ns/op
Iteration   4: 0.288 ns/op
Iteration   5: 0.295 ns/op
Iteration   6: 0.286 ns/op
Iteration   7: 0.287 ns/op
Iteration   8: 0.295 ns/op
Iteration   9: 0.295 ns/op
Iteration  10: 0.286 ns/op

Benchmark  Mode  Cnt  Score   Error  Units
A01.add    avgt   10  0.289 ± 0.006  ns/op

It is YOU now!

Execute and report.

First Learnings

First mistakes have been made

  • Never trust the first result
  • Always run the test multiple times
  • Get yourself a second machine
  • Maybe even the future target system
  • Question Does it make sense to do 1 ns tests?

A main method for debugging

Adding a main for debugging purposes


public class A02
{
    ...

    public static void main(String[] args)
        throws RunnerException
    {
        var opt = new OptionsBuilder()
                // important, otherwise we will run
                // all tests!
                .include(A02.class.getSimpleName())
                // 0 is needed for debugging,
                // not for running
                .forks(0)
                .build();

        new Runner(opt).run();
    }
}
  • Perfect for a quick execution
  • Good for debugging
  • Limited validity for serious measurements
  • Requires a full Maven build or enabled annotation processing

Is the result right?

0.3 ns for an add?


public class A03a
{
	long time;

	@Setup
	public void setup()
	{
		time = System.currentTimeMillis();
	}

	@Benchmark
	public void addCheap()
	{
		var x = 1 + 1;
	}

	@Benchmark
	public void addExpensive()
	{
		var x =
            (time * 0x5DEECE66DL + 0xBL + time)  /
            (time * time * 0.42d);
	}
}

# ensure a full and clean build
$ mvn clean verify

# Only execute our example, not everything!
$ java -jar target/benchmarks.jar "A03a"

Benchmark          Mode  Cnt  Score   Error  Units
A03a.addCheap      avgt    5  0.288 ± 0.009  ns/op
A03a.addExpensive  avgt    5  0.290 ± 0.019  ns/op

# Disable the JIT
$ java -jar target/benchmarks.jar "A03a" -jvmArgs "-Xint"

Benchmark          Mode  Cnt   Score   Error  Units
A03a.addCheap      avgt    5  22.542 ± 0.456  ns/op
A03a.addExpensive  avgt    5  45.016 ± 2.680  ns/op

The Clock

Can we measure time?

Time Measurement

Whom do you ask for time?

  • API >> JVM >> OS >> hardware
  • APIs: System.nanoTime() and System.currentTimeMillis()
  • Whoops! We cannot measure less than 1 ns?
  • 3 GHz CPU - 0.33 ns per clock cycle, 6 instructions per cycle
  • A lot can/could be done in 1 ns (about 3 to 18 instructions)
  • Even worse, we cannot measure more precisely than 20 - 30 ns
  • Your clock is... code

Time Measurement

We can measure nanoseconds, can't we?


@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1000,
    timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 1000,
    timeUnit = TimeUnit.MILLISECONDS)
@Fork(1)
@State(Scope.Thread)
public class TimerResolution
{
    private long lastValue;

    @Benchmark
    public long latency_nanotime()
    {
        return System.nanoTime();
    }

    public static double precision(long[] t)
    {
        // store samples
        for (int i = 0; i < t.length; i++)
        {
            t[i] = System.nanoTime();
        }
    }
}

Benchmark                         Mode  Cnt   Score   Error  Units
TimerResolution.latency_nanotime  avgt    5  22.244 ± 0.474  ns/op

# Samples Lenovo T14s, AMD, Linux 5.15, TSC
$ java -cp target/classes/ org.xc.jmh.TimerResolution
Precision 1 - 21.66
Precision 2 - 23.80
Latency     - 24.27

112232081516561
112232081516581 - 20 ns
112232081516601 - 20 ns
112232081516621 - 20 ns
112232081516651 - 30 ns
112232081516671 - 20 ns
...
112232081516951
112232081516972 - 21 ns
112232081516992 - 20 ns
...
  • Solution Calculate runtime, don't measure it.

It is YOU now!

Execute and report.

Don't measure!

Math is your friend

  • A stop watch that measures only minutes... sigh
  • How do you get something timed down to the second?
  • Let your driver go for 100 laps and not one.
  • 100 laps = 209 minutes
  • 1 lap = 00:02:05.400
  • And we can resolve down to milliseconds
  • This is how JMH measures things (mostly)!
  • Important Not everything we measure becomes correct.

OS - Clock Source

Not every clock is created equal

  • RTC Real Time Clock
  • PIT Programmable Interval Timer
  • HPET High Performance Event Timer
  • ACPI PM ACPI Power Management Timer
  • APIC Advanced Programmable Interrupt Controller
  • TSC Time Stamp Counter

x86 Time Stamp Counter (TSC), a low-latency monotonically increasing clock, to measure event duration


Benchmark                             Mode  Cnt      Score  Units
TimerResolution.latency_nanotime      avgt   10  1,128.204  ns/op

# Samples Lenovo T14s, AMD, Linux 5.15, HPET
374172181482681
374172181484078
374172181484986
374172181485964
374172181487291
374172181488199
Difference between calls 900 to 1,400 ns

Benchmark                             Mode  Cnt   Score   Units
TimerResolution.latency_nanotime      avgt   10  25.519   ns/op

# Samples Lenovo T14s, AMD, Linux 5.15, TSC
495719354555374
495719354555394
495719354555424
495719354555444
495719354555464
495719354555484
Difference between calls 20 to 30 ns

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
kvm-clock tsc hpet acpi_pm

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet

$ echo 'tsc' > /sys/devices/system/clocksource/clocksource0/current_clocksource

Back to 1 + 1

We are being outsmarted

Debug into it

Setup theories


public class A03b {
    long time;
	Long lTime;

	@Setup
	public void setup() {
		time = System.currentTimeMillis();
		lTime = Long.valueOf(time);
	}

	@Benchmark
	public void addCheap1()	{
		var x = 1 + 1727610292048L;
	}

    @Benchmark
	public void addCheap2()	{
		var x = 1 + time;
	}

	@Benchmark
	public void addCheap3() {
		var x = 1 + lTime.longValue();
	}

	@Benchmark
	public void addExpensive() {
		var x =
            (time * 0x5DEECE66DL + 0xBL + time) /
            (time * time * 0.42d);
	}
}

Benchmark          Mode  Cnt   Score   Error  Units
A03b.addCheap1     avgt    5   0.289 ± 0.014  ns/op
A03b.addCheap2     avgt    5   0.291 ± 0.015  ns/op
A03b.addCheap3     avgt    5   0.390 ± 0.025  ns/op
A03b.addExpensive  avgt    5   0.288 ± 0.024  ns/op

# -jvmArgs "-Xint"
A03b.addCheap1     avgt    5  16.056 ± 1.308  ns/op
A03b.addCheap2     avgt    5  20.431 ± 3.972  ns/op
A03b.addCheap3     avgt    5  28.280 ± 1.847  ns/op
A03b.addExpensive  avgt    5  39.193 ± 3.921  ns/op
  • No return? No call. Deadcode. JIT!
  • Not when we might have sideeffects.

Popping the hood open

Don't get scared, won't happen that often


[Entry Point]
  # {method} {0x00007f89d0475488} 'addCheap2' '()V' in 'org/xc/jmh/A03b'
  #           [sp+0x20]  (sp of caller)
  0x00007f8a24476f80:   mov    0x8(%rsi),%r10d
  0x00007f8a24476f84:   movabs $0x7f89db000000,%r11
  0x00007f8a24476f8e:   add    %r11,%r10
  0x00007f8a24476f91:   cmp    %r10,%rax
  0x00007f8a24476f94:   jne    0x00007f8a23d53d80           ;   {runtime_call ic_miss_stub}
  0x00007f8a24476f9a:   xchg   %ax,%ax
  0x00007f8a24476f9c:   nopl   0x0(%rax)
[Verified Entry Point]
  0x00007f8a24476fa0:   sub    $0x18,%rsp
  0x00007f8a24476fa7:   mov    %rbp,0x10(%rsp)
  0x00007f8a24476fac:   cmpl   $0x0,0x20(%r15)
  0x00007f8a24476fb4:   jne    0x00007f8a24476fe3
  0x00007f8a24476fba:   add    $0x10,%rsp
  0x00007f8a24476fbe:   pop    %rbp
  0x00007f8a24476fbf:   cmp    0x450(%r15),%rsp     ;   {poll_return}
  0x00007f8a24476fc6:   ja     0x00007f8a24476fcd
  0x00007f8a24476fcc:   ret
  0x00007f8a24476fcd:   movabs $0x7f8a24476fbf,%r10 ; {internal_word}
  0x00007f8a24476fd7:   mov    %r10,0x468(%r15)
  0x00007f8a24476fde:   jmp    0x00007f8a23d5a400   ; {runtime_call SafepointBlob}
  0x00007f8a24476fe3:   call   Stub::nmethod_entry_barrier ; {runtime_call StubRoutines (final stubs)}
  0x00007f8a24476fe8:   jmp    0x00007f8a24476fba
  0x00007f8a24476fed:   hlt
  0x00007f8a24476fee:   hlt
  0x00007f8a24476fef:   hlt
[Exception Handler]
  0x00007f8a24476ff0:   jmp    0x00007f8a23e0a600 ; {no_reloc}
[Deopt Handler Code]
  0x00007f8a24476ff5:   call   0x00007f8a24476ffa
  0x00007f8a24476ffa:   subq   $0x5,(%rsp)
  0x00007f8a24476fff:   jmp    0x00007f8a23d596a0 ; {runtime_call DeoptimizationBlob}
  0x00007f8a24477004:   hlt
  0x00007f8a24477005:   hlt
  0x00007f8a24477006:   hlt
  0x00007f8a24477007:   hlt

[Entry Point]
  # {method} {0x00007f0014475560} 'addCheap3' '()V' in 'org/xc/jmh/A03b'
  #           [sp+0x20]  (sp of caller)
  0x00007f0068475b80:   mov    0x8(%rsi),%r10d
  0x00007f0068475b84:   movabs $0x7f001f000000,%r11
  0x00007f0068475b8e:   add    %r11,%r10
  0x00007f0068475b91:   cmp    %r10,%rax
  0x00007f0068475b94:   jne    0x00007f0067d53d80           ;   {runtime_call ic_miss_stub}
  0x00007f0068475b9a:   xchg   %ax,%ax
  0x00007f0068475b9c:   nopl   0x0(%rax)
[Verified Entry Point]
  0x00007f0068475ba0:   mov    %eax,-0x14000(%rsp)
  0x00007f0068475ba7:   push   %rbp
  0x00007f0068475ba8:   sub    $0x10,%rsp
  0x00007f0068475bac:   cmpl   $0x0,0x20(%r15)
  0x00007f0068475bb4:   jne    0x00007f0068475bfe
  0x00007f0068475bba:   cmp    0xc(%rsi),%r12d
  0x00007f0068475bbe:   je     0x00007f0068475bd3
  0x00007f0068475bc0:   add    $0x10,%rsp
  0x00007f0068475bc4:   pop    %rbp
  0x00007f0068475bc5:   cmp    0x450(%r15),%rsp     ;   {poll_return}
  0x00007f0068475bcc:   ja     0x00007f0068475be8
  0x00007f0068475bd2:   ret
  0x00007f0068475bd3:   mov    $0xfffffff6,%esi
  0x00007f0068475bd8:   data16 xchg %ax,%ax
  0x00007f0068475bdb:   call   0x00007f0067d59300   ; ImmutableOopMap {}
                                                    ; *invokevirtual longValue
                                                    ; - org.xc.jmh.A03b::addCheap3@5 (line 53)
                                                    ; {runtime_call UncommonTrapBlob}
  0x00007f0068475be0:   nopl   0x1d0(%rax,%rax,1)   ;   {other}
  0x00007f0068475be8:   movabs $0x7f0068475bc5,%r10 ;   {internal_word}
  0x00007f0068475bf2:   mov    %r10,0x468(%r15)
  0x00007f0068475bf9:   jmp    0x00007f0067d5a400   ; {runtime_call SafepointBlob}
  0x00007f0068475bfe:   call   Stub::nmethod_entry_barrier ; {runtime_call StubRoutines (final stubs)}
  0x00007f0068475c03:   jmp    0x00007f0068475bba
[Exception Handler]
  0x00007f0068475c08:   jmp    0x00007f0067e08100           ;   {no_reloc}
[Deopt Handler Code]
  0x00007f0068475c0d:   call   0x00007f0068475c12
  0x00007f0068475c12:   subq   $0x5,(%rsp)
  0x00007f0068475c17:   jmp    0x00007f0067d596a0           ;   {runtime_call DeoptimizationBlob}
  0x00007f0068475c1c:   hlt
  0x00007f0068475c1d:   hlt
  0x00007f0068475c1e:   hlt
  0x00007f0068475c1f:   hlt

Make us do stuff

Avoid dropping our code by returning it


public class A04a {
    ...
    @Benchmark
	public long addCheap1()	{
		return 1 + 1727610292048L;
	}

    @Benchmark
	public long addCheap2()	{
		return 1 + time;
	}

	@Benchmark
	public long addCheap3()	{
		return 1 + lTime.longValue();
	}

	@Benchmark
	public double addExpensive() {
		return
            (time * 0x5DEECE66DL + 0xBL + time) /
            (time * time * 0.42d);
	}
    ...
}

Benchmark          Mode  Cnt  Score   Error  Units
A04a.addCheap1     avgt    3  0.287 ± 0.010  ns/op
A04a.addCheap2     avgt    3  0.374 ± 0.087  ns/op
A04a.addCheap3     avgt    3  0.555 ± 0.216  ns/op
A04a.addExpensive  avgt    3  4.436 ± 0.008  ns/op
  • long-add is still cheap
  • Long is really expensive
  • Real math is really that expensive?!

Popping the hood open - AGAIN

Sorry...


[Entry Point]
  # {method} {0x00007f5c2c4754c0} 'addCheap2' '()J' in 'org/xc/jmh/A04a'
  #           [sp+0x20]  (sp of caller)
  0x00007f5c80475a80:   mov    0x8(%rsi),%r10d
  0x00007f5c80475a84:   movabs $0x7f5c37000000,%r11
  0x00007f5c80475a8e:   add    %r11,%r10
  0x00007f5c80475a91:   cmp    %r10,%rax
  0x00007f5c80475a94:   jne    0x00007f5c7fd53d80           ;   {runtime_call ic_miss_stub}
  0x00007f5c80475a9a:   xchg   %ax,%ax
  0x00007f5c80475a9c:   nopl   0x0(%rax)
[Verified Entry Point]
  0x00007f5c80475aa0:   sub    $0x18,%rsp
  0x00007f5c80475aa7:   mov    %rbp,0x10(%rsp)
  0x00007f5c80475aac:   cmpl   $0x1,0x20(%r15)
  0x00007f5c80475ab4:   jne    0x00007f5c80475aec
  0x00007f5c80475aba:   mov    $0x1,%eax
  0x00007f5c80475abf:   add    0x10(%rsi),%rax
  0x00007f5c80475ac3:   add    $0x10,%rsp
  0x00007f5c80475ac7:   pop    %rbp
  0x00007f5c80475ac8:   cmp    0x450(%r15),%rsp             ;   {poll_return}
  0x00007f5c80475acf:   ja     0x00007f5c80475ad6
  0x00007f5c80475ad5:   ret
  0x00007f5c80475ad6:   movabs $0x7f5c80475ac8,%r10         ;   {internal_word}
  0x00007f5c80475ae0:   mov    %r10,0x468(%r15)
  0x00007f5c80475ae7:   jmp    0x00007f5c7fd5a400           ;   {runtime_call SafepointBlob}
  0x00007f5c80475aec:   call   Stub::nmethod_entry_barrier  ;   {runtime_call StubRoutines (final stubs)}
  0x00007f5c80475af1:   jmp    0x00007f5c80475aba
  0x00007f5c80475af6:   hlt
  0x00007f5c80475af7:   hlt
[Exception Handler]
  0x00007f5c80475af8:   jmp    0x00007f5c7fe08200           ;   {no_reloc}
[Deopt Handler Code]
  0x00007f5c80475afd:   call   0x00007f5c80475b02
  0x00007f5c80475b02:   subq   $0x5,(%rsp)
  0x00007f5c80475b07:   jmp    0x00007f5c7fd596a0           ;   {runtime_call DeoptimizationBlob}
  0x00007f5c80475b0c:   hlt
  0x00007f5c80475b0d:   hlt
  0x00007f5c80475b0e:   hlt
  0x00007f5c80475b0f:   hlt

[Entry Point]
  # {method} {0x00007faf64475588} 'addCheap3' '()J' in 'org/xc/jmh/A04a'
  #           [sp+0x20]  (sp of caller)
  0x00007fafb8475a00:   mov    0x8(%rsi),%r10d
  0x00007fafb8475a04:   movabs $0x7faf6f000000,%r11
  0x00007fafb8475a0e:   add    %r11,%r10
  0x00007fafb8475a11:   cmp    %r10,%rax
  0x00007fafb8475a14:   jne    0x00007fafb7d53d80           ;   {runtime_call ic_miss_stub}
  0x00007fafb8475a1a:   xchg   %ax,%ax
  0x00007fafb8475a1c:   nopl   0x0(%rax)
[Verified Entry Point]
  0x00007fafb8475a20:   mov    %eax,-0x14000(%rsp)
  0x00007fafb8475a27:   push   %rbp
  0x00007fafb8475a28:   sub    $0x10,%rsp
  0x00007fafb8475a2c:   cmpl   $0x0,0x20(%r15)
  0x00007fafb8475a34:   jne    0x00007fafb8475a8a
  0x00007fafb8475a3a:   mov    0xc(%rsi),%r11d
  0x00007fafb8475a3e:   test   %r11d,%r11d
  0x00007fafb8475a41:   je     0x00007fafb8475a60
  0x00007fafb8475a43:   mov    $0x1,%eax
  0x00007fafb8475a48:   add    0x10(%r12,%r11,8),%rax
  0x00007fafb8475a4d:   add    $0x10,%rsp
  0x00007fafb8475a51:   pop    %rbp
  0x00007fafb8475a52:   cmp    0x450(%r15),%rsp             ;   {poll_return}
  0x00007fafb8475a59:   ja     0x00007fafb8475a74
  0x00007fafb8475a5f:   ret
  0x00007fafb8475a60:   mov    $0xfffffff6,%esi
  0x00007fafb8475a65:   xchg   %ax,%ax
  0x00007fafb8475a67:   call   0x00007fafb7d59300 ; ImmutableOopMap {}
                                                  ; *invokevirtual longValue {reexecute=0 rethrow=0 return_oop=0}
                                                  ; - org.xc.jmh.A04a::addCheap3@5 (line 53)
                                                  ;   {runtime_call UncommonTrapBlob}
  0x00007fafb8475a6c:   nopl   0x1dc(%rax,%rax,1)           ;   {other}
  0x00007fafb8475a74:   movabs $0x7fafb8475a52,%r10         ;   {internal_word}
  0x00007fafb8475a7e:   mov    %r10,0x468(%r15)
  0x00007fafb8475a85:   jmp    0x00007fafb7d5a400           ;   {runtime_call SafepointBlob}
  0x00007fafb8475a8a:   call   Stub::nmethod_entry_barrier  ;   {runtime_call StubRoutines (final stubs)}
  0x00007fafb8475a8f:   jmp    0x00007fafb8475a3a
  0x00007fafb8475a94:   hlt
  0x00007fafb8475a95:   hlt
  0x00007fafb8475a96:   hlt
  0x00007fafb8475a97:   hlt
[Exception Handler]
  0x00007fafb8475a98:   jmp    0x00007fafb7e07100           ;   {no_reloc}
[Deopt Handler Code]
  0x00007fafb8475a9d:   call   0x00007fafb8475aa2
  0x00007fafb8475aa2:   subq   $0x5,(%rsp)
  0x00007fafb8475aa7:   jmp    0x00007fafb7d596a0           ;   {runtime_call DeoptimizationBlob}
  0x00007fafb8475aac:   hlt
  0x00007fafb8475aad:   hlt
  0x00007fafb8475aae:   hlt
  0x00007fafb8475aaf:   hlt

Make us do stuff

Avoid dropping our code by blackholing it


public class A04d {
    ...
    @Benchmark
	public void addCheap1(Blackhole b)
	{
		b.consume(1 + 1727610292048L);
	}

	@Benchmark
	public void addCheap2(Blackhole b)
	{
		b.consume(1 + time);
	}

	@Benchmark
	public void addCheap3(Blackhole b)
	{
		b.consume( 1 + lTime.longValue());
	}

	@Benchmark
	public void addExpensive(Blackhole b)
	{
		b.consume((time * 0x5DEECE66DL + 0xBL + time) / (time * time * 0.42d));
	}
    ...
}

Benchmark          Mode  Cnt  Score   Error  Units
A04d.addCheap1     avgt    3  0.288 ± 0.021  ns/op
A04d.addCheap2     avgt    3  0.548 ± 0.034  ns/op
A04d.addCheap3     avgt    3  0.549 ± 0.040  ns/op
A04d.addExpensive  avgt    3  4.439 ± 0.010  ns/op
  • Blackholes will be handed to you

You are curious, aren't you?

One last time, promised


@Benchmark
public double addExpensive() {
	return
        (time * 0x5DEECE66DL + 0xBL + time) /
        (time * time * 0.42d);
}

# C1-Compiler
0x00007f8b009a5122:   mov    0x10(%rsi),%rdi
0x00007f8b009a5126:   movabs $0x5deece66d,%rax
0x00007f8b009a5130:   mov    %rax,%rsi
0x00007f8b009a5133:   mov    %rdi,%rax
0x00007f8b009a5136:   imul   %rsi,%rax
0x00007f8b009a513a:   movabs $0xb,%r10
0x00007f8b009a5144:   add    %r10,%rax
0x00007f8b009a5147:   add    %rdi,%rax
0x00007f8b009a514a:   vcvtsi2sd %rax,%xmm0,%xmm0
0x00007f8b009a514f:   mov    %rdi,%rsi
0x00007f8b009a5152:   mov    %rdi,%rax
0x00007f8b009a5155:   imul   %rsi,%rax
0x00007f8b009a5159:   vcvtsi2sd %rax,%xmm1,%xmm1
0x00007f8b009a515e:   vmulsd -0xc6(%rip),%xmm1,%xmm1
0x00007f8b009a5166:   vdivsd %xmm1,%xmm0,%xmm0

@Benchmark
public double addExpensive() {
	return
        (time * 0x5DEECE66DL + 0xBL + time) /
        (time * time * 0.42d);
}

# C2-Compiler
0x00007f8b08474bda:   mov    0x10(%rsi),%r10
0x00007f8b08474bde:   mov    %r10,%r11
0x00007f8b08474be1:   imul   %r10,%r11
0x00007f8b08474be5:   vcvtsi2sd %r11,%xmm0,%xmm0
0x00007f8b08474bea:   movabs $0x5deece66d,%r11
0x00007f8b08474bf4:   mov    %r10,%r8
0x00007f8b08474bf7:   imul   %r11,%r8
0x00007f8b08474bfb:   lea    0xb(%r8,%r10,1),%r10
0x00007f8b08474c00:   vcvtsi2sd %r10,%xmm1,%xmm1
0x00007f8b08474c05:   vmulsd -0x8d(%rip),%xmm0,%xmm0
0x00007f8b08474c0d:   vdivsd %xmm0,%xmm1,%xmm0

Warning - Still dangerous territory


public class A11 {
	double x1, x2;

	@Setup
	public void setup() {
		x1 = System.currentTimeMillis() / 42.1801;
		x2 = System.currentTimeMillis() / 42.1802;
	}

	private double compute(double d) {
		for (int c = 0; c < 10; c++) {
			d = d * d / Math.PI;
		}
		return d;
	}

	public double baseline() {
		return compute(x2);
	}

	public double measureWrong() {
		compute(x1);
		return compute(x2);
	}

	public void measureRight_Blackholes(Blackhole bh) {
		bh.consume(compute(x1));
		bh.consume(compute(x2));
	}

    public void measureWrong_Blackholes(Blackhole bh) {
		compute(x1);
		bh.consume(compute(x2));
	}
}

Benchmark                     Mode  Cnt   Score   Error  Units
A11.baseline                  avgt    3  12.338 ± 0.292  ns/op

A11.measureWrong              avgt    3  12.468 ± 3.478  ns/op
A11.measureWrong_Blackholes   avgt    3  12.362 ± 0.155  ns/op

A11.measureRight_Return       avgt    3  24.828 ± 0.265  ns/op
A11.measureRight_Blackhole    avgt    3  24.826 ± 0.187  ns/op
A11.measureRight_2Blackholes  avgt    3  25.602 ± 0.367  ns/op
  • Return and Blackhole are both fine
  • Deal with all results!
  • Blackholes have an overhead

Probe the JIT

Shine different light


// make is warm longer
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
// measure longer and more often
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class A04b {
    ...
    The same here!
    ...
}

# Benchmark: org.xc.jmh.A04b.addCheap2

# Warmup Iteration   1: 0.372 ns/op
# Warmup Iteration   2: 0.373 ns/op
# Warmup Iteration   3: 0.549 ns/op
# Warmup Iteration   4: 0.549 ns/op
# Warmup Iteration   5: 0.553 ns/op
Iteration   1: 0.549 ns/op
Iteration   2: 0.548 ns/op
Iteration   3: 0.548 ns/op
Iteration   4: 0.569 ns/op
Iteration   5: 0.548 ns/op
Iteration   6: 0.549 ns/op
Iteration   7: 0.547 ns/op
Iteration   8: 0.547 ns/op
Iteration   9: 0.548 ns/op
Iteration  10: 0.547 ns/op
  • What the heck?
  • addCheap2 is more expensive now
  • Something changed during warmup

Let's watch the magic

Observing the JIT actions


/*
 * Each method warms and runs differently
 */
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class A04c {
    ...
    @Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
    @Measurement(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
    @Benchmark
    public long addShort() {
    	return 1 + time;
    }

    @Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
    @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
    @Benchmark
    public long addLong() {
    	return 1 + time;
    }
    ...
}

# java -jar target/benchmarks.jar "A04c.addShort"
Benchmark      Mode  Cnt  Score   Error  Units
A04c.addShort  avgt    3  0.371 ± 0.051  ns/op

# java -jar target/benchmarks.jar "A04c.addLong"
Benchmark      Mode  Cnt  Score   Error  Units
A04c.addLong  avgt   10  0.551 ± 0.009  ns/op

# Warmup Iteration   1: 0.379 ns/op
# Warmup Iteration   2: 0.379 ns/op
Iteration   1: 0.547 ns/op
Iteration   2: 0.564 ns/op
Iteration   3: 0.550 ns/op

# java -jar target/benchmarks.jar "A04c"
Benchmark      Mode  Cnt  Score   Error  Units
A04c.addShort  avgt    3  0.548 ± 0.010  ns/op
A04c.addLong   avgt   10  0.553 ± 0.014  ns/op
  • We get different results!
  • JMH forks a new process all the time, still we get that mess
  • Danger This might or might not happen to you!

# java -jar target/benchmarks.jar "A04c.addShort" \
    -jvmArgsAppend "-XX:+PrintCompilation"

    259  554       3       org.xc.jmh.A04c::addShort (7 bytes)
    259  558       4       org.xc.jmh.A04c::addShort (7 bytes)
    259  554       3       org.xc.jmh.A04c::addShort (7 bytes)   made not entrant
    261  560 %     3       org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes)
    262  561       3       org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub (57 bytes)
# Warmup Iteration   1:
    263  562 %     4       org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes)
    265  560 %     3       org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes)   made not entrant
   1260  562 %     4       org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes)   made not entrant
   1272  591 %     4       org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes)
   1274  592       4       org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub (57 bytes)
   1276  561       3       org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub (57 bytes)   made not entrant
0.372 ns/op
# Warmup Iteration   2: 0.371 ns/op
Iteration   1: 0.369 ns/op
Iteration   2: 0.377 ns/op
Iteration   3: 0.370 ns/op

# Warmup Iteration
    277  505       3       org.xc.jmh.A04c::addShort (7 bytes)
    280  510 %     3       org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes)
    280  511       3       org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub (57 bytes)
    281  512 %     4       org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes)
    285  510 %     3       org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes)   made not entrant
    285  508       4       org.xc.jmh.A04c::addShort (7 bytes)
    286  505       3       org.xc.jmh.A04c::addShort (7 bytes)   made not entrant
   1278  512 %     4       org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes)   made not entrant
   1290  553 %     4       org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes)
   1295  554       4       org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub (57 bytes)
   1299  511       3       org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub (57 bytes)   made not entrant
0.388 ns/op
# Warmup Iteration: 0.377 ns/op
Iteration   1: 0.547 ns/op
Iteration   2: 0.547 ns/op
Iteration   3: 0.547 ns/op

Tame the JIT

JIT is concurrent, this varies results


# Run it normally
# java -jar target/benchmarks.jar "A04c"




# Benchmark: org.xc.jmh.A04c.addLong
# Warmup Iteration   1: 0.381 ns/op
# Warmup Iteration   2: 0.378 ns/op
# Warmup Iteration   3: 0.547 ns/op
# Warmup Iteration   4: 0.547 ns/op
# Warmup Iteration   5: 0.547 ns/op
Iteration   1: 0.547 ns/op
Iteration   2: 0.550 ns/op
Iteration   3: 0.547 ns/op
Iteration   4: 0.547 ns/op
Iteration   5: 0.547 ns/op
Iteration   6: 0.548 ns/op
Iteration   7: 0.547 ns/op
Iteration   8: 0.547 ns/op
Iteration   9: 0.549 ns/op
Iteration  10: 0.548 ns/op

# java -jar target/benchmarks.jar "A04c" \
#      -jvmArgs "-XX:-TieredCompilation -XX:CICompilerCount=1 -Xbatch"
# -XX:-TieredCompilation : Go to C2 directly
# -XX:CICompilerCount=1  : Run only one compiler thread (-TieredC only)
# -Xbatch                : Wait for the C2 result

# Benchmark: org.xc.jmh.A04c.addLong
# Warmup Iteration   1: 0.385 ns/op
# Warmup Iteration   2: 0.547 ns/op
# Warmup Iteration   3: 0.375 ns/op
# Warmup Iteration   4: 0.378 ns/op
# Warmup Iteration   5: 0.377 ns/op
Iteration   1: 0.374 ns/op
Iteration   2: 0.371 ns/op
Iteration   3: 0.374 ns/op
Iteration   4: 0.370 ns/op
Iteration   5: 0.370 ns/op
Iteration   6: 0.371 ns/op
Iteration   7: 0.380 ns/op
Iteration   8: 0.369 ns/op
Iteration   9: 0.377 ns/op
Iteration  10: 0.402 ns/op

Avoid inlining

Enforce a method call


public class A05 {
    ...
    @Benchmark
	@CompilerControl(CompilerControl.Mode.DONT_INLINE)
	public long addCheap1() {
		return 1 + 1727610292048L;
	}

	@Benchmark
	@CompilerControl(CompilerControl.Mode.DONT_INLINE)
	public long addCheap2()	{
		return 1 + time;
	}

	@Benchmark
	@CompilerControl(CompilerControl.Mode.DONT_INLINE)
	public long addCheap3()	{
		return 1 + lTime.longValue();
	}

	@Benchmark
	@CompilerControl(CompilerControl.Mode.DONT_INLINE)
	public double addExpensive() {
		return
            (time * 0x5DEECE66DL + 0xBL + time) /
            (time * time * 0.42d);
	}
    ...
}

# -XX:+Inline
Benchmark          Mode  Cnt  Score   Error  Units
A04b.addCheap1     avgt   10  0.292 ± 0.009  ns/op
A04b.addCheap2     avgt   10  0.548 ± 0.001  ns/op
A04b.addCheap3     avgt   10  0.550 ± 0.008  ns/op
A04b.addExpensive  avgt   10  4.120 ± 1.530  ns/op

# -XX:-Inline
A04b.addCheap1     avgt   10  2.507 ± 0.132  ns/op
A04b.addCheap2     avgt   10  2.735 ± 0.006  ns/op
A04b.addCheap3     avgt   10  2.745 ± 0.029  ns/op
A04b.addExpensive  avgt   10  5.223 ± 1.168  ns/op

# CompilerControl
A05.addCheap1      avgt   10  1.644 ± 0.004  ns/op
A05.addCheap2      avgt   10  1.653 ± 0.020  ns/op
A05.addCheap3      avgt   10  1.646 ± 0.006  ns/op
A05.addExpensive   avgt   10  5.067 ± 1.673  ns/op
  • The result difference is interesting
  • Targeted disabling yields different results compared to broad disabling (-XX:-Inline)

Learnings

More mistakes have been made

  • Still never trust the first result
  • javac evaluates constants
  • The JIT drops code
  • The JIT is not fully predictable
  • Always return all results(!) or blackhole them
  • Inlining makes a difference
  • Warming and measurement define outcomes

Detour: Setup

When you need some state

State for the Test

Prepare data for later measurement - Fixtures


public class B01
{
	final int SIZE = 1000;
	int[] data;

	@Setup(Level.Iteration)
	public void setup()
	{
		var r = new Random(42L);
		data = new int[SIZE];
		for (int i = 0; i < SIZE; i++)
		{
			data[i] = r.nextInt();
		}
	}

	@Benchmark
	public long sumClassic()
	{
		long result = 0;
		for (int i = 0; i < data.length; i++)
		{
			result += data[i];
		}

		return result;
	}
}
  • Setup method will be per benchmark, iteration, or invocation
  • Level.Trial is default!!!
  • Setup method will be per state
  • State is per benchmark or thread

# Warmup Iteration   1: 74.855 ns/op
# Warmup Iteration   2: 73.813 ns/op
Iteration   1: 74.167 ns/op
Iteration   2: 74.652 ns/op
Iteration   3: 76.764 ns/op
Iteration   4: 76.902 ns/op
Iteration   5: 75.612 ns/op

Benchmark       Mode  Cnt   Score   Error  Units
B01.sumClassic  avgt    5  75.620 ± 4.716  ns/op

Invocation Danger

Don't run Level.Invocation unless....


# Trial - Per Benchmark
$ java -jar target/benchmarks.jar "B02"

# Warmup Iteration   1: 75.102 ns/op
# Warmup Iteration   2: 73.226 ns/op
Iteration   1: 72.837 ns/op
Iteration   2: 76.176 ns/op
Iteration   3: 77.893 ns/op
Iteration   4: 74.138 ns/op
Iteration   5: 74.346 ns/op

Benchmark       Mode  Cnt   Score   Error  Units
B02.sumClassic  avgt    5  75.078 ± 7.598  ns/op

# Iteration - Per Warmup and Measurement Round
$ java -jar target/benchmarks.jar "B01"

# Warmup Iteration   1: 75.288 ns/op
# Warmup Iteration   2: 73.892 ns/op
Iteration   1: 76.258 ns/op
Iteration   2: 73.315 ns/op
Iteration   3: 77.246 ns/op
Iteration   4: 76.957 ns/op
Iteration   5: 72.868 ns/op

Benchmark       Mode  Cnt   Score   Error  Units
B01.sumClassic  avgt    5  75.329 ± 8.008  ns/op

# Invocation - Before each(!!) call to the method
$ mvn clean verify
$ java -jar target/benchmarks.jar "B03"

# Warmup Iteration   1: 157.255 ns/op
# Warmup Iteration   2: 110.940 ns/op
Iteration   1: 110.901 ns/op
Iteration   2: 105.556 ns/op
Iteration   3: 105.466 ns/op
Iteration   4: 111.826 ns/op
Iteration   5: 111.462 ns/op

Benchmark       Mode  Cnt    Score    Error  Units
B03.sumClassic  avgt    5  109.042 ± 12.478  ns/op
  • Now we measure EACH invocation
  • Remember our clock issue!
  • Need enough work to suppress overhead
  • Expensive setup might skew the result
  • More warnings in the JavaDoc!

Vary the State

Fixtures can be varied automatically


@Warmup(iterations = 2, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
@Fork(value = 1, jvmArgsAppend = {"-Xms4g", "-Xmx4g", "-XX:+AlwaysPreTouch"})
public class B04a {
	@Param({"1", "10", "100", "1000", "10000", "100000",
		"1000000", "10000000", "100000000", "1000000000"})
	int SIZE;
	int[] data;

	@Setup
	public void setup()	{
		var r = new Random(42L);
		data = new int[SIZE];
		for (int i = 0; i < SIZE; i++) {
			data[i] = r.nextInt();
		}
	}

	@Benchmark
	public long sumClassic() {
		long result = 0;
		for (int i = 0; i < data.length; i++) {
			result += data[i];
		}

		return result;
	}

	@Benchmark
	public long sumStream()	{
		return Arrays.stream(data).sum();
	}
}
  • Param list of Strings
  • JMH automatically converts the data
  • JMH automatically duplicates the test runs

Benchmark            (SIZE) Mode Cnt            Score        Error Units
B04a.sumClassic           1 avgt   5            1.376 ±       0.01 ns/op
B04a.sumClassic          10 avgt   5            3.553 ±       0.00 ns/op
B04a.sumClassic         100 avgt   5            7.698 ±       0.08 ns/op
B04a.sumClassic        1000 avgt   5           71.799 ±       0.90 ns/op
B04a.sumClassic       10000 avgt   5          764.769 ±      13.41 ns/op
B04a.sumClassic      100000 avgt   5        7,679.096 ±      38.78 ns/op
B04a.sumClassic     1000000 avgt   5       82,114.438 ±    4660.59 ns/op
B04a.sumClassic    10000000 avgt   5    1,037,525.538 ±   19741.51 ns/op
B04a.sumClassic   100000000 avgt   5   11,867,029.385 ±  130115.32 ns/op
B04a.sumClassic  1000000000 avgt   5  118,987,484.190 ± 3224256.55 ns/op

B04a.sumStream            1 avgt   5           21.295 ±       0.53 ns/op
B04a.sumStream           10 avgt   5           19.088 ±       1.43 ns/op
B04a.sumStream          100 avgt   5           38.379 ±       3.85 ns/op
B04a.sumStream         1000 avgt   5          284.431 ±       7.83 ns/op
B04a.sumStream        10000 avgt   5        2,763.850 ±     100.13 ns/op
B04a.sumStream       100000 avgt   5       27,620.697 ±     914.37 ns/op
B04a.sumStream      1000000 avgt   5      275,711.637 ±     507.95 ns/op
B04a.sumStream     10000000 avgt   5    2,775,992.907 ±    7054.71 ns/op
B04a.sumStream    100000000 avgt   5   27,783,853.383 ±   70817.23 ns/op
B04a.sumStream   1000000000 avgt   5  277,770,888.000 ±  427443.91 ns/op

It is YOU now!

Execute and report - B04a

Vary the State

Things done horribly wrong - What is wrong here?


@Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class B08
{
	@Param({"1", "10", "100", "1000"})
	int SIZE;
	List<String> data = new ArrayList<>(SIZE);

	@Setup
	public void setup()
	{
		var r = new Random(42);
		for (int i = 0; i < SIZE; i++)
		{
			data.add(String.valueOf(r.nextInt()));
		}
	}

	@Benchmark
    public List<String> sort()
	{
		Collections.sort(data);
		return data;
	}
}

Benchmark  (SIZE)  Mode  Cnt     Score     Error  Units
B08.sort        1  avgt    5     1.349 ±   0.080  ns/op
B08.sort       10  avgt    5    34.319 ±   0.151  ns/op
B08.sort      100  avgt    5   382.332 ±  20.398  ns/op
B08.sort     1000  avgt    5  4104.559 ± 207.218  ns/op
  • We sort only once!

Impacting Things

Things hitting us unexpectedly

Runtime Environment - JDK

Java evolves and so does performance


public class Example1ArrayCopying {
    @Param({"1000"}) int SIZE;
    byte[] src;

    @Setup
    public void setup() {
        src = new byte[SIZE];
        final Random r = new Random(7L);
        r.nextBytes(src);
    }

    @Benchmark
    public byte[] systemCopy() {
        var dest = new byte[src.length];
        System.arraycopy(src, 0, dest, 0, src.length);

        return dest;
    }

    @Benchmark
    public byte[] arrayCopy() {
        return Arrays.copyOf(src, src.length);
    }

    @Benchmark
    public byte[] manualCopy() {
        var dest = new byte[src.length];
        for (int i = 0; i < src.length; i++) {
            dest[i] = src[i];
        }

        return dest;
    }
}

# JDK 11
Benchmark                       (SIZE) Mode Cnt    Score     Error Units
Example1ArrayCopying.manualCopy   1000 avgt   3  152.889 ±  32.305 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3  130.963 ± 119.643 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3  128.726 ±  83.837 ns/op

# JDK 17
Example1ArrayCopying.manualCopy   1000 avgt   3  153.508 ±   7.987 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3  124.021 ±  35.199 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3  121.271 ±  35.723 ns/op

# JDK 19
Example1ArrayCopying.manualCopy   1000 avgt   3  149.298 ±  79.705 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3  129.439 ± 104.106 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3  133.358 ±  40.304 ns/op

# JDK 20
Example1ArrayCopying.manualCopy   1000 avgt   3  152.070 ±   7.096 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3  119.613 ±   9.873 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3  117.901 ±   8.322 ns/op

# JDK 21
Example1ArrayCopying.manualCopy   1000 avgt   3  146.508 ±  47.293 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3  131.452 ± 107.300 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3  132.797 ± 124.440 ns/op


# Graal 22.3.r17 - Whoops!
Example1ArrayCopying.manualCopy   1000 avgt   3  335.612 ±  22.000 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3  150.035 ±   4.466 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3  149.176 ±   4.496 ns/op

Environment - Hardware

Machines are never alike


public class Example1ArrayCopying {
    @Param({"1000"}) int SIZE;
    byte[] src;

    @Setup
    public void setup() {
        src = new byte[SIZE];
        final Random r = new Random(7L);
        r.nextBytes(src);
    }

    @Benchmark
    public byte[] systemCopy() {
        var dest = new byte[src.length];
        System.arraycopy(src, 0, dest, 0, src.length);

        return dest;
    }

    @Benchmark
    public byte[] arrayCopy() {
        return Arrays.copyOf(src, src.length);
    }

    @Benchmark
    public byte[] manualCopy() {
        var dest = new byte[src.length];
        for (int i = 0; i < src.length; i++) {
            dest[i] = src[i];
        }

        return dest;
    }
}

# JDK 21 - T14s Gen 1 AMD
Example1ArrayCopying.manualCopy   1000 avgt   3  146.508 ±  47.293 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3  131.452 ± 107.300 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3  132.797 ± 124.440 ns/op

# JDK 21 - Hetzner AMD EPYC
Example1ArrayCopying.manualCopy   1000 avgt   3   47.118 ±  12.747 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3   59.655 ±  13.721 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3   42.315 ±  17.464 ns/op

Example1ArrayCopying.manualCopy   1000 avgt   3   48.347 ±  10.000 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3   49.044 ±   3.226 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3   43.766 ±  15.874 ns/op

Example1ArrayCopying.manualCopy   1000 avgt   3   51.993 ±  25.652 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3   47.272 ±   5.970 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3   48.202 ±  19.657 ns/op

Example1ArrayCopying.manualCopy   1000 avgt   3   49.232 ±  29.510 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3   42.732 ±   9.204 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3   43.663 ±  13.449 ns/op

Measurement - Measure More

Things are jittery


@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
public class Example2ArrayCopying {
    ...

    @Benchmark
    public byte[] systemCopy() {
        var dest = new byte[src.length];
        System.arraycopy(src, 0, dest, 0, src.length);

        return dest;
    }

    @Benchmark
    public byte[] arrayCopy() {
        return Arrays.copyOf(src, src.length);
    }

    @Benchmark
    public byte[] manualCopy() {
        var dest = new byte[src.length];
        for (int i = 0; i < src.length; i++) {
            dest[i] = src[i];
        }

        return dest;
    }
}

# JDK 21 - T14s Gen 1 AMD - W3-3/M3-3/F1
Example1ArrayCopying.manualCopy   1000 avgt   3  146.508 ±  47.293 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3  131.452 ± 107.300 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3  132.797 ± 124.440 ns/op

# JDK 21 - Hetzner AMD EPYC - W3-3/M3-3/F1
Example1ArrayCopying.manualCopy   1000 avgt   3   47.118 ±  12.747 ns/op
Example1ArrayCopying.systemCopy   1000 avgt   3   59.655 ±  13.721 ns/op
Example1ArrayCopying.arrayCopy    1000 avgt   3   42.315 ±  17.464 ns/op

# JDK 21 - T14s Gen 1 AMD - W5-1/M10-1/F3
Example2ArrayCopying.manualCopy    1000  avgt   30  175.009 ± 2.882  ns/op
Example2ArrayCopying.systemCopy    1000  avgt   30  142.290 ± 5.045  ns/op
Example2ArrayCopying.arrayCopy     1000  avgt   30  143.374 ± 2.372  ns/op

Example2ArrayCopying.manualCopy    1000  avgt   30  171.585 ± 1.051  ns/op
Example2ArrayCopying.systemCopy    1000  avgt   30  143.686 ± 1.997  ns/op
Example2ArrayCopying.arrayCopy     1000  avgt   30  143.701 ± 1.464  ns/op

# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F3
Example2ArrayCopying.manualCopy    1000  avgt   30  48.971 ± 1.260  ns/op
Example2ArrayCopying.systemCopy    1000  avgt   30  54.044 ± 5.644  ns/op
Example2ArrayCopying.arrayCopy     1000  avgt   30  49.097 ± 4.838  ns/op

Example2ArrayCopying.manualCopy    1000  avgt   30  48.743 ± 1.472  ns/op
Example2ArrayCopying.systemCopy    1000  avgt   30  44.689 ± 2.206  ns/op
Example2ArrayCopying.arrayCopy     1000  avgt   30  49.527 ± 5.736  ns/op

Data - Change Data Dimensions

Still no clear lead


@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
public class Example3ArrayCopying {
    @Param({"10", "1000", "100000"})
    int SIZE;
    ...
}

# JDK 21 - T14s Gen 1 AMD - W5-1/M10-1/F1
Benchmark                        (SIZE)  Mode Cnt      Score      Error  Units
Example3ArrayCopying.manualCopy      10  avgt  10     19.199 ±    0.060  ns/op
Example3ArrayCopying.systemCopy      10  avgt  10     11.842 ±    0.524  ns/op
Example3ArrayCopying.arrayCopy       10  avgt  10     12.055 ±    0.157  ns/op
Example3ArrayCopying.manualCopy    1000  avgt  10    179.273 ±    2.051  ns/op
Example3ArrayCopying.systemCopy    1000  avgt  10    145.791 ±    1.194  ns/op
Example3ArrayCopying.arrayCopy     1000  avgt  10    146.662 ±    1.988  ns/op
Example3ArrayCopying.manualCopy  100000  avgt  10  18492.201 ± 2739.328  ns/op
Example3ArrayCopying.systemCopy  100000  avgt  10  13468.580 ±  121.943  ns/op
Example3ArrayCopying.arrayCopy   100000  avgt  10  13919.631 ±  113.306  ns/op

# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1
Benchmark                        (SIZE)  Mode Cnt      Score      Error  Units
Example3ArrayCopying.manualCopy      10  avgt  10      6.182 ±    0.554  ns/op
Example3ArrayCopying.systemCopy      10  avgt  10      3.957 ±    0.089  ns/op
Example3ArrayCopying.arrayCopy       10  avgt  10      4.073 ±    0.329  ns/op
Example3ArrayCopying.manualCopy    1000  avgt  10     51.011 ±    3.748  ns/op
Example3ArrayCopying.systemCopy    1000  avgt  10     61.058 ±    1.217  ns/op
Example3ArrayCopying.arrayCopy     1000  avgt  10     58.612 ±    1.315  ns/op
Example3ArrayCopying.manualCopy  100000  avgt  10   6033.414 ±  323.343  ns/op
Example3ArrayCopying.systemCopy  100000  avgt  10   6227.169 ±  156.952  ns/op
Example3ArrayCopying.arrayCopy   100000  avgt  10   6390.100 ±  194.210  ns/op

Data - Change Data Type

Maybe more is less? From byte to long


@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
public class Example4ArrayCopying {
    @Param({"10", "1000", "100000"})
    int SIZE;
    long[] src;

    @Setup
    public void setup()
    {
        src = new long[SIZE];
        final Random r = new Random(7L);
        for (int i = 0; i < src.length; i++)
        {
            src[i] = r.nextLong();
        }
    }
    ...
}

# JDK 21 - T14s Gen 1 AMD - W5-1/M10-1/F1
Benchmark                        (SIZE) Mode Cnt      Score      Error Units
Example4ArrayCopying.manualCopy      10 avgt  10     19.836 ±     0.44 ns/op
Example4ArrayCopying.systemCopy      10 avgt  10     15.233 ±     0.72 ns/op
Example4ArrayCopying.arrayCopy       10 avgt  10     16.243 ±     0.43 ns/op
Example4ArrayCopying.manualCopy    1000 avgt  10   1170.518 ±    40.69 ns/op
Example4ArrayCopying.systemCopy    1000 avgt  10   1000.050 ±    19.52 ns/op
Example4ArrayCopying.arrayCopy     1000 avgt  10   1018.666 ±    20.75 ns/op
Example4ArrayCopying.manualCopy  100000 avgt  10 132694.706 ± 10647.86 ns/op
Example4ArrayCopying.systemCopy  100000 avgt  10 127729.915 ± 19014.11 ns/op
Example4ArrayCopying.arrayCopy   100000 avgt  10 113348.850 ±  4124.81 ns/op

# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1
Benchmark                        (SIZE) Mode Cnt     Score       Error Units
Example4ArrayCopying.manualCopy      10 avgt  10      6.640 ±     0.59 ns/op
Example4ArrayCopying.systemCopy      10 avgt  10      5.219 ±     0.28 ns/op
Example4ArrayCopying.arrayCopy       10 avgt  10      5.025 ±     0.25 ns/op
Example4ArrayCopying.manualCopy    1000 avgt  10    350.139 ±    11.62 ns/op
Example4ArrayCopying.systemCopy    1000 avgt  10    424.347 ±     6.27 ns/op
Example4ArrayCopying.arrayCopy     1000 avgt  10    424.027 ±     8.20 ns/op
Example4ArrayCopying.manualCopy  100000 avgt  10  48080.023 ±  1714.36 ns/op
Example4ArrayCopying.systemCopy  100000 avgt  10  39513.986 ±  3052.52 ns/op
Example4ArrayCopying.arrayCopy   100000 avgt  10  40345.696 ±  1136.56 ns/op

GC - Evaluate Demand

Just blame the usual suspect


# Enable GC profiling by JMH
java -jar target/benchmarks.jar "Example5ArrayCopying" -prof gc

 

# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1
Benchmark                      (SIZE)  Mode  Cnt       Score      Error   Units
manualCopy                       1000  avgt   10     358.236 ±    9.899   ns/op
manualCopy:gc.alloc.rate         1000  avgt   10   21340.350 ±  594.948  MB/sec
manualCopy:gc.alloc.rate.norm    1000  avgt   10    8016.002 ±    0.001    B/op
manualCopy:gc.count              1000  avgt   10     181.000             counts
manualCopy:gc.time               1000  avgt   10     234.000                 ms

systemCopy                       1000  avgt   10     462.448 ±    9.151   ns/op
systemCopy:gc.alloc.rate         1000  avgt   10   16528.255 ±  327.421  MB/sec
systemCopy:gc.alloc.rate.norm    1000  avgt   10    8016.003 ±    0.001    B/op
systemCopy:gc.count              1000  avgt   10     151.000             counts
systemCopy:gc.time               1000  avgt   10     189.000                 ms

arrayCopy                        1000  avgt   10     496.126 ±   10.055   ns/op
arrayCopy:gc.alloc.rate          1000  avgt   10   15406.729 ±  310.263  MB/sec
arrayCopy:gc.alloc.rate.norm     1000  avgt   10    8016.003 ±    0.001    B/op
arrayCopy:gc.count               1000  avgt   10     131.000             counts
arrayCopy:gc.time                1000  avgt   10     171.000                 ms

# Give it more memory
java -jar target/benchmarks.jar "Example5ArrayCopying" \
    -jvmArgsAppend "-Xms16g -Xmx16g" \
    -prof gc

# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1
Benchmark                      (SIZE)  Mode  Cnt       Score      Error   Units
manualCopy                       1000  avgt   10     344.894 ±    3.923   ns/op
manualCopy:gc.alloc.rate         1000  avgt   10   22160.016 ±  251.956  MB/sec
manualCopy:gc.alloc.rate.norm    1000  avgt   10    8016.002 ±    0.001    B/op
manualCopy:gc.count              1000  avgt   10      23.000             counts
manualCopy:gc.time               1000  avgt   10      70.000                 ms

systemCopy                       1000  avgt   10     398.919 ±   12.741   ns/op
systemCopy:gc.alloc.rate         1000  avgt   10   19165.850 ±  628.072  MB/sec
systemCopy:gc.alloc.rate.norm    1000  avgt   10    8016.003 ±    0.001    B/op
systemCopy:gc.count              1000  avgt   10      19.000             counts
systemCopy:gc.time               1000  avgt   10      56.000                 ms

arrayCopy                        1000  avgt   10     378.813 ±   26.941   ns/op
arrayCopy:gc.alloc.rate          1000  avgt   10   20214.726 ± 1412.521  MB/sec
arrayCopy:gc.alloc.rate.norm     1000  avgt   10    8016.003 ±    0.001    B/op
arrayCopy:gc.count               1000  avgt   10      21.000             counts
arrayCopy:gc.time                1000  avgt   10      65.000                 ms

GC - Compare GCs

Try a different one as one is told all the time


# Enable GC profiling by JMH
java -jar target/benchmarks.jar "Example5ArrayCopying" \
    -jvmArgsAppend "-Xms16g -Xmx16g -XX:+UseSerialGC" \
    -prof gc

# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1
Benchmark                                           (SIZE)  Mode  Cnt       Score      Error   Units
manualCopy                       1000  avgt   10     340.026 ±    2.053   ns/op
manualCopy:gc.alloc.rate         1000  avgt   10   22477.566 ±  135.713  MB/sec
manualCopy:gc.alloc.rate.norm    1000  avgt   10    8016.002 ±    0.001    B/op
manualCopy:gc.count              1000  avgt   10      51.000             counts
manualCopy:gc.time               1000  avgt   10      16.000                 ms

systemCopy                       1000  avgt   10     384.150 ±   17.802   ns/op
systemCopy:gc.alloc.rate         1000  avgt   10   19912.093 ±  939.526  MB/sec
systemCopy:gc.alloc.rate.norm    1000  avgt   10    8016.002 ±    0.001    B/op
systemCopy:gc.count              1000  avgt   10      46.000             counts
systemCopy:gc.time               1000  avgt   10      14.000                 ms

arrayCopy                        1000  avgt   10     396.502 ±    6.364   ns/op
arrayCopy:gc.alloc.rate          1000  avgt   10   19275.432 ±  315.711  MB/sec
arrayCopy:gc.alloc.rate.norm     1000  avgt   10    8016.002 ±    0.001    B/op
arrayCopy:gc.count               1000  avgt   10      44.000             counts
arrayCopy:gc.time                1000  avgt   10       9.000                 ms

# Give it more memory
java -jar target/benchmarks.jar "Example5ArrayCopying" \
    -jvmArgsAppend "-Xms16g -Xmx16g -XX:+UseZGC" \
    -prof gc

# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1
Benchmark                                           (SIZE)  Mode  Cnt       Score       Error   Units
manualCopy                       1000  avgt   10     429.413 ±    76.723   ns/op
manualCopy:gc.alloc.rate         1000  avgt   10   18004.823 ±  3022.606  MB/sec
manualCopy:gc.alloc.rate.norm    1000  avgt   10    8016.003 ±     0.001    B/op
manualCopy:gc.count              1000  avgt   10      56.000              counts
manualCopy:gc.time               1000  avgt   10     214.000                  ms

systemCopy                       1000  avgt   10     470.485 ±   112.292   ns/op
systemCopy:gc.alloc.rate         1000  avgt   10   16610.545 ±  3953.947  MB/sec
systemCopy:gc.alloc.rate.norm    1000  avgt   10    8016.004 ±     0.001    B/op
systemCopy:gc.count              1000  avgt   10      52.000              counts
systemCopy:gc.time               1000  avgt   10     175.000                  ms

arrayCopy                        1000  avgt   10     442.024 ±   127.831   ns/op
arrayCopy:gc.alloc.rate          1000  avgt   10   17806.595 ±  4581.743  MB/sec
arrayCopy:gc.alloc.rate.norm     1000  avgt   10    8016.003 ±     0.001    B/op
arrayCopy:gc.count               1000  avgt   10      56.000              counts
arrayCopy:gc.time                1000  avgt   10     198.000                  ms

Memory - Don't Churn

Focus on the action, avoid overhead


public class Example6ArrayCopying {
    @Param({"1000"})
    int SIZE;
    long[] src, dest;

    @Setup
    public void setup() {
        src = new long[SIZE];
        dest = new long[SIZE];

        final Random r = new Random(7L);
        for (int i = 0; i < src.length; i++) {
        	src[i] = r.nextLong();
        }
    }

    @Benchmark
    public long[] systemCopy() {
        System.arraycopy(src, 0, dest, 0, src.length);
        return dest;
    }

    @Benchmark
    public long[] manualCopy() {
        for (int i = 0; i < src.length; i++) {
            dest[i] = src[i];
        }

        return dest;
    }
}

# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1 - Default Mem
Benchmark                               (SIZE)  Mode  Cnt      Score     Error   Units
Example4...manualCopy                     1000  avgt   10    358.236 ±   9.899   ns/op
Example4...manualCopy:gc.alloc.rate       1000  avgt   10  21340.350 ± 594.948  MB/sec
Example4...manualCopy:gc.alloc.rate.norm  1000  avgt   10   8016.002 ±   0.001    B/op

Example4...systemCopy                     1000  avgt   10    462.448 ±   9.151   ns/op
Example4...systemCopy:gc.alloc.rate       1000  avgt   10  16528.255 ± 327.421  MB/sec
Example4...systemCopy:gc.alloc.rate.norm  1000  avgt   10   8016.003 ±   0.001    B/op

# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1 - Default Mem
Benchmark                               (SIZE)  Mode  Cnt      Score     Error   Units
Example6...manualCopy                     1000  avgt   10     68.884 ±   0.244   ns/op
Example6...manualCopy:gc.alloc.rate       1000  avgt   10      0.007 ±   0.001  MB/sec
Example6...manualCopy:gc.alloc.rate.norm  1000  avgt   10     ≈ 10⁻³             B/op

Example6...systemCopy                     1000  avgt   10     87.878 ±   0.748   ns/op
Example6...systemCopy:gc.alloc.rate       1000  avgt   10      0.007 ±   0.001  MB/sec
Example6...systemCopy:gc.alloc.rate.norm  1000  avgt   10      0.001 ±   0.001    B/op
--
Example6...manualCopy                     1000  avgt   10     68.745 ±   0.166   ns/op
Example6...manualCopy:gc.alloc.rate       1000  avgt   10      0.007 ±   0.001  MB/sec
Example6...manualCopy:gc.alloc.rate.norm  1000  avgt   10     ≈ 10⁻³              B/op

Example6...systemCopy                     1000  avgt   10     91.892 ±   0.466   ns/op
Example6...systemCopy:gc.alloc.rate       1000  avgt   10      0.007 ±   0.001  MB/sec
Example6...systemCopy:gc.alloc.rate.norm  1000  avgt   10      0.001 ±   0.001    B/op

Pattern - Any difference when same array

Just copy within the array


public class Example7ArrayCopying {
    @Param({"1000"})
    int SIZE;
    int pos;
    long[] src;

    public void setup() {
        ...
        pos = SIZE / 2;
        ...
    }

    public long[] manualCopy1() {
        for (int i = 0; i < src.length - pos; i++) {
            src[i] = src[pos + i];
        }
        return src;
    }

    public long[] manualCopy2() {
    	var to = src.length - pos;
        for (int i = 0; i < to; i++) {
            src[i] = src[pos + i];
        }
        return src;
    }

    public long[] manualCopy3() {
    	var p = pos;
    	var to = src.length - p;
        for (int i = 0; i < to; i++, p++) {
            src[i] = src[p];
        }
        return src;
    }
}

# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1 - Default Mem
Benchmark                         (SIZE)  Mode  Cnt    Score   Error  Units
Example7ArrayCopying.manualCopy1    1000  avgt   10  127.035 ± 4.312  ns/op
Example7ArrayCopying.manualCopy2    1000  avgt   10  121.472 ± 0.108  ns/op
Example7ArrayCopying.manualCopy3    1000  avgt   10  122.662 ± 2.137  ns/op
Example7ArrayCopying.systemCopy     1000  avgt   10   44.243 ± 0.209  ns/op
--
Example7ArrayCopying.manualCopy1    1000  avgt   10  125.406 ± 0.050  ns/op
Example7ArrayCopying.manualCopy2    1000  avgt   10  123.669 ± 2.829  ns/op
Example7ArrayCopying.manualCopy3    1000  avgt   10  146.735 ± 1.185  ns/op
Example7ArrayCopying.systemCopy     1000  avgt   10   45.049 ± 0.176  ns/op
--
Example7ArrayCopying.manualCopy1    1000  avgt   10  128.646 ± 7.547  ns/op
Example7ArrayCopying.manualCopy2    1000  avgt   10  122.786 ± 2.348  ns/op
Example7ArrayCopying.manualCopy3    1000  avgt   10  131.138 ± 2.348  ns/op
Example7ArrayCopying.systemCopy     1000  avgt   10   47.014 ± 0.201  ns/op
--
Example7ArrayCopying.manualCopy1    1000  avgt   10  126.050 ±  1.729  ns/op
Example7ArrayCopying.manualCopy2    1000  avgt   10  126.528 ± 14.605  ns/op
Example7ArrayCopying.manualCopy3    1000  avgt   10  123.898 ±  2.667  ns/op
Example7ArrayCopying.systemCopy     1000  avgt   10   44.879 ±  0.163  ns/op
--
Example7ArrayCopying.manualCopy1    1000  avgt   10  125.773 ± 0.628  ns/op
Example7ArrayCopying.manualCopy2    1000  avgt   10  123.595 ± 1.524  ns/op
Example7ArrayCopying.manualCopy3    1000  avgt   10  123.349 ± 3.968  ns/op
Example7ArrayCopying.systemCopy     1000  avgt   10   44.673 ± 0.044  ns/op
--
Example7ArrayCopying.manualCopy1    1000  avgt   10  126.058 ± 0.769  ns/op
Example7ArrayCopying.manualCopy2    1000  avgt   10  189.097 ± 2.793  ns/op
Example7ArrayCopying.manualCopy3    1000  avgt   10  122.751 ± 1.444  ns/op
Example7ArrayCopying.systemCopy     1000  avgt   10   44.674 ± 0.038  ns/op

JIT - Break Optimization

Loop unrolling is king


public class Example8ArrayCopying {
    @Setup
    public void setup() {
        ...
        final Random r = new Random(7L);
        increment = r.nextInt(1) + 1;
    }

    @Benchmark
    public long[] manualCopyAntiUnroll1() {
        for (long l = 0; l < src.length; l++) {
        	final var i = (int)l;
            dest[i] = src[i];
        }
        return src;
    }

    @Benchmark
    public long[] manualCopyAntiUnroll2() {
        for (int i = 0; i < src.length; i = i + increment) {
            dest[i] = src[i];
        }
        return src;
    }

    @Benchmark
    public long[] manualCopyAntiUnroll3() {
        final var inc = increment;
        for (int i = 0; i < src.length; i = i + inc) {
            dest[i] = src[i];
        }
        return src;
    }
}

java -jar target/benchmarks.jar "Example8ArrayCopying"

# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1 - Default Mem
Benchmark                                   (SIZE)  Mode  Cnt     Score    Error  Units
Example8ArrayCopying.manualCopy1              1000  avgt   10    68.692 ±  0.066  ns/op
Example8ArrayCopying.manualCopy2              1000  avgt   10    70.277 ±  1.390  ns/op
Example8ArrayCopying.manualCopyAntiUnroll1    1000  avgt   10   385.301 ±  3.335  ns/op
Example8ArrayCopying.manualCopyAntiUnroll2    1000  avgt   10   617.899 ± 28.190  ns/op
Example8ArrayCopying.manualCopyAntiUnroll3    1000  avgt   10  1640.345 ±  0.966  ns/op
Example8ArrayCopying.systemCopy               1000  avgt   10   100.947 ±  1.753  ns/op
---
Benchmark                                   (SIZE)  Mode  Cnt     Score    Error  Units
Example8ArrayCopying.manualCopy1              1000  avgt   10    68.818 ±  0.893  ns/op
Example8ArrayCopying.manualCopy2              1000  avgt   10    70.744 ±  0.439  ns/op
Example8ArrayCopying.manualCopyAntiUnroll1    1000  avgt   10   385.096 ±  4.509  ns/op
Example8ArrayCopying.manualCopyAntiUnroll2    1000  avgt   10   655.082 ±  1.224  ns/op
Example8ArrayCopying.manualCopyAntiUnroll3    1000  avgt   10  1645.394 ± 24.600  ns/op
Example8ArrayCopying.systemCopy               1000  avgt   10    74.698 ±  1.357  ns/op
--
# Benchmark: org.xc.jmh.Example8ArrayCopying.manualCopyAntiUnroll3
# Warmup Iteration   1: 579.444 ns/op
# Warmup Iteration   2: 421.095 ns/op
# Warmup Iteration   3: 1640.320 ns/op
# Warmup Iteration   4: 1639.955 ns/op
# Warmup Iteration   5: 1639.914 ns/op
Iteration   1: 1640.164 ns/op
...
Iteration  10: 1641.931 ns/op

JIT - Unroll it manually

Just try to prove a point


public class Example9ArrayCopying {
    @Benchmark
    public long[] manualCopyAntiUnroll() {
        for (int i = 0; i < src.length; i = i + increment) {
            dest[i] = src[i];
        }

        return dest;
    }

    @Benchmark
    public long[] manualCopyUnroll() {
    	var inc = increment * 4;
    	var i = 0;
        for (; i < src.length; i = i + inc) {
            dest[i] = src[i];
            dest[i + 1] = src[i + 1];
            dest[i + 2] = src[i + 2];
            dest[i + 3] = src[i + 3];
        }
        for (; i < src.length; i++) {
            dest[i] = src[i];
        }

        return dest;
    }
}

java -jar target/benchmarks.jar "Example9ArrayCopying"

# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1 - Default Mem
Benchmark                                 (SIZE) Mode Cnt   Score    Error Units
Example9ArrayCopying.manualCopyAntiUnroll   1000 avgt  10 715.286 ± 132.30 ns/op
Example9ArrayCopying.manualCopyUnroll       1000 avgt  10 480.738 ±   3.93 ns/op
Example9ArrayCopying.systemCopy             1000 avgt  10  97.871 ±   0.08 ns/op
--
Example9ArrayCopying.manualCopyAntiUnroll   1000 avgt  10 597.785 ±   9.35 ns/op
Example9ArrayCopying.manualCopyUnroll       1000 avgt  10 480.876 ±   2.13 ns/op
Example9ArrayCopying.systemCopy             1000 avgt  10  97.437 ±   0.08 ns/op
--
# Benchmark: org.xc.jmh.Example9ArrayCopying.manualCopyUnroll
# Warmup Iteration   1: 283.584 ns/op
# Warmup Iteration   2: 283.552 ns/op
# Warmup Iteration   3: 480.231 ns/op
# Warmup Iteration   4: 481.449 ns/op
# Warmup Iteration   5: 479.380 ns/op
Iteration   1: 480.518 ns/op
...
Iteration  10: 479.786 ns/op

JIT - Limit its Reach

Investigating JIT things, small scale


java -jar target/benchmarks.jar "Example8ArrayCopying"

# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1 - Default Mem
Benchmark                                   (SIZE)  Mode  Cnt     Score    Error  Units
Example8ArrayCopying.manualCopy1              1000  avgt   10    68.692 ±  0.066  ns/op
Example8ArrayCopying.manualCopy2              1000  avgt   10    70.277 ±  1.390  ns/op
Example8ArrayCopying.manualCopyAntiUnroll1    1000  avgt   10   385.301 ±  3.335  ns/op
Example8ArrayCopying.manualCopyAntiUnroll2    1000  avgt   10   617.899 ± 28.190  ns/op
Example8ArrayCopying.manualCopyAntiUnroll3    1000  avgt   10  1640.345 ±  0.966  ns/op
Example8ArrayCopying.systemCopy               1000  avgt   10   100.947 ±  1.753  ns/op
---
Benchmark                                   (SIZE)  Mode  Cnt     Score    Error  Units
Example8ArrayCopying.manualCopy1              1000  avgt   10    68.818 ±  0.893  ns/op
Example8ArrayCopying.manualCopy2              1000  avgt   10    70.744 ±  0.439  ns/op
Example8ArrayCopying.manualCopyAntiUnroll1    1000  avgt   10   385.096 ±  4.509  ns/op
Example8ArrayCopying.manualCopyAntiUnroll2    1000  avgt   10   655.082 ±  1.224  ns/op
Example8ArrayCopying.manualCopyAntiUnroll3    1000  avgt   10  1645.394 ± 24.600  ns/op
Example8ArrayCopying.systemCopy               1000  avgt   10    74.698 ±  1.357  ns/op
--
# Benchmark: org.xc.jmh.Example8ArrayCopying.manualCopyAntiUnroll3
# Warmup Iteration   1: 579.444 ns/op
# Warmup Iteration   2: 421.095 ns/op
# Warmup Iteration   3: 1640.320 ns/op
# Warmup Iteration   4: 1639.955 ns/op
# Warmup Iteration   5: 1639.914 ns/op
Iteration   1: 1640.164 ns/op
...
Iteration  10: 1641.931 ns/op

java -jar target/benchmarks.jar "Example8ArrayCopying" \
    -jvmArgsAppend "-Xbatch -XX:TieredStopAtLevel=1"

# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1 - Default Mem
Benchmark                                   (SIZE)  Mode  Cnt    Score    Error  Units
Example8ArrayCopying.manualCopy1              1000  avgt   10  565.111 ± 20.447  ns/op
Example8ArrayCopying.manualCopy2              1000  avgt   10  825.489 ±  0.782  ns/op
Example8ArrayCopying.manualCopyAntiUnroll1    1000  avgt   10  768.211 ± 18.383  ns/op
Example8ArrayCopying.manualCopyAntiUnroll2    1000  avgt   10  694.116 ±  8.131  ns/op
Example8ArrayCopying.manualCopyAntiUnroll3    1000  avgt   10  694.561 ±  3.570  ns/op
Example8ArrayCopying.systemCopy               1000  avgt   10  117.650 ±  2.397  ns/op
--
Example8ArrayCopying.manualCopy1              1000  avgt   10  555.663 ±  8.745  ns/op
Example8ArrayCopying.manualCopy2              1000  avgt   10  830.459 ± 11.616  ns/op
Example8ArrayCopying.manualCopyAntiUnroll1    1000  avgt   10  763.762 ±  4.898  ns/op
Example8ArrayCopying.manualCopyAntiUnroll2    1000  avgt   10  692.760 ±  1.160  ns/op
Example8ArrayCopying.manualCopyAntiUnroll3    1000  avgt   10  694.167 ±  9.906  ns/op
Example8ArrayCopying.systemCopy               1000  avgt   10   74.923 ±  0.273  ns/op
--
# Benchmark: org.xc.jmh.Example8ArrayCopying.manualCopyAntiUnroll3
# Warmup Iteration   1: 717.091 ns/op
# Warmup Iteration   2: 689.629 ns/op
# Warmup Iteration   3: 691.484 ns/op
# Warmup Iteration   4: 690.789 ns/op
# Warmup Iteration   5: 698.366 ns/op
Iteration   1: 692.108 ns/op
..
Iteration  10: 690.475 ns/op

Memory and Caches

Betray yourself quickly and efficiently

A fun one first

Do less but don't gain enough speed


public class Cache03 {
    final int SIZE = 100_000;
    final int[] src = new int[SIZE];

    @Benchmark
    public int step1() {
        int sum = 0;
        for (int i = 0; i < SIZE; i++) {
            sum += src[i];
        }
        return sum;
    }

    @Benchmark
    public int step20() {
        int sum = 0;
        for (int i = 0; i < SIZE; i = i + 20) {
            sum += src[i];
        }
        return sum;
    }

    int step(int strideSize) {
        int sum = 0;
        for (int i = 0; i < SIZE; i = i + strideSize) {
            sum += src[i];
        }
        return sum;
    }

    @Benchmark
    public int stepUnrolled1() {return step(1);}

    @Benchmark
    public int stepUnrolled20() {return step(2o);}
}


Benchmark                   Mode  Cnt       Score     Error  Units
Cache03.step1               avgt    3  27,341.974 ±   26.97  ns/op
Cache03.step2               avgt    3  13,677.435 ±   64.85  ns/op
Cache03.step20              avgt    3   2,740.613 ±    5.87  ns/op
Cache03.stepUnrolled1       avgt    3  27,355.850 ±   36.46  ns/op
Cache03.stepUnrolled2       avgt    3  13,674.883 ±   86.00  ns/op
Cache03.stepUnrolled20      avgt    3   2,739.449 ±   11.39  ns/op

# Benchmark: org.xc.jmh.Cache03.stepUnrolled1
# Warmup Iteration   1: 32292.624 ns/op
# Warmup Iteration   2: 29584.658 ns/op
# Warmup Iteration   3: 49148.501 ns/op
# Warmup Iteration   4: 50692.600 ns/op
# Warmup Iteration   5: 30464.858 ns/op
# Warmup Iteration   6: 27348.027 ns/op
Iteration   1: 27354.870 ns/op
Iteration   2: 27348.734 ns/op
Iteration   3: 27343.787 ns/op

Cache03.step1Reversed       avgt    3  27,341.654 ±   67.54  ns/op
Cache03.stepUnrolled1Long   avgt    3  41,423.789 ± 8710.51  ns/op
Cache03.stepUnrolled2Long   avgt    3  86,654.227 ±  494.13  ns/op
Cache03.stepUnrolled20Long  avgt    3   2,954.895 ±  892.58  ns/op

Why are things so complicated

A look behind the scenes


# java -jar target/benchmarks.jar "Cache03.step(1|20)" -prof perfnorm

Benchmark                             Mode Cnt      Score      Units
Cache03.step1                         avgt   3  31676.192      ns/op

Cache03.step1:IPC                     avgt          1.227  insns/clk

Cache03.step1:L1-dcache-load-misses   avgt       5337.646       #/op
Cache03.step1:L1-dcache-loads         avgt      83939.072       #/op
Cache03.step1:L1-icache-load-misses   avgt          4.004       #/op
Cache03.step1:L1-icache-loads         avgt        165.415       #/op

Cache03.step1:branch-misses           avgt         13.406       #/op
Cache03.step1:branches                avgt       5143.071       #/op
Cache03.step1:cycles                  avgt      81345.466       #/op
Cache03.step1:instructions            avgt      99844.986       #/op

Cache03.step1:dTLB-load-misses        avgt          0.194       #/op
Cache03.step1:dTLB-loads              avgt         90.337       #/op
Cache03.step1:iTLB-load-misses        avgt          0.112       #/op
Cache03.step1:iTLB-loads              avgt          0.534       #/op

Cache03.step1:stalled-cycles-frontend avgt        534.119       #/op

# java -jar target/benchmarks.jar "Cache03.step(1|20)" -prof perfnorm

Benchmark                              Mode Cnt      Score      Units
Cache03.step20                         avgt   3   3164.759      ns/op

Cache03.step20:IPC                     avgt          0.621  insns/clk

Cache03.step20:L1-dcache-load-misses   avgt       4249.552       #/op
Cache03.step20:L1-dcache-loads         avgt       4208.701       #/op
Cache03.step20:L1-icache-load-misses   avgt          0.564       #/op
Cache03.step20:L1-icache-loads         avgt         16.031       #/op

Cache03.step20:branch-misses           avgt          0.857       #/op
Cache03.step20:branches                avgt        269.741       #/op
Cache03.step20:cycles                  avgt       8164.074       #/op
Cache03.step20:instructions            avgt       5072.712       #/op

Cache03.step20:dTLB-load-misses        avgt          0.015       #/op
Cache03.step20:dTLB-loads              avgt         89.055       #/op
Cache03.step20:iTLB-load-misses        avgt          0.004       #/op
Cache03.step20:iTLB-loads              avgt          0.017       #/op

Cache03.step20:stalled-cycles-frontend avgt         41.663       #/op

Caches

The hidden memory that speeds things up

  • CPUs feature multiple layers of data and instruction caches
  • Caches are per core, socket, and global
  • You cannot control the caches
  • Your benchmark might betray you here

Data - Work Against the Caches

Caches are misleading and production is different


public class Example10ArrayCopying {
    @Param({"1000"})
    int SIZE;
    @Param({"1", "10", "1000", "10000"})
    int COUNT;

    final List srcArrays = new ArrayList<>();
    final List destArrays = new ArrayList<>();
    final Random r = new Random(7L);

    @Setup
    public void setup() {
    	srcArrays.clear();
    	destArrays.clear();
    	var a = new long[SIZE];

        for (int i = 0; i < a.length; i++) {
        	a[i] = r.nextLong();
        }

        // get us enough arrays
        for (int i = 0; i < COUNT; i++) {
        	srcArrays.add(Arrays.copyOf(a, a.length));
        	destArrays.add(Arrays.copyOf(a, a.length));
        }
    }

    @Benchmark
    public long[] systemCopy() {
    	var src = srcArrays.get(r.nextInt(srcArrays.size()));
    	var dest = destArrays.get(r.nextInt(destArrays.size()));

        System.arraycopy(src, 0, dest, 0, src.length);
        return dest;
    }
    ...
}

java -jar target/benchmarks.jar "Example10ArrayCopying"

# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1 - Default Mem
Benchmark                         (COUNT)  (SIZE)  Mode  Cnt     Score    Error  Units
Example10ArrayCopying.manualCopy        1    1000  avgt   10    80.914 ±  1.215  ns/op
Example10ArrayCopying.manualCopy       10    1000  avgt   10   137.502 ±  0.495  ns/op
Example10ArrayCopying.manualCopy     1000    1000  avgt   10   244.279 ±  5.420  ns/op
Example10ArrayCopying.manualCopy    10000    1000  avgt   10   986.221 ±  9.278  ns/op
Example10ArrayCopying.systemCopy        1    1000  avgt   10    93.923 ±  0.517  ns/op
Example10ArrayCopying.systemCopy       10    1000  avgt   10   137.737 ±  0.538  ns/op
Example10ArrayCopying.systemCopy     1000    1000  avgt   10   245.400 ±  3.961  ns/op
Example10ArrayCopying.systemCopy    10000    1000  avgt   10  1051.927 ± 12.971  ns/op
--
Example10ArrayCopying.manualCopy        1    1000  avgt   10    80.677 ±  0.066  ns/op
Example10ArrayCopying.manualCopy       10    1000  avgt   10   133.435 ±  0.465  ns/op
Example10ArrayCopying.manualCopy     1000    1000  avgt   10   247.355 ±  4.421  ns/op
Example10ArrayCopying.manualCopy    10000    1000  avgt   10  1005.709 ± 13.220  ns/op
Example10ArrayCopying.systemCopy        1    1000  avgt   10    94.961 ±  2.043  ns/op
Example10ArrayCopying.systemCopy       10    1000  avgt   10   139.995 ±  3.271  ns/op
Example10ArrayCopying.systemCopy     1000    1000  avgt   10   240.796 ±  1.283  ns/op
Example10ArrayCopying.systemCopy    10000    1000  avgt   10  1066.152 ± 15.434  ns/op

# JDK 21 - T14s - W5-1/M10-1/F1 - Default Mem
Example10ArrayCopying.manualCopy        1    1000  avgt   10   220.405 ±  0.841  ns/op
Example10ArrayCopying.manualCopy       10    1000  avgt   10   398.055 ±  5.960  ns/op
Example10ArrayCopying.manualCopy     1000    1000  avgt   10  1205.517 ±  4.025  ns/op
Example10ArrayCopying.manualCopy    10000    1000  avgt   10  1593.276 ± 72.917  ns/op
Example10ArrayCopying.systemCopy        1    1000  avgt   10   400.866 ± 10.155  ns/op
Example10ArrayCopying.systemCopy       10    1000  avgt   10   464.524 ±  7.419  ns/op
Example10ArrayCopying.systemCopy     1000    1000  avgt   10  1246.732 ±  3.929  ns/op
Example10ArrayCopying.systemCopy    10000    1000  avgt   10  1649.171 ± 40.588  ns/op

Data - Caches

Try harder to make the point


public class Cache01 {
    @Param({"10", "1000", "10000"}) int SIZE = 1000;

    @Param({"true", "false"}) boolean RANDOM = false;
    @Param({"true", "false"}) boolean MOREGARBAGE = false;

    final List<String> data = new ArrayList<>(SIZE);
    final List<byte[]> garbage = new ArrayList<>(SIZE);

    String[] CITIES = {"Berlin;", "Hannover;", "Prag;", "Rio;", "Hamburg;", "Paris;", "Rotterdam;"};
    String[] TEMPERATURES = {"-99.9", "99.9", "3.4", "-12.2", "22.2", "26.8", "31.1", "11.0", "-5.6"};

    @Setup(Level.Trial)
    public void setup() {
    	data.clear();
    	garbage.clear();
        final var r = new Random(7L);

        for (int i = 0; i < SIZE; i++) {
        	if (MOREGARBAGE) {
        		garbage.add(new byte[r.nextInt(5000)]);
        	}

        	data.add(
                CITIES[r.nextInt(CITIES.length)] +
                TEMPERATURES[r.nextInt(TEMPERATURES.length)]);
        }

        if (RANDOM) {
        	Collections.shuffle(data, r);
        }
    }

    @Benchmark
    public double parse() {
    	var total = 0d;
    	for (int i = 0; i < data.size(); i++) {
    		var s = data.get(i);
    		var pos = s.indexOf(';');
    		total += pos + s.length(); // .parseDouble(s.substring(pos + 1));
        }

        return total;
    }
}

java -jar target/benchmarks.jar "Cache01"

# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1 - Default Mem
Benchmark      (MOREGARBAGE)  (RANDOM)  (SIZE)  Mode  Cnt      Score       Error  Units
Cache01.parse           true      true      10  avgt   10     47.995 ±     0.396  ns/op
Cache01.parse           true      true    1000  avgt   10   6484.647 ±    69.412  ns/op
Cache01.parse           true      true   10000  avgt   10  81505.961 ± 13035.464  ns/op
Cache01.parse           true     false      10  avgt   10     47.452 ±     0.864  ns/op
Cache01.parse           true     false    1000  avgt   10   6138.623 ±    29.380  ns/op
Cache01.parse           true     false   10000  avgt   10  70276.588 ± 13068.121  ns/op
Cache01.parse          false      true      10  avgt   10     44.323 ±     0.249  ns/op
Cache01.parse          false      true    1000  avgt   10   4910.066 ±    61.657  ns/op
Cache01.parse          false      true   10000  avgt   10  63560.366 ±  4251.100  ns/op
Cache01.parse          false     false      10  avgt   10     44.133 ±     0.713  ns/op
Cache01.parse          false     false    1000  avgt   10   4556.780 ±    59.546  ns/op
Cache01.parse          false     false   10000  avgt   10  57926.441 ±  3680.559  ns/op

Cache01.parse           true      true      10  avgt   10     47.586 ±     0.509  ns/op
Cache01.parse           true      true    1000  avgt   10   6469.714 ±     2.503  ns/op
Cache01.parse           true      true   10000  avgt   10  81499.896 ± 12107.251  ns/op
Cache01.parse           true     false      10  avgt   10     48.136 ±     0.378  ns/op
Cache01.parse           true     false    1000  avgt   10   6156.876 ±    26.356  ns/op
Cache01.parse           true     false   10000  avgt   10  69994.126 ± 13665.307  ns/op
Cache01.parse          false      true      10  avgt   10     44.743 ±     0.497  ns/op
Cache01.parse          false      true    1000  avgt   10   4895.309 ±    11.820  ns/op
Cache01.parse          false      true   10000  avgt   10  72291.021 ± 13616.639  ns/op
Cache01.parse          false     false      10  avgt   10     44.206 ±     1.026  ns/op
Cache01.parse          false     false    1000  avgt   10   4539.098 ±    91.328  ns/op
Cache01.parse          false     false   10000  avgt   10  59608.531 ±  8642.156  ns/op

Data - Caches

Check the details to see what is going on


# Parameters: (MOREGARBAGE = false, RANDOM = false, SIZE = 1000)
# Warmup Iteration   1: 4741.602 ns/op
# Warmup Iteration   2: 4620.027 ns/op
# Warmup Iteration   3: 4616.325 ns/op
# Warmup Iteration   4: 4573.271 ns/op
# Warmup Iteration   5: 4446.747 ns/op
Iteration   1: 4574.989 ns/op
Iteration   2: 4517.727 ns/op
Iteration   3: 4575.883 ns/op
Iteration   4: 4576.167 ns/op
Iteration   5: 4574.460 ns/op
Iteration   6: 4573.074 ns/op
Iteration   7: 4387.671 ns/op
Iteration   8: 4492.154 ns/op
Iteration   9: 4557.968 ns/op
Iteration  10: 4560.888 ns/op

# Parameters: (MOREGARBAGE = false, RANDOM = false, SIZE = 10000)
# Warmup Iteration   1: 45189.567 ns/op
# Warmup Iteration   2: 43549.135 ns/op
# Warmup Iteration   3: 43981.266 ns/op
# Warmup Iteration   4: 44022.431 ns/op
# Warmup Iteration   5: 43775.808 ns/op
Iteration   1: 57472.806 ns/op
Iteration   2: 57244.411 ns/op
Iteration   3: 55563.821 ns/op
Iteration   4: 61645.149 ns/op
Iteration   5: 56521.307 ns/op
Iteration   6: 56301.759 ns/op
Iteration   7: 56916.481 ns/op
Iteration   8: 55442.452 ns/op
Iteration   9: 66088.367 ns/op
Iteration  10: 72888.761 ns/op

# Parameters: (MOREGARBAGE = false, RANDOM = true, SIZE = 1000)
# Warmup Iteration   1: 4997.930 ns/op
# Warmup Iteration   2: 4906.259 ns/op
# Warmup Iteration   3: 4895.008 ns/op
# Warmup Iteration   4: 4893.813 ns/op
# Warmup Iteration   5: 4891.924 ns/op
Iteration   1: 4892.487 ns/op
Iteration   2: 4891.916 ns/op
Iteration   3: 4890.972 ns/op
Iteration   4: 4891.562 ns/op
Iteration   5: 4890.869 ns/op
Iteration   6: 4904.517 ns/op
Iteration   7: 4892.786 ns/op
Iteration   8: 4890.875 ns/op
Iteration   9: 4892.796 ns/op
Iteration  10: 4914.307 ns/op

# Parameters: (MOREGARBAGE = false, RANDOM = true, SIZE = 10000)
# Warmup Iteration   1: 62002.559 ns/op
# Warmup Iteration   2: 60539.963 ns/op
# Warmup Iteration   3: 61673.083 ns/op
# Warmup Iteration   4: 60632.315 ns/op
# Warmup Iteration   5: 60558.396 ns/op
Iteration   1: 60353.912 ns/op
Iteration   2: 60794.953 ns/op
Iteration   3: 82814.928 ns/op
Iteration   4: 73443.366 ns/op
Iteration   5: 71391.310 ns/op
Iteration   6: 70919.199 ns/op
Iteration   7: 79608.340 ns/op
Iteration   8: 81793.408 ns/op
Iteration   9: 61074.410 ns/op
Iteration  10: 80716.386 ns/op

Data - Final Thoughts

Same data for every run, only touch it differently


public class Cache02
{
	@Param({"10", "1000", "10000"}) int SIZE;

	int totalData = 10_000;
	int baseSize = 10_000;
	int counter = 0;

	final List<List<String>> data = new ArrayList<>(SIZE);

	String[] CITIES = {"Berlin;", "Hannover;", "Prag;", "Rio;", "Hamburg;", "Paris;", "Rotterdam;"};
	String[] TEMPERATURES = {"-99.9", "99.9", "3.4", "-12.2", "22.2", "26.8", "31.1", "11.0", "-5.6"};

	@Setup(Level.Trial)
	public void setupTrial()
	{
		data.clear();
		final var r = new Random(7L);

		for (int j = 0; j < totalData; j++)
		{
            final List<String> d = new ArrayList<>(SIZE);
			for (int i = 0; i < baseSize; i++)
			{
				d.add(CITIES[r.nextInt(CITIES.length)] + TEMPERATURES[r.nextInt(TEMPERATURES.length)]);
			}
			data.add(d);
		}
	}

	@Setup(Level.Iteration)
	public void setupIteration()
	{
		counter = 0;
	}

    @Benchmark
    public double cacheHit()
    {
        var d = data.get(0);

        var total = 0d;
        for (int i = 0; i < SIZE; i++)
        {
            var s = d.get(i);
            var pos = s.indexOf(';');
            total += pos + s.length(); // .parseDouble(s.substring(pos + 1));
        }

        return total;
    }

    @Benchmark
    public double cacheMiss()
    {
        var d = data.get(counter++ % totalData);

        var total = 0d;
        for (int i = 0; i < SIZE; i++)
        {
            var s = d.get(i);
            var pos = s.indexOf(';');
            total += pos + s.length(); // .parseDouble(s.substring(pos + 1));
        }

        return total;
    }
}

Data - Caches

Is it more correct now?


Benchmark          (SIZE)  Mode  Cnt       Score       Error  Units
Cache02.cacheHit       10  avgt   10      43.010 ±     0.184  ns/op
Cache02.cacheHit     1000  avgt   10    4778.031 ±    60.261  ns/op
Cache02.cacheHit    10000  avgt   10   64683.226 ± 14788.267  ns/op
Cache02.cacheMiss      10  avgt   10     336.260 ±    45.515  ns/op
Cache02.cacheMiss    1000  avgt   10   17669.625 ±    41.837  ns/op
Cache02.cacheMiss   10000  avgt   10  174635.237 ±  2976.802  ns/op

# Benchmark: org.xc.jmh.Cache02.cacheMiss
# Parameters: (SIZE = 10000)
# Warmup Iteration   1: 177439.666 ns/op
# Warmup Iteration   2: 174301.683 ns/op
# Warmup Iteration   3: 174319.358 ns/op
# Warmup Iteration   4: 175194.210 ns/op
# Warmup Iteration   5: 193149.452 ns/op
Iteration   1: 176797.338 ns/op
Iteration   2: 179366.082 ns/op
Iteration   3: 174686.931 ns/op
Iteration   4: 173912.824 ns/op
Iteration   5: 174186.462 ns/op
Iteration   6: 173900.033 ns/op
Iteration   7: 173583.888 ns/op
Iteration   8: 173610.283 ns/op
Iteration   9: 173037.852 ns/op
Iteration  10: 173270.673 ns/op

Benchmark          (SIZE)  Mode  Cnt       Score      Error  Units
Cache02.cacheHit       10  avgt   10      42.954 ±    0.304  ns/op
Cache02.cacheHit     1000  avgt   10    4802.211 ±   75.155  ns/op
Cache02.cacheHit    10000  avgt   10   72177.415 ± 6813.819  ns/op
Cache02.cacheMiss      10  avgt   10     415.397 ±  101.234  ns/op
Cache02.cacheMiss    1000  avgt   10   17653.832 ±   67.704  ns/op
Cache02.cacheMiss   10000  avgt   10  174592.955 ±  487.884  ns/op

# Benchmark: org.xc.jmh.Cache02.cacheHit
# Parameters: (SIZE = 10000)
# Warmup Iteration   1: 43386.186 ns/op
# Warmup Iteration   2: 47619.523 ns/op
# Warmup Iteration   3: 46837.359 ns/op
# Warmup Iteration   4: 44352.910 ns/op
# Warmup Iteration   5: 45543.546 ns/op
Iteration   1: 76304.197 ns/op
Iteration   2: 73811.738 ns/op
Iteration   3: 74944.359 ns/op
Iteration   4: 71217.526 ns/op
Iteration   5: 70320.534 ns/op
Iteration   6: 70690.886 ns/op
Iteration   7: 73316.245 ns/op
Iteration   8: 75228.869 ns/op
Iteration   9: 60798.436 ns/op
Iteration  10: 75141.359 ns/op

Data - Caches

Ask someone who knows things


# Perf - requires Linux kernel tooling
java -jar target/benchmarks.jar "Cache02.cacheMiss" -prof perfnorm

cacheMiss                         10000 avgt 202,126.64     ns/op
cacheMiss:IPC                     10000 avgt       1.22 insns/clk
cacheMiss:L1-dcache-load-misses   10000 avgt   7,949.14      #/op
cacheMiss:L1-dcache-loads         10000 avgt 125,343.85      #/op

cacheMiss:branch-misses           10000 avgt   7,672.78      #/op
cacheMiss:branches                10000 avgt 211,824.84      #/op
cacheMiss:cycles                  10000 avgt 516,010.56      #/op
cacheMiss:dTLB-load-misses        10000 avgt     124.12      #/op
cacheMiss:dTLB-loads              10000 avgt     152.94      #/op

cacheMiss:instructions            10000 avgt 634,205.58      #/op
cacheMiss:stalled-cycles-frontend 10000 avgt  54,376.13      #/op

# Benchmark: org.xc.jmh.Cache02.cacheMiss
# Parameters: (SIZE = 10000)
# Warmup Iteration   1: 177439.666 ns/op
# Warmup Iteration   2: 174301.683 ns/op
# Warmup Iteration   3: 174319.358 ns/op
# Warmup Iteration   4: 175194.210 ns/op
# Warmup Iteration   5: 193149.452 ns/op
Iteration   1: 176797.338 ns/op
Iteration   2: 179366.082 ns/op
Iteration   6: 173900.033 ns/op
Iteration   7: 173583.888 ns/op
Iteration   9: 173037.852 ns/op

# Might also just work on bare-metal or some clouds
java -jar target/benchmarks.jar "Cache02.cacheHit" -prof perfnorm

cacheHit                         10000 avgt  67,556.39     ns/op
cacheHit:IPC                     10000 avgt       3.87 insns/clk
cacheHit:L1-dcache-load-misses   10000 avgt   8,090.11      #/op
cacheHit:L1-dcache-loads         10000 avgt 102,131.89      #/op

cacheHit:branch-misses           10000 avgt     282.59      #/op
cacheHit:branches                10000 avgt 213,493.76      #/op
cacheHit:cycles                  10000 avgt 172,101.15      #/op
cacheHit:dTLB-load-misses        10000 avgt       0.35      #/op
cacheHit:dTLB-loads              10000 avgt     159.13      #/op

cacheHit:instructions            10000 avgt 667,271.07      #/op
cacheHit:stalled-cycles-frontend 10000 avgt   7,618.23      #/op

# Benchmark: org.xc.jmh.Cache02.cacheHit
# Parameters: (SIZE = 10000)
# Warmup Iteration   1: 43386.186 ns/op
# Warmup Iteration   2: 47619.523 ns/op
# Warmup Iteration   3: 46837.359 ns/op
# Warmup Iteration   4: 44352.910 ns/op
# Warmup Iteration   5: 45543.546 ns/op
Iteration   1: 76304.197 ns/op
Iteration   2: 73811.738 ns/op
Iteration   6: 70690.886 ns/op
Iteration   7: 73316.245 ns/op
Iteration   9: 60798.436 ns/op

Detour: Cost

What Does it Cost to Leave the CPU?

  Real Humanized
CPU Cycle 0.4 ns 1 s
L1 Cache 0.9 ns 2 s
L2 Cache 2.8 ns 7 s
L3 Cache 28 ns 1 min
Memory Access 100 ns 4 min
NVM SSD 25 μs 17 h
SSD 50–150 μs1.5-4 days
HDD 1–10 ms 1-9 months
TCP SF to NY 65 ms 5 years
TCP SF to Hong Kong141 ms11 years
  • Pay 100 to 200 cycles for main memory access
  • Java and CPU avoid main memory access at all cost
  • Java does not have the notion of a location, hence this is fully transparent to you... but you can still work with it :)

Detour: The modern CPU

Everything happens in parallel

  • Current CPUs could do 5 to 7* instructions per cycle (1 Hz)
  • Multiple execution units per core
  • Parallel execution despite code not being parallel
  • CPU reorders to fill the execution units
  • CPU tries things ahead of the result (prediction)

Learnings

What have we learned so far?

  • GCs yield different performance
  • You might measure allocation speed
  • Some loops are unrolled
  • JITs is unpredictable
  • Getting data arranged is hard
  • Memory access is expensive
  • Caches are friend and foe
  • CPUs are concurrent

Warmup and Measurement

Let's get into more details

Lifecycle and Frequency

The when, where, and how often

  • When is the code executed? - Startup, shutdown, normally, on error
  • How often is the code executed? - Once, a few times, like hell
  • How is it called? - From a single place, multiple places, in a loop, conditional or unconditional

Understand Warmup and Measurement

Just some basics first

  • Warmup is designed to bring the VM into a final state
  • Memory usage, class loading, JIT compilation, IO, and more
  • Measurement data is stored and evaluated
  • Iteration count, iteration size in time or calls
  • Applicable globally or per benchmark method

@Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class C01 {
    final int SIZE = 100_000;
    final byte[] src = new byte[SIZE];
    final byte[] dest = new byte[SIZE];

    ...

	@Benchmark
	public byte[] systemCopy() {
		System.arraycopy(src, 0, dest, 0, src.length);
		return dest;
	}

	@Benchmark
	public byte[] manualCopy() {
		for (int i = 0; i < src.length; i++) {
			dest[i] = src[i];
		}
		return dest;
	}
}

# Hetzner
Benchmark       Mode  Cnt     Score      Error  Units
C01.manualCopy  avgt    3  1751.934 ±    8.782  ns/op
C01.systemCopy  avgt    3  1805.559 ±   60.820  ns/op

# T14s
Benchmark       Mode  Cnt     Score      Error  Units
C01.manualCopy  avgt    3  4881.513 ± 1209.687  ns/op
C01.systemCopy  avgt    3  4834.282 ± 1066.856  ns/op

Try different warmup

Vary warmup, keep measurement the same


@Measurement(iterations = 1, time=10, timeUnit = TimeUnit.MILLISECONDS)
...
public class C02
{
    @Benchmark
    @Warmup(iterations = 0)
	public byte[] systemCopyCold()

	@Benchmark
    @Warmup(iterations = 1, time=10, timeUnit = TimeUnit.MILLISECONDS)
	public byte[] systemCopyWarm()

	@Benchmark
    @Warmup(iterations = 1, time=1000, timeUnit = TimeUnit.MILLISECONDS)
	public byte[] systemCopyHot()

	@Benchmark
    @Warmup(iterations = 0)
	public byte[] manualCopyCold()

	@Benchmark
    @Warmup(iterations = 1, time=10, timeUnit = TimeUnit.MILLISECONDS)
	public byte[] manualCopyWarm()

	@Benchmark
    @Warmup(iterations = 1, time=1000, timeUnit = TimeUnit.MILLISECONDS)
	public byte[] manualCopyHot()
}

# Hetzner
Benchmark           Mode  Cnt       Score   Error  Units
C02.manualCopyCold  avgt       241472.047          ns/op
C02.manualCopyHot   avgt         1773.845          ns/op
C02.manualCopyWarm  avgt         4604.600          ns/op
C02.systemCopyCold  avgt         1938.973          ns/op
C02.systemCopyHot   avgt         3349.491          ns/op
C02.systemCopyWarm  avgt         1909.157          ns/op
---
C02.manualCopyCold  avgt       652502.813          ns/op
C02.manualCopyHot   avgt         1785.308          ns/op
C02.manualCopyWarm  avgt         1830.773          ns/op
C02.systemCopyCold  avgt         1954.546          ns/op
C02.systemCopyHot   avgt         1854.376          ns/op
C02.systemCopyWarm  avgt         3774.145          ns/op

# T14s
Benchmark           Mode  Cnt       Score   Error  Units
C02.manualCopyCold  avgt       866723.538          ns/op
C02.manualCopyHot   avgt         4951.803          ns/op
C02.manualCopyWarm  avgt       107999.305          ns/op
C02.systemCopyCold  avgt         5731.386          ns/op
C02.systemCopyHot   avgt         5007.324          ns/op
C02.systemCopyWarm  avgt         5213.909          ns/op

How can that be?

Look under the hood again


java -jar target/benchmarks.jar "C02" -jvmArgs "-XX:+PrintCompilation"
# Benchmark: org.xc.jmh.C02.systemCopyHot
# Fork: 1 of 1
...
# Warmup Iteration   1:
...
  175  506     3  org.xc.jmh.C02::systemCopyHot (23 bytes)
  175  507     3  org.openjdk.jmh.infra.Blackhole::consume (19 bytes)
  175  508     3  org.openjdk.jmh.infra.Blackhole::consumeCompiler (1 bytes)
  185  509     4  org.xc.jmh.C02::systemCopyHot (23 bytes)
# 186  506     3  org.xc.jmh.C02::systemCopyHot (23 bytes)   made not entrant
  186  510     4  org.openjdk.jmh.infra.Blackhole::consume (19 bytes)
# 186  507     3  org.openjdk.jmh.infra.Blackhole::consume (19 bytes)   made not entrant
  287  511 %   3  org.xc...jmhTest::systemCopyHot_avgt_jmhStub @ 13 (57 bytes)
  289  512     3  org.xc...jmhTest::systemCopyHot_avgt_jmhStub (57 bytes)
  361  513 %   4  org.xc...jmhTest::systemCopyHot_avgt_jmhStub @ 13 (57 bytes)
# 368  511 %   3  org.xc...jmhTest::systemCopyHot_avgt_jmhStub @ 13 (57 bytes)   made not entrant
#1175  513 %   4  org.xc...jmhTest::systemCopyHot_avgt_jmhStub @ 13 (57 bytes)   made not entrant

1820.859 ns/op
Iteration   1: 4797.916 ns/op

java -jar target/benchmarks.jar "C02" -jvmArgs "-XX:+PrintCompilation"
# Benchmark: org.xc.jmh.C02.systemCopyWarm
# Fork: 1 of 1
...
# Warmup Iteration   1:
...
 210  539     3  org.xc.jmh.C02::systemCopyWarm (23 bytes)
 211  540     3  org.openjdk.jmh.infra.Blackhole::consume (19 bytes)
 220  541     4  org.xc.jmh.C02::systemCopyWarm (23 bytes)
#221  539     3  org.xc.jmh.C02::systemCopyWarm (23 bytes)   made not entrant
 221  542     4  org.openjdk.jmh.infra.Blackhole::consume (19 bytes)
#221  540     3  org.openjdk.jmh.infra.Blackhole::consume (19 bytes)   made not entrant

1923.731 ns/op
Iteration   1: 1859.749 ns/op

How can that be?

Limit the JITs powers


# Only compile quick and easy, never leave that again
java -jar target/benchmarks.jar "C02" -jvmArgs "-XX:TieredStopAtLevel=1"


Benchmark           Mode  Cnt       Score   Error  Units
C02.manualCopyCold  avgt       197266.887          ns/op
C02.manualCopyHot   avgt       146286.457          ns/op
C02.manualCopyWarm  avgt       147154.377          ns/op
C02.systemCopyCold  avgt         1943.922          ns/op
C02.systemCopyHot   avgt         1775.785          ns/op
C02.systemCopyWarm  avgt         3744.589          ns/op
--
C02.manualCopyCold  avgt       144304.352          ns/op
C02.manualCopyHot   avgt        57894.731          ns/op
C02.manualCopyWarm  avgt       115281.330          ns/op
C02.systemCopyCold  avgt         1940.215          ns/op
C02.systemCopyHot   avgt         1878.354          ns/op
C02.systemCopyWarm  avgt         4590.958          ns/op
--
C02.manualCopyCold  avgt       295074.343          ns/op
C02.manualCopyHot   avgt        58333.523          ns/op
C02.manualCopyWarm  avgt        57356.597          ns/op
C02.systemCopyCold  avgt         1943.131          ns/op
C02.systemCopyHot   avgt         1856.031          ns/op
C02.systemCopyWarm  avgt         4736.612          ns/op

# Don't background compile and don't switch between tiers
java -jar target/benchmarks.jar "C02" \
    -jvmArgs "-XX:-TieredCompilation -Xbatch"

Benchmark           Mode  Cnt         Score   Error  Units
C02.manualCopyCold  avgt       26368588.000          ns/op
C02.manualCopyHot   avgt           1758.489          ns/op
C02.manualCopyWarm  avgt       11940991.000          ns/op
C02.systemCopyCold  avgt           7170.455          ns/op
C02.systemCopyHot   avgt           1852.513          ns/op
C02.systemCopyWarm  avgt           7268.862          ns/op
--
C02.manualCopyCold  avgt       26817602.000          ns/op
C02.manualCopyHot   avgt           1787.099          ns/op
C02.manualCopyWarm  avgt       11927175.000          ns/op
C02.systemCopyCold  avgt           7158.861          ns/op
C02.systemCopyHot   avgt           1847.176          ns/op
C02.systemCopyWarm  avgt           7608.851          ns/op
--
C02.manualCopyCold  avgt       24100619.000          ns/op
C02.manualCopyHot   avgt           1760.130          ns/op
C02.manualCopyWarm  avgt       13403527.500          ns/op
C02.systemCopyCold  avgt           7284.120          ns/op
C02.systemCopyHot   avgt           4610.946          ns/op
C02.systemCopyWarm  avgt           7716.843          ns/op

Lifecycle and Frequency

Never do streams and lambdas. Period... or?


public class Streams {
    final int SIZE = 1024;
    final int[] integers = new int[SIZE];
    ...

    @Benchmark
    @Warmup(iterations = 0, batchSize = 1)
    public int lambdaArrayCold() {
        return Arrays.stream(integers).filter(i -> i % 2 == 0).sum();
    }
    @Benchmark
    @Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
    public int lambdaArrayWarm() {...}

    @Benchmark
    @Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
    public int lambdaArrayHot() {...}

    @Benchmark
    @Warmup(iterations = 0, batchSize = 1)
    public int loopCold() {
        int sum = 0;
        for (int i = 0; i < integers.length; i++) {
            var v = integers[i];
            if (v % 2 == 0) {
                sum += v;
            }
        }
        return sum;
    }

    @Benchmark
    @Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
    public int loopWarm() {...}

    @Benchmark
    @Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
    public int loopHot() {...}
}

Benchmark                Mode Cnt      Score      Error  Units
Streams.lambdaArrayCold  avgt   5 574,212.40 ± 3,911,792 ns/op
Streams.lambdaArrayWarm  avgt   5  27,398.98 ±   181,998 ns/op
Streams.lambdaArrayHot   avgt   5     860.11 ±     2,943 ns/op

Streams.loopCold         avgt   5  26,345.93 ±    12,867 ns/op
Streams.loopWarm         avgt   5     605.36 ±       225 ns/op
Streams.loopHot          avgt   5   1,568.21 ±      1882 ns/op

Streams.lambdaArrayCold
Iteration   1: 2391282.000 ns/op
Iteration   2: 144067.000 ns/op
Iteration   3: 119609.000 ns/op
Iteration   4: 109935.000 ns/op
Iteration   5: 106169.000 ns/op

Streams.lambdaArrayWarm
Iteration   1: 111857.000 ns/op
Iteration   2: 4564.900 ns/op
Iteration   3: 4609.857 ns/op
Iteration   4: 6015.000 ns/op
Iteration   5: 9948.167 ns/op

Streams.loopCold
Iteration   1: 22230.655 ns/op
Iteration   2: 25418.000 ns/op
Iteration   3: 28242.000 ns/op
Iteration   4: 24892.500 ns/op
Iteration   5: 30946.500 ns/op

Lifecycle and Frequency - Real Work

You shall not care too much


public class HeavyStreams {
    // as seen before

    @Benchmark
    @Warmup(iterations = 0, batchSize = 1)
    public int lambdaArrayCold() {
        return Arrays.stream(integers)
            .filter(i -> i.mod(BigInteger.TWO).equals(BigInteger.ZERO))
            .map(i -> i.sqrt())
            .mapToInt(BigInteger::intValue)
            .sum();
    }

    ...

    @Benchmark
    @Warmup(iterations = 0, batchSize = 1)
    public int loopCold() {
        int sum = 0;
        for (int i = 0; i < integers.length; i++) {
            var v = integers[i];

            if (v.mod(BigInteger.TWO).equals(BigInteger.ZERO)) {
                sum += v.sqrt().intValue();
            }
        }

        return sum;
    }
}

# Milliseconds now!!!
Benchmark                     Mode Cnt   Score    Error Units
HeavyStreams.lambdaArrayCold  avgt   5  283.99 ± 220.86 ms/op
HeavyStreams.lambdaArrayWarm  avgt   5  254.92 ±  66.34 ms/op
HeavyStreams.lambdaArrayHot   avgt   5  247.91 ±  53.26 ms/op

HeavyStreams.loopCold         avgt   5  297.98 ± 165.77 ms/op
HeavyStreams.loopWarm         avgt   5  248.82 ±  51.59 ms/op
HeavyStreams.loopHot          avgt   5  253.76 ±  68.80 ms/op

HeavyStreams.lambdaArrayCold
Iteration   1: 359.072 ms/op
Iteration   2: 332.922 ms/op
Iteration   3: 242.727 ms/op
Iteration   4: 243.821 ms/op
Iteration   5: 241.427 ms/op

HeavyStreams.lambdaArrayWarm
Iteration   1: 241.479 ms/op
Iteration   2: 272.908 ms/op
Iteration   3: 242.048 ms/op
Iteration   4: 274.634 ms/op
Iteration   5: 243.568 ms/op

HeavyStreams.loopCold
Iteration   1: 349.515 ms/op
Iteration   2: 326.304 ms/op
Iteration   3: 301.334 ms/op
Iteration   4: 271.928 ms/op
Iteration   5: 240.832 ms/op
  • Overhead almost disappeared

Lifecycle and Frequency - Proper Real Work

consume to the rescue


@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Measurement(iterations = 5, batchSize = 1, time=1, timeUnit = TimeUnit.NANOSECONDS)
@Fork(1)
public class HeavyStreamsConsume {
    final int SIZE = 1024;
    final int[] integers = new int[SIZE];

    static int calcAndBurn(int a) {
    	Blackhole.consumeCPU(10011);
    	return a % 2;
    }

    @Benchmark
    @Warmup(iterations = 0, batchSize = 1)
    public int lambdaArrayCold() {
        return Arrays
        		.stream(integers)
        		.filter(i -> calcAndBurn(i) == 0).sum();
    }

    @Benchmark
    @Warmup(iterations = 0, batchSize = 1)
    public int loopCold() {
        int sum = 0;
        for (int i = 0; i < integers.length; i++) {
            var v = integers[i];
            if (calcAndBurn(v) == 0) {
                sum += v;
            }
        }

        return sum;
    }
}

# Microseconds now!!!
Benchmark                           Mode Cnt     Score      Error Units
HeavyStreamsConsume.lambdaArrayCold avgt   5  21,165.6 ± 12,498.1 us/op
HeavyStreamsConsume.lambdaArrayHot  avgt   5  19,653.0 ±     84.0 us/op
HeavyStreamsConsume.lambdaArrayWarm avgt   5  19,693.3 ±    170.3 us/op
HeavyStreamsConsume.loopCold        avgt   5  20,642.6 ±  8,376.7 us/op
HeavyStreamsConsume.loopHot         avgt   5  19,622.6 ±     88.5 us/op
HeavyStreamsConsume.loopWarm        avgt   5  19,656.9 ±     38.7 us/op

HeavyStreams.lambdaArrayCold
Iteration   1: 26971.615 us/op
Iteration   2: 19761.449 us/op
Iteration   3: 19707.116 us/op
Iteration   4: 19702.641 us/op
Iteration   5: 19685.665 us/op

HeavyStreams.lambdaArrayWarm
Iteration   1: 19771.183 us/op
Iteration   2: 19678.924 us/op
Iteration   3: 19682.219 us/op
Iteration   4: 19672.906 us/op
Iteration   5: 19661.459 us/op

HeavyStreams.loopCold
Iteration   1: 24534.031 us/op
Iteration   2: 19690.451 us/op
Iteration   3: 19663.822 us/op
Iteration   4: 19651.893 us/op
Iteration   5: 19672.885 us/op

Measurement Types

How you can measure

Benchmark Modes

Measurement approaches


@Warmup(iterations = 2, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@State(Scope.Thread)
public class M01
{
    ...
	@Benchmark
	@BenchmarkMode(Mode.AverageTime)
	@OutputTimeUnit(TimeUnit.NANOSECONDS)
	public long sum_AverageTime()
    {
		long result = 0;
		for (int i = 0; i < data.length; i++)
        {
			result += data[i];
			Blackhole.consumeCPU(10);
		}

		return result;
	}

	@Benchmark
	@BenchmarkMode(Mode.Throughput)
	@OutputTimeUnit(TimeUnit.MILLISECONDS)
	public long sum_Throughput() { ... }

	@Benchmark
	@BenchmarkMode(Mode.SingleShotTime)
	@OutputTimeUnit(TimeUnit.NANOSECONDS)
	public long sum_SingleShotTime() { ... }

	@Benchmark
	@BenchmarkMode(Mode.SampleTime)
	@OutputTimeUnit(TimeUnit.NANOSECONDS)
	public long sum_SampleTime() { ... }
}

Benchmark                    Mode   Cnt       Score        Error  Units
M01.sum_Throughput          thrpt     3      74.127 ±      5.157 ops/ms
M01.sum_AverageTime          avgt     3   13357.290 ±   1096.865  ns/op
M01.sum_SingleShotTime         ss     3  147726.000 ± 486517.167  ns/op

M01.sum_SampleTime         sample 56633   13305.622 ±     21.038  ns/op
M01.sum_SampleTime:p0.00   sample         12640.000               ns/op
M01.sum_SampleTime:p0.50   sample         13184.000               ns/op
M01.sum_SampleTime:p0.90   sample         13280.000               ns/op
M01.sum_SampleTime:p0.95   sample         13344.000               ns/op
M01.sum_SampleTime:p0.99   sample         17888.000               ns/op
M01.sum_SampleTime:p0.999  sample         33472.000               ns/op
M01.sum_SampleTime:p0.9999 sample         45036.966               ns/op
M01.sum_SampleTime:p1.00   sample         51008.000               ns/op
  • SingleShotTime affects WarmUp as well
  • SingleShotTime ignores defined runtimes
  • You can adjust the OutputTimeUnit

An Exception

Flow changes are ... well

One exception only

When things go wrong extremly rarely


@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
public class Error01 {
	int iterationCount = 0;
    final String[] integers = new String[1000];

    @Setup(Level.Iteration)
    public void setup(BenchmarkParams params)
    {
    	iterationCount++;

        for (int i = 0; i < integers.length; i++) {
            integers[i] = new String(String.valueOf(i));
        }
        // when we are beyond warmup the third time, inject failure
        if (iterationCount == params.getWarmup().getCount() + 6) {
            integers[integers.length / 2] = "any";
        }
        else {
            integers[integers.length / 2] = "42";
        }
    }

    @Benchmark
    public long parse() {
        long total = 0;
        for (int i = 0; i < integers.length; i++) {
            try {
                total += Integer.valueOf(integers[i]);
            }
            catch(NumberFormatException e) {
                total = 0;
            }
        }

        return total;
    }
}

# Benchmark: org.xc.jmh.Error01.parse

# Warmup Iteration   1: 5886.587 ns/op
# Warmup Iteration   2: 5945.850 ns/op
# Warmup Iteration   3: 5723.013 ns/op
Iteration   1: 5726.163 ns/op
Iteration   2: 5721.427 ns/op
Iteration   3: 5724.125 ns/op
Iteration   4: 5725.052 ns/op
Iteration   5: 5726.114 ns/op
Iteration   6: 7242.212 ns/op
Iteration   7: 5955.518 ns/op
Iteration   8: 5942.675 ns/op
Iteration   9: 5937.309 ns/op
Iteration  10: 5933.929 ns/op

One exception only

When things go wrong all the time


@Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Fork(value = 1, jvmArgsAppend = {"-Xms4g", "-Xmx4g", "-XX:+AlwaysPreTouch"})
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class Error02 {
	int iterationCount = 0;
    final String[] integers = new String[1000];

    @Setup(Level.Iteration)
    public void setup(BenchmarkParams params) {
    	iterationCount++;

        for (int i = 0; i < integers.length; i++) {
            integers[i] = new String(String.valueOf(i));
        }
        // when we are beyond warmup the third time, inject failure
        if (iterationCount == params.getWarmup().getCount() + 6) {
        	integers[integers.length / 2] = "any";
        }
        else {
            integers[integers.length / 2] = "yas";
        }
    }

    @Benchmark
    public long parse() {
    	long total = 0;
    	for (int i = 0; i < integers.length; i++) {
    		try {
    			total += Integer.valueOf(integers[i]);
    		}
    		catch(NumberFormatException e) {
    			total = 0;
    		}
    	}

    	return total;
    }
}

# Benchmark: org.xc.jmh.Error02.parse

# Warmup Iteration   1: 7003.412 ns/op
# Warmup Iteration   2: 7318.162 ns/op
# Warmup Iteration   3: 6910.102 ns/op
Iteration   1: 6962.584 ns/op
Iteration   2: 6912.097 ns/op
Iteration   3: 6974.506 ns/op
Iteration   4: 6919.982 ns/op
Iteration   5: 6918.328 ns/op
Iteration   6: 7044.448 ns/op
Iteration   7: 6898.633 ns/op
Iteration   8: 6897.084 ns/op
Iteration   9: 6908.213 ns/op
Iteration  10: 6900.028 ns/op

Things go the other way

I know what you might evaluate to


public class BranchPrediction {
    private static final int COUNT = 10000;

    // sorted ints
    private int[] sorted = new int[COUNT];
    // fully unsorted
    private int[] unsorted = new int[COUNT];
    // alternating branch usage
    private int[] pattern2 = new int[COUNT];
    // 80/20 branch usage
    private int[] pattern5 = new int[COUNT];
    // data from pattern2, but shuffled
    private int[] shuffledPattern = new int[COUNT];

    @Setup
    public void setup() {
        final Random r = new Random(13241L);

        for (int i = 0; i < COUNT; i++) {
        	final int d = r.nextInt(100_000_000) + 1;
            sorted[i] = d;
            unsorted[i] = d;
            pattern2[i] = i % 2 == 0 ? 100_000_000 : r.nextInt(50_000_000);
            pattern5[i] = i % 5 == 0 ? 100_000_000 : r.nextInt(50_000_000);
            shuffledPattern[i] = pattern2[i];
        }
        Arrays.sort(sorted);

        // make it noisy to break the branch predictor
        for (int i = 0; i < shuffledPattern.length; i++) {
            int j = r.nextInt(shuffledPattern.length);
            var t = shuffledPattern[j];
            shuffledPattern[j] = shuffledPattern[i];
            shuffledPattern[i] = t;
        }
    }

    public long doIt(int[] array) {
        long sum = 2;
        for (int v : array) {
            if (v > 50_000_000) {
                sum += (v >> 2);
            }
            else {
                sum += (v << 2);
            }
        }
        return sum;
    }


    @Benchmark
    public long sorted() {
    	return doIt(sorted);
    }
    ...
}

# Hetzner
Benchmark          Mode  Cnt      Score       Error  Units
Branch01.pattern2  avgt    5   4157.021 ±   231.482  ns/op
Branch01.pattern5  avgt    5   4154.633 ±   226.007  ns/op
Branch01.random    avgt    5   8155.178 ±     2.132  ns/op
Branch01.sorted    avgt    5   4196.762 ±   202.215  ns/op
Branch01.unsorted  avgt    5   8131.623 ±   856.262  ns/op

# T14s
Branch01.pattern2  avgt    5  12285.168 ±  1418.264  ns/op
Branch01.pattern5  avgt    5  10644.531 ±   418.107  ns/op
Branch01.random    avgt    5  40382.077 ±  9874.619  ns/op
Branch01.sorted    avgt    5  10577.268 ±  2053.626  ns/op
Branch01.unsorted  avgt    5  40491.795 ± 15274.222  ns/op

Things go the other way

We love statistics - AMD EPYC


Branch01.pattern2                          avgt    5   4763.210 ±  12.230      ns/op
Branch01.pattern2:IPC                      avgt           4.740            insns/clk
Branch01.pattern2:L1-dcache-load-misses    avgt         535.867                 #/op
Branch01.pattern2:L1-dcache-loads          avgt        8442.117                 #/op
Branch01.pattern2:branch-misses            avgt           4.468                 #/op
Branch01.pattern2:branches                 avgt       14288.724                 #/op
Branch01.pattern2:cycles                   avgt       12237.059                 #/op
Branch01.pattern2:dTLB-load-misses         avgt           0.023                 #/op
Branch01.pattern2:dTLB-loads               avgt           0.318                 #/op
Branch01.pattern2:instructions             avgt       58002.780                 #/op
Branch01.pattern2:stalled-cycles-frontend  avgt         100.277                 #/op

Branch01.pattern5                          avgt    5   4192.828 ± 359.636      ns/op
Branch01.pattern5:IPC                      avgt           5.220            insns/clk
Branch01.pattern5:L1-dcache-load-misses    avgt         535.540                 #/op
Branch01.pattern5:L1-dcache-loads          avgt        8428.333                 #/op
Branch01.pattern5:branch-misses            avgt           3.536                 #/op
Branch01.pattern5:branches                 avgt       11897.675                 #/op
Branch01.pattern5:cycles                   avgt       10714.181                 #/op
Branch01.pattern5:dTLB-load-misses         avgt           0.021                 #/op
Branch01.pattern5:dTLB-loads               avgt           0.296                 #/op
Branch01.pattern5:instructions             avgt       55924.576                 #/op
Branch01.pattern5:stalled-cycles-frontend  avgt         129.842                 #/op

Branch01.random                            avgt    5   7763.375 ± 483.591      ns/op
Branch01.random:IPC                        avgt           2.935            insns/clk
Branch01.random:L1-dcache-load-misses      avgt         534.159                 #/op
Branch01.random:L1-dcache-loads            avgt        8454.813                 #/op
Branch01.random:branch-misses              avgt           2.361                 #/op
Branch01.random:branches                   avgt       14225.074                 #/op
Branch01.random:cycles                     avgt       19737.984                 #/op
Branch01.random:dTLB-load-misses           avgt           0.029                 #/op
Branch01.random:dTLB-loads                 avgt           0.528                 #/op
Branch01.random:instructions               avgt       57938.467                 #/op
Branch01.random:stalled-cycles-frontend    avgt        6542.892                 #/op

Branch01.sorted                            avgt    5   4773.460 ±  13.965      ns/op
Branch01.sorted:IPC                        avgt           4.740            insns/clk
Branch01.sorted:L1-dcache-load-misses      avgt         534.771                 #/op
Branch01.sorted:L1-dcache-loads            avgt        8431.196                 #/op
Branch01.sorted:branch-misses              avgt           5.447                 #/op
Branch01.sorted:branches                   avgt       14266.975                 #/op
Branch01.sorted:cycles                     avgt       12229.820                 #/op
Branch01.sorted:dTLB-load-misses           avgt           0.023                 #/op
Branch01.sorted:dTLB-loads                 avgt           0.299                 #/op
Branch01.sorted:instructions               avgt       57969.496                 #/op
Branch01.sorted:stalled-cycles-frontend    avgt          98.264                 #/op

Branch01.unsorted                          avgt    5   8255.555 ±  18.594      ns/op
Branch01.unsorted:IPC                      avgt           2.733            insns/clk
Branch01.unsorted:L1-dcache-load-misses    avgt         532.575                 #/op
Branch01.unsorted:L1-dcache-loads          avgt        8399.977                 #/op
Branch01.unsorted:branch-misses            avgt           2.683                 #/op
Branch01.unsorted:branches                 avgt       14196.794                 #/op
Branch01.unsorted:cycles                   avgt       21289.886                 #/op
Branch01.unsorted:dTLB-load-misses         avgt           0.021                 #/op
Branch01.unsorted:dTLB-loads               avgt           0.434                 #/op
Branch01.unsorted:instructions             avgt       58183.757                 #/op
Branch01.unsorted:stalled-cycles-frontend  avgt        8768.434                 #/op

Things go the other way

We love statistics - T14s aMD Ryzen 7 PRO 4750U


Branch01.pattern2                          avgt    5  12644.421 ±    32.790      ns/op
Branch01.pattern2:IPC                      avgt           4.101              insns/clk
Branch01.pattern2:L1-dcache-load-misses    avgt         635.322                   #/op
Branch01.pattern2:L1-dcache-loads          avgt       10065.464                   #/op
Branch01.pattern2:branch-misses            avgt           5.700                   #/op
Branch01.pattern2:branches                 avgt       17610.050                   #/op
Branch01.pattern2:cycles                   avgt       17783.624                   #/op
Branch01.pattern2:instructions             avgt       72937.027                   #/op
Branch01.pattern2:stalled-cycles-backend   avgt        2556.326                   #/op
Branch01.pattern2:stalled-cycles-frontend  avgt          14.046                   #/op

Branch01.pattern5                          avgt    5  11561.519 ±    91.598      ns/op
Branch01.pattern5:IPC                      avgt           4.301              insns/clk
Branch01.pattern5:L1-dcache-load-misses    avgt         634.284                   #/op
Branch01.pattern5:L1-dcache-loads          avgt       10058.859                   #/op
Branch01.pattern5:branch-misses            avgt           4.610                   #/op
Branch01.pattern5:branches                 avgt       14569.519                   #/op
Branch01.pattern5:cycles                   avgt       16247.216                   #/op
Branch01.pattern5:instructions             avgt       69877.428                   #/op
Branch01.pattern5:stalled-cycles-backend   avgt        3526.451                   #/op
Branch01.pattern5:stalled-cycles-frontend  avgt          19.762                   #/op

Branch01.random                            avgt    5  39071.484 ± 11327.252      ns/op
Branch01.random:IPC                        avgt           1.285              insns/clk
Branch01.random:L1-dcache-load-misses      avgt         650.766                   #/op
Branch01.random:L1-dcache-loads            avgt       11321.971                   #/op
Branch01.random:branch-misses              avgt         975.152                   #/op
Branch01.random:branches                   avgt       17595.904                   #/op
Branch01.random:cycles                     avgt       56724.714                   #/op
Branch01.random:instructions               avgt       72887.307                   #/op
Branch01.random:stalled-cycles-backend     avgt         270.540                   #/op
Branch01.random:stalled-cycles-frontend    avgt        1286.695                   #/op

Branch01.sorted                            avgt    5  11606.551 ±    47.793      ns/op
Branch01.sorted:IPC                        avgt           4.464              insns/clk
Branch01.sorted:L1-dcache-load-misses      avgt         634.906                   #/op
Branch01.sorted:L1-dcache-loads            avgt       10074.952                   #/op
Branch01.sorted:branch-misses              avgt           6.583                   #/op
Branch01.sorted:branches                   avgt       17603.203                   #/op
Branch01.sorted:cycles                     avgt       16285.361                   #/op
Branch01.sorted:instructions               avgt       72693.454                   #/op
Branch01.sorted:stalled-cycles-backend     avgt        3483.882                   #/op
Branch01.sorted:stalled-cycles-frontend    avgt          31.488                   #/op

Branch01.unsorted                          avgt    5  37313.611 ± 14404.163      ns/op
Branch01.unsorted:IPC                      avgt           1.400              insns/clk
Branch01.unsorted:L1-dcache-load-misses    avgt         661.293                   #/op
Branch01.unsorted:L1-dcache-loads          avgt       11423.488                   #/op
Branch01.unsorted:branch-misses            avgt         793.127                   #/op
Branch01.unsorted:branches                 avgt       17588.644                   #/op
Branch01.unsorted:cycles                   avgt       52257.003                   #/op
Branch01.unsorted:instructions             avgt       73173.150                   #/op
Branch01.unsorted:stalled-cycles-backend   avgt         205.963                   #/op
Branch01.unsorted:stalled-cycles-frontend  avgt        1083.194                   #/op

Consuming Time

Making things realistic

Streams or Loops

Shall we use streams or loops?


public class E01
{
    final int[] integers = new int[1000];

    @Setup
    public void setup() {
        for (int i = 0; i < integers.length; i++) {
            integers[i] = i;
        }
    }

    @Benchmark
    public int lambda() {
        return Arrays.stream(integers).filter(i -> i % 2 == 0).sum();
    }

    @Benchmark
    public int loop() {
        int sum = 0;
        for (int i = 0; i < integers.length; i++) {
            var v = integers[i];
            if (v % 2 == 0) {
                sum += v;
            }
        }

        return sum;
    }
}

# No CPU burning, JDK 17, T14s
Benchmark   Mode  Cnt    Score    Error  Units
E01.lambda  avgt    3  614.137 ± 27.369  ns/op
E01.loop    avgt    3  537.428 ± 49.316  ns/op

# Check GC -prof gc
Benchmark                      Mode  Cnt    Score     Error   Units
E01.lambda                     avgt    3  646.987 ± 105.953   ns/op
E01.lambda:gc.alloc.rate.norm  avgt    3  240.000 ±   0.001    B/op

E01.loop                       avgt    3  534.282 ±  37.081   ns/op
E01.loop:gc.alloc.rate.norm    avgt    3   ≈ 10⁻⁴              B/op

# SURPRISE - HETZNER AMD EPYC
Benchmark   Mode  Cnt    Score   Error  Units
E01.lambda  avgt    3  466.097 ± 6.798  ns/op
E01.loop    avgt    3  548.451 ± 1.450  ns/op

# SURPRISE - JDK 21 T14s
Benchmark   Mode  Cnt    Score      Error  Units
E01.lambda  avgt    3  600.270 ±  275.706  ns/op
E01.loop    avgt    3  661.141 ± 3868.363  ns/op
  • Ok, never use streams... or?

Increase the Workload

Burn CPU without consuming memory


public class E02
{
    ...
    private static int calc(final int a) {
        // consume time and cpu, but don't waste memory
        // trick the compiler too and avoid optimization
        // by creating a side-effect
    	Blackhole.consumeCPU(100);

    	return a;
    }

    @Benchmark
    public int lambda() {
        return Arrays.stream(integers)
            .filter(i -> calc(i) % 2 == 0)
            .sum();
    }

    @Benchmark
    public int loop() {
        int sum = 0;
        for (int i = 0; i < integers.length; i++) {
            var v = calc(integers[i]);
            if (v % 2 == 0) {
                sum += v;
            }
        }

        return sum;
    }
}

# T14s, JDK 17
Benchmark   Mode  Cnt    Score    Error  Units
E01.lambda  avgt    3  614.137 ± 27.369  ns/op
E01.loop    avgt    3  537.428 ± 49.316  ns/op

Benchmark                      Mode  Cnt       Score      Error   Units
E02.lambda                     avgt    3  160445.533 ± 8399.802   ns/op
E02.lambda:gc.alloc.rate.norm  avgt    3     240.086 ±    0.058    B/op

E02.loop                       avgt    3  161183.136 ± 6899.931   ns/op
E02.loop:gc.alloc.rate.norm    avgt    3       0.086 ±    0.057    B/op

# Hetzner JDK 21
Benchmark   (WORK)  Mode  Cnt       Score      Error  Units
E02.lambda       1  avgt    3    3915.734 ± 1001.819  ns/op
E02.lambda      10  avgt    3   11209.627 ±   18.601  ns/op
E02.lambda     100  avgt    3  175428.897 ± 3424.067  ns/op
E02.loop         1  avgt    3    4116.134 ±   14.891  ns/op
E02.loop        10  avgt    3   10768.445 ±  141.270  ns/op
E02.loop       100  avgt    3  173909.239 ± 3531.735  ns/op
  • It all depends...
  • Still needs a little memory
  • But the runtime difference is gone

Learnings

The learning goes on

  • When code run is important
  • Simulating warm code is hard
  • JIT is unpredictable
  • You can measure differently
  • The program flow is key
  • Branch preduction changes everything
  • perf is nice but not clear sometimes
  • Burning CPU is legit and needed

Threads

How to run multi-threaded

Run things in parallel

Independent workloads


...
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
public class T01
{
    final int SIZE = 10000;
    List<String> csv = new ArrayList<>(SIZE);

    @Setup(Level.Trial)
    public void setup()	{
    	var r = new FastRandom(41L);
    	for (int i = 0; i < SIZE; i++) {
    		var sb = new StringJoiner(",");
    		for (int j = 0; j < 5; j++) {
    			sb.add(RandomUtils.randomString(r, 15));
    		}
    		csv.add(sb.toString());
    	}
    }

    @Benchmark
    @Threads(1)
    public long split_1() {
    	long count = 0;
    	for (var s : csv) {
    		count += s.split(",").length;
    	}
    	return count;
    }

    @Benchmark
    @Threads(2)
    public long split_2() { ... }
}
  • @State: Per benchmark
  • @OutputTimeUnit: In microseconds
  • @Setup(Level.Trial): Setup once only
  • Threading needs synchronization effort, hence longer runtimes per invovation are needed
  • JMH attempts to start all threads at the same time
  • @Threads(2): How many threads should run that

Thread Scale via JMH

Let's see how it performs


# T14s Gen 1, AMD 8+8 Core, 32 GB memory
# Mode powersave (battery)
Benchmark    Mode  Cnt     Score     Error  Units
T01.split_1  avgt    3  1111.237 ±  76.534  us/op
T01.split_2  avgt    3  1247.724 ± 149.086  us/op
T01.split_4  avgt    3  1387.988 ± 255.601  us/op
T01.split_6  avgt    3  1659.864 ± 226.979  us/op
T01.split_8  avgt    3  2243.382 ± 207.846  us/op

# T14s Gen 1, AMD 8+8 Core, 32 GB memory
# Mode performance (plugged in)
Benchmark    Mode  Cnt     Score      Error  Units
T01.split_1  avgt    3  1035.574 ±  139.435  us/op
T01.split_2  avgt    3  1167.293 ±  102.166  us/op
T01.split_4  avgt    3  1480.129 ±  563.533  us/op
T01.split_6  avgt    3  1933.609 ± 1095.241  us/op
T01.split_8  avgt    3  2291.194 ±  321.369  us/op
  • It does not scale!
  • Why?
  • We only share read-only data

# T14s Gen 1, AMD 8+8 Core, 32 GB memory
# SMT off, Turbo off, Mode powersave
Benchmark    Mode  Cnt     Score      Error  Units
T01.split_1  avgt    3  2447.491 ±  246.614  us/op
T01.split_2  avgt    3  2451.685 ±  120.794  us/op
T01.split_4  avgt    3  2669.147 ± 1487.202  us/op
T01.split_6  avgt    3  3114.441 ± 2684.979  us/op
T01.split_8  avgt    3  3292.129 ±  139.653  us/op

# T14s Gen 1, AMD 8+8 Core, 32 GB memory
# SMT off, Turbo off, Mode performance
Benchmark    Mode  Cnt     Score      Error  Units
T01.split_1  avgt    3  2497.786 ±  230.361  us/op
T01.split_2  avgt    3  2539.525 ±  338.014  us/op
T01.split_4  avgt    3  2539.477 ±   53.316  us/op
T01.split_6  avgt    3  2644.285 ±  239.924  us/op
T01.split_8  avgt    3  2767.139 ± 1536.015  us/op
  • It still does not scale well!
  • Why?

Thread Scale via JMH

Diagnose scaling limits


# -prof gc, mode performance
Benchmark                       Mode  Cnt        Score      Error   Units
T01.split_1                     avgt    3     1318.253 ±  670.168   us/op
T01.split_1:gc.alloc.rate       avgt    3     2894.884 ± 1494.720  MB/sec
T01.split_1:gc.alloc.rate.norm  avgt    3  4000000.389 ±    0.874    B/op
T01.split_1:gc.count            avgt    3        7.000             counts
T01.split_1:gc.time             avgt    3       19.000                 ms
T01.split_2                     avgt    3     1177.087 ±  293.667   us/op
T01.split_2:gc.alloc.rate       avgt    3     6479.263 ± 1621.990  MB/sec
T01.split_2:gc.alloc.rate.norm  avgt    3  4000000.350 ±    0.558    B/op
T01.split_2:gc.count            avgt    3       16.000             counts
T01.split_2:gc.time             avgt    3       27.000                 ms
T01.split_4                     avgt    3     1483.509 ±  553.779   us/op
T01.split_4:gc.alloc.rate       avgt    3    10279.160 ± 3869.941  MB/sec
T01.split_4:gc.alloc.rate.norm  avgt    3  4000001.102 ±   21.109    B/op
T01.split_4:gc.count            avgt    3       26.000             counts
T01.split_4:gc.time             avgt    3       35.000                 ms
T01.split_6                     avgt    3     1992.545 ±  327.078   us/op
T01.split_6:gc.alloc.rate       avgt    3    11473.173 ± 1753.518  MB/sec
T01.split_6:gc.alloc.rate.norm  avgt    3  4000001.164 ±   18.958    B/op
T01.split_6:gc.count            avgt    3       28.000             counts
T01.split_6:gc.time             avgt    3       34.000                 ms
T01.split_8                     avgt    3     2314.763 ±  646.143   us/op
T01.split_8:gc.alloc.rate       avgt    3    13216.420 ± 3163.362  MB/sec
T01.split_8:gc.alloc.rate.norm  avgt    3  4000000.627 ±    0.149    B/op
T01.split_8:gc.count            avgt    3       32.000             counts
T01.split_8:gc.time             avgt    3       36.000                 ms
  • It does not scale!

# -prof gc, no SMT and turbo, mode performance
Benchmark                       Mode  Cnt        Score     Error   Units
T01.split_1                     avgt    3     2841.890 ± 219.913   us/op
T01.split_1:gc.alloc.rate       avgt    3     1342.067 ± 104.672  MB/sec
T01.split_1:gc.alloc.rate.norm  avgt    3  4000000.803 ±   0.460    B/op
T01.split_1:gc.count            avgt    3        3.000            counts
T01.split_1:gc.time             avgt    3       10.000                ms
T01.split_2                     avgt    3     2506.612 ±  42.312   us/op
T01.split_2:gc.alloc.rate       avgt    3     3041.421 ±  50.525  MB/sec
T01.split_2:gc.alloc.rate.norm  avgt    3  4000000.746 ±   0.949    B/op
T01.split_2:gc.count            avgt    3        7.000            counts
T01.split_2:gc.time             avgt    3       22.000                ms
T01.split_4                     avgt    3     2519.438 ± 251.261   us/op
T01.split_4:gc.alloc.rate       avgt    3     6049.532 ± 620.042  MB/sec
T01.split_4:gc.alloc.rate.norm  avgt    3  4000001.828 ±  34.873    B/op
T01.split_4:gc.count            avgt    3       15.000            counts
T01.split_4:gc.time             avgt    3       38.000                ms
T01.split_6                     avgt    3     2567.769 ± 106.240   us/op
T01.split_6:gc.alloc.rate       avgt    3     8889.642 ± 340.287  MB/sec
T01.split_6:gc.alloc.rate.norm  avgt    3  4000000.707 ±   0.080    B/op
T01.split_6:gc.count            avgt    3       22.000            counts
T01.split_6:gc.time             avgt    3       44.000                ms
T01.split_8                     avgt    3     2682.544 ± 246.422   us/op
T01.split_8:gc.alloc.rate       avgt    3    11352.457 ± 942.768  MB/sec
T01.split_8:gc.alloc.rate.norm  avgt    3  4000000.746 ±   0.277    B/op
T01.split_8:gc.count            avgt    3       28.000            counts
T01.split_8:gc.time             avgt    3       51.000                ms

Thread Scale via JMH

Change GC, SerialGC


# -prof gc, no SMT and turbo, mode performance, -XX:+UseG1GC
Benchmark                       Mode  Cnt        Score     Error   Units
T01.split_1                     avgt    3     2841.890 ± 219.913   us/op
T01.split_1:gc.alloc.rate       avgt    3     1342.067 ± 104.672  MB/sec
T01.split_1:gc.alloc.rate.norm  avgt    3  4000000.803 ±   0.460    B/op
T01.split_1:gc.count            avgt    3        3.000            counts
T01.split_1:gc.time             avgt    3       10.000                ms
T01.split_2                     avgt    3     2506.612 ±  42.312   us/op
T01.split_2:gc.alloc.rate       avgt    3     3041.421 ±  50.525  MB/sec
T01.split_2:gc.alloc.rate.norm  avgt    3  4000000.746 ±   0.949    B/op
T01.split_2:gc.count            avgt    3        7.000            counts
T01.split_2:gc.time             avgt    3       22.000                ms
T01.split_4                     avgt    3     2519.438 ± 251.261   us/op
T01.split_4:gc.alloc.rate       avgt    3     6049.532 ± 620.042  MB/sec
T01.split_4:gc.alloc.rate.norm  avgt    3  4000001.828 ±  34.873    B/op
T01.split_4:gc.count            avgt    3       15.000            counts
T01.split_4:gc.time             avgt    3       38.000                ms
T01.split_6                     avgt    3     2567.769 ± 106.240   us/op
T01.split_6:gc.alloc.rate       avgt    3     8889.642 ± 340.287  MB/sec
T01.split_6:gc.alloc.rate.norm  avgt    3  4000000.707 ±   0.080    B/op
T01.split_6:gc.count            avgt    3       22.000            counts
T01.split_6:gc.time             avgt    3       44.000                ms
T01.split_8                     avgt    3     2682.544 ± 246.422   us/op
T01.split_8:gc.alloc.rate       avgt    3    11352.457 ± 942.768  MB/sec
T01.split_8:gc.alloc.rate.norm  avgt    3  4000000.746 ±   0.277    B/op
T01.split_8:gc.count            avgt    3       28.000            counts
T01.split_8:gc.time             avgt    3       51.000                ms

# -prof gc, no SMT and turbo, mode performance, -XX:+UseSerialGC
Benchmark                       Mode  Cnt        Score      Error   Units
T01.split_1                     avgt    3     2302.586 ±   76.403   us/op
T01.split_1:gc.alloc.rate       avgt    3     1656.370 ±   55.984  MB/sec
T01.split_1:gc.alloc.rate.norm  avgt    3  4000000.626 ±    0.671    B/op
T01.split_1:gc.count            avgt    3        9.000             counts
T01.split_1:gc.time             avgt    3       50.000                 ms
T01.split_2                     avgt    3     2581.650 ± 3736.753   us/op
T01.split_2:gc.alloc.rate       avgt    3     2965.881 ± 4496.335  MB/sec
T01.split_2:gc.alloc.rate.norm  avgt    3  4000000.765 ±    0.180    B/op
T01.split_2:gc.count            avgt    3       16.000             counts
T01.split_2:gc.time             avgt    3       58.000                 ms
T01.split_4                     avgt    3     2287.881 ±   42.820   us/op
T01.split_4:gc.alloc.rate       avgt    3     6662.656 ±  115.427  MB/sec
T01.split_4:gc.alloc.rate.norm  avgt    3  4000001.664 ±   31.651    B/op
T01.split_4:gc.count            avgt    3       36.000             counts
T01.split_4:gc.time             avgt    3       17.000                 ms
T01.split_6                     avgt    3     2661.442 ±   48.027   us/op
T01.split_6:gc.alloc.rate       avgt    3     8584.677 ±  141.610  MB/sec
T01.split_6:gc.alloc.rate.norm  avgt    3  4000000.727 ±    0.016    B/op
T01.split_6:gc.count            avgt    3       48.000             counts
T01.split_6:gc.time             avgt    3       19.000                 ms
T01.split_8                     avgt    3     2357.560 ±  118.595   us/op
T01.split_8:gc.alloc.rate       avgt    3    12905.849 ±  879.622  MB/sec
T01.split_8:gc.alloc.rate.norm  avgt    3  4000000.651 ±    0.250    B/op
T01.split_8:gc.count            avgt    3       72.000             counts
T01.split_8:gc.time             avgt    3       25.000                 ms

Thread Scale via JMH

Change GC, ZGC


# -prof gc, no SMT and turbo, mode performance, -XX:+UseG1GC
Benchmark                       Mode  Cnt        Score     Error   Units
T01.split_1                     avgt    3     2841.890 ± 219.913   us/op
T01.split_1:gc.alloc.rate       avgt    3     1342.067 ± 104.672  MB/sec
T01.split_1:gc.alloc.rate.norm  avgt    3  4000000.803 ±   0.460    B/op
T01.split_1:gc.count            avgt    3        3.000            counts
T01.split_1:gc.time             avgt    3       10.000                ms
T01.split_2                     avgt    3     2506.612 ±  42.312   us/op
T01.split_2:gc.alloc.rate       avgt    3     3041.421 ±  50.525  MB/sec
T01.split_2:gc.alloc.rate.norm  avgt    3  4000000.746 ±   0.949    B/op
T01.split_2:gc.count            avgt    3        7.000            counts
T01.split_2:gc.time             avgt    3       22.000                ms
T01.split_4                     avgt    3     2519.438 ± 251.261   us/op
T01.split_4:gc.alloc.rate       avgt    3     6049.532 ± 620.042  MB/sec
T01.split_4:gc.alloc.rate.norm  avgt    3  4000001.828 ±  34.873    B/op
T01.split_4:gc.count            avgt    3       15.000            counts
T01.split_4:gc.time             avgt    3       38.000                ms
T01.split_6                     avgt    3     2567.769 ± 106.240   us/op
T01.split_6:gc.alloc.rate       avgt    3     8889.642 ± 340.287  MB/sec
T01.split_6:gc.alloc.rate.norm  avgt    3  4000000.707 ±   0.080    B/op
T01.split_6:gc.count            avgt    3       22.000            counts
T01.split_6:gc.time             avgt    3       44.000                ms
T01.split_8                     avgt    3     2682.544 ± 246.422   us/op
T01.split_8:gc.alloc.rate       avgt    3    11352.457 ± 942.768  MB/sec
T01.split_8:gc.alloc.rate.norm  avgt    3  4000000.746 ±   0.277    B/op
T01.split_8:gc.count            avgt    3       28.000            counts
T01.split_8:gc.time             avgt    3       51.000                ms

# -prof gc, no SMT and turbo, mode performance, -XX:+UseZGC
Benchmark                       Mode  Cnt        Score      Error   Units
T01.split_1                     avgt    3     2286.191 ±   13.489   us/op
T01.split_1:gc.alloc.rate       avgt    3     1968.161 ±   10.459  MB/sec
T01.split_1:gc.alloc.rate.norm  avgt    3  4720000.813 ±    0.010    B/op
T01.split_1:gc.count            avgt    3       28.000             counts
T01.split_1:gc.time             avgt    3       92.000                 ms
T01.split_2                     avgt    3     2294.309 ±   87.284   us/op
T01.split_2:gc.alloc.rate       avgt    3     3920.457 ±  151.112  MB/sec
T01.split_2:gc.alloc.rate.norm  avgt    3  4720000.910 ±    1.376    B/op
T01.split_2:gc.count            avgt    3       24.000             counts
T01.split_2:gc.time             avgt    3      101.000                 ms
T01.split_4                     avgt    3     2340.600 ±  164.406   us/op
T01.split_4:gc.alloc.rate       avgt    3     7678.836 ±  633.377  MB/sec
T01.split_4:gc.alloc.rate.norm  avgt    3  4720002.021 ±   35.588    B/op
T01.split_4:gc.count            avgt    3       52.000             counts
T01.split_4:gc.time             avgt    3      276.000                 ms
T01.split_6                     avgt    3     2898.646 ±   51.351   us/op
T01.split_6:gc.alloc.rate       avgt    3     9295.258 ±  118.284  MB/sec
T01.split_6:gc.alloc.rate.norm  avgt    3  4720001.071 ±    0.240    B/op
T01.split_6:gc.count            avgt    3       64.000             counts
T01.split_6:gc.time             avgt    3      359.000                 ms
T01.split_8                     avgt    3     2677.623 ±  994.806   us/op
T01.split_8:gc.alloc.rate       avgt    3    13431.418 ± 4719.314  MB/sec
T01.split_8:gc.alloc.rate.norm  avgt    3  4720000.959 ±    0.214    B/op
T01.split_8:gc.count            avgt    3      124.000             counts
T01.split_8:gc.time             avgt    3      912.000                 ms

Thread Scale via JMH

Change GC, ShenandoahGC


# -prof gc, no SMT and turbo, mode performance, -XX:+UseG1GC
Benchmark                       Mode  Cnt        Score     Error   Units
T01.split_1                     avgt    3     2841.890 ± 219.913   us/op
T01.split_1:gc.alloc.rate       avgt    3     1342.067 ± 104.672  MB/sec
T01.split_1:gc.alloc.rate.norm  avgt    3  4000000.803 ±   0.460    B/op
T01.split_1:gc.count            avgt    3        3.000            counts
T01.split_1:gc.time             avgt    3       10.000                ms
T01.split_2                     avgt    3     2506.612 ±  42.312   us/op
T01.split_2:gc.alloc.rate       avgt    3     3041.421 ±  50.525  MB/sec
T01.split_2:gc.alloc.rate.norm  avgt    3  4000000.746 ±   0.949    B/op
T01.split_2:gc.count            avgt    3        7.000            counts
T01.split_2:gc.time             avgt    3       22.000                ms
T01.split_4                     avgt    3     2519.438 ± 251.261   us/op
T01.split_4:gc.alloc.rate       avgt    3     6049.532 ± 620.042  MB/sec
T01.split_4:gc.alloc.rate.norm  avgt    3  4000001.828 ±  34.873    B/op
T01.split_4:gc.count            avgt    3       15.000            counts
T01.split_4:gc.time             avgt    3       38.000                ms
T01.split_6                     avgt    3     2567.769 ± 106.240   us/op
T01.split_6:gc.alloc.rate       avgt    3     8889.642 ± 340.287  MB/sec
T01.split_6:gc.alloc.rate.norm  avgt    3  4000000.707 ±   0.080    B/op
T01.split_6:gc.count            avgt    3       22.000            counts
T01.split_6:gc.time             avgt    3       44.000                ms
T01.split_8                     avgt    3     2682.544 ± 246.422   us/op
T01.split_8:gc.alloc.rate       avgt    3    11352.457 ± 942.768  MB/sec
T01.split_8:gc.alloc.rate.norm  avgt    3  4000000.746 ±   0.277    B/op
T01.split_8:gc.count            avgt    3       28.000            counts
T01.split_8:gc.time             avgt    3       51.000                ms

# -prof gc, no SMT and turbo, mode performance, -XX:+UseShenandoahGC
Benchmark                       Mode  Cnt        Score     Error   Units
T01.split_1                     avgt    3     2744.327 ± 547.522   us/op
T01.split_1:gc.alloc.rate       avgt    3     1389.809 ± 274.955  MB/sec
T01.split_1:gc.alloc.rate.norm  avgt    3  4000000.746 ±   0.152    B/op
T01.split_1:gc.count            avgt    3       12.000            counts
T01.split_1:gc.time             avgt    3       25.000                ms
T01.split_2                     avgt    3     2557.375 ± 691.621   us/op
T01.split_2:gc.alloc.rate       avgt    3     2980.591 ± 806.456  MB/sec
T01.split_2:gc.alloc.rate.norm  avgt    3  4000000.761 ±   0.878    B/op
T01.split_2:gc.count            avgt    3       20.000            counts
T01.split_2:gc.time             avgt    3       45.000                ms
T01.split_4                     avgt    3     2490.815 ± 227.054   us/op
T01.split_4:gc.alloc.rate       avgt    3     6116.189 ± 530.842  MB/sec
T01.split_4:gc.alloc.rate.norm  avgt    3  4000001.807 ±  34.415    B/op
T01.split_4:gc.count            avgt    3       44.000            counts
T01.split_4:gc.time             avgt    3      123.000                ms
T01.split_6                     avgt    3     2540.445 ± 199.454   us/op
T01.split_6:gc.alloc.rate       avgt    3     8988.221 ± 879.298  MB/sec
T01.split_6:gc.alloc.rate.norm  avgt    3  4000000.704 ±   0.186    B/op
T01.split_6:gc.count            avgt    3       66.000            counts
T01.split_6:gc.time             avgt    3      176.000                ms
T01.split_8                     avgt    3     2621.635 ± 223.107   us/op
T01.split_8:gc.alloc.rate       avgt    3    11598.049 ± 728.324  MB/sec
T01.split_8:gc.alloc.rate.norm  avgt    3  4000000.721 ±   0.101    B/op
T01.split_8:gc.count            avgt    3       80.000            counts
T01.split_8:gc.time             avgt    3      235.000                ms

Don't allocate

Fix the problem correctly


public long split()
{
	long count = 0;
	for (var s : csv)
	{
		var st = new StringTokenizer(s, ",");
		count += st.countTokens();
	}
	return count;
}

# Default, mode performance, -XX:+UseG1GC
Benchmark    Mode  Cnt    Score     Error  Units  GHz
T02.split_1  avgt    3  433.597 ±  88.340  us/op 4.16
T02.split_2  avgt    3  436.323 ±  37.467  us/op 3.70
T02.split_4  avgt    3  509.709 ±  92.259  us/op 3.10
T02.split_6  avgt    3  594.650 ± 113.787  us/op 2.90
T02.split_8  avgt    3  666.666 ±  25.594  us/op 2.60

# No SMT, no turbo, mode performance, -XX:+UseG1GC
Benchmark    Mode  Cnt     Score    Error  Units  GHz
T02.split_1  avgt    3  1025.864 ± 11.773  us/op 1.70
T02.split_2  avgt    3  1024.357 ± 56.503  us/op 1.70
T02.split_4  avgt    3  1015.246 ± 26.788  us/op 1.70
T02.split_6  avgt    3  1014.575 ± 65.084  us/op 1.70
T02.split_8  avgt    3  1036.651 ± 23.975  us/op 1.70

Don't allocate

But what about StringTokenizer


Benchmark                       Mode  Cnt    Score    Error   Units
T02.split_1                     avgt    3  438.617 ± 52.709   us/op
T02.split_1:gc.alloc.rate       avgt    3   ≈ 10⁻⁴           MB/sec
T02.split_1:gc.alloc.rate.norm  avgt    3    0.117 ±  0.080    B/op
T02.split_1:gc.count            avgt    3      ≈ 0           counts
T02.split_2                     avgt    3  440.256 ± 84.787   us/op
T02.split_2:gc.alloc.rate       avgt    3    0.001 ±  0.001  MB/sec
T02.split_2:gc.alloc.rate.norm  avgt    3    0.127 ±  0.036    B/op
T02.split_2:gc.count            avgt    3      ≈ 0           counts
T02.split_4                     avgt    3  504.023 ± 29.002   us/op
T02.split_4:gc.alloc.rate       avgt    3    0.003 ±  0.053  MB/sec
T02.split_4:gc.alloc.rate.norm  avgt    3    0.366 ±  7.046    B/op
T02.split_4:gc.count            avgt    3      ≈ 0           counts
T02.split_6                     avgt    3  598.367 ± 32.953   us/op
T02.split_6:gc.alloc.rate       avgt    3    0.002 ±  0.001  MB/sec
T02.split_6:gc.alloc.rate.norm  avgt    3    0.166 ±  0.018    B/op
T02.split_6:gc.count            avgt    3      ≈ 0           counts
T02.split_8                     avgt    3  643.071 ± 20.133   us/op
T02.split_8:gc.alloc.rate       avgt    3    0.002 ±  0.001  MB/sec
T02.split_8:gc.alloc.rate.norm  avgt    3    0.178 ±  0.044    B/op
T02.split_8:gc.count            avgt    3      ≈ 0           counts
  • We don't burn any memory
  • Why? We allocate StringTokenizer
  • JIT runs an escape analysis
  • StringTokenizer does not make it beyond the method
  • But... split() stays local as well?!
  • Yes, but the generated data spans two methods at least

Learnings

No CPU is alike

  • CPUs are all different
  • Laptops are not your benchmark friends
  • Hyperthreading sucks when benchmarking
  • Core power limits can hit you
  • There are memory allocation limits!

Wait, there is one more thing

We have not checked on our Strings


public class Strings1 {
    private static final String PREFIX = "a";
    private String foo = null;
    private int LENGTH;

    @Setup
    public void setup() {
        //                        13 chars                  18 chars
        foo = String.valueOf(System.currentTimeMillis()) + "ashdfhgas dfhsa df";
        LENGTH = plain().length();
    }

    @Benchmark
    public String plain() {
        return PREFIX + foo;
    }

    @Benchmark
    public String concat() {
        return PREFIX.concat(foo);
    }

    @Benchmark
    public String builder() {
        return new StringBuilder().append(PREFIX).append(foo).toString();
    }

    @Benchmark
    public String builderSized() {
        return new StringBuilder(LENGTH).append(PREFIX).append(foo).toString();
    }

    @Benchmark
    public String builderNonFluid() {
        final StringBuilder sb = new StringBuilder();
        sb.append(PREFIX);
        sb.append(foo);
        return sb.toString();
    }

    @Benchmark
    public String builderNonFluidSized() {
        final StringBuilder sb = new StringBuilder(LENGTH);
        sb.append(PREFIX);
        sb.append(foo);
        return sb.toString();
    }
}

# java -jar target/benchmarks.jar "Strings1" -prof gc

Benchmark                               Mode Cnt   Score    Error Units
builder                                 avgt   3   8.354 ±  2.698 ns/op
builder:gc.alloc.rate.norm              avgt   3  72.000 ±  0.001  B/op

builderNonFluid                         avgt   3  21.585 ± 21.029 ns/op
builderNonFluid:gc.alloc.rate.norm      avgt   3 160.000 ±  0.001  B/op

builderNonFluidSized                    avgt   3  16.595 ± 11.674 ns/op
builderNonFluidSized:gc.alloc.rate.norm avgt   3 144.000 ±  0.001  B/op

builderSized                            avgt   3   8.711 ±  5.998 ns/op
builderSized:gc.alloc.rate.norm         avgt   3  72.000 ±  0.001  B/op

concat                                  avgt   3   9.710 ±  6.780 ns/op
concat:gc.alloc.rate.norm               avgt   3  72.000 ±  0.001  B/op

plain                                   avgt   3   9.619 ±  3.065 ns/op
plain:gc.alloc.rate.norm                avgt   3  72.000 ±  0.001  B/op

Wait, there is one more thing again

Data, it is about data!!!


public class Strings2 {
    private String[] foo = new String[5];
    private Random r = new Random(1L);

    @Setup
    public void setup() {
        foo[0] = String.valueOf(r.nextInt()) + "ashdfhgas dfhsa df";
        foo[1] = String.valueOf(r.nextInt()) + "hsa df";
        foo[2] = String.valueOf(r.nextInt()) + "fhsa df";
        foo[3] = String.valueOf(r.nextInt()) + "as dfhsa df";
        foo[4] = String.valueOf(r.nextInt())
                        + "ashdfhgas dfhsa hsajdhf kjahskjdh fkjhsa dhfkjskhj hdf"
                        + "ashdfhgas dfhsa hsajdhf kjahs as87837 837 987kjskhj hdf";
    }

    public String foo() {
        return foo[r.nextInt(5)];
    }

    @Benchmark
    public String plain() {
        return "a" + foo();
    }

    @Benchmark
    public String concat() {
        return "a".concat(foo());
    }

    @Benchmark
    public String builder() {
        return new StringBuilder().append("a").append(foo()).toString();
    }
    ...
}

# java -jar target/benchmarks.jar "Strings2" -prof gc

Benchmark                               Mode Cnt   Score Error Units
builder                                 avgt   2  39.149       ns/op
builder:gc.alloc.rate.norm              avgt   2 176.010        B/op

builderNonFluid                         avgt   2  37.518       ns/op
builderNonFluid:gc.alloc.rate.norm      avgt   2 176.010        B/op

builderNonFluidSized                    avgt   2  29.788       ns/op
builderNonFluidSized:gc.alloc.rate.norm avgt   2 227.205        B/op

builderSized                            avgt   2  32.169       ns/op
builderSized:gc.alloc.rate.norm         avgt   2 227.204        B/op

concat                                  avgt   2  24.675       ns/op
concat:gc.alloc.rate.norm               avgt   2  83.202        B/op

plain                                   avgt   2  24.212       ns/op
plain:gc.alloc.rate.norm                avgt   2  83.201        B/op

Approaching Microbenchmarks

Let's Tackle the Problem

Got a Problem?

What is your motivation?

Real Problem

  • Production problem or oddity
  • You have an idea or suspicion

Made Up Problem

  • Colleague told you
  • Read it on somewhere
  • Told to make it scale
  • Told to cut cost in half
  • Production problem has been profiled, hopefully
  • You have to benchmark before you microbenchmark
  • When an arbitrary goal is given, you have to profile first
  • You might have to instrument first

Theory

No idea? No benchmark!

  • Develop a sound theory why something is slow or hungry
  • You might need a benchmark first!
  • Consider these five areas:
    • Code aka algorithms
    • Memory aka usage (allocation, freeing, access)
    • Cache aka locality of memory usage
    • I/O aka external dependencies
    • Scale aka synchronisation and CPU usage
  • Think about interactions of your problem component with the world

Test Case

Build your code

  • Isolate your problem
  • Avoid rewriting the code, try to use the API
  • Don't remove code only because it is not important for the benchmark
  • Isolate data preparation from code execution
  • Standard JMH prepares data only once, difficult when mutating data
  • Concurrency tests are hard to write, see how your production runs it
  • Don't apply your ideas or theory to the test case

Data

Shape the future outcome

  • Shape your test data according to your production data
  • No production data, because the idea is made up? Think in dimensions:
    • Amount - How many strings?
    • Size - How long is a string?
    • Structure - Where is the challenge in the string?
    • Layout - Holders and copies
    • Variance - How many different sizes, structures, and layouts and how? At once? Isolated?
  • You data might create or be the problem you are looking for
  • Think unit tests and the data you might need to cover your code properly

Execution Profile

Wrong conclusions can be drawn in seconds with this trick

  • How often is your code executed?
    • Only at startup (how often)
    • Occasionally
    • Always
    • On error or on a rare path
  • Where is your code used?
    • Inlining
    • Branch prediction and out of order

Unit Testing

Functional correctness before performance

  • Your first duty is to correctness!
  • Try to reuse your regular tests
  • Validate every change to avoid false hopes
  • Don't unit-test memory, cache or other performance things
  • Ensure proper coverage, because performance tuning often removes edge-cases
  • Try to use the same data for testing and benchmarking

Question the Results

Never believe in your first results

  • Is this your expected outcome? Question it.
  • Is this not what you expected? Question it.
  • You have not had any expectations? Well... we got a problem.
  • Start validating, what you got, by experimenting around
  • Vary the hell out of code and data before your start tuning!
  • Think of edge-cases

Verification by Variation

Vary a little to feel confident

  • Always keep the original state running
  • Vary a copy of original
  • Vary the tuned version
  • Vary the code
    • Just switch lines
    • Reuse variables or don't
    • Flip branches and conditions
    • Flatten out code or compact it
  • Vary the data
    • Sort or unsort it
    • Make it longer, bigger, and smaller
  • Vary the environment
    • Try another JDK
    • Try another machine
    • Play with the GC

Verification by Tooling

Use the power of JMH

  • GC -prof gc
  • Low-level stats -prof perf/perfnorm
  • Generated code -prof perfasm
  • JIT Compiler -prof comp
  • Classloader -prof cl
  • JFR -prof jfr
  • Async Profiler -prof async
  • And more!

Verification by Review

Ask someone but ChatGPT

  • Ask a colleague
  • Don't just send a PR
  • Explain your thought process
  • No need to ask a benchmarking expert
  • Compare with JDK benchmarks and JMH examples
  • Ask the mighty Internet politely

Verfication by Knowledge

Remember our experiments and more

  • Care about the platform and environment
  • Remember our CPU, cache, and memory examples
  • Keep the Java core concepts in mind (JIT, JMM)
  • Be prepared to start from scratch
  • Always benchmark the outcome in reality
  • NEVER apply your benchmark outcome as a general rule in the future!

Microbenchmark when...

A Summary of Things

Conclusion

Microbenchmarking is...

Good

  • For isolating a problem
  • Investigating an isolated problem
  • Tuning an isolated problem
  • Extremly detailed tuning
  • Understanding the inner-workings
  • Squeezing the lemon a little more

Bad

  • Drawing general conclusions
  • Black and white comparisons
  • Expecting large scale impacts
  • The impatient developer

The Conclusion rephrased

When and when not to do microbenchmarks

  • Yes You are an JDK contributor
  • Yes You are an open source contributor
  • Yes Your problem can be isolated...
  • Yes ...or you isolate it for easier tuning
  • Yes You want to debunk your colleague's claims
  • Yes You are just curious
  • No You want to write a Medium post about which method is faster
  • No You want to tune production without a need
  • No You swipped left on all previous slides

One last thing

The JMH warning

REMEMBER: The numbers are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.

Do not assume the numbers tell you what you want them to tell.

Because you have seen it on my slides, does not make it right!