The Art of Realizing One is Wrong
While Learning How to Microbenchmark
Bonus Content: JVM Insights, Hardware Details, and Interactions
@ReneSchwietzke
@reneschwietzke@foojay.social
#java #performance #tuning #qa #test #performancetest #loadtest
Slides: https://t.ly/POQnP
Always attribute other people's work and state your sources
[1] https://www.youtube.com/watch?v=ADxUsCkWdbE
[2] https://shipilev.net/
I care and share - Slides: https://t.ly/POQnP
c b
This presentation is licensed under
Creative Commons Attribution 4.0 International License
All source code is APL 2.0.
Third-party content could use a different license.
Why you should stick around
This is a small JMH training!
We won't cover all aspects.
The JMH-Samples are awesome
Photos from Pexel, free to use, no attribution needed, CC0 or Public Domain
Slides: https://t.ly/POQnP
Repo: https://github.com/Xceptance/jmh-training.git
What is my motivation?
If you want to drive fast, you have to understand how a car works.
Slides: https://t.ly/POQnP
Repo: https://github.com/Xceptance/jmh-training.git
Definitions, Guesses, Tools
What you will need to make things fun
Slides: https://t.ly/POQnP
Repo: https://github.com/Xceptance/jmh-training.git
A definition first
To study (something, such as a competitor's product or business practices) in order to improve the performance of one's own company
A point of reference from which measurements may be made
A standardized problem or test that serves as a basis for evaluation or comparison (as of computer system performance)
What is this micro thing and why is it micro?
A simplified benchmark that tests components in isolation.
It simplifies the process by abstracting the problem, removing complexity, and improving debuggability.
Components are usually quite small. Compare that to the unit test idea!
Side effects of simplification
Some Experiments First
Always use System.arrayCopy
over manual copying
public static void main(String[] args)
{
var length = 255;
var src = new byte[length];
var dest = new byte[length];
var s1 = System.nanoTime();
System.arraycopy(src, 0, dest, 0, length);
var e1 = System.nanoTime();
var s2 = System.nanoTime();
for (int i = 0; i < length; i++)
{
dest[i] = src[i];
}
var e2 = System.nanoTime();
System.out.printf("System Copy: %d ns%n", e1 - s1);
System.out.printf("Manual Copy: %d ns%n", e2 - s2);
}
$ cd jmh-training
$ java src/main/java/org/xc/jmh/naive/Example1.java
System Copy: 1,913 ns
Manual Copy: 3,135 ns
$ java src/main/java/org/xc/jmh/naive/Example1.java
System Copy: 1,172 ns
Manual Copy: 2,765 ns
$ java src/main/java/org/xc/jmh/naive/Example1.java
System Copy: 1,814 ns
Manual Copy: 3,917 ns
git clone https://github.com/Xceptance/jmh-training.git
Use a sized StringBuilder
all the time
public static void main(String[] args) {
final String a = "This is a test";
final String b = "More fun with a string";
final long c = System.currentTimeMillis(); // e.g. 1684581249117
var l = a.length() + b.length() + 13;
var s1 = System.nanoTime();
{
var sb = new StringBuilder();
sb.append(a); sb.append(b); sb.append(c);
var r = sb.toString();
}
var e1 = System.nanoTime();
var s2 = System.nanoTime();
{
var sb = new StringBuilder(l);
sb.append(a); sb.append(b); sb.append(c);
var r = sb.toString();
}
var e2 = System.nanoTime();
var s3 = System.nanoTime();
{
var r = a + b + c;
}
var e3 = System.nanoTime();
System.out.printf("Pure String usage : %d ns%n", e3 - s3);
System.out.printf("StringBuilder : %d ns%n", e1 - s1);
System.out.printf("Sized StringBuilder: %d ns%n", e2 - s2);
}
$ java src/main/java/org/xc/jmh/naive/Example2.java
Pure String usage : 2,590,902 ns
StringBuilder : 43,863 ns
Sized StringBuilder: 5,591 ns
$ java src/main/java/org/xc/jmh/naive/Example2.java
Pure String usage : 2,926,695 ns
StringBuilder : 35,547 ns
Sized StringBuilder: 7,364 ns
$ java src/main/java/org/xc/jmh/naive/Example2.java
Pure String usage : 2,659,792 ns
StringBuilder : 27,241 ns
Sized StringBuilder: 5,030 ns
git clone https://github.com/Xceptance/jmh-training.git
Streams are slow and parallel()
is always better
public static void main(String[] args) {
var data = new long[10000];
for (int i = 0; i < data.length; i++) {
data[i] = i;
}
// manual
var s1 = System.nanoTime(); {
var total = 0.0d;
for (int i = 0; i < data.length; i++) {
total += calc(data[i]);
}
var r = total / data.length;
}
var e1 = System.nanoTime();
// stream
var s2 = System.nanoTime(); {
var r = Arrays.stream(data).map(i -> calc(i)).average();
}
var e2 = System.nanoTime();
// stream parallel
var s3 = System.nanoTime(); {
var r = Arrays.stream(data).parallel().map(i -> calc(i)).average();
}
var e3 = System.nanoTime();
...
}
private static long calc(long c) {
var t = c;
for (long i = 100; i > 0; i--) {
t += (t * 0x5DEECE66DL + 0xBL + i) & (0xFFFFFFFFFFFFL);
}
return t;
}
}
$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual : 5,832,700 ns
Stream : 23,259,444 ns
Parallel stream: 10,246,827 ns
$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual : 5,834,677 ns
Stream : 20,817,603 ns
Parallel stream: 10,223,933 ns
$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual : 6,054,650 ns
Stream : 20,718,407 ns
Parallel stream: 10,415,500 ns
git clone https://github.com/Xceptance/jmh-training.git
Forget everything you just saw
What do we really want?
Our true goal: We want to learn and understand by changing details under controlled conditions. Turn a knob a bit and see what happens. We want to derive reasoning or even proof from our changes.
Don't ask what you can do, ask others what they have done
JMH is a Java harness for building, running, and analysing benchmarks written in Java and other languages targeting the JVM.
This is not an extensive JMH training.
It is about the mistakes you can make when microbenchmarking, even with the best tool available.
"A fool with a tool is still a fool."
Grady Booch
Learning by Example
Just naive and simple again
@Warmup(iterations = 5, time = 2,
timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1,
timeUnit = TimeUnit.SECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class A01 {
long time;
@Setup
public void setup() {
time = System.currentTimeMillis();
}
@Benchmark
public void add() {
var x =
(time * 0x5DEECE66DL + 0xBL + time) /
(time * time * 0.42d);
}
}
# ensure a full and clean build
$ mvn clean verify
# Only execute our example, not everything!
$ java -jar target/benchmarks.jar "A01"
# Warmup Iteration 1: 0.621 ns/op
# Warmup Iteration 2: 0.381 ns/op
# Warmup Iteration 3: 0.378 ns/op
# Warmup Iteration 4: 0.798 ns/op
# Warmup Iteration 5: 1.094 ns/op
Iteration 1: 1.090 ns/op
Iteration 2: 1.087 ns/op
Iteration 3: 1.085 ns/op
Iteration 4: 1.084 ns/op
Iteration 5: 1.095 ns/op
Iteration 6: 1.080 ns/op
Iteration 7: 1.079 ns/op
Iteration 8: 1.077 ns/op
Iteration 9: 1.089 ns/op
Iteration 10: 1.006 ns/op
Benchmark Mode Cnt Score Error Units
A01.add avgt 10 1.077 ± 0.039 ns/op
org.xc.jmh.A01
Not spending time on the setup: https://github.com/openjdk/jmh
Can we trust it?
# T14s Gen1, Linux 5.15
# Warmup Iteration 1: 0.621 ns/op
# Warmup Iteration 2: 0.381 ns/op
# Warmup Iteration 3: 0.378 ns/op
# Warmup Iteration 4: 0.798 ns/op
# Warmup Iteration 5: 1.094 ns/op
Iteration 1: 1.090 ns/op
Iteration 2: 1.087 ns/op
Iteration 3: 1.085 ns/op
Iteration 4: 1.084 ns/op
Iteration 5: 1.095 ns/op
Iteration 6: 1.080 ns/op
Iteration 7: 1.079 ns/op
Iteration 8: 1.077 ns/op
Iteration 9: 1.089 ns/op
Iteration 10: 1.006 ns/op
Benchmark Mode Cnt Score Error Units
A01.add avgt 10 1.077 ± 0.039 ns/op
# T14s Gen1, Linux 5.15
# Warmup Iteration 1: 1.105 ns/op
# Warmup Iteration 2: 1.075 ns/op
# Warmup Iteration 3: 1.082 ns/op
# Warmup Iteration 4: 1.076 ns/op
# Warmup Iteration 5: 1.070 ns/op
Iteration 1: 1.112 ns/op
Iteration 2: 0.370 ns/op
Iteration 3: 0.377 ns/op
Iteration 4: 0.767 ns/op
Iteration 5: 1.097 ns/op
Iteration 6: 1.094 ns/op
Iteration 7: 0.383 ns/op
Iteration 8: 0.384 ns/op
Iteration 9: 0.374 ns/op
Iteration 10: 0.375 ns/op
Benchmark Mode Cnt Score Error Units
A01.add avgt 10 0.633 ± 0.521 ns/op
# T14s Gen1, Linux 5.15
# Warmup Iteration 1: 0.939 ns/op
# Warmup Iteration 2: 1.061 ns/op
# Warmup Iteration 3: 1.137 ns/op
# Warmup Iteration 4: 1.093 ns/op
# Warmup Iteration 5: 1.071 ns/op
Iteration 1: 1.071 ns/op
Iteration 2: 1.071 ns/op
Iteration 3: 1.070 ns/op
Iteration 4: 1.070 ns/op
Iteration 5: 0.378 ns/op
Iteration 6: 0.376 ns/op
Iteration 7: 0.375 ns/op
Iteration 8: 0.383 ns/op
Iteration 9: 0.374 ns/op
Iteration 10: 0.376 ns/op
Benchmark Mode Cnt Score Error Units
A01.add avgt 10 0.654 ± 0.541 ns/op
org.xc.jmh.A01
Can we trust it?
# Hetzner, AMD DEDICATED, Cloud, HEL
# Warmup Iteration 1: 0.292 ns/op
# Warmup Iteration 2: 0.289 ns/op
# Warmup Iteration 3: 0.299 ns/op
# Warmup Iteration 4: 0.294 ns/op
# Warmup Iteration 5: 0.288 ns/op
Iteration 1: 0.291 ns/op
Iteration 2: 0.287 ns/op
Iteration 3: 0.296 ns/op
Iteration 4: 0.287 ns/op
Iteration 5: 0.286 ns/op
Iteration 6: 0.287 ns/op
Iteration 7: 0.297 ns/op
Iteration 8: 0.291 ns/op
Iteration 9: 0.289 ns/op
Iteration 10: 0.289 ns/op
Benchmark Mode Cnt Score Error Units
A01.add avgt 10 0.290 ± 0.005 ns/op
# Hetzner, AMD DEDICATED, Cloud, HEL
# Warmup Iteration 1: 0.288 ns/op
# Warmup Iteration 2: 0.302 ns/op
# Warmup Iteration 3: 0.286 ns/op
# Warmup Iteration 4: 0.287 ns/op
# Warmup Iteration 5: 0.295 ns/op
Iteration 1: 0.287 ns/op
Iteration 2: 0.282 ns/op
Iteration 3: 0.286 ns/op
Iteration 4: 0.287 ns/op
Iteration 5: 0.293 ns/op
Iteration 6: 0.282 ns/op
Iteration 7: 0.284 ns/op
Iteration 8: 0.288 ns/op
Iteration 9: 0.292 ns/op
Iteration 10: 0.286 ns/op
Benchmark Mode Cnt Score Error Units
A01.add avgt 10 0.287 ± 0.006 ns/op
# Hetzner, AMD DEDICATED, Cloud, HEL
# Warmup Iteration 1: 0.287 ns/op
# Warmup Iteration 2: 0.306 ns/op
# Warmup Iteration 3: 0.286 ns/op
# Warmup Iteration 4: 0.286 ns/op
# Warmup Iteration 5: 0.293 ns/op
Iteration 1: 0.287 ns/op
Iteration 2: 0.286 ns/op
Iteration 3: 0.288 ns/op
Iteration 4: 0.288 ns/op
Iteration 5: 0.295 ns/op
Iteration 6: 0.286 ns/op
Iteration 7: 0.287 ns/op
Iteration 8: 0.295 ns/op
Iteration 9: 0.295 ns/op
Iteration 10: 0.286 ns/op
Benchmark Mode Cnt Score Error Units
A01.add avgt 10 0.289 ± 0.006 ns/op
org.xc.jmh.A01
Execute and report.
First mistakes have been made
Adding a main for debugging purposes
public class A02
{
...
public static void main(String[] args)
throws RunnerException
{
var opt = new OptionsBuilder()
// important, otherwise we will run
// all tests!
.include(A02.class.getSimpleName())
// 0 is needed for debugging,
// not for running
.forks(0)
.build();
new Runner(opt).run();
}
}
org.xc.jmh.A02
0.3 ns for an add
?
public class A03a
{
long time;
@Setup
public void setup()
{
time = System.currentTimeMillis();
}
@Benchmark
public void addCheap()
{
var x = 1 + 1;
}
@Benchmark
public void addExpensive()
{
var x =
(time * 0x5DEECE66DL + 0xBL + time) /
(time * time * 0.42d);
}
}
# ensure a full and clean build
$ mvn clean verify
# Only execute our example, not everything!
$ java -jar target/benchmarks.jar "A03a"
Benchmark Mode Cnt Score Error Units
A03a.addCheap avgt 5 0.288 ± 0.009 ns/op
A03a.addExpensive avgt 5 0.290 ± 0.019 ns/op
# Disable the JIT
$ java -jar target/benchmarks.jar "A03a" -jvmArgs "-Xint"
Benchmark Mode Cnt Score Error Units
A03a.addCheap avgt 5 22.542 ± 0.456 ns/op
A03a.addExpensive avgt 5 45.016 ± 2.680 ns/op
org.xc.jmh.A03a
Can we measure time?
Whom do you ask for time?
System.nanoTime()
and System.currentTimeMillis()
We can measure nanoseconds, can't we?
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1000,
timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 1000,
timeUnit = TimeUnit.MILLISECONDS)
@Fork(1)
@State(Scope.Thread)
public class TimerResolution
{
private long lastValue;
@Benchmark
public long latency_nanotime()
{
return System.nanoTime();
}
public static double precision(long[] t)
{
// store samples
for (int i = 0; i < t.length; i++)
{
t[i] = System.nanoTime();
}
}
}
Benchmark Mode Cnt Score Error Units
TimerResolution.latency_nanotime avgt 5 22.244 ± 0.474 ns/op
# Samples Lenovo T14s, AMD, Linux 5.15, TSC
$ java -cp target/classes/ org.xc.jmh.TimerResolution
Precision 1 - 21.66
Precision 2 - 23.80
Latency - 24.27
112232081516561
112232081516581 - 20 ns
112232081516601 - 20 ns
112232081516621 - 20 ns
112232081516651 - 30 ns
112232081516671 - 20 ns
...
112232081516951
112232081516972 - 21 ns
112232081516992 - 20 ns
...
org.xc.jmh.TimerResolution
Source: https://github.com/shipilev/timers-bench
Execute and report.
Math is your friend
Not every clock is created equal
x86 Time Stamp Counter (TSC), a low-latency monotonically increasing clock, to measure event duration
Benchmark Mode Cnt Score Units
TimerResolution.latency_nanotime avgt 10 1,128.204 ns/op
# Samples Lenovo T14s, AMD, Linux 5.15, HPET
374172181482681
374172181484078
374172181484986
374172181485964
374172181487291
374172181488199
Difference between calls 900 to 1,400 ns
Benchmark Mode Cnt Score Units
TimerResolution.latency_nanotime avgt 10 25.519 ns/op
# Samples Lenovo T14s, AMD, Linux 5.15, TSC
495719354555374
495719354555394
495719354555424
495719354555444
495719354555464
495719354555484
Difference between calls 20 to 30 ns
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
kvm-clock tsc hpet acpi_pm
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet
$ echo 'tsc' > /sys/devices/system/clocksource/clocksource0/current_clocksource
Resource: TSC Frequency For All: Better Profiling and Benchmarking
We are being outsmarted
Setup theories
public class A03b {
long time;
Long lTime;
@Setup
public void setup() {
time = System.currentTimeMillis();
lTime = Long.valueOf(time);
}
@Benchmark
public void addCheap1() {
var x = 1 + 1727610292048L;
}
@Benchmark
public void addCheap2() {
var x = 1 + time;
}
@Benchmark
public void addCheap3() {
var x = 1 + lTime.longValue();
}
@Benchmark
public void addExpensive() {
var x =
(time * 0x5DEECE66DL + 0xBL + time) /
(time * time * 0.42d);
}
}
Benchmark Mode Cnt Score Error Units
A03b.addCheap1 avgt 5 0.289 ± 0.014 ns/op
A03b.addCheap2 avgt 5 0.291 ± 0.015 ns/op
A03b.addCheap3 avgt 5 0.390 ± 0.025 ns/op
A03b.addExpensive avgt 5 0.288 ± 0.024 ns/op
# -jvmArgs "-Xint"
A03b.addCheap1 avgt 5 16.056 ± 1.308 ns/op
A03b.addCheap2 avgt 5 20.431 ± 3.972 ns/op
A03b.addCheap3 avgt 5 28.280 ± 1.847 ns/op
A03b.addExpensive avgt 5 39.193 ± 3.921 ns/op
org.xc.jmh.A03b
Don't get scared, won't happen that often
[Entry Point]
# {method} {0x00007f89d0475488} 'addCheap2' '()V' in 'org/xc/jmh/A03b'
# [sp+0x20] (sp of caller)
0x00007f8a24476f80: mov 0x8(%rsi),%r10d
0x00007f8a24476f84: movabs $0x7f89db000000,%r11
0x00007f8a24476f8e: add %r11,%r10
0x00007f8a24476f91: cmp %r10,%rax
0x00007f8a24476f94: jne 0x00007f8a23d53d80 ; {runtime_call ic_miss_stub}
0x00007f8a24476f9a: xchg %ax,%ax
0x00007f8a24476f9c: nopl 0x0(%rax)
[Verified Entry Point]
0x00007f8a24476fa0: sub $0x18,%rsp
0x00007f8a24476fa7: mov %rbp,0x10(%rsp)
0x00007f8a24476fac: cmpl $0x0,0x20(%r15)
0x00007f8a24476fb4: jne 0x00007f8a24476fe3
0x00007f8a24476fba: add $0x10,%rsp
0x00007f8a24476fbe: pop %rbp
0x00007f8a24476fbf: cmp 0x450(%r15),%rsp ; {poll_return}
0x00007f8a24476fc6: ja 0x00007f8a24476fcd
0x00007f8a24476fcc: ret
0x00007f8a24476fcd: movabs $0x7f8a24476fbf,%r10 ; {internal_word}
0x00007f8a24476fd7: mov %r10,0x468(%r15)
0x00007f8a24476fde: jmp 0x00007f8a23d5a400 ; {runtime_call SafepointBlob}
0x00007f8a24476fe3: call Stub::nmethod_entry_barrier ; {runtime_call StubRoutines (final stubs)}
0x00007f8a24476fe8: jmp 0x00007f8a24476fba
0x00007f8a24476fed: hlt
0x00007f8a24476fee: hlt
0x00007f8a24476fef: hlt
[Exception Handler]
0x00007f8a24476ff0: jmp 0x00007f8a23e0a600 ; {no_reloc}
[Deopt Handler Code]
0x00007f8a24476ff5: call 0x00007f8a24476ffa
0x00007f8a24476ffa: subq $0x5,(%rsp)
0x00007f8a24476fff: jmp 0x00007f8a23d596a0 ; {runtime_call DeoptimizationBlob}
0x00007f8a24477004: hlt
0x00007f8a24477005: hlt
0x00007f8a24477006: hlt
0x00007f8a24477007: hlt
[Entry Point]
# {method} {0x00007f0014475560} 'addCheap3' '()V' in 'org/xc/jmh/A03b'
# [sp+0x20] (sp of caller)
0x00007f0068475b80: mov 0x8(%rsi),%r10d
0x00007f0068475b84: movabs $0x7f001f000000,%r11
0x00007f0068475b8e: add %r11,%r10
0x00007f0068475b91: cmp %r10,%rax
0x00007f0068475b94: jne 0x00007f0067d53d80 ; {runtime_call ic_miss_stub}
0x00007f0068475b9a: xchg %ax,%ax
0x00007f0068475b9c: nopl 0x0(%rax)
[Verified Entry Point]
0x00007f0068475ba0: mov %eax,-0x14000(%rsp)
0x00007f0068475ba7: push %rbp
0x00007f0068475ba8: sub $0x10,%rsp
0x00007f0068475bac: cmpl $0x0,0x20(%r15)
0x00007f0068475bb4: jne 0x00007f0068475bfe
0x00007f0068475bba: cmp 0xc(%rsi),%r12d
0x00007f0068475bbe: je 0x00007f0068475bd3
0x00007f0068475bc0: add $0x10,%rsp
0x00007f0068475bc4: pop %rbp
0x00007f0068475bc5: cmp 0x450(%r15),%rsp ; {poll_return}
0x00007f0068475bcc: ja 0x00007f0068475be8
0x00007f0068475bd2: ret
0x00007f0068475bd3: mov $0xfffffff6,%esi
0x00007f0068475bd8: data16 xchg %ax,%ax
0x00007f0068475bdb: call 0x00007f0067d59300 ; ImmutableOopMap {}
; *invokevirtual longValue
; - org.xc.jmh.A03b::addCheap3@5 (line 53)
; {runtime_call UncommonTrapBlob}
0x00007f0068475be0: nopl 0x1d0(%rax,%rax,1) ; {other}
0x00007f0068475be8: movabs $0x7f0068475bc5,%r10 ; {internal_word}
0x00007f0068475bf2: mov %r10,0x468(%r15)
0x00007f0068475bf9: jmp 0x00007f0067d5a400 ; {runtime_call SafepointBlob}
0x00007f0068475bfe: call Stub::nmethod_entry_barrier ; {runtime_call StubRoutines (final stubs)}
0x00007f0068475c03: jmp 0x00007f0068475bba
[Exception Handler]
0x00007f0068475c08: jmp 0x00007f0067e08100 ; {no_reloc}
[Deopt Handler Code]
0x00007f0068475c0d: call 0x00007f0068475c12
0x00007f0068475c12: subq $0x5,(%rsp)
0x00007f0068475c17: jmp 0x00007f0067d596a0 ; {runtime_call DeoptimizationBlob}
0x00007f0068475c1c: hlt
0x00007f0068475c1d: hlt
0x00007f0068475c1e: hlt
0x00007f0068475c1f: hlt
org.xc.jmh.A03b
Avoid dropping our code by returning it
public class A04a {
...
@Benchmark
public long addCheap1() {
return 1 + 1727610292048L;
}
@Benchmark
public long addCheap2() {
return 1 + time;
}
@Benchmark
public long addCheap3() {
return 1 + lTime.longValue();
}
@Benchmark
public double addExpensive() {
return
(time * 0x5DEECE66DL + 0xBL + time) /
(time * time * 0.42d);
}
...
}
Benchmark Mode Cnt Score Error Units
A04a.addCheap1 avgt 3 0.287 ± 0.010 ns/op
A04a.addCheap2 avgt 3 0.374 ± 0.087 ns/op
A04a.addCheap3 avgt 3 0.555 ± 0.216 ns/op
A04a.addExpensive avgt 3 4.436 ± 0.008 ns/op
long
-add is still cheapLong
is really expensiveorg.xc.jmh.A04a
Sorry...
[Entry Point]
# {method} {0x00007f5c2c4754c0} 'addCheap2' '()J' in 'org/xc/jmh/A04a'
# [sp+0x20] (sp of caller)
0x00007f5c80475a80: mov 0x8(%rsi),%r10d
0x00007f5c80475a84: movabs $0x7f5c37000000,%r11
0x00007f5c80475a8e: add %r11,%r10
0x00007f5c80475a91: cmp %r10,%rax
0x00007f5c80475a94: jne 0x00007f5c7fd53d80 ; {runtime_call ic_miss_stub}
0x00007f5c80475a9a: xchg %ax,%ax
0x00007f5c80475a9c: nopl 0x0(%rax)
[Verified Entry Point]
0x00007f5c80475aa0: sub $0x18,%rsp
0x00007f5c80475aa7: mov %rbp,0x10(%rsp)
0x00007f5c80475aac: cmpl $0x1,0x20(%r15)
0x00007f5c80475ab4: jne 0x00007f5c80475aec
0x00007f5c80475aba: mov $0x1,%eax
0x00007f5c80475abf: add 0x10(%rsi),%rax
0x00007f5c80475ac3: add $0x10,%rsp
0x00007f5c80475ac7: pop %rbp
0x00007f5c80475ac8: cmp 0x450(%r15),%rsp ; {poll_return}
0x00007f5c80475acf: ja 0x00007f5c80475ad6
0x00007f5c80475ad5: ret
0x00007f5c80475ad6: movabs $0x7f5c80475ac8,%r10 ; {internal_word}
0x00007f5c80475ae0: mov %r10,0x468(%r15)
0x00007f5c80475ae7: jmp 0x00007f5c7fd5a400 ; {runtime_call SafepointBlob}
0x00007f5c80475aec: call Stub::nmethod_entry_barrier ; {runtime_call StubRoutines (final stubs)}
0x00007f5c80475af1: jmp 0x00007f5c80475aba
0x00007f5c80475af6: hlt
0x00007f5c80475af7: hlt
[Exception Handler]
0x00007f5c80475af8: jmp 0x00007f5c7fe08200 ; {no_reloc}
[Deopt Handler Code]
0x00007f5c80475afd: call 0x00007f5c80475b02
0x00007f5c80475b02: subq $0x5,(%rsp)
0x00007f5c80475b07: jmp 0x00007f5c7fd596a0 ; {runtime_call DeoptimizationBlob}
0x00007f5c80475b0c: hlt
0x00007f5c80475b0d: hlt
0x00007f5c80475b0e: hlt
0x00007f5c80475b0f: hlt
[Entry Point]
# {method} {0x00007faf64475588} 'addCheap3' '()J' in 'org/xc/jmh/A04a'
# [sp+0x20] (sp of caller)
0x00007fafb8475a00: mov 0x8(%rsi),%r10d
0x00007fafb8475a04: movabs $0x7faf6f000000,%r11
0x00007fafb8475a0e: add %r11,%r10
0x00007fafb8475a11: cmp %r10,%rax
0x00007fafb8475a14: jne 0x00007fafb7d53d80 ; {runtime_call ic_miss_stub}
0x00007fafb8475a1a: xchg %ax,%ax
0x00007fafb8475a1c: nopl 0x0(%rax)
[Verified Entry Point]
0x00007fafb8475a20: mov %eax,-0x14000(%rsp)
0x00007fafb8475a27: push %rbp
0x00007fafb8475a28: sub $0x10,%rsp
0x00007fafb8475a2c: cmpl $0x0,0x20(%r15)
0x00007fafb8475a34: jne 0x00007fafb8475a8a
0x00007fafb8475a3a: mov 0xc(%rsi),%r11d
0x00007fafb8475a3e: test %r11d,%r11d
0x00007fafb8475a41: je 0x00007fafb8475a60
0x00007fafb8475a43: mov $0x1,%eax
0x00007fafb8475a48: add 0x10(%r12,%r11,8),%rax
0x00007fafb8475a4d: add $0x10,%rsp
0x00007fafb8475a51: pop %rbp
0x00007fafb8475a52: cmp 0x450(%r15),%rsp ; {poll_return}
0x00007fafb8475a59: ja 0x00007fafb8475a74
0x00007fafb8475a5f: ret
0x00007fafb8475a60: mov $0xfffffff6,%esi
0x00007fafb8475a65: xchg %ax,%ax
0x00007fafb8475a67: call 0x00007fafb7d59300 ; ImmutableOopMap {}
; *invokevirtual longValue {reexecute=0 rethrow=0 return_oop=0}
; - org.xc.jmh.A04a::addCheap3@5 (line 53)
; {runtime_call UncommonTrapBlob}
0x00007fafb8475a6c: nopl 0x1dc(%rax,%rax,1) ; {other}
0x00007fafb8475a74: movabs $0x7fafb8475a52,%r10 ; {internal_word}
0x00007fafb8475a7e: mov %r10,0x468(%r15)
0x00007fafb8475a85: jmp 0x00007fafb7d5a400 ; {runtime_call SafepointBlob}
0x00007fafb8475a8a: call Stub::nmethod_entry_barrier ; {runtime_call StubRoutines (final stubs)}
0x00007fafb8475a8f: jmp 0x00007fafb8475a3a
0x00007fafb8475a94: hlt
0x00007fafb8475a95: hlt
0x00007fafb8475a96: hlt
0x00007fafb8475a97: hlt
[Exception Handler]
0x00007fafb8475a98: jmp 0x00007fafb7e07100 ; {no_reloc}
[Deopt Handler Code]
0x00007fafb8475a9d: call 0x00007fafb8475aa2
0x00007fafb8475aa2: subq $0x5,(%rsp)
0x00007fafb8475aa7: jmp 0x00007fafb7d596a0 ; {runtime_call DeoptimizationBlob}
0x00007fafb8475aac: hlt
0x00007fafb8475aad: hlt
0x00007fafb8475aae: hlt
0x00007fafb8475aaf: hlt
org.xc.jmh.A04a
Avoid dropping our code by blackholing it
public class A04d {
...
@Benchmark
public void addCheap1(Blackhole b)
{
b.consume(1 + 1727610292048L);
}
@Benchmark
public void addCheap2(Blackhole b)
{
b.consume(1 + time);
}
@Benchmark
public void addCheap3(Blackhole b)
{
b.consume( 1 + lTime.longValue());
}
@Benchmark
public void addExpensive(Blackhole b)
{
b.consume((time * 0x5DEECE66DL + 0xBL + time) / (time * time * 0.42d));
}
...
}
Benchmark Mode Cnt Score Error Units
A04d.addCheap1 avgt 3 0.288 ± 0.021 ns/op
A04d.addCheap2 avgt 3 0.548 ± 0.034 ns/op
A04d.addCheap3 avgt 3 0.549 ± 0.040 ns/op
A04d.addExpensive avgt 3 4.439 ± 0.010 ns/op
org.xc.jmh.A04d
One last time, promised
@Benchmark
public double addExpensive() {
return
(time * 0x5DEECE66DL + 0xBL + time) /
(time * time * 0.42d);
}
# C1-Compiler
0x00007f8b009a5122: mov 0x10(%rsi),%rdi
0x00007f8b009a5126: movabs $0x5deece66d,%rax
0x00007f8b009a5130: mov %rax,%rsi
0x00007f8b009a5133: mov %rdi,%rax
0x00007f8b009a5136: imul %rsi,%rax
0x00007f8b009a513a: movabs $0xb,%r10
0x00007f8b009a5144: add %r10,%rax
0x00007f8b009a5147: add %rdi,%rax
0x00007f8b009a514a: vcvtsi2sd %rax,%xmm0,%xmm0
0x00007f8b009a514f: mov %rdi,%rsi
0x00007f8b009a5152: mov %rdi,%rax
0x00007f8b009a5155: imul %rsi,%rax
0x00007f8b009a5159: vcvtsi2sd %rax,%xmm1,%xmm1
0x00007f8b009a515e: vmulsd -0xc6(%rip),%xmm1,%xmm1
0x00007f8b009a5166: vdivsd %xmm1,%xmm0,%xmm0
@Benchmark
public double addExpensive() {
return
(time * 0x5DEECE66DL + 0xBL + time) /
(time * time * 0.42d);
}
# C2-Compiler
0x00007f8b08474bda: mov 0x10(%rsi),%r10
0x00007f8b08474bde: mov %r10,%r11
0x00007f8b08474be1: imul %r10,%r11
0x00007f8b08474be5: vcvtsi2sd %r11,%xmm0,%xmm0
0x00007f8b08474bea: movabs $0x5deece66d,%r11
0x00007f8b08474bf4: mov %r10,%r8
0x00007f8b08474bf7: imul %r11,%r8
0x00007f8b08474bfb: lea 0xb(%r8,%r10,1),%r10
0x00007f8b08474c00: vcvtsi2sd %r10,%xmm1,%xmm1
0x00007f8b08474c05: vmulsd -0x8d(%rip),%xmm0,%xmm0
0x00007f8b08474c0d: vdivsd %xmm0,%xmm1,%xmm0
org.xc.jmh.A04a
public class A11 {
double x1, x2;
@Setup
public void setup() {
x1 = System.currentTimeMillis() / 42.1801;
x2 = System.currentTimeMillis() / 42.1802;
}
private double compute(double d) {
for (int c = 0; c < 10; c++) {
d = d * d / Math.PI;
}
return d;
}
public double baseline() {
return compute(x2);
}
public double measureWrong() {
compute(x1);
return compute(x2);
}
public void measureRight_Blackholes(Blackhole bh) {
bh.consume(compute(x1));
bh.consume(compute(x2));
}
public void measureWrong_Blackholes(Blackhole bh) {
compute(x1);
bh.consume(compute(x2));
}
}
Benchmark Mode Cnt Score Error Units
A11.baseline avgt 3 12.338 ± 0.292 ns/op
A11.measureWrong avgt 3 12.468 ± 3.478 ns/op
A11.measureWrong_Blackholes avgt 3 12.362 ± 0.155 ns/op
A11.measureRight_Return avgt 3 24.828 ± 0.265 ns/op
A11.measureRight_Blackhole avgt 3 24.826 ± 0.187 ns/op
A11.measureRight_2Blackholes avgt 3 25.602 ± 0.367 ns/op
org.xc.jmh.A11
Shine different light
// make is warm longer
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
// measure longer and more often
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class A04b {
...
The same here!
...
}
# Benchmark: org.xc.jmh.A04b.addCheap2
# Warmup Iteration 1: 0.372 ns/op
# Warmup Iteration 2: 0.373 ns/op
# Warmup Iteration 3: 0.549 ns/op
# Warmup Iteration 4: 0.549 ns/op
# Warmup Iteration 5: 0.553 ns/op
Iteration 1: 0.549 ns/op
Iteration 2: 0.548 ns/op
Iteration 3: 0.548 ns/op
Iteration 4: 0.569 ns/op
Iteration 5: 0.548 ns/op
Iteration 6: 0.549 ns/op
Iteration 7: 0.547 ns/op
Iteration 8: 0.547 ns/op
Iteration 9: 0.548 ns/op
Iteration 10: 0.547 ns/op
addCheap2
is more expensive noworg.xc.jmh.A04b
Observing the JIT actions
/*
* Each method warms and runs differently
*/
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class A04c {
...
@Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Benchmark
public long addShort() {
return 1 + time;
}
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Benchmark
public long addLong() {
return 1 + time;
}
...
}
# java -jar target/benchmarks.jar "A04c.addShort"
Benchmark Mode Cnt Score Error Units
A04c.addShort avgt 3 0.371 ± 0.051 ns/op
# java -jar target/benchmarks.jar "A04c.addLong"
Benchmark Mode Cnt Score Error Units
A04c.addLong avgt 10 0.551 ± 0.009 ns/op
# Warmup Iteration 1: 0.379 ns/op
# Warmup Iteration 2: 0.379 ns/op
Iteration 1: 0.547 ns/op
Iteration 2: 0.564 ns/op
Iteration 3: 0.550 ns/op
# java -jar target/benchmarks.jar "A04c"
Benchmark Mode Cnt Score Error Units
A04c.addShort avgt 3 0.548 ± 0.010 ns/op
A04c.addLong avgt 10 0.553 ± 0.014 ns/op
org.xc.jmh.A04c
# java -jar target/benchmarks.jar "A04c.addShort" \
-jvmArgsAppend "-XX:+PrintCompilation"
259 554 3 org.xc.jmh.A04c::addShort (7 bytes)
259 558 4 org.xc.jmh.A04c::addShort (7 bytes)
259 554 3 org.xc.jmh.A04c::addShort (7 bytes) made not entrant
261 560 % 3 org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes)
262 561 3 org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub (57 bytes)
# Warmup Iteration 1:
263 562 % 4 org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes)
265 560 % 3 org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes) made not entrant
1260 562 % 4 org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes) made not entrant
1272 591 % 4 org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes)
1274 592 4 org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub (57 bytes)
1276 561 3 org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub (57 bytes) made not entrant
0.372 ns/op
# Warmup Iteration 2: 0.371 ns/op
Iteration 1: 0.369 ns/op
Iteration 2: 0.377 ns/op
Iteration 3: 0.370 ns/op
# Warmup Iteration
277 505 3 org.xc.jmh.A04c::addShort (7 bytes)
280 510 % 3 org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes)
280 511 3 org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub (57 bytes)
281 512 % 4 org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes)
285 510 % 3 org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes) made not entrant
285 508 4 org.xc.jmh.A04c::addShort (7 bytes)
286 505 3 org.xc.jmh.A04c::addShort (7 bytes) made not entrant
1278 512 % 4 org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes) made not entrant
1290 553 % 4 org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub @ 13 (57 bytes)
1295 554 4 org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub (57 bytes)
1299 511 3 org.xc.jmh.jmh_generated.A04c_addShort_jmhTest::addShort_avgt_jmhStub (57 bytes) made not entrant
0.388 ns/op
# Warmup Iteration: 0.377 ns/op
Iteration 1: 0.547 ns/op
Iteration 2: 0.547 ns/op
Iteration 3: 0.547 ns/op
org.xc.jmh.A04c
JIT is concurrent, this varies results
# Run it normally
# java -jar target/benchmarks.jar "A04c"
# Benchmark: org.xc.jmh.A04c.addLong
# Warmup Iteration 1: 0.381 ns/op
# Warmup Iteration 2: 0.378 ns/op
# Warmup Iteration 3: 0.547 ns/op
# Warmup Iteration 4: 0.547 ns/op
# Warmup Iteration 5: 0.547 ns/op
Iteration 1: 0.547 ns/op
Iteration 2: 0.550 ns/op
Iteration 3: 0.547 ns/op
Iteration 4: 0.547 ns/op
Iteration 5: 0.547 ns/op
Iteration 6: 0.548 ns/op
Iteration 7: 0.547 ns/op
Iteration 8: 0.547 ns/op
Iteration 9: 0.549 ns/op
Iteration 10: 0.548 ns/op
# java -jar target/benchmarks.jar "A04c" \
# -jvmArgs "-XX:-TieredCompilation -XX:CICompilerCount=1 -Xbatch"
# -XX:-TieredCompilation : Go to C2 directly
# -XX:CICompilerCount=1 : Run only one compiler thread (-TieredC only)
# -Xbatch : Wait for the C2 result
# Benchmark: org.xc.jmh.A04c.addLong
# Warmup Iteration 1: 0.385 ns/op
# Warmup Iteration 2: 0.547 ns/op
# Warmup Iteration 3: 0.375 ns/op
# Warmup Iteration 4: 0.378 ns/op
# Warmup Iteration 5: 0.377 ns/op
Iteration 1: 0.374 ns/op
Iteration 2: 0.371 ns/op
Iteration 3: 0.374 ns/op
Iteration 4: 0.370 ns/op
Iteration 5: 0.370 ns/op
Iteration 6: 0.371 ns/op
Iteration 7: 0.380 ns/op
Iteration 8: 0.369 ns/op
Iteration 9: 0.377 ns/op
Iteration 10: 0.402 ns/op
org.xc.jmh.A04c
Enforce a method call
public class A05 {
...
@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public long addCheap1() {
return 1 + 1727610292048L;
}
@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public long addCheap2() {
return 1 + time;
}
@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public long addCheap3() {
return 1 + lTime.longValue();
}
@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public double addExpensive() {
return
(time * 0x5DEECE66DL + 0xBL + time) /
(time * time * 0.42d);
}
...
}
# -XX:+Inline
Benchmark Mode Cnt Score Error Units
A04b.addCheap1 avgt 10 0.292 ± 0.009 ns/op
A04b.addCheap2 avgt 10 0.548 ± 0.001 ns/op
A04b.addCheap3 avgt 10 0.550 ± 0.008 ns/op
A04b.addExpensive avgt 10 4.120 ± 1.530 ns/op
# -XX:-Inline
A04b.addCheap1 avgt 10 2.507 ± 0.132 ns/op
A04b.addCheap2 avgt 10 2.735 ± 0.006 ns/op
A04b.addCheap3 avgt 10 2.745 ± 0.029 ns/op
A04b.addExpensive avgt 10 5.223 ± 1.168 ns/op
# CompilerControl
A05.addCheap1 avgt 10 1.644 ± 0.004 ns/op
A05.addCheap2 avgt 10 1.653 ± 0.020 ns/op
A05.addCheap3 avgt 10 1.646 ± 0.006 ns/op
A05.addExpensive avgt 10 5.067 ± 1.673 ns/op
-XX:-Inline
)org.xc.jmh.A05
More mistakes have been made
javac
evaluates constantsWhen you need some state
Prepare data for later measurement - Fixtures
public class B01
{
final int SIZE = 1000;
int[] data;
@Setup(Level.Iteration)
public void setup()
{
var r = new Random(42L);
data = new int[SIZE];
for (int i = 0; i < SIZE; i++)
{
data[i] = r.nextInt();
}
}
@Benchmark
public long sumClassic()
{
long result = 0;
for (int i = 0; i < data.length; i++)
{
result += data[i];
}
return result;
}
}
Setup
method will be per benchmark, iteration, or invocationLevel.Trial
is default!!!Setup
method will be per stateState
is per benchmark or thread
# Warmup Iteration 1: 74.855 ns/op
# Warmup Iteration 2: 73.813 ns/op
Iteration 1: 74.167 ns/op
Iteration 2: 74.652 ns/op
Iteration 3: 76.764 ns/op
Iteration 4: 76.902 ns/op
Iteration 5: 75.612 ns/op
Benchmark Mode Cnt Score Error Units
B01.sumClassic avgt 5 75.620 ± 4.716 ns/op
org.xc.jmh.B01
Don't run Level.Invocation unless....
# Trial - Per Benchmark
$ java -jar target/benchmarks.jar "B02"
# Warmup Iteration 1: 75.102 ns/op
# Warmup Iteration 2: 73.226 ns/op
Iteration 1: 72.837 ns/op
Iteration 2: 76.176 ns/op
Iteration 3: 77.893 ns/op
Iteration 4: 74.138 ns/op
Iteration 5: 74.346 ns/op
Benchmark Mode Cnt Score Error Units
B02.sumClassic avgt 5 75.078 ± 7.598 ns/op
# Iteration - Per Warmup and Measurement Round
$ java -jar target/benchmarks.jar "B01"
# Warmup Iteration 1: 75.288 ns/op
# Warmup Iteration 2: 73.892 ns/op
Iteration 1: 76.258 ns/op
Iteration 2: 73.315 ns/op
Iteration 3: 77.246 ns/op
Iteration 4: 76.957 ns/op
Iteration 5: 72.868 ns/op
Benchmark Mode Cnt Score Error Units
B01.sumClassic avgt 5 75.329 ± 8.008 ns/op
# Invocation - Before each(!!) call to the method
$ mvn clean verify
$ java -jar target/benchmarks.jar "B03"
# Warmup Iteration 1: 157.255 ns/op
# Warmup Iteration 2: 110.940 ns/op
Iteration 1: 110.901 ns/op
Iteration 2: 105.556 ns/op
Iteration 3: 105.466 ns/op
Iteration 4: 111.826 ns/op
Iteration 5: 111.462 ns/op
Benchmark Mode Cnt Score Error Units
B03.sumClassic avgt 5 109.042 ± 12.478 ns/op
org.xc.jmh.B0[1-3]
Fixtures can be varied automatically
@Warmup(iterations = 2, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
@Fork(value = 1, jvmArgsAppend = {"-Xms4g", "-Xmx4g", "-XX:+AlwaysPreTouch"})
public class B04a {
@Param({"1", "10", "100", "1000", "10000", "100000",
"1000000", "10000000", "100000000", "1000000000"})
int SIZE;
int[] data;
@Setup
public void setup() {
var r = new Random(42L);
data = new int[SIZE];
for (int i = 0; i < SIZE; i++) {
data[i] = r.nextInt();
}
}
@Benchmark
public long sumClassic() {
long result = 0;
for (int i = 0; i < data.length; i++) {
result += data[i];
}
return result;
}
@Benchmark
public long sumStream() {
return Arrays.stream(data).sum();
}
}
Param
list of Strings
Benchmark (SIZE) Mode Cnt Score Error Units
B04a.sumClassic 1 avgt 5 1.376 ± 0.01 ns/op
B04a.sumClassic 10 avgt 5 3.553 ± 0.00 ns/op
B04a.sumClassic 100 avgt 5 7.698 ± 0.08 ns/op
B04a.sumClassic 1000 avgt 5 71.799 ± 0.90 ns/op
B04a.sumClassic 10000 avgt 5 764.769 ± 13.41 ns/op
B04a.sumClassic 100000 avgt 5 7,679.096 ± 38.78 ns/op
B04a.sumClassic 1000000 avgt 5 82,114.438 ± 4660.59 ns/op
B04a.sumClassic 10000000 avgt 5 1,037,525.538 ± 19741.51 ns/op
B04a.sumClassic 100000000 avgt 5 11,867,029.385 ± 130115.32 ns/op
B04a.sumClassic 1000000000 avgt 5 118,987,484.190 ± 3224256.55 ns/op
B04a.sumStream 1 avgt 5 21.295 ± 0.53 ns/op
B04a.sumStream 10 avgt 5 19.088 ± 1.43 ns/op
B04a.sumStream 100 avgt 5 38.379 ± 3.85 ns/op
B04a.sumStream 1000 avgt 5 284.431 ± 7.83 ns/op
B04a.sumStream 10000 avgt 5 2,763.850 ± 100.13 ns/op
B04a.sumStream 100000 avgt 5 27,620.697 ± 914.37 ns/op
B04a.sumStream 1000000 avgt 5 275,711.637 ± 507.95 ns/op
B04a.sumStream 10000000 avgt 5 2,775,992.907 ± 7054.71 ns/op
B04a.sumStream 100000000 avgt 5 27,783,853.383 ± 70817.23 ns/op
B04a.sumStream 1000000000 avgt 5 277,770,888.000 ± 427443.91 ns/op
org.xc.jmh.B04a
Execute and report - B04a
Things done horribly wrong - What is wrong here?
@Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class B08
{
@Param({"1", "10", "100", "1000"})
int SIZE;
	List<String> data = new ArrayList<>(SIZE);
@Setup
public void setup()
{
var r = new Random(42);
for (int i = 0; i < SIZE; i++)
{
data.add(String.valueOf(r.nextInt()));
}
}
@Benchmark
public List<String> sort()
{
Collections.sort(data);
return data;
}
}
Benchmark (SIZE) Mode Cnt Score Error Units
B08.sort 1 avgt 5 1.349 ± 0.080 ns/op
B08.sort 10 avgt 5 34.319 ± 0.151 ns/op
B08.sort 100 avgt 5 382.332 ± 20.398 ns/op
B08.sort 1000 avgt 5 4104.559 ± 207.218 ns/op
org.xc.jmh.B08
Things hitting us unexpectedly
Java evolves and so does performance
public class Example1ArrayCopying {
@Param({"1000"}) int SIZE;
byte[] src;
@Setup
public void setup() {
src = new byte[SIZE];
final Random r = new Random(7L);
r.nextBytes(src);
}
@Benchmark
public byte[] systemCopy() {
var dest = new byte[src.length];
System.arraycopy(src, 0, dest, 0, src.length);
return dest;
}
@Benchmark
public byte[] arrayCopy() {
return Arrays.copyOf(src, src.length);
}
@Benchmark
public byte[] manualCopy() {
var dest = new byte[src.length];
for (int i = 0; i < src.length; i++) {
dest[i] = src[i];
}
return dest;
}
}
# JDK 11
Benchmark (SIZE) Mode Cnt Score Error Units
Example1ArrayCopying.manualCopy 1000 avgt 3 152.889 ± 32.305 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 130.963 ± 119.643 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 128.726 ± 83.837 ns/op
# JDK 17
Example1ArrayCopying.manualCopy 1000 avgt 3 153.508 ± 7.987 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 124.021 ± 35.199 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 121.271 ± 35.723 ns/op
# JDK 19
Example1ArrayCopying.manualCopy 1000 avgt 3 149.298 ± 79.705 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 129.439 ± 104.106 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 133.358 ± 40.304 ns/op
# JDK 20
Example1ArrayCopying.manualCopy 1000 avgt 3 152.070 ± 7.096 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 119.613 ± 9.873 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 117.901 ± 8.322 ns/op
# JDK 21
Example1ArrayCopying.manualCopy 1000 avgt 3 146.508 ± 47.293 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 131.452 ± 107.300 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 132.797 ± 124.440 ns/op
# Graal 22.3.r17 - Whoops!
Example1ArrayCopying.manualCopy 1000 avgt 3 335.612 ± 22.000 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 150.035 ± 4.466 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 149.176 ± 4.496 ns/op
org.xc.jmh.Example1ArrayCopying
Machines are never alike
public class Example1ArrayCopying {
@Param({"1000"}) int SIZE;
byte[] src;
@Setup
public void setup() {
src = new byte[SIZE];
final Random r = new Random(7L);
r.nextBytes(src);
}
@Benchmark
public byte[] systemCopy() {
var dest = new byte[src.length];
System.arraycopy(src, 0, dest, 0, src.length);
return dest;
}
@Benchmark
public byte[] arrayCopy() {
return Arrays.copyOf(src, src.length);
}
@Benchmark
public byte[] manualCopy() {
var dest = new byte[src.length];
for (int i = 0; i < src.length; i++) {
dest[i] = src[i];
}
return dest;
}
}
# JDK 21 - T14s Gen 1 AMD
Example1ArrayCopying.manualCopy 1000 avgt 3 146.508 ± 47.293 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 131.452 ± 107.300 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 132.797 ± 124.440 ns/op
# JDK 21 - Hetzner AMD EPYC
Example1ArrayCopying.manualCopy 1000 avgt 3 47.118 ± 12.747 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 59.655 ± 13.721 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 42.315 ± 17.464 ns/op
Example1ArrayCopying.manualCopy 1000 avgt 3 48.347 ± 10.000 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 49.044 ± 3.226 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 43.766 ± 15.874 ns/op
Example1ArrayCopying.manualCopy 1000 avgt 3 51.993 ± 25.652 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 47.272 ± 5.970 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 48.202 ± 19.657 ns/op
Example1ArrayCopying.manualCopy 1000 avgt 3 49.232 ± 29.510 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 42.732 ± 9.204 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 43.663 ± 13.449 ns/op
org.xc.jmh.Example1ArrayCopying
Things are jittery
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
public class Example2ArrayCopying {
...
@Benchmark
public byte[] systemCopy() {
var dest = new byte[src.length];
System.arraycopy(src, 0, dest, 0, src.length);
return dest;
}
@Benchmark
public byte[] arrayCopy() {
return Arrays.copyOf(src, src.length);
}
@Benchmark
public byte[] manualCopy() {
var dest = new byte[src.length];
for (int i = 0; i < src.length; i++) {
dest[i] = src[i];
}
return dest;
}
}
# JDK 21 - T14s Gen 1 AMD - W3-3/M3-3/F1
Example1ArrayCopying.manualCopy 1000 avgt 3 146.508 ± 47.293 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 131.452 ± 107.300 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 132.797 ± 124.440 ns/op
# JDK 21 - Hetzner AMD EPYC - W3-3/M3-3/F1
Example1ArrayCopying.manualCopy 1000 avgt 3 47.118 ± 12.747 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 59.655 ± 13.721 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 42.315 ± 17.464 ns/op
# JDK 21 - T14s Gen 1 AMD - W5-1/M10-1/F3
Example2ArrayCopying.manualCopy 1000 avgt 30 175.009 ± 2.882 ns/op
Example2ArrayCopying.systemCopy 1000 avgt 30 142.290 ± 5.045 ns/op
Example2ArrayCopying.arrayCopy 1000 avgt 30 143.374 ± 2.372 ns/op
Example2ArrayCopying.manualCopy 1000 avgt 30 171.585 ± 1.051 ns/op
Example2ArrayCopying.systemCopy 1000 avgt 30 143.686 ± 1.997 ns/op
Example2ArrayCopying.arrayCopy 1000 avgt 30 143.701 ± 1.464 ns/op
# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F3
Example2ArrayCopying.manualCopy 1000 avgt 30 48.971 ± 1.260 ns/op
Example2ArrayCopying.systemCopy 1000 avgt 30 54.044 ± 5.644 ns/op
Example2ArrayCopying.arrayCopy 1000 avgt 30 49.097 ± 4.838 ns/op
Example2ArrayCopying.manualCopy 1000 avgt 30 48.743 ± 1.472 ns/op
Example2ArrayCopying.systemCopy 1000 avgt 30 44.689 ± 2.206 ns/op
Example2ArrayCopying.arrayCopy 1000 avgt 30 49.527 ± 5.736 ns/op
org.xc.jmh.Example2ArrayCopying
Still no clear lead
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
public class Example3ArrayCopying {
@Param({"10", "1000", "100000"})
int SIZE;
...
}
# JDK 21 - T14s Gen 1 AMD - W5-1/M10-1/F1
Benchmark (SIZE) Mode Cnt Score Error Units
Example3ArrayCopying.manualCopy 10 avgt 10 19.199 ± 0.060 ns/op
Example3ArrayCopying.systemCopy 10 avgt 10 11.842 ± 0.524 ns/op
Example3ArrayCopying.arrayCopy 10 avgt 10 12.055 ± 0.157 ns/op
Example3ArrayCopying.manualCopy 1000 avgt 10 179.273 ± 2.051 ns/op
Example3ArrayCopying.systemCopy 1000 avgt 10 145.791 ± 1.194 ns/op
Example3ArrayCopying.arrayCopy 1000 avgt 10 146.662 ± 1.988 ns/op
Example3ArrayCopying.manualCopy 100000 avgt 10 18492.201 ± 2739.328 ns/op
Example3ArrayCopying.systemCopy 100000 avgt 10 13468.580 ± 121.943 ns/op
Example3ArrayCopying.arrayCopy 100000 avgt 10 13919.631 ± 113.306 ns/op
# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1
Benchmark (SIZE) Mode Cnt Score Error Units
Example3ArrayCopying.manualCopy 10 avgt 10 6.182 ± 0.554 ns/op
Example3ArrayCopying.systemCopy 10 avgt 10 3.957 ± 0.089 ns/op
Example3ArrayCopying.arrayCopy 10 avgt 10 4.073 ± 0.329 ns/op
Example3ArrayCopying.manualCopy 1000 avgt 10 51.011 ± 3.748 ns/op
Example3ArrayCopying.systemCopy 1000 avgt 10 61.058 ± 1.217 ns/op
Example3ArrayCopying.arrayCopy 1000 avgt 10 58.612 ± 1.315 ns/op
Example3ArrayCopying.manualCopy 100000 avgt 10 6033.414 ± 323.343 ns/op
Example3ArrayCopying.systemCopy 100000 avgt 10 6227.169 ± 156.952 ns/op
Example3ArrayCopying.arrayCopy 100000 avgt 10 6390.100 ± 194.210 ns/op
org.xc.jmh.Example3ArrayCopying
Maybe more is less? From byte
to long
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
public class Example4ArrayCopying {
@Param({"10", "1000", "100000"})
int SIZE;
long[] src;
@Setup
public void setup()
{
src = new long[SIZE];
final Random r = new Random(7L);
for (int i = 0; i < src.length; i++)
{
src[i] = r.nextLong();
}
}
...
}
# JDK 21 - T14s Gen 1 AMD - W5-1/M10-1/F1
Benchmark (SIZE) Mode Cnt Score Error Units
Example4ArrayCopying.manualCopy 10 avgt 10 19.836 ± 0.44 ns/op
Example4ArrayCopying.systemCopy 10 avgt 10 15.233 ± 0.72 ns/op
Example4ArrayCopying.arrayCopy 10 avgt 10 16.243 ± 0.43 ns/op
Example4ArrayCopying.manualCopy 1000 avgt 10 1170.518 ± 40.69 ns/op
Example4ArrayCopying.systemCopy 1000 avgt 10 1000.050 ± 19.52 ns/op
Example4ArrayCopying.arrayCopy 1000 avgt 10 1018.666 ± 20.75 ns/op
Example4ArrayCopying.manualCopy 100000 avgt 10 132694.706 ± 10647.86 ns/op
Example4ArrayCopying.systemCopy 100000 avgt 10 127729.915 ± 19014.11 ns/op
Example4ArrayCopying.arrayCopy 100000 avgt 10 113348.850 ± 4124.81 ns/op
# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1
Benchmark (SIZE) Mode Cnt Score Error Units
Example4ArrayCopying.manualCopy 10 avgt 10 6.640 ± 0.59 ns/op
Example4ArrayCopying.systemCopy 10 avgt 10 5.219 ± 0.28 ns/op
Example4ArrayCopying.arrayCopy 10 avgt 10 5.025 ± 0.25 ns/op
Example4ArrayCopying.manualCopy 1000 avgt 10 350.139 ± 11.62 ns/op
Example4ArrayCopying.systemCopy 1000 avgt 10 424.347 ± 6.27 ns/op
Example4ArrayCopying.arrayCopy 1000 avgt 10 424.027 ± 8.20 ns/op
Example4ArrayCopying.manualCopy 100000 avgt 10 48080.023 ± 1714.36 ns/op
Example4ArrayCopying.systemCopy 100000 avgt 10 39513.986 ± 3052.52 ns/op
Example4ArrayCopying.arrayCopy 100000 avgt 10 40345.696 ± 1136.56 ns/op
org.xc.jmh.Example4ArrayCopying
Just blame the usual suspect
# Enable GC profiling by JMH
java -jar target/benchmarks.jar "Example5ArrayCopying" -prof gc
# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1
Benchmark (SIZE) Mode Cnt Score Error Units
manualCopy 1000 avgt 10 358.236 ± 9.899 ns/op
manualCopy:gc.alloc.rate 1000 avgt 10 21340.350 ± 594.948 MB/sec
manualCopy:gc.alloc.rate.norm 1000 avgt 10 8016.002 ± 0.001 B/op
manualCopy:gc.count 1000 avgt 10 181.000 counts
manualCopy:gc.time 1000 avgt 10 234.000 ms
systemCopy 1000 avgt 10 462.448 ± 9.151 ns/op
systemCopy:gc.alloc.rate 1000 avgt 10 16528.255 ± 327.421 MB/sec
systemCopy:gc.alloc.rate.norm 1000 avgt 10 8016.003 ± 0.001 B/op
systemCopy:gc.count 1000 avgt 10 151.000 counts
systemCopy:gc.time 1000 avgt 10 189.000 ms
arrayCopy 1000 avgt 10 496.126 ± 10.055 ns/op
arrayCopy:gc.alloc.rate 1000 avgt 10 15406.729 ± 310.263 MB/sec
arrayCopy:gc.alloc.rate.norm 1000 avgt 10 8016.003 ± 0.001 B/op
arrayCopy:gc.count 1000 avgt 10 131.000 counts
arrayCopy:gc.time 1000 avgt 10 171.000 ms
# Give it more memory
java -jar target/benchmarks.jar "Example5ArrayCopying" \
-jvmArgsAppend "-Xms16g -Xmx16g" \
-prof gc
# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1
Benchmark (SIZE) Mode Cnt Score Error Units
manualCopy 1000 avgt 10 344.894 ± 3.923 ns/op
manualCopy:gc.alloc.rate 1000 avgt 10 22160.016 ± 251.956 MB/sec
manualCopy:gc.alloc.rate.norm 1000 avgt 10 8016.002 ± 0.001 B/op
manualCopy:gc.count 1000 avgt 10 23.000 counts
manualCopy:gc.time 1000 avgt 10 70.000 ms
systemCopy 1000 avgt 10 398.919 ± 12.741 ns/op
systemCopy:gc.alloc.rate 1000 avgt 10 19165.850 ± 628.072 MB/sec
systemCopy:gc.alloc.rate.norm 1000 avgt 10 8016.003 ± 0.001 B/op
systemCopy:gc.count 1000 avgt 10 19.000 counts
systemCopy:gc.time 1000 avgt 10 56.000 ms
arrayCopy 1000 avgt 10 378.813 ± 26.941 ns/op
arrayCopy:gc.alloc.rate 1000 avgt 10 20214.726 ± 1412.521 MB/sec
arrayCopy:gc.alloc.rate.norm 1000 avgt 10 8016.003 ± 0.001 B/op
arrayCopy:gc.count 1000 avgt 10 21.000 counts
arrayCopy:gc.time 1000 avgt 10 65.000 ms
org.xc.jmh.Example5ArrayCopying
Try a different one as one is told all the time
# Enable GC profiling by JMH
java -jar target/benchmarks.jar "Example5ArrayCopying" \
-jvmArgsAppend "-Xms16g -Xmx16g -XX:+UseSerialGC" \
-prof gc
# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1
Benchmark (SIZE) Mode Cnt Score Error Units
manualCopy 1000 avgt 10 340.026 ± 2.053 ns/op
manualCopy:gc.alloc.rate 1000 avgt 10 22477.566 ± 135.713 MB/sec
manualCopy:gc.alloc.rate.norm 1000 avgt 10 8016.002 ± 0.001 B/op
manualCopy:gc.count 1000 avgt 10 51.000 counts
manualCopy:gc.time 1000 avgt 10 16.000 ms
systemCopy 1000 avgt 10 384.150 ± 17.802 ns/op
systemCopy:gc.alloc.rate 1000 avgt 10 19912.093 ± 939.526 MB/sec
systemCopy:gc.alloc.rate.norm 1000 avgt 10 8016.002 ± 0.001 B/op
systemCopy:gc.count 1000 avgt 10 46.000 counts
systemCopy:gc.time 1000 avgt 10 14.000 ms
arrayCopy 1000 avgt 10 396.502 ± 6.364 ns/op
arrayCopy:gc.alloc.rate 1000 avgt 10 19275.432 ± 315.711 MB/sec
arrayCopy:gc.alloc.rate.norm 1000 avgt 10 8016.002 ± 0.001 B/op
arrayCopy:gc.count 1000 avgt 10 44.000 counts
arrayCopy:gc.time 1000 avgt 10 9.000 ms
# Give it more memory
java -jar target/benchmarks.jar "Example5ArrayCopying" \
-jvmArgsAppend "-Xms16g -Xmx16g -XX:+UseZGC" \
-prof gc
# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1
Benchmark (SIZE) Mode Cnt Score Error Units
manualCopy 1000 avgt 10 429.413 ± 76.723 ns/op
manualCopy:gc.alloc.rate 1000 avgt 10 18004.823 ± 3022.606 MB/sec
manualCopy:gc.alloc.rate.norm 1000 avgt 10 8016.003 ± 0.001 B/op
manualCopy:gc.count 1000 avgt 10 56.000 counts
manualCopy:gc.time 1000 avgt 10 214.000 ms
systemCopy 1000 avgt 10 470.485 ± 112.292 ns/op
systemCopy:gc.alloc.rate 1000 avgt 10 16610.545 ± 3953.947 MB/sec
systemCopy:gc.alloc.rate.norm 1000 avgt 10 8016.004 ± 0.001 B/op
systemCopy:gc.count 1000 avgt 10 52.000 counts
systemCopy:gc.time 1000 avgt 10 175.000 ms
arrayCopy 1000 avgt 10 442.024 ± 127.831 ns/op
arrayCopy:gc.alloc.rate 1000 avgt 10 17806.595 ± 4581.743 MB/sec
arrayCopy:gc.alloc.rate.norm 1000 avgt 10 8016.003 ± 0.001 B/op
arrayCopy:gc.count 1000 avgt 10 56.000 counts
arrayCopy:gc.time 1000 avgt 10 198.000 ms
org.xc.jmh.Example5ArrayCopying
Focus on the action, avoid overhead
public class Example6ArrayCopying {
@Param({"1000"})
int SIZE;
long[] src, dest;
@Setup
public void setup() {
src = new long[SIZE];
dest = new long[SIZE];
final Random r = new Random(7L);
for (int i = 0; i < src.length; i++) {
src[i] = r.nextLong();
}
}
@Benchmark
public long[] systemCopy() {
System.arraycopy(src, 0, dest, 0, src.length);
return dest;
}
@Benchmark
public long[] manualCopy() {
for (int i = 0; i < src.length; i++) {
dest[i] = src[i];
}
return dest;
}
}
# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1 - Default Mem
Benchmark (SIZE) Mode Cnt Score Error Units
Example4...manualCopy 1000 avgt 10 358.236 ± 9.899 ns/op
Example4...manualCopy:gc.alloc.rate 1000 avgt 10 21340.350 ± 594.948 MB/sec
Example4...manualCopy:gc.alloc.rate.norm 1000 avgt 10 8016.002 ± 0.001 B/op
Example4...systemCopy 1000 avgt 10 462.448 ± 9.151 ns/op
Example4...systemCopy:gc.alloc.rate 1000 avgt 10 16528.255 ± 327.421 MB/sec
Example4...systemCopy:gc.alloc.rate.norm 1000 avgt 10 8016.003 ± 0.001 B/op
# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1 - Default Mem
Benchmark (SIZE) Mode Cnt Score Error Units
Example6...manualCopy 1000 avgt 10 68.884 ± 0.244 ns/op
Example6...manualCopy:gc.alloc.rate 1000 avgt 10 0.007 ± 0.001 MB/sec
Example6...manualCopy:gc.alloc.rate.norm 1000 avgt 10 ≈ 10⁻³ B/op
Example6...systemCopy 1000 avgt 10 87.878 ± 0.748 ns/op
Example6...systemCopy:gc.alloc.rate 1000 avgt 10 0.007 ± 0.001 MB/sec
Example6...systemCopy:gc.alloc.rate.norm 1000 avgt 10 0.001 ± 0.001 B/op
--
Example6...manualCopy 1000 avgt 10 68.745 ± 0.166 ns/op
Example6...manualCopy:gc.alloc.rate 1000 avgt 10 0.007 ± 0.001 MB/sec
Example6...manualCopy:gc.alloc.rate.norm 1000 avgt 10 ≈ 10⁻³ B/op
Example6...systemCopy 1000 avgt 10 91.892 ± 0.466 ns/op
Example6...systemCopy:gc.alloc.rate 1000 avgt 10 0.007 ± 0.001 MB/sec
Example6...systemCopy:gc.alloc.rate.norm 1000 avgt 10 0.001 ± 0.001 B/op
org.xc.jmh.Example6ArrayCopying
Just copy within the array
public class Example7ArrayCopying {
@Param({"1000"})
int SIZE;
int pos;
long[] src;
public void setup() {
...
pos = SIZE / 2;
...
}
public long[] manualCopy1() {
for (int i = 0; i < src.length - pos; i++) {
src[i] = src[pos + i];
}
return src;
}
public long[] manualCopy2() {
var to = src.length - pos;
for (int i = 0; i < to; i++) {
src[i] = src[pos + i];
}
return src;
}
public long[] manualCopy3() {
var p = pos;
var to = src.length - p;
for (int i = 0; i < to; i++, p++) {
src[i] = src[p];
}
return src;
}
}
# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1 - Default Mem
Benchmark (SIZE) Mode Cnt Score Error Units
Example7ArrayCopying.manualCopy1 1000 avgt 10 127.035 ± 4.312 ns/op
Example7ArrayCopying.manualCopy2 1000 avgt 10 121.472 ± 0.108 ns/op
Example7ArrayCopying.manualCopy3 1000 avgt 10 122.662 ± 2.137 ns/op
Example7ArrayCopying.systemCopy 1000 avgt 10 44.243 ± 0.209 ns/op
--
Example7ArrayCopying.manualCopy1 1000 avgt 10 125.406 ± 0.050 ns/op
Example7ArrayCopying.manualCopy2 1000 avgt 10 123.669 ± 2.829 ns/op
Example7ArrayCopying.manualCopy3 1000 avgt 10 146.735 ± 1.185 ns/op
Example7ArrayCopying.systemCopy 1000 avgt 10 45.049 ± 0.176 ns/op
--
Example7ArrayCopying.manualCopy1 1000 avgt 10 128.646 ± 7.547 ns/op
Example7ArrayCopying.manualCopy2 1000 avgt 10 122.786 ± 2.348 ns/op
Example7ArrayCopying.manualCopy3 1000 avgt 10 131.138 ± 2.348 ns/op
Example7ArrayCopying.systemCopy 1000 avgt 10 47.014 ± 0.201 ns/op
--
Example7ArrayCopying.manualCopy1 1000 avgt 10 126.050 ± 1.729 ns/op
Example7ArrayCopying.manualCopy2 1000 avgt 10 126.528 ± 14.605 ns/op
Example7ArrayCopying.manualCopy3 1000 avgt 10 123.898 ± 2.667 ns/op
Example7ArrayCopying.systemCopy 1000 avgt 10 44.879 ± 0.163 ns/op
--
Example7ArrayCopying.manualCopy1 1000 avgt 10 125.773 ± 0.628 ns/op
Example7ArrayCopying.manualCopy2 1000 avgt 10 123.595 ± 1.524 ns/op
Example7ArrayCopying.manualCopy3 1000 avgt 10 123.349 ± 3.968 ns/op
Example7ArrayCopying.systemCopy 1000 avgt 10 44.673 ± 0.044 ns/op
--
Example7ArrayCopying.manualCopy1 1000 avgt 10 126.058 ± 0.769 ns/op
Example7ArrayCopying.manualCopy2 1000 avgt 10 189.097 ± 2.793 ns/op
Example7ArrayCopying.manualCopy3 1000 avgt 10 122.751 ± 1.444 ns/op
Example7ArrayCopying.systemCopy 1000 avgt 10 44.674 ± 0.038 ns/op
org.xc.jmh.Example7ArrayCopying
Loop unrolling is king
public class Example8ArrayCopying {
@Setup
public void setup() {
...
final Random r = new Random(7L);
increment = r.nextInt(1) + 1;
}
@Benchmark
public long[] manualCopyAntiUnroll1() {
for (long l = 0; l < src.length; l++) {
final var i = (int)l;
dest[i] = src[i];
}
return src;
}
@Benchmark
public long[] manualCopyAntiUnroll2() {
for (int i = 0; i < src.length; i = i + increment) {
dest[i] = src[i];
}
return src;
}
@Benchmark
public long[] manualCopyAntiUnroll3() {
final var inc = increment;
for (int i = 0; i < src.length; i = i + inc) {
dest[i] = src[i];
}
return src;
}
}
java -jar target/benchmarks.jar "Example8ArrayCopying"
# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1 - Default Mem
Benchmark (SIZE) Mode Cnt Score Error Units
Example8ArrayCopying.manualCopy1 1000 avgt 10 68.692 ± 0.066 ns/op
Example8ArrayCopying.manualCopy2 1000 avgt 10 70.277 ± 1.390 ns/op
Example8ArrayCopying.manualCopyAntiUnroll1 1000 avgt 10 385.301 ± 3.335 ns/op
Example8ArrayCopying.manualCopyAntiUnroll2 1000 avgt 10 617.899 ± 28.190 ns/op
Example8ArrayCopying.manualCopyAntiUnroll3 1000 avgt 10 1640.345 ± 0.966 ns/op
Example8ArrayCopying.systemCopy 1000 avgt 10 100.947 ± 1.753 ns/op
---
Benchmark (SIZE) Mode Cnt Score Error Units
Example8ArrayCopying.manualCopy1 1000 avgt 10 68.818 ± 0.893 ns/op
Example8ArrayCopying.manualCopy2 1000 avgt 10 70.744 ± 0.439 ns/op
Example8ArrayCopying.manualCopyAntiUnroll1 1000 avgt 10 385.096 ± 4.509 ns/op
Example8ArrayCopying.manualCopyAntiUnroll2 1000 avgt 10 655.082 ± 1.224 ns/op
Example8ArrayCopying.manualCopyAntiUnroll3 1000 avgt 10 1645.394 ± 24.600 ns/op
Example8ArrayCopying.systemCopy 1000 avgt 10 74.698 ± 1.357 ns/op
--
# Benchmark: org.xc.jmh.Example8ArrayCopying.manualCopyAntiUnroll3
# Warmup Iteration 1: 579.444 ns/op
# Warmup Iteration 2: 421.095 ns/op
# Warmup Iteration 3: 1640.320 ns/op
# Warmup Iteration 4: 1639.955 ns/op
# Warmup Iteration 5: 1639.914 ns/op
Iteration 1: 1640.164 ns/op
...
Iteration 10: 1641.931 ns/op
org.xc.jmh.Example8ArrayCopying
Just try to prove a point
public class Example9ArrayCopying {
@Benchmark
public long[] manualCopyAntiUnroll() {
for (int i = 0; i < src.length; i = i + increment) {
dest[i] = src[i];
}
return dest;
}
@Benchmark
public long[] manualCopyUnroll() {
var inc = increment * 4;
var i = 0;
for (; i < src.length; i = i + inc) {
dest[i] = src[i];
dest[i + 1] = src[i + 1];
dest[i + 2] = src[i + 2];
dest[i + 3] = src[i + 3];
}
for (; i < src.length; i++) {
dest[i] = src[i];
}
return dest;
}
}
java -jar target/benchmarks.jar "Example9ArrayCopying"
# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1 - Default Mem
Benchmark (SIZE) Mode Cnt Score Error Units
Example9ArrayCopying.manualCopyAntiUnroll 1000 avgt 10 715.286 ± 132.30 ns/op
Example9ArrayCopying.manualCopyUnroll 1000 avgt 10 480.738 ± 3.93 ns/op
Example9ArrayCopying.systemCopy 1000 avgt 10 97.871 ± 0.08 ns/op
--
Example9ArrayCopying.manualCopyAntiUnroll 1000 avgt 10 597.785 ± 9.35 ns/op
Example9ArrayCopying.manualCopyUnroll 1000 avgt 10 480.876 ± 2.13 ns/op
Example9ArrayCopying.systemCopy 1000 avgt 10 97.437 ± 0.08 ns/op
--
# Benchmark: org.xc.jmh.Example9ArrayCopying.manualCopyUnroll
# Warmup Iteration 1: 283.584 ns/op
# Warmup Iteration 2: 283.552 ns/op
# Warmup Iteration 3: 480.231 ns/op
# Warmup Iteration 4: 481.449 ns/op
# Warmup Iteration 5: 479.380 ns/op
Iteration 1: 480.518 ns/op
...
Iteration 10: 479.786 ns/op
org.xc.jmh.Example9ArrayCopying
Investigating JIT things, small scale
java -jar target/benchmarks.jar "Example8ArrayCopying"
# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1 - Default Mem
Benchmark (SIZE) Mode Cnt Score Error Units
Example8ArrayCopying.manualCopy1 1000 avgt 10 68.692 ± 0.066 ns/op
Example8ArrayCopying.manualCopy2 1000 avgt 10 70.277 ± 1.390 ns/op
Example8ArrayCopying.manualCopyAntiUnroll1 1000 avgt 10 385.301 ± 3.335 ns/op
Example8ArrayCopying.manualCopyAntiUnroll2 1000 avgt 10 617.899 ± 28.190 ns/op
Example8ArrayCopying.manualCopyAntiUnroll3 1000 avgt 10 1640.345 ± 0.966 ns/op
Example8ArrayCopying.systemCopy 1000 avgt 10 100.947 ± 1.753 ns/op
---
Benchmark (SIZE) Mode Cnt Score Error Units
Example8ArrayCopying.manualCopy1 1000 avgt 10 68.818 ± 0.893 ns/op
Example8ArrayCopying.manualCopy2 1000 avgt 10 70.744 ± 0.439 ns/op
Example8ArrayCopying.manualCopyAntiUnroll1 1000 avgt 10 385.096 ± 4.509 ns/op
Example8ArrayCopying.manualCopyAntiUnroll2 1000 avgt 10 655.082 ± 1.224 ns/op
Example8ArrayCopying.manualCopyAntiUnroll3 1000 avgt 10 1645.394 ± 24.600 ns/op
Example8ArrayCopying.systemCopy 1000 avgt 10 74.698 ± 1.357 ns/op
--
# Benchmark: org.xc.jmh.Example8ArrayCopying.manualCopyAntiUnroll3
# Warmup Iteration 1: 579.444 ns/op
# Warmup Iteration 2: 421.095 ns/op
# Warmup Iteration 3: 1640.320 ns/op
# Warmup Iteration 4: 1639.955 ns/op
# Warmup Iteration 5: 1639.914 ns/op
Iteration 1: 1640.164 ns/op
...
Iteration 10: 1641.931 ns/op
java -jar target/benchmarks.jar "Example8ArrayCopying" \
-jvmArgsAppend "-Xbatch -XX:TieredStopAtLevel=1"
# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1 - Default Mem
Benchmark (SIZE) Mode Cnt Score Error Units
Example8ArrayCopying.manualCopy1 1000 avgt 10 565.111 ± 20.447 ns/op
Example8ArrayCopying.manualCopy2 1000 avgt 10 825.489 ± 0.782 ns/op
Example8ArrayCopying.manualCopyAntiUnroll1 1000 avgt 10 768.211 ± 18.383 ns/op
Example8ArrayCopying.manualCopyAntiUnroll2 1000 avgt 10 694.116 ± 8.131 ns/op
Example8ArrayCopying.manualCopyAntiUnroll3 1000 avgt 10 694.561 ± 3.570 ns/op
Example8ArrayCopying.systemCopy 1000 avgt 10 117.650 ± 2.397 ns/op
--
Example8ArrayCopying.manualCopy1 1000 avgt 10 555.663 ± 8.745 ns/op
Example8ArrayCopying.manualCopy2 1000 avgt 10 830.459 ± 11.616 ns/op
Example8ArrayCopying.manualCopyAntiUnroll1 1000 avgt 10 763.762 ± 4.898 ns/op
Example8ArrayCopying.manualCopyAntiUnroll2 1000 avgt 10 692.760 ± 1.160 ns/op
Example8ArrayCopying.manualCopyAntiUnroll3 1000 avgt 10 694.167 ± 9.906 ns/op
Example8ArrayCopying.systemCopy 1000 avgt 10 74.923 ± 0.273 ns/op
--
# Benchmark: org.xc.jmh.Example8ArrayCopying.manualCopyAntiUnroll3
# Warmup Iteration 1: 717.091 ns/op
# Warmup Iteration 2: 689.629 ns/op
# Warmup Iteration 3: 691.484 ns/op
# Warmup Iteration 4: 690.789 ns/op
# Warmup Iteration 5: 698.366 ns/op
Iteration 1: 692.108 ns/op
..
Iteration 10: 690.475 ns/op
org.xc.jmh.Example8ArrayCopying
Betray yourself quickly and efficiently
Do less but don't gain enough speed
public class Cache03 {
final int SIZE = 100_000;
final int[] src = new int[SIZE];
@Benchmark
public int step1() {
int sum = 0;
for (int i = 0; i < SIZE; i++) {
sum += src[i];
}
return sum;
}
@Benchmark
public int step20() {
int sum = 0;
for (int i = 0; i < SIZE; i = i + 20) {
sum += src[i];
}
return sum;
}
int step(int strideSize) {
int sum = 0;
for (int i = 0; i < SIZE; i = i + strideSize) {
sum += src[i];
}
return sum;
}
@Benchmark
public int stepUnrolled1() {return step(1);}
@Benchmark
public int stepUnrolled20() {return step(2o);}
}
Benchmark Mode Cnt Score Error Units
Cache03.step1 avgt 3 27,341.974 ± 26.97 ns/op
Cache03.step2 avgt 3 13,677.435 ± 64.85 ns/op
Cache03.step20 avgt 3 2,740.613 ± 5.87 ns/op
Cache03.stepUnrolled1 avgt 3 27,355.850 ± 36.46 ns/op
Cache03.stepUnrolled2 avgt 3 13,674.883 ± 86.00 ns/op
Cache03.stepUnrolled20 avgt 3 2,739.449 ± 11.39 ns/op
# Benchmark: org.xc.jmh.Cache03.stepUnrolled1
# Warmup Iteration 1: 32292.624 ns/op
# Warmup Iteration 2: 29584.658 ns/op
# Warmup Iteration 3: 49148.501 ns/op
# Warmup Iteration 4: 50692.600 ns/op
# Warmup Iteration 5: 30464.858 ns/op
# Warmup Iteration 6: 27348.027 ns/op
Iteration 1: 27354.870 ns/op
Iteration 2: 27348.734 ns/op
Iteration 3: 27343.787 ns/op
Cache03.step1Reversed avgt 3 27,341.654 ± 67.54 ns/op
Cache03.stepUnrolled1Long avgt 3 41,423.789 ± 8710.51 ns/op
Cache03.stepUnrolled2Long avgt 3 86,654.227 ± 494.13 ns/op
Cache03.stepUnrolled20Long avgt 3 2,954.895 ± 892.58 ns/op
org.xc.jmh.Cache03
A look behind the scenes
# java -jar target/benchmarks.jar "Cache03.step(1|20)" -prof perfnorm
Benchmark Mode Cnt Score Units
Cache03.step1 avgt 3 31676.192 ns/op
Cache03.step1:IPC avgt 1.227 insns/clk
Cache03.step1:L1-dcache-load-misses avgt 5337.646 #/op
Cache03.step1:L1-dcache-loads avgt 83939.072 #/op
Cache03.step1:L1-icache-load-misses avgt 4.004 #/op
Cache03.step1:L1-icache-loads avgt 165.415 #/op
Cache03.step1:branch-misses avgt 13.406 #/op
Cache03.step1:branches avgt 5143.071 #/op
Cache03.step1:cycles avgt 81345.466 #/op
Cache03.step1:instructions avgt 99844.986 #/op
Cache03.step1:dTLB-load-misses avgt 0.194 #/op
Cache03.step1:dTLB-loads avgt 90.337 #/op
Cache03.step1:iTLB-load-misses avgt 0.112 #/op
Cache03.step1:iTLB-loads avgt 0.534 #/op
Cache03.step1:stalled-cycles-frontend avgt 534.119 #/op
# java -jar target/benchmarks.jar "Cache03.step(1|20)" -prof perfnorm
Benchmark Mode Cnt Score Units
Cache03.step20 avgt 3 3164.759 ns/op
Cache03.step20:IPC avgt 0.621 insns/clk
Cache03.step20:L1-dcache-load-misses avgt 4249.552 #/op
Cache03.step20:L1-dcache-loads avgt 4208.701 #/op
Cache03.step20:L1-icache-load-misses avgt 0.564 #/op
Cache03.step20:L1-icache-loads avgt 16.031 #/op
Cache03.step20:branch-misses avgt 0.857 #/op
Cache03.step20:branches avgt 269.741 #/op
Cache03.step20:cycles avgt 8164.074 #/op
Cache03.step20:instructions avgt 5072.712 #/op
Cache03.step20:dTLB-load-misses avgt 0.015 #/op
Cache03.step20:dTLB-loads avgt 89.055 #/op
Cache03.step20:iTLB-load-misses avgt 0.004 #/op
Cache03.step20:iTLB-loads avgt 0.017 #/op
Cache03.step20:stalled-cycles-frontend avgt 41.663 #/op
org.xc.jmh.Cache03
The hidden memory that speeds things up
Caches are misleading and production is different
public class Example10ArrayCopying {
@Param({"1000"})
int SIZE;
@Param({"1", "10", "1000", "10000"})
int COUNT;
final List srcArrays = new ArrayList<>();
final List destArrays = new ArrayList<>();
final Random r = new Random(7L);
@Setup
public void setup() {
srcArrays.clear();
destArrays.clear();
var a = new long[SIZE];
for (int i = 0; i < a.length; i++) {
a[i] = r.nextLong();
}
// get us enough arrays
for (int i = 0; i < COUNT; i++) {
srcArrays.add(Arrays.copyOf(a, a.length));
destArrays.add(Arrays.copyOf(a, a.length));
}
}
@Benchmark
public long[] systemCopy() {
var src = srcArrays.get(r.nextInt(srcArrays.size()));
var dest = destArrays.get(r.nextInt(destArrays.size()));
System.arraycopy(src, 0, dest, 0, src.length);
return dest;
}
...
}
java -jar target/benchmarks.jar "Example10ArrayCopying"
# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1 - Default Mem
Benchmark (COUNT) (SIZE) Mode Cnt Score Error Units
Example10ArrayCopying.manualCopy 1 1000 avgt 10 80.914 ± 1.215 ns/op
Example10ArrayCopying.manualCopy 10 1000 avgt 10 137.502 ± 0.495 ns/op
Example10ArrayCopying.manualCopy 1000 1000 avgt 10 244.279 ± 5.420 ns/op
Example10ArrayCopying.manualCopy 10000 1000 avgt 10 986.221 ± 9.278 ns/op
Example10ArrayCopying.systemCopy 1 1000 avgt 10 93.923 ± 0.517 ns/op
Example10ArrayCopying.systemCopy 10 1000 avgt 10 137.737 ± 0.538 ns/op
Example10ArrayCopying.systemCopy 1000 1000 avgt 10 245.400 ± 3.961 ns/op
Example10ArrayCopying.systemCopy 10000 1000 avgt 10 1051.927 ± 12.971 ns/op
--
Example10ArrayCopying.manualCopy 1 1000 avgt 10 80.677 ± 0.066 ns/op
Example10ArrayCopying.manualCopy 10 1000 avgt 10 133.435 ± 0.465 ns/op
Example10ArrayCopying.manualCopy 1000 1000 avgt 10 247.355 ± 4.421 ns/op
Example10ArrayCopying.manualCopy 10000 1000 avgt 10 1005.709 ± 13.220 ns/op
Example10ArrayCopying.systemCopy 1 1000 avgt 10 94.961 ± 2.043 ns/op
Example10ArrayCopying.systemCopy 10 1000 avgt 10 139.995 ± 3.271 ns/op
Example10ArrayCopying.systemCopy 1000 1000 avgt 10 240.796 ± 1.283 ns/op
Example10ArrayCopying.systemCopy 10000 1000 avgt 10 1066.152 ± 15.434 ns/op
# JDK 21 - T14s - W5-1/M10-1/F1 - Default Mem
Example10ArrayCopying.manualCopy 1 1000 avgt 10 220.405 ± 0.841 ns/op
Example10ArrayCopying.manualCopy 10 1000 avgt 10 398.055 ± 5.960 ns/op
Example10ArrayCopying.manualCopy 1000 1000 avgt 10 1205.517 ± 4.025 ns/op
Example10ArrayCopying.manualCopy 10000 1000 avgt 10 1593.276 ± 72.917 ns/op
Example10ArrayCopying.systemCopy 1 1000 avgt 10 400.866 ± 10.155 ns/op
Example10ArrayCopying.systemCopy 10 1000 avgt 10 464.524 ± 7.419 ns/op
Example10ArrayCopying.systemCopy 1000 1000 avgt 10 1246.732 ± 3.929 ns/op
Example10ArrayCopying.systemCopy 10000 1000 avgt 10 1649.171 ± 40.588 ns/op
org.xc.jmh.Example10ArrayCopying
Try harder to make the point
public class Cache01 {
@Param({"10", "1000", "10000"}) int SIZE = 1000;
@Param({"true", "false"}) boolean RANDOM = false;
@Param({"true", "false"}) boolean MOREGARBAGE = false;
final List<String> data = new ArrayList<>(SIZE);
final List<byte[]> garbage = new ArrayList<>(SIZE);
String[] CITIES = {"Berlin;", "Hannover;", "Prag;", "Rio;", "Hamburg;", "Paris;", "Rotterdam;"};
String[] TEMPERATURES = {"-99.9", "99.9", "3.4", "-12.2", "22.2", "26.8", "31.1", "11.0", "-5.6"};
@Setup(Level.Trial)
public void setup() {
data.clear();
garbage.clear();
final var r = new Random(7L);
for (int i = 0; i < SIZE; i++) {
if (MOREGARBAGE) {
garbage.add(new byte[r.nextInt(5000)]);
}
data.add(
CITIES[r.nextInt(CITIES.length)] +
TEMPERATURES[r.nextInt(TEMPERATURES.length)]);
}
if (RANDOM) {
Collections.shuffle(data, r);
}
}
@Benchmark
public double parse() {
var total = 0d;
for (int i = 0; i < data.size(); i++) {
var s = data.get(i);
var pos = s.indexOf(';');
total += pos + s.length(); // .parseDouble(s.substring(pos + 1));
}
return total;
}
}
java -jar target/benchmarks.jar "Cache01"
# JDK 21 - Hetzner AMD EPYC - W5-1/M10-1/F1 - Default Mem
Benchmark (MOREGARBAGE) (RANDOM) (SIZE) Mode Cnt Score Error Units
Cache01.parse true true 10 avgt 10 47.995 ± 0.396 ns/op
Cache01.parse true true 1000 avgt 10 6484.647 ± 69.412 ns/op
Cache01.parse true true 10000 avgt 10 81505.961 ± 13035.464 ns/op
Cache01.parse true false 10 avgt 10 47.452 ± 0.864 ns/op
Cache01.parse true false 1000 avgt 10 6138.623 ± 29.380 ns/op
Cache01.parse true false 10000 avgt 10 70276.588 ± 13068.121 ns/op
Cache01.parse false true 10 avgt 10 44.323 ± 0.249 ns/op
Cache01.parse false true 1000 avgt 10 4910.066 ± 61.657 ns/op
Cache01.parse false true 10000 avgt 10 63560.366 ± 4251.100 ns/op
Cache01.parse false false 10 avgt 10 44.133 ± 0.713 ns/op
Cache01.parse false false 1000 avgt 10 4556.780 ± 59.546 ns/op
Cache01.parse false false 10000 avgt 10 57926.441 ± 3680.559 ns/op
Cache01.parse true true 10 avgt 10 47.586 ± 0.509 ns/op
Cache01.parse true true 1000 avgt 10 6469.714 ± 2.503 ns/op
Cache01.parse true true 10000 avgt 10 81499.896 ± 12107.251 ns/op
Cache01.parse true false 10 avgt 10 48.136 ± 0.378 ns/op
Cache01.parse true false 1000 avgt 10 6156.876 ± 26.356 ns/op
Cache01.parse true false 10000 avgt 10 69994.126 ± 13665.307 ns/op
Cache01.parse false true 10 avgt 10 44.743 ± 0.497 ns/op
Cache01.parse false true 1000 avgt 10 4895.309 ± 11.820 ns/op
Cache01.parse false true 10000 avgt 10 72291.021 ± 13616.639 ns/op
Cache01.parse false false 10 avgt 10 44.206 ± 1.026 ns/op
Cache01.parse false false 1000 avgt 10 4539.098 ± 91.328 ns/op
Cache01.parse false false 10000 avgt 10 59608.531 ± 8642.156 ns/op
org.xc.jmh.Cache01
Check the details to see what is going on
# Parameters: (MOREGARBAGE = false, RANDOM = false, SIZE = 1000)
# Warmup Iteration 1: 4741.602 ns/op
# Warmup Iteration 2: 4620.027 ns/op
# Warmup Iteration 3: 4616.325 ns/op
# Warmup Iteration 4: 4573.271 ns/op
# Warmup Iteration 5: 4446.747 ns/op
Iteration 1: 4574.989 ns/op
Iteration 2: 4517.727 ns/op
Iteration 3: 4575.883 ns/op
Iteration 4: 4576.167 ns/op
Iteration 5: 4574.460 ns/op
Iteration 6: 4573.074 ns/op
Iteration 7: 4387.671 ns/op
Iteration 8: 4492.154 ns/op
Iteration 9: 4557.968 ns/op
Iteration 10: 4560.888 ns/op
# Parameters: (MOREGARBAGE = false, RANDOM = false, SIZE = 10000)
# Warmup Iteration 1: 45189.567 ns/op
# Warmup Iteration 2: 43549.135 ns/op
# Warmup Iteration 3: 43981.266 ns/op
# Warmup Iteration 4: 44022.431 ns/op
# Warmup Iteration 5: 43775.808 ns/op
Iteration 1: 57472.806 ns/op
Iteration 2: 57244.411 ns/op
Iteration 3: 55563.821 ns/op
Iteration 4: 61645.149 ns/op
Iteration 5: 56521.307 ns/op
Iteration 6: 56301.759 ns/op
Iteration 7: 56916.481 ns/op
Iteration 8: 55442.452 ns/op
Iteration 9: 66088.367 ns/op
Iteration 10: 72888.761 ns/op
# Parameters: (MOREGARBAGE = false, RANDOM = true, SIZE = 1000)
# Warmup Iteration 1: 4997.930 ns/op
# Warmup Iteration 2: 4906.259 ns/op
# Warmup Iteration 3: 4895.008 ns/op
# Warmup Iteration 4: 4893.813 ns/op
# Warmup Iteration 5: 4891.924 ns/op
Iteration 1: 4892.487 ns/op
Iteration 2: 4891.916 ns/op
Iteration 3: 4890.972 ns/op
Iteration 4: 4891.562 ns/op
Iteration 5: 4890.869 ns/op
Iteration 6: 4904.517 ns/op
Iteration 7: 4892.786 ns/op
Iteration 8: 4890.875 ns/op
Iteration 9: 4892.796 ns/op
Iteration 10: 4914.307 ns/op
# Parameters: (MOREGARBAGE = false, RANDOM = true, SIZE = 10000)
# Warmup Iteration 1: 62002.559 ns/op
# Warmup Iteration 2: 60539.963 ns/op
# Warmup Iteration 3: 61673.083 ns/op
# Warmup Iteration 4: 60632.315 ns/op
# Warmup Iteration 5: 60558.396 ns/op
Iteration 1: 60353.912 ns/op
Iteration 2: 60794.953 ns/op
Iteration 3: 82814.928 ns/op
Iteration 4: 73443.366 ns/op
Iteration 5: 71391.310 ns/op
Iteration 6: 70919.199 ns/op
Iteration 7: 79608.340 ns/op
Iteration 8: 81793.408 ns/op
Iteration 9: 61074.410 ns/op
Iteration 10: 80716.386 ns/op
org.xc.jmh.Cache01
Same data for every run, only touch it differently
public class Cache02
{
@Param({"10", "1000", "10000"}) int SIZE;
int totalData = 10_000;
int baseSize = 10_000;
int counter = 0;
	final List<List<String>> data = new ArrayList<>(SIZE);
String[] CITIES = {"Berlin;", "Hannover;", "Prag;", "Rio;", "Hamburg;", "Paris;", "Rotterdam;"};
String[] TEMPERATURES = {"-99.9", "99.9", "3.4", "-12.2", "22.2", "26.8", "31.1", "11.0", "-5.6"};
@Setup(Level.Trial)
public void setupTrial()
{
data.clear();
final var r = new Random(7L);
for (int j = 0; j < totalData; j++)
{
final List<String> d = new ArrayList<>(SIZE);
for (int i = 0; i < baseSize; i++)
{
d.add(CITIES[r.nextInt(CITIES.length)] + TEMPERATURES[r.nextInt(TEMPERATURES.length)]);
}
data.add(d);
}
}
@Setup(Level.Iteration)
public void setupIteration()
{
counter = 0;
}
@Benchmark
public double cacheHit()
{
var d = data.get(0);
var total = 0d;
for (int i = 0; i < SIZE; i++)
{
var s = d.get(i);
var pos = s.indexOf(';');
total += pos + s.length(); // .parseDouble(s.substring(pos + 1));
}
return total;
}
@Benchmark
public double cacheMiss()
{
var d = data.get(counter++ % totalData);
var total = 0d;
for (int i = 0; i < SIZE; i++)
{
var s = d.get(i);
var pos = s.indexOf(';');
total += pos + s.length(); // .parseDouble(s.substring(pos + 1));
}
return total;
}
}
org.xc.jmh.Cache02
Is it more correct now?
Benchmark (SIZE) Mode Cnt Score Error Units
Cache02.cacheHit 10 avgt 10 43.010 ± 0.184 ns/op
Cache02.cacheHit 1000 avgt 10 4778.031 ± 60.261 ns/op
Cache02.cacheHit 10000 avgt 10 64683.226 ± 14788.267 ns/op
Cache02.cacheMiss 10 avgt 10 336.260 ± 45.515 ns/op
Cache02.cacheMiss 1000 avgt 10 17669.625 ± 41.837 ns/op
Cache02.cacheMiss 10000 avgt 10 174635.237 ± 2976.802 ns/op
# Benchmark: org.xc.jmh.Cache02.cacheMiss
# Parameters: (SIZE = 10000)
# Warmup Iteration 1: 177439.666 ns/op
# Warmup Iteration 2: 174301.683 ns/op
# Warmup Iteration 3: 174319.358 ns/op
# Warmup Iteration 4: 175194.210 ns/op
# Warmup Iteration 5: 193149.452 ns/op
Iteration 1: 176797.338 ns/op
Iteration 2: 179366.082 ns/op
Iteration 3: 174686.931 ns/op
Iteration 4: 173912.824 ns/op
Iteration 5: 174186.462 ns/op
Iteration 6: 173900.033 ns/op
Iteration 7: 173583.888 ns/op
Iteration 8: 173610.283 ns/op
Iteration 9: 173037.852 ns/op
Iteration 10: 173270.673 ns/op
Benchmark (SIZE) Mode Cnt Score Error Units
Cache02.cacheHit 10 avgt 10 42.954 ± 0.304 ns/op
Cache02.cacheHit 1000 avgt 10 4802.211 ± 75.155 ns/op
Cache02.cacheHit 10000 avgt 10 72177.415 ± 6813.819 ns/op
Cache02.cacheMiss 10 avgt 10 415.397 ± 101.234 ns/op
Cache02.cacheMiss 1000 avgt 10 17653.832 ± 67.704 ns/op
Cache02.cacheMiss 10000 avgt 10 174592.955 ± 487.884 ns/op
# Benchmark: org.xc.jmh.Cache02.cacheHit
# Parameters: (SIZE = 10000)
# Warmup Iteration 1: 43386.186 ns/op
# Warmup Iteration 2: 47619.523 ns/op
# Warmup Iteration 3: 46837.359 ns/op
# Warmup Iteration 4: 44352.910 ns/op
# Warmup Iteration 5: 45543.546 ns/op
Iteration 1: 76304.197 ns/op
Iteration 2: 73811.738 ns/op
Iteration 3: 74944.359 ns/op
Iteration 4: 71217.526 ns/op
Iteration 5: 70320.534 ns/op
Iteration 6: 70690.886 ns/op
Iteration 7: 73316.245 ns/op
Iteration 8: 75228.869 ns/op
Iteration 9: 60798.436 ns/op
Iteration 10: 75141.359 ns/op
org.xc.jmh.Cache02
Ask someone who knows things
# Perf - requires Linux kernel tooling
java -jar target/benchmarks.jar "Cache02.cacheMiss" -prof perfnorm
cacheMiss 10000 avgt 202,126.64 ns/op
cacheMiss:IPC 10000 avgt 1.22 insns/clk
cacheMiss:L1-dcache-load-misses 10000 avgt 7,949.14 #/op
cacheMiss:L1-dcache-loads 10000 avgt 125,343.85 #/op
cacheMiss:branch-misses 10000 avgt 7,672.78 #/op
cacheMiss:branches 10000 avgt 211,824.84 #/op
cacheMiss:cycles 10000 avgt 516,010.56 #/op
cacheMiss:dTLB-load-misses 10000 avgt 124.12 #/op
cacheMiss:dTLB-loads 10000 avgt 152.94 #/op
cacheMiss:instructions 10000 avgt 634,205.58 #/op
cacheMiss:stalled-cycles-frontend 10000 avgt 54,376.13 #/op
# Benchmark: org.xc.jmh.Cache02.cacheMiss
# Parameters: (SIZE = 10000)
# Warmup Iteration 1: 177439.666 ns/op
# Warmup Iteration 2: 174301.683 ns/op
# Warmup Iteration 3: 174319.358 ns/op
# Warmup Iteration 4: 175194.210 ns/op
# Warmup Iteration 5: 193149.452 ns/op
Iteration 1: 176797.338 ns/op
Iteration 2: 179366.082 ns/op
Iteration 6: 173900.033 ns/op
Iteration 7: 173583.888 ns/op
Iteration 9: 173037.852 ns/op
# Might also just work on bare-metal or some clouds
java -jar target/benchmarks.jar "Cache02.cacheHit" -prof perfnorm
cacheHit 10000 avgt 67,556.39 ns/op
cacheHit:IPC 10000 avgt 3.87 insns/clk
cacheHit:L1-dcache-load-misses 10000 avgt 8,090.11 #/op
cacheHit:L1-dcache-loads 10000 avgt 102,131.89 #/op
cacheHit:branch-misses 10000 avgt 282.59 #/op
cacheHit:branches 10000 avgt 213,493.76 #/op
cacheHit:cycles 10000 avgt 172,101.15 #/op
cacheHit:dTLB-load-misses 10000 avgt 0.35 #/op
cacheHit:dTLB-loads 10000 avgt 159.13 #/op
cacheHit:instructions 10000 avgt 667,271.07 #/op
cacheHit:stalled-cycles-frontend 10000 avgt 7,618.23 #/op
# Benchmark: org.xc.jmh.Cache02.cacheHit
# Parameters: (SIZE = 10000)
# Warmup Iteration 1: 43386.186 ns/op
# Warmup Iteration 2: 47619.523 ns/op
# Warmup Iteration 3: 46837.359 ns/op
# Warmup Iteration 4: 44352.910 ns/op
# Warmup Iteration 5: 45543.546 ns/op
Iteration 1: 76304.197 ns/op
Iteration 2: 73811.738 ns/op
Iteration 6: 70690.886 ns/op
Iteration 7: 73316.245 ns/op
Iteration 9: 60798.436 ns/op
org.xc.jmh.Cache02
What Does it Cost to Leave the CPU?
Real | Humanized | |
---|---|---|
CPU Cycle | 0.4 ns | 1 s |
L1 Cache | 0.9 ns | 2 s |
L2 Cache | 2.8 ns | 7 s |
L3 Cache | 28 ns | 1 min |
Memory Access | 100 ns | 4 min |
NVM SSD | 25 μs | 17 h |
SSD | 50–150 μs | 1.5-4 days |
HDD | 1–10 ms | 1-9 months |
TCP SF to NY | 65 ms | 5 years |
TCP SF to Hong Kong | 141 ms | 11 years |
http://www.prowesscorp.com/computer-latency-at-a-human-scale/
Everything happens in parallel
https://en.wikichip.org/wiki/intel/microarchitectures/kaby_lake
* depends on the type and generation
What have we learned so far?
Let's get into more details
The when, where, and how often
Just some basics first
@Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class C01 {
final int SIZE = 100_000;
final byte[] src = new byte[SIZE];
final byte[] dest = new byte[SIZE];
...
@Benchmark
public byte[] systemCopy() {
System.arraycopy(src, 0, dest, 0, src.length);
return dest;
}
@Benchmark
public byte[] manualCopy() {
for (int i = 0; i < src.length; i++) {
dest[i] = src[i];
}
return dest;
}
}
# Hetzner
Benchmark Mode Cnt Score Error Units
C01.manualCopy avgt 3 1751.934 ± 8.782 ns/op
C01.systemCopy avgt 3 1805.559 ± 60.820 ns/op
# T14s
Benchmark Mode Cnt Score Error Units
C01.manualCopy avgt 3 4881.513 ± 1209.687 ns/op
C01.systemCopy avgt 3 4834.282 ± 1066.856 ns/op
org.xc.jmh.C01
Vary warmup, keep measurement the same
@Measurement(iterations = 1, time=10, timeUnit = TimeUnit.MILLISECONDS)
...
public class C02
{
@Benchmark
@Warmup(iterations = 0)
public byte[] systemCopyCold()
@Benchmark
@Warmup(iterations = 1, time=10, timeUnit = TimeUnit.MILLISECONDS)
public byte[] systemCopyWarm()
@Benchmark
@Warmup(iterations = 1, time=1000, timeUnit = TimeUnit.MILLISECONDS)
public byte[] systemCopyHot()
@Benchmark
@Warmup(iterations = 0)
public byte[] manualCopyCold()
@Benchmark
@Warmup(iterations = 1, time=10, timeUnit = TimeUnit.MILLISECONDS)
public byte[] manualCopyWarm()
@Benchmark
@Warmup(iterations = 1, time=1000, timeUnit = TimeUnit.MILLISECONDS)
public byte[] manualCopyHot()
}
# Hetzner
Benchmark Mode Cnt Score Error Units
C02.manualCopyCold avgt 241472.047 ns/op
C02.manualCopyHot avgt 1773.845 ns/op
C02.manualCopyWarm avgt 4604.600 ns/op
C02.systemCopyCold avgt 1938.973 ns/op
C02.systemCopyHot avgt 3349.491 ns/op
C02.systemCopyWarm avgt 1909.157 ns/op
---
C02.manualCopyCold avgt 652502.813 ns/op
C02.manualCopyHot avgt 1785.308 ns/op
C02.manualCopyWarm avgt 1830.773 ns/op
C02.systemCopyCold avgt 1954.546 ns/op
C02.systemCopyHot avgt 1854.376 ns/op
C02.systemCopyWarm avgt 3774.145 ns/op
# T14s
Benchmark Mode Cnt Score Error Units
C02.manualCopyCold avgt 866723.538 ns/op
C02.manualCopyHot avgt 4951.803 ns/op
C02.manualCopyWarm avgt 107999.305 ns/op
C02.systemCopyCold avgt 5731.386 ns/op
C02.systemCopyHot avgt 5007.324 ns/op
C02.systemCopyWarm avgt 5213.909 ns/op
org.xc.jmh.C02
Look under the hood again
java -jar target/benchmarks.jar "C02" -jvmArgs "-XX:+PrintCompilation"
# Benchmark: org.xc.jmh.C02.systemCopyHot
# Fork: 1 of 1
...
# Warmup Iteration 1:
...
175 506 3 org.xc.jmh.C02::systemCopyHot (23 bytes)
175 507 3 org.openjdk.jmh.infra.Blackhole::consume (19 bytes)
175 508 3 org.openjdk.jmh.infra.Blackhole::consumeCompiler (1 bytes)
185 509 4 org.xc.jmh.C02::systemCopyHot (23 bytes)
# 186 506 3 org.xc.jmh.C02::systemCopyHot (23 bytes) made not entrant
186 510 4 org.openjdk.jmh.infra.Blackhole::consume (19 bytes)
# 186 507 3 org.openjdk.jmh.infra.Blackhole::consume (19 bytes) made not entrant
287 511 % 3 org.xc...jmhTest::systemCopyHot_avgt_jmhStub @ 13 (57 bytes)
289 512 3 org.xc...jmhTest::systemCopyHot_avgt_jmhStub (57 bytes)
361 513 % 4 org.xc...jmhTest::systemCopyHot_avgt_jmhStub @ 13 (57 bytes)
# 368 511 % 3 org.xc...jmhTest::systemCopyHot_avgt_jmhStub @ 13 (57 bytes) made not entrant
#1175 513 % 4 org.xc...jmhTest::systemCopyHot_avgt_jmhStub @ 13 (57 bytes) made not entrant
1820.859 ns/op
Iteration 1: 4797.916 ns/op
java -jar target/benchmarks.jar "C02" -jvmArgs "-XX:+PrintCompilation"
# Benchmark: org.xc.jmh.C02.systemCopyWarm
# Fork: 1 of 1
...
# Warmup Iteration 1:
...
210 539 3 org.xc.jmh.C02::systemCopyWarm (23 bytes)
211 540 3 org.openjdk.jmh.infra.Blackhole::consume (19 bytes)
220 541 4 org.xc.jmh.C02::systemCopyWarm (23 bytes)
#221 539 3 org.xc.jmh.C02::systemCopyWarm (23 bytes) made not entrant
221 542 4 org.openjdk.jmh.infra.Blackhole::consume (19 bytes)
#221 540 3 org.openjdk.jmh.infra.Blackhole::consume (19 bytes) made not entrant
1923.731 ns/op
Iteration 1: 1859.749 ns/op
org.xc.jmh.C02
Limit the JITs powers
# Only compile quick and easy, never leave that again
java -jar target/benchmarks.jar "C02" -jvmArgs "-XX:TieredStopAtLevel=1"
Benchmark Mode Cnt Score Error Units
C02.manualCopyCold avgt 197266.887 ns/op
C02.manualCopyHot avgt 146286.457 ns/op
C02.manualCopyWarm avgt 147154.377 ns/op
C02.systemCopyCold avgt 1943.922 ns/op
C02.systemCopyHot avgt 1775.785 ns/op
C02.systemCopyWarm avgt 3744.589 ns/op
--
C02.manualCopyCold avgt 144304.352 ns/op
C02.manualCopyHot avgt 57894.731 ns/op
C02.manualCopyWarm avgt 115281.330 ns/op
C02.systemCopyCold avgt 1940.215 ns/op
C02.systemCopyHot avgt 1878.354 ns/op
C02.systemCopyWarm avgt 4590.958 ns/op
--
C02.manualCopyCold avgt 295074.343 ns/op
C02.manualCopyHot avgt 58333.523 ns/op
C02.manualCopyWarm avgt 57356.597 ns/op
C02.systemCopyCold avgt 1943.131 ns/op
C02.systemCopyHot avgt 1856.031 ns/op
C02.systemCopyWarm avgt 4736.612 ns/op
# Don't background compile and don't switch between tiers
java -jar target/benchmarks.jar "C02" \
-jvmArgs "-XX:-TieredCompilation -Xbatch"
Benchmark Mode Cnt Score Error Units
C02.manualCopyCold avgt 26368588.000 ns/op
C02.manualCopyHot avgt 1758.489 ns/op
C02.manualCopyWarm avgt 11940991.000 ns/op
C02.systemCopyCold avgt 7170.455 ns/op
C02.systemCopyHot avgt 1852.513 ns/op
C02.systemCopyWarm avgt 7268.862 ns/op
--
C02.manualCopyCold avgt 26817602.000 ns/op
C02.manualCopyHot avgt 1787.099 ns/op
C02.manualCopyWarm avgt 11927175.000 ns/op
C02.systemCopyCold avgt 7158.861 ns/op
C02.systemCopyHot avgt 1847.176 ns/op
C02.systemCopyWarm avgt 7608.851 ns/op
--
C02.manualCopyCold avgt 24100619.000 ns/op
C02.manualCopyHot avgt 1760.130 ns/op
C02.manualCopyWarm avgt 13403527.500 ns/op
C02.systemCopyCold avgt 7284.120 ns/op
C02.systemCopyHot avgt 4610.946 ns/op
C02.systemCopyWarm avgt 7716.843 ns/op
org.xc.jmh.C02
Never do streams and lambdas. Period... or?
public class Streams {
final int SIZE = 1024;
final int[] integers = new int[SIZE];
...
@Benchmark
@Warmup(iterations = 0, batchSize = 1)
public int lambdaArrayCold() {
return Arrays.stream(integers).filter(i -> i % 2 == 0).sum();
}
@Benchmark
@Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
public int lambdaArrayWarm() {...}
@Benchmark
@Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
public int lambdaArrayHot() {...}
@Benchmark
@Warmup(iterations = 0, batchSize = 1)
public int loopCold() {
int sum = 0;
for (int i = 0; i < integers.length; i++) {
var v = integers[i];
if (v % 2 == 0) {
sum += v;
}
}
return sum;
}
@Benchmark
@Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
public int loopWarm() {...}
@Benchmark
@Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
public int loopHot() {...}
}
Benchmark Mode Cnt Score Error Units
Streams.lambdaArrayCold avgt 5 574,212.40 ± 3,911,792 ns/op
Streams.lambdaArrayWarm avgt 5 27,398.98 ± 181,998 ns/op
Streams.lambdaArrayHot avgt 5 860.11 ± 2,943 ns/op
Streams.loopCold avgt 5 26,345.93 ± 12,867 ns/op
Streams.loopWarm avgt 5 605.36 ± 225 ns/op
Streams.loopHot avgt 5 1,568.21 ± 1882 ns/op
Streams.lambdaArrayCold
Iteration 1: 2391282.000 ns/op
Iteration 2: 144067.000 ns/op
Iteration 3: 119609.000 ns/op
Iteration 4: 109935.000 ns/op
Iteration 5: 106169.000 ns/op
Streams.lambdaArrayWarm
Iteration 1: 111857.000 ns/op
Iteration 2: 4564.900 ns/op
Iteration 3: 4609.857 ns/op
Iteration 4: 6015.000 ns/op
Iteration 5: 9948.167 ns/op
Streams.loopCold
Iteration 1: 22230.655 ns/op
Iteration 2: 25418.000 ns/op
Iteration 3: 28242.000 ns/op
Iteration 4: 24892.500 ns/op
Iteration 5: 30946.500 ns/op
You shall not care too much
public class HeavyStreams {
// as seen before
@Benchmark
@Warmup(iterations = 0, batchSize = 1)
public int lambdaArrayCold() {
return Arrays.stream(integers)
.filter(i -> i.mod(BigInteger.TWO).equals(BigInteger.ZERO))
.map(i -> i.sqrt())
.mapToInt(BigInteger::intValue)
.sum();
}
...
@Benchmark
@Warmup(iterations = 0, batchSize = 1)
public int loopCold() {
int sum = 0;
for (int i = 0; i < integers.length; i++) {
var v = integers[i];
if (v.mod(BigInteger.TWO).equals(BigInteger.ZERO)) {
sum += v.sqrt().intValue();
}
}
return sum;
}
}
# Milliseconds now!!!
Benchmark Mode Cnt Score Error Units
HeavyStreams.lambdaArrayCold avgt 5 283.99 ± 220.86 ms/op
HeavyStreams.lambdaArrayWarm avgt 5 254.92 ± 66.34 ms/op
HeavyStreams.lambdaArrayHot avgt 5 247.91 ± 53.26 ms/op
HeavyStreams.loopCold avgt 5 297.98 ± 165.77 ms/op
HeavyStreams.loopWarm avgt 5 248.82 ± 51.59 ms/op
HeavyStreams.loopHot avgt 5 253.76 ± 68.80 ms/op
HeavyStreams.lambdaArrayCold
Iteration 1: 359.072 ms/op
Iteration 2: 332.922 ms/op
Iteration 3: 242.727 ms/op
Iteration 4: 243.821 ms/op
Iteration 5: 241.427 ms/op
HeavyStreams.lambdaArrayWarm
Iteration 1: 241.479 ms/op
Iteration 2: 272.908 ms/op
Iteration 3: 242.048 ms/op
Iteration 4: 274.634 ms/op
Iteration 5: 243.568 ms/op
HeavyStreams.loopCold
Iteration 1: 349.515 ms/op
Iteration 2: 326.304 ms/op
Iteration 3: 301.334 ms/op
Iteration 4: 271.928 ms/op
Iteration 5: 240.832 ms/op
consume
to the rescue
@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Measurement(iterations = 5, batchSize = 1, time=1, timeUnit = TimeUnit.NANOSECONDS)
@Fork(1)
public class HeavyStreamsConsume {
final int SIZE = 1024;
final int[] integers = new int[SIZE];
static int calcAndBurn(int a) {
Blackhole.consumeCPU(10011);
return a % 2;
}
@Benchmark
@Warmup(iterations = 0, batchSize = 1)
public int lambdaArrayCold() {
return Arrays
.stream(integers)
.filter(i -> calcAndBurn(i) == 0).sum();
}
@Benchmark
@Warmup(iterations = 0, batchSize = 1)
public int loopCold() {
int sum = 0;
for (int i = 0; i < integers.length; i++) {
var v = integers[i];
if (calcAndBurn(v) == 0) {
sum += v;
}
}
return sum;
}
}
# Microseconds now!!!
Benchmark Mode Cnt Score Error Units
HeavyStreamsConsume.lambdaArrayCold avgt 5 21,165.6 ± 12,498.1 us/op
HeavyStreamsConsume.lambdaArrayHot avgt 5 19,653.0 ± 84.0 us/op
HeavyStreamsConsume.lambdaArrayWarm avgt 5 19,693.3 ± 170.3 us/op
HeavyStreamsConsume.loopCold avgt 5 20,642.6 ± 8,376.7 us/op
HeavyStreamsConsume.loopHot avgt 5 19,622.6 ± 88.5 us/op
HeavyStreamsConsume.loopWarm avgt 5 19,656.9 ± 38.7 us/op
HeavyStreams.lambdaArrayCold
Iteration 1: 26971.615 us/op
Iteration 2: 19761.449 us/op
Iteration 3: 19707.116 us/op
Iteration 4: 19702.641 us/op
Iteration 5: 19685.665 us/op
HeavyStreams.lambdaArrayWarm
Iteration 1: 19771.183 us/op
Iteration 2: 19678.924 us/op
Iteration 3: 19682.219 us/op
Iteration 4: 19672.906 us/op
Iteration 5: 19661.459 us/op
HeavyStreams.loopCold
Iteration 1: 24534.031 us/op
Iteration 2: 19690.451 us/op
Iteration 3: 19663.822 us/op
Iteration 4: 19651.893 us/op
Iteration 5: 19672.885 us/op
How you can measure
Measurement approaches
@Warmup(iterations = 2, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@State(Scope.Thread)
public class M01
{
...
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public long sum_AverageTime()
{
long result = 0;
for (int i = 0; i < data.length; i++)
{
result += data[i];
Blackhole.consumeCPU(10);
}
return result;
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public long sum_Throughput() { ... }
@Benchmark
@BenchmarkMode(Mode.SingleShotTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public long sum_SingleShotTime() { ... }
@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public long sum_SampleTime() { ... }
}
Benchmark Mode Cnt Score Error Units
M01.sum_Throughput thrpt 3 74.127 ± 5.157 ops/ms
M01.sum_AverageTime avgt 3 13357.290 ± 1096.865 ns/op
M01.sum_SingleShotTime ss 3 147726.000 ± 486517.167 ns/op
M01.sum_SampleTime sample 56633 13305.622 ± 21.038 ns/op
M01.sum_SampleTime:p0.00 sample 12640.000 ns/op
M01.sum_SampleTime:p0.50 sample 13184.000 ns/op
M01.sum_SampleTime:p0.90 sample 13280.000 ns/op
M01.sum_SampleTime:p0.95 sample 13344.000 ns/op
M01.sum_SampleTime:p0.99 sample 17888.000 ns/op
M01.sum_SampleTime:p0.999 sample 33472.000 ns/op
M01.sum_SampleTime:p0.9999 sample 45036.966 ns/op
M01.sum_SampleTime:p1.00 sample 51008.000 ns/op
org.xc.jmh.M01
Flow changes are ... well
When things go wrong extremly rarely
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
public class Error01 {
int iterationCount = 0;
final String[] integers = new String[1000];
@Setup(Level.Iteration)
public void setup(BenchmarkParams params)
{
iterationCount++;
for (int i = 0; i < integers.length; i++) {
integers[i] = new String(String.valueOf(i));
}
// when we are beyond warmup the third time, inject failure
if (iterationCount == params.getWarmup().getCount() + 6) {
integers[integers.length / 2] = "any";
}
else {
integers[integers.length / 2] = "42";
}
}
@Benchmark
public long parse() {
long total = 0;
for (int i = 0; i < integers.length; i++) {
try {
total += Integer.valueOf(integers[i]);
}
catch(NumberFormatException e) {
total = 0;
}
}
return total;
}
}
# Benchmark: org.xc.jmh.Error01.parse
# Warmup Iteration 1: 5886.587 ns/op
# Warmup Iteration 2: 5945.850 ns/op
# Warmup Iteration 3: 5723.013 ns/op
Iteration 1: 5726.163 ns/op
Iteration 2: 5721.427 ns/op
Iteration 3: 5724.125 ns/op
Iteration 4: 5725.052 ns/op
Iteration 5: 5726.114 ns/op
Iteration 6: 7242.212 ns/op
Iteration 7: 5955.518 ns/op
Iteration 8: 5942.675 ns/op
Iteration 9: 5937.309 ns/op
Iteration 10: 5933.929 ns/op
org.xc.jmh.Error01
When things go wrong all the time
@Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Fork(value = 1, jvmArgsAppend = {"-Xms4g", "-Xmx4g", "-XX:+AlwaysPreTouch"})
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class Error02 {
int iterationCount = 0;
final String[] integers = new String[1000];
@Setup(Level.Iteration)
public void setup(BenchmarkParams params) {
iterationCount++;
for (int i = 0; i < integers.length; i++) {
integers[i] = new String(String.valueOf(i));
}
// when we are beyond warmup the third time, inject failure
if (iterationCount == params.getWarmup().getCount() + 6) {
integers[integers.length / 2] = "any";
}
else {
integers[integers.length / 2] = "yas";
}
}
@Benchmark
public long parse() {
long total = 0;
for (int i = 0; i < integers.length; i++) {
try {
total += Integer.valueOf(integers[i]);
}
catch(NumberFormatException e) {
total = 0;
}
}
return total;
}
}
# Benchmark: org.xc.jmh.Error02.parse
# Warmup Iteration 1: 7003.412 ns/op
# Warmup Iteration 2: 7318.162 ns/op
# Warmup Iteration 3: 6910.102 ns/op
Iteration 1: 6962.584 ns/op
Iteration 2: 6912.097 ns/op
Iteration 3: 6974.506 ns/op
Iteration 4: 6919.982 ns/op
Iteration 5: 6918.328 ns/op
Iteration 6: 7044.448 ns/op
Iteration 7: 6898.633 ns/op
Iteration 8: 6897.084 ns/op
Iteration 9: 6908.213 ns/op
Iteration 10: 6900.028 ns/op
org.xc.jmh.Error02
I know what you might evaluate to
public class BranchPrediction {
private static final int COUNT = 10000;
// sorted ints
private int[] sorted = new int[COUNT];
// fully unsorted
private int[] unsorted = new int[COUNT];
// alternating branch usage
private int[] pattern2 = new int[COUNT];
// 80/20 branch usage
private int[] pattern5 = new int[COUNT];
// data from pattern2, but shuffled
private int[] shuffledPattern = new int[COUNT];
@Setup
public void setup() {
final Random r = new Random(13241L);
for (int i = 0; i < COUNT; i++) {
final int d = r.nextInt(100_000_000) + 1;
sorted[i] = d;
unsorted[i] = d;
pattern2[i] = i % 2 == 0 ? 100_000_000 : r.nextInt(50_000_000);
pattern5[i] = i % 5 == 0 ? 100_000_000 : r.nextInt(50_000_000);
shuffledPattern[i] = pattern2[i];
}
Arrays.sort(sorted);
// make it noisy to break the branch predictor
for (int i = 0; i < shuffledPattern.length; i++) {
int j = r.nextInt(shuffledPattern.length);
var t = shuffledPattern[j];
shuffledPattern[j] = shuffledPattern[i];
shuffledPattern[i] = t;
}
}
public long doIt(int[] array) {
long sum = 2;
for (int v : array) {
if (v > 50_000_000) {
sum += (v >> 2);
}
else {
sum += (v << 2);
}
}
return sum;
}
@Benchmark
public long sorted() {
return doIt(sorted);
}
...
}
# Hetzner
Benchmark Mode Cnt Score Error Units
Branch01.pattern2 avgt 5 4157.021 ± 231.482 ns/op
Branch01.pattern5 avgt 5 4154.633 ± 226.007 ns/op
Branch01.random avgt 5 8155.178 ± 2.132 ns/op
Branch01.sorted avgt 5 4196.762 ± 202.215 ns/op
Branch01.unsorted avgt 5 8131.623 ± 856.262 ns/op
# T14s
Branch01.pattern2 avgt 5 12285.168 ± 1418.264 ns/op
Branch01.pattern5 avgt 5 10644.531 ± 418.107 ns/op
Branch01.random avgt 5 40382.077 ± 9874.619 ns/op
Branch01.sorted avgt 5 10577.268 ± 2053.626 ns/op
Branch01.unsorted avgt 5 40491.795 ± 15274.222 ns/op
org.xc.jmh.Branch01
We love statistics - AMD EPYC
Branch01.pattern2 avgt 5 4763.210 ± 12.230 ns/op
Branch01.pattern2:IPC avgt 4.740 insns/clk
Branch01.pattern2:L1-dcache-load-misses avgt 535.867 #/op
Branch01.pattern2:L1-dcache-loads avgt 8442.117 #/op
Branch01.pattern2:branch-misses avgt 4.468 #/op
Branch01.pattern2:branches avgt 14288.724 #/op
Branch01.pattern2:cycles avgt 12237.059 #/op
Branch01.pattern2:dTLB-load-misses avgt 0.023 #/op
Branch01.pattern2:dTLB-loads avgt 0.318 #/op
Branch01.pattern2:instructions avgt 58002.780 #/op
Branch01.pattern2:stalled-cycles-frontend avgt 100.277 #/op
Branch01.pattern5 avgt 5 4192.828 ± 359.636 ns/op
Branch01.pattern5:IPC avgt 5.220 insns/clk
Branch01.pattern5:L1-dcache-load-misses avgt 535.540 #/op
Branch01.pattern5:L1-dcache-loads avgt 8428.333 #/op
Branch01.pattern5:branch-misses avgt 3.536 #/op
Branch01.pattern5:branches avgt 11897.675 #/op
Branch01.pattern5:cycles avgt 10714.181 #/op
Branch01.pattern5:dTLB-load-misses avgt 0.021 #/op
Branch01.pattern5:dTLB-loads avgt 0.296 #/op
Branch01.pattern5:instructions avgt 55924.576 #/op
Branch01.pattern5:stalled-cycles-frontend avgt 129.842 #/op
Branch01.random avgt 5 7763.375 ± 483.591 ns/op
Branch01.random:IPC avgt 2.935 insns/clk
Branch01.random:L1-dcache-load-misses avgt 534.159 #/op
Branch01.random:L1-dcache-loads avgt 8454.813 #/op
Branch01.random:branch-misses avgt 2.361 #/op
Branch01.random:branches avgt 14225.074 #/op
Branch01.random:cycles avgt 19737.984 #/op
Branch01.random:dTLB-load-misses avgt 0.029 #/op
Branch01.random:dTLB-loads avgt 0.528 #/op
Branch01.random:instructions avgt 57938.467 #/op
Branch01.random:stalled-cycles-frontend avgt 6542.892 #/op
Branch01.sorted avgt 5 4773.460 ± 13.965 ns/op
Branch01.sorted:IPC avgt 4.740 insns/clk
Branch01.sorted:L1-dcache-load-misses avgt 534.771 #/op
Branch01.sorted:L1-dcache-loads avgt 8431.196 #/op
Branch01.sorted:branch-misses avgt 5.447 #/op
Branch01.sorted:branches avgt 14266.975 #/op
Branch01.sorted:cycles avgt 12229.820 #/op
Branch01.sorted:dTLB-load-misses avgt 0.023 #/op
Branch01.sorted:dTLB-loads avgt 0.299 #/op
Branch01.sorted:instructions avgt 57969.496 #/op
Branch01.sorted:stalled-cycles-frontend avgt 98.264 #/op
Branch01.unsorted avgt 5 8255.555 ± 18.594 ns/op
Branch01.unsorted:IPC avgt 2.733 insns/clk
Branch01.unsorted:L1-dcache-load-misses avgt 532.575 #/op
Branch01.unsorted:L1-dcache-loads avgt 8399.977 #/op
Branch01.unsorted:branch-misses avgt 2.683 #/op
Branch01.unsorted:branches avgt 14196.794 #/op
Branch01.unsorted:cycles avgt 21289.886 #/op
Branch01.unsorted:dTLB-load-misses avgt 0.021 #/op
Branch01.unsorted:dTLB-loads avgt 0.434 #/op
Branch01.unsorted:instructions avgt 58183.757 #/op
Branch01.unsorted:stalled-cycles-frontend avgt 8768.434 #/op
org.xc.jmh.Branch01
We love statistics - T14s aMD Ryzen 7 PRO 4750U
Branch01.pattern2 avgt 5 12644.421 ± 32.790 ns/op
Branch01.pattern2:IPC avgt 4.101 insns/clk
Branch01.pattern2:L1-dcache-load-misses avgt 635.322 #/op
Branch01.pattern2:L1-dcache-loads avgt 10065.464 #/op
Branch01.pattern2:branch-misses avgt 5.700 #/op
Branch01.pattern2:branches avgt 17610.050 #/op
Branch01.pattern2:cycles avgt 17783.624 #/op
Branch01.pattern2:instructions avgt 72937.027 #/op
Branch01.pattern2:stalled-cycles-backend avgt 2556.326 #/op
Branch01.pattern2:stalled-cycles-frontend avgt 14.046 #/op
Branch01.pattern5 avgt 5 11561.519 ± 91.598 ns/op
Branch01.pattern5:IPC avgt 4.301 insns/clk
Branch01.pattern5:L1-dcache-load-misses avgt 634.284 #/op
Branch01.pattern5:L1-dcache-loads avgt 10058.859 #/op
Branch01.pattern5:branch-misses avgt 4.610 #/op
Branch01.pattern5:branches avgt 14569.519 #/op
Branch01.pattern5:cycles avgt 16247.216 #/op
Branch01.pattern5:instructions avgt 69877.428 #/op
Branch01.pattern5:stalled-cycles-backend avgt 3526.451 #/op
Branch01.pattern5:stalled-cycles-frontend avgt 19.762 #/op
Branch01.random avgt 5 39071.484 ± 11327.252 ns/op
Branch01.random:IPC avgt 1.285 insns/clk
Branch01.random:L1-dcache-load-misses avgt 650.766 #/op
Branch01.random:L1-dcache-loads avgt 11321.971 #/op
Branch01.random:branch-misses avgt 975.152 #/op
Branch01.random:branches avgt 17595.904 #/op
Branch01.random:cycles avgt 56724.714 #/op
Branch01.random:instructions avgt 72887.307 #/op
Branch01.random:stalled-cycles-backend avgt 270.540 #/op
Branch01.random:stalled-cycles-frontend avgt 1286.695 #/op
Branch01.sorted avgt 5 11606.551 ± 47.793 ns/op
Branch01.sorted:IPC avgt 4.464 insns/clk
Branch01.sorted:L1-dcache-load-misses avgt 634.906 #/op
Branch01.sorted:L1-dcache-loads avgt 10074.952 #/op
Branch01.sorted:branch-misses avgt 6.583 #/op
Branch01.sorted:branches avgt 17603.203 #/op
Branch01.sorted:cycles avgt 16285.361 #/op
Branch01.sorted:instructions avgt 72693.454 #/op
Branch01.sorted:stalled-cycles-backend avgt 3483.882 #/op
Branch01.sorted:stalled-cycles-frontend avgt 31.488 #/op
Branch01.unsorted avgt 5 37313.611 ± 14404.163 ns/op
Branch01.unsorted:IPC avgt 1.400 insns/clk
Branch01.unsorted:L1-dcache-load-misses avgt 661.293 #/op
Branch01.unsorted:L1-dcache-loads avgt 11423.488 #/op
Branch01.unsorted:branch-misses avgt 793.127 #/op
Branch01.unsorted:branches avgt 17588.644 #/op
Branch01.unsorted:cycles avgt 52257.003 #/op
Branch01.unsorted:instructions avgt 73173.150 #/op
Branch01.unsorted:stalled-cycles-backend avgt 205.963 #/op
Branch01.unsorted:stalled-cycles-frontend avgt 1083.194 #/op
org.xc.jmh.Branch01
Making things realistic
Shall we use streams or loops?
public class E01
{
final int[] integers = new int[1000];
@Setup
public void setup() {
for (int i = 0; i < integers.length; i++) {
integers[i] = i;
}
}
@Benchmark
public int lambda() {
return Arrays.stream(integers).filter(i -> i % 2 == 0).sum();
}
@Benchmark
public int loop() {
int sum = 0;
for (int i = 0; i < integers.length; i++) {
var v = integers[i];
if (v % 2 == 0) {
sum += v;
}
}
return sum;
}
}
# No CPU burning, JDK 17, T14s
Benchmark Mode Cnt Score Error Units
E01.lambda avgt 3 614.137 ± 27.369 ns/op
E01.loop avgt 3 537.428 ± 49.316 ns/op
# Check GC -prof gc
Benchmark Mode Cnt Score Error Units
E01.lambda avgt 3 646.987 ± 105.953 ns/op
E01.lambda:gc.alloc.rate.norm avgt 3 240.000 ± 0.001 B/op
E01.loop avgt 3 534.282 ± 37.081 ns/op
E01.loop:gc.alloc.rate.norm avgt 3 ≈ 10⁻⁴ B/op
# SURPRISE - HETZNER AMD EPYC
Benchmark Mode Cnt Score Error Units
E01.lambda avgt 3 466.097 ± 6.798 ns/op
E01.loop avgt 3 548.451 ± 1.450 ns/op
# SURPRISE - JDK 21 T14s
Benchmark Mode Cnt Score Error Units
E01.lambda avgt 3 600.270 ± 275.706 ns/op
E01.loop avgt 3 661.141 ± 3868.363 ns/op
org.xc.jmh.E01
Burn CPU without consuming memory
public class E02
{
...
private static int calc(final int a) {
// consume time and cpu, but don't waste memory
// trick the compiler too and avoid optimization
// by creating a side-effect
Blackhole.consumeCPU(100);
return a;
}
@Benchmark
public int lambda() {
return Arrays.stream(integers)
.filter(i -> calc(i) % 2 == 0)
.sum();
}
@Benchmark
public int loop() {
int sum = 0;
for (int i = 0; i < integers.length; i++) {
var v = calc(integers[i]);
if (v % 2 == 0) {
sum += v;
}
}
return sum;
}
}
# T14s, JDK 17
Benchmark Mode Cnt Score Error Units
E01.lambda avgt 3 614.137 ± 27.369 ns/op
E01.loop avgt 3 537.428 ± 49.316 ns/op
Benchmark Mode Cnt Score Error Units
E02.lambda avgt 3 160445.533 ± 8399.802 ns/op
E02.lambda:gc.alloc.rate.norm avgt 3 240.086 ± 0.058 B/op
E02.loop avgt 3 161183.136 ± 6899.931 ns/op
E02.loop:gc.alloc.rate.norm avgt 3 0.086 ± 0.057 B/op
# Hetzner JDK 21
Benchmark (WORK) Mode Cnt Score Error Units
E02.lambda 1 avgt 3 3915.734 ± 1001.819 ns/op
E02.lambda 10 avgt 3 11209.627 ± 18.601 ns/op
E02.lambda 100 avgt 3 175428.897 ± 3424.067 ns/op
E02.loop 1 avgt 3 4116.134 ± 14.891 ns/op
E02.loop 10 avgt 3 10768.445 ± 141.270 ns/op
E02.loop 100 avgt 3 173909.239 ± 3531.735 ns/op
org.xc.jmh.E02
The learning goes on
perf
is nice but not clear sometimesHow to run multi-threaded
Independent workloads
...
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
public class T01
{
final int SIZE = 10000;
List<String> csv = new ArrayList<>(SIZE);
@Setup(Level.Trial)
public void setup() {
var r = new FastRandom(41L);
for (int i = 0; i < SIZE; i++) {
var sb = new StringJoiner(",");
for (int j = 0; j < 5; j++) {
sb.add(RandomUtils.randomString(r, 15));
}
csv.add(sb.toString());
}
}
@Benchmark
@Threads(1)
public long split_1() {
long count = 0;
for (var s : csv) {
count += s.split(",").length;
}
return count;
}
@Benchmark
@Threads(2)
public long split_2() { ... }
}
@State
: Per benchmark@OutputTimeUnit
: In microseconds@Setup(Level.Trial)
: Setup once only@Threads(2)
: How many threads should run thatorg.xc.jmh.T01
Let's see how it performs
# T14s Gen 1, AMD 8+8 Core, 32 GB memory
# Mode powersave (battery)
Benchmark Mode Cnt Score Error Units
T01.split_1 avgt 3 1111.237 ± 76.534 us/op
T01.split_2 avgt 3 1247.724 ± 149.086 us/op
T01.split_4 avgt 3 1387.988 ± 255.601 us/op
T01.split_6 avgt 3 1659.864 ± 226.979 us/op
T01.split_8 avgt 3 2243.382 ± 207.846 us/op
# T14s Gen 1, AMD 8+8 Core, 32 GB memory
# Mode performance (plugged in)
Benchmark Mode Cnt Score Error Units
T01.split_1 avgt 3 1035.574 ± 139.435 us/op
T01.split_2 avgt 3 1167.293 ± 102.166 us/op
T01.split_4 avgt 3 1480.129 ± 563.533 us/op
T01.split_6 avgt 3 1933.609 ± 1095.241 us/op
T01.split_8 avgt 3 2291.194 ± 321.369 us/op
# T14s Gen 1, AMD 8+8 Core, 32 GB memory
# SMT off, Turbo off, Mode powersave
Benchmark Mode Cnt Score Error Units
T01.split_1 avgt 3 2447.491 ± 246.614 us/op
T01.split_2 avgt 3 2451.685 ± 120.794 us/op
T01.split_4 avgt 3 2669.147 ± 1487.202 us/op
T01.split_6 avgt 3 3114.441 ± 2684.979 us/op
T01.split_8 avgt 3 3292.129 ± 139.653 us/op
# T14s Gen 1, AMD 8+8 Core, 32 GB memory
# SMT off, Turbo off, Mode performance
Benchmark Mode Cnt Score Error Units
T01.split_1 avgt 3 2497.786 ± 230.361 us/op
T01.split_2 avgt 3 2539.525 ± 338.014 us/op
T01.split_4 avgt 3 2539.477 ± 53.316 us/op
T01.split_6 avgt 3 2644.285 ± 239.924 us/op
T01.split_8 avgt 3 2767.139 ± 1536.015 us/op
org.xc.jmh.T01
Diagnose scaling limits
# -prof gc, mode performance
Benchmark Mode Cnt Score Error Units
T01.split_1 avgt 3 1318.253 ± 670.168 us/op
T01.split_1:gc.alloc.rate avgt 3 2894.884 ± 1494.720 MB/sec
T01.split_1:gc.alloc.rate.norm avgt 3 4000000.389 ± 0.874 B/op
T01.split_1:gc.count avgt 3 7.000 counts
T01.split_1:gc.time avgt 3 19.000 ms
T01.split_2 avgt 3 1177.087 ± 293.667 us/op
T01.split_2:gc.alloc.rate avgt 3 6479.263 ± 1621.990 MB/sec
T01.split_2:gc.alloc.rate.norm avgt 3 4000000.350 ± 0.558 B/op
T01.split_2:gc.count avgt 3 16.000 counts
T01.split_2:gc.time avgt 3 27.000 ms
T01.split_4 avgt 3 1483.509 ± 553.779 us/op
T01.split_4:gc.alloc.rate avgt 3 10279.160 ± 3869.941 MB/sec
T01.split_4:gc.alloc.rate.norm avgt 3 4000001.102 ± 21.109 B/op
T01.split_4:gc.count avgt 3 26.000 counts
T01.split_4:gc.time avgt 3 35.000 ms
T01.split_6 avgt 3 1992.545 ± 327.078 us/op
T01.split_6:gc.alloc.rate avgt 3 11473.173 ± 1753.518 MB/sec
T01.split_6:gc.alloc.rate.norm avgt 3 4000001.164 ± 18.958 B/op
T01.split_6:gc.count avgt 3 28.000 counts
T01.split_6:gc.time avgt 3 34.000 ms
T01.split_8 avgt 3 2314.763 ± 646.143 us/op
T01.split_8:gc.alloc.rate avgt 3 13216.420 ± 3163.362 MB/sec
T01.split_8:gc.alloc.rate.norm avgt 3 4000000.627 ± 0.149 B/op
T01.split_8:gc.count avgt 3 32.000 counts
T01.split_8:gc.time avgt 3 36.000 ms
# -prof gc, no SMT and turbo, mode performance
Benchmark Mode Cnt Score Error Units
T01.split_1 avgt 3 2841.890 ± 219.913 us/op
T01.split_1:gc.alloc.rate avgt 3 1342.067 ± 104.672 MB/sec
T01.split_1:gc.alloc.rate.norm avgt 3 4000000.803 ± 0.460 B/op
T01.split_1:gc.count avgt 3 3.000 counts
T01.split_1:gc.time avgt 3 10.000 ms
T01.split_2 avgt 3 2506.612 ± 42.312 us/op
T01.split_2:gc.alloc.rate avgt 3 3041.421 ± 50.525 MB/sec
T01.split_2:gc.alloc.rate.norm avgt 3 4000000.746 ± 0.949 B/op
T01.split_2:gc.count avgt 3 7.000 counts
T01.split_2:gc.time avgt 3 22.000 ms
T01.split_4 avgt 3 2519.438 ± 251.261 us/op
T01.split_4:gc.alloc.rate avgt 3 6049.532 ± 620.042 MB/sec
T01.split_4:gc.alloc.rate.norm avgt 3 4000001.828 ± 34.873 B/op
T01.split_4:gc.count avgt 3 15.000 counts
T01.split_4:gc.time avgt 3 38.000 ms
T01.split_6 avgt 3 2567.769 ± 106.240 us/op
T01.split_6:gc.alloc.rate avgt 3 8889.642 ± 340.287 MB/sec
T01.split_6:gc.alloc.rate.norm avgt 3 4000000.707 ± 0.080 B/op
T01.split_6:gc.count avgt 3 22.000 counts
T01.split_6:gc.time avgt 3 44.000 ms
T01.split_8 avgt 3 2682.544 ± 246.422 us/op
T01.split_8:gc.alloc.rate avgt 3 11352.457 ± 942.768 MB/sec
T01.split_8:gc.alloc.rate.norm avgt 3 4000000.746 ± 0.277 B/op
T01.split_8:gc.count avgt 3 28.000 counts
T01.split_8:gc.time avgt 3 51.000 ms
org.xc.jmh.T01
Change GC, SerialGC
# -prof gc, no SMT and turbo, mode performance, -XX:+UseG1GC
Benchmark Mode Cnt Score Error Units
T01.split_1 avgt 3 2841.890 ± 219.913 us/op
T01.split_1:gc.alloc.rate avgt 3 1342.067 ± 104.672 MB/sec
T01.split_1:gc.alloc.rate.norm avgt 3 4000000.803 ± 0.460 B/op
T01.split_1:gc.count avgt 3 3.000 counts
T01.split_1:gc.time avgt 3 10.000 ms
T01.split_2 avgt 3 2506.612 ± 42.312 us/op
T01.split_2:gc.alloc.rate avgt 3 3041.421 ± 50.525 MB/sec
T01.split_2:gc.alloc.rate.norm avgt 3 4000000.746 ± 0.949 B/op
T01.split_2:gc.count avgt 3 7.000 counts
T01.split_2:gc.time avgt 3 22.000 ms
T01.split_4 avgt 3 2519.438 ± 251.261 us/op
T01.split_4:gc.alloc.rate avgt 3 6049.532 ± 620.042 MB/sec
T01.split_4:gc.alloc.rate.norm avgt 3 4000001.828 ± 34.873 B/op
T01.split_4:gc.count avgt 3 15.000 counts
T01.split_4:gc.time avgt 3 38.000 ms
T01.split_6 avgt 3 2567.769 ± 106.240 us/op
T01.split_6:gc.alloc.rate avgt 3 8889.642 ± 340.287 MB/sec
T01.split_6:gc.alloc.rate.norm avgt 3 4000000.707 ± 0.080 B/op
T01.split_6:gc.count avgt 3 22.000 counts
T01.split_6:gc.time avgt 3 44.000 ms
T01.split_8 avgt 3 2682.544 ± 246.422 us/op
T01.split_8:gc.alloc.rate avgt 3 11352.457 ± 942.768 MB/sec
T01.split_8:gc.alloc.rate.norm avgt 3 4000000.746 ± 0.277 B/op
T01.split_8:gc.count avgt 3 28.000 counts
T01.split_8:gc.time avgt 3 51.000 ms
# -prof gc, no SMT and turbo, mode performance, -XX:+UseSerialGC
Benchmark Mode Cnt Score Error Units
T01.split_1 avgt 3 2302.586 ± 76.403 us/op
T01.split_1:gc.alloc.rate avgt 3 1656.370 ± 55.984 MB/sec
T01.split_1:gc.alloc.rate.norm avgt 3 4000000.626 ± 0.671 B/op
T01.split_1:gc.count avgt 3 9.000 counts
T01.split_1:gc.time avgt 3 50.000 ms
T01.split_2 avgt 3 2581.650 ± 3736.753 us/op
T01.split_2:gc.alloc.rate avgt 3 2965.881 ± 4496.335 MB/sec
T01.split_2:gc.alloc.rate.norm avgt 3 4000000.765 ± 0.180 B/op
T01.split_2:gc.count avgt 3 16.000 counts
T01.split_2:gc.time avgt 3 58.000 ms
T01.split_4 avgt 3 2287.881 ± 42.820 us/op
T01.split_4:gc.alloc.rate avgt 3 6662.656 ± 115.427 MB/sec
T01.split_4:gc.alloc.rate.norm avgt 3 4000001.664 ± 31.651 B/op
T01.split_4:gc.count avgt 3 36.000 counts
T01.split_4:gc.time avgt 3 17.000 ms
T01.split_6 avgt 3 2661.442 ± 48.027 us/op
T01.split_6:gc.alloc.rate avgt 3 8584.677 ± 141.610 MB/sec
T01.split_6:gc.alloc.rate.norm avgt 3 4000000.727 ± 0.016 B/op
T01.split_6:gc.count avgt 3 48.000 counts
T01.split_6:gc.time avgt 3 19.000 ms
T01.split_8 avgt 3 2357.560 ± 118.595 us/op
T01.split_8:gc.alloc.rate avgt 3 12905.849 ± 879.622 MB/sec
T01.split_8:gc.alloc.rate.norm avgt 3 4000000.651 ± 0.250 B/op
T01.split_8:gc.count avgt 3 72.000 counts
T01.split_8:gc.time avgt 3 25.000 ms
org.xc.jmh.T01
Change GC, ZGC
# -prof gc, no SMT and turbo, mode performance, -XX:+UseG1GC
Benchmark Mode Cnt Score Error Units
T01.split_1 avgt 3 2841.890 ± 219.913 us/op
T01.split_1:gc.alloc.rate avgt 3 1342.067 ± 104.672 MB/sec
T01.split_1:gc.alloc.rate.norm avgt 3 4000000.803 ± 0.460 B/op
T01.split_1:gc.count avgt 3 3.000 counts
T01.split_1:gc.time avgt 3 10.000 ms
T01.split_2 avgt 3 2506.612 ± 42.312 us/op
T01.split_2:gc.alloc.rate avgt 3 3041.421 ± 50.525 MB/sec
T01.split_2:gc.alloc.rate.norm avgt 3 4000000.746 ± 0.949 B/op
T01.split_2:gc.count avgt 3 7.000 counts
T01.split_2:gc.time avgt 3 22.000 ms
T01.split_4 avgt 3 2519.438 ± 251.261 us/op
T01.split_4:gc.alloc.rate avgt 3 6049.532 ± 620.042 MB/sec
T01.split_4:gc.alloc.rate.norm avgt 3 4000001.828 ± 34.873 B/op
T01.split_4:gc.count avgt 3 15.000 counts
T01.split_4:gc.time avgt 3 38.000 ms
T01.split_6 avgt 3 2567.769 ± 106.240 us/op
T01.split_6:gc.alloc.rate avgt 3 8889.642 ± 340.287 MB/sec
T01.split_6:gc.alloc.rate.norm avgt 3 4000000.707 ± 0.080 B/op
T01.split_6:gc.count avgt 3 22.000 counts
T01.split_6:gc.time avgt 3 44.000 ms
T01.split_8 avgt 3 2682.544 ± 246.422 us/op
T01.split_8:gc.alloc.rate avgt 3 11352.457 ± 942.768 MB/sec
T01.split_8:gc.alloc.rate.norm avgt 3 4000000.746 ± 0.277 B/op
T01.split_8:gc.count avgt 3 28.000 counts
T01.split_8:gc.time avgt 3 51.000 ms
# -prof gc, no SMT and turbo, mode performance, -XX:+UseZGC
Benchmark Mode Cnt Score Error Units
T01.split_1 avgt 3 2286.191 ± 13.489 us/op
T01.split_1:gc.alloc.rate avgt 3 1968.161 ± 10.459 MB/sec
T01.split_1:gc.alloc.rate.norm avgt 3 4720000.813 ± 0.010 B/op
T01.split_1:gc.count avgt 3 28.000 counts
T01.split_1:gc.time avgt 3 92.000 ms
T01.split_2 avgt 3 2294.309 ± 87.284 us/op
T01.split_2:gc.alloc.rate avgt 3 3920.457 ± 151.112 MB/sec
T01.split_2:gc.alloc.rate.norm avgt 3 4720000.910 ± 1.376 B/op
T01.split_2:gc.count avgt 3 24.000 counts
T01.split_2:gc.time avgt 3 101.000 ms
T01.split_4 avgt 3 2340.600 ± 164.406 us/op
T01.split_4:gc.alloc.rate avgt 3 7678.836 ± 633.377 MB/sec
T01.split_4:gc.alloc.rate.norm avgt 3 4720002.021 ± 35.588 B/op
T01.split_4:gc.count avgt 3 52.000 counts
T01.split_4:gc.time avgt 3 276.000 ms
T01.split_6 avgt 3 2898.646 ± 51.351 us/op
T01.split_6:gc.alloc.rate avgt 3 9295.258 ± 118.284 MB/sec
T01.split_6:gc.alloc.rate.norm avgt 3 4720001.071 ± 0.240 B/op
T01.split_6:gc.count avgt 3 64.000 counts
T01.split_6:gc.time avgt 3 359.000 ms
T01.split_8 avgt 3 2677.623 ± 994.806 us/op
T01.split_8:gc.alloc.rate avgt 3 13431.418 ± 4719.314 MB/sec
T01.split_8:gc.alloc.rate.norm avgt 3 4720000.959 ± 0.214 B/op
T01.split_8:gc.count avgt 3 124.000 counts
T01.split_8:gc.time avgt 3 912.000 ms
org.xc.jmh.T01
Change GC, ShenandoahGC
# -prof gc, no SMT and turbo, mode performance, -XX:+UseG1GC
Benchmark Mode Cnt Score Error Units
T01.split_1 avgt 3 2841.890 ± 219.913 us/op
T01.split_1:gc.alloc.rate avgt 3 1342.067 ± 104.672 MB/sec
T01.split_1:gc.alloc.rate.norm avgt 3 4000000.803 ± 0.460 B/op
T01.split_1:gc.count avgt 3 3.000 counts
T01.split_1:gc.time avgt 3 10.000 ms
T01.split_2 avgt 3 2506.612 ± 42.312 us/op
T01.split_2:gc.alloc.rate avgt 3 3041.421 ± 50.525 MB/sec
T01.split_2:gc.alloc.rate.norm avgt 3 4000000.746 ± 0.949 B/op
T01.split_2:gc.count avgt 3 7.000 counts
T01.split_2:gc.time avgt 3 22.000 ms
T01.split_4 avgt 3 2519.438 ± 251.261 us/op
T01.split_4:gc.alloc.rate avgt 3 6049.532 ± 620.042 MB/sec
T01.split_4:gc.alloc.rate.norm avgt 3 4000001.828 ± 34.873 B/op
T01.split_4:gc.count avgt 3 15.000 counts
T01.split_4:gc.time avgt 3 38.000 ms
T01.split_6 avgt 3 2567.769 ± 106.240 us/op
T01.split_6:gc.alloc.rate avgt 3 8889.642 ± 340.287 MB/sec
T01.split_6:gc.alloc.rate.norm avgt 3 4000000.707 ± 0.080 B/op
T01.split_6:gc.count avgt 3 22.000 counts
T01.split_6:gc.time avgt 3 44.000 ms
T01.split_8 avgt 3 2682.544 ± 246.422 us/op
T01.split_8:gc.alloc.rate avgt 3 11352.457 ± 942.768 MB/sec
T01.split_8:gc.alloc.rate.norm avgt 3 4000000.746 ± 0.277 B/op
T01.split_8:gc.count avgt 3 28.000 counts
T01.split_8:gc.time avgt 3 51.000 ms
# -prof gc, no SMT and turbo, mode performance, -XX:+UseShenandoahGC
Benchmark Mode Cnt Score Error Units
T01.split_1 avgt 3 2744.327 ± 547.522 us/op
T01.split_1:gc.alloc.rate avgt 3 1389.809 ± 274.955 MB/sec
T01.split_1:gc.alloc.rate.norm avgt 3 4000000.746 ± 0.152 B/op
T01.split_1:gc.count avgt 3 12.000 counts
T01.split_1:gc.time avgt 3 25.000 ms
T01.split_2 avgt 3 2557.375 ± 691.621 us/op
T01.split_2:gc.alloc.rate avgt 3 2980.591 ± 806.456 MB/sec
T01.split_2:gc.alloc.rate.norm avgt 3 4000000.761 ± 0.878 B/op
T01.split_2:gc.count avgt 3 20.000 counts
T01.split_2:gc.time avgt 3 45.000 ms
T01.split_4 avgt 3 2490.815 ± 227.054 us/op
T01.split_4:gc.alloc.rate avgt 3 6116.189 ± 530.842 MB/sec
T01.split_4:gc.alloc.rate.norm avgt 3 4000001.807 ± 34.415 B/op
T01.split_4:gc.count avgt 3 44.000 counts
T01.split_4:gc.time avgt 3 123.000 ms
T01.split_6 avgt 3 2540.445 ± 199.454 us/op
T01.split_6:gc.alloc.rate avgt 3 8988.221 ± 879.298 MB/sec
T01.split_6:gc.alloc.rate.norm avgt 3 4000000.704 ± 0.186 B/op
T01.split_6:gc.count avgt 3 66.000 counts
T01.split_6:gc.time avgt 3 176.000 ms
T01.split_8 avgt 3 2621.635 ± 223.107 us/op
T01.split_8:gc.alloc.rate avgt 3 11598.049 ± 728.324 MB/sec
T01.split_8:gc.alloc.rate.norm avgt 3 4000000.721 ± 0.101 B/op
T01.split_8:gc.count avgt 3 80.000 counts
T01.split_8:gc.time avgt 3 235.000 ms
org.xc.jmh.T01
Fix the problem correctly
public long split()
{
long count = 0;
for (var s : csv)
{
var st = new StringTokenizer(s, ",");
count += st.countTokens();
}
return count;
}
# Default, mode performance, -XX:+UseG1GC
Benchmark Mode Cnt Score Error Units GHz
T02.split_1 avgt 3 433.597 ± 88.340 us/op 4.16
T02.split_2 avgt 3 436.323 ± 37.467 us/op 3.70
T02.split_4 avgt 3 509.709 ± 92.259 us/op 3.10
T02.split_6 avgt 3 594.650 ± 113.787 us/op 2.90
T02.split_8 avgt 3 666.666 ± 25.594 us/op 2.60
# No SMT, no turbo, mode performance, -XX:+UseG1GC
Benchmark Mode Cnt Score Error Units GHz
T02.split_1 avgt 3 1025.864 ± 11.773 us/op 1.70
T02.split_2 avgt 3 1024.357 ± 56.503 us/op 1.70
T02.split_4 avgt 3 1015.246 ± 26.788 us/op 1.70
T02.split_6 avgt 3 1014.575 ± 65.084 us/op 1.70
T02.split_8 avgt 3 1036.651 ± 23.975 us/op 1.70
org.xc.jmh.T02
But what about StringTokenizer
Benchmark Mode Cnt Score Error Units
T02.split_1 avgt 3 438.617 ± 52.709 us/op
T02.split_1:gc.alloc.rate avgt 3 ≈ 10⁻⁴ MB/sec
T02.split_1:gc.alloc.rate.norm avgt 3 0.117 ± 0.080 B/op
T02.split_1:gc.count avgt 3 ≈ 0 counts
T02.split_2 avgt 3 440.256 ± 84.787 us/op
T02.split_2:gc.alloc.rate avgt 3 0.001 ± 0.001 MB/sec
T02.split_2:gc.alloc.rate.norm avgt 3 0.127 ± 0.036 B/op
T02.split_2:gc.count avgt 3 ≈ 0 counts
T02.split_4 avgt 3 504.023 ± 29.002 us/op
T02.split_4:gc.alloc.rate avgt 3 0.003 ± 0.053 MB/sec
T02.split_4:gc.alloc.rate.norm avgt 3 0.366 ± 7.046 B/op
T02.split_4:gc.count avgt 3 ≈ 0 counts
T02.split_6 avgt 3 598.367 ± 32.953 us/op
T02.split_6:gc.alloc.rate avgt 3 0.002 ± 0.001 MB/sec
T02.split_6:gc.alloc.rate.norm avgt 3 0.166 ± 0.018 B/op
T02.split_6:gc.count avgt 3 ≈ 0 counts
T02.split_8 avgt 3 643.071 ± 20.133 us/op
T02.split_8:gc.alloc.rate avgt 3 0.002 ± 0.001 MB/sec
T02.split_8:gc.alloc.rate.norm avgt 3 0.178 ± 0.044 B/op
T02.split_8:gc.count avgt 3 ≈ 0 counts
StringTokenizer
StringTokenizer
does not make it beyond the methodsplit()
stays local as well?!org.xc.jmh.T02
No CPU is alike
We have not checked on our Strings
public class Strings1 {
private static final String PREFIX = "a";
private String foo = null;
private int LENGTH;
@Setup
public void setup() {
// 13 chars 18 chars
foo = String.valueOf(System.currentTimeMillis()) + "ashdfhgas dfhsa df";
LENGTH = plain().length();
}
@Benchmark
public String plain() {
return PREFIX + foo;
}
@Benchmark
public String concat() {
return PREFIX.concat(foo);
}
@Benchmark
public String builder() {
return new StringBuilder().append(PREFIX).append(foo).toString();
}
@Benchmark
public String builderSized() {
return new StringBuilder(LENGTH).append(PREFIX).append(foo).toString();
}
@Benchmark
public String builderNonFluid() {
final StringBuilder sb = new StringBuilder();
sb.append(PREFIX);
sb.append(foo);
return sb.toString();
}
@Benchmark
public String builderNonFluidSized() {
final StringBuilder sb = new StringBuilder(LENGTH);
sb.append(PREFIX);
sb.append(foo);
return sb.toString();
}
}
# java -jar target/benchmarks.jar "Strings1" -prof gc
Benchmark Mode Cnt Score Error Units
builder avgt 3 8.354 ± 2.698 ns/op
builder:gc.alloc.rate.norm avgt 3 72.000 ± 0.001 B/op
builderNonFluid avgt 3 21.585 ± 21.029 ns/op
builderNonFluid:gc.alloc.rate.norm avgt 3 160.000 ± 0.001 B/op
builderNonFluidSized avgt 3 16.595 ± 11.674 ns/op
builderNonFluidSized:gc.alloc.rate.norm avgt 3 144.000 ± 0.001 B/op
builderSized avgt 3 8.711 ± 5.998 ns/op
builderSized:gc.alloc.rate.norm avgt 3 72.000 ± 0.001 B/op
concat avgt 3 9.710 ± 6.780 ns/op
concat:gc.alloc.rate.norm avgt 3 72.000 ± 0.001 B/op
plain avgt 3 9.619 ± 3.065 ns/op
plain:gc.alloc.rate.norm avgt 3 72.000 ± 0.001 B/op
org.xc.jmh.String1
Data, it is about data!!!
public class Strings2 {
private String[] foo = new String[5];
private Random r = new Random(1L);
@Setup
public void setup() {
foo[0] = String.valueOf(r.nextInt()) + "ashdfhgas dfhsa df";
foo[1] = String.valueOf(r.nextInt()) + "hsa df";
foo[2] = String.valueOf(r.nextInt()) + "fhsa df";
foo[3] = String.valueOf(r.nextInt()) + "as dfhsa df";
foo[4] = String.valueOf(r.nextInt())
+ "ashdfhgas dfhsa hsajdhf kjahskjdh fkjhsa dhfkjskhj hdf"
+ "ashdfhgas dfhsa hsajdhf kjahs as87837 837 987kjskhj hdf";
}
public String foo() {
return foo[r.nextInt(5)];
}
@Benchmark
public String plain() {
return "a" + foo();
}
@Benchmark
public String concat() {
return "a".concat(foo());
}
@Benchmark
public String builder() {
return new StringBuilder().append("a").append(foo()).toString();
}
...
}
# java -jar target/benchmarks.jar "Strings2" -prof gc
Benchmark Mode Cnt Score Error Units
builder avgt 2 39.149 ns/op
builder:gc.alloc.rate.norm avgt 2 176.010 B/op
builderNonFluid avgt 2 37.518 ns/op
builderNonFluid:gc.alloc.rate.norm avgt 2 176.010 B/op
builderNonFluidSized avgt 2 29.788 ns/op
builderNonFluidSized:gc.alloc.rate.norm avgt 2 227.205 B/op
builderSized avgt 2 32.169 ns/op
builderSized:gc.alloc.rate.norm avgt 2 227.204 B/op
concat avgt 2 24.675 ns/op
concat:gc.alloc.rate.norm avgt 2 83.202 B/op
plain avgt 2 24.212 ns/op
plain:gc.alloc.rate.norm avgt 2 83.201 B/op
org.xc.jmh.String2
Let's Tackle the Problem
What is your motivation?
If you are an OpenJDK contributor, feel free to ignore the above ;)
No idea? No benchmark!
Most performance problems are not longer about CPU usage rather about not using the CPU.
Build your code
Stop thinking about the CPU!
Shape the future outcome
Data creation and mutation can be challenging
Wrong conclusions can be drawn in seconds with this trick
Functional correctness before performance
No unit tests, no microbenchmark!
Never believe in your first results
Vary a little to feel confident
Luke, minimal disturbance to the force you must apply.
Use the power of JMH
-prof gc
-prof perf/perfnorm
-prof perfasm
-prof comp
-prof cl
-prof jfr
-prof async
You might have to google a lot.
Ask someone but ChatGPT
One shall not claim success before someone else said so.
Remember our experiments and more
A Summary of Things
Microbenchmarking is...
When and when not to do microbenchmarks
The JMH warning
REMEMBER: The numbers are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Because you have seen it on my slides, does not make it right!