The Art of Realizing One is Wrong
@ReneSchwietzke
@reneschwietzke@foojay.social
#java #qa #test #performance #performancetest #quality
Why you should stick around
This is not a JMH training!
Photos from Pexel, free to use, no attribution needed, CC0 or Public Domain
What is my motivation?
If you want to drive fast, you have to understand how a car works. Some physics knowledge won't hurt either.
Definitions and Guesses
A definition first
To study (something, such as a competitor's product or business practices) in order to improve the performance of one's own company
A point of reference from which measurements may be made
A standardized problem or test that serves as a basis for evaluation or comparison (as of computer system performance)
Source: Mirriam Webster
What is this micro thing and why is it micro?
A simplified benchmark that tests components in isolation.
It simplifies the process by abstracting the problem, removing complexity, and improving debuggability.
Components are usually quite small. Compare that to the unit test idea!
Source: René Schwietzke
What might be our goals?
Side effects of simplification
Some Experiments First
Always use System.arrayCopy
over manual copying
public static void main(String[] args)
{
var length = 255;
var src = new byte[length];
var dest = new byte[length];
var s1 = System.nanoTime();
System.arraycopy(src, 0, dest, 0, length);
var e1 = System.nanoTime();
var s2 = System.nanoTime();
for (int i = 0; i < length; i++)
{
dest[i] = src[i];
}
var e2 = System.nanoTime();
System.out.printf("System Copy: %d ns%n", e1 - s1);
System.out.printf("Manual Copy: %d ns%n", e2 - s2);
}
$ java -cp target/classes/ org.devoxxpl23.naive.Example1
System Copy: 2685 ns
Manual Copy: 5781 ns
$ java -cp target/classes/ org.devoxxpl23.naive.Example1
System Copy: 2565 ns
Manual Copy: 4609 ns
$ java -cp target/classes/ org.devoxxpl23.naive.Example1
System Copy: 2545 ns
Manual Copy: 4589 ns
Use a sized StringBuilder
all the time
public static void main(String[] args)
{
final String a = "This is a test";
final String b = "More fun with a string";
final long c = System.currentTimeMillis(); // e.g. 1684581249117
var l = a.length() + b.length() + 13;
var s3 = System.nanoTime();
{
var r = a + b + c;
}
var e3 = System.nanoTime();
var s1 = System.nanoTime();
{
var sb = new StringBuilder();
sb.append(a); sb.append(b); sb.append(c);
var r = sb.toString();
}
var e1 = System.nanoTime();
var s2 = System.nanoTime();
{
var sb = new StringBuilder(l);
sb.append(a); sb.append(b); sb.append(c);
var r = sb.toString();
}
var e2 = System.nanoTime();
System.out.printf("Pure String usage : %d ns%n", e3 - s3);
System.out.printf("StringBuilder : %d ns%n", e1 - s1);
System.out.printf("StringBuilder size: %d ns%n", e2 - s2);
}
$ java -cp target/classes/ org.devoxxpl23.naive.Example2
Pure String usage : 44441844 ns
StringBuilder : 93092 ns
StringBuilder size: 7123 ns
$ java -cp target/classes/ org.devoxxpl23.naive.Example2
Pure String usage : 38471326 ns
StringBuilder : 74865 ns
StringBuilder size: 6076 ns
$ java -cp target/classes/ org.devoxxpl23.naive.Example2
Pure String usage : 37914191 ns
StringBuilder : 72841 ns
StringBuilder size: 6146 ns
Streams are slow and parallel()
is always better
public static void main(String[] args)
{
var data = new int[100_000];
for (int i = 0; i < data.length; i++)
{
data[i] = i;
}
// manual
var s1 = System.nanoTime();
{
var total = 0.0d;
for (int i = 0; i < data.length; i++)
{
total += data[i];
}
var r = total / data.length;
}
var e1 = System.nanoTime();
// stream
var s2 = System.nanoTime();
{
var r = Arrays.stream(data).average();
}
var e2 = System.nanoTime();
// stream parallel
var s3 = System.nanoTime();
{
var r = Arrays.stream(data).parallel().average();
}
var e3 = System.nanoTime();
...
}
$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual : 1,595,527 ns
Stream : 29,082,919 ns
Parallel stream: 21,997,029 ns
$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual : 1,597,572 ns
Stream : 29,913,900 ns
Parallel stream: 23,778,407 ns
$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual : 2,008,348 ns
Stream : 30,491,623 ns
Parallel stream: 22,356,628 ns
Forget everything you just saw
What do we really want?
Our true goal: We want to learn and understand by changing details under controlled conditions. Turn a knob a bit and see what happens. We want to derive reasoning or even proof from our changes.
Don't ask what you can do, ask others what they have done
JMH is a Java harness for building, running, and analysing nano/micro/milli/macro benchmarks written in Java and other languages targeting the JVM.
Some Factors by Example
We don't have the time...
https://reneschwietzke.de/java/benchmarking-a-number-journey-int-to-string.html
Your clock is... code
System.nanoTime()
and System.currentTimeMillis()
We can measure nanoseconds, can't we?
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Fork(2)
@State(Scope.Thread)
public class TimerResolution
{
private long lastValue;
@Benchmark
public long latency_nanotime()
{
return System.nanoTime();
}
public static double precision(long[] t)
{
// store samples
for (int i = 0; i < t.length; i++)
{
t[i] = System.nanoTime();
}
}
}
Benchmark Mode Cnt Score Error Units
TimerResolution.latency_nanotime avgt 10 25.519 ± 0.985 ns/op
# Samples Lenovo T14s, AMD, Linux 5.15, TSC
372942679845594
372942679845635
372942679845665
372942679845705
372942679845745
372942679845775
Difference between calls 30 to 40 ns
Source: https://github.com/shipilev/timers-bench
Not every clock is created equal
x86 Time Stamp Counter (TSC), a low-latency monotonically increasing clock, to measure event duration
Benchmark Mode Cnt Score Error Units
TimerResolution.latency_nanotime avgt 10 1,128.204 ± 32.831 ns/op
# Samples Lenovo T14s, AMD, Linux 5.15, HPET
374172181482681
374172181484078
374172181484986
374172181485964
374172181487291
374172181488199
Difference between calls 900 to 1,400 ns
Benchmark Mode Cnt Score Error Units
TimerResolution.latency_nanotime avgt 10 25.519 ± 0.985 ns/op
# Samples Lenovo T14s, AMD, Linux 5.15, TSC
495719354555374
495719354555394
495719354555424
495719354555444
495719354555464
495719354555484
Difference between calls 20 to 30 ns
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
kvm-clock tsc hpet acpi_pm
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet
$ echo 'tsc' > /sys/devices/system/clocksource/clocksource0/current_clocksource
Java evolves and so does performance
public class Example1ArrayCopying {
@Param({"1000"}) int SIZE;
byte[] src;
@Setup
public void setup() {
src = new byte[SIZE];
final Random r = new Random(7L);
r.nextBytes(src);
}
@Benchmark
public byte[] systemCopy() {
var dest = new byte[src.length];
System.arraycopy(src, 0, dest, 0, src.length);
return dest;
}
@Benchmark
public byte[] arrayCopy() {
return Arrays.copyOf(src, src.length);
}
@Benchmark
public byte[] manualCopy() {
var dest = new byte[src.length];
for (int i = 0; i < src.length; i++) {
dest[i] = src[i];
}
return dest;
}
}
# JDK 11
Benchmark (SIZE) Mode Cnt Score Error Units
Example1ArrayCopying.manualCopy 1000 avgt 3 152.889 ± 32.305 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 130.963 ± 119.643 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 128.726 ± 83.837 ns/op
# JDK 17
Example1ArrayCopying.manualCopy 1000 avgt 3 153.508 ± 7.987 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 124.021 ± 35.199 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 121.271 ± 35.723 ns/op
# JDK 19
Example1ArrayCopying.manualCopy 1000 avgt 3 149.298 ± 79.705 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 129.439 ± 104.106 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 133.358 ± 40.304 ns/op
# JDK 20
Example1ArrayCopying.manualCopy 1000 avgt 3 152.070 ± 7.096 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 119.613 ± 9.873 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 117.901 ± 8.322 ns/op
# Graal 22.3.r17 - Whoops!
Example1ArrayCopying.manualCopy 1000 avgt 3 335.612 ± 22.000 ns/op <<<
Example1ArrayCopying.systemCopy 1000 avgt 3 150.035 ± 4.466 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 149.176 ± 4.496 ns/op
When, where, and how often is it called - the JIT at work
@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Measurement(iterations = 5, batchSize = 1, time = 1, timeUnit = TimeUnit.NANOSECONDS)
@Fork(1)
public class Streams {
final int SIZE = 1024;
final int[] integers = new int[SIZE];
@Setup
public void setup() {
for (int i = 0; i < integers.length; i++) {
integers[i] = i - SIZE;
}
}
@Benchmark
@Warmup(iterations = 0, batchSize = 1)
public int lambdaArrayCold() {
return Arrays.stream(integers).filter(i -> i % 2 == 0).sum();
}
@Benchmark
@Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
public int lambdaArrayWarm() {...}
@Benchmark
@Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
public int lambdaArrayHot() {...}
@Benchmark
@Warmup(iterations = 0, batchSize = 1)
public int loopCold() {
int sum = 0;
for (int i = 0; i < integers.length; i++) {
var v = integers[i];
if (v % 2 == 0) {
sum += v;
}
}
return sum;
}
@Benchmark
@Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
public int loopWarm() {...}
@Benchmark
@Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
public int loopHot() {...}
}
Benchmark Mode Cnt Score Error Units
Streams.lambdaArrayCold avgt 5 688,797 ± 4,665,070 ns/op
Streams.lambdaArrayWarm avgt 5 12,797 ± 59,453 ns/op
Streams.lambdaArrayHot avgt 5 1,113 ± 2,602 ns/op
Streams.loopCold avgt 5 52,936 ± 42,276 ns/op
Streams.loopWarm avgt 5 2,030 ± 2,527 ns/op
Streams.loopHot avgt 5 1,422 ± 733 ns/op
Streams.lambdaArrayCold
Iteration 1: 2,855,843.000 ns/op
Iteration 2: 172,353.000 ns/op
Iteration 3: 139,810.000 ns/op
Iteration 4: 139,670.000 ns/op
Iteration 5: 136,313.000 ns/op
Streams.lambdaArrayWarm
Iteration 1: 40,097.000 ns/op
Iteration 2: 4,712.000 ns/op
Iteration 3: 9,992.667 ns/op
Iteration 4: 4,145.143 ns/op
Iteration 5: 5,039.667 ns/op
Streams.loopCold
Iteration 1: 44,510.286 ns/op
Iteration 2: 61,584.000 ns/op
Iteration 3: 67,831.000 ns/op
Iteration 4: 45,749.000 ns/op
Iteration 5: 45,007.000 ns/op
@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Measurement(iterations = 5, batchSize = 1, time = 1, timeUnit = TimeUnit.NANOSECONDS)
@Fork(1)
public class HeavyStreams {
// as seen before
@Benchmark
@Warmup(iterations = 0, batchSize = 1)
public int lambdaArrayCold() {
return Arrays.stream(integers)
.filter(i -> i.mod(BigInteger.TWO).equals(BigInteger.ZERO))
.map(i -> i.sqrt())
.mapToInt(BigInteger::intValue)
.sum();
}
@Benchmark
@Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
public int lambdaArrayWarm() { ... }
@Benchmark
@Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
public int lambdaArrayHot() { ... }
@Benchmark
@Warmup(iterations = 0, batchSize = 1)
public int loopCold() {
int sum = 0;
for (int i = 0; i < integers.length; i++) {
var v = integers[i];
if (v.mod(BigInteger.TWO).equals(BigInteger.ZERO)) {
sum += v.sqrt().intValue();
}
}
return sum;
}
@Benchmark
@Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
public int loopWarm() { ... }
@Benchmark
@Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
public int loopHot() { ... }
}
# ms now!!!
Benchmark Mode Cnt Score Error Units
HeavyStreams.lambdaArrayCold avgt 5 362 ± 113 ms/op
HeavyStreams.lambdaArrayWarm avgt 5 297 ± 177 ms/op
HeavyStreams.lambdaArrayHot avgt 5 264 ± 12 ms/op
HeavyStreams.loopCold avgt 5 363 ± 100 ms/op
HeavyStreams.loopWarm avgt 5 302 ± 183 ms/op
HeavyStreams.loopHot avgt 5 271 ± 16 ms/op
HeavyStreams.lambdaArrayCold
Iteration 1: 409 ms/op
Iteration 2: 372 ms/op
Iteration 3: 347 ms/op
Iteration 4: 342 ms/op
Iteration 5: 339 ms/op
HeavyStreams.lambdaArrayWarm
Iteration 1: 263 ms/op
Iteration 2: 270 ms/op
Iteration 3: 264 ms/op
Iteration 4: 263 ms/op
Iteration 5: 263 ms/op
HeavyStreams.loopCold
Iteration 1: 401 ms/op
Iteration 2: 376 ms/op
Iteration 3: 352 ms/op
Iteration 4: 342 ms/op
Iteration 5: 340 ms/op
It's dead, Jim!
public class DeadCode {
private double computeSideEffect(int n) {
return BigInteger.valueOf(n).pow(7).sqrt().doubleValue();
}
private double fib(int n) {
int fib = 1;
int prevFib = 1;
for (int i = 2; i < n; i++) {
int temp = fib;
fib += prevFib;
prevFib = temp;
}
return fib;
}
@Benchmark
public double baseline() {
return 5.0d;
}
@Benchmark
public void computeWrongSE() {
computeSideEffect(2000);
}
@Benchmark
public double computeRightSE() {
return computeSideEffect(2000);
}
@Benchmark
public void computeWrong() {
fib(2000);
}
@Benchmark
public double computeRight() {
return fib(2000);
}
}
Benchmark Mode Cnt Score Error Units
DeadCode.baseline avgt 3 0.4 ± 0.1 ns/op
DeadCode.computeWrong avgt 3 16.1 ± 1.6 ns/op
DeadCode.computeRight avgt 3 479.2 ± 20.2 ns/op
DeadCode.computeWrongSE avgt 3 1616.7 ± 434.4 ns/op
DeadCode.computeRightSE avgt 3 1713.4 ± 876.9 ns/op
computeWrong
is not really droppedcomputeWrong
is changing based on the parametercomputeWrong fib(1000)
yields 8.8 ns - ???computeWrong fib(100)
yields 0.4 ns - ???
https://github.com/openjdk/jmh/blob/master/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_08_DeadCode.java
https://self-learning-java-tutorial.blogspot.com/2022/08/jmh-be-cautious-with-dead-code.html
public class CachingEffects {
@Param({"true", "false"})
boolean live;
int SIZE = 10000;
int LOAD = 1000000;
List<String> work;
List<String> load;
@Setup
public void setup() {
load = new ArrayList<>(LOAD);
var s = new StringBuilder("ß09876512[more here]");
s.append(System.currentTimeMillis());
s.append('$');
for (int i = 0; i < LOAD; i++) {
// align the copies next to each other
load.add(s.toString());
}
work = new ArrayList<>(SIZE);
if (live) {
// shuffle the data, with fixed outcome
Collections.shuffle(load, new Random(42L));
}
// copy us the data, when live, we get a random pattern, when
// not live, we get basically data that is laid out next to each other
for (int i = 0; i < SIZE; i++) {
work.add(load.get(i));
}
}
...
}
public class CachingEffects
{
...
/**
* Manually search for a character pos
*/
@Benchmark
public int manual() {
var n = 0;
for (int i = 0; i < work.size(); i++) {
var s = work.get(i);
for (int index = 0; index < s.length(); index++) {
if (s.charAt(index) == '$') {
n += index;
break;
}
}
}
return n;
}
/**
* Search via indexOf for the char position
*/
@Benchmark
public int indexof() {
var n = 0;
for (int i = 0; i < work.size(); i++) {
n += work.get(i).indexOf('$');
}
return n;
}
}
The hidden memory that speeds things up
# JDK 11.0.17 - Aligned Data
Benchmark (live) Mode Cnt Score Error Units Diff
CachingEffects.indexof false avgt 3 435,184 ± 54,444 ns/op
CachingEffects.manual false avgt 3 318,798 ± 166,076 ns/op 73%
# JDK 11.0.17 - More Random Data
Benchmark (live) Mode Cnt Score Error Units Diff
CachingEffects.indexof true avgt 3 1,420,598 ± 549,023 ns/op
CachingEffects.manual true avgt 3 1,389,879 ± 342,328 ns/op 98%
# JDK 11.0.17 - Single Executions - TRIGGERED by ERROR VALUE
Benchmark (live) Mode Cnt Score Units
CachingEffects.indexof false avgt 1 562,553 ns/op
CachingEffects.indexof false avgt 1 456,493 ns/op
CachingEffects.indexof false avgt 1 469,517 ns/op
CachingEffects.manual false avgt 1 313,673 ns/op
CachingEffects.manual false avgt 1 677,170 ns/op
CachingEffects.manual false avgt 1 309,094 ns/op
CachingEffects.indexof true avgt 1 1,637,546 ns/op
CachingEffects.indexof true avgt 1 2,477,774 ns/op
CachingEffects.indexof true avgt 1 2,604,464 ns/op
CachingEffects.manual true avgt 1 1,560,230 ns/op
CachingEffects.manual true avgt 1 2,067,318 ns/op
CachingEffects.manual true avgt 1 1,888,360 ns/op
CPU design, usage, and speed make things go fuzzy
# Laptop Busy, JDK 11
Benchmark (live) Mode Cnt Score Units Code Cache
CachingEffects.indexof true avgt 3 1,572,759 ns/op
CachingEffects.manual true avgt 3 1,413,919 ns/op 89%
CachingEffects.indexof false avgt 3 377,649 ns/op 24%
CachingEffects.manual false avgt 3 294,783 ns/op 78% 21%
# Laptop Less Busy, JDK 11
CachingEffects.indexof true avgt 3 1,263,913 ns/op
CachingEffects.manual true avgt 3 1,150,578 ns/op 91%
CachingEffects.indexof false avgt 3 422,753 ns/op 33%
CachingEffects.manual false avgt 3 301,250 ns/op 71% 26%
# Turbo off, JDK 11
CachingEffects.indexof true avgt 3 2,191,820 ns/op
CachingEffects.manual true avgt 3 1,928,017 ns/op 88%
CachingEffects.indexof false avgt 3 858,436 ns/op 39%
CachingEffects.manual false avgt 3 687,400 ns/op 80% 36%
# Google c2-standard-8, Intel, JDK 11
CachingEffects.indexof true avgt 3 1,064,037 ns/op
CachingEffects.manual true avgt 3 908,363 ns/op 85%
CachingEffects.indexof false avgt 3 509,849 ns/op 47%
CachingEffects.manual false avgt 3 476,487 ns/op 93% 52%
# Google c2d-standard-8, AMDs, JDK 11
CachingEffects.indexof true avgt 3 707,885 ns/op
CachingEffects.manual true avgt 3 599,987 ns/op 84%
CachingEffects.indexof false avgt 3 375,242 ns/op 53%
CachingEffects.manual false avgt 3 261,686 ns/op 70% 44%
# Google c3-highcpu-8, Intel, JDK 11
CachingEffects.indexof true avgt 3 1,122,492 ns/op
CachingEffects.manual true avgt 3 1,075,496 ns/op 96%
CachingEffects.indexof false avgt 3 357,894 ns/op 32%
CachingEffects.manual false avgt 3 303,531 ns/op 85% 28%
The unknown fortune teller
https://shipilev.net/jvm/anatomy-quarks/28-frequency-based-code-layout/
public class BranchPrediction {
private static final int COUNT = 10000;
private int[] sorted = new int[COUNT];
private int[] unsorted = new int[COUNT];
private int[] pattern2 = new int[COUNT];
private int[] pattern5 = new int[COUNT];
private int[] shuffledPattern = new int[COUNT];
@Setup
public void setup() {
final Random r = new Random(13241L);
// using int here to avoid problems with memory and caches aka
// only inline data and hence same caching behavior
for (int i = 0; i < COUNT; i++) {
final int d = r.nextInt(100_000_000) + 1;
sorted[i] = d; unsorted[i] = d;
pattern2[i] = i % 2 == 0 ? 100_000_000 : r.nextInt(50_000_000);
pattern5[i] = i % 5 == 0 ? 100_000_000 : r.nextInt(50_000_000);
shuffledPattern[i] = pattern2[i];
}
Arrays.sort(sorted);
// make it noisy to break the branch predictor
for (int i = 0; i < shuffledPattern.length; i++) {
// shuffle here
}
}
public long doIt(int[] array) {
long sum = 2;
for (int v : array) {
if (v > 50_000_000) {
sum += (v >> 2);
}
else {
sum += (v << 2);
}
}
return sum;
}
...
}
Your data sorting becomes the driver of the results
# JDK 11
Benchmark Mode C Score Error Units
BranchPrediction.unsorted avgt 4 14,424 ± 1,235 ns/op
BranchPrediction.sorted avgt 4 4,709 ± 40 ns/op
BranchPrediction.pattern2 avgt 4 4,943 ± 14 ns/op
BranchPrediction.pattern5 avgt 4 4,642 ± 501 ns/op
BranchPrediction.shuffledPattern avgt 4 14,209 ± 833 ns/op
# JDK 17
BranchPrediction.unsorted avgt 4 14,285 ± 1,940 ns/op
BranchPrediction.sorted avgt 4 4,858 ± 722 ns/op
BranchPrediction.pattern2 avgt 4 4,353 ± 51 ns/op
BranchPrediction.pattern5 avgt 4 4,744 ± 347 ns/op
BranchPrediction.shuffledPattern avgt 4 14,132 ± 2,155 ns/op
# JDK 11, -XX:-BlockLayoutByFrequency
Benchmark Mode C Score Error Units
BranchPrediction.unsorted avgt 4 9,800 ± 327 ns/op
BranchPrediction.sorted avgt 4 4,791 ± 341 ns/op
BranchPrediction.pattern2 avgt 4 5,509 ± 57 ns/op
BranchPrediction.pattern5 avgt 4 5,802 ± 454 ns/op
BranchPrediction.shuffledPattern avgt 4 9,865 ± 292 ns/op
# JDK 17, -XX:-BlockLayoutByFrequency
BranchPrediction.unsorted avgt 4 9,304 ± 224 ns/op
BranchPrediction.sorted avgt 4 4,239 ± 227 ns/op
BranchPrediction.pattern2 avgt 4 5,530 ± 187 ns/op
BranchPrediction.pattern5 avgt 4 4,182 ± 816 ns/op
BranchPrediction.shuffledPattern avgt 4 9,085 ± 473 ns/op
Static and little data is not enough
String
has a length of just 42 characterscom.devoxxpl23.HashingDataDimensions
public class HashingDataDimensions {
@Param({"10", "50", "100", "500", "1000", "2000"}) int CHARARRAY_SIZE;
@Param({"true", "false"}) boolean RANDOM;
final static int MAX = 100_000;
List data, holder;
@Setup
public void setup() {
data = new ArrayList(MAX);
holder = new ArrayList(MAX);
for (int i = 0; i < MAX; i++) {
// the data to work with
var a = random(CHARARRAY_SIZE);
data.add(a);
holder.add(a);
if (!RANDOM) {
// same data in our data array all the time, still a miss access when accessing it
// so the RANDOM test has the same code path, just ends up on the same data
data.set(data.size() - 1, data.get(0));
}
}
}
@Benchmark
public void traditionalHashcode(Blackhole b) {
for (int i = 0; i < data.size(); i++) {
b.consume(hashCodeClassic(data.get(r.nextInt(data.size()))));
}
}
...
public static int hashCodeClassic(char[] src) {
final int last = src.length;
int h = 0;
for (int i = 0; i < last; i++) {
h = 31 * h + src[i];
}
return h;
}
public static int hashCodeVectoredJDK19(char[] src) {
int h = 0;
int i = 0;
int l, l2;
l = l2 = src.length;
l = l & ~(8 - 1);
for (; i < l; i += 8) {
h = -1807454463 * h +
1742810335 * src[i + 0] +
887503681 * src[i + 1] +
28629151 * src[i + 2] +
923521 * src[i + 3] +
29791 * src[i + 4] +
961 * src[i + 5] +
31 * src[i + 6] +
1 * src[i + 7];
}
for (; i < l2; i++) {
h = 31 * h + src[i];
}
return h;
}
public static int hashCodeNoMul(char[] src) {
final int last = src.length;
int h = 0;
for (int i = 0; i < last; i++) {
h = ((h << 5) - h) + src[i];
}
return h;
}
}
Benchmark (CHARARRAY_SIZE) (RANDOM) Score Error Units %
traditionalHashcode 10 true 2,354 ± 98 us/op
traditionalHashcode 10 false 1,424 ± 18 us/op 60%
traditionalHashcode 50 true 6,905 ± 155 us/op
traditionalHashcode 50 false 4,255 ± 45 us/op 61%
traditionalHashcode 100 true 12,590 ± 5650 us/op
traditionalHashcode 100 false 8,651 ± 106 us/op 68%
traditionalHashcode 500 true 58,176 ± 14771 us/op
traditionalHashcode 500 false 43,322 ± 1022 us/op 74%
traditionalHashcode 1000 true 103,746 ± 7203 us/op
traditionalHashcode 1000 false 86,582 ± 842 us/op 83%
traditionalHashcode 2000 true 194,827 ± 11203 us/op
traditionalHashcode 2000 false 173,190 ± 1040 us/op 88%
vectoredHashcode 10 true 2,701 ± 45 us/op
vectoredHashcode 10 false 1,682 ± 20 us/op 62%
vectoredHashcode 50 true 5,893 ± 798 us/op
vectoredHashcode 50 false 2,919 ± 101 us/op 50%
vectoredHashcode 100 true 8,593 ± 1135 us/op
vectoredHashcode 100 false 4,735 ± 185 us/op 55%
vectoredHashcode 500 true 40,104 ± 13728 us/op
vectoredHashcode 500 false 20,698 ± 157 us/op 51%
vectoredHashcode 1000 true 63,079 ± 6775 us/op
vectoredHashcode 1000 false 40,718 ± 836 us/op 64%
vectoredHashcode 2000 true 106,395 ± 2552 us/op
vectoredHashcode 2000 false 80,286 ± 301 us/op 75%
Benchmark (CHARARRAY_SIZE) (RANDOM) Score Error Units %
traditionalHashcode 10 true 2,354 ± 98 us/op
vectoredHashcode 10 true 2,701 ± 45 us/op 114%
traditionalHashcode 50 true 6,905 ± 155 us/op
vectoredHashcode 50 true 5,893 ± 798 us/op 85%
traditionalHashcode 100 true 12,590 ± 5650 us/op
vectoredHashcode 100 true 8,593 ± 1135 us/op 68%
traditionalHashcode 500 true 58,176 ± 14771 us/op
vectoredHashcode 500 true 40,104 ± 13728 us/op 69%
traditionalHashcode 1000 true 103,746 ± 7203 us/op
vectoredHashcode 1000 true 63,079 ± 6775 us/op 61%
traditionalHashcode 2000 true 194,827 ± 11203 us/op
vectoredHashcode 2000 true 106,395 ± 2552 us/op 55%
traditionalHashcode 10 false 1,424 ± 18 us/op
vectoredHashcode 10 false 1,682 ± 20 us/op 118%
traditionalHashcode 50 false 4,255 ± 45 us/op
vectoredHashcode 50 false 2,919 ± 101 us/op 69%
traditionalHashcode 100 false 8,651 ± 106 us/op
vectoredHashcode 100 false 4,735 ± 185 us/op 55%
traditionalHashcode 500 false 43,322 ± 1022 us/op
vectoredHashcode 500 false 20,698 ± 157 us/op 49%
traditionalHashcode 1000 false 86,582 ± 842 us/op
vectoredHashcode 1000 false 40,718 ± 836 us/op 47%
traditionalHashcode 2000 false 173,190 ± 1040 us/op
vectoredHashcode 2000 false 80,286 ± 301 us/op 46%
Quietly removed the bit-shifting version :)
Let's revisit the JDK version
# JDK 11.0.16, T14s AMD Gen 1
Benchmark (live) Mode Cnt Score Units Code Cache
CachingEffects.indexof true avgt 3 1,263,913 ns/op
CachingEffects.manual true avgt 3 1,150,578 ns/op 91%
CachingEffects.indexof false avgt 3 422,753 ns/op 33%
CachingEffects.manual false avgt 3 301,250 ns/op 71% 26%
# JDK 11.0.16, c3-highcpu-8
CachingEffects.indexof true avgt 3 1,122,492 ns/op
CachingEffects.manual true avgt 3 1,075,496 ns/op 95%
CachingEffects.indexof false avgt 3 357,894 ns/op 32%
CachingEffects.manual false avgt 3 303,531 ns/op 85% 28%
# JDK 17.0.6, T14s AMD Gen 1
CachingEffects.indexof true avgt 3 403,302 ns/op
CachingEffects.manual true avgt 3 1,160,649 ns/op 287%
CachingEffects.indexof false avgt 3 53,984 ns/op 13%
CachingEffects.manual false avgt 3 293,749 ns/op 542% 25%
# JDK 17.0.6, c3-highcpu-8
CachingEffects.indexof true avgt 3 271,683 ns/op
CachingEffects.manual true avgt 3 1,080,750 ns/op 398%
CachingEffects.indexof false avgt 3 58,124 ns/op 21%
CachingEffects.manual false avgt 3 310,592 ns/op 534% 29%
Never underestimate your own bias
If this wasn't challenging enough...
You just saw a fraction of all factors that might or might not influence the correctness of your microbenchmarks. You don't have to understand them all, but you have to be aware that a microbenchmark is not just about cpu and time measurement.
Let's Tackle the Problem
What is your motivation?
If you are an OpenJDK contributor, feel free to ignore the above ;)
No idea? No benchmark!
Most performance problems are not longer about CPU usage rather about not using the CPU.
Build your code
Stop thinking about the CPU!
Shape the future outcome
Data creation and mutation can be challenging
Wrong conclusions can be drawn in seconds with this trick
Functional correctness before performance
No unit tests, no microbenchmark!
Never believe in your first results
Vary a little to feel confident
Luke, minimal disturbance to the force you must apply.
Use the power of JMH
-prof gc
-prof perf/perfnorm
-prof perfasm
-prof comp
-prof cl
-prof jfr
-prof async
You might have to google a lot.
Ask someone but ChatGPT
One shall not claim success before someone else said so.
Remember our experiments and more
See my talk "Java - Parallel Programming is Hard", Friday - 14:30 Room 2
A Summary of Things
Microbenchmarking is...
When and when not to do microbenchmarks
The JMH warning
REMEMBER: The numbers are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Because you have seen it on my slides, does not make it right!