The Art of Realizing One is Wrong
@ReneSchwietzke
@reneschwietzke@foojay.social
#java #qa #test #performance #performancetest #quality
For your own browsing and copying needs
Slides
https://t.ly/NxQiB
capital November, X-ray, capital Quebec, India, capital Bravo
Why you should stick around
This is not a JMH training!
Photos from Pexel, free to use, no attribution needed, CC0 or Public Domain
Slides: https://t.ly/NxQiB
What is my motivation?
If you want to drive fast, you have to understand how a car works. Some physics knowledge won't hurt either.
Slides: https://t.ly/NxQiB
Definitions and Guesses
A definition first
To study (something, such as a competitor's product or business practices) in order to improve the performance of one's own company
A point of reference from which measurements may be made
A standardized problem or test that serves as a basis for evaluation or comparison (as of computer system performance)
Source: Mirriam Webster
Slides: https://t.ly/NxQiB
What is this micro thing and why is it micro?
A simplified benchmark that tests components in isolation.
It simplifies the process by abstracting the problem, removing complexity, and improving debuggability.
Components are usually quite small. Compare that to the unit test idea!
Source: René Schwietzke
Slides: https://t.ly/NxQiB
What might be our goals?
Slides: https://t.ly/NxQiB
Side effects of simplification
Slides: https://t.ly/NxQiB
Some Experiments First
Always use System.arrayCopy
over manual copying
public static void main(String[] args)
{
var length = 255;
var src = new byte[length];
var dest = new byte[length];
var s1 = System.nanoTime();
System.arraycopy(src, 0, dest, 0, length);
var e1 = System.nanoTime();
var s2 = System.nanoTime();
for (int i = 0; i < length; i++)
{
dest[i] = src[i];
}
var e2 = System.nanoTime();
System.out.printf("System Copy: %d ns%n", e1 - s1);
System.out.printf("Manual Copy: %d ns%n", e2 - s2);
}
$ java -cp target/classes/ org.devoxxpl23.naive.Example1
System Copy: 2,685 ns
Manual Copy: 5,781 ns
$ java -cp target/classes/ org.devoxxpl23.naive.Example1
System Copy: 2,565 ns
Manual Copy: 4,609 ns
$ java -cp target/classes/ org.devoxxpl23.naive.Example1
System Copy: 2,545 ns
Manual Copy: 4,589 ns
Use a sized StringBuilder
all the time
public static void main(String[] args) {
final String a = "This is a test";
final String b = "More fun with a string";
final long c = System.currentTimeMillis(); // e.g. 1684581249117
var l = a.length() + b.length() + 13;
var s3 = System.nanoTime();
{
var r = a + b + c;
}
var e3 = System.nanoTime();
var s1 = System.nanoTime();
{
var sb = new StringBuilder();
sb.append(a); sb.append(b); sb.append(c);
var r = sb.toString();
}
var e1 = System.nanoTime();
var s2 = System.nanoTime();
{
var sb = new StringBuilder(l);
sb.append(a); sb.append(b); sb.append(c);
var r = sb.toString();
}
var e2 = System.nanoTime();
System.out.printf("Pure String usage : %d ns%n", e3 - s3);
System.out.printf("StringBuilder : %d ns%n", e1 - s1);
System.out.printf("StringBuilder size: %d ns%n", e2 - s2);
}
$ java -cp target/classes/ org.devoxxpl23.naive.Example2
Pure String usage : 44,441,844 ns
StringBuilder : 93,092 ns
StringBuilder size: 7,123 ns
$ java -cp target/classes/ org.devoxxpl23.naive.Example2
Pure String usage : 38,471,326 ns
StringBuilder : 74,865 ns
StringBuilder size: 6,076 ns
$ java -cp target/classes/ org.devoxxpl23.naive.Example2
Pure String usage : 37,914,191 ns
StringBuilder : 72,841 ns
StringBuilder size: 6,146 ns
Streams are slow and parallel()
is always better
public static void main(String[] args)
{
var data = new long[10000];
for (int i = 0; i < data.length; i++) {
data[i] = i;
}
// manual
var s1 = System.nanoTime(); {
var total = 0.0d;
for (int i = 0; i < data.length; i++) {
total += calc(data[i]);
}
var r = total / data.length;
}
var e1 = System.nanoTime();
// stream
var s2 = System.nanoTime(); {
var r = Arrays.stream(data).map(i -> calc(i)).average();
}
var e2 = System.nanoTime();
// stream parallel
var s3 = System.nanoTime(); {
var r = Arrays.stream(data).parallel().map(i -> calc(i)).average();
}
var e3 = System.nanoTime();
...
}
private static long calc(long c) {
var t = c;
for (long i = 100; i > 0; i--) {
t += (t * 0x5DEECE66DL + 0xBL + i) & (0xFFFFFFFFFFFFL);
}
return t;
}
}
$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual : 5,832,700 ns
Stream : 23,259,444 ns
Parallel stream: 10,246,827 ns
$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual : 5,834,677 ns
Stream : 20,817,603 ns
Parallel stream: 10,223,933 ns
$ java -cp target/classes/ org.devoxxpl23.naive.Example3
Manual : 6,054,650 ns
Stream : 20,718,407 ns
Parallel stream: 10,415,500 ns
Forget everything you just saw
What do we really want?
Our true goal: We want to learn and understand by changing details under controlled conditions. Turn a knob a bit and see what happens. We want to derive reasoning or even proof from our changes.
Don't ask what you can do, ask others what they have done (not JFK)!
JMH is a Java harness for building, running, and analysing nano/micro/milli/macro benchmarks written in Java and other languages targeting the JVM.
Some Factors by Example
We don't have the time...
https://reneschwietzke.de/java/benchmarking-a-number-journey-int-to-string.html
Whom do you ask for time?
System.nanoTime()
and System.currentTimeMillis()
We can measure nanoseconds, can't we?
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Fork(2)
@State(Scope.Thread)
public class TimerResolution
{
private long lastValue;
@Benchmark
public long latency_nanotime()
{
return System.nanoTime();
}
public static double precision(long[] t)
{
// store samples
for (int i = 0; i < t.length; i++)
{
t[i] = System.nanoTime();
}
}
}
Benchmark Mode Cnt Score Error Units
TimerResolution.latency_nanotime avgt 10 25.519 ± 0.985 ns/op
# Samples Lenovo T14s, AMD, Linux 5.15, TSC
372942679845594
372942679845635
372942679845665
372942679845705
372942679845745
372942679845775
Difference between calls 30 to 40 ns
Source: https://github.com/shipilev/timers-bench
Not every clock is created equal
x86 Time Stamp Counter (TSC), a low-latency monotonically increasing clock, to measure event duration
Benchmark Mode Cnt Score Error Units
TimerResolution.latency_nanotime avgt 10 1,128.204 ± 32.831 ns/op
# Samples Lenovo T14s, AMD, Linux 5.15, HPET
374172181482681
374172181484078
374172181484986
374172181485964
374172181487291
374172181488199
Difference between calls 900 to 1,400 ns
Benchmark Mode Cnt Score Error Units
TimerResolution.latency_nanotime avgt 10 25.519 ± 0.985 ns/op
# Samples Lenovo T14s, AMD, Linux 5.15, TSC
495719354555374
495719354555394
495719354555424
495719354555444
495719354555464
495719354555484
Difference between calls 20 to 30 ns
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
kvm-clock tsc hpet acpi_pm
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet
$ echo 'tsc' > /sys/devices/system/clocksource/clocksource0/current_clocksource
Java evolves and so does performance
public class Example1ArrayCopying {
@Param({"1000"}) int SIZE;
byte[] src;
@Setup
public void setup() {
src = new byte[SIZE];
final Random r = new Random(7L);
r.nextBytes(src);
}
@Benchmark
public byte[] systemCopy() {
var dest = new byte[src.length];
System.arraycopy(src, 0, dest, 0, src.length);
return dest;
}
@Benchmark
public byte[] arrayCopy() {
return Arrays.copyOf(src, src.length);
}
@Benchmark
public byte[] manualCopy() {
var dest = new byte[src.length];
for (int i = 0; i < src.length; i++) {
dest[i] = src[i];
}
return dest;
}
}
# JDK 11
Benchmark (SIZE) Mode Cnt Score Error Units
Example1ArrayCopying.manualCopy 1000 avgt 3 152.889 ± 32.305 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 130.963 ± 119.643 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 128.726 ± 83.837 ns/op
# JDK 17
Example1ArrayCopying.manualCopy 1000 avgt 3 153.508 ± 7.987 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 124.021 ± 35.199 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 121.271 ± 35.723 ns/op
# JDK 19
Example1ArrayCopying.manualCopy 1000 avgt 3 149.298 ± 79.705 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 129.439 ± 104.106 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 133.358 ± 40.304 ns/op
# JDK 20
Example1ArrayCopying.manualCopy 1000 avgt 3 152.070 ± 7.096 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 119.613 ± 9.873 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 117.901 ± 8.322 ns/op
# Graal 22.3.r17 - Whoops!
Example1ArrayCopying.manualCopy 1000 avgt 3 335.612 ± 22.000 ns/op
Example1ArrayCopying.systemCopy 1000 avgt 3 150.035 ± 4.466 ns/op
Example1ArrayCopying.arrayCopy 1000 avgt 3 149.176 ± 4.496 ns/op
When, where, and how often is it called - the JIT at work
public class Streams {
final int SIZE = 1024;
final int[] integers = new int[SIZE];
@Setup
public void setup() {
for (int i = 0; i < integers.length; i++) {
integers[i] = i - SIZE;
}
}
@Benchmark
@Warmup(iterations = 0, batchSize = 1)
public int lambdaArrayCold() {
return Arrays.stream(integers).filter(i -> i % 2 == 0).sum();
}
@Benchmark
@Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
public int lambdaArrayWarm() {...}
@Benchmark
@Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
public int lambdaArrayHot() {...}
@Benchmark
@Warmup(iterations = 0, batchSize = 1)
public int loopCold() {
int sum = 0;
for (int i = 0; i < integers.length; i++) {
var v = integers[i];
if (v % 2 == 0) {
sum += v;
}
}
return sum;
}
@Benchmark
@Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
public int loopWarm() {...}
@Benchmark
@Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
public int loopHot() {...}
}
Benchmark Mode Cnt Score Error Units
Streams.lambdaArrayCold avgt 5 688,797 ± 4,665,070 ns/op
Streams.lambdaArrayWarm avgt 5 12,797 ± 59,453 ns/op
Streams.lambdaArrayHot avgt 5 1,113 ± 2,602 ns/op
Streams.loopCold avgt 5 52,936 ± 42,276 ns/op
Streams.loopWarm avgt 5 2,030 ± 2,527 ns/op
Streams.loopHot avgt 5 1,422 ± 733 ns/op
Streams.lambdaArrayCold
Iteration 1: 2,855,843.000 ns/op
Iteration 2: 172,353.000 ns/op
Iteration 3: 139,810.000 ns/op
Iteration 4: 139,670.000 ns/op
Iteration 5: 136,313.000 ns/op
Streams.lambdaArrayWarm
Iteration 1: 40,097.000 ns/op
Iteration 2: 4,712.000 ns/op
Iteration 3: 9,992.667 ns/op
Iteration 4: 4,145.143 ns/op
Iteration 5: 5,039.667 ns/op
Streams.loopCold
Iteration 1: 44,510.286 ns/op
Iteration 2: 61,584.000 ns/op
Iteration 3: 67,831.000 ns/op
Iteration 4: 45,749.000 ns/op
Iteration 5: 45,007.000 ns/op
@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Measurement(iterations = 5, batchSize = 1, time = 1, timeUnit = TimeUnit.NANOSECONDS)
@Fork(1)
public class HeavyStreams {
// as seen before
@Benchmark
@Warmup(iterations = 0, batchSize = 1)
public int lambdaArrayCold() {
return Arrays.stream(integers)
.filter(i -> i.mod(BigInteger.TWO).equals(BigInteger.ZERO))
.map(i -> i.sqrt())
.mapToInt(BigInteger::intValue)
.sum();
}
@Benchmark
@Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
public int lambdaArrayWarm() { ... }
@Benchmark
@Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
public int lambdaArrayHot() { ... }
@Benchmark
@Warmup(iterations = 0, batchSize = 1)
public int loopCold() {
int sum = 0;
for (int i = 0; i < integers.length; i++) {
var v = integers[i];
if (v.mod(BigInteger.TWO).equals(BigInteger.ZERO)) {
sum += v.sqrt().intValue();
}
}
return sum;
}
@Benchmark
@Warmup(iterations = 1, time = 1, timeUnit = TimeUnit.SECONDS)
public int loopWarm() { ... }
@Benchmark
@Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
public int loopHot() { ... }
}
# ms now!!!
Benchmark Mode Cnt Score Error Units
HeavyStreams.lambdaArrayCold avgt 5 362 ± 113 ms/op
HeavyStreams.lambdaArrayWarm avgt 5 297 ± 177 ms/op
HeavyStreams.lambdaArrayHot avgt 5 264 ± 12 ms/op
HeavyStreams.loopCold avgt 5 363 ± 100 ms/op
HeavyStreams.loopWarm avgt 5 302 ± 183 ms/op
HeavyStreams.loopHot avgt 5 271 ± 16 ms/op
HeavyStreams.lambdaArrayCold
Iteration 1: 409 ms/op
Iteration 2: 372 ms/op
Iteration 3: 347 ms/op
Iteration 4: 342 ms/op
Iteration 5: 339 ms/op
HeavyStreams.lambdaArrayWarm
Iteration 1: 263 ms/op
Iteration 2: 270 ms/op
Iteration 3: 264 ms/op
Iteration 4: 263 ms/op
Iteration 5: 263 ms/op
HeavyStreams.loopCold
Iteration 1: 401 ms/op
Iteration 2: 376 ms/op
Iteration 3: 352 ms/op
Iteration 4: 342 ms/op
Iteration 5: 340 ms/op
It's dead, Jim!
public class DeadCode {
private double computeSideEffect(int n) {
return BigInteger.valueOf(n).pow(7).sqrt().doubleValue();
}
private double fib(int n) {
int fib = 1;
int prevFib = 1;
for (int i = 2; i < n; i++) {
int temp = fib;
fib += prevFib;
prevFib = temp;
}
return fib;
}
@Benchmark
public double baseline() {
return 5.0d;
}
@Benchmark
public void computeWrongSE() {
computeSideEffect(2000);
}
@Benchmark
public double computeRightSE() {
return computeSideEffect(2000);
}
@Benchmark
public void computeWrong() {
fib(2000);
}
@Benchmark
public double computeRight() {
return fib(2000);
}
}
# JDK 11
Benchmark Mode Cnt Score Error Units
DeadCode.baseline avgt 3 3.4 ± 0.2 ns/op
# JDK 17
Benchmark Mode Cnt Score Error Units
DeadCode.baseline avgt 3 0.4 ± 0.1 ns/op
DeadCode.computeWrong avgt 3 16.1 ± 1.6 ns/op
DeadCode.computeRight avgt 3 479.2 ± 20.2 ns/op
DeadCode.computeWrongSE avgt 3 1616.7 ± 434.4 ns/op
DeadCode.computeRightSE avgt 3 1713.4 ± 876.9 ns/op
computeWrong
is not really droppedcomputeWrong fib(1000)
yields 8.8 ns - ???computeWrong fib(100)
yields 0.4 ns - ???
https://github.com/openjdk/jmh/blob/master/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_08_DeadCode.java
https://self-learning-java-tutorial.blogspot.com/2022/08/jmh-be-cautious-with-dead-code.html
public class CachingEffects {
@Param({"true", "false"})
boolean live;
int SIZE = 10000;
int LOAD = 1000000;
List<String> work;
List<String> load;
@Setup
public void setup() {
load = new ArrayList<>(LOAD);
var s = new StringBuilder("ß09876512[more here]");
s.append(System.currentTimeMillis());
s.append('$');
for (int i = 0; i < LOAD; i++) {
// align the copies next to each other
load.add(s.toString());
}
work = new ArrayList<>(SIZE);
if (live) {
// shuffle the data, with fixed outcome
Collections.shuffle(load, new Random(42L));
}
// copy us the data, when live, we get a random pattern, when
// not live, we get basically data that is laid out next to each other
for (int i = 0; i < SIZE; i++) {
work.add(load.get(i));
}
}
...
}
public class CachingEffects
{
...
/**
* Manually search for a character pos
*/
@Benchmark
public int manual() {
var n = 0;
for (int i = 0; i < work.size(); i++) {
var s = work.get(i);
for (int index = 0; index < s.length(); index++) {
if (s.charAt(index) == '$') {
n += index;
break;
}
}
}
return n;
}
/**
* Search via indexOf for the char position
*/
@Benchmark
public int indexof() {
var n = 0;
for (int i = 0; i < work.size(); i++) {
n += work.get(i).indexOf('$');
}
return n;
}
}
The hidden memory that speeds things up
# JDK 11.0.17 - Aligned Data
Benchmark (live) Mode Cnt Score Error Units Diff
CachingEffects.indexof false avgt 3 435,184 ± 54,444 ns/op
CachingEffects.manual false avgt 3 318,798 ± 166,076 ns/op 73%
# JDK 11.0.17 - More Random Data
Benchmark (live) Mode Cnt Score Error Units Diff
CachingEffects.indexof true avgt 3 1,420,598 ± 549,023 ns/op
CachingEffects.manual true avgt 3 1,389,879 ± 342,328 ns/op 98%
# JDK 11.0.17 - Single Executions
Benchmark (live) Mode Cnt Score Units
CachingEffects.indexof false avgt 1 562,553 ns/op
CachingEffects.indexof false avgt 1 456,493 ns/op
CachingEffects.indexof false avgt 1 469,517 ns/op
CachingEffects.manual false avgt 1 313,673 ns/op
CachingEffects.manual false avgt 1 677,170 ns/op
CachingEffects.manual false avgt 1 309,094 ns/op
CachingEffects.indexof true avgt 1 1,637,546 ns/op
CachingEffects.indexof true avgt 1 2,477,774 ns/op
CachingEffects.indexof true avgt 1 2,604,464 ns/op
CachingEffects.manual true avgt 1 1,560,230 ns/op
CachingEffects.manual true avgt 1 2,067,318 ns/op
CachingEffects.manual true avgt 1 1,888,360 ns/op
CPU design, usage, and speed make things go fuzzy
# Laptop Busy, JDK 11
Benchmark (live) Mode Cnt Score Units Code Cache
CachingEffects.indexof true avgt 3 1,572,759 ns/op
CachingEffects.manual true avgt 3 1,413,919 ns/op 89%
CachingEffects.indexof false avgt 3 377,649 ns/op 24%
CachingEffects.manual false avgt 3 294,783 ns/op 78% 21%
# Laptop Less Busy, JDK 11
CachingEffects.indexof true avgt 3 1,263,913 ns/op
CachingEffects.manual true avgt 3 1,150,578 ns/op 91%
CachingEffects.indexof false avgt 3 422,753 ns/op 33%
CachingEffects.manual false avgt 3 301,250 ns/op 71% 26%
# Turbo off, JDK 11
CachingEffects.indexof true avgt 3 2,191,820 ns/op
CachingEffects.manual true avgt 3 1,928,017 ns/op 88%
CachingEffects.indexof false avgt 3 858,436 ns/op 39%
CachingEffects.manual false avgt 3 687,400 ns/op 80% 36%
# Google c2-standard-8, Intel, JDK 11
CachingEffects.indexof true avgt 3 1,064,037 ns/op
CachingEffects.manual true avgt 3 908,363 ns/op 85%
CachingEffects.indexof false avgt 3 509,849 ns/op 47%
CachingEffects.manual false avgt 3 476,487 ns/op 93% 52%
# Google c2d-standard-8, AMDs, JDK 11
CachingEffects.indexof true avgt 3 707,885 ns/op
CachingEffects.manual true avgt 3 599,987 ns/op 84%
CachingEffects.indexof false avgt 3 375,242 ns/op 53%
CachingEffects.manual false avgt 3 261,686 ns/op 70% 44%
# Google c3-highcpu-8, Intel, JDK 11
CachingEffects.indexof true avgt 3 1,122,492 ns/op
CachingEffects.manual true avgt 3 1,075,496 ns/op 96%
CachingEffects.indexof false avgt 3 357,894 ns/op 32%
CachingEffects.manual false avgt 3 303,531 ns/op 85% 28%
The unknown fortune teller
https://shipilev.net/jvm/anatomy-quarks/28-frequency-based-code-layout/
public class BranchPrediction {
private static final int COUNT = 10000;
private int[] sorted = new int[COUNT];
private int[] unsorted = new int[COUNT];
private int[] pattern2 = new int[COUNT];
private int[] pattern5 = new int[COUNT];
private int[] shuffledPattern = new int[COUNT];
@Setup
public void setup() {
final Random r = new Random(13241L);
// using int here to avoid problems with memory and caches aka
// only inline data and hence same caching behavior
for (int i = 0; i < COUNT; i++) {
final int d = r.nextInt(100_000_000) + 1;
sorted[i] = d; unsorted[i] = d;
pattern2[i] = i % 2 == 0 ? 100_000_000 : r.nextInt(50_000_000);
pattern5[i] = i % 5 == 0 ? 100_000_000 : r.nextInt(50_000_000);
shuffledPattern[i] = pattern2[i];
}
Arrays.sort(sorted);
// make it noisy to break the branch predictor
for (int i = 0; i < shuffledPattern.length; i++) {
// shuffle here
}
}
public long doIt(int[] array) {
long sum = 2;
for (int v : array) {
if (v > 50_000_000) {
sum += (v >> 2);
}
else {
sum += (v << 2);
}
}
return sum;
}
...
}
Your data sorting becomes the driver of the results
# JDK 11
Benchmark Mode C Score Units
BranchPrediction.unsorted avgt 4 14,424 ns/op
BranchPrediction.sorted avgt 4 4,709 ns/op
BranchPrediction.pattern2 avgt 4 4,943 ns/op
BranchPrediction.pattern5 avgt 4 4,642 ns/op
BranchPrediction.shuffledPattern avgt 4 14,209 ns/op
# JDK 17
BranchPrediction.unsorted avgt 4 14,285 ns/op
BranchPrediction.sorted avgt 4 4,858 ns/op
BranchPrediction.pattern2 avgt 4 4,353 ns/op
BranchPrediction.pattern5 avgt 4 4,744 ns/op
BranchPrediction.shuffledPattern avgt 4 14,132 ns/op
# JDK 11, -XX:-BlockLayoutByFrequency
Benchmark Mode C Score Units
BranchPrediction.unsorted avgt 4 9,800 ns/op
BranchPrediction.sorted avgt 4 4,791 ns/op
BranchPrediction.pattern2 avgt 4 5,509 ns/op
BranchPrediction.pattern5 avgt 4 5,802 ns/op
BranchPrediction.shuffledPattern avgt 4 9,865 ns/op
# JDK 17, -XX:-BlockLayoutByFrequency
BranchPrediction.unsorted avgt 4 9,304 ns/op
BranchPrediction.sorted avgt 4 4,239 ns/op
BranchPrediction.pattern2 avgt 4 5,530 ns/op
BranchPrediction.pattern5 avgt 4 4,182 ns/op
BranchPrediction.shuffledPattern avgt 4 9,085 ns/op
Static and little data is not enough
String
has a length of just 42 characterscom.devoxxpl23.HashingDataDimensions
public class HashingDataDimensions {
@Param({"10", "50", "100", "500", "1000", "2000"}) int CHARARRAY_SIZE;
@Param({"true", "false"}) boolean RANDOM;
final static int MAX = 100_000;
List data, holder;
@Setup
public void setup() {
data = new ArrayList(MAX);
holder = new ArrayList(MAX);
for (int i = 0; i < MAX; i++) {
// the data to work with
var a = random(CHARARRAY_SIZE);
data.add(a);
holder.add(a);
if (!RANDOM) {
// same data in our data array all the time, still a miss access when accessing it
// so the RANDOM test has the same code path, just ends up on the same data
data.set(data.size() - 1, data.get(0));
}
}
}
@Benchmark
public void traditionalHashcode(Blackhole b) {
for (int i = 0; i < data.size(); i++) {
b.consume(hashCodeClassic(data.get(r.nextInt(data.size()))));
}
}
...
public static int hashCodeClassic(char[] src) {
final int last = src.length;
int h = 0;
for (int i = 0; i < last; i++) {
h = 31 * h + src[i];
}
return h;
}
public static int hashCodeVectoredJDK19(char[] src) {
int h = 0;
int i = 0;
int l, l2;
l = l2 = src.length;
l = l & ~(8 - 1);
for (; i < l; i += 8) {
h = -1807454463 * h +
1742810335 * src[i + 0] +
887503681 * src[i + 1] +
28629151 * src[i + 2] +
923521 * src[i + 3] +
29791 * src[i + 4] +
961 * src[i + 5] +
31 * src[i + 6] +
1 * src[i + 7];
}
for (; i < l2; i++) {
h = 31 * h + src[i];
}
return h;
}
public static int hashCodeNoMul(char[] src) {
final int last = src.length;
int h = 0;
for (int i = 0; i < last; i++) {
h = ((h << 5) - h) + src[i];
}
return h;
}
}
Benchmark (CHARARRAY_SIZE) (RANDOM) Score Units %
traditionalHashcode 10 true 2,354 us/op
traditionalHashcode 10 false 1,424 us/op 60%
traditionalHashcode 50 true 6,905 us/op
traditionalHashcode 50 false 4,255 us/op 61%
traditionalHashcode 100 true 12,590 us/op
traditionalHashcode 100 false 8,651 us/op 68%
traditionalHashcode 500 true 58,176 us/op
traditionalHashcode 500 false 43,322 us/op 74%
traditionalHashcode 1000 true 103,746 us/op
traditionalHashcode 1000 false 86,582 us/op 83%
traditionalHashcode 2000 true 194,827 us/op
traditionalHashcode 2000 false 173,190 us/op 88%
vectoredHashcode 10 true 2,701 us/op
vectoredHashcode 10 false 1,682 us/op 62%
vectoredHashcode 50 true 5,893 us/op
vectoredHashcode 50 false 2,919 us/op 50%
vectoredHashcode 100 true 8,593 us/op
vectoredHashcode 100 false 4,735 us/op 55%
vectoredHashcode 500 true 40,104 us/op
vectoredHashcode 500 false 20,698 us/op 51%
vectoredHashcode 1000 true 63,079 us/op
vectoredHashcode 1000 false 40,718 us/op 64%
vectoredHashcode 2000 true 106,395 us/op
vectoredHashcode 2000 false 80,286 us/op 75%
Benchmark (CHARARRAY_SIZE) (RANDOM) Score Units %
traditionalHashcode 10 true 2,354 us/op
vectoredHashcode 10 true 2,701 us/op 114%
traditionalHashcode 50 true 6,905 us/op
vectoredHashcode 50 true 5,893 us/op 85%
traditionalHashcode 100 true 12,590 us/op
vectoredHashcode 100 true 8,593 us/op 68%
traditionalHashcode 500 true 58,176 us/op
vectoredHashcode 500 true 40,104 us/op 69%
traditionalHashcode 1000 true 103,746 us/op
vectoredHashcode 1000 true 63,079 us/op 61%
traditionalHashcode 2000 true 194,827 us/op
vectoredHashcode 2000 true 106,395 us/op 55%
traditionalHashcode 10 false 1,424 us/op
vectoredHashcode 10 false 1,682 us/op 118%
traditionalHashcode 50 false 4,255 us/op
vectoredHashcode 50 false 2,919 us/op 69%
traditionalHashcode 100 false 8,651 us/op
vectoredHashcode 100 false 4,735 us/op 55%
traditionalHashcode 500 false 43,322 us/op
vectoredHashcode 500 false 20,698 us/op 49%
traditionalHashcode 1000 false 86,582 us/op
vectoredHashcode 1000 false 40,718 us/op 47%
traditionalHashcode 2000 false 173,190 us/op
vectoredHashcode 2000 false 80,286 us/op 46%
Quietly removed the bit-shifting version :)
Let's revisit the JDK version
# JDK 11.0.16, T14s AMD Gen 1
Benchmark (live) Mode Cnt Score Units Code Cache
CachingEffects.indexof true avgt 3 1,263,913 ns/op
CachingEffects.manual true avgt 3 1,150,578 ns/op 91%
CachingEffects.indexof false avgt 3 422,753 ns/op 33%
CachingEffects.manual false avgt 3 301,250 ns/op 71% 26%
# JDK 11.0.16, c3-highcpu-8
CachingEffects.indexof true avgt 3 1,122,492 ns/op
CachingEffects.manual true avgt 3 1,075,496 ns/op 95%
CachingEffects.indexof false avgt 3 357,894 ns/op 32%
CachingEffects.manual false avgt 3 303,531 ns/op 85% 28%
# JDK 17.0.6, T14s AMD Gen 1
CachingEffects.indexof true avgt 3 403,302 ns/op
CachingEffects.manual true avgt 3 1,160,649 ns/op 287%
CachingEffects.indexof false avgt 3 53,984 ns/op 13%
CachingEffects.manual false avgt 3 293,749 ns/op 542% 25%
# JDK 17.0.6, c3-highcpu-8
CachingEffects.indexof true avgt 3 271,683 ns/op
CachingEffects.manual true avgt 3 1,080,750 ns/op 398%
CachingEffects.indexof false avgt 3 58,124 ns/op 21%
CachingEffects.manual false avgt 3 310,592 ns/op 534% 29%
Never underestimate your own bias
If this wasn't challenging enough...
You just saw a fraction of all factors that might or might not influence the correctness of your microbenchmarks. You don't have to understand them all, but you have to be aware that a microbenchmark is not just about cpu and time measurement.
Let's Tackle the Problem
What is your motivation?
If you are an OpenJDK contributor, feel free to ignore the above ;)
No idea? No benchmark!
Most performance problems are not longer about CPU usage rather about not using the CPU.
Build your code
Stop thinking about the CPU!
Shape the future outcome
Data creation and mutation can be challenging
Wrong conclusions can be drawn in seconds with this trick
Functional correctness before performance
No unit tests, no microbenchmark!
Never believe in your first results
Vary a little to feel confident
Luke, minimal disturbance to the force you must apply.
Use the power of JMH
-prof gc
-prof perf/perfnorm
-prof perfasm
-prof comp
-prof cl
-prof jfr
-prof async
You might have to google a lot.
Ask someone but ChatGPT
One shall not claim success before someone else said so.
Remember our experiments and more
A Summary of Things
Microbenchmarking is...
When and when not to do microbenchmarks
The JMH warning
REMEMBER: The numbers are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Because you have seen it on my slides, does not make it right!