About the Smart Java Internals
René Schwietzke, Xceptance GmbH, Jena, Germany
@ReneSchwietzke
www.xceptance.com
Always attribute other people's work
Other information is taken from JDK sources, tweets, and related talks and videos
Java is much more than what most people think. Spoilers!
The times when Java was a bytecode interpreter, running slow and sluggish, are long gone. But most people don't know the magic that is going on to speed up Java to native code speed and beyond.
Actually, most programmers don't even know how their code gets executed or how a computer works... sadly.
A lot of the material applies to other languages, even C, as well.
Optimization and tuning is fun... but...
Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
Donald Knuth, Turing Award Lecture 1974
StringBuilder
instead of... this talk is not for youJava is efficient but the why has to be explained
Source: https://sites.google.com/view/energy-efficiency-languages/home
Get Some Measurements Going
public class PoorMansBenchmark
{
private static String append(final int i)
{
final StringBuilder sb = new StringBuilder();
sb.append(System.currentTimeMillis());
sb.append("Foo");
sb.append(i);
sb.append('c');
sb.append("lkJAHJ AHSDJHF KJASHF HSAKJFD");
final String s = sb.toString();
char[] ch = s.toCharArray();
Arrays.sort(ch);
return new String(ch);
}
public static void main (final String[] args)
{
final int SIZE = 100_000;
final long[] result = new long[SIZE];
int sum = 0;
for (int i = 0; i < SIZE; i++)
{
final Timer t = Timer.startTimer();
sum += append(i).length();
result[i] = t.stop().runtimeNanos();
}
// pretend we need it but we really don't
// just to avoid optimizations (spoilers, duh!)
if (sum < 0)
{
System.out.println(sum);
}
}
}
Just benchmark something...
StringBuilder
and String
https://github.com/Xceptance/jmh-jmm-training: org/sample/PoorMansbenchmark.java
# SIZE = 10
1, 617053
2, 38628
3, 24712
4, 24459
5, 33943
6, 23639
7, 24312
8, 25948
9, 23379
10, 23003
# SIZE = 50
1, 616953
2, 37628
3, 27712
4, 25459
5, 36943
6, 24639
7, 27312
8, 23948
9, 24379
10, 24003
20, 23770
30, 25595
40, 23631
41, 23744
42, 25004
43, 26035
44, 23782
45, 22973
46, 25261
47, 23215
48, 24135
49, 23312
50, 22908
# SIZE = 200
1, 560064
2, 33833
3, 25935
4, 24224
5, 58761
6, 25303
7, 23832
8, 23243
9, 24405
10, 23159
20, 24005
30, 23097
40, 24002
50, 23575
60, 26380
70, 24612
80, 26099
90, 23815
100, 33719
110, 24405
120, 25568
130, 27355
140, 23190
150, 12568
160, 22774
170, 21934
180, 11408
190, 12476
200, 10366
# SIZE = 2000
1, 582933
2, 37514
3, 26746
4, 24760
5, 40041
6, 23869
7, 26384
8, 23183
9, 22801
10, 24316
20, 28037
30, 47624
40, 37747
50, 37559
60, 47502
70, 35459
80, 37795
90, 36272
100, 36344
200, 14478
300, 11769
400, 11557
500, 10088
600, 9965
700, 9387
800, 8322
900, 7305
1000, 6849
2000, 3347
# SIZE = 100,000
1, 592763
10, 24077
100, 24995
200, 10734
300, 19562
400, 8904
500, 6226
600, 6723
700, 6981
800, 5283
900, 4896
1000, 4883
2000, 3367
3000, 4947
4000, 5179
5000, 4648
6000, 4113
7000, 4283
8000, 4654
9000, 4054
10000, 4300
20000, 19314
30000, 671
40000, 694
50000, 686
60000, 663
70000, 717
80000, 697
90000, 686
100000, 686
All in nanoseconds (ns)
That must be Garbage Collection
-Xms5g -Xmx5g
-XX:+UnlockExperimentalVMOptions
-XX:+UseEpsilonGC
-XX:+AlwaysPreTouch
# No GC
# SIZE = 100,000
1, 549924
10, 23824
100, 34017
200, 16395
300, 12858
400, 11460
500, 10504
600, 9787
700, 9935
800, 11464
900, 7369
1000, 7365
2000, 2771
3000, 2763
4000, 2753
5000, 2762
6000, 2846
7000, 2703
8000, 2737
9000, 2726
10000, 2738
20000, 2770
30000, 2706
40000, 2764
50000, 2657
60000, 2476
70000, 1040
80000, 1002
90000, 982
100000, 990
# G1 GC
# SIZE = 100,000
1, 595433
10, 23628
100, 40456
200, 17898
300, 11618
400, 13881
500, 9707
600, 9335
700, 9153
800, 8872
900, 7258
1000, 6953
2000, 2882
3000, 2737
4000, 2748
5000, 2756
6000, 2724
7000, 2686
8000, 2758
9000, 2751
10000, 2743
20000, 4219
30000, 5016
40000, 4437
50000, 1488
60000, 999
70000, 1056
80000, 4011
90000, 1221
100000, 755
Java 9 or higher
Java is slow
native-image
# G1 GC
# SIZE = 100,000
1, 595433
10, 23628
100, 40456
200, 17898
300, 11618
400, 13881
500, 9707
600, 9335
700, 9153
800, 8872
900, 7258
1000, 6953
2000, 2882
3000, 2737
4000, 2748
5000, 2756
6000, 2724
7000, 2686
8000, 2758
9000, 2751
10000, 2743
20000, 4219
30000, 5016
40000, 4437
50000, 1488
60000, 999
70000, 1056
80000, 4011
90000, 1221
100000, 755
# GraalVM Native
# SIZE = 100,000
1, 8791
10, 4736
100, 1989
200, 1830
300, 5040
400, 1870
500, 2267
600, 1903
700, 2279
800, 3889
900, 1771
1000, 1620
2000, 2415
3000, 3758
4000, 1290
5000, 1300
6000, 1249
7000, 2636
8000, 2160
9000, 1254
10000, 1309
20000, 1730
30000, 2446
40000, 1320
50000, 1332
60000, 1363
70000, 1385
80000, 2737
90000, 1373
100000, 1278
# G1 GC
# SIZE = 1,000,000
1, 563883
10, 23505
100, 23311
1000, 6727
2000, 5174
3000, 5444
4000, 5160
5000, 4815
6000, 5130
7000, 5138
8000, 5162
9000, 5032
10000, 5230
20000, 3992
30000, 3673
40000, 1909
50000, 1908
60000, 1106
70000, 963
80000, 966
90000, 981
100000, 958
200000, 683
300000, 693
400000, 694
500000, 691
600000, 699
700000, 697
800000, 1057
900000, 710
1000000, 703
The Data is so inconsistent
Greetings from the Java Internals
javac
Just to get a feel for the complexity
[1] src/share/vm/runtime/advancedThresholdPolicy.hpp
[2] Douglas Hawkins, JVM Mechanics, https://www.youtube.com/watch?v=a-U0so9FfqQ&t=2036s
-XX:-TieredCompilation
, -Xint
, -Xcomp
Compile when Needed vs. Right from the Start
-Xcomp
forces Java to always compile to native code# defaults
1, 560,960
2, 34,772
3, 26,167
4, 24,237
5, 35,926
6, 24,116
7, 23,163
8, 22,895
9, 24,142
10, 23,174
100, 23,596
200, 12,973
300, 12,361
400, 9,852
500, 9,596
600, 8,774
700, 8,956
800, 7,280
900, 7,076
1000, 6,089
10000, 5,071
20000, 2,683
30000, 2,703
40000, 1,532
50000, 1,121
60000, 1,012
70000, 975
80000, 979
90000, 987
100000, 976
900000, 700
1000000, 708
# -Xcomp
1, 40,041,490
2, 51,254
3, 17,217
4, 2,077
5, 1,714
6, 1,607
7, 1,610
8, 1,536
9, 1,576
10, 1,536
100, 1,376
200, 2,481
300, 1,284
400, 1,307
500, 1,150
600, 1,177
700, 1,173
800, 1,163
900, 1,174
1000, 1,546
10000, 1,096
20000, 962
30000, 999
40000, 1,093
50000, 891
60000, 883
70000, 885
80000, 862
90000, 894
100000, 869
900000, 871
1000000, 870
# defaults, T14s AMD
1, 1307425
2, 51264
3, 29962
4, 29054
5, 36178
6, 28634
7, 28635
8, 28565
9, 29053
10, 29613
20, 28495
30, 28565
40, 39111
50, 29473
60, 31009
70, 31010
80, 29543
90, 27098
100, 33734
200, 10546
300, 9428
400, 8032
500, 7543
600, 6704
700, 5866
800, 6076
900, 5238
1000, 5168
100000, 3701
200000, 3352
300000, 2934
400000, 2863
500000, 2933
600000, 2864
700000, 2933
800000, 2794
900000, 2933
1000000, 2794
# -Xint
1, 1329704
2, 55733
3, 32406
4, 30940
5, 30101
6, 30939
7, 30939
8, 30520
9, 30031
10, 30800
20, 30521
30, 31429
40, 30520
50, 29822
60, 29612
70, 30451
80, 30939
90, 30730
100, 33314
200, 32337
300, 31289
400, 31009
500, 31429
600, 31219
700, 30940
800, 30940
900, 30940
1000, 31498
100000, 33175
200000, 23257
300000, 21022
400000, 20743
500000, 20533
600000, 22070
700000, 20463
800000, 21022
900000, 20882
1000000, 20114
# JIT 1
1, 1180942
2, 36736
3, 30031
4, 29543
5, 35130
6, 28286
7, 28565
8, 28635
9, 29054
10, 29543
20, 28705
30, 30031
40, 27657
50, 28146
60, 31499
70, 31289
80, 27936
90, 28635
100, 31918
200, 11873
300, 8591
400, 7752
500, 6146
600, 6216
700, 6215
800, 4749
900, 4190
1000, 3771
100000, 3772
200000, 3841
300000, 3422
400000, 3352
500000, 3422
600000, 3212
700000, 3422
800000, 3772
900000, 3283
1000000, 3212
# JIT 1-2
1, 1196501
2, 37854
3, 30031
4, 29124
5, 32895
6, 29053
7, 28634
8, 28076
9, 27656
10, 29333
20, 29054
30, 33942
40, 33314
50, 28564
60, 30101
70, 28844
80, 28145
90, 25771
100, 25771
200, 10407
300, 8520
400, 6705
500, 5796
600, 4679
700, 4330
800, 4260
900, 4190
1000, 4260
100000, 4190
200000, 4400
300000, 4261
400000, 4819
500000, 4260
600000, 6705
700000, 4330
800000, 3841
900000, 3422
1000000, 3352
# JIT 1-3
1, 1474763
2, 83600
3, 31009
4, 28704
5, 35689
6, 29193
7, 28565
8, 28635
9, 29193
10, 29822
20, 28635
30, 28565
40, 29054
50, 29123
60, 29403
70, 28146
80, 27238
90, 27168
100, 27238
200, 11384
300, 10337
400, 12292
500, 7194
600, 5867
700, 6705
800, 5308
900, 5099
1000, 5657
100000, 4819
200000, 4749
300000, 4750
400000, 4679
500000, 4749
600000, 5867
700000, 5238
800000, 5238
900000, 4330
1000000, 3841
-XX:TieredStopAtLevel=X
1157 104 % 3 org.sample.poor.PoorMansBenchmark::main @ 17 (470 bytes)
1162 105 3 org.sample.poor.PoorMansBenchmark::main (470 bytes)
1328 135 3 org.sample.poor.PoorMansBenchmark::append (66 bytes)
1454 153 4 org.sample.poor.PoorMansBenchmark::append (66 bytes)
1561 178 % 4 org.sample.poor.PoorMansBenchmark::main @ 55 (470 bytes)
1566 135 3 org.sample.poor.PoorMansBenchmark::append (66 bytes) made not entrant
6002 178 % 4 org.sample.poor.PoorMansBenchmark::main @ 55 (470 bytes) made not entrant
6069 367 % 3 org.sample.poor.PoorMansBenchmark::main @ 100 (470 bytes)
6082 407 % 4 org.sample.poor.PoorMansBenchmark::main @ 100 (470 bytes)
-XX:PrintCompilation
, log format can change
Things to Know when Executing or Profiling Code
You trained it well
private final static int SIZE = 1000;
public static void main(String[] args)
{
Object trap = null;
Object o = null;
for (int i = 0; i < 1000; i++)
{
final Timer t = Timer.startTimer();
for (int j = 0; j < SIZE; j++)
{
// burn time and train that null is normal
o = new Object();
if (trap != null)
{
System.out.println("Got you." + o);
trap = null;
}
}
// Give me a Non-Null, Vasily.
// One Non-Null only, please.
if (i == 400)
{
trap = new Object();
}
System.out.println(
MessageFormat.format("{1} {0, number, #}",
t.stop().runtimeNanos(), i));
}
}
Example by Douglas Hawkins: org/sample/AllocationTrap.java
WTH?
If I don't need it, I remove it
null
or one exception can turn your performance upside-downYou can horribly shoot yourself in the foot
/**
* JHM Benchmark
* Vary SIZE from 1 to 10000
* @param args
*/
public int nonSense()
{
Object trap = null;
Object o = null;
for (int i = 0; i < OUTER; i++)
{
for (int j = 0; j < SIZE; j++)
{
o = new Object();
if (trap != null)
{
o = new Object();
trap = null;
}
}
if (i == 500)
{
trap = new Object();
}
}
return trap == null ? 1 : 0;
}
# One Shoot Execution aka measure one execution and no warm-up
Benchmark (SIZE) Mode Cnt Score Error Units
RemoveNonSenseJMH.nonSenseSingle 1 ss 85,298 ns/op
RemoveNonSenseJMH.nonSenseSingle 10 ss 513,886 ns/op
RemoveNonSenseJMH.nonSenseSingle 100 ss 3,603,910 ns/op
RemoveNonSenseJMH.nonSenseSingle 1000 ss 9,465,274 ns/op
RemoveNonSenseJMH.nonSenseSingle 10000 ss 17,528,758 ns/op
# Repeated Execution, No Warm up, 100 ms duration
Benchmark (SIZE) Mode Cnt Score Error Units
RemoveNonSenseJMH.nonSenseNoWarmup 1 avgt 963 ns/op
RemoveNonSenseJMH.nonSenseNoWarmup 10 avgt 600 ns/op
RemoveNonSenseJMH.nonSenseNoWarmup 100 avgt 597 ns/op
RemoveNonSenseJMH.nonSenseNoWarmup 1000 avgt 697 ns/op
RemoveNonSenseJMH.nonSenseNoWarmup 10000 avgt 2,258 ns/op
# Warm it Up First, 100 ms duration
Benchmark (SIZE) Mode Cnt Score Error Units
RemoveNonSenseJMH.nonSenseWarmup 1 avgt 660 ns/op
RemoveNonSenseJMH.nonSenseWarmup 10 avgt 609 ns/op
RemoveNonSenseJMH.nonSenseWarmup 100 avgt 409 ns/op
RemoveNonSenseJMH.nonSenseWarmup 1000 avgt 439 ns/op
RemoveNonSenseJMH.nonSenseWarmup 10000 avgt 410 ns/op
# -Xint to verify
Benchmark (SIZE) Mode Cnt Score Error Units
RemoveNonSenseJMH.nonSenseWarmup 1 avgt 113,536 ns/op
RemoveNonSenseJMH.nonSenseWarmup 10 avgt 1,110,772 ns/op
RemoveNonSenseJMH.nonSenseWarmup 100 avgt 11,449,399 ns/op
RemoveNonSenseJMH.nonSenseWarmup 1000 avgt 112,158,390 ns/op
RemoveNonSenseJMH.nonSenseWarmup 10000 avgt 1,104,793,429 ns/op
org.sample.RemoveNonSenseJMH
Now We Can Really Talk About Performance
Just one statement more...
@Benchmark
public long add230()
{
int sum = 0;
int i = 0;
// now 23 of these
sum += a[++i] + a[++i] + a[++i] + a[++i] + a[++i] +
a[++i] + a[++i] + a[++i] + a[++i] + a[++i];
return sum;
}
@Benchmark
public long add231()
{
int sum = 0;
int i = 0;
// now 23 of these
sum += a[++i] + a[++i] + a[++i] + a[++i] + a[++i] +
a[++i] + a[++i] + a[++i] + a[++i] + a[++i];
// and this
sum += a[++i];
return sum;
}
Benchmark Mode Cnt Score Units
ExpensiveLastAdd.add230 avgt 2 41.159 ns/op
ExpensiveLastAdd.add231 avgt 2 122.162 ns/op
803 434 4 org.sample.ExpensiveLastAdd::add230 (2251 bytes)
---
797 434 3 org.sample.ExpensiveLastAdd::add231 (2263 bytes)
799 436 4 org.sample.ExpensiveLastAdd::add231 (2263 bytes)
811 436 4 org.sample.ExpensiveLastAdd::add231 (2263 bytes)
COMPILE SKIPPED: Out of stack space (not retryable)
Douglas Q Hawkins: org/sample/ExpensiveLastAdd.java
Flatten out loops
private int next() {
return r.nextInt(1) + 1;
}
@Benchmark
public int classic() {
int sum = 0;
int step = next();
for (int i = 0; i < ints.length; i = i + 1)
{
sum += ints[i];
step = next();
}
return sum + step;
}
@Benchmark
public int variable() {
int sum = 0;
int step = next();
for (int i = 0; i < ints.length; i = i + step)
{
sum += ints[i];
step = next();
}
return sum + step;
}
org/sample/LoopUnroll.java
Benchmark (size) Mode Cnt Score Units
LoopUnroll.classic 10000 avgt 2 24,416 ns/op
LoopUnroll.variable 10000 avgt 2 35,011 ns/op
# -Xint
LoopUnroll.classic 10000 avgt 2 2,650,812 ns/op
LoopUnroll.variable 10000 avgt 2 2,449,845 ns/op
# classic()
Benchmark Mode Score Units
time avgt 25,920.0 ns/op
CPI avgt 0.572 #/op # Cycles per Instruction
IPC avgt 1.480 #/op # Instructions per Cycle
L1-dcache-load-misses avgt 648.0 #/op
L1-dcache-loads avgt 10,242.0 #/op
L1-dcache-stores avgt 294.0 #/op
branch-misses avgt 4.8 #/op
branches avgt 2,694.0 #/op
cycles avgt 61,092.0 #/op
instructions avgt 106,718.0 #/op
# variable()
Benchmark Mode Score Units
time avgt 35,020.0 ns/op
CPI avgt 0.316 #/op
IPC avgt 3.165 #/op
L1-dcache-load-misses avgt 677.0 #/op
L1-dcache-loads avgt 30,203.0 #/op
L1-dcache-stores avgt 10,293.0 #/op
branch-misses avgt 8.3 #/op
branches avgt 29,832.0 #/op
cycles avgt 87,294.0 #/op
instructions avgt 276,379.0 #/op
Everything happens in parallel
https://en.wikichip.org/wiki/intel/microarchitectures/kaby_lake
* depends on the type and generation
Make it less expensive
@Benchmark
public long array64()
{
int[] a = new int[64];
a[0] = r.nextInt();
a[1] = r.nextInt();
return a[0] + a[1];
}
@Benchmark
public long array65()
{
int[] a = new int[65];
a[0] = r.nextInt();
a[1] = r.nextInt();
return a[0] + a[1];
}
EscapeAnalysis.array64 avgt 2 22.804 ns/op
EscapeAnalysis.array65 avgt 2 41.890 ns/op
# -prof gc
EscapeAnalysis.array64 avgt 2 22.804 ns/op
EscapeAnalysis.array64:·gc.alloc.rate avgt 2 ≈ 10⁻⁴ MB/sec
EscapeAnalysis.array64:·gc.alloc.rate.norm avgt 2 ≈ 10⁻⁶ B/op
EscapeAnalysis.array64:·gc.count avgt 2 ≈ 0 counts
EscapeAnalysis.array65 avgt 2 41.890 ns/op
EscapeAnalysis.array65:·gc.alloc.rate avgt 2 5795.571 MB/sec
EscapeAnalysis.array65:·gc.alloc.rate.norm avgt 2 280.000 B/op
EscapeAnalysis.array65:·gc.count avgt 2 93.000 counts
# -XX:EliminateAllocationArraySizeLimit=70
EscapeAnalysis.array65 avgt 2 22.279 ns/op
EscapeAnalysis.array65:·gc.alloc.rate avgt 2 ≈ 10⁻⁴ MB/sec
EscapeAnalysis.array65:·gc.alloc.rate.norm avgt 2 ≈ 10⁻⁶ B/op
EscapeAnalysis.array65:·gc.count avgt 2 ≈ 0 counts
org/sample/EscapeAnalysis.java
Native code to the rescue
final int SIZE = 100_000;
final int[] src = new int[SIZE];
public int[] manualArrayCopy()
{
int[] target = new int[SIZE];
for (int i = 0; i < SIZE; i++)
{
target[i] = src[i];
}
return target;
}
public int[] systemArrayCopy()
{
int[] target = new int[SIZE];
System.arraycopy(src, 0, target, 0, SIZE);
return target;
}
public int[] arraysCopy()
{
return Arrays.copyOf(src, src.length);
}
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.manual avgt 2 48,177 ns/op
IntrinsicsArrayCopy.system avgt 2 48,678 ns/op
Iteration Manual System
===============================================
Iteration 1: 1,431,758 ns/op 129,551 ns/op
Iteration 2: 714,942 ns/op 81,926 ns/op
Iteration 3: 188,272 ns/op 77,098 ns/op
Iteration 4: 70,491 ns/op 72,010 ns/op
Iteration 5: 82,862 ns/op 73,495 ns/op
Iteration 6: 66,534 ns/op 75,625 ns/op
Iteration 7: 61,277 ns/op 74,382 ns/op
org/sample/IntrinsicsArrayCopy.java
Native Code vs. Intrinsics
# -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining
@ 54 java.lang.Math::min (11 bytes) (intrinsic)
@ 57 java.lang.System::arraycopy (0 bytes) (intrinsic)
public static native void arraycopy(...)
can be a native call or an intrinsicjava.lang.Math
http://pzemtsov.github.io/2015/11/21/crc32-intrinsic-java-or-c-with-jni.html
http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee27509/src/share/vm/classfile/vmSymbols.hpp
https://www.lmax.com/blog/staff-blogs/2016/03/30/notes-hotspot-compiler-flags
# Lenovo T530, JDK 8
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.manual avgt 2 48,177.054 ns/op
IntrinsicsArrayCopy.system avgt 2 48,678.219 ns/op
# Lenovo T450s, JDK 8, BAT
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.manual avgt 2 66,560.515 ns/op
IntrinsicsArrayCopy.system avgt 2 67,051.983 ns/op
IntrinsicsArrayCopy.arrays avgt 2 60,674.969 ns/op
# Lenovo T450s, JDK 11, BAT
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.manual avgt 2 67,476.175 ns/op
IntrinsicsArrayCopy.system avgt 2 58,952.894 ns/op
IntrinsicsArrayCopy.arrays avgt 2 58,771.408 ns/op
# Lenovo T450s, GraalVM EE 1.0.0.14, BAT
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.manual avgt 2 94,315.864 ns/op
IntrinsicsArrayCopy.system avgt 2 70,201.933 ns/op
IntrinsicsArrayCopy.arrays avgt 2 71,605.877 ns/op
# Lenovo T14s, AMD, JDK 11, BAT Performance
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.arrays avgt 2 44,172.944 ns/op
IntrinsicsArrayCopy.manual avgt 2 46,073.504 ns/op
IntrinsicsArrayCopy.system avgt 2 43,730.640 ns/op
# Lenovo T450s, JDK 11, AC
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.manual avgt 2 56,108.312 ns/op
IntrinsicsArrayCopy.system avgt 2 55,748.067 ns/op
IntrinsicsArrayCopy.arrays avgt 2 53,769.876 ns/op
# AWS c5.xlarge, JDK 11
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.manual avgt 2 83,066.975 ns/op
IntrinsicsArrayCopy.system avgt 2 63,422.029 ns/op
IntrinsicsArrayCopy.arrays avgt 2 64,748.500 ns/op
# GCP n1-standard-4, JDK 11
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.manual avgt 2 146,932.588 ns/op
IntrinsicsArrayCopy.system avgt 2 101,970.577 ns/op
IntrinsicsArrayCopy.arrays avgt 2 116,233.650 ns/op
# GCP n1-highcpu-8, JDK 11
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.manual avgt 2 82,865.258 ns/op
IntrinsicsArrayCopy.system avgt 2 62,810.793 ns/op
IntrinsicsArrayCopy.arrays avgt 2 61,237.489 ns/op
# Lenovo T450s, JDK 11, BAT, Min Power
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.manual avgt 2 215,360.090 ns/op
IntrinsicsArrayCopy.system avgt 2 210,971.627 ns/op
IntrinsicsArrayCopy.arrays avgt 2 214,870.207 ns/op
org.sample.IntrinsicsArrayCopy_AverageTime
@Setup
public void setup()
{
foo = String.valueOf(System.currentTimeMillis()) + "ashdfhgas dfhsa df";
}
public String plain()
{
return "a" + foo;
}
public String concat()
{
return "a".concat(foo);
}
public String builder()
{
return new StringBuilder().append("a").append(foo).toString();
}
public String builderSized()
{
return new StringBuilder(32).append("a").append(foo).toString();
}
public String builderNonFluid()
{
final StringBuilder sb = new StringBuilder();
sb.append("a");
sb.append(foo);
return sb.toString();
}
public String builderNonFluidSized()
{
final StringBuilder sb = new StringBuilder(32);
sb.append("a");
sb.append(foo);
return sb.toString();
}
Let's make some Strings
Benchmark Mode Cnt Score Units
Strings1.builderSized avgt 2 12.643 ns/op
Strings1.builder avgt 2 12.655 ns/op
Strings1.plain avgt 2 13.087 ns/op
Strings1.concat avgt 2 15.919 ns/op
Strings1.builderNonFluidSized avgt 2 17.599 ns/op
Strings1.builderNonFluid avgt 2 24.259 ns/op
# -Xint
Strings1.concat avgt 2 875.504 ns/op
Strings1.builderSized avgt 2 2395.736 ns/op
Strings1.builderNonFluidSized avgt 2 2439.351 ns/op
Strings1.builder avgt 2 2770.349 ns/op
Strings1.builderNonFluid avgt 2 2746.212 ns/op
Strings1.plain avgt 2 3014.032 ns/op
Inspired by Douglas Q Hawkins: org.sample.Strings1, JDK 8
Hardware does not matter... you wish!
final int SIZE = 1_000_000;
final int[] src = new int[SIZE];
@Benchmark
public int step1()
{
int sum = 0;
for (int i = 0; i < SIZE; i++)
{
sum += src[i];
}
return sum;
}
@Benchmark
public int step20()
{
int sum = 0;
for (int i = 0; i < SIZE; i = i + 20)
{
sum += src[i];
}
return sum;
}
size20
must be faster, about 20x
Benchmark Mode Cnt Score Error Units
ArraysAndHardware.step1 avgt 2 325,159 ns/op
ArraysAndHardware.step20 avgt 2 97,402 ns/op
org.sample.ArraysAndHardware.java
Also Java has to live by hardware laws
Linux Perf for step1
==========================================================================================
4485.209219 task-clock (msec) # 0,290 CPUs utilized
11,127,672,435 cycles # 2,481 GHz (38,32%)
11,620,571,786 instructions # 1,04 insn per cycle (46,29%)
665,870,415 branches # 148,459 M/sec (46,36%)
1,066,446 branch-misses # 0,16% of all branches (46,40%)
9,482,675,316 L1-dcache-loads # 2114,210 M/sec (46,50%)
593,181,935 L1-dcache-load-misses # 6,26% of all L1-dcache hits (46,59%)
21,448,491 LLC-loads # 4,782 M/sec (30,64%)
5,856,980 LLC-load-misses # 27,31% of all LL-cache hits (30,67%)
Linux Perf for step20
==========================================================================================
4455.946190 task-clock (msec) # 0,286 CPUs utilized
11,046,881,256 cycles # 2,479 GHz (38,45%)
3,836,171,205 instructions # 0,35 insn per cycle (46,27%)
927,360,764 branches # 208,118 M/sec (46,40%)
873,526 branch-misses # 0,09% of all branches (46,49%)
979,546,079 L1-dcache-loads # 219,829 M/sec (46,57%)
860,310,077 L1-dcache-load-misses # 87,83% of all L1-dcache hits (46,66%)
746,566,165 LLC-loads # 167,544 M/sec (30,76%)
330,798,682 LLC-load-misses # 44,31% of all LL-cache hits (30,64%)
-prof perf
requires a special kernel tooling
What it costs to leave the CPU
Real | Humanized | |
---|---|---|
CPU Cycle | 0.4 ns | 1 s |
L1 Cache | 0.9 ns | 2 s |
L2 Cache | 2.8 ns | 7 s |
L3 Cache | 28 ns | 1 min |
Memory Access | 100 ns | 4 min |
NVM SSD | 25 μs | 17 h |
SSD | 50–150 μs | 1.5-4 days |
HDD | 1–10 ms | 1-9 months |
TCP SF to NY | 65 ms | 5 years |
TCP SF to Hong Kong | 141 ms | 11 years |
Data: She brought me closer to humanity than I ever thought possible, and for a time...I was tempted by her offer.
Picard: How long a time?
Data: 0.68 seconds, sir. For an android, that is nearly an eternity.
http://www.prowesscorp.com/computer-latency-at-a-human-scale/
final int[][] src = new int[1000][1000];
@Benchmark
public int horizontal() {
int sum = 0;
for (int i = 0; i < SIZE; i++)
{
for (int j = 0; j < SIZE; j++)
{
sum += src[i][j];
}
}
return sum;
}
@Benchmark
public int vertical() {
int sum = 0;
for (int i = 0; i < SIZE; i++)
{
for (int j = 0; j < SIZE; j++)
{
sum += src[j][i];
}
}
return sum;
}
@Benchmark
public int horizontalBackwards() {
int sum = 0;
for (int i = SIZE - 1; i >=0; i--)
{
for (int j = SIZE - 1; j >= 0; j--)
{
sum += src[i][j];
}
}
return sum;
}
What is real and what is not
# Lenovo T530, Intel i7
Benchmark Mode Score Error Units
horizontal avgt 470,009.870 ns/op
horizontalBackwards avgt 672,821.809 ns/op
vertical avgt 3,209,562.255 ns/op
# Lenovo T14s, AMD Ryzen 7
Benchmark Mode Score Error Units
horizontal avgt 266,470.908 ns/op
horizontalBackwards avgt 270,099.755 ns/op
vertical avgt 1,197,509.843 ns/op
org.sample.Matrix
with -prof perfnorm
final int[][] src = new int[1000][1000];
@Benchmark
public int horizontal() {
int sum = 0;
for (int i = 0; i < SIZE; i++)
{
for (int j = 0; j < SIZE; j++)
{
sum += src[i][j];
}
}
return sum;
}
@Benchmark
public int vertical() {
int sum = 0;
for (int i = 0; i < SIZE; i++)
{
for (int j = 0; j < SIZE; j++)
{
sum += src[j][i];
}
}
return sum;
}
@Benchmark
public int horizontalBackwards() {
int sum = 0;
for (int i = SIZE - 1; i >=0; i--)
{
for (int j = SIZE - 1; j >= 0; j--)
{
sum += src[i][j];
}
}
return sum;
}
# Lenovo T530
Benchmark Mode Score Units
Matrix.horizontal avgt 470,009 ns/op
Matrix.horizontal:L1-dcache-load-misses avgt 63,645 #/op
Matrix.horizontal:L1-dcache-loads avgt 1,014,532 #/op
Matrix.horizontal:LLC-load-misses avgt 429 #/op
Matrix.horizontal:LLC-loads avgt 1,523 #/op
Matrix.horizontalBackwards avgt 672,821 ns/op
Matrix.horizontalBackwards:L1-dcache-load-misses avgt 63,053 #/op
Matrix.horizontalBackwards:L1-dcache-loads avgt 1,007,381 #/op
Matrix.horizontalBackwards:LLC-load-misses avgt 854 #/op
Matrix.horizontalBackwards:LLC-loads avgt 2,764 #/op
Matrix.vertical avgt 3,209,562 ns/op
Matrix.vertical:L1-dcache-load-misses avgt 2,060,998 #/op
Matrix.vertical:L1-dcache-loads avgt 3,020,796 #/op
Matrix.vertical:LLC-load-misses avgt 15,223 #/op
Matrix.vertical:LLC-loads avgt 694,397 #/op
# Lenovo T14s
Benchmark Mode Score Units
Matrix.horizontal avgt 286,633 ns/op
Matrix.horizontal:L1-dcache-load-misses avgt 63,902 #/op
Matrix.horizontal:L1-dcache-loads avgt 1,031,015 #/op
Matrix.horizontal:stalled-cycles-backend avgt 1,015,230 #/op
Matrix.horizontal:stalled-cycles-frontend avgt 4,848 #/op
Matrix.horizontalBackwards avgt 278,992 ns/op
Matrix.horizontalBackwards:L1-dcache-load-misses avgt 64,526 #/op
Matrix.horizontalBackwards:L1-dcache-loads avgt 1,033,686 #/op
Matrix.horizontalBackwards:stalled-cycles-backend avgt 870,881 #/op
Matrix.horizontalBackwards:stalled-cycles-frontend avgt 5,029 #/op
Matrix.vertical avgt 1,484,345 ns/op
Matrix.vertical:L1-dcache-load-misses avgt 2,114,064 #/op
Matrix.vertical:L1-dcache-loads avgt 3,108,037 #/op
Matrix.vertical:stalled-cycles-backend avgt 5,081,442 #/op
Matrix.vertical:stalled-cycles-frontend avgt 16,206 #/op
org.sample.Matrix
with -prof perfnorm
Influence of the size on the result
Benchmark (SIZE) Mode Score Factor Units
horizontal 1 avgt 5.3 1.0 ns/op
horizontal 10 avgt 45.2 1.0 ns/op
horizontal 20 avgt 141.3 1.0 ns/op
horizontal 100 avgt 3,388.0 1.0 ns/op
horizontal 1000 avgt 373,619.3 1.0 ns/op
horizontal 10000 avgt 41,701,496.2 1.0 ns/op
horizontalBackwards 1 avgt 5.3 1.0 ns/op
horizontalBackwards 10 avgt 43.2 1.0 ns/op
horizontalBackwards 20 avgt 140.1 1.0 ns/op
horizontalBackwards 100 avgt 3,261.3 1.0 ns/op
horizontalBackwards 1000 avgt 511,246.1 1.4 ns/op
horizontalBackwards 10000 avgt 46,908,230.0 1.1 ns/op
vertical 1 avgt 5.3 1.0 ns/op
vertical 10 avgt 77.0 1.7 ns/op
vertical 20 avgt 275.8 2.0 ns/op
vertical 100 avgt 6,226.3 1.8 ns/op
vertical 1000 avgt 2,719,478.0 7.2 ns/op
vertical 10000 avgt 1,683,041,892.2 40.4 ns/op
Factor shows the change to the horizontal baseline number.
Testing with caches is not easy and depends on the overall machine workload.
Understand your hardware correctly
# RUN 1 - T14s, AMD
Benchmark Mode Cnt Score Diff Units
threads1 avgt 2 250.514 ms/op
threads2 avgt 2 257.087 +3% ms/op
threads4 avgt 2 284.657 +13% ms/op
threads6 avgt 2 330.602 +32% ms/op
threads8 avgt 2 375.904 +50% ms/op
# RUN 2 - T14s, AMD, turbo mode off for verification
Benchmark Mode Cnt Score Diff Units
threads1 avgt 2 611.487 ms/op
threads2 avgt 2 612.605 +0% ms/op
threads4 avgt 2 611.488 +0% ms/op
threads6 avgt 2 616.488 +1% ms/op
threads8 avgt 2 624.876 +2% ms/op
private final int SIZE = 100_000_000;
private final int[] SOURCE = new int[SIZE];
double doCPUBoundStuff(int[] data) {
int numDataPoints = data.length;
int i = 0;
double avg = 0;
double var = 0;
data[0] = 0;
data[1] = 1;
for (i = 2; i < numDataPoints; i++) {
data[i] = data[i-1] + data[i-2];
}
for (avg = 0, i = 0; i < numDataPoints; i++) {
avg += (double)numDataPoints;
}
avg /= (double) numDataPoints;
for (var = 0, i = 0; i < numDataPoints; i++) {
double diff = (double) data[i] - avg;
var += diff*diff;
}
var /= (double) (numDataPoints - 1);
return avg;
}
@Setup
public void setup() {
for (int i = 0; i < SIZE; i++) {
SOURCE[i] = i;
}
}
@Benchmark
@Threads(1..8)
public double t1() {
return doCPUBoundStuff(SOURCE);
}
Intel: echo "1" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
AMD: echo "0" | sudo tee /sys/devices/system/cpu/cpufreq/boost
Modern fortune tellers
private static final int COUNT = 10_000;
// Contain random numbers from -50 to 50
private int[] sorted;
private int[] unsorted;
private int[] reversed;
public void doIt(
int[] array,
Blackhole bh1,
Blackhole bh2)
{
for (int v : array)
{
if (v > 0)
{
bh1.consume(v);
}
else
{
bh2.consume(v);
}
}
}
Benchmark Mode Cnt Score Error Units
reversed avgt 2 39570.696 ns/op
sorted avgt 2 39859.027 ns/op
unsorted avgt 2 66043.605 ns/op
org.sample.BranchPredict.java
Modern CPU marvels: I know what you might need next
Benchmark Mode Score Error Units
=================================================================================
BranchPrediction.reversed avgt 35,182.532 ns/op
BranchPrediction.reversed:IPC avgt 3.257 #/op
BranchPrediction.reversed:branch-misses avgt 13.767 #/op
BranchPrediction.reversed:branches avgt 55,339.912 #/op
BranchPrediction.reversed:cycles avgt 110,993.223 #/op
BranchPrediction.reversed:instructions avgt 361,508.018 #/op
BranchPrediction.sorted avgt 35,183.238 ns/op
BranchPrediction.sorted:IPC avgt 3.257 #/op
BranchPrediction.sorted:branch-misses avgt 14.738 #/op
BranchPrediction.sorted:branches avgt 55,884.946 #/op
BranchPrediction.sorted:cycles avgt 112,912.763 #/op
BranchPrediction.sorted:instructions avgt 367,820.131 #/op
BranchPrediction.unsorted avgt 65,770.667 ns/op
BranchPrediction.unsorted:IPC avgt 1.577 #/op
BranchPrediction.unsorted:branch-misses avgt 3,795.737 #/op
BranchPrediction.unsorted:branches avgt 56,008.188 #/op
BranchPrediction.unsorted:cycles avgt 211,089.268 #/op
BranchPrediction.unsorted:instructions avgt 333,105.231 #/op
IPC: Instructions per Cycle
How small changes can mess up things big time
final int d = i % 2 == 0 ? -1 : 1;
public int doIt(int[] array)
{
int sum = 0;
for (int v : array)
{
if (v > 0)
{
sum -= v;
}
else
{
sum += v;
}
}
return sum;
}
Benchmark Mode Score Units
===================================
sorted avgt 5,147.764 ns/op
unsorted avgt 3,317.656 ns/op
reversed avgt 4,738.261 ns/op
final int d = i % 2 == 0
? -r.nextInt(5) : r.nextInt(5);
public int doIt(int[] array)
{
int sum = 0;
for (int v : array)
{
if (v > 0)
{
sum -= v;
}
else
{
sum += v;
}
}
return sum;
}
Benchmark Mode Score Units
===================================
sorted avgt 3,588.073 ns/op
unsorted avgt 14,282.879 ns/op
reversed avgt 5,046.499 ns/op
final int d = i % 2 == 0
? -r.nextInt(5) : r.nextInt(5);
public int doIt(int[] array)
{
int sum = 0;
for (int v : array)
{
if (v >= 0)
{
sum -= -v;
}
else
{
sum += +v;
}
}
return sum;
}
Benchmark Mode Score Units
===================================
sorted avgt 3,226.433 ns/op
unsorted avgt 3,261.952 ns/op
reversed avgt 3,234.786 ns/op
final int d = i % 2 == 0
? -r.nextInt(5) : r.nextInt(5);
public int doIt(int[] array)
{
int sum = 0;
for (int v : array)
{
if (v >= 0)
{
sum -= v;
}
else
{
sum += v;
}
}
return sum;
}
Benchmark Mode Score Units
===================================
sorted avgt 5,044.253 ns/op
unsorted avgt 11,669.557 ns/op
reversed avgt 5,390.210 ns/op
Being alone and still have to sync
private final int SIZE = 1024;
private final Map<String, String> syncMap =
Collections.synchronizedMap(new HashMap<>(SIZE));
private final Map<String, String> unsyncMap =
new HashMap<>(SIZE);
private final Map<String, String> conMap =
new ConcurrentHashMap<>(SIZE);
@Benchmark
public String syncAccess()
{
return syncMap.get("1");
}
@Benchmark
public String unsyncAccess()
{
return unsyncMap.get("1");
}
@Benchmark
public String conAccess()
{
return conMap.get("1");
}
Benchmark Mode Cnt Score Units
SynchronizedAlone.conAccess avgt 2 9.152 ns/op
SynchronizedAlone.syncAccess avgt 2 23.462 ns/op
SynchronizedAlone.unsyncAccess avgt 2 9.366 ns/op
ConcurrentHashMap
does wellorg/sample/SynchronizedAlone.java
Let's try more threads
2 Threads
====================================================================
Benchmark Mode Cnt Score Error Units
SynchronizedThreads4.conAccess avgt 2 9.721 ns/op
SynchronizedThreads4.syncAccess avgt 2 183.086 ns/op
SynchronizedThreads4.unsyncAccess avgt 2 9.312 ns/op
4 Threads
====================================================================
Benchmark Mode Cnt Score Error Units
SynchronizedThreads4.conAccess avgt 2 19.661 ns/op
SynchronizedThreads4.syncAccess avgt 2 216.041 ns/op
SynchronizedThreads4.unsyncAccess avgt 2 19.631 ns/op
8 Threads
====================================================================
Benchmark Mode Cnt Score Error Units
SynchronizedThreads4.conAccess avgt 2 44.055 ns/op
SynchronizedThreads4.syncAccess avgt 2 377.180 ns/op
SynchronizedThreads4.unsyncAccess avgt 2 38.562 ns/op
Unsynced access is of course mostly garbage for shared structures
Loops compared
Benchmark (length) Mode Cnt Score Error Units
ForEachSimple.arrayFor 1 avgt 2 3.8 ns/op
ForEachSimple.arrayFor 10 avgt 2 9.8 ns/op
ForEachSimple.arrayFor 100 avgt 2 64.4 ns/op
ForEachSimple.arrayFor 1000 avgt 2 1,446.1 ns/op
ForEachSimple.arrayEnhanced 1 avgt 2 3.5 ns/op
ForEachSimple.arrayEnhanced 10 avgt 2 9.5 ns/op
ForEachSimple.arrayEnhanced 100 avgt 2 63.2 ns/op
ForEachSimple.arrayEnhanced 1000 avgt 2 1,199.3 ns/op
ForEachSimple.classicFor 1 avgt 2 5.7 ns/op
ForEachSimple.classicFor 10 avgt 2 15.5 ns/op
ForEachSimple.classicFor 100 avgt 2 103.0 ns/op
ForEachSimple.classicFor 1000 avgt 2 1,388.8 ns/op
ForEachSimple.enhancedFor 1 avgt 2 6.1 ns/op
ForEachSimple.enhancedFor 10 avgt 2 17.0 ns/op
ForEachSimple.enhancedFor 100 avgt 2 118.4 ns/op
ForEachSimple.enhancedFor 1000 avgt 2 1,578.6 ns/op
ForEachSimple.lambdaStream 1 avgt 2 53.7 ns/op
ForEachSimple.lambdaStream 10 avgt 2 65.5 ns/op
ForEachSimple.lambdaStream 100 avgt 2 153.0 ns/op
ForEachSimple.lambdaStream 1000 avgt 2 1,326.0 ns/op
ForEachSimple.parallelLambdaStream 1 avgt 2 94.2 ns/op
ForEachSimple.parallelLambdaStream 10 avgt 2 6,283.2 ns/op
ForEachSimple.parallelLambdaStream 100 avgt 2 8,443.4 ns/op
ForEachSimple.parallelLambdaStream 1000 avgt 2 9,367.6 ns/op
org/sample/ForEachSimple.java
// just some strings
final List<String> list = new ArrayList<>();
String[] array = null;
@Param({"1", "10", "100", "1000"})
int length;
@Setup
public void setup() {.. // uses length ..}
@Benchmark
public void arrayFor(Blackhole bh)
{
int result = 0;
for (int i = 0; i < array.length; i++)
{
result += array[i].length();
}
bh.consume(result);
}
// classicFor
for (int i = 0; i < list.size(); i++)
{
result += list.get(i).length();
}
// enhancedFor
for (String s : list)
{
result += s.length();
}
// lambdaStream
int result = list.stream()
.mapToInt(s -> s.length())
.sum();
//parallelLambdaStream
int result = list.parallelStream()
.mapToInt(s -> s.length())
.sum();
Not so empty loops compared
Benchmark (length) Mode Cnt Score Error Units
arrayFor 1 avgt 2 238.3 ns/op
arrayFor 10 avgt 2 2,105.6 ns/op
arrayFor 100 avgt 2 21,957.8 ns/op
arrayFor 1000 avgt 2 253,262.0 ns/op
arrayEnhanced 1 avgt 2 234.1 ns/op
arrayEnhanced 10 avgt 2 2,173.1 ns/op
arrayEnhanced 100 avgt 2 22,080.1 ns/op
arrayEnhanced 1000 avgt 2 250,038.0 ns/op
classicFor 1 avgt 2 238.1 ns/op
classicFor 10 avgt 2 2,177.0 ns/op
classicFor 100 avgt 2 22,318.0 ns/op
classicFor 1000 avgt 2 249,914.1 ns/op
enhancedFor 1 avgt 2 254.8 ns/op
enhancedFor 10 avgt 2 2,185.9 ns/op
enhancedFor 100 avgt 2 23,011.1 ns/op
enhancedFor 1000 avgt 2 262,605.5 ns/op
lambdaStream 1 avgt 2 395.7 ns/op
lambdaStream 10 avgt 2 2,829.0 ns/op
lambdaStream 100 avgt 2 27,771.4 ns/op
lambdaStream 1000 avgt 2 304,831.6 ns/op
# 2+2 CPU
parallelLambdaStream 1 avgt 2 453.5 ns/op
parallelLambdaStream 10 avgt 2 11,686.4 ns/op
parallelLambdaStream 100 avgt 2 26,798.6 ns/op
parallelLambdaStream 1000 avgt 2 189,304.3 ns/op
final List<String> list = new ArrayList<>();
@Param({"1", "10", "100", "1000"})
int length;
@Setup
public void setup()
{
for (int i = 0; i < length; i++)
{
list.add(
MessageFormat.format(
"{0},{0},{0},{0},{0},{0}", String.valueOf(i)));
}
}
@Benchmark
public void classicFor(Blackhole bh)
{
int result = 0;
for (int i = 0; i < list.size(); i++)
{
final String s = list.get(i);
if (s.startsWith("5"))
{
continue;
}
final String[] a = s.split(",");
for (int j = 0; j < a.length; j++)
{
result += Integer.valueOf(a[j]);
}
}
bh.consume(result);
}
org/sample/ForEachExpensive.java
T450s vs. GCP N1-highcpu-16core
# Lenovo T450s
Benchmark (length) Mode Cnt Score Error Units
arrayFor 1 avgt 2 238.3 ns/op
arrayFor 10 avgt 2 2,105.6 ns/op
arrayFor 100 avgt 2 21,957.8 ns/op
arrayFor 1000 avgt 2 253,262.0 ns/op
arrayFor 10000 avgt 2 2,810,799.6 ns/op
arrayFor 100000 avgt 2 31,116,545.4 ns/op
classicFor 1 avgt 2 238.1 ns/op
classicFor 10 avgt 2 2,177.0 ns/op
classicFor 100 avgt 2 22,318.0 ns/op
classicFor 1000 avgt 2 257,362.6 ns/op
classicFor 10000 avgt 2 2,809,832.8 ns/op
classicFor 100000 avgt 2 32,017,487.3 ns/op
enhancedFor 1 avgt 2 254.8 ns/op
enhancedFor 10 avgt 2 2,185.9 ns/op
enhancedFor 100 avgt 2 23,011.1 ns/op
enhancedFor 1000 avgt 2 254,691.8 ns/op
enhancedFor 10000 avgt 2 3,180,447.3 ns/op
enhancedFor 100000 avgt 2 31,479,657.4 ns/op
lambdaStream 1 avgt 2 395.7 ns/op
lambdaStream 10 avgt 2 2,829.0 ns/op
lambdaStream 100 avgt 2 27,771.4 ns/op
lambdaStream 1000 avgt 2 299,971.9 ns/op
lambdaStream 10000 avgt 2 3,181,573.5 ns/op
lambdaStream 100000 avgt 2 36,055,386.8 ns/op
parallelLambdaStream 1 avgt 2 453.5 ns/op
parallelLambdaStream 10 avgt 2 11,686.4 ns/op
parallelLambdaStream 100 avgt 2 26,798.6 ns/op
parallelLambdaStream 1000 avgt 2 193,232.6 ns/op
parallelLambdaStream 10000 avgt 2 1,956,289.8 ns/op
parallelLambdaStream 100000 avgt 2 20,628,366.2 ns/op
# Google n1-highcpu-16core
Benchmark (length) Mode Cnt Score Error Units
arrayFor 1 avgt 2 263.4 ns/op
arrayFor 10 avgt 2 2,390.4 ns/op
arrayFor 100 avgt 2 24,871.1 ns/op
arrayFor 1000 avgt 2 266,237.0 ns/op
arrayFor 10000 avgt 2 3,020,851.2 ns/op
arrayFor 100000 avgt 2 30,428,126.5 ns/op
classicFor 1 avgt 2 264.0 ns/op
classicFor 10 avgt 2 2,490.3 ns/op
classicFor 100 avgt 2 25,186.0 ns/op
classicFor 1000 avgt 2 281,254.1 ns/op
classicFor 10000 avgt 2 2882,459.1 ns/op
classicFor 100000 avgt 2 34,657,730.5 ns/op
enhancedFor 1 avgt 2 263.5 ns/op
enhancedFor 10 avgt 2 2,329.8 ns/op
enhancedFor 100 avgt 2 24,811.0 ns/op
enhancedFor 1000 avgt 2 291,544.4 ns/op
enhancedFor 10000 avgt 2 3,185,459.2 ns/op
enhancedFor 100000 avgt 2 33,898,708.6 ns/op
lambdaStream 1 avgt 2 547.9 ns/op
lambdaStream 10 avgt 2 3,841.7 ns/op
lambdaStream 100 avgt 2 28,808.7 ns/op
lambdaStream 1000 avgt 2 329,148.1 ns/op
lambdaStream 10000 avgt 2 3,684,920.1 ns/op
lambdaStream 100000 avgt 2 36,935,499.3 ns/op
parallelLambdaStream 1 avgt 2 521.7 ns/op
parallelLambdaStream 10 avgt 2 43,527.8 ns/op
parallelLambdaStream 100 avgt 2 62,834.3 ns/op
parallelLambdaStream 1000 avgt 2 110,602.3 ns/op
parallelLambdaStream 10000 avgt 2 558,666.5 ns/op
parallelLambdaStream 100000 avgt 2 5,183,311.8 ns/op
Method Calls, Inheritance, and Compiler Magic
@CompilerControl(CompilerControl.Mode.INLINE)
private static String uppercase(String s)
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
private static String uppercaseNoInline(String s)
@CompilerControl(CompilerControl.Mode.INLINE)
private static String lowercase(String s)
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
private static String lowercaseNoInline(String s)
@CompilerControl(CompilerControl.Mode.INLINE)
private static int sumOfChars(String s)
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
private static int sumOfCharsNoInline(String s)
@Benchmark
public void largeMethod(Blackhole bh)
{
// all the code of the methods copied here
bh.consume(sum);
}
@Benchmark
public void smallAndCallsInline(Blackhole bh)
{
final String s = strings[0];
final String s2 = uppercase(s);
final String s3 = lowercase(s2);
int sum = sumOfChars(s3);
bh.consume(sum);
}
@Benchmark
public void smallAndCallsNoInline(Blackhole bh)
{
final String s = strings[0];
final String s2 = uppercaseNoInline(s);
final String s3 = lowercaseNoInline(s2);
int sum = sumOfCharsNoInline(s3);
bh.consume(sum);
}
Calling a method is expensive
# BAT JDK 11
Benchmark Mode Cnt Score Error Units
smallAndCallsInline avgt 5 4860.736 ± 229.153 ns/op
smallAndCallsNoInline avgt 5 5672.435 ± 219.164 ns/op
largeMethod avgt 5 5294.679 ± 218.967 ns/op
# AC JDK 11
Benchmark Mode Cnt Score Error Units
smallAndCallsInline avgt 5 4501.134 ± 217.389 ns/op
smallAndCallsNoInline avgt 5 4460.731 ± 132.889 ns/op
largeMethod avgt 5 4244.479 ± 209.325 ns/op
# BAT JDK 11 again
Benchmark Mode Cnt Score Error Units
smallAndCallsInline avgt 5 5661.628 ± 268.654 ns/op
smallAndCallsNoInline avgt 5 5634.992 ± 323.852 ns/op
largeMethod avgt 5 5300.749 ± 267.469 ns/op
# AC JDK 8
Benchmark Mode Cnt Score Error Units
smallAndCallsInline avgt 5 4839.311 ± 158.454 ns/op
smallAndCallsNoInline avgt 5 4923.197 ± 93.549 ns/op
largeMethod avgt 5 4919.511 ± 223.358 ns/op
How Expensive are Virtual Method Calls?
public class MegaMorphicSimple
{
static interface C { public int apply(int x); }
static class C1 implements C
{
@Override
public int apply(int x) { return x * x; }
}
@Benchmark
public int _1_mono()
{
C c = null;
int r = random.nextInt(4);
if (r == 0) { c = new C1(); }
else if (r == 1) { c = new C1(); }
else if (r == 2) { c = new C1(); }
else if (r == 3) { c = new C1(); }
return c.apply(r);
}
@Benchmark
public int _2_bi()
{
C c = null;
int r = random.nextInt(4);
if (r == 0) { c = new C1(); }
else if (r == 1) { c = new C2(); }
else if (r == 2) { c = new C1(); }
else if (r == 3) { c = new C2(); }
return c.apply(r);
}
@Benchmark
public int _5_manualDispatch()
{
...
if (c instanceof C1)
{
return ((C1) c).apply(r);
}
else if (c instanceof C2)
{
return ((C2) c).apply(r);
}
else if (c instanceof C3)
{
return ((C3) c).apply(r);
}
else
{
return ((C4) c).apply(r);
}
}
}
Inspired by Douglas Hawkins: org.sample.MegaMorphic[Simple|Default]
How Expensive are Virtual Method Calls?
# identical method per implementing class
Benchmark Mode Cnt Score Error Units
MegaMorphicSimple._1_mono avgt 5 19.703 ± 2.177 ns/op
MegaMorphicSimple._2_bi avgt 5 21.917 ± 3.759 ns/op
MegaMorphicSimple._4_mega avgt 5 26.778 ± 4.411 ns/op
MegaMorphicSimple._5_manualDispatch avgt 5 21.861 ± 2.860 ns/op
static interface C { default int apply(int x){ return x * x; }};
static class C1 implements C {};
static class C2 implements C {};
static class C3 implements C {};
static class C4 implements C {};
# default method at interface
Benchmark Mode Cnt Score Error Units
MegaMorphicDefault._1_mono avgt 5 18.867 ± 1.325 ns/op
MegaMorphicDefault._2_bi avgt 5 21.817 ± 2.613 ns/op
MegaMorphicDefault._4_mega avgt 5 26.355 ± 2.591 ns/op
MegaMorphicDefault._5_manualDispatch avgt 5 21.763 ± 3.479 ns/op
# method got bigger and we made that non-default
Benchmark Mode Cnt Score Error Units
MegaMorphicLargeMethod._1_mono avgt 5 158.775 ± 4.971 ns/op
MegaMorphicLargeMethod._2_bi avgt 5 157.100 ± 6.153 ns/op
MegaMorphicLargeMethod._4_mega avgt 5 165.580 ± 8.669 ns/op
MegaMorphicLargeMethod._5_manualDispatch avgt 5 162.729 ± 3.795 ns/op
org.sample.MegaMorphic[Simple|Default]
So we don't have to think about virtual calls at all
public class MegaMorphicArray
{
private C[] array = new C[20_000];
@Setup
public void setup()
{
final Random r = new Random(42L);
for (int x = 0; x < 20_000; x++)
{
int i = r.nextInt(JHM_PARAM);
if (i == 0) array[x] = new C1()
else if (i == 1) array[x] = new C2();
else if (i == 2) array[x] = new C3();
else if (i == 3) array[x] = new C4();
}
}
}
static interface C { public int apply(int x); }
static class C1 implements C {
@Override public int apply(int x) { return x * x; }
} ...
@Benchmark
public int morphic()
{
int sum = 0;
for (int i = 0; i < 20_000; i++)
{
sum += array[i].apply(i);
}
return sum;
}
@Benchmark
public int peeled()
{
int sum = 0;
for (int i = 0; i < 20_000; i++) {
final C c = array[i];
if (c instanceof C1) {
sum += ((C1) c).apply(i);
}
else if (c instanceof C2) {
sum += ((C2) c).apply(i);
}
else if (c instanceof C3) {
sum += ((C3) c).apply(i);
}
else {
sum += ((C4) c).apply(i);
}
}
return sum;
}
org.sample.MegaMorphicArray
The Results are in
# JDK 11
Benchmark (params) Mode Cnt Score Units
morphic 1 avgt 5 17,705.6 ns/op
morphic 2 avgt 5 98,269.4 ns/op
morphic 3 avgt 5 334,745.9 ns/op
morphic 4 avgt 5 364,485.9 ns/op
peeled 1 avgt 5 16,990.4 ns/op
peeled 2 avgt 5 94,084.2 ns/op
peeled 3 avgt 5 123,590.3 ns/op
peeled 4 avgt 5 147,067.2 ns/op
# JDK 11 -Xint
Benchmark (params) Mode Cnt Score Units
morphic 1 avgt 5 1,682,550.7 ns/op
morphic 2 avgt 5 1,695,875.2 ns/op
morphic 3 avgt 5 1,679,643.2 ns/op
morphic 4 avgt 5 1,675,298.1 ns/op
peeled 1 avgt 5 1,764,953.0 ns/op
peeled 2 avgt 5 1,821,662.6 ns/op
peeled 3 avgt 5 1,885,678.8 ns/op
peeled 4 avgt 5 1,936,054.8 ns/op
org.sample.MegaMorphicArray
But why do we see this performance?
Benchmark (params) Mode Score Error Units
morphic 1 avgt 16,829.7 ± 1165.264 ns/op
morphic:IPC 1 avgt 3.2 #/op
morphic:L1-dcache-load-misses 1 avgt 6,263.3 #/op
morphic:L1-dcache-loads 1 avgt 40,296.8 #/op
morphic:branch-misses 1 avgt 9.1 #/op
morphic:branches 1 avgt 24,984.2 #/op
morphic:dTLB-load-misses 1 avgt 0.2 #/op
morphic:dTLB-loads 1 avgt 40,401.5 #/op
morphic 2 avgt 97,456.9 ± 1177.830 ns/op
morphic:IPC 2 avgt 0.7 #/op
morphic:L1-dcache-load-misses 2 avgt 6,472.7 #/op
morphic:L1-dcache-loads 2 avgt 43,597.9 #/op
morphic:branch-misses 2 avgt 9,850.6 #/op
morphic:branches 2 avgt 41,935.3 #/op
morphic:dTLB-load-misses 2 avgt 2.0 #/op
morphic:dTLB-loads 2 avgt 45,509.8 #/op
morphic 3 avgt 334,117.8 ± 3334.059 ns/op
morphic:IPC 3 avgt 1.1 #/op
morphic:L1-dcache-load-misses 3 avgt 6,840.3 #/op
morphic:L1-dcache-loads 3 avgt 407,654.7 #/op
morphic:branch-misses 3 avgt 13,247.0 #/op
morphic:branches 3 avgt 146,105.8 #/op
morphic:dTLB-load-misses 3 avgt 13.5 #/op
morphic:dTLB-loads 3 avgt 410,043.5 #/op
morphic 4 avgt 390,601.4 ± 40515.922 ns/op
morphic:IPC 4 avgt 0.9 #/op
morphic:L1-dcache-load-misses 4 avgt 6,740.5 #/op
morphic:L1-dcache-loads 4 avgt 408,814.3 #/op
morphic:branch-misses 4 avgt 15,099.0 #/op
morphic:branches 4 avgt 145,118.2 #/op
morphic:dTLB-load-misses 4 avgt 18.9 #/op
morphic:dTLB-loads 4 avgt 410,767.5 #/op
Benchmark (params) Mode Score Error Units
peeled 1 avgt 16,831.1 ± 816.491 ns/op
peeled:IPC 1 avgt 3.2 #/op
peeled:L1-dcache-load-misses 1 avgt 6,218.4 #/op
peeled:L1-dcache-loads 1 avgt 39,797.1 #/op
peeled:branch-misses 1 avgt 9.2 #/op
peeled:branches 1 avgt 24,854.5 #/op
peeled:dTLB-load-misses 1 avgt 0.2 #/op
peeled:dTLB-loads 1 avgt 40,133.1 #/op
peeled 2 avgt 93,347.0 ± 3607.879 ns/op
peeled:IPC 2 avgt 0.7 #/op
peeled:L1-dcache-load-misses 2 avgt 6,456.1 #/op
peeled:L1-dcache-loads 2 avgt 42,060.7 #/op
peeled:branch-misses 2 avgt 9,252.3 #/op
peeled:branches 2 avgt 41,153.9 #/op
peeled:dTLB-load-misses 2 avgt 6.0 #/op
peeled:dTLB-loads 2 avgt 42,195.1 #/op
peeled 3 avgt 123,075.7 ± 5181.245 ns/op
peeled:IPC 3 avgt 0.6 #/op
peeled:L1-dcache-load-misses 3 avgt 6,472.7 #/op
peeled:L1-dcache-loads 3 avgt 42,059.2 #/op
peeled:branch-misses 3 avgt 13,440.9 #/op
peeled:branches 3 avgt 51,950.4 #/op
peeled:dTLB-load-misses 3 avgt 5.7 #/op
peeled:dTLB-loads 3 avgt 43,097.4 #/op
peeled 4 avgt 136,765.3 ± 2792.939 ns/op
peeled:IPC 4 avgt 0.6 #/op
peeled:L1-dcache-load-misses 4 avgt 6,460.5 #/op
peeled:L1-dcache-loads 4 avgt 43,312.1 #/op
peeled:branch-misses 4 avgt 14,818.3 #/op
peeled:branches 4 avgt 62,214.6 #/op
peeled:dTLB-load-misses 4 avgt 8.0 #/op
peeled:dTLB-loads 4 avgt 42,775.6 #/op
Credits: Aleksey Shipilёv, RedHat, https://shipilev.net/jvm/anatomy-quarks/16-megamorphic-virtual-calls/
The CPU understood
[1]: https://en.wikipedia.org/wiki/Translation_lookaside_buffer
Let's try some streams first
final int SIZE = 10240;
final int[] integers = new int[SIZE];
public int lambdaArray[Cold|Warm]()
{
return Arrays.stream(integers)
.filter(i -> i % 2 == 0).sum();
}
public int lambdaIntStream[Cold|Warm]()
{
return IntStream.range(0, SIZE)
.filter(i -> i % 2 == 0).sum();
}
public int loop[Cold|Warm]()
{
int sum = 0;
for (int i = 0; i < integers.length; i++)
{
if (i % 2 == 0)
{
sum += integers[i];
}
}
return sum;
}
Benchmark Mode Cnt Score Units
Lambda01.lambdaArrayHot avgt 5 13,968.8 ns/op
Lambda01.lambdaIntStreamHot avgt 5 13,156.2 ns/op
Lambda01.loopHot avgt 5 10,146.6 ns/op
# Cold does not have JMH warmup
Benchmark Mode Cnt Score Units
Lambda01.lambdaArrayCold avgt 5 251,372.4 ns/op
Lambda01.lambdaArrayHot avgt 5 13,968.8 ns/op
Lambda01.lambdaIntStreamCold avgt 5 794,422.3 ns/op
Lambda01.lambdaIntStreamHot avgt 5 13,156.2 ns/op
Lambda01.loopCold avgt 5 15,066.7 ns/op
Lambda01.loopHot avgt 5 10,146.6 ns/op
# -Xcomp
Benchmark Mode Cnt Score Units
Lambda01.lambdaArrayCold avgt 5 24,809,394.5 ns/op
Lambda01.lambdaIntStreamCold avgt 5 23,398,581.4 ns/op
Lambda01.loopCold avgt 5 11,787.0 ns/op
Lambda01.lambdaArrayHot avgt 5 143,678.1 ns/op
Lambda01.lambdaIntStreamHot avgt 5 144,445.3 ns/op
Lambda01.loopHot avgt 5 10,109.5 ns/op
# -Xint
Benchmark Mode Cnt Score Units
Lambda01.lambdaArrayCold avgt 5 5,578,040.0 ns/op
Lambda01.lambdaArrayHot avgt 5 5,473,135.6 ns/op
Lambda01.lambdaIntStreamCold avgt 5 5,394,464.0 ns/op
Lambda01.lambdaIntStreamHot avgt 5 5,317,269.0 ns/op
Lambda01.loopCold avgt 5 229,327.9 ns/op
Lambda01.loopHot avgt 5 229,369.6 ns/op
org/sample/Lambda01.java
Compare Predicate, Pseudo-Predicate, and Direct Implementation
public int forLambda(final Predicate<Integer> p) {
int sum = 0;
for (int i = 0; i < array.length; i++) {
if (p.test(array[i])) {
sum++;
}
}
return sum;
}
public int forAnonymous(final PseudoLambda<Integer> p) {
int sum = 0;
for (int i = 0; i < array.length; i++) {
if (p.test(array[i])) {
sum++;
}
}
return sum;
}
public int simple() {
int sum = 0;
for (int i = 0; i < array.length; i++) {
if (array[i] % 2 == 0) {
sum++;
}
}
return sum;
}
@Benchmark
public void forSimple(Blackhole bh)
{
bh.consume(simple());
}
@Benchmark
public void forLambdaPredicate(Blackhole bh)
{
bh.consume(forLambda(i -> i % 2 == 0));
}
interface PseudoLambda<T>
{
boolean test(T t);
}
@Benchmark
public void forAnonymousClass(Blackhole bh)
{
bh.consume(
forAnonymous(new PseudoLambda<Integer>()
{
@Override
public boolean test(Integer i)
{
return i % 2 == 0;
}
})
);
}
org/sample/Lambda02.java
Benchmark Mode Cnt Score Units
forAnonymousClass avgt 2 152.9 ns/op
forLambdaPredicate avgt 2 128.8 ns/op
forLambdaPredicatePredefined avgt 2 127.9 ns/op
forSimple avgt 2 128.1 ns/op
# -Xint
Benchmark Mode Cnt Score Units
forAnonymousClass avgt 2 21,312.5 ns/op
forLambdaPredicate avgt 2 20,154.8 ns/op
forLambdaPredicatePredefined avgt 2 19,871.6 ns/op
forSimple avgt 2 3,471.7 ns/op
# No warmup and just 5x measurements
Benchmark Mode Cnt Score Units
forAnonymousClass ss 5 92,787.6 ns/op
forLambdaPredicate ss 5 122,377.4 ns/op
forLambdaPredicatePredefined ss 5 50,435.6 ns/op
forSimple ss 5 16,268.6 ns/op
org/sample/Lambda02.java
public class GCAndAccessSpeed
{
private final static int SIZE = 100_000;
private final List<String> STRINGS = new ArrayList<>();
private final List<String> ordered = new ArrayList<>();
private final List<String> nonOrdered = new ArrayList<>();
@Param({"false", "true"}) boolean gc;
@Param({"1", "10"}) int COUNT;
@Param({"false", "true"}) boolean drop;
@Setup
public void setup() throws InterruptedException
{
final FastRandom r = new FastRandom(7);
for (int i = 0; i < COUNT * SIZE; i++)
{
STRINGS.add(RandomUtils.randomString(r, 1, 20));
}
for (int i = 0; i < SIZE; i++)
{
ordered.add(STRINGS.get(i * COUNT));
}
nonOrdered.addAll(ordered);
Collections.shuffle(nonOrdered, new Random(r.nextInt()));
if (drop)
{
STRINGS.clear();
}
if (gc)
{
for (int c = 0; c < 5; c++)
{
System.gc();
TimeUnit.SECONDS.sleep(2);
}
}
}
@Benchmark
public int walk[Ordered|NonOrdered]()
{
int sum = 0;
for (int i = 0; i < [ordered|nonOrdered].size(); i++)
{
sum += [ordered|nonOrdered].get(i).length();
}
return sum;
}
}
Applied Memory Knowledge
# G1 GC
Benchmark (COUNT) (drop) (gc) Mode Cnt Score Units
walkNonOrdered 1 false false avgt 5 1,596,315.2 ns/op
walkOrdered 1 false false avgt 5 611,137.7 ns/op
walkNonOrdered 1 false true avgt 5 1,172,951.3 ns/op
walkOrdered 1 false true avgt 5 431,143.5 ns/op
walkNonOrdered 1 true false avgt 5 1,562,844.1 ns/op
walkOrdered 1 true false avgt 5 605,119.4 ns/op
walkNonOrdered 1 true true avgt 5 1,243,973.9 ns/op
walkOrdered 1 true true avgt 5 400,721.9 ns/op
walkNonOrdered 10 false false avgt 5 1,903,731.9 ns/op
walkOrdered 10 false false avgt 5 1,229,945.1 ns/op
walkNonOrdered 10 false true avgt 5 2,026,861.7 ns/op
walkOrdered 10 false true avgt 5 1,809,961.9 ns/op
walkNonOrdered 10 true false avgt 5 1,920,658.4 ns/op
walkOrdered 10 true false avgt 5 1,239,658.5 ns/op
walkNonOrdered 10 true true avgt 5 1,160,229.2 ns/op
walkOrdered 10 true true avgt 5 403,949.5 ns/op
# -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC
Benchmark (COUNT) (drop) (gc) Mode Cnt Score Units
walkNonOrdered 1 false false avgt 5 1,667,611.7 ns/op
walkNonOrdered 1 false true avgt 5 1,820,968.2 ns/op
walkNonOrdered 1 true false avgt 5 1,928,822.7 ns/op
walkNonOrdered 1 true true avgt 5 1,777,251.4 ns/op
walkOrdered 1 false false avgt 5 931,728.5 ns/op
walkOrdered 1 false true avgt 5 902,433.7 ns/op
walkOrdered 1 true false avgt 5 930,294.3 ns/op
walkOrdered 1 true true avgt 5 907,886.5 ns/op
org.sample.GCAndAccessSpeed
Inspired by: Aleksey Shipilёv, Moving GC and Locality
There are so many more things to discover
null
checksRule 1
Don't worry too much about performance. Java is very good in taking care of it.
Rule 2
Don't optimize prematurely and don't micro-tune.
Rule 3
Simply write clean and good code. You won't see most of the issues ever. Think before you code. Java can make code fast, but bad algorithms stay bad algorithms, bad designs stays bad designs.
Rule 4
When you worry, worry about the hottest code. Measure carefully and apply what you learned. The compiler and the CPU are your friends but make also your profiling life harder.
Rule 5
Understand, measure, review, test, and measure again.
Rule 6
The code you write is not the code that is executed. Measure under production conditions to understand and avoid optimizing non-compiled and non-optimized code.
If your response to problems normally is: "That must be garbage collection."... get a different job.
When you want to play with the code
https://github.com/Xceptance/jmh-jmm-training