About the Smart Java Internals
René Schwietzke, Xceptance GmbH, Jena, Germany
@ReneSchwietzke
www.xceptance.com
Always attribute other people's work
More was taken from JDK sources, tweets, and related talks and videos
Java is much more than what most people think. Spoilers!
The times when Java was a bytecode interpreter, running slow and sluggish, are long gone. But most people don't know the magic that is going on to speed up Java to native code speed and beyond.
Actually, most programmers don't even know how their code gets executed or how a computer works... sadly.
A lot of the material applies to other languages, even C, as well.
Optimization and tuning is fun... but...
Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
Donald Knuth, Turing Award Lecture 1974
StringBuilder
instead of... this talk is not for youJava is efficient but the why has to be explained
Source: https://sites.google.com/view/energy-efficiency-languages/home
Get Some Measurements Going
public class PoorMansBenchmark
{
private static String append(final int i)
{
final StringBuilder sb = new StringBuilder();
sb.append(System.currentTimeMillis());
sb.append("Foo");
sb.append(i);
sb.append('c');
sb.append("lkJAHJ AHSDJHF KJASHF HSAKJFD");
final String s = sb.toString();
char[] ch = s.toCharArray();
Arrays.sort(ch);
return new String(ch);
}
public static void main (final String[] args)
{
final int SIZE = 100_000;
final long[] result = new long[SIZE];
int sum = 0;
for (int i = 0; i < SIZE; i++)
{
final Timer t = Timer.startTimer();
sum += append(i).length();
result[i] = t.stop().runtimeNanos();
}
// pretend we need it but we really don't
// just to avoid optimizations (spoilers, duh!)
if (sum < 0)
{
System.out.println(sum);
}
}
}
Just benchmark something...
StringBuilder
and String
https://github.com/Xceptance/jmh-jmm-training: org/sample/PoorMansbenchmark.java
# Size: 10
1, 437,607
2, 22,873
3, 18,180
4, 17,597
5, 17,698
6, 18,582
7, 17,607
8, 18,200
9, 19,798
10, 19,224
# Size: 100
1, 516,854
2, 32,120
3, 18,300
4, 17,633
5, 17,126
6, 20,441
7, 17,503
8, 18,101
9, 17,483
10, 17,085
20, 18,877
30, 18,818
40, 18,301
50, 23,070
60, 18,888
70, 21,198
80, 12,854
90, 11,948
100, 11,879
# Size: 1,000
1, 402,289
2, 22,654
3, 18,931
4, 18,435
5, 17,494
6, 19,335
7, 17,798
8, 18,991
9, 18,162
10, 17,625
20, 19,639
30, 19,123
40, 19,255
50, 20,874
60, 18,435
70, 18,070
80, 13,012
90, 12,678
100, 12,566
200, 10,230
300, 7,902
400, 5,919
500, 4,765
600, 4,451
700, 4,199
800, 3,865
900, 3,956
1,000, 3,339
# Size: 10,000
1, 714,799
2, 22,859
3, 18,531
4, 17,673
5, 17,444
6, 18,501
7, 17,245
8, 18,362
9, 17,224
10, 17,244
100, 12,138
200, 12,647
300, 7,421
400, 6,503
500, 4,678
600, 4,259
700, 4,099
800, 3,960
900, 3,571
1,000, 3,620
2,000, 2,224
3,000, 1,666
4,000, 1,616
5,000, 1,645
6,000, 1,167
7,000, 1,207
8,000, 1,137
9,000, 1,098
10,000, 987
# Size: 1,000,000
1, 719,086
2, 36,834
3, 26,186
4, 25,529
5, 25,329
6, 27,658
7, 25,488
8, 26,424
9, 26,683
10, 26,623
100, 17,835
1,000, 4,877
10,000, 1,125
20,000, 1,035
30,000, 766
40,000, 657
50,000, 667
60,000, 627
70,000, 497
80,000, 518
90,000, 498
100,000, 1,184
200,000, 437
300,000, 428
400,000, 448
500,000, 418
600,000, 537
700,000, 548
800,000, 438
900,000, 478
1,000,000, 438
All in nanoseconds (ns) on a T14s with AMD Ryzen 7 PRO 4750U, BAT, Performance Governor
That must be Garbage Collection
-XX:+UseEpsilonGC
# G1 GC
# Size: 10,000
1, 615,818
2, 24,265
3, 18,772
4, 17,692
5, 17,712
6, 19,408
7, 17,884
8, 17,873
9, 17,793
10, 17,601
100, 17,671
200, 8,785
300, 6,210
400, 5,110
500, 5,302
600, 3,636
700, 4,040
800, 2,908
900, 2,464
1,000, 1,969
2,000, 1,474
3,000, 1,576
4,000, 1,606
5,000, 1,505
6,000, 1,515
7,000, 1,495
8,000, 1,514
9,000, 1,384
10,000, 858
# No GC
# Size: 10,000
1, 847,466
2, 65,884
3, 37,987
4, 36,217
5, 40,327
6, 36,357
7, 35,896
8, 36,267
9, 35,967
10, 36,467
100, 19,918
200, 11,039
300, 3,970
400, 3,620
500, 3,669
600, 3,600
700, 3,370
800, 3,039
900, 2,840
1,000, 2,739
2,000, 1,650
3,000, 1,410
4,000, 1,430
5,000, 1,399
6,000, 1,040
7,000, 1,029
8,000, 1,020
9,000, 1,040
10,000, 1,050
# ShenandoahGC
# Size: 10,000
1, 599,049
2, 42,120
3, 17,912
4, 16,799
5, 16,263
6, 18,672
7, 17,164
8, 16,435
9, 16,577
10, 16,658
100, 12,195
200, 4,706
300, 4,220
400, 7,661
500, 6,355
600, 6,690
700, 4,513
800, 5,515
900, 5,283
1,000, 3,167
2,000, 3,350
3,000, 3,835
4,000, 3,299
5,000, 1,468
6,000, 1,457
7,000, 1,001
8,000, 1,012
9,000, 1,305
10,000, 921
Java 17
Java is slow
native-image
# Java 17
# Size: 100,000
1, 465,864
2, 21,818
3, 18,809
4, 17,809
5, 18,052
6, 22,413
7, 17,547
8, 19,112
9, 17,466
10, 17,456
100, 12,419
1,000, 3,736
2,000, 2,312
3,000, 1,464
4,000, 1,454
5,000, 1,081
6,000, 1,091
7,000, 1,080
8,000, 1,080
9,000, 989
10,000, 1,009
20,000, 1,757
30,000, 646
40,000, 565
50,000, 596
60,000, 556
70,000, 495
80,000, 494
90,000, 484
100,000, 414
# Graal 22 / Native
# Size: 100,000
1, 11,251
2, 2,798
3, 1,933
4, 1,771
5, 1,630
6, 1,570
7, 1,590
8, 1,540
9, 3,935
10, 1,600
100, 1,670
1,000, 2,023
2,000, 1,681
3,000, 1,660
4,000, 1,670
5,000, 1,429
6,000, 1,369
7,000, 3,733
8,000, 1,610
9,000, 1,399
10,000, 1,419
20,000, 574
30,000, 1,389
40,000, 574
50,000, 563
60,000, 724
70,000, 564
80,000, 554
90,000, 554
100,000, 564
The Data is so Inconsistent
6 hours lost due to a BIOS bug
Issues with clocksource on T14s BIOS 1.30, tsc
was not stable and disabled
Greetings from the Java Internals
javac
Just to get a feel for the complexity
[1] src/share/vm/runtime/advancedThresholdPolicy.hpp
[2] Douglas Hawkins, JVM Mechanics, https://www.youtube.com/watch?v=a-U0so9FfqQ&t=2036s
-XX:-TieredCompilation
, -Xint
, -Xcomp
Compile when Needed vs. Right from the Start
-Xcomp
forces Java to always compile to native code# defaults
1, 560,960
2, 34,772
3, 26,167
4, 24,237
5, 35,926
6, 24,116
7, 23,163
8, 22,895
9, 24,142
10, 23,174
100, 23,596
200, 12,973
300, 12,361
400, 9,852
500, 9,596
600, 8,774
700, 8,956
800, 7,280
900, 7,076
1000, 6,089
10000, 5,071
20000, 2,683
30000, 2,703
40000, 1,532
50000, 1,121
60000, 1,012
70000, 975
80000, 979
90000, 987
100000, 976
900000, 700
1000000, 708
# -Xcomp
1, 40,041,490
2, 51,254
3, 17,217
4, 2,077
5, 1,714
6, 1,607
7, 1,610
8, 1,536
9, 1,576
10, 1,536
100, 1,376
200, 2,481
300, 1,284
400, 1,307
500, 1,150
600, 1,177
700, 1,173
800, 1,163
900, 1,174
1000, 1,546
10000, 1,096
20000, 962
30000, 999
40000, 1,093
50000, 891
60000, 883
70000, 885
80000, 862
90000, 894
100000, 869
900000, 871
1000000, 870
# -Xint
1, 443,537
2, 20,145
3, 18,862
4, 18,324
5, 16,484
6, 16,345
7, 16,494
8, 16,186
9, 16,563
10, 16,513
100, 17,340
200, 17,091
300, 16,782
400, 17,111
500, 17,240
600, 17,608
700, 17,012
800, 18,832
900, 16,763
1,000, 16,902
2,000, 16,942
3,000, 17,609
4,000, 16,822
5,000, 17,042
6,000, 16,991
7,000, 17,021
8,000, 16,683
9,000, 16,653
10,000, 18,891
100,000, 17,558
200,000, 20,303
300,000, 21,516
400,000, 19,050
500,000, 19,477
600,000, 19,676
700,000, 19,775
800,000, 19,517
900,000, 19,665
# JIT 1
1, 439,989
2, 20,364
3, 16,531
4, 15,973
5, 15,811
6, 16,490
7, 15,841
8, 16,095
9, 15,619
10, 15,172
100, 10,213
200, 10,892
300, 8,286
400, 4,888
500, 3,682
600, 2,647
700, 2,292
800, 1,917
900, 1,714
1,000, 974
2,000, 700
3,000, 689
4,000, 700
5,000, 699
6,000, 689
7,000, 700
8,000, 689
9,000, 680
10,000, 690
100,000, 629
200,000, 761
300,000, 852
400,000, 771
500,000, 832
600,000, 761
700,000, 830
800,000, 829
900,000, 809
# JIT 1-2
1, 469,922
2, 21,275
3, 18,368
4, 17,044
5, 17,681
6, 18,457
7, 17,532
8, 17,940
9, 17,143
10, 17,472
100, 11,498
200, 11,717
300, 7,069
400, 4,321
500, 3,175
600, 3,325
700, 2,121
800, 1,603
900, 1,503
1,000, 1,264
2,000, 995
3,000, 1,005
4,000, 995
5,000, 1,005
6,000, 1,005
7,000, 1,005
8,000, 996
9,000, 996
10,000, 1,005
100,000, 976
200,000, 1,185
300,000, 1,185
400,000, 1,204
500,000, 1,264
600,000, 1,235
700,000, 1,234
800,000, 1,294
900,000, 1,284
# JIT 1-3
1, 378,533
2, 21,978
3, 18,809
4, 17,991
5, 17,364
6, 18,451
7, 17,962
8, 17,752
9, 17,324
10, 17,544
100, 11,692
200, 12,050
300, 10,347
400, 6,070
500, 4,366
600, 3,947
700, 3,339
800, 2,791
900, 2,403
1,000, 2,193
2,000, 1,425
3,000, 1,445
4,000, 1,435
5,000, 1,436
6,000, 1,445
7,000, 1,425
8,000, 1,415
9,000, 1,425
10,000, 1,426
100,000, 1,425
200,000, 1,735
300,000, 1,764
400,000, 1,924
500,000, 1,753
600,000, 1,724
700,000, 1,704
800,000, 1,724
900,000, 1,784
# JIT 1-4
1, 489,002
2, 30,301
3, 18,479
4, 17,639
5, 17,659
6, 19,248
7, 17,509
8, 18,838
9, 17,109
10, 17,309
100, 11,933
200, 12,173
300, 7,425
400, 6,346
500, 5,366
600, 4,547
700, 4,148
800, 3,867
900, 3,527
1,000, 3,547
2,000, 2,229
3,000, 1,409
4,000, 2,119
5,000, 1,429
6,000, 1,049
7,000, 1,039
8,000, 1,039
9,000, 1,059
10,000, 1,039
100,000, 379
200,000, 460
300,000, 459
400,000, 519
500,000, 529
600,000, 470
700,000, 470
800,000, 469
900,000, 499
# No Tiered
1, 772,154
2, 50,772
3, 33,622
4, 29,284
5, 29,294
6, 30,371
7, 29,155
8, 29,405
9, 28,786
10, 29,075
100, 27,131
200, 26,174
300, 26,871
400, 27,300
500, 28,986
600, 35,606
700, 23,900
800, 27,799
900, 15,485
1,000, 14,826
2,000, 7,626
3,000, 5,093
4,000, 4,007
5,000, 3,409
6,000, 3,559
7,000, 3,199
8,000, 2,931
9,000, 2,323
10,000, 2,123
100,000, 399
200,000, 558
300,000, 539
400,000, 528
500,000, 548
600,000, 528
700,000, 528
800,000, 578
900,000, 548
-XX:TieredStopAtLevel=X
1157 104 % 3 org.sample.poor.PoorMansBenchmark::main @ 17 (470 bytes)
1162 105 3 org.sample.poor.PoorMansBenchmark::main (470 bytes)
1328 135 3 org.sample.poor.PoorMansBenchmark::append (66 bytes)
1454 153 4 org.sample.poor.PoorMansBenchmark::append (66 bytes)
1561 178 % 4 org.sample.poor.PoorMansBenchmark::main @ 55 (470 bytes)
1566 135 3 org.sample.poor.PoorMansBenchmark::append (66 bytes) made not entrant
6002 178 % 4 org.sample.poor.PoorMansBenchmark::main @ 55 (470 bytes) made not entrant
6069 367 % 3 org.sample.poor.PoorMansBenchmark::main @ 100 (470 bytes)
6082 407 % 4 org.sample.poor.PoorMansBenchmark::main @ 100 (470 bytes)
-XX:PrintCompilation
, log format can change
Things to Know when Executing or Profiling Code
You trained it well
private final static int SIZE = 1000;
public static void main(String[] args)
{
Object trap = null;
Object o = null;
for (int i = 0; i < 4000; i++)
{
final Timer t = Timer.startTimer();
for (int j = 0; j < SIZE; j++)
{
// burn time and train that null is normal
o = new Object();
if (trap != null)
{
System.out.println("Got you." + o);
trap = null;
}
}
// Give me a Non-Null, Vasily.
// One Non-Null only, please.
if (i == 2000)
{
trap = new Object();
}
System.out.println(
MessageFormat.format("{1} {0, number, #}",
t.stop().runtimeNanos(), i));
}
}
Example by Douglas Hawkins: org/sample/AllocationTrap.java
If I don't need it, I remove it
null
or one exception can turn your performance upside-downYou can horribly shoot yourself in the foot
/**
* JHM Benchmark
* Vary SIZE from 1 to 10000
* @param args
*/
public int nonSense()
{
Object trap = null;
Object o = null;
for (int i = 0; i < OUTER; i++)
{
for (int j = 0; j < SIZE; j++)
{
o = new Object();
if (trap != null)
{
o = new Object();
trap = null;
}
}
if (i == 500)
{
trap = new Object();
}
}
return trap == null ? 1 : 0;
}
# One Shoot Execution aka measure one execution and no warm-up
Benchmark (SIZE) Mode Cnt Score Error Units
RemoveNonSenseJMH.nonSenseSingle 1 ss 85,298 ns/op
RemoveNonSenseJMH.nonSenseSingle 10 ss 513,886 ns/op
RemoveNonSenseJMH.nonSenseSingle 100 ss 3,603,910 ns/op
RemoveNonSenseJMH.nonSenseSingle 1000 ss 9,465,274 ns/op
RemoveNonSenseJMH.nonSenseSingle 10000 ss 17,528,758 ns/op
# Repeated Execution, No Warm up, 100 ms duration
Benchmark (SIZE) Mode Cnt Score Error Units
RemoveNonSenseJMH.nonSenseNoWarmup 1 avgt 963 ns/op
RemoveNonSenseJMH.nonSenseNoWarmup 10 avgt 600 ns/op
RemoveNonSenseJMH.nonSenseNoWarmup 100 avgt 597 ns/op
RemoveNonSenseJMH.nonSenseNoWarmup 1000 avgt 697 ns/op
RemoveNonSenseJMH.nonSenseNoWarmup 10000 avgt 2,258 ns/op
# Warm it Up First, 100 ms duration
Benchmark (SIZE) Mode Cnt Score Error Units
RemoveNonSenseJMH.nonSenseWarmup 1 avgt 660 ns/op
RemoveNonSenseJMH.nonSenseWarmup 10 avgt 609 ns/op
RemoveNonSenseJMH.nonSenseWarmup 100 avgt 409 ns/op
RemoveNonSenseJMH.nonSenseWarmup 1000 avgt 439 ns/op
RemoveNonSenseJMH.nonSenseWarmup 10000 avgt 410 ns/op
# -Xint to verify
Benchmark (SIZE) Mode Cnt Score Error Units
RemoveNonSenseJMH.nonSenseWarmup 1 avgt 113,536 ns/op
RemoveNonSenseJMH.nonSenseWarmup 10 avgt 1,110,772 ns/op
RemoveNonSenseJMH.nonSenseWarmup 100 avgt 11,449,399 ns/op
RemoveNonSenseJMH.nonSenseWarmup 1000 avgt 112,158,390 ns/op
RemoveNonSenseJMH.nonSenseWarmup 10000 avgt 1,104,793,429 ns/op
org.sample.RemoveNonSenseJMH
Now We Can Really Talk About Performance
private int next() {
return r.nextInt(1) + 1;
}
@Benchmark
public int classic() {
int s1 = 1;
final int step = next(1);
for (int i = 0; i < SIZE; i = i + 1) {
s1 += i;
}
return s1;
}
@Benchmark
public int variable() {
int s1 = 1;
final int step = next(1);
for (int i = 0; i < SIZE; i = i + step) {
s1 += i;
}
return s1;
}
@Benchmark
public int manualUnroll() {
int s1 = 1;
final int step = next(1) + 9;
for (int i = 0; i < SIZE; i = i + step) {
s1 += i;
s1 += i + 1;
...
s1 += i + 9;
}
return s1;
}
Flatten out loops
# Benchmark (size) Mode Cnt Score Units
classic 10000 avgt 4 2,563.905 ns/op
variable 10000 avgt 4 5,125.580 ns/op
manualUnroll 10000 avgt 4 2,646.231 ns/op
# -Xint
classic 10000 avgt 4 89,497.225 ns/op
variable 10000 avgt 4 88,081.687 ns/op
manualUnroll 10000 avgt 4 50,411.112 ns/op
classic 10000 avgt 2,593 ns/op
classic:branches 10000 avgt 706 #/op
classic:L1-dcache-loads 10000 avgt 184 #/op
classic:instructions 10000 avgt 17,452 #/op
classic:cycles 10000 avgt 10,860 #/op
classic:IPC 10000 avgt 1.607 #/op
manualUnroll 10000 avgt 2,660 ns/op
manualUnroll:branches 10000 avgt 1,061 #/op
manualUnroll:L1-dcache-loads 10000 avgt 2,189 #/op
manualUnroll:instructions 10000 avgt 22,585 #/op
manualUnroll:cycles 10000 avgt 11,321 #/op
manualUnroll:IPC 10000 avgt 1.972 #/op
variable 10000 avgt 5,263 ns/op
variable:branches 10000 avgt 10,059 #/op
variable:L1-dcache-loads 10000 avgt 20,144 #/op
variable:instructions 10000 avgt 90,128 #/op
variable:cycles 10000 avgt 21,595 #/op
variable:IPC 10000 avgt 4.167 #/op
org/sample/LoopUnroll.java
Everything happens in parallel
https://en.wikichip.org/wiki/intel/microarchitectures/kaby_lake
* depends on the type and generation
Make it less expensive
@Benchmark
public long array64()
{
int[] a = new int[64];
a[0] = r.nextInt();
a[1] = r.nextInt();
return a[0] + a[1];
}
@Benchmark
public long array65()
{
int[] a = new int[65];
a[0] = r.nextInt();
a[1] = r.nextInt();
return a[0] + a[1];
}
array64 avgt 2 10.598 ns/op
array65 avgt 2 31.383 ns/op
# -prof gc
array64 avgt 2 10.598 ns/op
array64:·gc.alloc.rate avgt 2 ≈ 10⁻⁴ MB/sec
array64:·gc.alloc.rate.norm avgt 2 ≈ 10⁻⁶ B/op
array65 avgt 2 31.383 ns/op
array65:·gc.alloc.rate avgt 2 7732.432 MB/sec
array65:·gc.alloc.rate.norm avgt 2 280.000 B/op
array64Returned avgt 2 30.605 ns/op
array64Returned:·gc.alloc.rate avgt 2 7703.857 MB/sec
array64Returned:·gc.alloc.rate.norm avgt 2 272.000 B/op
# -XX:EliminateAllocationArraySizeLimit=128
array65 avgt 2 10.569 ns/op
array65:·gc.alloc.rate avgt 2 ≈ 10⁻⁴ MB/sec
array65:·gc.alloc.rate.norm avgt 2 ≈ 10⁻⁶ B/op
array65:·gc.count avgt 2 ≈ 0 counts
array64 avgt 2 10.474 ns/op
array64:IPC avgt 1.555 #/op
array64:L1-dcache-load-misses avgt 0.005 #/op
array64:L1-dcache-loads avgt 26.915 #/op
array64:branches avgt 9.137 #/op
array64:cycles avgt 44.428 #/op
array64:instructions avgt 69.077 #/op
array65 avgt 2 30.526 ns/op
array65:IPC avgt 0.968 #/op
array65:L1-dcache-load-misses avgt 4.427 #/op
array65:L1-dcache-loads avgt 47.829 #/op
array65:branches avgt 19.767 #/op
array65:cycles avgt 130.389 #/op
array65:instructions avgt 126.335 #/op
org/sample/EscapeAnalysis.java
Native code to the rescue
final int SIZE = 100_000;
final int[] src = new int[SIZE];
public int[] manualArrayCopy()
{
final int[] target = new int[SIZE];
for (int i = 0; i < SIZE; i++)
{
target[i] = src[i];
}
return target;
}
public int[] systemArrayCopy()
{
final int[] target = new int[SIZE];
System.arraycopy(src, 0, target, 0, SIZE);
return target;
}
public int[] arraysCopy()
{
return Arrays.copyOf(src, src.length);
}
# Benchmark Mode Cnt Score Error Units
Manual avgt 2 44,112 ns/op
System avgt 2 44,893 ns/op
Arrays avgt 2 45,300 ns/op
Iteration Manual System
===============================================
Iteration 1: 2,061,452 ns/op 77,621 ns/op
Iteration 2: 592,751 ns/op 120,478 ns/op
Iteration 3: 477,828 ns/op 103,568 ns/op
Iteration 4: 479,590 ns/op 102,592 ns/op
Iteration 5: 482,536 ns/op 151,151 ns/op
Iteration 6: 482,785 ns/op 87,365 ns/op
Iteration 7: 710,620 ns/op 187,857 ns/op
...
Iteration 80: 46,009 ns/op 66,394 ns/op
org/sample/IntrinsicsArrayCopy.java
Native Code vs. Intrinsics
# -XX:+PrintCompilation -XX:+PrintInlining
@ 54 java.lang.Math::min (11 bytes) (intrinsic)
@ 57 java.lang.System::arraycopy (0 bytes) (intrinsic)
public static native void arraycopy(...)
can be a native call or an intrinsicjava.lang.Math
@HotSpotIntrinsicCandidate
http://pzemtsov.github.io/2015/11/21/crc32-intrinsic-java-or-c-with-jni.html
http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee27509/src/share/vm/classfile/vmSymbols.hpp
https://www.lmax.com/blog/staff-blogs/2016/03/30/notes-hotspot-compiler-flags
# Lenovo T530, JDK 8
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.manual avgt 2 48,177 ns/op
IntrinsicsArrayCopy.system avgt 2 48,678 ns/op
# Lenovo T450s, JDK 8, BAT
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.manual avgt 2 66,560 ns/op
IntrinsicsArrayCopy.system avgt 2 67,051 ns/op
IntrinsicsArrayCopy.arrays avgt 2 60,674 ns/op
# Lenovo T450s, JDK 11, BAT
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.manual avgt 2 67,476 ns/op
IntrinsicsArrayCopy.system avgt 2 58,952 ns/op
IntrinsicsArrayCopy.arrays avgt 2 58,771 ns/op
# Lenovo T450s, JDK 11, AC
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.manual avgt 2 56,108 ns/op
IntrinsicsArrayCopy.system avgt 2 55,748 ns/op
IntrinsicsArrayCopy.arrays avgt 2 53,769 ns/op
# Lenovo T450s, JDK 11, BAT, Min Power
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.manual avgt 2 215,360 ns/op
IntrinsicsArrayCopy.system avgt 2 210,971 ns/op
IntrinsicsArrayCopy.arrays avgt 2 214,870 ns/op
# Lenovo T14s, AMD, JDK 11, AC
Benchmark Mode Cn Score Units
IntrinsicsArrayCopy.arrays avgt 2 45,295 ns/op
IntrinsicsArrayCopy.manual avgt 2 44,141 ns/op
IntrinsicsArrayCopy.system avgt 2 44,861 ns/op
# Lenovo T14s, AMD, JDK 11, AC, SMT off, Boost off
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.arrays avgt 2 52,025 ns/op
IntrinsicsArrayCopy.manual avgt 2 59,333 ns/op
IntrinsicsArrayCopy.system avgt 2 51,319 ns/op
# AWS c5.xlarge, JDK 11
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.manual avgt 2 83,066 ns/op
IntrinsicsArrayCopy.system avgt 2 63,422 ns/op
IntrinsicsArrayCopy.arrays avgt 2 64,748 ns/op
# GCP n1-standard-4, JDK 11
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.manual avgt 2 146,932 ns/op
IntrinsicsArrayCopy.system avgt 2 101,970 ns/op
IntrinsicsArrayCopy.arrays avgt 2 116,233 ns/op
# GCP n1-highcpu-8, JDK 11
Benchmark Mode Cnt Score Units
IntrinsicsArrayCopy.manual avgt 2 82,865 ns/op
IntrinsicsArrayCopy.system avgt 2 62,810 ns/op
IntrinsicsArrayCopy.arrays avgt 2 61,237 ns/op
org.sample.IntrinsicsArrayCopy_AverageTime
Let's create some Strings
@Setup
public void setup() {
foo = String.valueOf(System.currentTimeMillis()) + "ashdfhgas dfhsa df";
}
public String plain() {
return "a" + foo;
}
public String concat() {
return "a".concat(foo);
}
public String builder() {
return new StringBuilder().append("a").append(foo).toString();
}
public String builderSized() {
return new StringBuilder(70).append("a").append(foo).toString();
}
public String builderNonFluid() {
final StringBuilder sb = new StringBuilder();
sb.append("a");
sb.append(foo);
return sb.toString();
}
public String builderNonFluidSized() {
final StringBuilder sb = new StringBuilder(70);
sb.append("a");
sb.append(foo);
return sb.toString();
}
# JDK 11
# Benchmark Mode Cnt Score Units
Strings1.builder avgt 2 12.822 ns/op
Strings1.builderSized avgt 2 13.055 ns/op
Strings1.plain avgt 2 13.203 ns/op
Strings1.concat avgt 2 16.199 ns/op
Strings1.builderNonFluidSized avgt 2 22.070 ns/op
Strings1.builderNonFluid avgt 2 24.335 ns/op
# JDK 17
Strings1.builderSized avgt 2 10.635 ns/op
Strings1.plain avgt 2 10.979 ns/op
Strings1.concat avgt 2 11.121 ns/op
Strings1.builder avgt 2 12.929 ns/op
Strings1.builderNonFluidSized avgt 2 19.207 ns/op
Strings1.builderNonFluid avgt 2 24.035 ns/op
# -Xint
Benchmark Mode Cnt Score Units
Strings1.concat avgt 2 809.838 ns/op
Strings1.builderNonFluidSized avgt 2 2,575.899 ns/op
Strings1.builderSized avgt 2 2,738.723 ns/op
Strings1.builderNonFluid avgt 2 2,969.153 ns/op
Strings1.plain avgt 2 2,982.004 ns/op
Strings1.builder avgt 2 3,115.392 ns/op
Inspired by Douglas Q Hawkins: org.sample.Strings1
Hardware does not matter... you wish!
final int SIZE = 1_000_000;
final int[] src = new int[100_000];
@Benchmark
public int step1()
{
int sum = 0;
for (int i = 0; i < 100_000; i++)
{
sum += src[i];
}
return sum;
}
@Benchmark
public int step20()
{
int sum = 0;
for (int i = 0; i < 100_000; i = i + 20)
{
sum += src[i];
}
return sum;
}
size20
must be faster, about 20x
Benchmark Mode Cnt Score Units
ArraysAndHardware.step1 avgt 2 24,526.814 ns/op
ArraysAndHardware.step20 avgt 2 2,609.979 ns/op
org.sample.ArraysAndHardware.java
Also Java has to live by hardware laws
Benchmark Mode Cnt Score Error Units
step1 avgt 2 24,562.946 ns/op
step1:IPC avgt 1.173 #/op
step1:CPI avgt 0.852 #/op
step1:L1-dcache-load-misses avgt 6,324.421 #/op
step1:L1-dcache-loads avgt 101,184.005 #/op
step1:branch-misses avgt 9.841 #/op
step1:branches avgt 6,417.104 #/op
step1:cycles avgt 103,033.534 #/op
step1:instructions avgt 120,883.682 #/op
step1:stalled-cycles-backend avgt 82,888.746 #/op
step1:stalled-cycles-frontend avgt 96.538 #/op
18.593.601.312 L1-dcache-loads # 3,995 G/sec
1.140.028.415 L1-dcache-load-misses # 6,13% of all L1-dcache accesses
Benchmark Mode Cnt Score Error Units
step20 avgt 2 2,632.399 ns/op
step20:IPC avgt 1.855 #/op
step20:CPI avgt 0.539 #/op
step20:L1-dcache-load-misses avgt 5,293.077 #/op
step20:L1-dcache-loads avgt 5,210.832 #/op
step20:branch-misses avgt 7.153 #/op
step20:branches avgt 5,062.571 #/op
step20:cycles avgt 11,036.292 #/op
step20:instructions avgt 20,462.526 #/op
step20:stalled-cycles-backend avgt 5,496.080 #/op
step20:stalled-cycles-frontend avgt 57.150 #/op
9.963.540.045 L1-dcache-loads # 2,010 G/sec
9.753.693.673 L1-dcache-load-misses # 97,89% of all L1-dcache accesses
-prof perfnorm
requires a special kernel tooling
What it costs to leave the CPU
Real | Humanized | |
---|---|---|
CPU Cycle | 0.4 ns | 1 s |
L1 Cache | 0.9 ns | 2 s |
L2 Cache | 2.8 ns | 7 s |
L3 Cache | 28 ns | 1 min |
Memory Access | 100 ns | 4 min |
NVM SSD | 25 μs | 17 h |
SSD | 50–150 μs | 1.5-4 days |
HDD | 1–10 ms | 1-9 months |
TCP SF to NY | 65 ms | 5 years |
TCP SF to Hong Kong | 141 ms | 11 years |
Data: She brought me closer to humanity than I ever thought possible, and for a time...I was tempted by her offer.
Picard: How long a time?
Data: 0.68 seconds, sir. For an android, that is nearly an eternity.
http://www.prowesscorp.com/computer-latency-at-a-human-scale/
final int[][] src = new int[1000][1000];
@Benchmark
public int horizontal() {
int sum = 0;
for (int i = 0; i < SIZE; i++)
{
for (int j = 0; j < SIZE; j++)
{
sum += src[i][j];
}
}
return sum;
}
@Benchmark
public int vertical() {
int sum = 0;
for (int i = 0; i < SIZE; i++)
{
for (int j = 0; j < SIZE; j++)
{
sum += src[j][i];
}
}
return sum;
}
@Benchmark
public int horizontalBackwards() {
int sum = 0;
for (int i = SIZE - 1; i >=0; i--)
{
for (int j = SIZE - 1; j >= 0; j--)
{
sum += src[i][j];
}
}
return sum;
}
What is real and what is not
# Lenovo T530, Intel i7
# Benchmark Mode Score Error Units
horizontal avgt 470,009 ns/op
horizontalBackwards avgt 672,821 ns/op
vertical avgt 3,209,5625 ns/op
# Lenovo T14s, AMD Ryzen 7
# Benchmark (SIZE) Mode Cnt Score Error Units
horizontal 1000 avgt 2 314,558 ns/op
horizontalBackwards 1000 avgt 2 291,384 ns/op
vertical 1000 avgt 2 1,280,180 ns/op
org.sample.Matrix
with -prof perfnorm
final int[][] src = new int[1000][1000];
@Benchmark
public int horizontal() {
int sum = 0;
for (int i = 0; i < SIZE; i++)
{
for (int j = 0; j < SIZE; j++)
{
sum += src[i][j];
}
}
return sum;
}
@Benchmark
public int vertical() {
int sum = 0;
for (int i = 0; i < SIZE; i++)
{
for (int j = 0; j < SIZE; j++)
{
sum += src[j][i];
}
}
return sum;
}
@Benchmark
public int horizontalBackwards() {
int sum = 0;
for (int i = SIZE - 1; i >=0; i--)
{
for (int j = SIZE - 1; j >= 0; j--)
{
sum += src[i][j];
}
}
return sum;
}
# Lenovo T530
Benchmark Mode Score Units
Matrix.horizontal avgt 470,009 ns/op
Matrix.horizontal:L1-dcache-load-misses avgt 63,645 #/op
Matrix.horizontal:L1-dcache-loads avgt 1,014,532 #/op
Matrix.horizontal:LLC-load-misses avgt 429 #/op
Matrix.horizontal:LLC-loads avgt 1,523 #/op
Matrix.horizontalBackwards avgt 672,821 ns/op
Matrix.horizontalBackwards:L1-dcache-load-misses avgt 63,053 #/op
Matrix.horizontalBackwards:L1-dcache-loads avgt 1,007,381 #/op
Matrix.horizontalBackwards:LLC-load-misses avgt 854 #/op
Matrix.horizontalBackwards:LLC-loads avgt 2,764 #/op
Matrix.vertical avgt 3,209,562 ns/op
Matrix.vertical:L1-dcache-load-misses avgt 2,060,998 #/op
Matrix.vertical:L1-dcache-loads avgt 3,020,796 #/op
Matrix.vertical:LLC-load-misses avgt 15,223 #/op
Matrix.vertical:LLC-loads avgt 694,397 #/op
# Lenovo T14s
Benchmark Mode Score Units
Matrix.horizontal avgt 286,633 ns/op
Matrix.horizontal:L1-dcache-load-misses avgt 63,902 #/op
Matrix.horizontal:L1-dcache-loads avgt 1,031,015 #/op
Matrix.horizontal:stalled-cycles-backend avgt 1,015,230 #/op
Matrix.horizontal:stalled-cycles-frontend avgt 4,848 #/op
Matrix.horizontalBackwards avgt 278,992 ns/op
Matrix.horizontalBackwards:L1-dcache-load-misses avgt 64,526 #/op
Matrix.horizontalBackwards:L1-dcache-loads avgt 1,033,686 #/op
Matrix.horizontalBackwards:stalled-cycles-backend avgt 870,881 #/op
Matrix.horizontalBackwards:stalled-cycles-frontend avgt 5,029 #/op
Matrix.vertical avgt 1,484,345 ns/op
Matrix.vertical:L1-dcache-load-misses avgt 2,114,064 #/op
Matrix.vertical:L1-dcache-loads avgt 3,108,037 #/op
Matrix.vertical:stalled-cycles-backend avgt 5,081,442 #/op
Matrix.vertical:stalled-cycles-frontend avgt 16,206 #/op
org.sample.Matrix
with -prof perfnorm
Influence of the Size on the Results
Benchmark (SIZE) Mode Cnt Score Factor Units
horizontal 1 avgt 2 5 ns/op
horizontal 10 avgt 2 35 ns/op
horizontal 20 avgt 2 107 ns/op
horizontal 100 avgt 2 2,546 ns/op
horizontal 1000 avgt 2 306,505 ns/op
horizontal 5000 avgt 2 9,901,156 ns/op
horizontal 10000 avgt 2 39,579,171 ns/op
vertical 1 avgt 2 5 1.0 ns/op
vertical 10 avgt 2 101 2.9 ns/op
vertical 20 avgt 2 292 2.7 ns/op
vertical 100 avgt 2 5,921 2.3 ns/op
vertical 1000 avgt 2 1,748,734 5.7 ns/op
vertical 5000 avgt 2 272,269,384 27.5 ns/op
vertical 10000 avgt 2 1,884,958,277 47.6 ns/op
horizontalBackwards 1 avgt 2 5 ns/op
horizontalBackwards 10 avgt 2 36 ns/op
horizontalBackwards 20 avgt 2 107 ns/op
horizontalBackwards 100 avgt 2 2,544 ns/op
horizontalBackwards 1000 avgt 2 390,983 ns/op
horizontalBackwards 5000 avgt 2 10,800,172 ns/op
horizontalBackwards 10000 avgt 2 40,368,771 ns/op
Factor shows the change to the horizontal baseline number.
Testing with caches is not easy and depends on the overall machine workload.
Understand Your Hardware
# RUN 1 - T14s, AMD
Benchmark Mode Cnt Score Diff Units
threads1 avgt 2 250.5 ms/op
threads2 avgt 2 257.0 +3% ms/op
threads4 avgt 2 284.6 +13% ms/op
threads6 avgt 2 330.6 +32% ms/op
threads8 avgt 2 375.9 +50% ms/op
# RUN 2 - T14s, AMD, turbo mode off for verification
Benchmark Mode Cnt Score Diff Units
threads1 avgt 2 611.4 ms/op
threads2 avgt 2 612.6 +0% ms/op
threads4 avgt 2 611.4 +0% ms/op
threads6 avgt 2 616.4 +1% ms/op
threads8 avgt 2 624.8 +2% ms/op
private final int SIZE = 100_000_000;
private final int[] SOURCE = new int[SIZE];
double doCPUBoundStuff(int[] data) {
int numDataPoints = data.length;
int i = 0;
double avg = 0;
double var = 0;
data[0] = 0;
data[1] = 1;
for (i = 2; i < numDataPoints; i++) {
data[i] = data[i-1] + data[i-2];
}
for (avg = 0, i = 0; i < numDataPoints; i++) {
avg += (double)numDataPoints;
}
avg /= (double) numDataPoints;
for (var = 0, i = 0; i < numDataPoints; i++) {
double diff = (double) data[i] - avg;
var += diff*diff;
}
var /= (double) (numDataPoints - 1);
return avg;
}
@Setup
public void setup() {
for (int i = 0; i < SIZE; i++) {
SOURCE[i] = i;
}
}
@Benchmark
@Threads(1..8)
public double t1() {
return doCPUBoundStuff(SOURCE);
}
org.sample.scale.Powersave
Intel: echo "1" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
AMD: echo "0" | sudo tee /sys/devices/system/cpu/cpufreq/boost
Modern Fortune Teller
private static final int COUNT = 10_000;
// Contain random numbers from -50 to 50
private int[] sorted;
private int[] unsorted;
private int[] reversed;
public int doIt(int[] array)
{
for (int v : array)
{
int sum = 0;
for (int v : array)
{
if (v > 0)
{
sum = sum + 2;
}
else
{
sum = sum + 1;
}
}
return sum;
}
}
# T14s
# Benchmark Mode Cnt Score Error Units
sorted avgt 2 3,423.415 ns/op
reversed avgt 2 3,467.932 ns/op
unsorted avgt 2 10,600.557 ns/op
org.sample.BranchPrediction1
Modern CPU marvels: I know what you might need next
# Benchmark Mode Cnt Score Error Units
BranchPrediction1.sorted avgt 2 3329.529 ns/op
BranchPrediction1.sorted:IPC avgt 3.812 insns/clk
BranchPrediction1.sorted:L1-dcache-load-misses avgt 643.914 #/op
BranchPrediction1.sorted:L1-dcache-loads avgt 10287.609 #/op
BranchPrediction1.sorted:L1-icache-load-misses avgt 0.201 #/op
BranchPrediction1.sorted:L1-icache-loads avgt 241.148 #/op
BranchPrediction1.sorted:branch-misses avgt 5.800 #/op
BranchPrediction1.sorted:branches avgt 17527.359 #/op
BranchPrediction1.sorted:cycles avgt 13969.192 #/op
BranchPrediction1.sorted:instructions avgt 53254.559 #/op
BranchPrediction1.sorted:stalled-cycles-backend avgt 155.405 #/op
BranchPrediction1.sorted:stalled-cycles-frontend avgt 16.920 #/op
BranchPrediction1.unsorted avgt 2 14399.107 ns/op
BranchPrediction1.unsorted:IPC avgt 0.901 insns/clk
BranchPrediction1.unsorted:L1-dcache-load-misses avgt 653.839 #/op
BranchPrediction1.unsorted:L1-dcache-loads avgt 11429.217 #/op
BranchPrediction1.unsorted:L1-icache-load-misses avgt 1.365 #/op
BranchPrediction1.unsorted:L1-icache-loads avgt 21877.925 #/op
BranchPrediction1.unsorted:branch-misses avgt 956.056 #/op
BranchPrediction1.unsorted:branches avgt 17693.113 #/op
BranchPrediction1.unsorted:cycles avgt 59200.151 #/op
BranchPrediction1.unsorted:instructions avgt 53320.560 #/op
BranchPrediction1.unsorted:stalled-cycles-backend avgt 225.695 #/op
BranchPrediction1.unsorted:stalled-cycles-frontend avgt 1366.556 #/op
IPC: Instructions per Cycle
How small changes can mess up things big time
final int d = i % 2 == 0 ?
-10 : 10;
public int doIt(int[] array)
{
int sum = 0;
for (int v : array)
{
if (v > 0)
{
sum += 2;
}
else
{
sum += 1;
}
}
return sum;
}
Benchmark Mode Score Units
===============================
sorted avgt 3,177 ns/op
unsorted avgt 3,709 ns/op
final int d = i % 2 == 0
? -r.nextInt(5) : r.nextInt(5);
public int doIt(int[] array)
{
int sum = 0;
for (int v : array)
{
if (v > 0)
{
sum += 2;
}
else
{
sum += 1;
}
}
return sum;
}
Benchmark Mode Score Units
===============================
sorted avgt 3,231 ns/op
unsorted avgt 5,175 ns/op
final int d = i % 2 == 0
? -r.nextInt(5) - 1 :
r.nextInt(5) + 1;
public int doIt(int[] array)
{
int sum = 0;
for (int v : array)
{
if (v > 0)
{
sum += 2;
}
else
{
sum += 1;
}
}
return sum;
}
Benchmark Mode Score Units
===============================
sorted avgt 3,369 ns/op
unsorted avgt 3,671 ns/op
final int d = r.nextInt();
public int doIt(int[] array)
{
int sum = 0;
for (int v : array)
{
if (v > 0)
{
sum += 2;
}
else
{
sum += 1;
}
}
return sum;
}
Benchmark Mode Score Units
===============================
sorted avgt 3,240 ns/op
unsorted avgt 15,440 ns/op
Being alone and still have to sync
private final int SIZE = 1024;
private final Map<String, String> syncMap =
Collections.synchronizedMap(new HashMap<>(SIZE));
private final Map<String, String> unsyncMap =
new HashMap<>(SIZE);
private final Map<String, String> conMap =
new ConcurrentHashMap<>(SIZE);
@Benchmark
public String syncAccess()
{
return syncMap.get("1");
}
@Benchmark
public String unsyncAccess()
{
return unsyncMap.get("1");
}
@Benchmark
public String conAccess()
{
return conMap.get("1");
}
Benchmark Mode Cnt Score Units
SynchronizedAlone.conAccess avgt 2 9.152 ns/op
SynchronizedAlone.syncAccess avgt 2 23.462 ns/op
SynchronizedAlone.unsyncAccess avgt 2 9.366 ns/op
ConcurrentHashMap
does wellorg/sample/SynchronizedAlone.java
Let's try more threads
2 Threads
====================================================================
Benchmark Mode Cnt Score Error Units
SynchronizedThreads4.conAccess avgt 2 9.721 ns/op
SynchronizedThreads4.syncAccess avgt 2 183.086 ns/op
SynchronizedThreads4.unsyncAccess avgt 2 9.312 ns/op
4 Threads
====================================================================
Benchmark Mode Cnt Score Error Units
SynchronizedThreads4.conAccess avgt 2 19.661 ns/op
SynchronizedThreads4.syncAccess avgt 2 216.041 ns/op
SynchronizedThreads4.unsyncAccess avgt 2 19.631 ns/op
8 Threads
====================================================================
Benchmark Mode Cnt Score Error Units
SynchronizedThreads4.conAccess avgt 2 44.055 ns/op
SynchronizedThreads4.syncAccess avgt 2 377.180 ns/op
SynchronizedThreads4.unsyncAccess avgt 2 38.562 ns/op
Unsynced access is of course mostly garbage for shared structures
Loops compared
Benchmark (length) Mode Cnt Score Error Units
ForEachSimple.arrayFor 1 avgt 2 3.8 ns/op
ForEachSimple.arrayFor 10 avgt 2 9.8 ns/op
ForEachSimple.arrayFor 100 avgt 2 64.4 ns/op
ForEachSimple.arrayFor 1000 avgt 2 1,446.1 ns/op
ForEachSimple.arrayEnhanced 1 avgt 2 3.5 ns/op
ForEachSimple.arrayEnhanced 10 avgt 2 9.5 ns/op
ForEachSimple.arrayEnhanced 100 avgt 2 63.2 ns/op
ForEachSimple.arrayEnhanced 1000 avgt 2 1,199.3 ns/op
ForEachSimple.classicFor 1 avgt 2 5.7 ns/op
ForEachSimple.classicFor 10 avgt 2 15.5 ns/op
ForEachSimple.classicFor 100 avgt 2 103.0 ns/op
ForEachSimple.classicFor 1000 avgt 2 1,388.8 ns/op
ForEachSimple.enhancedFor 1 avgt 2 6.1 ns/op
ForEachSimple.enhancedFor 10 avgt 2 17.0 ns/op
ForEachSimple.enhancedFor 100 avgt 2 118.4 ns/op
ForEachSimple.enhancedFor 1000 avgt 2 1,578.6 ns/op
ForEachSimple.lambdaStream 1 avgt 2 53.7 ns/op
ForEachSimple.lambdaStream 10 avgt 2 65.5 ns/op
ForEachSimple.lambdaStream 100 avgt 2 153.0 ns/op
ForEachSimple.lambdaStream 1000 avgt 2 1,326.0 ns/op
ForEachSimple.parallelLambdaStream 1 avgt 2 94.2 ns/op
ForEachSimple.parallelLambdaStream 10 avgt 2 6,283.2 ns/op
ForEachSimple.parallelLambdaStream 100 avgt 2 8,443.4 ns/op
ForEachSimple.parallelLambdaStream 1000 avgt 2 9,367.6 ns/op
org/sample/ForEachSimple.java
// just some strings
final List<String> list = new ArrayList<>();
String[] array = null;
@Param({"1", "10", "100", "1000"})
int length;
@Setup
public void setup() {.. // uses length ..}
@Benchmark
public void arrayFor(Blackhole bh) {
int result = 0;
for (int i = 0; i < array.length; i++) {
result += array[i].length();
}
bh.consume(result);
}
// arrayEnhanced
for (String s : array) {
result += s .length();
}
// classicFor
for (int i = 0; i < list.size(); i++) {
result += list.get(i).length();
}
// enhancedFor
for (String s : list) {
result += s.length();
}
// lambdaStream
int result = list.stream()
.mapToInt(s -> s.length())
.sum();
//parallelLambdaStream
int result = list.parallelStream()
.mapToInt(s -> s.length())
.sum();
Not so empty loops compared
Benchmark (length) Mode Cnt Score Error Units
arrayFor 1 avgt 2 238.3 ns/op
arrayFor 10 avgt 2 2,105.6 ns/op
arrayFor 100 avgt 2 21,957.8 ns/op
arrayFor 1000 avgt 2 253,262.0 ns/op
arrayEnhanced 1 avgt 2 234.1 ns/op
arrayEnhanced 10 avgt 2 2,173.1 ns/op
arrayEnhanced 100 avgt 2 22,080.1 ns/op
arrayEnhanced 1000 avgt 2 250,038.0 ns/op
classicFor 1 avgt 2 238.1 ns/op
classicFor 10 avgt 2 2,177.0 ns/op
classicFor 100 avgt 2 22,318.0 ns/op
classicFor 1000 avgt 2 249,914.1 ns/op
enhancedFor 1 avgt 2 254.8 ns/op
enhancedFor 10 avgt 2 2,185.9 ns/op
enhancedFor 100 avgt 2 23,011.1 ns/op
enhancedFor 1000 avgt 2 262,605.5 ns/op
lambdaStream 1 avgt 2 395.7 ns/op
lambdaStream 10 avgt 2 2,829.0 ns/op
lambdaStream 100 avgt 2 27,771.4 ns/op
lambdaStream 1000 avgt 2 304,831.6 ns/op
# 2+2 CPU
parallelLambdaStream 1 avgt 2 453.5 ns/op
parallelLambdaStream 10 avgt 2 11,686.4 ns/op
parallelLambdaStream 100 avgt 2 26,798.6 ns/op
parallelLambdaStream 1000 avgt 2 189,304.3 ns/op
final List<String> list = new ArrayList<>();
@Param({"1", "10", "100", "1000"})
int length;
@Setup
public void setup()
{
for (int i = 0; i < length; i++)
{
list.add(
MessageFormat.format(
"{0},{0},{0},{0},{0},{0}", String.valueOf(i)));
}
}
@Benchmark
public void classicFor(Blackhole bh)
{
int result = 0;
for (int i = 0; i < list.size(); i++)
{
final String s = list.get(i);
if (s.startsWith("5"))
{
continue;
}
final String[] a = s.split(",");
for (int j = 0; j < a.length; j++)
{
result += Integer.valueOf(a[j]);
}
}
bh.consume(result);
}
org/sample/ForEachExpensive.java
T450s vs. GCP N1-highcpu-16core
# Lenovo T450s
Benchmark (length) Mode Cnt Score Error Units
arrayFor 1 avgt 2 238.3 ns/op
arrayFor 10 avgt 2 2,105.6 ns/op
arrayFor 100 avgt 2 21,957.8 ns/op
arrayFor 1000 avgt 2 253,262.0 ns/op
arrayFor 10000 avgt 2 2,810,799.6 ns/op
arrayFor 100000 avgt 2 31,116,545.4 ns/op
classicFor 1 avgt 2 238.1 ns/op
classicFor 10 avgt 2 2,177.0 ns/op
classicFor 100 avgt 2 22,318.0 ns/op
classicFor 1000 avgt 2 257,362.6 ns/op
classicFor 10000 avgt 2 2,809,832.8 ns/op
classicFor 100000 avgt 2 32,017,487.3 ns/op
enhancedFor 1 avgt 2 254.8 ns/op
enhancedFor 10 avgt 2 2,185.9 ns/op
enhancedFor 100 avgt 2 23,011.1 ns/op
enhancedFor 1000 avgt 2 254,691.8 ns/op
enhancedFor 10000 avgt 2 3,180,447.3 ns/op
enhancedFor 100000 avgt 2 31,479,657.4 ns/op
lambdaStream 1 avgt 2 395.7 ns/op
lambdaStream 10 avgt 2 2,829.0 ns/op
lambdaStream 100 avgt 2 27,771.4 ns/op
lambdaStream 1000 avgt 2 299,971.9 ns/op
lambdaStream 10000 avgt 2 3,181,573.5 ns/op
lambdaStream 100000 avgt 2 36,055,386.8 ns/op
parallelLambdaStream 1 avgt 2 453.5 ns/op
parallelLambdaStream 10 avgt 2 11,686.4 ns/op
parallelLambdaStream 100 avgt 2 26,798.6 ns/op
parallelLambdaStream 1000 avgt 2 193,232.6 ns/op
parallelLambdaStream 10000 avgt 2 1,956,289.8 ns/op
parallelLambdaStream 100000 avgt 2 20,628,366.2 ns/op
# Google n1-highcpu-16core
Benchmark (length) Mode Cnt Score Error Units
arrayFor 1 avgt 2 263.4 ns/op
arrayFor 10 avgt 2 2,390.4 ns/op
arrayFor 100 avgt 2 24,871.1 ns/op
arrayFor 1000 avgt 2 266,237.0 ns/op
arrayFor 10000 avgt 2 3,020,851.2 ns/op
arrayFor 100000 avgt 2 30,428,126.5 ns/op
classicFor 1 avgt 2 264.0 ns/op
classicFor 10 avgt 2 2,490.3 ns/op
classicFor 100 avgt 2 25,186.0 ns/op
classicFor 1000 avgt 2 281,254.1 ns/op
classicFor 10000 avgt 2 2882,459.1 ns/op
classicFor 100000 avgt 2 34,657,730.5 ns/op
enhancedFor 1 avgt 2 263.5 ns/op
enhancedFor 10 avgt 2 2,329.8 ns/op
enhancedFor 100 avgt 2 24,811.0 ns/op
enhancedFor 1000 avgt 2 291,544.4 ns/op
enhancedFor 10000 avgt 2 3,185,459.2 ns/op
enhancedFor 100000 avgt 2 33,898,708.6 ns/op
lambdaStream 1 avgt 2 547.9 ns/op
lambdaStream 10 avgt 2 3,841.7 ns/op
lambdaStream 100 avgt 2 28,808.7 ns/op
lambdaStream 1000 avgt 2 329,148.1 ns/op
lambdaStream 10000 avgt 2 3,684,920.1 ns/op
lambdaStream 100000 avgt 2 36,935,499.3 ns/op
parallelLambdaStream 1 avgt 2 521.7 ns/op
parallelLambdaStream 10 avgt 2 43,527.8 ns/op
parallelLambdaStream 100 avgt 2 62,834.3 ns/op
parallelLambdaStream 1000 avgt 2 110,602.3 ns/op
parallelLambdaStream 10000 avgt 2 558,666.5 ns/op
parallelLambdaStream 100000 avgt 2 5,183,311.8 ns/op
Method Calls, Inheritance, and Compiler Magic
How Expensive are Virtual Method Calls?
public class MegaMorphicSimple
{
static interface C { public int apply(int x); }
static class C1 implements C
{
@Override
public int apply(int x) { return x * x; }
}
@Benchmark
public int _1_mono()
{
C c = null;
int r = random.nextInt(4);
if (r == 0) { c = new C1(); }
else if (r == 1) { c = new C1(); }
else if (r == 2) { c = new C1(); }
else if (r == 3) { c = new C1(); }
return c.apply(r);
}
@Benchmark
public int _2_bi()
{
C c = null;
int r = random.nextInt(4);
if (r == 0) { c = new C1(); }
else if (r == 1) { c = new C2(); }
else if (r == 2) { c = new C1(); }
else if (r == 3) { c = new C2(); }
return c.apply(r);
}
@Benchmark
public int _5_manualDispatch()
{
...
if (c instanceof C1)
{
return ((C1) c).apply(r);
}
else if (c instanceof C2)
{
return ((C2) c).apply(r);
}
else if (c instanceof C3)
{
return ((C3) c).apply(r);
}
else
{
return ((C4) c).apply(r);
}
}
}
Inspired by Douglas Hawkins: org.sample.MegaMorphic[Simple|Default]
How Expensive are Virtual Method Calls?
# identical method per implementing class
Benchmark Mode Cnt Score Error Units
MegaMorphicSimple._1_mono avgt 5 19.703 ± 2.177 ns/op
MegaMorphicSimple._2_bi avgt 5 21.917 ± 3.759 ns/op
MegaMorphicSimple._4_mega avgt 5 26.778 ± 4.411 ns/op
MegaMorphicSimple._5_manualDispatch avgt 5 21.861 ± 2.860 ns/op
static interface C { default int apply(int x){ return x * x; }};
static class C1 implements C {};
static class C2 implements C {};
static class C3 implements C {};
static class C4 implements C {};
# default method at interface
Benchmark Mode Cnt Score Error Units
MegaMorphicDefault._1_mono avgt 5 18.867 ± 1.325 ns/op
MegaMorphicDefault._2_bi avgt 5 21.817 ± 2.613 ns/op
MegaMorphicDefault._4_mega avgt 5 26.355 ± 2.591 ns/op
MegaMorphicDefault._5_manualDispatch avgt 5 21.763 ± 3.479 ns/op
# method got bigger and we made that non-default
Benchmark Mode Cnt Score Error Units
MegaMorphicLargeMethod._1_mono avgt 5 158.775 ± 4.971 ns/op
MegaMorphicLargeMethod._2_bi avgt 5 157.100 ± 6.153 ns/op
MegaMorphicLargeMethod._4_mega avgt 5 165.580 ± 8.669 ns/op
MegaMorphicLargeMethod._5_manualDispatch avgt 5 162.729 ± 3.795 ns/op
org.sample.MegaMorphic[Simple|Default]
So we don't have to think about virtual calls at all
public class MegaMorphicArray
{
private C[] array = new C[20_000];
@Setup
public void setup()
{
final Random r = new Random(42L);
for (int x = 0; x < 20_000; x++)
{
int i = r.nextInt(JHM_PARAM);
if (i == 0) array[x] = new C1()
else if (i == 1) array[x] = new C2();
else if (i == 2) array[x] = new C3();
else if (i == 3) array[x] = new C4();
}
}
}
static interface C { public int apply(int x); }
static class C1 implements C {
@Override public int apply(int x) { return x * x; }
} ...
@Benchmark
public int morphic()
{
int sum = 0;
for (int i = 0; i < 20_000; i++)
{
sum += array[i].apply(i);
}
return sum;
}
@Benchmark
public int peeled()
{
int sum = 0;
for (int i = 0; i < 20_000; i++) {
final C c = array[i];
if (c instanceof C1) {
sum += ((C1) c).apply(i);
}
else if (c instanceof C2) {
sum += ((C2) c).apply(i);
}
else if (c instanceof C3) {
sum += ((C3) c).apply(i);
}
else {
sum += ((C4) c).apply(i);
}
}
return sum;
}
org.sample.MegaMorphicArray
The Results are in
# JDK 11
Benchmark (params) Mode Cnt Score Units
morphic 1 avgt 5 17,705.6 ns/op
morphic 2 avgt 5 98,269.4 ns/op
morphic 3 avgt 5 334,745.9 ns/op
morphic 4 avgt 5 364,485.9 ns/op
peeled 1 avgt 5 16,990.4 ns/op
peeled 2 avgt 5 94,084.2 ns/op
peeled 3 avgt 5 123,590.3 ns/op
peeled 4 avgt 5 147,067.2 ns/op
# JDK 11 -Xint
Benchmark (params) Mode Cnt Score Units
morphic 1 avgt 5 1,682,550.7 ns/op
morphic 2 avgt 5 1,695,875.2 ns/op
morphic 3 avgt 5 1,679,643.2 ns/op
morphic 4 avgt 5 1,675,298.1 ns/op
peeled 1 avgt 5 1,764,953.0 ns/op
peeled 2 avgt 5 1,821,662.6 ns/op
peeled 3 avgt 5 1,885,678.8 ns/op
peeled 4 avgt 5 1,936,054.8 ns/op
org.sample.MegaMorphicArray
But why do we see this performance?
Benchmark (params) Mode Score Error Units
morphic 1 avgt 16,829.7 ± 1165.264 ns/op
morphic:IPC 1 avgt 3.2 #/op
morphic:L1-dcache-load-misses 1 avgt 6,263.3 #/op
morphic:L1-dcache-loads 1 avgt 40,296.8 #/op
morphic:branch-misses 1 avgt 9.1 #/op
morphic:branches 1 avgt 24,984.2 #/op
morphic:dTLB-load-misses 1 avgt 0.2 #/op
morphic:dTLB-loads 1 avgt 40,401.5 #/op
morphic 2 avgt 97,456.9 ± 1177.830 ns/op
morphic:IPC 2 avgt 0.7 #/op
morphic:L1-dcache-load-misses 2 avgt 6,472.7 #/op
morphic:L1-dcache-loads 2 avgt 43,597.9 #/op
morphic:branch-misses 2 avgt 9,850.6 #/op
morphic:branches 2 avgt 41,935.3 #/op
morphic:dTLB-load-misses 2 avgt 2.0 #/op
morphic:dTLB-loads 2 avgt 45,509.8 #/op
morphic 3 avgt 334,117.8 ± 3334.059 ns/op
morphic:IPC 3 avgt 1.1 #/op
morphic:L1-dcache-load-misses 3 avgt 6,840.3 #/op
morphic:L1-dcache-loads 3 avgt 407,654.7 #/op
morphic:branch-misses 3 avgt 13,247.0 #/op
morphic:branches 3 avgt 146,105.8 #/op
morphic:dTLB-load-misses 3 avgt 13.5 #/op
morphic:dTLB-loads 3 avgt 410,043.5 #/op
morphic 4 avgt 390,601.4 ± 40515.922 ns/op
morphic:IPC 4 avgt 0.9 #/op
morphic:L1-dcache-load-misses 4 avgt 6,740.5 #/op
morphic:L1-dcache-loads 4 avgt 408,814.3 #/op
morphic:branch-misses 4 avgt 15,099.0 #/op
morphic:branches 4 avgt 145,118.2 #/op
morphic:dTLB-load-misses 4 avgt 18.9 #/op
morphic:dTLB-loads 4 avgt 410,767.5 #/op
Benchmark (params) Mode Score Error Units
peeled 1 avgt 16,831.1 ± 816.491 ns/op
peeled:IPC 1 avgt 3.2 #/op
peeled:L1-dcache-load-misses 1 avgt 6,218.4 #/op
peeled:L1-dcache-loads 1 avgt 39,797.1 #/op
peeled:branch-misses 1 avgt 9.2 #/op
peeled:branches 1 avgt 24,854.5 #/op
peeled:dTLB-load-misses 1 avgt 0.2 #/op
peeled:dTLB-loads 1 avgt 40,133.1 #/op
peeled 2 avgt 93,347.0 ± 3607.879 ns/op
peeled:IPC 2 avgt 0.7 #/op
peeled:L1-dcache-load-misses 2 avgt 6,456.1 #/op
peeled:L1-dcache-loads 2 avgt 42,060.7 #/op
peeled:branch-misses 2 avgt 9,252.3 #/op
peeled:branches 2 avgt 41,153.9 #/op
peeled:dTLB-load-misses 2 avgt 6.0 #/op
peeled:dTLB-loads 2 avgt 42,195.1 #/op
peeled 3 avgt 123,075.7 ± 5181.245 ns/op
peeled:IPC 3 avgt 0.6 #/op
peeled:L1-dcache-load-misses 3 avgt 6,472.7 #/op
peeled:L1-dcache-loads 3 avgt 42,059.2 #/op
peeled:branch-misses 3 avgt 13,440.9 #/op
peeled:branches 3 avgt 51,950.4 #/op
peeled:dTLB-load-misses 3 avgt 5.7 #/op
peeled:dTLB-loads 3 avgt 43,097.4 #/op
peeled 4 avgt 136,765.3 ± 2792.939 ns/op
peeled:IPC 4 avgt 0.6 #/op
peeled:L1-dcache-load-misses 4 avgt 6,460.5 #/op
peeled:L1-dcache-loads 4 avgt 43,312.1 #/op
peeled:branch-misses 4 avgt 14,818.3 #/op
peeled:branches 4 avgt 62,214.6 #/op
peeled:dTLB-load-misses 4 avgt 8.0 #/op
peeled:dTLB-loads 4 avgt 42,775.6 #/op
Credits: Aleksey Shipilёv, RedHat, https://shipilev.net/jvm/anatomy-quarks/16-megamorphic-virtual-calls/
The CPU understood
[1]: https://en.wikipedia.org/wiki/Translation_lookaside_buffer
Let's try some streams first
final int SIZE = 10240;
final int[] integers = new int[SIZE];
public int lambdaArray[Cold|Warm]()
{
return Arrays.stream(integers)
.filter(i -> i % 2 == 0).sum();
}
public int lambdaIntStream[Cold|Warm]()
{
return IntStream.range(0, SIZE)
.filter(i -> i % 2 == 0).sum();
}
public int loop[Cold|Warm]()
{
int sum = 0;
for (int i = 0; i < integers.length; i++)
{
if (i % 2 == 0)
{
sum += integers[i];
}
}
return sum;
}
Benchmark Mode Cnt Score Units
Lambda01.lambdaArrayHot avgt 5 13,968.8 ns/op
Lambda01.lambdaIntStreamHot avgt 5 13,156.2 ns/op
Lambda01.loopHot avgt 5 10,146.6 ns/op
# Cold does not have JMH warmup
Benchmark Mode Cnt Score Units
Lambda01.lambdaArrayCold avgt 5 251,372.4 ns/op
Lambda01.lambdaArrayHot avgt 5 13,968.8 ns/op
Lambda01.lambdaIntStreamCold avgt 5 794,422.3 ns/op
Lambda01.lambdaIntStreamHot avgt 5 13,156.2 ns/op
Lambda01.loopCold avgt 5 15,066.7 ns/op
Lambda01.loopHot avgt 5 10,146.6 ns/op
# -Xcomp
Benchmark Mode Cnt Score Units
Lambda01.lambdaArrayCold avgt 5 24,809,394.5 ns/op
Lambda01.lambdaIntStreamCold avgt 5 23,398,581.4 ns/op
Lambda01.loopCold avgt 5 11,787.0 ns/op
Lambda01.lambdaArrayHot avgt 5 143,678.1 ns/op
Lambda01.lambdaIntStreamHot avgt 5 144,445.3 ns/op
Lambda01.loopHot avgt 5 10,109.5 ns/op
# -Xint
Benchmark Mode Cnt Score Units
Lambda01.lambdaArrayCold avgt 5 5,578,040.0 ns/op
Lambda01.lambdaArrayHot avgt 5 5,473,135.6 ns/op
Lambda01.lambdaIntStreamCold avgt 5 5,394,464.0 ns/op
Lambda01.lambdaIntStreamHot avgt 5 5,317,269.0 ns/op
Lambda01.loopCold avgt 5 229,327.9 ns/op
Lambda01.loopHot avgt 5 229,369.6 ns/op
org/sample/Lambda01.java
Compare Predicate, Pseudo-Predicate, and Direct Implementation
public int forLambda(final Predicate<Integer> p) {
int sum = 0;
for (int i = 0; i < array.length; i++) {
if (p.test(array[i])) {
sum++;
}
}
return sum;
}
public int forAnonymous(final PseudoLambda<Integer> p) {
int sum = 0;
for (int i = 0; i < array.length; i++) {
if (p.test(array[i])) {
sum++;
}
}
return sum;
}
public int simple() {
int sum = 0;
for (int i = 0; i < array.length; i++) {
if (array[i] % 2 == 0) {
sum++;
}
}
return sum;
}
@Benchmark
public void forSimple(Blackhole bh)
{
bh.consume(simple());
}
@Benchmark
public void forLambdaPredicate(Blackhole bh)
{
bh.consume(forLambda(i -> i % 2 == 0));
}
interface PseudoLambda<T>
{
boolean test(T t);
}
@Benchmark
public void forAnonymousClass(Blackhole bh)
{
bh.consume(
forAnonymous(new PseudoLambda<Integer>()
{
@Override
public boolean test(Integer i)
{
return i % 2 == 0;
}
})
);
}
org/sample/Lambda02.java
Benchmark Mode Cnt Score Units
forAnonymousClass avgt 2 152.9 ns/op
forLambdaPredicate avgt 2 128.8 ns/op
forLambdaPredicatePredefined avgt 2 127.9 ns/op
forSimple avgt 2 128.1 ns/op
# -Xint
Benchmark Mode Cnt Score Units
forAnonymousClass avgt 2 21,312.5 ns/op
forLambdaPredicate avgt 2 20,154.8 ns/op
forLambdaPredicatePredefined avgt 2 19,871.6 ns/op
forSimple avgt 2 3,471.7 ns/op
# No warmup and just 5x measurements
Benchmark Mode Cnt Score Units
forAnonymousClass ss 5 92,787.6 ns/op
forLambdaPredicate ss 5 122,377.4 ns/op
forLambdaPredicatePredefined ss 5 50,435.6 ns/op
forSimple ss 5 16,268.6 ns/op
org/sample/Lambda02.java
public class GCAndAccessSpeed
{
private final static int SIZE = 100_000;
private final List<String> STRINGS = new ArrayList<>();
private final List<String> ordered = new ArrayList<>();
private final List<String> nonOrdered = new ArrayList<>();
@Param({"false", "true"}) boolean gc;
@Param({"1", "10"}) int COUNT;
@Param({"false", "true"}) boolean drop;
@Setup
public void setup() throws InterruptedException
{
final FastRandom r = new FastRandom(7);
for (int i = 0; i < COUNT * SIZE; i++)
{
STRINGS.add(RandomUtils.randomString(r, 1, 20));
}
for (int i = 0; i < SIZE; i++)
{
ordered.add(STRINGS.get(i * COUNT));
}
nonOrdered.addAll(ordered);
Collections.shuffle(nonOrdered, new Random(r.nextInt()));
if (drop)
{
STRINGS.clear();
}
if (gc)
{
for (int c = 0; c < 5; c++)
{
System.gc();
TimeUnit.SECONDS.sleep(2);
}
}
}
@Benchmark
public int walk[Ordered|NonOrdered]()
{
int sum = 0;
for (int i = 0; i < [ordered|nonOrdered].size(); i++)
{
sum += [ordered|nonOrdered].get(i).length();
}
return sum;
}
}
Applied Memory Knowledge
# G1 GC
Benchmark (COUNT) (drop) (gc) Mode Cnt Score Units
walkNonOrdered 1 false false avgt 5 1,596,315.2 ns/op
walkOrdered 1 false false avgt 5 611,137.7 ns/op
walkNonOrdered 1 false true avgt 5 1,172,951.3 ns/op
walkOrdered 1 false true avgt 5 431,143.5 ns/op
walkNonOrdered 1 true false avgt 5 1,562,844.1 ns/op
walkOrdered 1 true false avgt 5 605,119.4 ns/op
walkNonOrdered 1 true true avgt 5 1,243,973.9 ns/op
walkOrdered 1 true true avgt 5 400,721.9 ns/op
walkNonOrdered 10 false false avgt 5 1,903,731.9 ns/op
walkOrdered 10 false false avgt 5 1,229,945.1 ns/op
walkNonOrdered 10 false true avgt 5 2,026,861.7 ns/op
walkOrdered 10 false true avgt 5 1,809,961.9 ns/op
walkNonOrdered 10 true false avgt 5 1,920,658.4 ns/op
walkOrdered 10 true false avgt 5 1,239,658.5 ns/op
walkNonOrdered 10 true true avgt 5 1,160,229.2 ns/op
walkOrdered 10 true true avgt 5 403,949.5 ns/op
# -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC
Benchmark (COUNT) (drop) (gc) Mode Cnt Score Units
walkNonOrdered 1 false false avgt 5 1,667,611.7 ns/op
walkNonOrdered 1 false true avgt 5 1,820,968.2 ns/op
walkNonOrdered 1 true false avgt 5 1,928,822.7 ns/op
walkNonOrdered 1 true true avgt 5 1,777,251.4 ns/op
walkOrdered 1 false false avgt 5 931,728.5 ns/op
walkOrdered 1 false true avgt 5 902,433.7 ns/op
walkOrdered 1 true false avgt 5 930,294.3 ns/op
walkOrdered 1 true true avgt 5 907,886.5 ns/op
org.sample.GCAndAccessSpeed
Inspired by: Aleksey Shipilёv, Moving GC and Locality
Rule 1
Don't worry too much about performance. Java is very good in taking care of it.
Rule 2
Don't optimize prematurely and don't micro-tune.
Rule 3
Simply write clean and good code. You won't see most of the issues ever. Think before you code. Java can make code fast, but bad algorithms stay bad algorithms, bad designs stays bad designs.
Rule 4
When you worry, worry about the hottest code. Measure carefully and apply what you learned. The compiler and the CPU are your friends but make also your profiling life harder.
Rule 5
Understand, measure, review, test, and measure again.
Rule 6
The code you write is not the code that is executed. Measure under production conditions to understand and avoid optimizing non-compiled and non-optimized code.
If your response to problems normally is: "That must be garbage collection."... get a different job.
Applied Knowledge About CPU, Caches, and more to the XLT Reporting Engine
# OLD
4,449,972.01 msec task-clock # 13.341 CPUs utilized
3,692,466 context-switches # 829.773 /sec
677,485 cpu-migrations # 152.245 /sec
1,148,505 page-faults # 258.093 /sec
10,607,013,813,174 cycles # 2.384 GHz (50.00%)
355,223,105,079 stalled-cycles-frontend # 3.35% frontend cycles idle (50.01%)
2,005,906,915,627 stalled-cycles-backend # 18.91% backend cycles idle (50.02%)
21,127,403,990,596 instructions # 1.99 insn per cycle
# 0.09 stalled cycles per insn (50.02%)
4,456,128,125,803 branches # 1.001 G/sec (50.02%)
16,654,985,143 branch-misses # 0.37% of all branches (50.01%)
8,252,058,721,619 L1-dcache-loads # 1.854 G/sec (50.00%)
132,098,926,744 L1-dcache-load-misses # 1.60% of all L1-dcache accesses (50.00%)
333.5 seconds time elapsed
4158.9 seconds user
295.2 seconds sys
# NEW
2,740,643.95 msec task-clock # 14.051 CPUs utilized
8,421,379 context-switches # 3.073 K/sec
203,008 cpu-migrations # 74.073 /sec
1,152,818 page-faults # 420.638 /sec
6,490,638,190,051 cycles # 2.368 GHz (49.99%)
235,691,588,371 stalled-cycles-frontend # 3.63% frontend cycles idle (50.00%)
881,992,481,414 stalled-cycles-backend # 13.59% backend cycles idle (50.01%)
10,681,912,161,893 instructions # 1.65 insn per cycle
# 0.08 stalled cycles per insn (50.01%)
1,873,443,357,019 branches # 683.578 M/sec (50.02%)
13,816,096,139 branch-misses # 0.74% of all branches (50.01%)
4,424,610,353,285 L1-dcache-loads # 1.614 G/sec (50.00%)
103,982,020,029 L1-dcache-load-misses # 2.35% of all L1-dcache accesses (50.00%)
195.0 seconds time elapsed
2507.4 seconds user
232.7 seconds sys