High Performance Java

About the Smart Java Internals

René Schwietzke, Xceptance GmbH, Jena, Germany

About

Xceptance

Focused on Software Test and Quality Assurance
Performance Testing, Test Automation, Functional Testing, QA and Test Process Consulting
Founded 2004
Headquarters in Jena, Germany
Subsidiary in Cambridge, MA, USA
Performance Test Tool XLT, Java-based, APL 2.0

René Schwietzke

Co-Founder and Managing Directory Xceptance
Working with Java since version 1.0
Performance Tester since 1999
@ReneSchwietzke

Sources

Always attribute other people's work

Java Performance Puzzlers
by Douglas Q Hawkins, Azul Systems

Optimizing Java by Evans/Gough/Newland

Aleksey Shipilëv: https://shipilev.net/

Java and Performance

Java is much more than what most people think. Spoilers!

The times when Java was a bytecode interpreter, running slow and sluggish, are long gone. But most people don't know the magic that is going on to speed up Java to native code speed and beyond.

Actually, most programmers don't even know how their code gets executed or how a computer works... sadly.

How speed is achieved and the mysterious things behind that
Hardware matters
Avoid common mistakes but besides that, trust Java
The code you write is not the code that runs
Squeeze the lemon even more

An Early Warning or The Disclaimer

Optimization and tuning is fun... but...

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

Donald Knuth, Turing Award Lecture 1974

This talk is NOT about how your code should look like
If you expect tips like use StringBuilder instead of... this talk is not for you
This talk is not about the fastest code (bummer!)
It is about understanding the JVM and getting measurements right
It is about avoiding to work against Java
It is also about the right approach to performance tuning

Another Motivation or Early Prove

Java is efficient but the why has to be explained

Let's Benchmark

Get Some Measurements Going

Poor Man's Benchmark 1

public class PoorMansBenchmark 
{   
    private static String append(final int i) 
    {
        final StringBuilder sb = new StringBuilder();
        sb.append(System.currentTimeMillis());
        sb.append("Foo"); 
        sb.append(i); 
        sb.append('c');
        sb.append("lkJAHJ AHSDJHF KJASHF HSAKJFD");
        
        final String s = sb.toString();
        char[] ch = s.toCharArray();
        Arrays.sort(ch);
        
        return new String(ch);
    }

    public static void main (final String[] args) 
    {
        final int SIZE = 100_000;
        final long[] result = new long[SIZE];
        
        int sum = 0;
        for (int i = 0; i < SIZE; i++)
        {
            final Timer t = Timer.startTimer();       
            
            sum += append(i).length();
            
            result[i] = t.stop().runtimeNanos();
        }  
        
        // pretend we need it but we really don't
        // just to avoid optimizations (spoilers, duh!)
        if (sum < 0)
        {
            System.out.println(sum);
        }
    }        
}

Just benchmark something...

Some non-sense with StringBuilder and String
We call our method in a loop
A helper tracks the time

https://github.com/Xceptance/jmh-jmm-training: org/sample/PoorMansbenchmark.java

Let's Run the Poor Man's Benchmark

# SIZE = 10
 1, 617053
 2,  38628
 3,  24712
 4,  24459
 5,  33943
 6,  23639
 7,  24312
 8,  25948
 9,  23379
10,  23003

# SIZE = 50 
  1, 616953
  2,  37628
  3,  27712
  4,  25459
  5,  36943
  6,  24639
  7,  27312
  8,  23948
  9,  24379
 10,  24003
 20,  23770
 30,  25595
 40,  23631
 41,  23744
 42,  25004
 43,  26035
 44,  23782
 45,  22973
 46,  25261
 47,  23215
 48,  24135
 49,  23312
 50,  22908

# SIZE = 200
  1, 560064
  2,  33833
  3,  25935
  4,  24224
  5,  58761
  6,  25303
  7,  23832
  8,  23243
  9,  24405
 10,  23159
 20,  24005
 30,  23097
 40,  24002
 50,  23575
 60,  26380
 70,  24612
 80,  26099
 90,  23815
100,  33719
110,  24405
120,  25568
130,  27355
140,  23190
150,  12568
160,  22774
170,  21934
180,  11408
190,  12476
200,  10366

# SIZE = 2000
   1, 582933
   2,  37514
   3,  26746
   4,  24760
   5,  40041
   6,  23869
   7,  26384
   8,  23183
   9,  22801
  10,  24316
  20,  28037
  30,  47624
  40,  37747
  50,  37559
  60,  47502
  70,  35459
  80,  37795
  90,  36272
 100,  36344
 200,  14478
 300,  11769
 400,  11557
 500,  10088
 600,   9965
 700,   9387
 800,   8322
 900,   7305
1000,   6849
2000,   3347

# SIZE = 100,000
     1, 592763
    10,  24077
   100,  24995
   200,  10734
   300,  19562
   400,   8904
   500,   6226
   600,   6723
   700,   6981
   800,   5283
   900,   4896
  1000,   4883
  2000,   3367
  3000,   4947
  4000,   5179
  5000,   4648
  6000,   4113
  7000,   4283
  8000,   4654
  9000,   4054
 10000,   4300
 20000,  19314
 30000,    671
 40000,    694
 50000,    686
 60000,    663
 70000,    717
 80000,    697
 90000,    686
100000,    686

The Java Hater's Response

That must be Garbage Collection

Let's disable GC
Epsilon GC to the rescue
Noop-Garbage Collector
It provides but does not free memory
-Xms5g -Xmx5g
-XX:+UnlockExperimentalVMOptions
-XX:+UseEpsilonGC
-XX:+AlwaysPreTouch

# No GC
# SIZE = 100,000
     1, 549924
    10,  23824
   100,  34017
   200,  16395
   300,  12858
   400,  11460
   500,  10504
   600,   9787
   700,   9935
   800,  11464
   900,   7369
  1000,   7365
  2000,   2771
  3000,   2763
  4000,   2753
  5000,   2762
  6000,   2846
  7000,   2703
  8000,   2737
  9000,   2726
 10000,   2738
 20000,   2770
 30000,   2706
 40000,   2764
 50000,   2657
 60000,   2476
 70000,   1040
 80000,   1002
 90000,    982
100000,    990

# G1 GC
# SIZE = 100,000
     1, 595433
    10,  23628
   100,  40456
   200,  17898
   300,  11618
   400,  13881
   500,   9707
   600,   9335
   700,   9153
   800,   8872
   900,   7258
  1000,   6953
  2000,   2882
  3000,   2737
  4000,   2748
  5000,   2756
  6000,   2724
  7000,   2686
  8000,   2758
  9000,   2751
 10000,   2743
 20000,   4219
 30000,   5016
 40000,   4437
 50000,   1488
 60000,    999
 70000,   1056
 80000,   4011
 90000,   1221
100000,    755

The Java Hater's Response II

Java is slow

GraalVM to the rescue
Let's compile to machine code
native-image

# G1 GC
# SIZE = 100,000
     1, 595433
    10,  23628
   100,  40456
   200,  17898
   300,  11618
   400,  13881
   500,   9707
   600,   9335
   700,   9153
   800,   8872
   900,   7258
  1000,   6953
  2000,   2882
  3000,   2737
  4000,   2748
  5000,   2756
  6000,   2724
  7000,   2686
  8000,   2758
  9000,   2751
 10000,   2743
 20000,   4219
 30000,   5016
 40000,   4437
 50000,   1488
 60000,    999
 70000,   1056
 80000,   4011
 90000,   1221
100000,    755

# GraalVM Native
# SIZE = 100,000
     1,   8791
    10,   4736
   100,   1989
   200,   1830
   300,   5040
   400,   1870
   500,   2267
   600,   1903
   700,   2279
   800,   3889
   900,   1771
  1000,   1620
  2000,   2415
  3000,   3758
  4000,   1290
  5000,   1300
  6000,   1249
  7000,   2636
  8000,   2160
  9000,   1254
 10000,   1309
 20000,   1730
 30000,   2446
 40000,   1320
 50000,   1332
 60000,   1363
 70000,   1385
 80000,   2737
 90000,   1373
100000,   1278

# G1 GC
# SIZE = 1,000,000
      1, 563883
     10,  23505
    100,  23311
   1000,   6727
   2000,   5174
   3000,   5444
   4000,   5160
   5000,   4815
   6000,   5130
   7000,   5138
   8000,   5162
   9000,   5032
  10000,   5230
  20000,   3992
  30000,   3673
  40000,   1909
  50000,   1908
  60000,   1106
  70000,    963
  80000,    966
  90000,    981
 100000,    958
 200000,    683
 300000,    693
 400000,    694
 500000,    691
 600000,    699
 700000,    697
 800000,   1057
 900000,    710
1000000,    703

That is not helpful

The Data is so inconsistent

Let's Analyze

Greetings from the Java Internals

Java has about three compilers
You interact only with javac
To make your code fly, another compiler compiles the bytecode on-demand
Classic: Hotspot Compiler (our focus right now)
New Kid: Graal Compiler
Tiered Compiler: C1 and C2 with four modes
Profiles and optimizes continuously(!)
Can inline, loop-unroll, run escape analysis, branch predict, use intrinsics, do lock tricks, and a more things (yes, will get to that later)

Compiler decides what code is hot
Compile is per method
In addition loops are compiled, because the method might be cold, but the loop is hot

Some more information about Hotspot

Just to get a feel for the complexity

Compiler Stages ^[1]

Level 0: Interpreter only
Level 1: C1, full optimization, no profiling
Level 2: C1 with invocation and backedge counters
Level 3: C1 with full profiling
Level 4: C2, fully optimized

Compiler Flow ^[2]

[1] src/share/vm/runtime/advancedThresholdPolicy.hpp
[2] Douglas Hawkins, JVM Mechanics, https://www.youtube.com/watch?v=a-U0so9FfqQ&t=2036s

Compare Interpreter vs. Compiler

Hotspot

No Tiered Compiler

Interpreter Only

Compile Everything

-XX:-TieredCompilation, -Xint, -Xcomp

Compile Or Not to Compile

Compile when Needed vs. Right from the Start

-Xcomp forces Java to always compile to native code
Still permits later change in opinion

70x slower in the beginning to be faster later
Effect 1: We compile things we might never need again
Effect 2: We might run out of compiler cache

# defaults
      1, 560,960
      2,  34,772
      3,  26,167
      4,  24,237
      5,  35,926
      6,  24,116
      7,  23,163
      8,  22,895
      9,  24,142
     10,  23,174
    100,  23,596
    200,  12,973
    300,  12,361
    400,   9,852
    500,   9,596
    600,   8,774
    700,   8,956
    800,   7,280
    900,   7,076
   1000,   6,089
  10000,   5,071
  20000,   2,683
  30000,   2,703
  40000,   1,532
  50000,   1,121
  60000,   1,012
  70000,     975
  80000,     979
  90000,     987
 100000,     976
 900000,     700
1000000,     708

# -Xcomp
      1, 40,041,490
      2,     51,254
      3,     17,217
      4,      2,077
      5,      1,714
      6,      1,607
      7,      1,610
      8,      1,536
      9,      1,576
     10,      1,536
    100,      1,376
    200,      2,481
    300,      1,284
    400,      1,307
    500,      1,150
    600,      1,177
    700,      1,173
    800,      1,163
    900,      1,174
   1000,      1,546
  10000,      1,096
  20000,        962
  30000,        999
  40000,      1,093
  50000,        891
  60000,        883
  70000,        885
  80000,        862
  90000,        894
 100000,        869
 900000,        871
1000000,        870

Compile Or Not to Compile


# defaults, T14s AMD
      1, 1307425
      2,  51264
      3,  29962
      4,  29054
      5,  36178
      6,  28634
      7,  28635
      8,  28565
      9,  29053
     10,  29613
     20,  28495
     30,  28565
     40,  39111
     50,  29473
     60,  31009
     70,  31010
     80,  29543
     90,  27098
    100,  33734
    200,  10546
    300,   9428
    400,   8032
    500,   7543
    600,   6704
    700,   5866
    800,   6076
    900,   5238
   1000,   5168
 100000,   3701
 200000,   3352
 300000,   2934
 400000,   2863
 500000,   2933
 600000,   2864
 700000,   2933
 800000,   2794
 900000,   2933
1000000,   2794


# -Xint
      1, 1329704
      2,  55733
      3,  32406
      4,  30940
      5,  30101
      6,  30939
      7,  30939
      8,  30520
      9,  30031
     10,  30800
     20,  30521
     30,  31429
     40,  30520
     50,  29822
     60,  29612
     70,  30451
     80,  30939
     90,  30730
    100,  33314
    200,  32337
    300,  31289
    400,  31009
    500,  31429
    600,  31219
    700,  30940
    800,  30940
    900,  30940
   1000,  31498
 100000,  33175
 200000,  23257
 300000,  21022
 400000,  20743
 500000,  20533
 600000,  22070
 700000,  20463
 800000,  21022
 900000,  20882
1000000,  20114


# JIT 1
      1, 1180942
      2,  36736
      3,  30031
      4,  29543
      5,  35130
      6,  28286
      7,  28565
      8,  28635
      9,  29054
     10,  29543
     20,  28705
     30,  30031
     40,  27657
     50,  28146
     60,  31499
     70,  31289
     80,  27936
     90,  28635
    100,  31918
    200,  11873
    300,   8591
    400,   7752
    500,   6146
    600,   6216
    700,   6215
    800,   4749
    900,   4190
   1000,   3771
 100000,   3772
 200000,   3841
 300000,   3422
 400000,   3352
 500000,   3422
 600000,   3212
 700000,   3422
 800000,   3772
 900000,   3283
1000000,   3212


# JIT 1-2
      1, 1196501
      2,  37854
      3,  30031
      4,  29124
      5,  32895
      6,  29053
      7,  28634
      8,  28076
      9,  27656
     10,  29333
     20,  29054
     30,  33942
     40,  33314
     50,  28564
     60,  30101
     70,  28844
     80,  28145
     90,  25771
    100,  25771
    200,  10407
    300,   8520
    400,   6705
    500,   5796
    600,   4679
    700,   4330
    800,   4260
    900,   4190
   1000,   4260
 100000,   4190
 200000,   4400
 300000,   4261
 400000,   4819
 500000,   4260
 600000,   6705
 700000,   4330
 800000,   3841
 900000,   3422
1000000,   3352


# JIT 1-3
      1, 1474763
      2,  83600
      3,  31009
      4,  28704
      5,  35689
      6,  29193
      7,  28565
      8,  28635
      9,  29193
     10,  29822
     20,  28635
     30,  28565
     40,  29054
     50,  29123
     60,  29403
     70,  28146
     80,  27238
     90,  27168
    100,  27238
    200,  11384
    300,  10337
    400,  12292
    500,   7194
    600,   5867
    700,   6705
    800,   5308
    900,   5099
   1000,   5657
 100000,   4819
 200000,   4749
 300000,   4750
 400000,   4679
 500000,   4749
 600000,   5867
 700000,   5238
 800000,   5238
 900000,   4330
1000000,   3841

-XX:TieredStopAtLevel=X

Watch the Compiler


1157  104 %     3       org.sample.poor.PoorMansBenchmark::main @ 17 (470 bytes)
1162  105       3       org.sample.poor.PoorMansBenchmark::main (470 bytes)
1328  135       3       org.sample.poor.PoorMansBenchmark::append (66 bytes)
1454  153       4       org.sample.poor.PoorMansBenchmark::append (66 bytes)
1561  178 %     4       org.sample.poor.PoorMansBenchmark::main @ 55 (470 bytes)
1566  135       3       org.sample.poor.PoorMansBenchmark::append (66 bytes)   made not entrant
6002  178 %     4       org.sample.poor.PoorMansBenchmark::main @ 55 (470 bytes)   made not entrant
6069  367 %     3       org.sample.poor.PoorMansBenchmark::main @ 100 (470 bytes)
6082  407 %     4       org.sample.poor.PoorMansBenchmark::main @ 100 (470 bytes)

Time Stamp since start
Compilation order
Compiler level
Method and size of the bytecode

n: native
s: synchronized
!: has exception handler
%: On-stack replacement (OSR)
made not entrant: not longer to be used

-XX:PrintCompilation, log format can change

Java Performance Fun

Things to Know when Executing or Profiling Code

Ohhhh Trap

You trained it well

private final static int SIZE = 1000;

public static void main(String[] args)
{
    Object trap = null;
    Object o = null;
    
    for (int i = 0; i < 1000; i++)
    {
        final Timer t = Timer.startTimer();
        
        for (int j = 0; j < SIZE; j++)
        {
            // burn time and train that null is normal
            o = new Object();
            
            if (trap != null)
            {
                System.out.println("Got you." + o);
                trap = null;
            }
        }
        
        // Give me a Non-Null, Vasily. 
        // One Non-Null only, please.  
        if (i == 400)
        {
            trap = new Object();
        }

        System.out.println(
            MessageFormat.format("{1} {0, number, #}", 
            t.stop().runtimeNanos(), i));
    }        
}

Example by Douglas Hawkins: org/sample/AllocationTrap.java

Ohhhh...

WTH?

Don't trust your old measurements

Learning: Unused Code

If I don't need it, I remove it

Java freely removes code
- Results are not needed
- Code is probably not executed
- Even loops might die and not just stay empty
Rare conditions can completely change the runtime behavior forever

Danger: That can kill all your benchmark results
Danger: One null or one exception can turn your performance upside-down
Important: If your code is too complicated, Java cannot make that call

More is Less

You can horribly shoot yourself in the foot

/**
 * JHM Benchmark 
 * Vary SIZE from 1 to 10000    
 * @param args
 */
public int nonSense()
{
    Object trap = null;
    Object o = null;
    
    for (int i = 0; i < OUTER; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            o = new Object();
            
            if (trap != null)
            {
                o = new Object();
                trap = null;
            }
        }
            
        if (i == 500)
        {
            trap = new Object();
        }
    }        

    return trap == null ? 1 : 0;
}

We vary the inner loop length

# One Shoot Execution aka measure one execution and no warm-up
Benchmark                           (SIZE)  Mode  Cnt     Score   Error  Units
RemoveNonSenseJMH.nonSenseSingle         1    ss         85,298          ns/op
RemoveNonSenseJMH.nonSenseSingle        10    ss        513,886          ns/op
RemoveNonSenseJMH.nonSenseSingle       100    ss      3,603,910          ns/op
RemoveNonSenseJMH.nonSenseSingle      1000    ss      9,465,274          ns/op
RemoveNonSenseJMH.nonSenseSingle     10000    ss     17,528,758          ns/op

# Repeated Execution, No Warm up, 100 ms duration
Benchmark                           (SIZE)  Mode  Cnt     Score   Error  Units
RemoveNonSenseJMH.nonSenseNoWarmup       1  avgt            963          ns/op
RemoveNonSenseJMH.nonSenseNoWarmup      10  avgt            600          ns/op
RemoveNonSenseJMH.nonSenseNoWarmup     100  avgt            597          ns/op
RemoveNonSenseJMH.nonSenseNoWarmup    1000  avgt            697          ns/op
RemoveNonSenseJMH.nonSenseNoWarmup   10000  avgt          2,258          ns/op

# Warm it Up First, 100 ms duration
Benchmark                           (SIZE)  Mode  Cnt     Score   Error  Units
RemoveNonSenseJMH.nonSenseWarmup         1  avgt            660          ns/op
RemoveNonSenseJMH.nonSenseWarmup        10  avgt            609          ns/op
RemoveNonSenseJMH.nonSenseWarmup       100  avgt            409          ns/op
RemoveNonSenseJMH.nonSenseWarmup      1000  avgt            439          ns/op
RemoveNonSenseJMH.nonSenseWarmup     10000  avgt            410          ns/op

# -Xint to verify
Benchmark                           (SIZE)  Mode  Cnt     Score   Error  Units
RemoveNonSenseJMH.nonSenseWarmup         1  avgt        113,536          ns/op
RemoveNonSenseJMH.nonSenseWarmup        10  avgt      1,110,772          ns/op
RemoveNonSenseJMH.nonSenseWarmup       100  avgt     11,449,399          ns/op
RemoveNonSenseJMH.nonSenseWarmup      1000  avgt    112,158,390          ns/op
RemoveNonSenseJMH.nonSenseWarmup     10000  avgt  1,104,793,429          ns/op

org.sample.RemoveNonSenseJMH

Understand Runtime Internals

Now We Can Really Talk About Performance

Expensive last statement

Just one statement more...

@Benchmark
public long add230()
{
    int sum = 0;
    int i = 0;
    
    // now 23 of these
    sum += a[++i] + a[++i] + a[++i] + a[++i] + a[++i] +
           a[++i] + a[++i] + a[++i] + a[++i] + a[++i];

    return sum;
}

@Benchmark
public long add231()
{
    int sum = 0;
    int i = 0;
    
    // now 23 of these
    sum += a[++i] + a[++i] + a[++i] + a[++i] + a[++i] +
           a[++i] + a[++i] + a[++i] + a[++i] + a[++i];
    // and this
    sum += a[++i];

    return sum;
}

Benchmark                      Mode  Cnt    Score   Units
ExpensiveLastAdd.add230  avgt    2   41.159   ns/op
ExpensiveLastAdd.add231  avgt    2  122.162   ns/op

Darn... the last add was really expensive


803  434       4  org.sample.ExpensiveLastAdd::add230 (2251 bytes)
---    
797  434       3  org.sample.ExpensiveLastAdd::add231 (2263 bytes)
799  436       4  org.sample.ExpensiveLastAdd::add231 (2263 bytes)
811  436       4  org.sample.ExpensiveLastAdd::add231 (2263 bytes)   
                    COMPILE SKIPPED: Out of stack space (not retryable)

We cannot C2 compile it, code is too "complex"
Stuck with the C1 result and full profiling turned on
Learning: Too large methods are a problem

Douglas Q Hawkins: org/sample/ExpensiveLastAdd.java

Your loop is easy

Flatten out loops


private int next() {
    return r.nextInt(1) + 1;
}

@Benchmark
public int classic() {
    int sum = 0;
    int step = next();
    for (int i = 0; i < ints.length; i = i + 1)
    {
        sum += ints[i];
        step = next();
    }
    return sum + step;
}

@Benchmark
public int variable() {
    int sum = 0;
    int step = next();
    for (int i = 0; i < ints.length; i = i + step) 
    {
        sum += ints[i];
        step = next();
    }
    return sum + step;
}

org/sample/LoopUnroll.java


Benchmark            (size)  Mode  Cnt     Score Units
LoopUnroll.classic    10000  avgt    2    24,416 ns/op
LoopUnroll.variable   10000  avgt    2    35,011 ns/op

# -Xint
LoopUnroll.classic    10000  avgt    2 2,650,812 ns/op
LoopUnroll.variable   10000  avgt    2 2,449,845 ns/op


# classic() 
Benchmark              Mode    Score     Units
time                   avgt   25,920.0   ns/op
CPI                    avgt        0.572  #/op # Cycles per Instruction
IPC                    avgt        1.480  #/op # Instructions per Cycle
L1-dcache-load-misses  avgt      648.0    #/op
L1-dcache-loads        avgt   10,242.0    #/op
L1-dcache-stores       avgt      294.0    #/op
branch-misses          avgt        4.8    #/op
branches               avgt    2,694.0    #/op
cycles                 avgt   61,092.0    #/op
instructions           avgt  106,718.0    #/op

# variable()
Benchmark              Mode    Score     Units
time                   avgt   35,020.0   ns/op
CPI                    avgt        0.316  #/op
IPC                    avgt        3.165  #/op
L1-dcache-load-misses  avgt      677.0    #/op
L1-dcache-loads        avgt   30,203.0    #/op
L1-dcache-stores       avgt   10,293.0    #/op
branch-misses          avgt        8.3    #/op
branches               avgt   29,832.0    #/op
cycles                 avgt   87,294.0    #/op
instructions           avgt  276,379.0    #/op

Detour: The modern CPU

Everything happens in parallel

Current Intels can do 3 to 4^* instructions per cycle
A cycle is 1 Hz
Parallel execution units per core
Parallel execution of code despite code not being programmed that way
CPU reorders when permitted to fill the execution units
Tries to do things ahead of the result (prediction, speculation)
When units aren't filled, performance is wasted
Branches/jumps are expensive

Escape Analysis

Make it less expensive

@Benchmark
public long array64()
{
    int[] a = new int[64];
    
    a[0] = r.nextInt();
    a[1] = r.nextInt();
    
    return a[0] + a[1];
}

@Benchmark
public long array65()
{
    int[] a = new int[65];
    
    a[0] = r.nextInt();
    a[1] = r.nextInt();
    
    return a[0] + a[1];
}

EscapeAnalysis.array64                           avgt  2    22.804           ns/op
EscapeAnalysis.array65                           avgt  2    41.890           ns/op


# -prof gc
EscapeAnalysis.array64                           avgt  2    22.804           ns/op
EscapeAnalysis.array64:·gc.alloc.rate            avgt  2    ≈ 10⁻⁴          MB/sec
EscapeAnalysis.array64:·gc.alloc.rate.norm       avgt  2    ≈ 10⁻⁶            B/op
EscapeAnalysis.array64:·gc.count                 avgt  2       ≈ 0          counts

EscapeAnalysis.array65                           avgt  2    41.890           ns/op
EscapeAnalysis.array65:·gc.alloc.rate            avgt  2  5795.571          MB/sec
EscapeAnalysis.array65:·gc.alloc.rate.norm       avgt  2   280.000            B/op
EscapeAnalysis.array65:·gc.count                 avgt  2    93.000          counts

# -XX:EliminateAllocationArraySizeLimit=70
EscapeAnalysis.array65                           avgt  2    22.279           ns/op
EscapeAnalysis.array65:·gc.alloc.rate            avgt  2    ≈ 10⁻⁴          MB/sec
EscapeAnalysis.array65:·gc.alloc.rate.norm       avgt  2    ≈ 10⁻⁶            B/op
EscapeAnalysis.array65:·gc.count                 avgt  2       ≈ 0          counts

org/sample/EscapeAnalysis.java

Intrinsics or Native

Native code to the rescue

final int SIZE = 100_000;
final int[] src = new int[SIZE];
    
public int[] manualArrayCopy()
{
    int[] target = new int[SIZE];
    for (int i = 0; i < SIZE; i++)
    {
        target[i] = src[i];
    }
    
    return target;
}

public int[] systemArrayCopy()
{
    int[] target = new int[SIZE];
    System.arraycopy(src, 0, target, 0, SIZE);
    
    return target;
}

public int[] arraysCopy()
{
    return Arrays.copyOf(src, src.length);
}


Benchmark                   Mode  Cnt   Score    Units
IntrinsicsArrayCopy.manual  avgt    2  48,177    ns/op
IntrinsicsArrayCopy.system  avgt    2  48,678    ns/op

Uh... native is not faster?
Well, if Java compiled its code, it is native


Iteration         Manual           System
===============================================
Iteration   1: 1,431,758 ns/op    129,551 ns/op
Iteration   2:   714,942 ns/op     81,926 ns/op
Iteration   3:   188,272 ns/op     77,098 ns/op
Iteration   4:    70,491 ns/op     72,010 ns/op
Iteration   5:    82,862 ns/op     73,495 ns/op
Iteration   6:    66,534 ns/op     75,625 ns/op
Iteration   7:    61,277 ns/op     74,382 ns/op

Native is quickly outperformed by Hotspot
BUT: That depends on your hardware
Hotspot can vary its code and intrinsics selection based on CPU instructions available

org/sample/IntrinsicsArrayCopy.java

Wait... What are Intrinsics?

Native Code vs. Intrinsics

Native: A JNI based method call that calls any kind of provided native platform code
Intrinsics: Native platform code, but not called as method but directly placed in the Hotspot compiled code

# -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining
@ 54   java.lang.Math::min (11 bytes)   (intrinsic)
@ 57   java.lang.System::arraycopy (0 bytes)   (intrinsic)

public static native void arraycopy(...) can be a native call or an intrinsic
Java makes that decision when the VM is started based on CPU features
Even for methods fully visible in the source code of the JDK as Java, Java can and will use Intrinics instead, see java.lang.Math
Instead of compiling to native step-by-step, it will insert pre-determined native code based on the platform it runs on
Important: Java will change the code generated by the compiler based on the platform and CPU it runs on!

Intrinsics or Native II

# Lenovo T530, JDK 8
Benchmark                   Mode  Cnt      Score  Units
IntrinsicsArrayCopy.manual  avgt    2 48,177.054  ns/op
IntrinsicsArrayCopy.system  avgt    2 48,678.219  ns/op

# Lenovo T450s, JDK 8, BAT
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2  66,560.515  ns/op
IntrinsicsArrayCopy.system  avgt    2  67,051.983  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  60,674.969  ns/op

# Lenovo T450s, JDK 11, BAT
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2  67,476.175  ns/op
IntrinsicsArrayCopy.system  avgt    2  58,952.894  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  58,771.408  ns/op

# Lenovo T450s, GraalVM EE 1.0.0.14, BAT
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2  94,315.864  ns/op
IntrinsicsArrayCopy.system  avgt    2  70,201.933  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  71,605.877  ns/op

# Lenovo T14s, AMD, JDK 11, BAT Performance
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.arrays  avgt    2  44,172.944  ns/op
IntrinsicsArrayCopy.manual  avgt    2  46,073.504  ns/op
IntrinsicsArrayCopy.system  avgt    2  43,730.640  ns/op

# Lenovo T450s, JDK 11, AC
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2  56,108.312  ns/op
IntrinsicsArrayCopy.system  avgt    2  55,748.067  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  53,769.876  ns/op

# AWS c5.xlarge, JDK 11
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2  83,066.975  ns/op
IntrinsicsArrayCopy.system  avgt    2  63,422.029  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  64,748.500  ns/op

# GCP n1-standard-4, JDK 11
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2 146,932.588  ns/op
IntrinsicsArrayCopy.system  avgt    2 101,970.577  ns/op
IntrinsicsArrayCopy.arrays  avgt    2 116,233.650  ns/op

# GCP n1-highcpu-8, JDK 11
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2  82,865.258  ns/op
IntrinsicsArrayCopy.system  avgt    2  62,810.793  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  61,237.489  ns/op

# Lenovo T450s, JDK 11, BAT, Min Power
Benchmark                   Mode  Cnt        Score  Units
IntrinsicsArrayCopy.manual  avgt    2  215,360.090  ns/op
IntrinsicsArrayCopy.system  avgt    2  210,971.627  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  214,870.207  ns/op

org.sample.IntrinsicsArrayCopy_AverageTime

Stupid Compiler


@Setup
public void setup()
{
    foo = String.valueOf(System.currentTimeMillis()) + "ashdfhgas dfhsa df";
}

public String plain()
{    
    return "a" + foo;
}

public String concat()
{    
    return "a".concat(foo);
}

public String builder()
{    
    return new StringBuilder().append("a").append(foo).toString();
}

public String builderSized()
{    
    return new StringBuilder(32).append("a").append(foo).toString();
}

public String builderNonFluid()
{    
     final StringBuilder sb = new StringBuilder();
     sb.append("a");
     sb.append(foo);
     return sb.toString();
}

public String builderNonFluidSized()
{    
    final StringBuilder sb = new StringBuilder(32);
    sb.append("a");
    sb.append(foo);
    return sb.toString();
}

Let's make some Strings


Benchmark                      Mode  Cnt   Score  Units
Strings1.builderSized          avgt    2  12.643  ns/op
Strings1.builder               avgt    2  12.655  ns/op
Strings1.plain                 avgt    2  13.087  ns/op
Strings1.concat                avgt    2  15.919  ns/op
Strings1.builderNonFluidSized  avgt    2  17.599  ns/op
Strings1.builderNonFluid       avgt    2  24.259  ns/op

A sized builder is not faster?
A builder that is not fluid is slower?
Java JIT has optimizations patterns it follows
Yes, javac is the first to change the code


# -Xint
Strings1.concat                avgt    2   875.504  ns/op
Strings1.builderSized          avgt    2  2395.736  ns/op
Strings1.builderNonFluidSized  avgt    2  2439.351  ns/op
Strings1.builder               avgt    2  2770.349  ns/op
Strings1.builderNonFluid       avgt    2  2746.212  ns/op
Strings1.plain                 avgt    2  3014.032  ns/op

Inspired by Douglas Q Hawkins: org.sample.Strings1, JDK 8

Let's talk hardware

Hardware does not matter... you wish!

final int SIZE = 1_000_000;
final int[] src = new int[SIZE];
    
@Benchmark
public int step1()
{
    int sum = 0;
    for (int i = 0; i < SIZE; i++)
    {
        sum += src[i];
    }
    
    return sum;
}

@Benchmark
public int step20()
{
    int sum = 0;
    for (int i = 0; i < SIZE; i = i + 20)
    {
        sum += src[i];
    }
    
    return sum;
}

Theory: size20 must be faster, about 20x


Benchmark                 Mode  Cnt    Score  Error  Units
ArraysAndHardware.step1   avgt    2  325,159         ns/op
ArraysAndHardware.step20  avgt    2   97,402         ns/op

We expected 16,300 ns/op
We got only 3.3x more speed
What? Why?
P.S. step20 results will vary a lot when running repeatedly

Hardware Greetings

Also Java has to live by hardware laws

Linux Perf for step1
==========================================================================================
       4485.209219      task-clock (msec)         #    0,290 CPUs utilized          
    11,127,672,435      cycles                    #    2,481 GHz                      (38,32%)
    11,620,571,786      instructions              #    1,04  insn per cycle           (46,29%)
       665,870,415      branches                  #  148,459 M/sec                    (46,36%)
         1,066,446      branch-misses             #    0,16% of all branches          (46,40%)
     9,482,675,316      L1-dcache-loads           # 2114,210 M/sec                    (46,50%)
       593,181,935      L1-dcache-load-misses     #    6,26% of all L1-dcache hits    (46,59%)
        21,448,491      LLC-loads                 #    4,782 M/sec                    (30,64%)
         5,856,980      LLC-load-misses           #   27,31% of all LL-cache hits     (30,67%)

Linux Perf for step20
==========================================================================================
       4455.946190      task-clock (msec)         #    0,286 CPUs utilized          
    11,046,881,256      cycles                    #    2,479 GHz                      (38,45%)
     3,836,171,205      instructions              #    0,35  insn per cycle           (46,27%)
       927,360,764      branches                  #  208,118 M/sec                    (46,40%)
           873,526      branch-misses             #    0,09% of all branches          (46,49%)
       979,546,079      L1-dcache-loads           #  219,829 M/sec                    (46,57%)
       860,310,077      L1-dcache-load-misses     #   87,83% of all L1-dcache hits    (46,66%)
       746,566,165      LLC-loads                 #  167,544 M/sec                    (30,76%)
       330,798,682      LLC-load-misses           #   44,31% of all LL-cache hits     (30,64%)

-prof perf requires a special kernel tooling

Detour: Cost

What it costs to leave the CPU

	Real	Humanized
CPU Cycle	0.4 ns	1 s
L1 Cache	0.9 ns	2 s
L2 Cache	2.8 ns	7 s
L3 Cache	28 ns	1 min
Memory Access	100 ns	4 min
NVM SSD	25 μs	17 h
SSD	50–150 μs	1.5-4 days
HDD	1–10 ms	1-9 months
TCP SF to NY	65 ms	5 years
TCP SF to Hong Kong	141 ms	11 years

It costs 100 to 200 cycles when we have to go to main memory
Hence Java and the CPU avoid main memory access at all cost
Java does not have the notion of a memory location, hence this is transparent
Things will change with project Valhalla and its "inline classes" that brings primitive data type performance to "objects".

Data: She brought me closer to humanity than I ever thought possible, and for a time...I was tempted by her offer.
Picard: How long a time?
Data: 0.68 seconds, sir. For an android, that is nearly an eternity.

The Matrix


final int[][] src = new int[1000][1000];

@Benchmark
public int horizontal() {
    int sum = 0;
    for (int i = 0; i < SIZE; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            sum += src[i][j];
        }
    }
    return sum;
}

@Benchmark
public int vertical() {
    int sum = 0;
    for (int i = 0; i < SIZE; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            sum += src[j][i];
        }
    }
    return sum;
}

@Benchmark
public int horizontalBackwards() {
    int sum = 0;
    for (int i = SIZE - 1; i >=0; i--)
    {
        for (int j = SIZE - 1; j >= 0; j--)
        {
            sum += src[i][j];
        }
    }
    return sum;
}

What is real and what is not

All calculations should have the same speed, shouldn't they?


# Lenovo T530, Intel i7
Benchmark                                Mode           Score   Error  Units
horizontal                               avgt     470,009.870          ns/op
horizontalBackwards                      avgt     672,821.809          ns/op
vertical                                 avgt   3,209,562.255          ns/op

Horizontal access is about 7x faster, forward by 30% faster!


# Lenovo T14s, AMD Ryzen 7
Benchmark                                Mode           Score   Error  Units
horizontal                               avgt     266,470.908          ns/op
horizontalBackwards                      avgt     270,099.755          ns/op
vertical                                 avgt   1,197,509.843          ns/op

Horizontal access is about 4.5x faster, forward almost no diff anymore

org.sample.Matrix with -prof perfnorm

The Matrix Reloaded


final int[][] src = new int[1000][1000];

@Benchmark
public int horizontal() {
    int sum = 0;
    for (int i = 0; i < SIZE; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            sum += src[i][j];
        }
    }
    return sum;
}

@Benchmark
public int vertical() {
    int sum = 0;
    for (int i = 0; i < SIZE; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            sum += src[j][i];
        }
    }
    return sum;
}

@Benchmark
public int horizontalBackwards() {
    int sum = 0;
    for (int i = SIZE - 1; i >=0; i--)
    {
        for (int j = SIZE - 1; j >= 0; j--)
        {
            sum += src[i][j];
        }
    }
    return sum;
}


# Lenovo T530
Benchmark                                           Mode       Score    Units
Matrix.horizontal                                   avgt     470,009     ns/op
Matrix.horizontal:L1-dcache-load-misses             avgt      63,645      #/op
Matrix.horizontal:L1-dcache-loads                   avgt   1,014,532      #/op
Matrix.horizontal:LLC-load-misses                   avgt         429      #/op
Matrix.horizontal:LLC-loads                         avgt       1,523      #/op

Matrix.horizontalBackwards                          avgt     672,821     ns/op
Matrix.horizontalBackwards:L1-dcache-load-misses    avgt      63,053      #/op
Matrix.horizontalBackwards:L1-dcache-loads          avgt   1,007,381      #/op
Matrix.horizontalBackwards:LLC-load-misses          avgt         854      #/op
Matrix.horizontalBackwards:LLC-loads                avgt       2,764      #/op

Matrix.vertical                                     avgt   3,209,562     ns/op
Matrix.vertical:L1-dcache-load-misses               avgt   2,060,998      #/op
Matrix.vertical:L1-dcache-loads                     avgt   3,020,796      #/op
Matrix.vertical:LLC-load-misses                     avgt      15,223      #/op
Matrix.vertical:LLC-loads                           avgt     694,397      #/op


# Lenovo T14s
Benchmark                                           Mode       Score     Units
Matrix.horizontal                                   avgt     286,633     ns/op
Matrix.horizontal:L1-dcache-load-misses             avgt      63,902      #/op
Matrix.horizontal:L1-dcache-loads                   avgt   1,031,015      #/op
Matrix.horizontal:stalled-cycles-backend            avgt   1,015,230      #/op
Matrix.horizontal:stalled-cycles-frontend           avgt       4,848      #/op

Matrix.horizontalBackwards                          avgt     278,992     ns/op
Matrix.horizontalBackwards:L1-dcache-load-misses    avgt      64,526      #/op
Matrix.horizontalBackwards:L1-dcache-loads          avgt   1,033,686      #/op
Matrix.horizontalBackwards:stalled-cycles-backend   avgt     870,881      #/op
Matrix.horizontalBackwards:stalled-cycles-frontend  avgt       5,029      #/op

Matrix.vertical                                     avgt   1,484,345     ns/op
Matrix.vertical:L1-dcache-load-misses               avgt   2,114,064      #/op
Matrix.vertical:L1-dcache-loads                     avgt   3,108,037      #/op
Matrix.vertical:stalled-cycles-backend              avgt   5,081,442      #/op
Matrix.vertical:stalled-cycles-frontend             avgt      16,206      #/op

org.sample.Matrix with -prof perfnorm

Matrix - You probably did it wrong

Influence of the size on the result


Benchmark             (SIZE) Mode            Score Factor  Units
horizontal                1  avgt              5.3    1.0  ns/op
horizontal               10  avgt             45.2    1.0  ns/op
horizontal               20  avgt            141.3    1.0  ns/op
horizontal              100  avgt          3,388.0    1.0  ns/op
horizontal             1000  avgt        373,619.3    1.0  ns/op
horizontal            10000  avgt     41,701,496.2    1.0  ns/op

horizontalBackwards       1  avgt              5.3    1.0  ns/op
horizontalBackwards      10  avgt             43.2    1.0  ns/op
horizontalBackwards      20  avgt            140.1    1.0  ns/op
horizontalBackwards     100  avgt          3,261.3    1.0  ns/op
horizontalBackwards    1000  avgt        511,246.1    1.4  ns/op
horizontalBackwards   10000  avgt     46,908,230.0    1.1  ns/op

vertical                  1  avgt              5.3    1.0  ns/op
vertical                 10  avgt             77.0    1.7  ns/op
vertical                 20  avgt            275.8    2.0  ns/op
vertical                100  avgt          6,226.3    1.8  ns/op
vertical               1000  avgt      2,719,478.0    7.2  ns/op
vertical              10000  avgt  1,683,041,892.2   40.4  ns/op

Understand your future data set
Use production hardware
Verify theories!

Wrong again?

Understand your hardware correctly

Executed in independent threads count 1 to 8
What might be the difference?


# RUN 1 - T14s, AMD
Benchmark     Mode  Cnt    Score   Diff   Units
threads1      avgt    2  250.514          ms/op
threads2      avgt    2  257.087    +3%   ms/op
threads4      avgt    2  284.657   +13%   ms/op
threads6      avgt    2  330.602   +32%   ms/op
threads8      avgt    2  375.904   +50%   ms/op

Only one or two active cores run turbo mode
More than three and we are slowed down to keep the system cool


# RUN 2 - T14s, AMD, turbo mode off for verification
Benchmark     Mode  Cnt    Score   Diff  Units
threads1      avgt    2  611.487         ms/op
threads2      avgt    2  612.605    +0%  ms/op
threads4      avgt    2  611.488    +0%  ms/op
threads6      avgt    2  616.488    +1%  ms/op
threads8      avgt    2  624.876    +2%  ms/op


private final int SIZE = 100_000_000;
private final int[] SOURCE = new int[SIZE];

double doCPUBoundStuff(int[] data) {
    int numDataPoints = data.length;
    int i = 0;
    double avg = 0;
    double var = 0;

    data[0] = 0;
    data[1] = 1;
    
    for (i = 2; i < numDataPoints; i++) {
        data[i] = data[i-1] + data[i-2];
    }

    for (avg = 0, i = 0; i < numDataPoints; i++) {
        avg += (double)numDataPoints;
    }
    avg /= (double) numDataPoints;

    for (var = 0, i = 0; i < numDataPoints; i++) {
        double diff = (double) data[i] - avg;
        var += diff*diff;
    }
    var /= (double) (numDataPoints - 1);
    
    return avg;
}

@Setup
public void setup() {
    for (int i = 0; i < SIZE; i++) {
        SOURCE[i] = i;
    }
}

@Benchmark
@Threads(1..8)
public double t1() {
    return doCPUBoundStuff(SOURCE);
}

Intel: echo "1" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
AMD: echo "0" | sudo tee /sys/devices/system/cpu/cpufreq/boost

I know what you will do next

Modern fortune tellers


private static final int COUNT = 10_000;

// Contain random numbers from -50 to 50
private int[] sorted;
private int[] unsorted;
private int[] reversed;

public void doIt(
            int[] array, 
            Blackhole bh1, 
            Blackhole bh2)
{
    for (int v : array)
    {
        if (v > 0)
        {
            bh1.consume(v);
        }
        else
        {
            bh2.consume(v);
        }
    }
}

Will there be any difference when passing sorted or unsorted arrays?


Benchmark  Mode  Cnt      Score   Error  Units
reversed   avgt    2  39570.696          ns/op
sorted     avgt    2  39859.027          ns/op
unsorted   avgt    2  66043.605          ns/op

Sorted is 65% faster. Why?
Let's ask the CPU

org.sample.BranchPredict.java

Branch Prediction

Modern CPU marvels: I know what you might need next


Benchmark                                        Mode        Score   Error  Units
=================================================================================
BranchPrediction.reversed                        avgt    35,182.532         ns/op
BranchPrediction.reversed:IPC                    avgt         3.257          #/op
BranchPrediction.reversed:branch-misses          avgt        13.767          #/op
BranchPrediction.reversed:branches               avgt    55,339.912          #/op
BranchPrediction.reversed:cycles                 avgt   110,993.223          #/op
BranchPrediction.reversed:instructions           avgt   361,508.018          #/op

BranchPrediction.sorted                          avgt    35,183.238         ns/op
BranchPrediction.sorted:IPC                      avgt         3.257          #/op
BranchPrediction.sorted:branch-misses            avgt        14.738          #/op
BranchPrediction.sorted:branches                 avgt    55,884.946          #/op
BranchPrediction.sorted:cycles                   avgt   112,912.763          #/op
BranchPrediction.sorted:instructions             avgt   367,820.131          #/op

BranchPrediction.unsorted                        avgt    65,770.667         ns/op
BranchPrediction.unsorted:IPC                    avgt         1.577          #/op
BranchPrediction.unsorted:branch-misses          avgt     3,795.737          #/op
BranchPrediction.unsorted:branches               avgt    56,008.188          #/op
BranchPrediction.unsorted:cycles                 avgt   211,089.268          #/op
BranchPrediction.unsorted:instructions           avgt   333,105.231          #/op

Branches: Know what you test

How small changes can mess up things big time


final int d = i % 2 == 0 ? -1 : 1; 


public int doIt(int[] array)
{
    int sum = 0;
    for (int v : array)
    {
        if (v > 0)
        {
            sum -= v;
        }
        else
        {
            sum += v;
        }
    }
    
    return sum;
}


Benchmark  Mode       Score   Units
===================================
sorted     avgt   5,147.764   ns/op
unsorted   avgt   3,317.656   ns/op
reversed   avgt   4,738.261   ns/op


final int d = i % 2 == 0 
    ? -r.nextInt(5) : r.nextInt(5);

public int doIt(int[] array)
{
    int sum = 0;
    for (int v : array)
    {
        if (v > 0)
        {
            sum -= v;
        }
        else
        {
            sum += v;
        }
    }
    
    return sum;
}


Benchmark  Mode       Score   Units
===================================
sorted     avgt   3,588.073   ns/op
unsorted   avgt  14,282.879   ns/op
reversed   avgt   5,046.499   ns/op


final int d = i % 2 == 0 
    ? -r.nextInt(5) : r.nextInt(5); 

public int doIt(int[] array)
{
    int sum = 0;
    for (int v : array)
    {
        if (v >= 0)
        {
            sum -= -v;
        }
        else
        {
            sum += +v;
        }
    }
    
    return sum;
}


Benchmark  Mode       Score   Units
===================================
sorted     avgt   3,226.433   ns/op
unsorted   avgt   3,261.952   ns/op
reversed   avgt   3,234.786   ns/op


final int d = i % 2 == 0 
    ? -r.nextInt(5) : r.nextInt(5);
        
public int doIt(int[] array)
{
    int sum = 0;
    for (int v : array)
    {
        if (v >= 0)
        {
            sum -= v;
        }
        else
        {
            sum += v;
        }
    }
    
    return sum;
}


Benchmark  Mode       Score   Units
===================================
sorted     avgt   5,044.253   ns/op
unsorted   avgt  11,669.557   ns/op
reversed   avgt   5,390.210   ns/op

Short Summary

Measurement

Naive benchmarks are wrong
No JMH, no fun
Even JMH measurements can be wrong
How and where you measure makes a difference

JDK

Forget your Java optimization knowledge
Java now is not what most people think Java is

JVM

Tiered runtime compilation
JIT removes code when not needed
JIT tries to keep data local
Large methods kill the C2 optimization
Loops are optimized
Intrinsics are used dynamically

Hardware

Memory access is expensive
Caches are your friend
Branches are expensive
The CPU speculates

Synchronization

Being alone and still have to sync

private final int SIZE = 1024;
private final Map<String, String> syncMap = 
            Collections.synchronizedMap(new HashMap<>(SIZE));
private final Map<String, String> unsyncMap = 
            new HashMap<>(SIZE);
private final Map<String, String> conMap = 
            new ConcurrentHashMap<>(SIZE);

@Benchmark
public String syncAccess() 
{
    return syncMap.get("1");
}
@Benchmark
public String unsyncAccess() 
{
    return unsyncMap.get("1");
}
@Benchmark
public String conAccess() 
{
    return conMap.get("1");
}

Which on is slowest and by how much?
Remember, just one thread, so no contention

Benchmark                       Mode  Cnt   Score   Units
SynchronizedAlone.conAccess     avgt    2   9.152   ns/op
SynchronizedAlone.syncAccess    avgt    2  23.462   ns/op
SynchronizedAlone.unsyncAccess  avgt    2   9.366   ns/op

Synced is 2.6x slower
ConcurrentHashMap does well
Moving the maps to local scope totally changes the result - lock elision

org/sample/SynchronizedAlone.java

With Contention

Let's try more threads

2 Threads
====================================================================
Benchmark                          Mode  Cnt    Score   Error  Units
SynchronizedThreads4.conAccess     avgt    2    9.721          ns/op
SynchronizedThreads4.syncAccess    avgt    2  183.086          ns/op
SynchronizedThreads4.unsyncAccess  avgt    2    9.312          ns/op

4 Threads
====================================================================
Benchmark                          Mode  Cnt    Score   Error  Units
SynchronizedThreads4.conAccess     avgt    2   19.661          ns/op
SynchronizedThreads4.syncAccess    avgt    2  216.041          ns/op
SynchronizedThreads4.unsyncAccess  avgt    2   19.631          ns/op

8 Threads
====================================================================
Benchmark                          Mode  Cnt    Score   Error  Units
SynchronizedThreads4.conAccess     avgt    2   44.055          ns/op
SynchronizedThreads4.syncAccess    avgt    2  377.180          ns/op
SynchronizedThreads4.unsyncAccess  avgt    2   38.562          ns/op

Looping

Loops compared


Benchmark                           (length)  Mode  Cnt   Score   Error  Units
ForEachSimple.arrayFor                     1  avgt    2     3.8          ns/op
ForEachSimple.arrayFor                    10  avgt    2     9.8          ns/op
ForEachSimple.arrayFor                   100  avgt    2    64.4          ns/op
ForEachSimple.arrayFor                  1000  avgt    2 1,446.1          ns/op

ForEachSimple.arrayEnhanced                1  avgt    2     3.5          ns/op
ForEachSimple.arrayEnhanced               10  avgt    2     9.5          ns/op
ForEachSimple.arrayEnhanced              100  avgt    2    63.2          ns/op
ForEachSimple.arrayEnhanced             1000  avgt    2 1,199.3          ns/op

ForEachSimple.classicFor                   1  avgt    2     5.7          ns/op
ForEachSimple.classicFor                  10  avgt    2    15.5          ns/op
ForEachSimple.classicFor                 100  avgt    2   103.0          ns/op
ForEachSimple.classicFor                1000  avgt    2 1,388.8          ns/op

ForEachSimple.enhancedFor                  1  avgt    2     6.1          ns/op
ForEachSimple.enhancedFor                 10  avgt    2    17.0          ns/op
ForEachSimple.enhancedFor                100  avgt    2   118.4          ns/op
ForEachSimple.enhancedFor               1000  avgt    2 1,578.6          ns/op

ForEachSimple.lambdaStream                 1  avgt    2    53.7          ns/op
ForEachSimple.lambdaStream                10  avgt    2    65.5          ns/op
ForEachSimple.lambdaStream               100  avgt    2   153.0          ns/op
ForEachSimple.lambdaStream              1000  avgt    2 1,326.0          ns/op

ForEachSimple.parallelLambdaStream         1  avgt    2    94.2          ns/op
ForEachSimple.parallelLambdaStream        10  avgt    2 6,283.2          ns/op
ForEachSimple.parallelLambdaStream       100  avgt    2 8,443.4          ns/op
ForEachSimple.parallelLambdaStream      1000  avgt    2 9,367.6          ns/op

org/sample/ForEachSimple.java

// just some strings
final List<String> list = new ArrayList<>();
String[] array = null;

@Param({"1", "10", "100", "1000"})
int length;

@Setup
public void setup() {.. // uses length ..}

@Benchmark
public void arrayFor(Blackhole bh) 
{
    int result = 0;
    for (int i = 0; i < array.length; i++) 
    {
        result += array[i].length();
    }
    
    bh.consume(result);
}

// classicFor
for (int i = 0; i < list.size(); i++) 
{
    result += list.get(i).length();
}


// enhancedFor
for (String s : list) 
{
    result += s.length();
}

// lambdaStream
int result = list.stream()
    .mapToInt(s -> s.length())
    .sum();

//parallelLambdaStream
int result = list.parallelStream()
    .mapToInt(s -> s.length())
    .sum();

Looping

Not so empty loops compared


Benchmark             (length)  Mode  Cnt     Score   Error  Units
arrayFor                     1  avgt    2      238.3         ns/op
arrayFor                    10  avgt    2    2,105.6         ns/op
arrayFor                   100  avgt    2   21,957.8         ns/op
arrayFor                  1000  avgt    2  253,262.0         ns/op

arrayEnhanced                1  avgt    2      234.1         ns/op
arrayEnhanced               10  avgt    2    2,173.1         ns/op
arrayEnhanced              100  avgt    2   22,080.1         ns/op
arrayEnhanced             1000  avgt    2  250,038.0         ns/op

classicFor                   1  avgt    2      238.1         ns/op
classicFor                  10  avgt    2    2,177.0         ns/op
classicFor                 100  avgt    2   22,318.0         ns/op
classicFor                1000  avgt    2  249,914.1         ns/op

enhancedFor                  1  avgt    2      254.8         ns/op
enhancedFor                 10  avgt    2    2,185.9         ns/op
enhancedFor                100  avgt    2   23,011.1         ns/op
enhancedFor               1000  avgt    2  262,605.5         ns/op

lambdaStream                 1  avgt    2      395.7         ns/op
lambdaStream                10  avgt    2    2,829.0         ns/op
lambdaStream               100  avgt    2   27,771.4         ns/op
lambdaStream              1000  avgt    2  304,831.6         ns/op

# 2+2 CPU
parallelLambdaStream         1  avgt    2      453.5         ns/op
parallelLambdaStream        10  avgt    2   11,686.4         ns/op
parallelLambdaStream       100  avgt    2   26,798.6         ns/op
parallelLambdaStream      1000  avgt    2  189,304.3         ns/op

final List<String> list = new ArrayList<>();
    
@Param({"1", "10", "100", "1000"})
int length;

@Setup
public void setup()
{
    for (int i = 0; i < length; i++)
    {
        list.add(
            MessageFormat.format(
            "{0},{0},{0},{0},{0},{0}", String.valueOf(i)));
    }
}

@Benchmark
public void classicFor(Blackhole bh)
{
    int result = 0;
    for (int i = 0; i < list.size(); i++)
    {
        final String s = list.get(i);
        if (s.startsWith("5"))
        {
            continue;
        }
        
        final String[] a = s.split(",");
        for (int j = 0; j < a.length; j++)
        {
            result += Integer.valueOf(a[j]);
        }
    }
    
    bh.consume(result);
}

org/sample/ForEachExpensive.java

Looping - Bonus Content

T450s vs. GCP N1-highcpu-16core


# Lenovo T450s
Benchmark             (length)  Mode  Cnt        Score  Error  Units
arrayFor                     1  avgt    2        238.3         ns/op
arrayFor                    10  avgt    2      2,105.6         ns/op
arrayFor                   100  avgt    2     21,957.8         ns/op
arrayFor                  1000  avgt    2    253,262.0         ns/op
arrayFor                 10000  avgt    2  2,810,799.6         ns/op
arrayFor                100000  avgt    2 31,116,545.4         ns/op

classicFor                   1  avgt    2        238.1         ns/op
classicFor                  10  avgt    2      2,177.0         ns/op
classicFor                 100  avgt    2     22,318.0         ns/op
classicFor                1000  avgt    2    257,362.6         ns/op
classicFor               10000  avgt    2  2,809,832.8         ns/op
classicFor              100000  avgt    2 32,017,487.3         ns/op

enhancedFor                  1  avgt    2        254.8         ns/op
enhancedFor                 10  avgt    2      2,185.9         ns/op
enhancedFor                100  avgt    2     23,011.1         ns/op
enhancedFor               1000  avgt    2    254,691.8         ns/op
enhancedFor              10000  avgt    2  3,180,447.3         ns/op
enhancedFor             100000  avgt    2 31,479,657.4         ns/op

lambdaStream                 1  avgt    2        395.7         ns/op
lambdaStream                10  avgt    2      2,829.0         ns/op
lambdaStream               100  avgt    2     27,771.4         ns/op
lambdaStream              1000  avgt    2    299,971.9         ns/op
lambdaStream             10000  avgt    2  3,181,573.5         ns/op
lambdaStream            100000  avgt    2 36,055,386.8         ns/op

parallelLambdaStream         1  avgt    2        453.5         ns/op
parallelLambdaStream        10  avgt    2     11,686.4         ns/op
parallelLambdaStream       100  avgt    2     26,798.6         ns/op
parallelLambdaStream      1000  avgt    2    193,232.6         ns/op
parallelLambdaStream     10000  avgt    2  1,956,289.8         ns/op
parallelLambdaStream    100000  avgt    2 20,628,366.2         ns/op


# Google n1-highcpu-16core
Benchmark             (length)  Mode  Cnt       Score   Error  Units
arrayFor                     1  avgt   2        263.4          ns/op
arrayFor                    10  avgt   2      2,390.4          ns/op
arrayFor                   100  avgt   2     24,871.1          ns/op
arrayFor                  1000  avgt   2    266,237.0          ns/op
arrayFor                 10000  avgt   2  3,020,851.2          ns/op
arrayFor                100000  avgt   2 30,428,126.5          ns/op

classicFor                   1  avgt   2        264.0          ns/op
classicFor                  10  avgt   2      2,490.3          ns/op
classicFor                 100  avgt   2     25,186.0          ns/op
classicFor                1000  avgt   2    281,254.1          ns/op
classicFor               10000  avgt   2   2882,459.1          ns/op
classicFor              100000  avgt   2 34,657,730.5          ns/op

enhancedFor                  1  avgt   2        263.5          ns/op
enhancedFor                 10  avgt   2      2,329.8          ns/op
enhancedFor                100  avgt   2     24,811.0          ns/op
enhancedFor               1000  avgt   2    291,544.4          ns/op
enhancedFor              10000  avgt   2  3,185,459.2          ns/op
enhancedFor             100000  avgt   2 33,898,708.6          ns/op

lambdaStream                 1  avgt   2        547.9          ns/op
lambdaStream                10  avgt   2      3,841.7          ns/op
lambdaStream               100  avgt   2     28,808.7          ns/op
lambdaStream              1000  avgt   2    329,148.1          ns/op
lambdaStream             10000  avgt   2  3,684,920.1          ns/op
lambdaStream            100000  avgt   2 36,935,499.3          ns/op

parallelLambdaStream         1  avgt   2        521.7          ns/op
parallelLambdaStream        10  avgt   2     43,527.8          ns/op
parallelLambdaStream       100  avgt   2     62,834.3          ns/op
parallelLambdaStream      1000  avgt   2    110,602.3          ns/op
parallelLambdaStream     10000  avgt   2    558,666.5          ns/op
parallelLambdaStream    100000  avgt   2  5,183,311.8          ns/op

The Cost of Calling

Method Calls, Inheritance, and Compiler Magic

Inlining Methods


@CompilerControl(CompilerControl.Mode.INLINE)
private static String uppercase(String s)

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
private static String uppercaseNoInline(String s)

@CompilerControl(CompilerControl.Mode.INLINE)
private static String lowercase(String s)

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
private static String lowercaseNoInline(String s)

@CompilerControl(CompilerControl.Mode.INLINE)
private static int sumOfChars(String s)

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
private static int sumOfCharsNoInline(String s)

@Benchmark
public void largeMethod(Blackhole bh)
{ 
    // all the code of the methods copied here
    
    bh.consume(sum);
}

@Benchmark
public void smallAndCallsInline(Blackhole bh)
{    
    final String s = strings[0];
    
    final String s2 = uppercase(s);
    final String s3 = lowercase(s2);
    
    int sum = sumOfChars(s3);
    
    bh.consume(sum);
}

@Benchmark
public void smallAndCallsNoInline(Blackhole bh)
{    
    final String s = strings[0];
    
    final String s2 = uppercaseNoInline(s);
    final String s3 = lowercaseNoInline(s2);
    
    int sum = sumOfCharsNoInline(s3);
    
    bh.consume(sum);
}

Calling a method is expensive


# BAT JDK 11
Benchmark              Mode  Cnt     Score     Error  Units
smallAndCallsInline    avgt    5  4860.736 ± 229.153  ns/op
smallAndCallsNoInline  avgt    5  5672.435 ± 219.164  ns/op
largeMethod            avgt    5  5294.679 ± 218.967  ns/op


# AC JDK 11
Benchmark              Mode  Cnt     Score     Error  Units
smallAndCallsInline    avgt    5  4501.134 ± 217.389  ns/op
smallAndCallsNoInline  avgt    5  4460.731 ± 132.889  ns/op
largeMethod            avgt    5  4244.479 ± 209.325  ns/op


# BAT JDK 11 again
Benchmark              Mode  Cnt     Score     Error  Units
smallAndCallsInline    avgt    5  5661.628 ± 268.654  ns/op
smallAndCallsNoInline  avgt    5  5634.992 ± 323.852  ns/op
largeMethod            avgt    5  5300.749 ± 267.469  ns/op


# AC JDK 8
Benchmark              Mode  Cnt     Score     Error  Units
smallAndCallsInline    avgt    5  4839.311 ± 158.454  ns/op
smallAndCallsNoInline  avgt    5  4923.197 ±  93.549  ns/op
largeMethod            avgt    5  4919.511 ± 223.358  ns/op

Benchmarking is hard!
No idea what happened here... sorry!

The Cost of Calling

How Expensive are Virtual Method Calls?

Check if there is an overhead to find the right method
This is about virtual method calls aka the lookup of the target not so much about the call itself (inlining, see before)


public class MegaMorphicSimple
{
    static interface C { public int apply(int x); }
    
    static class C1 implements C 
    { 
        @Override
        public int apply(int x) { return x * x; }
    }
    
    @Benchmark
    public int _1_mono()
    {
        C c = null;
        int r = random.nextInt(4);
        
        if (r == 0) { c = new C1(); } 
        else if (r == 1) { c = new C1(); } 
        else if (r == 2) { c = new C1(); }
        else if (r == 3) { c = new C1(); }
         
        return c.apply(r);
    }

    @Benchmark
    public int _2_bi()
    {
        C c = null;
        int r = random.nextInt(4);
        
        if (r == 0) { c = new C1(); } 
        else if (r == 1) { c = new C2(); } 
        else if (r == 2) { c = new C1(); }
        else if (r == 3) { c = new C2(); }
         
        return c.apply(r);
    }

    @Benchmark
    public int _5_manualDispatch()
    {
        ... 
        if (c instanceof C1)
        {
            return ((C1) c).apply(r);
        }
        else if (c instanceof C2)
        {
            return ((C2) c).apply(r);            
        }
        else if (c instanceof C3)
        {
            return ((C3) c).apply(r);            
        }
        else
        {
            return ((C4) c).apply(r);            
        }
    }
}

Inspired by Douglas Hawkins: org.sample.MegaMorphic[Simple|Default]

The Cost of Calling

How Expensive are Virtual Method Calls?


# identical method per implementing class    
Benchmark                             Mode  Cnt   Score   Error  Units
MegaMorphicSimple._1_mono             avgt    5  19.703 ± 2.177  ns/op
MegaMorphicSimple._2_bi               avgt    5  21.917 ± 3.759  ns/op
MegaMorphicSimple._4_mega             avgt    5  26.778 ± 4.411  ns/op
MegaMorphicSimple._5_manualDispatch   avgt    5  21.861 ± 2.860  ns/op


static interface C { default int apply(int x){ return x * x; }};

static class C1 implements C {};
static class C2 implements C {};
static class C3 implements C {};
static class C4 implements C {};


# default method at interface    
Benchmark                             Mode  Cnt   Score   Error  Units
MegaMorphicDefault._1_mono            avgt    5  18.867 ± 1.325  ns/op
MegaMorphicDefault._2_bi              avgt    5  21.817 ± 2.613  ns/op
MegaMorphicDefault._4_mega            avgt    5  26.355 ± 2.591  ns/op
MegaMorphicDefault._5_manualDispatch  avgt    5  21.763 ± 3.479  ns/op

That looks expensive
Fancy that we can help the runtime
Shall we do that from now on the ugly way?
...but these are nanoseconds and the methods are useless.


# method got bigger and we made that non-default
Benchmark                                 Mode  Cnt    Score   Error  Units
MegaMorphicLargeMethod._1_mono            avgt    5  158.775 ± 4.971  ns/op
MegaMorphicLargeMethod._2_bi              avgt    5  157.100 ± 6.153  ns/op
MegaMorphicLargeMethod._4_mega            avgt    5  165.580 ± 8.669  ns/op
MegaMorphicLargeMethod._5_manualDispatch  avgt    5  162.729 ± 3.795  ns/op

org.sample.MegaMorphic[Simple|Default]

Really calling virtually

So we don't have to think about virtual calls at all

Let's have an array with different objects of a type C
C has four possible implementations
We either let Java make the call or we untangle it ourselves


public class MegaMorphicArray 
{
    private C[] array = new C[20_000];
    
    @Setup
    public void setup() 
    {
        final Random r = new Random(42L);
        
        for (int x = 0; x < 20_000; x++) 
        {
            int i = r.nextInt(JHM_PARAM);
            
            if (i == 0) array[x] = new C1()
            else if (i == 1) array[x] = new C2();
            else if (i == 2) array[x] = new C3();
            else if (i == 3) array[x] = new C4();
        }        
    }
}


static interface C { public int apply(int x); }
    
static class C1 implements C { 
    @Override public int apply(int x) { return x * x; }
}   ...
    
@Benchmark
public int morphic() 
{
    int sum = 0; 

    for (int i = 0; i < 20_000; i++) 
    {
        sum += array[i].apply(i);
    }

    return sum;
}    
    
@Benchmark
public int peeled()
{
    int sum = 0;
    for (int i = 0; i < 20_000; i++) {
        final C c = array[i];
        if (c instanceof C1) {
            sum += ((C1) c).apply(i);
        }
        else if (c instanceof C2) {
            sum += ((C2) c).apply(i);
        }
        else if (c instanceof C3) {
            sum += ((C3) c).apply(i);
        }
        else {
            sum += ((C4) c).apply(i);
        }
    }
    
    return sum;
}

org.sample.MegaMorphicArray

Really calling virtually

The Results are in


# JDK 11
Benchmark (params)  Mode  Cnt     Score  Units
morphic         1  avgt    5   17,705.6  ns/op
morphic         2  avgt    5   98,269.4  ns/op
morphic         3  avgt    5  334,745.9  ns/op
morphic         4  avgt    5  364,485.9  ns/op

peeled          1  avgt    5   16,990.4  ns/op
peeled          2  avgt    5   94,084.2  ns/op
peeled          3  avgt    5  123,590.3  ns/op
peeled          4  avgt    5  147,067.2  ns/op


# JDK 11 -Xint
Benchmark (params) Mode  Cnt        Score  Units
morphic         1  avgt    5  1,682,550.7  ns/op
morphic         2  avgt    5  1,695,875.2  ns/op
morphic         3  avgt    5  1,679,643.2  ns/op
morphic         4  avgt    5  1,675,298.1  ns/op

peeled          1  avgt    5  1,764,953.0  ns/op
peeled          2  avgt    5  1,821,662.6  ns/op
peeled          3  avgt    5  1,885,678.8  ns/op
peeled          4  avgt    5  1,936,054.8  ns/op

org.sample.MegaMorphicArray

Megamorphic Internals

But why do we see this performance?


Benchmark                      (params)  Mode       Score       Error  Units
morphic                               1  avgt    16,829.7 ±  1165.264  ns/op
morphic:IPC                           1  avgt         3.2               #/op
morphic:L1-dcache-load-misses         1  avgt     6,263.3               #/op
morphic:L1-dcache-loads               1  avgt    40,296.8               #/op
morphic:branch-misses                 1  avgt         9.1               #/op
morphic:branches                      1  avgt    24,984.2               #/op
morphic:dTLB-load-misses              1  avgt         0.2               #/op
morphic:dTLB-loads                    1  avgt    40,401.5               #/op

morphic                               2  avgt    97,456.9 ±  1177.830  ns/op
morphic:IPC                           2  avgt         0.7               #/op
morphic:L1-dcache-load-misses         2  avgt     6,472.7               #/op
morphic:L1-dcache-loads               2  avgt    43,597.9               #/op
morphic:branch-misses                 2  avgt     9,850.6               #/op
morphic:branches                      2  avgt    41,935.3               #/op
morphic:dTLB-load-misses              2  avgt         2.0               #/op
morphic:dTLB-loads                    2  avgt    45,509.8               #/op

morphic                               3  avgt   334,117.8 ±  3334.059  ns/op
morphic:IPC                           3  avgt         1.1               #/op
morphic:L1-dcache-load-misses         3  avgt     6,840.3               #/op
morphic:L1-dcache-loads               3  avgt   407,654.7               #/op
morphic:branch-misses                 3  avgt    13,247.0               #/op
morphic:branches                      3  avgt   146,105.8               #/op
morphic:dTLB-load-misses              3  avgt        13.5               #/op
morphic:dTLB-loads                    3  avgt   410,043.5               #/op

morphic                               4  avgt   390,601.4 ± 40515.922  ns/op
morphic:IPC                           4  avgt         0.9               #/op
morphic:L1-dcache-load-misses         4  avgt     6,740.5               #/op
morphic:L1-dcache-loads               4  avgt   408,814.3               #/op
morphic:branch-misses                 4  avgt    15,099.0               #/op
morphic:branches                      4  avgt   145,118.2               #/op
morphic:dTLB-load-misses              4  avgt        18.9               #/op
morphic:dTLB-loads                    4  avgt   410,767.5               #/op


Benchmark                      (params)  Mode        Score       Error  Units
peeled                                1  avgt     16,831.1 ±   816.491  ns/op
peeled:IPC                            1  avgt          3.2               #/op
peeled:L1-dcache-load-misses          1  avgt      6,218.4               #/op
peeled:L1-dcache-loads                1  avgt     39,797.1               #/op
peeled:branch-misses                  1  avgt          9.2               #/op
peeled:branches                       1  avgt     24,854.5               #/op
peeled:dTLB-load-misses               1  avgt          0.2               #/op
peeled:dTLB-loads                     1  avgt     40,133.1               #/op

peeled                                2  avgt     93,347.0 ±  3607.879  ns/op
peeled:IPC                            2  avgt          0.7               #/op
peeled:L1-dcache-load-misses          2  avgt      6,456.1               #/op
peeled:L1-dcache-loads                2  avgt     42,060.7               #/op
peeled:branch-misses                  2  avgt      9,252.3               #/op
peeled:branches                       2  avgt     41,153.9               #/op
peeled:dTLB-load-misses               2  avgt          6.0               #/op
peeled:dTLB-loads                     2  avgt     42,195.1               #/op

peeled                                3  avgt    123,075.7 ±  5181.245  ns/op
peeled:IPC                            3  avgt          0.6               #/op
peeled:L1-dcache-load-misses          3  avgt      6,472.7               #/op
peeled:L1-dcache-loads                3  avgt     42,059.2               #/op
peeled:branch-misses                  3  avgt     13,440.9               #/op
peeled:branches                       3  avgt     51,950.4               #/op
peeled:dTLB-load-misses               3  avgt          5.7               #/op
peeled:dTLB-loads                     3  avgt     43,097.4               #/op

peeled                                4  avgt    136,765.3 ±  2792.939  ns/op
peeled:IPC                            4  avgt          0.6               #/op
peeled:L1-dcache-load-misses          4  avgt      6,460.5               #/op
peeled:L1-dcache-loads                4  avgt     43,312.1               #/op
peeled:branch-misses                  4  avgt     14,818.3               #/op
peeled:branches                       4  avgt     62,214.6               #/op
peeled:dTLB-load-misses               4  avgt          8.0               #/op
peeled:dTLB-loads                     4  avgt     42,775.6               #/op

TLB - The Unknown Buffer

The CPU understood

Previous data shows that TLB numbers are high
VMT: Virtual Method Table
Java looks the right method up in that table and makes afterwards the call to that code
TLB: Translation Lookaside Buffer
The TLB stores the recent translations of virtual memory to physical memory and can be called an address-translation cache [1]
The VMT call target has to be resolved via a TLB lookup
Java can inline code for one or two callsides, but not for more

One method: Inlined or directly called
Two methods: Inlined or directly called with a virtual if-else
Three or more methods: We have to use VMT
That behavior can change in future Java versions

Lambdas are slow

Let's try some streams first

final int SIZE = 10240;
final int[] integers = new int[SIZE];

public int lambdaArray[Cold|Warm]()
{
    return Arrays.stream(integers)
        .filter(i -> i % 2 == 0).sum();
}

public int lambdaIntStream[Cold|Warm]()
{
    return IntStream.range(0, SIZE)
        .filter(i -> i % 2 == 0).sum();
}

public int loop[Cold|Warm]()
{
    int sum = 0;
    for (int i = 0; i < integers.length; i++)
    {
        if (i % 2 == 0)
        {
            sum += integers[i];
        }
    }
    
    return sum;
}


Benchmark                     Mode  Cnt         Score   Units
Lambda01.lambdaArrayHot       avgt    5      13,968.8   ns/op
Lambda01.lambdaIntStreamHot   avgt    5      13,156.2   ns/op
Lambda01.loopHot              avgt    5      10,146.6   ns/op


# Cold does not have JMH warmup
Benchmark                     Mode  Cnt         Score   Units
Lambda01.lambdaArrayCold      avgt    5     251,372.4   ns/op
Lambda01.lambdaArrayHot       avgt    5      13,968.8   ns/op

Lambda01.lambdaIntStreamCold  avgt    5     794,422.3   ns/op
Lambda01.lambdaIntStreamHot   avgt    5      13,156.2   ns/op

Lambda01.loopCold             avgt    5      15,066.7   ns/op
Lambda01.loopHot              avgt    5      10,146.6   ns/op

# -Xcomp
Benchmark                     Mode  Cnt         Score   Units
Lambda01.lambdaArrayCold      avgt    5  24,809,394.5   ns/op
Lambda01.lambdaIntStreamCold  avgt    5  23,398,581.4   ns/op
Lambda01.loopCold             avgt    5      11,787.0   ns/op

Lambda01.lambdaArrayHot       avgt    5     143,678.1   ns/op
Lambda01.lambdaIntStreamHot   avgt    5     144,445.3   ns/op
Lambda01.loopHot              avgt    5      10,109.5   ns/op

# -Xint
Benchmark                     Mode  Cnt         Score   Units
Lambda01.lambdaArrayCold      avgt    5   5,578,040.0   ns/op
Lambda01.lambdaArrayHot       avgt    5   5,473,135.6   ns/op
Lambda01.lambdaIntStreamCold  avgt    5   5,394,464.0   ns/op
Lambda01.lambdaIntStreamHot   avgt    5   5,317,269.0   ns/op
Lambda01.loopCold             avgt    5     229,327.9   ns/op
Lambda01.loopHot              avgt    5     229,369.6   ns/op

org/sample/Lambda01.java

Lambda Functions

Compare Predicate, Pseudo-Predicate, and Direct Implementation

public int forLambda(final Predicate<Integer> p) {
    int sum = 0;
    for (int i = 0; i < array.length; i++) {
        if (p.test(array[i])) {
            sum++;
        }
    }
    return sum;
}

public int forAnonymous(final PseudoLambda<Integer> p) {
    int sum = 0;
    for (int i = 0; i < array.length; i++) {
        if (p.test(array[i])) {
            sum++;
        }
    }
    return sum;
}

public int simple() {
    int sum = 0;
    for (int i = 0; i < array.length; i++) {
        if (array[i] % 2 == 0) {
            sum++;
        }
    }
    return sum;
}


@Benchmark
public void forSimple(Blackhole bh)
{
    bh.consume(simple());
}
    
@Benchmark
public void forLambdaPredicate(Blackhole bh)
{
    bh.consume(forLambda(i -> i % 2 == 0));
}

interface PseudoLambda<T>
{
    boolean test(T t);
}

@Benchmark
public void forAnonymousClass(Blackhole bh)
{
    bh.consume(
        forAnonymous(new PseudoLambda<Integer>()
        {
            @Override
            public boolean test(Integer i)
            {
                return i % 2 == 0;
            }
        })
    );
}

org/sample/Lambda02.java

Lambda Functions


Benchmark                     Mode  Cnt     Score  Units
forAnonymousClass             avgt    2     152.9  ns/op
forLambdaPredicate            avgt    2     128.8  ns/op
forLambdaPredicatePredefined  avgt    2     127.9  ns/op
forSimple                     avgt    2     128.1  ns/op

# -Xint
Benchmark                     Mode  Cnt     Score  Units
forAnonymousClass             avgt    2  21,312.5  ns/op
forLambdaPredicate            avgt    2  20,154.8  ns/op
forLambdaPredicatePredefined  avgt    2  19,871.6  ns/op
forSimple                     avgt    2   3,471.7  ns/op


# No warmup and just 5x measurements    
Benchmark                     Mode  Cnt     Score  Units
forAnonymousClass               ss    5  92,787.6  ns/op
forLambdaPredicate              ss    5 122,377.4  ns/op
forLambdaPredicatePredefined    ss    5  50,435.6  ns/op
forSimple                       ss    5  16,268.6  ns/op

Basic lambdas are great when hot
When cold, performance is questionable
Not the right thing for server startup or rare execution
BUT that all depends on the total performance debt collected

org/sample/Lambda02.java

GC can be your friend


public class GCAndAccessSpeed 
{
    private final static int SIZE  = 100_000;
    private final List<String> STRINGS = new ArrayList<>();
    private final List<String> ordered = new ArrayList<>();
    private final List<String> nonOrdered = new ArrayList<>();
    
    @Param({"false", "true"}) boolean gc;
    @Param({"1", "10"})       int COUNT;
    @Param({"false", "true"}) boolean drop;
    
    @Setup
    public void setup() throws InterruptedException
    {
        final FastRandom r = new FastRandom(7);
        for (int i = 0; i < COUNT * SIZE; i++)
        {
            STRINGS.add(RandomUtils.randomString(r, 1, 20));
        }
        
        for (int i = 0; i < SIZE; i++)
        {
            ordered.add(STRINGS.get(i * COUNT));
        }
        nonOrdered.addAll(ordered);
        Collections.shuffle(nonOrdered, new Random(r.nextInt()));
        
        if (drop)
        {
            STRINGS.clear();
        }
        
        if (gc) 
        {
            for (int c = 0; c < 5; c++) 
            {
                System.gc();
                TimeUnit.SECONDS.sleep(2);
            }
        }
    }
    
    @Benchmark
    public int walk[Ordered|NonOrdered]()
    {
        int sum = 0;
        for (int i = 0; i < [ordered|nonOrdered].size(); i++)
        {
            sum += [ordered|nonOrdered].get(i).length();
        }
        return sum;
    }
}

Applied Memory Knowledge

Theory is that memory closer together is good
G1 is compacting, hence moving objects closer together
When walking a structure with more "local" objects, the cache is hotter
Verify with Non-GC setup aka EpsilonGC


# G1 GC
Benchmark        (COUNT)  (drop)   (gc) Mode  Cnt        Score   Units
walkNonOrdered        1   false  false  avgt    5  1,596,315.2   ns/op
walkOrdered           1   false  false  avgt    5    611,137.7   ns/op

walkNonOrdered        1   false   true  avgt    5  1,172,951.3   ns/op
walkOrdered           1   false   true  avgt    5    431,143.5   ns/op

walkNonOrdered        1    true  false  avgt    5  1,562,844.1   ns/op
walkOrdered           1    true  false  avgt    5    605,119.4   ns/op

walkNonOrdered        1    true   true  avgt    5  1,243,973.9   ns/op
walkOrdered           1    true   true  avgt    5    400,721.9   ns/op

walkNonOrdered       10   false  false  avgt    5  1,903,731.9   ns/op
walkOrdered          10   false  false  avgt    5  1,229,945.1   ns/op

walkNonOrdered       10   false   true  avgt    5  2,026,861.7   ns/op
walkOrdered          10   false   true  avgt    5  1,809,961.9   ns/op

walkNonOrdered       10    true  false  avgt    5  1,920,658.4   ns/op
walkOrdered          10    true  false  avgt    5  1,239,658.5   ns/op

walkNonOrdered       10    true   true  avgt    5  1,160,229.2   ns/op
walkOrdered          10    true   true  avgt    5    403,949.5   ns/op

# -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC
Benchmark        (COUNT)  (drop)   (gc) Mode  Cnt        Score   Units
walkNonOrdered        1   false  false  avgt    5  1,667,611.7   ns/op
walkNonOrdered        1   false   true  avgt    5  1,820,968.2   ns/op
walkNonOrdered        1    true  false  avgt    5  1,928,822.7   ns/op
walkNonOrdered        1    true   true  avgt    5  1,777,251.4   ns/op

walkOrdered           1   false  false  avgt    5    931,728.5   ns/op
walkOrdered           1   false   true  avgt    5    902,433.7   ns/op
walkOrdered           1    true  false  avgt    5    930,294.3   ns/op
walkOrdered           1    true   true  avgt    5    907,886.5   ns/op

org.sample.GCAndAccessSpeed Inspired by: Aleksey Shipilёv, Moving GC and Locality

What we might not have talked about

There are so many more things to discover

Memory layout and performance
No null checks
Half-compiled if/else

The cost of leaving the CPU

Conclusion

Rule 1

Don't worry too much about performance. Java is very good in taking care of it.

Rule 2

Don't optimize prematurely and don't micro-tune.

Rule 3

Simply write clean and good code. You won't see most of the issues ever. Think before you code. Java can make code fast, but bad algorithms stay bad algorithms, bad designs stays bad designs.

Rule 4

When you worry, worry about the hottest code. Measure carefully and apply what you learned. The compiler and the CPU are your friends but make also your profiling life harder.

Rule 5

Understand, measure, review, test, and measure again.

Rule 6

The code you write is not the code that is executed. Measure under production conditions to understand and avoid optimizing non-compiled and non-optimized code.

Valhalla

Loom

The code

When you want to play with the code

All examples of today and more
Do whatever you like with it
JDK 11 or higher and Maven needed
Uses the Java Microbenchmark Harness from OpenJDK
Optionally uses kernel routines to profile low level

https://github.com/Xceptance/jmh-jmm-training

High Performance Java

About

Xceptance

René Schwietzke

Sources

Java and Performance

An Early Warning or The Disclaimer

Another Motivation or Early Prove

Let's Benchmark

Poor Man's Benchmark 1

Let's Run the Poor Man's Benchmark

The Java Hater's Response

The Java Hater's Response II

That is not helpful

Let's Analyze

Some more information about Hotspot

Compiler Stages [1]

Compiler Flow [2]

Compare Interpreter vs. Compiler

Hotspot

No Tiered Compiler

Interpreter Only

Compile Everything

Compile Or Not to Compile

Compile Or Not to Compile

Watch the Compiler

Java Performance Fun

Ohhhh Trap

Ohhhh...

Learning: Unused Code

More is Less

Understand Runtime Internals

Expensive last statement

Your loop is easy

Detour: The modern CPU

Escape Analysis

Intrinsics or Native

Wait... What are Intrinsics?

Intrinsics or Native II

Stupid Compiler

Let's talk hardware

Hardware Greetings

Detour: Cost

The Matrix

The Matrix Reloaded

Matrix - You probably did it wrong

Wrong again?

I know what you will do next

Branch Prediction

Branches: Know what you test

Short Summary

Measurement

JDK

JVM

Hardware

Synchronization

With Contention

Looping

Looping

Looping - Bonus Content

The Cost of Calling

Inlining Methods

The Cost of Calling

The Cost of Calling

Really calling virtually

Really calling virtually

Megamorphic Internals

TLB - The Unknown Buffer

Lambdas are slow

Lambda Functions

Lambda Functions

GC can be your friend

What we might not have talked about

Conclusion

Rule 1

Rule 2

Rule 3

Rule 4

Rule 5

Rule 6

Compiler Stages ^[1]

Compiler Flow ^[2]