High Performance Java

About the Smart Java Internals

...and how to use them to your advantage.

Disclaimer

Java is more than what most people think

The times when Java was a bytecode interpreter, running slow and sluggish, are long gone. But most people don't know the magic that is going on to speed up Java to native code speed and beyond.

  • This is not science, this is applied knowledge
  • Java got closer to the hardware
  • CPUs got better, but memory stayed slow, hence CPUs do whatever it takes to avoid main memory and that is what Java tries to do too
  • This is not about the fastest code, we want understand how speed is achieved and draw the right conclusions
  • We will learn how to avoid mistakes but besides that trust Java
This training is partially based on material from Java Performance Puzzlers by Douglas Hawkins and Optimizing Java by O'Reilly Media 2018.

An early warning

Optimization and tuning is fun... but...

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

Donald Knuth, Turing Award Lecture 1974

  • This is not about how your code should look like
  • It is about understanding the JVM and getting measurements right
  • It is about the right approach to performance tuning
  • If you expect tips like use ArrayList instead of... this talk might not be your talk

Let's Benchmark

Get Some Measurements Going

Poor Man's Benchmark 1

public class PoorMansBenchmark 
{   
    private static String append(final int i) 
    {
        final StringBuilder sb = new StringBuilder();
        sb.append(System.currentTimeMillis());
        sb.append("Foo"); 
        sb.append(i); 
        sb.append('c');
        sb.append("lkJAHJ AHSDJHF KJASHF HSAKJFD");
        
        final String s = sb.toString();
        char[] ch = s.toCharArray();
        Arrays.sort(ch);
        
        return new String(ch);
    }

    public static void main (final String[] args) 
    {
        final int SIZE = 100_000;
        final long[] result = new long[SIZE];
        
        int sum = 0;
        for (int i = 0; i < SIZE; i++)
        {
            final Timer t = Timer.startTimer();       
            
            sum += append(i).length();
            
            result[i] = t.stop().runtimeNanos();
        }  
        
        // pretend you need it but you don't need it
        // just to avoid optimizations
        if (sum < 0)
        {
            System.out.println(sum);
        }
    }        
}

Just benchmark something...

  • Some non-sense with StringBuilder and String
  • Within a loop we call our method

Let's Run PoorMansbenchmark

# SIZE = 10
 1, 616953
 2,  37628
 3,  27712
 4,  25459
 5,  36943
 6,  24639
 7,  27312
 8,  23948
 9,  24379
10,  24003
# SIZE = 50 
  1, 616953
  2,  37628
  3,  27712
  4,  25459
  5,  36943
  6,  24639
  7,  27312
  8,  23948
  9,  24379
 10,  24003
 20,  23770
 30,  25595
 40,  23631
 41,  23744
 42,  25004
 43,  26035
 44,  23782
 45,  22973
 46,  25261
 47,  23215
 48,  24135
 49,  23312
 50,  22908
# SIZE = 200
  1, 560064
  2,  33833
  3,  25935
  4,  24224
  5,  58761
  6,  25303
  7,  23832
  8,  23243
  9,  24405
 10,  23159
 20,  24005
 30,  23097
 40,  24002
 50,  23575
 60,  26380
 70,  24612
 80,  26099
 90,  23815
100,  33719
110,  24405
120,  25568
130,  27355
140,  23190
150,  12568
160,  22774
170,  21934
180,  11408
190,  12476
200,  10366
# SIZE = 2000
   1, 582933
   2,  37514
   3,  26746
   4,  24760
   5,  40041
   6,  23869
   7,  26384
   8,  23183
   9,  22801
  10,  24316
  20,  28037
  30,  47624
  40,  37747
  50,  37559
  60,  47502
  70,  35459
  80,  37795
  90,  36272
 100,  36344
 200,  14478
 300,  11769
 400,  11557
 500,  10088
 600,   9965
 700,   9387
 800,   8322
 900,   7305
1000,   6849
2000,   3347
# SIZE = 100,000
     1, 592763
    10,  24077
   100,  24995
   200,  10734
   300,  19562
   400,   8904
   500,   6226
   600,   6723
   700,   6981
   800,   5283
   900,   4896
  1000,   4883
  2000,   3367
  3000,   4947
  4000,   5179
  5000,   4648
  6000,   4113
  7000,   4283
  8000,   4654
  9000,   4054
 10000,   4300
 20000,  19314
 30000,    671
 40000,    694
 50000,    686
 60000,    663
 70000,    717
 80000,    697
 90000,    686
100000,    686

The Java Hater's Response

That must be Garbage Collections

  • Let's disable GC
  • Eplison GC to the rescue
  • It provides but does not free memory
  • -Xms5g -Xmx5g
  • -XX:+UseEpsilonGC
  • -XX:+AlwaysPreTouch
# No GC
# SIZE = 100,000
     1, 549924
    10,  23824
   100,  34017
   200,  16395
   300,  12858
   400,  11460
   500,  10504
   600,   9787
   700,   9935
   800,  11464
   900,   7369
  1000,   7365
  2000,   2771
  3000,   2763
  4000,   2753
  5000,   2762
  6000,   2846
  7000,   2703
  8000,   2737
  9000,   2726
 10000,   2738
 20000,   2770
 30000,   2706
 40000,   2764
 50000,   2657
 60000,   2476
 70000,   1040
 80000,   1002
 90000,    982
100000,    990
# G1 GC
# SIZE = 100,000
     1, 595433
    10,  23628
   100,  40456
   200,  17898
   300,  11618
   400,  13881
   500,   9707
   600,   9335
   700,   9153
   800,   8872
   900,   7258
  1000,   6953
  2000,   2882
  3000,   2737
  4000,   2748
  5000,   2756
  6000,   2724
  7000,   2686
  8000,   2758
  9000,   2751
 10000,   2743
 20000,   4219
 30000,   5016
 40000,   4437
 50000,   1488
 60000,    999
 70000,   1056
 80000,   4011
 90000,   1221
100000,    755

The Java Hater's Response II

Java is slow

  • GraalVM to the rescue
  • Let's compile it
  • native-image
# G1 GC
# SIZE = 100,000
     1, 595433
    10,  23628
   100,  40456
   200,  17898
   300,  11618
   400,  13881
   500,   9707
   600,   9335
   700,   9153
   800,   8872
   900,   7258
  1000,   6953
  2000,   2882
  3000,   2737
  4000,   2748
  5000,   2756
  6000,   2724
  7000,   2686
  8000,   2758
  9000,   2751
 10000,   2743
 20000,   4219
 30000,   5016
 40000,   4437
 50000,   1488
 60000,    999
 70000,   1056
 80000,   4011
 90000,   1221
100000,    755
# GraalVM Native
# SIZE = 100,000
     1,   8791
    10,   4736
   100,   1989
   200,   1830
   300,   5040
   400,   1870
   500,   2267
   600,   1903
   700,   2279
   800,   3889
   900,   1771
  1000,   1620
  2000,   2415
  3000,   3758
  4000,   1290
  5000,   1300
  6000,   1249
  7000,   2636
  8000,   2160
  9000,   1254
 10000,   1309
 20000,   1730
 30000,   2446
 40000,   1320
 50000,   1332
 60000,   1363
 70000,   1385
 80000,   2737
 90000,   1373
100000,   1278
# G1 GC
# SIZE = 1,000,000
      1, 563883
     10,  23505
    100,  23311
   1000,   6727
   2000,   5174
   3000,   5444
   4000,   5160
   5000,   4815
   6000,   5130
   7000,   5138
   8000,   5162
   9000,   5032
  10000,   5230
  20000,   3992
  30000,   3673
  40000,   1909
  50000,   1908
  60000,   1106
  70000,    963
  80000,    966
  90000,    981
 100000,    958
 200000,    683
 300000,    693
 400000,    694
 500000,    691
 600000,    699
 700000,    697
 800000,   1057
 900000,    710
1000000,    703

That is not helpful

The Data is so Inconsistent

Let's Analyze

Greetings from the Hotspot Compiler

  • Java has about three compilers
  • You interact only with javac
  • To make your code fly, another compiler compiles the bytecode on-demand
  • Classic: Hotspot Compiler (our focus right now)
  • New Kid: Graal Compiler
  • Tiered Compiler, C1 and C2 with four modes
  • Profiles result and optimizes continuously
  • Can inline, loop-unroll, run escape analysis, branch predict, use intrinsics, do lock tricks, and a few more things
  • Compiler decides what code is hot
  • Compile is per method
  • In addition loops are compiled, because the method might be cold, but the loop hot

Compare Interpreter vs. Compiler

Hotspot

No Tiered Compiler

Interpreter Only

Compile Everything

Some more information about Hotspot

Just to get a feel for the complexity

Compiler Stages [1]

  • Level 0: Interpreter only
  • Level 1: C1, full optimization, no profiling
  • Level 2: C1 with invocation and backtrack counters
  • Level 3: C1 with full profiling
  • Level 4: C2, fully optimized

Compiler Flow [2]

Compile Or Not to Compile

Compile when Needed vs. Right from the Start

  • -Xcomp forces Java to always compile to native code
  • Still permits later change in opinion
# defaults
      1, 560,960
      2,   34772
      3,   26167
      4,   24237
      5,   35926
      6,   24116
      7,   23163
      8,   22895
      9,   24142
     10,   23174
    100,   23596
    200,   12973
    300,   12361
    400,    9852
    500,    9596
    600,    8774
    700,    8956
    800,    7280
    900,    7076
   1000,    6089
  10000,    5071
  20000,    2683
  30000,    2703
  40000,    1532
  50000,    1121
  60000,    1012
  70000,     975
  80000,     979
  90000,     987
 100000,     976
 900000,     700
1000000,     708
# -Xcomp
      1, 40,041,490
      2,      51254
      3,      17217
      4,       2077
      5,       1714
      6,       1607
      7,       1610
      8,       1536
      9,       1576
     10,       1536
    100,       1376
    200,       2481
    300,       1284
    400,       1307
    500,       1150
    600,       1177
    700,       1173
    800,       1163
    900,       1174
   1000,       1546
  10000,       1096
  20000,        962
  30000,        999
  40000,       1093
  50000,        891
  60000,        883
  70000,        885
  80000,        862
  90000,        894
 100000,        869
 900000,        871
1000000,        870

Watch the Compiler

48    1       3       java.lang.String::equals (81 bytes)
49    2       3       java.lang.String::hashCode (55 bytes)
49    3       3       java.lang.String::charAt (29 bytes)
49    4       3       java.lang.String::length (6 bytes)
51    5       3       java.lang.Object::<init> (1 bytes)
51    6       3       java.lang.AbstractStringBuilder::ensureCapacityInternal (27 bytes)
51    7     n 0       java.lang.System::arraycopy (native)   (static)
51    8       3       java.lang.String::indexOf (70 bytes)
52   10       3       java.lang.Math::min (11 bytes)
52    9       3       java.util.Arrays::copyOfRange (63 bytes)
53   11       1       java.lang.ref.Reference::get (5 bytes)
53   12       3       java.lang.String::getChars (62 bytes)
53   13       1       java.lang.ThreadLocal::access$400 (5 bytes)
57   14       3       java.util.DualPivotQuicksort::sort (1195 bytes)
62   20       1       java.lang.Object::<init> (1 bytes)
62    5       3       java.lang.Object::<init> (1 bytes)   made not entrant
  • Time Stamp since start
  • Compilation order
  • Compiler level
  • Method and size
  • n: native
  • s: synchronized
  • !: has exception handler
  • %: On-stack replacement (OSR)
  • made not entrant: not longer used

JMH to the rescue

Java Microbenchmark Harness

  • OpenJDK developed JMH
  • Deliver consistent results
  • Be able to control the compiler
  • Suppress optimizations such as dead code
  • CLI tool using Maven
  • Use IDE to write, but not to run
  • Knows different JVMs and their tricks
  • Contains profiler (JMX, Hotspot or OS based)
  • Also great for multi-threaded testing
  • Read the examples carefully before assuming your benchmark is legit
  • Keep in mind, most code is not hot

mvn clean install; java -Xms2g -Xmx2g -XX:+AlwaysPreTouch -jar target/benchmarks.jar YourClass

Learning: Understand the Runtime

If you don't understand it, you measure just random numbers

  • Understand the application of your code
  • Measure at the scale your code runs
    • Hot, Warm, or Cold
  • Use proper tooling aka JMH
  • Understand the Java Runtime
  • Think production!

Java Performance Fun

Things to Know when Executing or Profiling Code

Ohhhh Trap

You trained it well

private final static int SIZE = 1000;

public static void main(String[] args)
{
    Object trap = null;
    Object o = null;
    
    for (int i = 0; i < 1000; i++)
    {
        final Timer t = Timer.startTimer();
        
        for (int j = 0; j < SIZE; j++)
        {
            // burn time and train that null is normal
            o = new Object();
            
            if (trap != null)
            {
                System.out.println("Got you." + o);
                trap = null;
            }
        }
        
        // Give me a Non-Null, Vasily. 
        // One Non-Null only, please.  
        if (i == 400)
        {
            trap = new Object();
        }

        System.out.println(
            MessageFormat.format("{1} {0, number, #}", 
            t.stop().runtimeNanos(), i));
    }        
}

Bonus Comparison

What If we had no if

Learning: Unused Code

If I don't need it, I remove it

  • Java freely removes code
    • Results are not needed
    • Code is probably not executed
    • Loops are empty aka only spin
  • Rare conditions can completely change the runtime behavior forever
  • Danger: That can kill all your benchmark results
  • If your code is complicated, Java cannot make that call

More is Less

You Can Horribly Shoot Yourself in the Foot

/**
 * JHM Benchmark 
 * Vary SIZE from 1 to 10000    
 * @param args
 */
public int nonSense()
{
    Object trap = null;
    Object o = null;
    
    for (int i = 0; i < OUTER; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            o = new Object();
            
            if (trap != null)
            {
                o = new Object();
                trap = null;
            }
        }
            
        if (i == 500)
        {
            trap = new Object();
        }
    }        

    return trap == null ? 1 : 0;
}
  • We vary the inner loop length
  • Use different ways to measure things
# One Shoot Execution 
Benchmark                           (SIZE)  Mode  Cnt         Score   Error  Units
RemoveNonSenseJMH.nonSenseSingle         1    ss         85,298.000          ns/op
RemoveNonSenseJMH.nonSenseSingle        10    ss        513,886.000          ns/op
RemoveNonSenseJMH.nonSenseSingle       100    ss      3,603,910.000          ns/op
RemoveNonSenseJMH.nonSenseSingle      1000    ss      9,465,274.000          ns/op
RemoveNonSenseJMH.nonSenseSingle     10000    ss     17,528,758.000          ns/op
# Repeated Execution, No Warm up
Benchmark                           (SIZE)  Mode  Cnt         Score   Error  Units
RemoveNonSenseJMH.nonSenseNoWarmup       1  avgt            963.733          ns/op
RemoveNonSenseJMH.nonSenseNoWarmup      10  avgt            600.373          ns/op
RemoveNonSenseJMH.nonSenseNoWarmup     100  avgt            597.396          ns/op
RemoveNonSenseJMH.nonSenseNoWarmup    1000  avgt            697.616          ns/op
RemoveNonSenseJMH.nonSenseNoWarmup   10000  avgt          2,258.298          ns/op
# Warm it Up First
Benchmark                           (SIZE)  Mode  Cnt         Score   Error  Units
RemoveNonSenseJMH.nonSenseWarmup         1  avgt            660.373          ns/op
RemoveNonSenseJMH.nonSenseWarmup        10  avgt            609.107          ns/op
RemoveNonSenseJMH.nonSenseWarmup       100  avgt            409.340          ns/op
RemoveNonSenseJMH.nonSenseWarmup      1000  avgt            439.348          ns/op
RemoveNonSenseJMH.nonSenseWarmup     10000  avgt            410.092          ns/op
# -Xint to verify
Benchmark                           (SIZE)  Mode  Cnt         Score   Error  Units
RemoveNonSenseJMH.nonSenseWarmup         1  avgt        113,536.227          ns/op
RemoveNonSenseJMH.nonSenseWarmup        10  avgt      1,110,772.473          ns/op
RemoveNonSenseJMH.nonSenseWarmup       100  avgt     11,449,399.556          ns/op
RemoveNonSenseJMH.nonSenseWarmup      1000  avgt    112,158,390.000          ns/op
RemoveNonSenseJMH.nonSenseWarmup     10000  avgt  1,104,793,429.000          ns/op

Understand Insides of the Runtime

Ok, Now We Can Talk about Performance

Escape Analysis

Use the stack instead of the heap

@Benchmark
public long array64()
{
    int[] a = new int[64];
    
    a[0] = r.nextInt();
    a[1] = r.nextInt();
    
    return a[0] + a[1];
}

@Benchmark
public long array65()
{
    int[] a = new int[65];
    
    a[0] = r.nextInt();
    a[1] = r.nextInt();
    
    return a[0] + a[1];
}
EscapeAnalysis.array64                           avgt  2    22.804           ns/op
EscapeAnalysis.array65                           avgt  2    41.890           ns/op
EscapeAnalysis.array64                           avgt  2    22.804           ns/op
EscapeAnalysis.array64:·gc.alloc.rate            avgt  2    ≈ 10⁻⁴          MB/sec
EscapeAnalysis.array64:·gc.alloc.rate.norm       avgt  2    ≈ 10⁻⁶            B/op
EscapeAnalysis.array64:·gc.count                 avgt  2       ≈ 0          counts

EscapeAnalysis.array65                           avgt  2    41.890           ns/op
EscapeAnalysis.array65:·gc.alloc.rate            avgt  2  5795.571          MB/sec
EscapeAnalysis.array65:·gc.alloc.rate.norm       avgt  2   280.000            B/op
EscapeAnalysis.array65:·gc.count                 avgt  2    93.000          counts
# -XX:EliminateAllocationArraySizeLimit=70
EscapeAnalysis.array65                           avgt  2    22.279           ns/op
EscapeAnalysis.array65:·gc.alloc.rate            avgt  2    ≈ 10⁻⁴          MB/sec
EscapeAnalysis.array65:·gc.alloc.rate.norm       avgt  2    ≈ 10⁻⁶            B/op
EscapeAnalysis.array65:·gc.count                 avgt  2       ≈ 0          counts

Expensive last statement

Just one statement more...

@Benchmark
public long add230()
{
    int sum = 0;
    int i = 0;
    
    // now 23 of these
    sum += a[++i] + a[++i] + a[++i] + a[++i] + a[++i] +
           a[++i] + a[++i] + a[++i] + a[++i] + a[++i];

    return sum;
}

@Benchmark
public long add231()
{
    int sum = 0;
    int i = 0;
    
    // now 23 of these
    sum += a[++i] + a[++i] + a[++i] + a[++i] + a[++i] +
           a[++i] + a[++i] + a[++i] + a[++i] + a[++i];
    // and this
    sum += a[++i];

    return sum;
}
Benchmark                      Mode  Cnt    Score   Units
ExpensiveLastAdd.add230  avgt    2   41.159   ns/op
ExpensiveLastAdd.add231  avgt    2  122.162   ns/op
  • Darn... the last add is really expensive

Expensive Last Statement

Let's check the compiler

803  434       4  org.sample.ExpensiveLastAdd::add230 (2251 bytes)
    
797  434       3  org.sample.ExpensiveLastAdd::add231 (2263 bytes)
799  436       4  org.sample.ExpensiveLastAdd::add231 (2263 bytes)
811  436       4  org.sample.ExpensiveLastAdd::add231 (2263 bytes)   COMPILE SKIPPED: Out of stack space (not retryable)
  • We cannot C2 compile it
  • Stuck with the C1 and full profiling turned on

Another Surprise

Just add integers together

@Benchmark
public int add1396()
{
    int sum = 0;
    
    sum += 1; ... sum += 1396;

    return sum;
}

@Benchmark
public int add1397()
{
    int sum = 0;
    
    sum += 1; ... sum += 1396;
    sum += 1397;

    return sum;
}
Benchmark                Mode  Cnt     Score   Error  Units
ExpensiveIntAdd.add1396  avgt    2     3.567          ns/op
ExpensiveIntAdd.add1397  avgt    2  2688.218          ns/op
  • That was really expensive but why?
  • add1396 is compiled into a pre-evaluated return statement, hence it is really fast
  • add1397 problem is called huge method limit
  • Method bytecode limit is 8000 bytes
  • So it stays bytecode
# Executed with -Xint
Benchmark                Mode  Cnt     Score   Error  Units
ExpensiveIntAdd.add1396  avgt    2  2660.065          ns/op
ExpensiveIntAdd.add1397  avgt    2  2644.399          ns/op

Your loop is easy

Flatten out the loops

private int next()
{
    int i = r.nextInt(1) + 1;
    return i;
}

@Benchmark
public int classic()
{
    int sum = 0;
    int step = next();
    for (int i = 0; i < ints.length; i = i + 1)
    {
        sum += ints[i];
        step = next();
    }
    return sum + step;
}

@Benchmark
public int variable()
{
    int sum = 0;
    int step = next();
    for (int i = 0; i < ints.length; i = i + step)
    {
        sum += ints[i];
        step = next();
    }
    return sum + step;
}
Benchmark            (size)  Mode  Cnt      Score   Units
LoopUnroll.classic    10000  avgt    2  18,871.574  ns/op
LoopUnroll.variable   10000  avgt    2  27,433.414  ns/op
  • Our loop got unrolled
  • Less range checks, less jumps
# -Xint
Benchmark            (size)  Mode Cnt        Score  Units
LoopUnroll.classic    10000  avgt   2 2,650,812.533 ns/op
LoopUnroll.variable   10000  avgt   2 2,449,845.956 ns/op
# classic()
 8.995.890.331  cycles:u                  #    3,292 GHz                      
15.443.666.755  instructions:u            #    1,72  insn per cycle         
                                          #    0,32  stalled cycles per insn  
   368.352.858  branches:u                #  134,800 M/sec                    
     1.714.520  branch-misses:u           #    0,47% of all branches         
 1.383.613.083  L1-dcache-loads:u         #  506,340 M/sec                    
    43.244.875  L1-dcache-load-misses:u   #    3,13% of all L1-dcache hits    

# variable()     
 8.597.780.881  cycles:u                  #    3,235 GHz                      
 2.828.996.294  stalled-cycles-frontend:u #   32,90% frontend cycles idle    
23.722.750.782  instructions:u            #    2,76  insn per cycle         
                                          #    0,12  stalled cycles per insn 
 2.156.234.427  branches:u                #  811,237 M/sec                   
     1.189.662  branch-misses:u           #    0,06% of all branches          
 1.826.787.276  L1-dcache-loads:u         #  687,289 M/sec                   
    29.874.307  L1-dcache-load-misses:u   #    1,64% of all L1-dcache hits

Intrinsics or Native

Native code to the rescue

final int SIZE = 100_000;
final int[] src = new int[SIZE];
    
public int[] manualArrayCopy()
{
    int[] target = new int[SIZE];
    for (int i = 0; i < SIZE; i++)
    {
        target[i] = src[i];
    }
    
    return target;
}

public int[] systemArrayCopy()
{
    int[] target = new int[SIZE];
    System.arraycopy(
        src, 0, target, 0, SIZE);
    
    return target;
}

public int[] arraysCopy()
{
    return Arrays.copyOf(src, src.length);
}
Benchmark                   Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy.manual  avgt    2  48177.054          ns/op
IntrinsicsArrayCopy.system  avgt    2  48678.219          ns/op
  • Uh... native is not faster?
  • Well, if Java compiled its code, it is native
                    Manual               System
=======================================================
Iteration   1: 1,431,758.000 ns/op    129,551.000 ns/op
Iteration   2:   714,942.000 ns/op     81,926.000 ns/op
Iteration   3:   188,272.000 ns/op     77,098.000 ns/op
Iteration   4:    70,491.000 ns/op     72,010.000 ns/op
Iteration   5:    82,862.000 ns/op     73,495.000 ns/op
Iteration   6:    66,534.000 ns/op     75,625.000 ns/op
Iteration   7:    61,277.000 ns/op     74,382.000 ns/op
  • Native is quickly outperformed by Hotspot
  • BUT: That depends on your hardware
  • Hotspot can vary its code and intrinsics selection based on CPU instructions available

Intrinsics or Native II

# Lenovo T530, JDK 8
Benchmark                   Mode  Cnt      Score  Units
IntrinsicsArrayCopy.manual  avgt    2 48,177.054  ns/op
IntrinsicsArrayCopy.system  avgt    2 48,678.219  ns/op

# Lenovo T450s, JDK 8, BAT
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2  66,560.515  ns/op
IntrinsicsArrayCopy.system  avgt    2  67,051.983  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  60,674.969  ns/op
# Lenovo T450s, JDK 11, BAT
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2  67,476.175  ns/op
IntrinsicsArrayCopy.system  avgt    2  58,952.894  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  58,771.408  ns/op
# Lenovo T450s, GraalVM EE 1.0.0.14, BAT
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2  94,315.864  ns/op
IntrinsicsArrayCopy.system  avgt    2  70,201.933  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  71,605.877  ns/op
# Lenovo T450s, JDK 11, AC
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2  56,108.312  ns/op
IntrinsicsArrayCopy.system  avgt    2  55,748.067  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  53,769.876  ns/op
# AWS c5.xlarge, JDK 11
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2  83,066.975  ns/op
IntrinsicsArrayCopy.system  avgt    2  63,422.029  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  64,748.500  ns/op
# GCP n1-standard-4, JDK 11
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2 146,932.588  ns/op
IntrinsicsArrayCopy.system  avgt    2 101,970.577  ns/op
IntrinsicsArrayCopy.arrays  avgt    2 116,233.650  ns/op
# GCP n1-highcpu-8, JDK 11
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2  82,865.258  ns/op
IntrinsicsArrayCopy.system  avgt    2  62,810.793  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  61,237.489  ns/op
# Lenovo T450s, JDK 11, BAT, Min Power
Benchmark                   Mode  Cnt        Score  Units
IntrinsicsArrayCopy.manual  avgt    2  215,360.090  ns/op
IntrinsicsArrayCopy.system  avgt    2  210,971.627  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  214,870.207  ns/op

Wait... What are Intrinsics?

Native Code vs. Intrinsics

  • Native: A JNI based method call that calls any kind of provided native platform code
  • Intrinsics: Native platform code, but not called as method but directly placed in the Hotspot compiled code
  • public static native void arraycopy(...) can be a native call or an intrinsic
  • Java makes that decision when the VM is started based on CPU features
  • Even for methods fully visible in the source code of the JDK as Java, Java can and will use Intrinics instead, see java.lang.Math
# -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining
@ 54   java.lang.Math::min (11 bytes)   (intrinsic)
@ 57   java.lang.System::arraycopy (0 bytes)   (intrinsic)

Stupid Compiler

Let's Concat some Strings


@Setup
public void setup()
{
    foo = String.valueOf(System.currentTimeMillis()) + "ashdfhgas dfhsa df";
}

public String plain()
{    
    return "a" + foo;
}

public String concat()
{    
    return "a".concat(foo);
}

public String builder()
{    
    return new StringBuilder().append("a").append(foo).toString();
}

public String builderSized()
{    
    return new StringBuilder(32).append("a").append(foo).toString();
}

public String builderNonFluid()
{    
     final StringBuilder sb = new StringBuilder();
     sb.append("a");
     sb.append(foo);
     return sb.toString();
}

public String builderNonFluidSized()
{    
    final StringBuilder sb = new StringBuilder(32);
    sb.append("a");
    sb.append(foo);
    return sb.toString();
}

Benchmark             Mode  Cnt   Score   Error  Units
builderSized          avgt    2  22.170          ns/op
plain                 avgt    2  22.213          ns/op
builder               avgt    2  22.409          ns/op
builderNonFluidSized  avgt    2  32.230          ns/op
concat                avgt    2  33.061          ns/op
builderNonFluid       avgt    2  49.581          ns/op
  • A sized builder is not faster?
  • A builder that is not fluid is slower?

Stupid Compiler II

Don't Trust Your First Benchmark


private String[] foo = new String[5];
private FastRandom r = new FastRandom(1L);

@Setup
public void setup()
{
    foo[0] = String.valueOf(r.nextInt()) + "ashdfhgas dfhsa df";
    foo[1] = String.valueOf(r.nextInt()) + "hsa df";
    foo[2] = String.valueOf(r.nextInt()) + "fhsa df";
    foo[3] = String.valueOf(r.nextInt()) + "as dfhsa df";
    foo[4] = String.valueOf(r.nextInt()) 
                    + "ashdfhgas dfhsa hsajdhf kjahskjdh fkjhsa dhfkjskhj hdf"
                    + "ashdfhgas dfhsa hsajdhf kjahs as87837 837 987kjskhj hdf";
}

public String foo()
{
    return foo[r.nextInt(5)];
}

@Benchmark
public String plain()
{    
    return "a" + foo();
}

@Benchmark
public String concat()
{    
    return "a".concat(foo());
}

// more here

Benchmark             Mode  Cnt   Score   Error  Units
concat                avgt    2  51.746          ns/op
plain                 avgt    2  66.310          ns/op
builder               avgt    2  66.313          ns/op
builderNonFluid       avgt    2  67.179          ns/op
builderSized          avgt    2  67.762          ns/op
builderNonFluidSized  avgt    2  67.860          ns/op

Benchmark             Mode  Cnt     Score   Error   Units
concat                avgt    2    52.437           ns/op
 gc.alloc.rate        avgt    2  1210.133          MB/sec

plain                 avgt    2    66.375           ns/op
 gc.alloc.rate        avgt    2  2022.217          MB/sec

builder               avgt    2    66.243           ns/op
 gc.alloc.rate        avgt    2  2025.848          MB/sec

builderNonFluid       avgt    2    72.185           ns/op
 gc.alloc.rate        avgt    2  1870.958          MB/sec

builderNonFluidSized  avgt    2    73.735           ns/op
 gc.alloc.rate        avgt    2  2354.521          MB/sec

builderSized          avgt    2    76.134           ns/op
 gc.alloc.rate        avgt    2  2276.236          MB/sec
  • Concat is memory efficient... interesting
  • Never assume that your artificial example tells it all

Let's talk hardware

Hardware does not matter... you wish!

final int SIZE = 1_000_000;
final int[] src = new int[SIZE];
    
@Benchmark
public int step1()
{
    int sum = 0;
    for (int i = 0; i < SIZE; i++)
    {
        sum += src[i];
    }
    
    return sum;
}

@Benchmark
public int step20()
{
    int sum = 0;
    for (int i = 0; i < SIZE; i = i + 20)
    {
        sum += src[i];
    }
    
    return sum;
}
  • size20 must be faster, about 20x
Benchmark                      Mode  Cnt       Score   Error  Units
ArraysAndHardware.step1        avgt    2  325,159.158          ns/op
ArraysAndHardware.step20       avgt    2   97,402.981          ns/op
  • We expected 16,300 ns/op
  • We got only 3.3x more speed
  • What? Why?
  • P.S. step20 results will vary a lot when running repeatably

Hardware Greetings

Also Java has to live by hardware laws

Linux Perf for step1
==========================================================================================
   5174,723228      task-clock:u (msec)       #    0,320 CPUs utilized          
             0      context-switches:u        #    0,000 K/sec                  
             0      cpu-migrations:u          #    0,000 K/sec                  
           148      page-faults:u             #    0,029 K/sec                  
17,608,341.100      cycles:u                  #    3,403 GHz                      (35,81%)
 9,565,555.351      stalled-cycles-frontend:u #   54,32% frontend cycles idle     (35,94%)
17,694,322.391      instructions:u            #     1,00 insn per cycle         
                                              #     0,54 stalled cycles per insn  (43,14%)
   938,066.934      branches:u                #  181,279 M/sec                    (43,10%)
       659.800      branch-misses:u           #    0,07% of all branches          (43,08%)
15,030,118.557      L1-dcache-loads:u         # 2904,526 M/sec                    (42,20%)
Linux Perf for step20
==========================================================================================
   5211,208535      task-clock:u (msec)       #    0,321 CPUs utilized          
             0      context-switches:u        #    0,000 K/sec                  
             0      cpu-migrations:u          #    0,000 K/sec                  
           155      page-faults:u             #    0,030 K/sec                  
16,720,652.877      cycles:u                  #    3,209 GHz                      (35,98%)
14,817,482.507      stalled-cycles-frontend:u #   88,62% frontend cycles idle     (36,13%)
 4,354,757.726      instructions:u            #     0,26 insn per cycle         
                                              #     3,40 stalled cycles per insn  (43,39%)
 1,081,134.885      branches:u                #  207,463 M/sec                    (43,47%)
       527.695      branch-misses:u           #    0,05% of all branches          (43,44%)
 1,073,841.169      L1-dcache-loads:u         #  206,064 M/sec                    (42,18%)

Detour: The modern CPU

Everything happens in parallel

  • Current Intels can do 3 to 4 instructions per cycle
  • A cycle is 1 Hz
  • Parallel execution of code despite code not being programmed that way
  • Parallel execution units per core
  • CPU reorders when permitted to fill the execution units
  • Tries to do things ahead of the result (prediction, speculation)
  • When units not filled, performance is wasted

The Matrix


final int[][] src = new int[1000][1000];

@Benchmark
public int horizontal() {
    int sum = 0;
    for (int i = 0; i < SIZE; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            sum += src[i][j];
        }
    }
    return sum;
}

@Benchmark
public int vertical() {
    int sum = 0;
    for (int i = 0; i < SIZE; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            sum += src[j][i];
        }
    }
    return sum;
}

@Benchmark
public int horizontalBackwards() {
    int sum = 0;
    for (int i = SIZE - 1; i >=0; i--)
    {
        for (int j = SIZE - 1; j >= 0; j--)
        {
            sum += src[i][j];
        }
    }
    return sum;
}

What is real and what is not

  • All calculations should have the same speed, shouldn't they?

Benchmark                                Mode           Score   Error  Units
horizontal                               avgt     470,009.870          ns/op
horizontalBackwards                      avgt     672,821.809          ns/op
vertical                                 avgt   3,209,562.255          ns/op
  • Horizontal access is about 7x faster, forward by 30% faster!

Benchmark                                         Mode           Score   Error  Units
Matrix.horizontal                                 avgt     470,009.870          ns/op
Matrix.horizontal:L1-dcache-load-misses           avgt      63,645.114           #/op
Matrix.horizontal:L1-dcache-loads                 avgt   1,014,532.653           #/op
Matrix.horizontal:LLC-load-misses                 avgt         429.477           #/op
Matrix.horizontal:LLC-loads                       avgt       1,523.997           #/op

Matrix.horizontalBackwards                        avgt     672,821.809          ns/op
Matrix.horizontalBackwards:L1-dcache-load-misses  avgt      63,053.428           #/op
Matrix.horizontalBackwards:L1-dcache-loads        avgt   1,007,381.378           #/op
Matrix.horizontalBackwards:LLC-load-misses        avgt         854.986           #/op
Matrix.horizontalBackwards:LLC-loads              avgt       2,764.802           #/op

Matrix.vertical                                   avgt   3,209,562.255          ns/op
Matrix.vertical:L1-dcache-load-misses             avgt   2,060,998.962           #/op
Matrix.vertical:L1-dcache-loads                   avgt   3,020,796.962           #/op
Matrix.vertical:LLC-load-misses                   avgt      15,223.808           #/op
Matrix.vertical:LLC-loads                         avgt     694,397.646           #/op
  • Being friends with the caches pays off

Matrix - Yes, you probably did it wrong

Influence of the Size on the Result


Benchmark             (SIZE) Mode            Score Factor  Units
horizontal                1  avgt              5.3    1.0  ns/op
horizontal               10  avgt             45.2    1.0  ns/op
horizontal               20  avgt            141.3    1.0  ns/op
horizontal              100  avgt          3,388.0    1.0  ns/op
horizontal             1000  avgt        373,619.3    1.0  ns/op
horizontal            10000  avgt     41,701,496.2    1.0  ns/op

horizontalBackwards       1  avgt              5.3    1.0  ns/op
horizontalBackwards      10  avgt             43.2    1.0  ns/op
horizontalBackwards      20  avgt            140.1    1.0  ns/op
horizontalBackwards     100  avgt          3,261.3    1.0  ns/op
horizontalBackwards    1000  avgt        511,246.1    1.4  ns/op
horizontalBackwards   10000  avgt     46,908,230.0    1.1  ns/op

vertical                  1  avgt              5.3    1.0  ns/op
vertical                 10  avgt             77.0    1.7  ns/op
vertical                 20  avgt            275.8    2.0  ns/op
vertical                100  avgt          6,226.3    1.8  ns/op
vertical               1000  avgt      2,719,478.0    7.2  ns/op
vertical              10000  avgt  1,683,041,892.2   40.4  ns/op
  • Understand your future data set
  • Don't limit to your own machine
  • Verify theories!

Detour: Cost

What it costs to leave the CPU

  Real Humanized
CPU Cycle 0.4 ns 1 s
L1 Cache 0.9 ns 2 s
L2 Cache 2.8 ns 7 s
L3 Cache 28 ns 1 min
Memory Access 100 ns 4 min
NVM SSD 25 μs 17 h
SSD 50–150 μs1.5-4 days
HDD 1–10 ms 1-9 months
TCP SF to NY 65 ms 5 years
TCP SF to Hong Kong141 ms11 years
  • It costs 100 to 200 cycles when we have to go to main memory
  • Hence Java and the CPU avoid main memory access at all cost
  • Java does not have the notion of a location, hence this is transparent

I know what you will do next

Modern Fortune Tellers


private static final int COUNT = 10_000;

// Contain random numbers from -50 to 50
private int[] sorted;
private int[] unsorted;
private int[] reversed;

public void doIt(
            int[] array, 
            Blackhole bh1, 
            Blackhole bh2)
{
    for (int v : array)
    {
        if (v > 0)
        {
            bh1.consume(v);
        }
        else
        {
            bh2.consume(v);
        }
    }
}
  • Will there be any difference when passing sorted or unsorted arrays?
Benchmark                  Mode  Cnt      Score   Error  Units
BranchPrediction.reversed  avgt    2  39570.696          ns/op
BranchPrediction.sorted    avgt    2  39859.027          ns/op
BranchPrediction.unsorted  avgt    2  66043.605          ns/op
  • Sorted is 65% faster. Why?
  • Let's ask the CPU

Branch Prediction

Modern CPU marvels: I know what you might need next


Benchmark                                        Mode        Score   Error  Units
=================================================================================
BranchPrediction.reversed                        avgt    35,182.532         ns/op
BranchPrediction.reversed:IPC                    avgt         3.257          #/op
BranchPrediction.reversed:branch-misses          avgt        13.767          #/op
BranchPrediction.reversed:branches               avgt    55,339.912          #/op
BranchPrediction.reversed:cycles                 avgt   110,993.223          #/op
BranchPrediction.reversed:instructions           avgt   361,508.018          #/op

BranchPrediction.sorted                          avgt    35,183.238         ns/op
BranchPrediction.sorted:IPC                      avgt         3.257          #/op
BranchPrediction.sorted:branch-misses            avgt        14.738          #/op
BranchPrediction.sorted:branches                 avgt    55,884.946          #/op
BranchPrediction.sorted:cycles                   avgt   112,912.763          #/op
BranchPrediction.sorted:instructions             avgt   367,820.131          #/op

BranchPrediction.unsorted                        avgt    65,770.667         ns/op
BranchPrediction.unsorted:IPC                    avgt         1.577          #/op
BranchPrediction.unsorted:branch-misses          avgt      3795.737          #/op
BranchPrediction.unsorted:branches               avgt    56,008.188          #/op
BranchPrediction.unsorted:cycles                 avgt   211,089.268          #/op
BranchPrediction.unsorted:instructions           avgt   333,105.231          #/op

Branches: Know what you test

How small changes can mess up things big time


final int d = i % 2 == 0 ? -1 : 1; 


public int doIt(int[] array)
{
    int sum = 0;
    for (int v : array)
    {
        if (v > 0)
        {
            sum -= v;
        }
        else
        {
            sum += v;
        }
    }
    
    return sum;
}

Benchmark  Mode       Score   Units
===================================
sorted     avgt   5,147.764   ns/op
unsorted   avgt   3,317.656   ns/op
reversed   avgt   4,738.261   ns/op

final int d = i % 2 == 0 
    ? -r.nextInt(5) : r.nextInt(5);

public int doIt(int[] array)
{
    int sum = 0;
    for (int v : array)
    {
        if (v > 0)
        {
            sum -= v;
        }
        else
        {
            sum += v;
        }
    }
    
    return sum;
}

Benchmark  Mode       Score   Units
===================================
sorted     avgt   3,588.073   ns/op
unsorted   avgt  14,282.879   ns/op
reversed   avgt   5,046.499   ns/op

final int d = i % 2 == 0 
    ? -r.nextInt(5) : r.nextInt(5); 

public int doIt(int[] array)
{
    int sum = 0;
    for (int v : array)
    {
        if (v >= 0)
        {
            sum -= -v;
        }
        else
        {
            sum += +v;
        }
    }
    
    return sum;
}

Benchmark  Mode       Score   Units
===================================
sorted     avgt   3,226.433   ns/op
unsorted   avgt   3,261.952   ns/op
reversed   avgt   3,234.786   ns/op

final int d = i % 2 == 0 
    ? -r.nextInt(5) : r.nextInt(5);
        
public int doIt(int[] array)
{
    int sum = 0;
    for (int v : array)
    {
        if (v >= 0)
        {
            sum -= v;
        }
        else
        {
            sum += v;
        }
    }
    
    return sum;
}

Benchmark  Mode       Score   Units
===================================
sorted     avgt   5,044.253   ns/op
unsorted   avgt  11,669.557   ns/op
reversed   avgt   5,390.210   ns/op

Short Summary

Measurement

  • Naive benchmarks are wrong
  • No JMH, no fun
  • Even JMH measurements can be wrong
  • How and where you measure makes a difference

JDK

  • Forget your Java optimization knowledge
  • Java now is not what most people think Java is

JVM

  • Tiered runtime compilation
  • JIT removes code when not needed
  • JIT tries to keep data local
  • Large methods kill the C2 optimization
  • Loops are optimized
  • Intrinsics are used dynamically

Hardware

  • Memory access is expensive
  • Caches are your friend
  • Branches are expensive
  • The CPU speculates

Synchronization

Being alone and still have to sync

private final int SIZE = 1024;
private final Map<String, String> syncMap = 
            Collections.synchronizedMap(new HashMap<>(SIZE));
private final Map<String, String> unsyncMap = 
            new HashMap<>(SIZE);
private final Map<String, String> conMap = 
            new ConcurrentHashMap<>(SIZE);

@Benchmark
public String syncAccess() 
{
    return syncMap.get("1");
}
@Benchmark
public String unsyncAccess() 
{
    return unsyncMap.get("1");
}
@Benchmark
public String conAccess() 
{
    return conMap.get("1");
}
  • Which on is slowest and by how much?
  • Remember, just one thread, so no contention
Benchmark                       Mode  Cnt   Score   Units
SynchronizedAlone.conAccess     avgt    2   9.152   ns/op
SynchronizedAlone.syncAccess    avgt    2  23.462   ns/op
SynchronizedAlone.unsyncAccess  avgt    2   9.366   ns/op
  • 40% performance remains
  • ConcurrentHashMap does well

With Contention

Let's try more threads

2 Threads
====================================================================
Benchmark                          Mode  Cnt    Score   Error  Units
SynchronizedThreads4.conAccess     avgt    2    9.721          ns/op
SynchronizedThreads4.syncAccess    avgt    2  183.086          ns/op
SynchronizedThreads4.unsyncAccess  avgt    2    9.312          ns/op
4 Threads
====================================================================
Benchmark                          Mode  Cnt    Score   Error  Units
SynchronizedThreads4.conAccess     avgt    2   19.661          ns/op
SynchronizedThreads4.syncAccess    avgt    2  216.041          ns/op
SynchronizedThreads4.unsyncAccess  avgt    2   19.631          ns/op
8 Threads
====================================================================
Benchmark                          Mode  Cnt    Score   Error  Units
SynchronizedThreads4.conAccess     avgt    2   44.055          ns/op
SynchronizedThreads4.syncAccess    avgt    2  377.180          ns/op
SynchronizedThreads4.unsyncAccess  avgt    2   38.562          ns/op

Looping

Loops compared


Benchmark                           (length)  Mode  Cnt     Score   Error  Units
ForEachSimple.arrayFor                     1  avgt    2     3.803          ns/op
ForEachSimple.arrayFor                    10  avgt    2     9.884          ns/op
ForEachSimple.arrayFor                   100  avgt    2    64.417          ns/op
ForEachSimple.arrayFor                  1000  avgt    2  1446.119          ns/op

ForEachSimple.arrayEnhanced                1  avgt    2     3.594          ns/op
ForEachSimple.arrayEnhanced               10  avgt    2     9.520          ns/op
ForEachSimple.arrayEnhanced              100  avgt    2    63.290          ns/op
ForEachSimple.arrayEnhanced             1000  avgt    2  1199.368          ns/op

ForEachSimple.classicFor                   1  avgt    2     5.771          ns/op
ForEachSimple.classicFor                  10  avgt    2    15.552          ns/op
ForEachSimple.classicFor                 100  avgt    2   103.035          ns/op
ForEachSimple.classicFor                1000  avgt    2  1388.853          ns/op

ForEachSimple.enhancedFor                  1  avgt    2     6.170          ns/op
ForEachSimple.enhancedFor                 10  avgt    2    17.025          ns/op
ForEachSimple.enhancedFor                100  avgt    2   118.442          ns/op
ForEachSimple.enhancedFor               1000  avgt    2  1578.651          ns/op

ForEachSimple.lambdaStream                 1  avgt    2    53.758          ns/op
ForEachSimple.lambdaStream                10  avgt    2    65.572          ns/op
ForEachSimple.lambdaStream               100  avgt    2   153.083          ns/op
ForEachSimple.lambdaStream              1000  avgt    2  1326.048          ns/op

ForEachSimple.parallelLambdaStream         1  avgt    2    94.212          ns/op
ForEachSimple.parallelLambdaStream        10  avgt    2  6283.293          ns/op
ForEachSimple.parallelLambdaStream       100  avgt    2  8443.437          ns/op
ForEachSimple.parallelLambdaStream      1000  avgt    2  9367.695          ns/op

// just some strings
final List<String> list = new ArrayList<>();
String[] array = null;

@Param({"1", "10", "100", "1000"})
int length;

@Setup
public void setup() {.. // uses length ..}

@Benchmark
public void arrayFor(Blackhole bh) 
{
    int result = 0;
    for (int i = 0; i < array.length; i++) 
    {
        result += array[i].length();
    }
    
    bh.consume(result);
}

// classicFor
for (int i = 0; i < list.size(); i++) 
{
    result += list.get(i).length();
}


// enhancedFor
for (String s : list) 
{
    result += s.length();
}

// lambdaStream
int result = list.stream()
    .mapToInt(s -> s.length())
    .sum();

//parallelLambdaStream
int result = list.parallelStream()
    .mapToInt(s -> s.length())
    .sum();

Looping

Not so empty loops compared


Benchmark             (length)  Mode  Cnt       Score   Error  Units
arrayFor                     1  avgt    2     238.355          ns/op
arrayFor                    10  avgt    2    2105.676          ns/op
arrayFor                   100  avgt    2   21957.823          ns/op
arrayFor                  1000  avgt    2  253262.024          ns/op

arrayEnhanced                1  avgt    2     234.163          ns/op
arrayEnhanced               10  avgt    2    2173.127          ns/op
arrayEnhanced              100  avgt    2   22080.198          ns/op
arrayEnhanced             1000  avgt    2  250038.041          ns/op

classicFor                   1  avgt    2     238.146          ns/op
classicFor                  10  avgt    2    2177.025          ns/op
classicFor                 100  avgt    2   22318.031          ns/op
classicFor                1000  avgt    2  249914.164          ns/op

enhancedFor                  1  avgt    2     254.852          ns/op
enhancedFor                 10  avgt    2    2185.979          ns/op
enhancedFor                100  avgt    2   23011.174          ns/op
enhancedFor               1000  avgt    2  262605.553          ns/op

lambdaStream                 1  avgt    2     395.788          ns/op
lambdaStream                10  avgt    2    2829.036          ns/op
lambdaStream               100  avgt    2   27771.462          ns/op
lambdaStream              1000  avgt    2  304831.611          ns/op

# 2+2 CPU
parallelLambdaStream         1  avgt    2     453.566          ns/op
parallelLambdaStream        10  avgt    2   11686.427          ns/op
parallelLambdaStream       100  avgt    2   26798.653          ns/op
parallelLambdaStream      1000  avgt    2  189304.394          ns/op
  • Gave the thing something to do
final List<String> list = new ArrayList<>();
    
@Param({"1", "10", "100", "1000"})
int length;

@Setup
public void setup()
{
    for (int i = 0; i < length; i++)
    {
        list.add(
            MessageFormat.format(
            "{0},{0},{0},{0},{0},{0}", String.valueOf(i)));
    }
}

@Benchmark
public void classicFor(Blackhole bh)
{
    int result = 0;
    for (int i = 0; i < list.size(); i++)
    {
        final String s = list.get(i);
        if (s.startsWith("5"))
        {
            continue;
        }
        
        final String[] a = s.split(",");
        for (int j = 0; j < a.length; j++)
        {
            result += Integer.valueOf(a[j]);
        }
    }
    
    bh.consume(result);
}

Looping - Bonus Content

T450s vs. GCP N1-highcpu-16core


# Lenovo T450s
Benchmark             (length)  Mode  Cnt        Score  Error  Units
arrayFor                     1  avgt    2        238.3         ns/op
arrayFor                    10  avgt    2      2,105.6         ns/op
arrayFor                   100  avgt    2     21,957.8         ns/op
arrayFor                  1000  avgt    2    253,262.0         ns/op
arrayFor                 10000  avgt    2  2,810,799.6         ns/op
arrayFor                100000  avgt    2 31,116,545.4         ns/op

classicFor                   1  avgt    2        238.1         ns/op
classicFor                  10  avgt    2      2,177.0         ns/op
classicFor                 100  avgt    2     22,318.0         ns/op
classicFor                1000  avgt    2    257,362.6         ns/op
classicFor               10000  avgt    2  2,809,832.8         ns/op
classicFor              100000  avgt    2 32,017,487.3         ns/op

enhancedFor                  1  avgt    2        254.8         ns/op
enhancedFor                 10  avgt    2      2,185.9         ns/op
enhancedFor                100  avgt    2     23,011.1         ns/op
enhancedFor               1000  avgt    2    254,691.8         ns/op
enhancedFor              10000  avgt    2  3,180,447.3         ns/op
enhancedFor             100000  avgt    2 31,479,657.4         ns/op

lambdaStream                 1  avgt    2        395.7         ns/op
lambdaStream                10  avgt    2      2,829.0         ns/op
lambdaStream               100  avgt    2     27,771.4         ns/op
lambdaStream              1000  avgt    2    299,971.9         ns/op
lambdaStream             10000  avgt    2  3,181,573.5         ns/op
lambdaStream            100000  avgt    2 36,055,386.8         ns/op

parallelLambdaStream         1  avgt    2        453.5         ns/op
parallelLambdaStream        10  avgt    2     11,686.4         ns/op
parallelLambdaStream       100  avgt    2     26,798.6         ns/op
parallelLambdaStream      1000  avgt    2    193,232.6         ns/op
parallelLambdaStream     10000  avgt    2  1,956,289.8         ns/op
parallelLambdaStream    100000  avgt    2 20,628,366.2         ns/op

# Google n1-highcpu-16core
Benchmark             (length)  Mode  Cnt       Score   Error  Units
arrayFor                     1  avgt   2        263.4          ns/op
arrayFor                    10  avgt   2      2,390.4          ns/op
arrayFor                   100  avgt   2     24,871.1          ns/op
arrayFor                  1000  avgt   2    266,237.0          ns/op
arrayFor                 10000  avgt   2  3,020,851.2          ns/op
arrayFor                100000  avgt   2 30,428,126.5          ns/op

classicFor                   1  avgt   2        264.0          ns/op
classicFor                  10  avgt   2      2,490.3          ns/op
classicFor                 100  avgt   2     25,186.0          ns/op
classicFor                1000  avgt   2    281,254.1          ns/op
classicFor               10000  avgt   2   2882,459.1          ns/op
classicFor              100000  avgt   2 34,657,730.5          ns/op

enhancedFor                  1  avgt   2        263.5          ns/op
enhancedFor                 10  avgt   2      2,329.8          ns/op
enhancedFor                100  avgt   2     24,811.0          ns/op
enhancedFor               1000  avgt   2    291,544.4          ns/op
enhancedFor              10000  avgt   2  3,185,459.2          ns/op
enhancedFor             100000  avgt   2 33,898,708.6          ns/op

lambdaStream                 1  avgt   2        547.9          ns/op
lambdaStream                10  avgt   2      3,841.7          ns/op
lambdaStream               100  avgt   2     28,808.7          ns/op
lambdaStream              1000  avgt   2    329,148.1          ns/op
lambdaStream             10000  avgt   2  3,684,920.1          ns/op
lambdaStream            100000  avgt   2 36,935,499.3          ns/op

parallelLambdaStream         1  avgt   2        521.7          ns/op
parallelLambdaStream        10  avgt   2     43,527.8          ns/op
parallelLambdaStream       100  avgt   2     62,834.3          ns/op
parallelLambdaStream      1000  avgt   2    110,602.3          ns/op
parallelLambdaStream     10000  avgt   2    558,666.5          ns/op
parallelLambdaStream    100000  avgt   2  5,183,311.8          ns/op

The Cost of Calling

Method Calls, Inheritance, and Compiler Magic

Inlining Methods


@CompilerControl(CompilerControl.Mode.INLINE)
private static String uppercase(String s)

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
private static String uppercaseNoInline(String s)

@CompilerControl(CompilerControl.Mode.INLINE)
private static String lowercase(String s)

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
private static String lowercaseNoInline(String s)

@CompilerControl(CompilerControl.Mode.INLINE)
private static int sumOfChars(String s)

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
private static int sumOfCharsNoInline(String s)

@Benchmark
public void largeMethod(Blackhole bh)
{ 
    // all the code of the methods copied here
    
    bh.consume(sum);
}

@Benchmark
public void smallAndCallsInline(Blackhole bh)
{    
    final String s = strings[0];
    
    final String s2 = uppercase(s);
    final String s3 = lowercase(s2);
    
    int sum = sumOfChars(s3);
    
    bh.consume(sum);
}

@Benchmark
public void smallAndCallsNoInline(Blackhole bh)
{    
    final String s = strings[0];
    
    final String s2 = uppercaseNoInline(s);
    final String s3 = lowercaseNoInline(s2);
    
    int sum = sumOfCharsNoInline(s3);
    
    bh.consume(sum);
}

Calling a method is expensive


# BAT JDK 11
Benchmark              Mode  Cnt     Score     Error  Units
smallAndCallsInline    avgt    5  4860.736 ± 229.153  ns/op
smallAndCallsNoInline  avgt    5  5672.435 ± 219.164  ns/op
largeMethod            avgt    5  5294.679 ± 218.967  ns/op

# AC JDK 11
Benchmark              Mode  Cnt     Score     Error  Units
smallAndCallsInline    avgt    5  4501.134 ± 217.389  ns/op
smallAndCallsNoInline  avgt    5  4460.731 ± 132.889  ns/op
largeMethod            avgt    5  4244.479 ± 209.325  ns/op

# BAT JDK 11 again
Benchmark              Mode  Cnt     Score     Error  Units
smallAndCallsInline    avgt    5  5661.628 ± 268.654  ns/op
smallAndCallsNoInline  avgt    5  5634.992 ± 323.852  ns/op
largeMethod            avgt    5  5300.749 ± 267.469  ns/op

# AC JDK 8
Benchmark              Mode  Cnt     Score     Error  Units
smallAndCallsInline    avgt    5  4839.311 ± 158.454  ns/op
smallAndCallsNoInline  avgt    5  4923.197 ±  93.549  ns/op
largeMethod            avgt    5  4919.511 ± 223.358  ns/op
  • Benchmarking is hard!

The Cost of Calling

How expensive are virtual method calls

  • Check if there is an overhead to find the right method
  • This is about virtual method calls

public class MegaMorphicSimple
{
    static interface C { public int apply(int x); }
    
    static class C1 implements C 
    { 
        @Override
        public int apply(int x) { return x * x; }
    }
    
    @Benchmark
    public int _1_mono()
    {
        C c = null;
        int r = random.nextInt(4);
        
        if (r == 0) { c = new C1(); } 
        else if (r == 1) { c = new C1(); } 
        else if (r == 2) { c = new C1(); }
        else if (r == 3) { c = new C1(); }
         
        return c.apply(r);
    }

    @Benchmark
    public int _2_bi()
    {
        C c = null;
        int r = random.nextInt(4);
        
        if (r == 0) { c = new C1(); } 
        else if (r == 1) { c = new C2(); } 
        else if (r == 2) { c = new C1(); }
        else if (r == 3) { c = new C2(); }
         
        return c.apply(r);
    }

    @Benchmark
    public int _5_manualDispatch()
    {
        ... 
        if (c instanceof C1)
        {
            return ((C1) c).apply(r);
        }
        else if (c instanceof C2)
        {
            return ((C2) c).apply(r);            
        }
        else if (c instanceof C3)
        {
            return ((C3) c).apply(r);            
        }
        else
        {
            return ((C4) c).apply(r);            
        }
    }
}

The Cost of Calling

How expensive are virtual method calls?


# identical method per implementing class    
Benchmark                             Mode  Cnt   Score   Error  Units
MegaMorphicSimple._1_mono             avgt    5  19.703 ± 2.177  ns/op
MegaMorphicSimple._2_bi               avgt    5  21.917 ± 3.759  ns/op
MegaMorphicSimple._4_mega             avgt    5  26.778 ± 4.411  ns/op
MegaMorphicSimple._5_manualDispatch   avgt    5  21.861 ± 2.860  ns/op

static interface C { default int apply(int x){ return x * x; }};

static class C1 implements C {};
static class C2 implements C {};
static class C3 implements C {};
static class C4 implements C {};

# default method at interface    
Benchmark                             Mode  Cnt   Score   Error  Units
MegaMorphicDefault._1_mono            avgt    5  18.867 ± 1.325  ns/op
MegaMorphicDefault._2_bi              avgt    5  21.817 ± 2.613  ns/op
MegaMorphicDefault._4_mega            avgt    5  26.355 ± 2.591  ns/op
MegaMorphicDefault._5_manualDispatch  avgt    5  21.763 ± 3.479  ns/op
  • That looks expensive
  • Fancy that we can help the runtime
  • Shall we do that from now on in that ugly way?
  • ...but these are nanoseconds and the methods are useless.

# method got bigger and we made that non-default
Benchmark                                 Mode  Cnt    Score   Error  Units
MegaMorphicLargeMethod._1_mono            avgt    5  158.775 ± 4.971  ns/op
MegaMorphicLargeMethod._2_bi              avgt    5  157.100 ± 6.153  ns/op
MegaMorphicLargeMethod._4_mega            avgt    5  165.580 ± 8.669  ns/op
MegaMorphicLargeMethod._5_manualDispatch  avgt    5  162.729 ± 3.795  ns/op

Really calling virtually

So we don't have to think about virtual calls at all

  • Let's have an array with different objects of a type C
  • C has four possible implementations
  • We either let Java make the call or we untangled it ourselves

public class MegaMorphicArray 
{
    private C[] array = new C[20_000];
    
    @Setup
    public void setup() 
    {
        final Random r = new Random(42L);
        
        for (int x = 0; x < 20_000; x++) 
        {
            int i = r.nextInt(JHM_PARAM);
            
            if (i == 0) array[x] = new C1()
            else if (i == 1) array[x] = new C2();
            else if (i == 2) array[x] = new C3();
            else if (i == 3) array[x] = new C4();
        }        
    }
}

@Benchmark
public int morphic() 
{
    int sum = 0; 

    for (int i = 0; i < 20_000; i++) 
    {
        sum += array[i].apply(i);
    }

    return sum;
}    
    
@Benchmark
public int peeled()
{
    int sum = 0;
    for (int i = 0; i < 20_000; i++)
    {
        final C c = array[i];
        if (c instanceof C1)
        {
            sum += ((C1) c).apply(i);
        }
        else if (c instanceof C2)
        {
            sum += ((C2) c).apply(i);
        }
        else if (c instanceof C3)
        {
            sum += ((C3) c).apply(i);
        }
        else
        {
            sum += ((C4) c).apply(i);
        }
    }
    
    return sum;
}

Really calling virtually

The Results are in


# JDK 11
Benchmark (params)  Mode  Cnt     Score  Units
morphic         1  avgt    5   17,705.6  ns/op
morphic         2  avgt    5   98,269.4  ns/op
morphic         3  avgt    5  334,745.9  ns/op
morphic         4  avgt    5  364,485.9  ns/op

peeled          1  avgt    5   16,990.4  ns/op
peeled          2  avgt    5   94,084.2  ns/op
peeled          3  avgt    5  123,590.3  ns/op
peeled          4  avgt    5  147,067.2  ns/op

# JDK 11 -Xint
Benchmark (params) Mode  Cnt        Score  Units
morphic         1  avgt    5  1,682,550.7  ns/op
morphic         2  avgt    5  1,695,875.2  ns/op
morphic         3  avgt    5  1,679,643.2  ns/op
morphic         4  avgt    5  1,675,298.1  ns/op

peeled          1  avgt    5  1,764,953.0  ns/op
peeled          2  avgt    5  1,821,662.6  ns/op
peeled          3  avgt    5  1,885,678.8  ns/op
peeled          4  avgt    5  1,936,054.8  ns/op

Megamorphic Internals

But why do we see this performance?


Benchmark                      (params)  Mode       Score       Error  Units
morphic                               1  avgt    16,829.7 ±  1165.264  ns/op
morphic:IPC                           1  avgt         3.2               #/op
morphic:L1-dcache-load-misses         1  avgt     6,263.3               #/op
morphic:L1-dcache-loads               1  avgt    40,296.8               #/op
morphic:branch-misses                 1  avgt         9.1               #/op
morphic:branches                      1  avgt    24,984.2               #/op
morphic:dTLB-load-misses              1  avgt         0.2               #/op
morphic:dTLB-loads                    1  avgt    40,401.5               #/op

morphic                               2  avgt    97,456.9 ±  1177.830  ns/op
morphic:IPC                           2  avgt         0.7               #/op
morphic:L1-dcache-load-misses         2  avgt     6,472.7               #/op
morphic:L1-dcache-loads               2  avgt    43,597.9               #/op
morphic:branch-misses                 2  avgt     9,850.6               #/op
morphic:branches                      2  avgt    41,935.3               #/op
morphic:dTLB-load-misses              2  avgt         2.0               #/op
morphic:dTLB-loads                    2  avgt    45,509.8               #/op

morphic                               3  avgt   334,117.8 ±  3334.059  ns/op
morphic:IPC                           3  avgt         1.1               #/op
morphic:L1-dcache-load-misses         3  avgt     6,840.3               #/op
morphic:L1-dcache-loads               3  avgt   407,654.7               #/op
morphic:branch-misses                 3  avgt    13,247.0               #/op
morphic:branches                      3  avgt   146,105.8               #/op
morphic:dTLB-load-misses              3  avgt        13.5               #/op
morphic:dTLB-loads                    3  avgt   410,043.5               #/op

morphic                               4  avgt   390,601.4 ± 40515.922  ns/op
morphic:IPC                           4  avgt         0.9               #/op
morphic:L1-dcache-load-misses         4  avgt     6,740.5               #/op
morphic:L1-dcache-loads               4  avgt   408,814.3               #/op
morphic:branch-misses                 4  avgt    15,099.0               #/op
morphic:branches                      4  avgt   145,118.2               #/op
morphic:dTLB-load-misses              4  avgt        18.9               #/op
morphic:dTLB-loads                    4  avgt   410,767.5               #/op

Benchmark                      (params)  Mode        Score       Error  Units
peeled                                1  avgt     16,831.1 ±   816.491  ns/op
peeled:IPC                            1  avgt          3.2               #/op
peeled:L1-dcache-load-misses          1  avgt      6,218.4               #/op
peeled:L1-dcache-loads                1  avgt     39,797.1               #/op
peeled:branch-misses                  1  avgt          9.2               #/op
peeled:branches                       1  avgt     24,854.5               #/op
peeled:dTLB-load-misses               1  avgt          0.2               #/op
peeled:dTLB-loads                     1  avgt     40,133.1               #/op

peeled                                2  avgt     93,347.0 ±  3607.879  ns/op
peeled:IPC                            2  avgt          0.7               #/op
peeled:L1-dcache-load-misses          2  avgt      6,456.1               #/op
peeled:L1-dcache-loads                2  avgt     42,060.7               #/op
peeled:branch-misses                  2  avgt      9,252.3               #/op
peeled:branches                       2  avgt     41,153.9               #/op
peeled:dTLB-load-misses               2  avgt          6.0               #/op
peeled:dTLB-loads                     2  avgt     42,195.1               #/op

peeled                                3  avgt    123,075.7 ±  5181.245  ns/op
peeled:IPC                            3  avgt          0.6               #/op
peeled:L1-dcache-load-misses          3  avgt      6,472.7               #/op
peeled:L1-dcache-loads                3  avgt     42,059.2               #/op
peeled:branch-misses                  3  avgt     13,440.9               #/op
peeled:branches                       3  avgt     51,950.4               #/op
peeled:dTLB-load-misses               3  avgt          5.7               #/op
peeled:dTLB-loads                     3  avgt     43,097.4               #/op

peeled                                4  avgt    136,765.3 ±  2792.939  ns/op
peeled:IPC                            4  avgt          0.6               #/op
peeled:L1-dcache-load-misses          4  avgt      6,460.5               #/op
peeled:L1-dcache-loads                4  avgt     43,312.1               #/op
peeled:branch-misses                  4  avgt     14,818.3               #/op
peeled:branches                       4  avgt     62,214.6               #/op
peeled:dTLB-load-misses               4  avgt          8.0               #/op
peeled:dTLB-loads                     4  avgt     42,775.6               #/op

TLB - The Unknown Buffer

The CPU understood

  • Previous data shows that TLB numbers are high
  • VMT: Virtual Method Table
  • Java looks the right method up in that table and makes afterwards the call to that code
  • TLB: Translation Lookaside Buffer
  • The TLB stores the recent translations of virtual memory to physical memory and can be called an address-translation cache [1]
  • The VMT call target has to be resolved via a TLB lookup
  • Java can inline code for one or two callsides, but not for more
  • One method: Inlined or directly called
  • Two methods: Inlined or directly called with a virtual if-else
  • Three or more methods: We have to use VMT
  • That behavior can change in future Java versions

Lambdas are slow

Let's try some streams first

final int SIZE = 10240;
final int[] integers = new int[SIZE];

public int lambdaArrayCold()
{
    return Arrays.stream(integers)
        .filter(i -> i % 2 == 0).sum();
}

public int lambdaIntStreamCold()
{
    return IntStream.range(0, SIZE)
        .filter(i -> i % 2 == 0).sum();
}

public int loopCold()
{
    int sum = 0;
    for (int i = 0; i < integers.length; i++)
    {
        if (i % 2 == 0)
        {
            sum += integers[i];
        }
    }
    
    return sum;
}
Default
Benchmark                     Mode  Cnt        Score Units
lambda01.lambdaArrayCold      avgt    5     50,673.7 ns/op
lambda01.lambdaArrayHot       avgt    5     11,084.6 ns/op
lambda01.lambdaIntStreamCold  avgt    5     76,926.1 ns/op
lambda01.lambdaIntStreamHot   avgt    5     56,211.7 ns/op
lambda01.loopCold             avgt    5      7,355.3 ns/op
lambda01.loopHot              avgt    5      8,420.8 ns/op

-Xint
Benchmark                     Mode  Cnt        Score Units
lambda01.lambdaArrayCold      avgt    5  3,674,930.4 ns/op
lambda01.lambdaArrayHot       avgt    5  3,199,644.7 ns/op
lambda01.lambdaIntStreamCold  avgt    5  3,498,449.7 ns/op
lambda01.lambdaIntStreamHot   avgt    5  3,270,098.0 ns/op
lambda01.loopCold             avgt    5    166,555.5 ns/op
lambda01.loopHot              avgt    5    166,784.6 ns/op

-Xcomp
Benchmark                     Mode  Cnt        Score Units
lambda01.lambdaArrayCold      avgt    5    128,531.1 ns/op
lambda01.lambdaArrayHot       avgt    5     97,316.8 ns/op
lambda01.lambdaIntStreamCold  avgt    5    128,284.2 ns/op
lambda01.lambdaIntStreamHot   avgt    5    153,985.8 ns/op
lambda01.loopCold             avgt    5      9,520.1 ns/op
lambda01.loopHot              avgt    5     10,760.0 ns/op

Lambda Functions

public int forLambda(final Predicate<Integer> p) {
    int sum = 0;
    for (int i = 0; i < array.length; i++) {
        if (p.test(array[i])) {
            sum++;
        }
    }
    return sum;
}

public int forAnonymous(final PseudoLambda<Integer> p) {
    int sum = 0;
    for (int i = 0; i < array.length; i++) {
        if (p.test(array[i])) {
            sum++;
        }
    }
    return sum;
}

public int simple() {
    int sum = 0;
    for (int i = 0; i < array.length; i++) {
        if (array[i] % 2 == 0) {
            sum++;
        }
    }
    return sum;
}
  • Compare Predicate, Pseudo-Predicate, and direct implementation

Benchmark                     Mode  Cnt     Score   Units
forAnonymousClass             avgt    2     136.3   ns/op
forLambdaPredicate            avgt    2     135.1   ns/op
forLambdaPredicatePredefined  avgt    2     135.2   ns/op
simple                        avgt    2     135.3   ns/op

# -Xint
Benchmark                     Mode  Cnt     Score   Units
forAnonymousClass             avgt    2  31,419.8   ns/op
forLambdaPredicate            avgt    2  30,531.0   ns/op
forLambdaPredicatePredefined  avgt    2  30,824.6   ns/op
simple                        avgt    2  10,327.4   ns/op

GC can be your friend


public class GCAndAccessSpeed 
{
    private final static int SIZE  = 100_000;
    private final List<String> STRINGS = new ArrayList<>();
    private final List<String> ordered = new ArrayList<>();
    private final List<String> nonOrdered = new ArrayList<>();
    
    @Param({"false", "true"}) boolean gc;
    @Param({"1", "10"})       int COUNT;
    @Param({"false", "true"}) boolean drop;
    
    @Setup
    public void setup() throws InterruptedException
    {
        final FastRandom r = new FastRandom(7);
        for (int i = 0; i < COUNT * SIZE; i++)
        {
            STRINGS.add(RandomUtils.randomString(r, 1, 20));
        }
        
        for (int i = 0; i < SIZE; i++)
        {
            ordered.add(STRINGS.get(i * COUNT));
        }
        nonOrdered.addAll(ordered);
        Collections.shuffle(nonOrdered, new Random(r.nextInt()));
        
        if (drop)
        {
            STRINGS.clear();
        }
        
        if (gc) 
        {
            for (int c = 0; c < 5; c++) 
            {
                System.gc();
                TimeUnit.SECONDS.sleep(2);
            }
        }
    }
    
    @Benchmark
    public int walk[Ordered|NonOrdered]()
    {
        int sum = 0;
        for (int i = 0; i < [ordered|nonOrdered].size(); i++)
        {
            sum += [ordered|nonOrdered].get(i).length();
        }
        return sum;
    }
}

Applied Memory Knowledge

  • Theory is that memory closer together is good
  • G1 is compacting, hence moving objects closer together
  • When walking a structure with more "local" objects, the cache is hotter
  • Verify with Non-GC setup aka EpsilonGC

# G1 GC
Benchmark        (COUNT)  (drop)   (gc) Mode  Cnt      Score   Units
walkNonOrdered        1   false  false  avgt    5  1596315.2   ns/op
walkOrdered           1   false  false  avgt    5   611137.7   ns/op

walkNonOrdered        1   false   true  avgt    5  1172951.3   ns/op
walkOrdered           1   false   true  avgt    5   431143.5   ns/op

walkNonOrdered        1    true  false  avgt    5  1562844.1   ns/op
walkOrdered           1    true  false  avgt    5   605119.4   ns/op

walkNonOrdered        1    true   true  avgt    5  1243973.9   ns/op
walkOrdered           1    true   true  avgt    5   400721.9   ns/op

walkNonOrdered       10   false  false  avgt    5  1903731.9   ns/op
walkOrdered          10   false  false  avgt    5  1229945.1   ns/op

walkNonOrdered       10   false   true  avgt    5  2026861.7   ns/op
walkOrdered          10   false   true  avgt    5  1809961.9   ns/op

walkNonOrdered       10    true  false  avgt    5  1920658.4   ns/op
walkOrdered          10    true  false  avgt    5  1239658.5   ns/op

walkNonOrdered       10    true   true  avgt    5  1160229.2   ns/op
walkOrdered          10    true   true  avgt    5   403949.5   ns/op

# -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC
Benchmark        (COUNT)  (drop)   (gc) Mode  Cnt      Score   Units
walkNonOrdered        1   false  false  avgt    5  1667611.7   ns/op
walkNonOrdered        1   false   true  avgt    5  1820968.2   ns/op
walkNonOrdered        1    true  false  avgt    5  1928822.7   ns/op
walkNonOrdered        1    true   true  avgt    5  1777251.4   ns/op

walkOrdered           1   false  false  avgt    5   931728.5   ns/op
walkOrdered           1   false   true  avgt    5   902433.7   ns/op
walkOrdered           1    true  false  avgt    5   930294.3   ns/op
walkOrdered           1    true   true  avgt    5   907886.5   ns/op

Conclusion

Rule 1

Don't worry too much about performance. Java is very good in taking care of it.

Rule 2

Don't optimize prematurely.

Rule 3

Simply write clean and good code. You won't see most of the issues ever. Think before you code. Java can make code fast, but bad algorithms stay bad algorithms, bad designs stays bad designs.



Rule 4

When you worry, worry about the hottest code. Measure carefully and apply what you learned. The compiler and the CPU are your friends but make also your profiling life harder.

Rule 5

Understand, measure, review, test, and measure again.

Rule 6

The code you write is not the code that is executed. Measure under production conditions to understand and avoid optimizing non-compiled and non-optimized code.

Valhalla

Loom

The code

When you want to play with the code

  • All examples of today and more
  • Do whatever you like with it
  • JDK 8 or higher and Maven needed
  • Uses the Java Microbenchmark Harness from OpenJDK
  • Optionally uses kernel routines to profile low level

https://github.com/Xceptance/jmh-jmm-training

Questions and Answers