High Performance Java

About the Smart Java Internals

René Schwietzke, Xceptance GmbH, Jena, Germany

About

Xceptance

  • Focused on Software Test and Quality Assurance
  • Performance Testing, Test Automation, Functional Testing, QA and Test Process Consulting
  • Founded 2004
  • Headquarters in Jena, Germany
  • Subsidiary in Cambridge, MA, USA
  • Own Performance Test Tool XLT, Java-based, free

René Schwietzke

  • Co-Founder and Managing Directory Xceptance
  • Working with Java since version 1.0
  • Performance Tester since 1999
  • @ReneFoobarJ

Sources

Always state your sources and attribute other people's work

Java and Performance

Java is more than what most people think (Spoilers!)

The times when Java was a bytecode interpreter, running slow and sluggish, are long gone. But most people don't know the magic that is going on to speed up Java to native code speed and beyond.

Actually, most programmers don't even know how their code gets executed or how a computer works... sadly.

  • This talk is about how speed is achieved and the mysterious things behind that
  • Java and hardware is not longer unimportant
  • You will learn how to avoid commons mistakes but besides that just trust Java
  • You will learn that the code you write is not the code that runs

An Early Warning

Optimization and tuning is fun... but...

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

Donald Knuth, Turing Award Lecture 1974

  • This talk is NOT about how your code should look like
  • If you expect tips like use StringBuilder instead of... this talk is not for you
  • This talk is not about the fastest code (bummer!)
  • It is about understanding the JVM and getting measurements right
  • It is about avoiding to work against Java
  • It is also about the right approach to performance tuning

Another Motivation or Early Prove

Java is Efficient, just the Why has to be Solved

Let's Benchmark

Get Some Measurements Going

Poor Man's Benchmark 1

public class PoorMansBenchmark 
{   
    private static String append(final int i) 
    {
        final StringBuilder sb = new StringBuilder();
        sb.append(System.currentTimeMillis());
        sb.append("Foo"); 
        sb.append(i); 
        sb.append('c');
        sb.append("lkJAHJ AHSDJHF KJASHF HSAKJFD");
        
        final String s = sb.toString();
        char[] ch = s.toCharArray();
        Arrays.sort(ch);
        
        return new String(ch);
    }

    public static void main (final String[] args) 
    {
        final int SIZE = 100_000;
        final long[] result = new long[SIZE];
        
        int sum = 0;
        for (int i = 0; i < SIZE; i++)
        {
            final Timer t = Timer.startTimer();       
            
            sum += append(i).length();
            
            result[i] = t.stop().runtimeNanos();
        }  
        
        // pretend we need it but we really don't
        // just to avoid optimizations (spoilers, duh!)
        if (sum < 0)
        {
            System.out.println(sum);
        }
    }        
}

Just benchmark something...

  • Some non-sense with StringBuilder and String
  • We call our method in a loop
  • A helper tracks the time

Let's Run the Poor Man's Benchmark

# SIZE = 10
 1, 617053
 2,  38628
 3,  24712
 4,  24459
 5,  33943
 6,  23639
 7,  24312
 8,  25948
 9,  23379
10,  23003
# SIZE = 50 
  1, 616953
  2,  37628
  3,  27712
  4,  25459
  5,  36943
  6,  24639
  7,  27312
  8,  23948
  9,  24379
 10,  24003
 20,  23770
 30,  25595
 40,  23631
 41,  23744
 42,  25004
 43,  26035
 44,  23782
 45,  22973
 46,  25261
 47,  23215
 48,  24135
 49,  23312
 50,  22908







# SIZE = 200
  1, 560064
  2,  33833
  3,  25935
  4,  24224
  5,  58761
  6,  25303
  7,  23832
  8,  23243
  9,  24405
 10,  23159
 20,  24005
 30,  23097
 40,  24002
 50,  23575
 60,  26380
 70,  24612
 80,  26099
 90,  23815
100,  33719
110,  24405
120,  25568
130,  27355
140,  23190
150,  12568
160,  22774
170,  21934
180,  11408
190,  12476
200,  10366

# SIZE = 2000
   1, 582933
   2,  37514
   3,  26746
   4,  24760
   5,  40041
   6,  23869
   7,  26384
   8,  23183
   9,  22801
  10,  24316
  20,  28037
  30,  47624
  40,  37747
  50,  37559
  60,  47502
  70,  35459
  80,  37795
  90,  36272
 100,  36344
 200,  14478
 300,  11769
 400,  11557
 500,  10088
 600,   9965
 700,   9387
 800,   8322
 900,   7305
1000,   6849
2000,   3347

# SIZE = 100,000
     1, 592763
    10,  24077
   100,  24995
   200,  10734
   300,  19562
   400,   8904
   500,   6226
   600,   6723
   700,   6981
   800,   5283
   900,   4896
  1000,   4883
  2000,   3367
  3000,   4947
  4000,   5179
  5000,   4648
  6000,   4113
  7000,   4283
  8000,   4654
  9000,   4054
 10000,   4300
 20000,  19314
 30000,    671
 40000,    694
 50000,    686
 60000,    663
 70000,    717
 80000,    697
 90000,    686
100000,    686

The Java Hater's Response

That must be Garbage Collection

  • Let's disable GC
  • Epsilon GC to the rescue
  • Noop-Garbage Collector
  • It provides but does not free memory
  • -Xms5g -Xmx5g
    -XX:+UnlockExperimentalVMOptions
    -XX:+UseEpsilonGC
    -XX:+AlwaysPreTouch
# No GC
# SIZE = 100,000
     1, 549924
    10,  23824
   100,  34017
   200,  16395
   300,  12858
   400,  11460
   500,  10504
   600,   9787
   700,   9935
   800,  11464
   900,   7369
  1000,   7365
  2000,   2771
  3000,   2763
  4000,   2753
  5000,   2762
  6000,   2846
  7000,   2703
  8000,   2737
  9000,   2726
 10000,   2738
 20000,   2770
 30000,   2706
 40000,   2764
 50000,   2657
 60000,   2476
 70000,   1040
 80000,   1002
 90000,    982
100000,    990
# G1 GC
# SIZE = 100,000
     1, 595433
    10,  23628
   100,  40456
   200,  17898
   300,  11618
   400,  13881
   500,   9707
   600,   9335
   700,   9153
   800,   8872
   900,   7258
  1000,   6953
  2000,   2882
  3000,   2737
  4000,   2748
  5000,   2756
  6000,   2724
  7000,   2686
  8000,   2758
  9000,   2751
 10000,   2743
 20000,   4219
 30000,   5016
 40000,   4437
 50000,   1488
 60000,    999
 70000,   1056
 80000,   4011
 90000,   1221
100000,    755

The Java Hater's Response II

Java is slow

  • GraalVM to the rescue
  • Let's compile to machine code
  • native-image
# G1 GC
# SIZE = 100,000
     1, 595433
    10,  23628
   100,  40456
   200,  17898
   300,  11618
   400,  13881
   500,   9707
   600,   9335
   700,   9153
   800,   8872
   900,   7258
  1000,   6953
  2000,   2882
  3000,   2737
  4000,   2748
  5000,   2756
  6000,   2724
  7000,   2686
  8000,   2758
  9000,   2751
 10000,   2743
 20000,   4219
 30000,   5016
 40000,   4437
 50000,   1488
 60000,    999
 70000,   1056
 80000,   4011
 90000,   1221
100000,    755

# GraalVM Native
# SIZE = 100,000
     1,   8791
    10,   4736
   100,   1989
   200,   1830
   300,   5040
   400,   1870
   500,   2267
   600,   1903
   700,   2279
   800,   3889
   900,   1771
  1000,   1620
  2000,   2415
  3000,   3758
  4000,   1290
  5000,   1300
  6000,   1249
  7000,   2636
  8000,   2160
  9000,   1254
 10000,   1309
 20000,   1730
 30000,   2446
 40000,   1320
 50000,   1332
 60000,   1363
 70000,   1385
 80000,   2737
 90000,   1373
100000,   1278

# G1 GC
# SIZE = 1,000,000
      1, 563883
     10,  23505
    100,  23311
   1000,   6727
   2000,   5174
   3000,   5444
   4000,   5160
   5000,   4815
   6000,   5130
   7000,   5138
   8000,   5162
   9000,   5032
  10000,   5230
  20000,   3992
  30000,   3673
  40000,   1909
  50000,   1908
  60000,   1106
  70000,    963
  80000,    966
  90000,    981
 100000,    958
 200000,    683
 300000,    693
 400000,    694
 500000,    691
 600000,    699
 700000,    697
 800000,   1057
 900000,    710
1000000,    703

That is not helpful

The Data is so Inconsistent

Let's Analyze

Greetings from the Java Internals

  • Java has about three compilers
  • You interact only with javac
  • To make your code fly, another compiler compiles the bytecode on-demand
  • Classic: Hotspot Compiler (our focus right now)
  • New Kid: Graal Compiler
  • Tiered Compiler: C1 and C2 with four modes
  • Profiles and optimizes continuously(!)
  • Can inline, loop-unroll, run escape analysis, branch predict, use intrinsics, do lock tricks, and a more things (yes, will get to that later)
  • Compiler decides what code is hot
  • Compile is per method
  • In addition loops are compiled, because the method might be cold, but the loop is hot

Compare Interpreter vs. Compiler

Hotspot

No Tiered Compiler

Interpreter Only

Compile Everything

Some more information about Hotspot

Just to get a feel for the complexity

Compiler Stages [1]

  • Level 0: Interpreter only
  • Level 1: C1, full optimization, no profiling
  • Level 2: C1 with invocation and backedge counters
  • Level 3: C1 with full profiling
  • Level 4: C2, fully optimized

Compiler Flow [2]

Compile Or Not to Compile

Compile when Needed vs. Right from the Start

  • -Xcomp forces Java to always compile to native code
  • Still permits later change in opinion
  • 70x slower in the beginning to be faster later
  • Effect 1: We compile things we might never need again
  • Effect 2: We might run out of compiler cache
# defaults
      1, 560,960
      2,  34,772
      3,  26,167
      4,  24,237
      5,  35,926
      6,  24,116
      7,  23,163
      8,  22,895
      9,  24,142
     10,  23,174
    100,  23,596
    200,  12,973
    300,  12,361
    400,   9,852
    500,   9,596
    600,   8,774
    700,   8,956
    800,   7,280
    900,   7,076
   1000,   6,089
  10000,   5,071
  20000,   2,683
  30000,   2,703
  40000,   1,532
  50000,   1,121
  60000,   1,012
  70000,     975
  80000,     979
  90000,     987
 100000,     976
 900000,     700
1000000,     708
# -Xcomp
      1, 40,041,490
      2,     51,254
      3,     17,217
      4,      2,077
      5,      1,714
      6,      1,607
      7,      1,610
      8,      1,536
      9,      1,576
     10,      1,536
    100,      1,376
    200,      2,481
    300,      1,284
    400,      1,307
    500,      1,150
    600,      1,177
    700,      1,173
    800,      1,163
    900,      1,174
   1000,      1,546
  10000,      1,096
  20000,        962
  30000,        999
  40000,      1,093
  50000,        891
  60000,        883
  70000,        885
  80000,        862
  90000,        894
 100000,        869
 900000,        871
1000000,        870

Watch the Compiler

48    1       3       java.lang.String::equals (81 bytes)
49    2       3       java.lang.String::hashCode (55 bytes)
49    3       3       java.lang.String::charAt (29 bytes)
49    4       3       java.lang.String::length (6 bytes)
51    5       3       java.lang.Object::<init> (1 bytes)
51    6       3       java.lang.AbstractStringBuilder::ensureCapacityInternal (27 bytes)
51    7     n 0       java.lang.System::arraycopy (native)   (static)
51    8       3       java.lang.String::indexOf (70 bytes)
52   10       3       java.lang.Math::min (11 bytes)
52    9       3       java.util.Arrays::copyOfRange (63 bytes)
53   11       1       java.lang.ref.Reference::get (5 bytes)
53   12       3       java.lang.String::getChars (62 bytes)
53   13       1       java.lang.ThreadLocal::access$400 (5 bytes)
57   14       3       java.util.DualPivotQuicksort::sort (1195 bytes)
62   20       1       java.lang.Object::<init> (1 bytes)
62    5       3       java.lang.Object::<init> (1 bytes)   made not entrant
  • Time Stamp since start
  • Compilation order
  • Compiler level
  • Method and size of the bytecode
  • n: native
  • s: synchronized
  • !: has exception handler
  • %: On-stack replacement (OSR)
  • made not entrant: not longer to be used

JMH to the rescue

Java Microbenchmark Harness

  • OpenJDK developed JMH
  • Deliver consistent results
  • Be able to control the compiler
  • Suppress optimizations such as dead code
  • CLI tool using Maven
  • Use IDE to write, but not to run
  • Knows different JVMs and their tricks
  • Contains profilers (JMX, Hotspot or OS based)
  • Also great for multi-threaded testing
  • Read the examples carefully before assuming your benchmark is legit
  • Keep in mind, most code is not hot

mvn clean install; java -Xms2g -Xmx2g -XX:+AlwaysPreTouch -jar target/benchmarks.jar YourClass

Learning: Understand the Runtime

If you don't understand it, you measure just random numbers

  • Understand the usage patterns of your code
  • Measure at the scale your code runs
    • Hot, Warm, or Cold
  • Use proper tooling aka JMH
  • Understand the Java Runtime
  • Think production!

Java Performance Fun

Things to Know when Executing or Profiling Code

Ohhhh Trap

You trained it well

private final static int SIZE = 1000;

public static void main(String[] args)
{
    Object trap = null;
    Object o = null;
    
    for (int i = 0; i < 1000; i++)
    {
        final Timer t = Timer.startTimer();
        
        for (int j = 0; j < SIZE; j++)
        {
            // burn time and train that null is normal
            o = new Object();
            
            if (trap != null)
            {
                System.out.println("Got you." + o);
                trap = null;
            }
        }
        
        // Give me a Non-Null, Vasily. 
        // One Non-Null only, please.  
        if (i == 400)
        {
            trap = new Object();
        }

        System.out.println(
            MessageFormat.format("{1} {0, number, #}", 
            t.stop().runtimeNanos(), i));
    }        
}

Bonus Comparison

What if we had no if

Learning: Unused Code

If I don't need it, I remove it

  • Java freely removes code
    • Results are not needed
    • Code is probably not executed
    • Even loops might die and not just stay empty
  • Rare conditions can completely change the runtime behavior forever
  • Danger: That can kill all your benchmark results
  • Fire: One null or one exception can turn your performance upside-down
  • Important: If your code is too complicated, Java cannot make that call

More is Less

You Can Horribly Shoot Yourself in the Foot

/**
 * JHM Benchmark 
 * Vary SIZE from 1 to 10000    
 * @param args
 */
public int nonSense()
{
    Object trap = null;
    Object o = null;
    
    for (int i = 0; i < OUTER; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            o = new Object();
            
            if (trap != null)
            {
                o = new Object();
                trap = null;
            }
        }
            
        if (i == 500)
        {
            trap = new Object();
        }
    }        

    return trap == null ? 1 : 0;
}
  • We vary the inner loop length
# One Shoot Execution aka measure one execution and no warm-up
Benchmark                           (SIZE)  Mode  Cnt     Score   Error  Units
RemoveNonSenseJMH.nonSenseSingle         1    ss         85,298          ns/op
RemoveNonSenseJMH.nonSenseSingle        10    ss        513,886          ns/op
RemoveNonSenseJMH.nonSenseSingle       100    ss      3,603,910          ns/op
RemoveNonSenseJMH.nonSenseSingle      1000    ss      9,465,274          ns/op
RemoveNonSenseJMH.nonSenseSingle     10000    ss     17,528,758          ns/op
# Repeated Execution, No Warm up, 100 ms duration
Benchmark                           (SIZE)  Mode  Cnt     Score   Error  Units
RemoveNonSenseJMH.nonSenseNoWarmup       1  avgt            963          ns/op
RemoveNonSenseJMH.nonSenseNoWarmup      10  avgt            600          ns/op
RemoveNonSenseJMH.nonSenseNoWarmup     100  avgt            597          ns/op
RemoveNonSenseJMH.nonSenseNoWarmup    1000  avgt            697          ns/op
RemoveNonSenseJMH.nonSenseNoWarmup   10000  avgt          2,258          ns/op
# Warm it Up First, 100 ms duration
Benchmark                           (SIZE)  Mode  Cnt     Score   Error  Units
RemoveNonSenseJMH.nonSenseWarmup         1  avgt            660          ns/op
RemoveNonSenseJMH.nonSenseWarmup        10  avgt            609          ns/op
RemoveNonSenseJMH.nonSenseWarmup       100  avgt            409          ns/op
RemoveNonSenseJMH.nonSenseWarmup      1000  avgt            439          ns/op
RemoveNonSenseJMH.nonSenseWarmup     10000  avgt            410          ns/op
# -Xint to verify
Benchmark                           (SIZE)  Mode  Cnt     Score   Error  Units
RemoveNonSenseJMH.nonSenseWarmup         1  avgt        113,536          ns/op
RemoveNonSenseJMH.nonSenseWarmup        10  avgt      1,110,772          ns/op
RemoveNonSenseJMH.nonSenseWarmup       100  avgt     11,449,399          ns/op
RemoveNonSenseJMH.nonSenseWarmup      1000  avgt    112,158,390          ns/op
RemoveNonSenseJMH.nonSenseWarmup     10000  avgt  1,104,793,429          ns/op

Understand Runtime Internals

Now We Can Really Talk About Performance

Expensive last statement

Just one statement more...

@Benchmark
public long add230()
{
    int sum = 0;
    int i = 0;
    
    // now 23 of these
    sum += a[++i] + a[++i] + a[++i] + a[++i] + a[++i] +
           a[++i] + a[++i] + a[++i] + a[++i] + a[++i];

    return sum;
}

@Benchmark
public long add231()
{
    int sum = 0;
    int i = 0;
    
    // now 23 of these
    sum += a[++i] + a[++i] + a[++i] + a[++i] + a[++i] +
           a[++i] + a[++i] + a[++i] + a[++i] + a[++i];
    // and this
    sum += a[++i];

    return sum;
}
Benchmark                      Mode  Cnt    Score   Units
ExpensiveLastAdd.add230  avgt    2   41.159   ns/op
ExpensiveLastAdd.add231  avgt    2  122.162   ns/op
  • Darn... the last add is really expensive

Expensive Last Statement

Let's check the compiler


803  434       4  org.sample.ExpensiveLastAdd::add230 (2251 bytes)
---    
797  434       3  org.sample.ExpensiveLastAdd::add231 (2263 bytes)
799  436       4  org.sample.ExpensiveLastAdd::add231 (2263 bytes)
811  436       4  org.sample.ExpensiveLastAdd::add231 (2263 bytes)   COMPILE SKIPPED: Out of stack space (not retryable)
  • We cannot C2 compile it, code is too "complex"
  • Stuck with the C1 result and full profiling turned on

Another Surprise

Just add integers together

@Benchmark
public int add1396()
{
    int sum = 0;
    
    sum += 1; ... sum += 1396;

    return sum;
}

@Benchmark
public int add1397()
{
    int sum = 0;
    
    sum += 1; ... sum += 1396;
    sum += 1397;

    return sum;
}
Benchmark                Mode  Cnt     Score   Error  Units
ExpensiveIntAdd.add1396  avgt    2     3.567          ns/op
ExpensiveIntAdd.add1397  avgt    2  2688.218          ns/op
  • That was really expensive but why?
  • add1396 is compiled into a pre-evaluated return statement, hence it is really fast
  • add1397 problem is called huge method limit
  • Method bytecode limit is 8000 bytes
  • So it stays bytecode
# Executed with -Xint
Benchmark                Mode  Cnt     Score   Error  Units
ExpensiveIntAdd.add1396  avgt    2  2660.065          ns/op
ExpensiveIntAdd.add1397  avgt    2  2644.399          ns/op

Your loop is easy

Flatten out the loops

private int next()
{
    int i = r.nextInt(1) + 1;
    return i;
}

@Benchmark
public int classic()
{
    int sum = 0;
    int step = next();
    for (int i = 0; i < ints.length; i = i + 1)
    {
        sum += ints[i];
        step = next();
    }
    return sum + step;
}

@Benchmark
public int variable()
{
    int sum = 0;
    int step = next();
    for (int i = 0; i < ints.length; i = i + step)
    {
        sum += ints[i];
        step = next();
    }
    return sum + step;
}
Benchmark            (size)  Mode  Cnt      Score   Units
LoopUnroll.classic    10000  avgt    2  18,871.574  ns/op
LoopUnroll.variable   10000  avgt    2  27,433.414  ns/op
  • Our loop got unrolled
  • Less range checks, less jumps
# -Xint
Benchmark            (size)  Mode Cnt        Score  Units
LoopUnroll.classic    10000  avgt   2 2,650,812.533 ns/op
LoopUnroll.variable   10000  avgt   2 2,449,845.956 ns/op
# classic()
 8.995.890.331  cycles:u                  #    3,292 GHz                      
15.443.666.755  instructions:u            #    1,72  insn per cycle         
                                          #    0,32  stalled cycles per insn  
   368.352.858  branches:u                #  134,800 M/sec                    
     1.714.520  branch-misses:u           #    0,47% of all branches         
 1.383.613.083  L1-dcache-loads:u         #  506,340 M/sec                    
    43.244.875  L1-dcache-load-misses:u   #    3,13% of all L1-dcache hits    

# variable()     
 8.597.780.881  cycles:u                  #    3,235 GHz                      
 2.828.996.294  stalled-cycles-frontend:u #   32,90% frontend cycles idle    
23.722.750.782  instructions:u            #    2,76  insn per cycle         
                                          #    0,12  stalled cycles per insn 
 2.156.234.427  branches:u                #  811,237 M/sec                   
     1.189.662  branch-misses:u           #    0,06% of all branches          
 1.826.787.276  L1-dcache-loads:u         #  687,289 M/sec                   
    29.874.307  L1-dcache-load-misses:u   #    1,64% of all L1-dcache hits

Escape Analysis

Make it less expensive

@Benchmark
public long array64()
{
    int[] a = new int[64];
    
    a[0] = r.nextInt();
    a[1] = r.nextInt();
    
    return a[0] + a[1];
}

@Benchmark
public long array65()
{
    int[] a = new int[65];
    
    a[0] = r.nextInt();
    a[1] = r.nextInt();
    
    return a[0] + a[1];
}
EscapeAnalysis.array64                           avgt  2    22.804           ns/op
EscapeAnalysis.array65                           avgt  2    41.890           ns/op

# -prof gc
EscapeAnalysis.array64                           avgt  2    22.804           ns/op
EscapeAnalysis.array64:·gc.alloc.rate            avgt  2    ≈ 10⁻⁴          MB/sec
EscapeAnalysis.array64:·gc.alloc.rate.norm       avgt  2    ≈ 10⁻⁶            B/op
EscapeAnalysis.array64:·gc.count                 avgt  2       ≈ 0          counts

EscapeAnalysis.array65                           avgt  2    41.890           ns/op
EscapeAnalysis.array65:·gc.alloc.rate            avgt  2  5795.571          MB/sec
EscapeAnalysis.array65:·gc.alloc.rate.norm       avgt  2   280.000            B/op
EscapeAnalysis.array65:·gc.count                 avgt  2    93.000          counts
# -XX:EliminateAllocationArraySizeLimit=70
EscapeAnalysis.array65                           avgt  2    22.279           ns/op
EscapeAnalysis.array65:·gc.alloc.rate            avgt  2    ≈ 10⁻⁴          MB/sec
EscapeAnalysis.array65:·gc.alloc.rate.norm       avgt  2    ≈ 10⁻⁶            B/op
EscapeAnalysis.array65:·gc.count                 avgt  2       ≈ 0          counts

Intrinsics or Native

Native code to the rescue

final int SIZE = 100_000;
final int[] src = new int[SIZE];
    
public int[] manualArrayCopy()
{
    int[] target = new int[SIZE];
    for (int i = 0; i < SIZE; i++)
    {
        target[i] = src[i];
    }
    
    return target;
}

public int[] systemArrayCopy()
{
    int[] target = new int[SIZE];
    System.arraycopy(src, 0, target, 0, SIZE);
    
    return target;
}

public int[] arraysCopy()
{
    return Arrays.copyOf(src, src.length);
}

Benchmark                   Mode  Cnt   Score    Units
IntrinsicsArrayCopy.manual  avgt    2  48,177    ns/op
IntrinsicsArrayCopy.system  avgt    2  48,678    ns/op
  • Uh... native is not faster?
  • Well, if Java compiled its code, it is native

Iteration         Manual           System
===============================================
Iteration   1: 1,431,758 ns/op    129,551 ns/op
Iteration   2:   714,942 ns/op     81,926 ns/op
Iteration   3:   188,272 ns/op     77,098 ns/op
Iteration   4:    70,491 ns/op     72,010 ns/op
Iteration   5:    82,862 ns/op     73,495 ns/op
Iteration   6:    66,534 ns/op     75,625 ns/op
Iteration   7:    61,277 ns/op     74,382 ns/op
  • Native is quickly outperformed by Hotspot
  • BUT: That depends on your hardware
  • Hotspot can vary its code and intrinsics selection based on CPU instructions available

Wait... What are Intrinsics?

Native Code vs. Intrinsics

  • Native: A JNI based method call that calls any kind of provided native platform code
  • Intrinsics: Native platform code, but not called as method but directly placed in the Hotspot compiled code
# -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining
@ 54   java.lang.Math::min (11 bytes)   (intrinsic)
@ 57   java.lang.System::arraycopy (0 bytes)   (intrinsic)
  • public static native void arraycopy(...) can be a native call or an intrinsic
  • Java makes that decision when the VM is started based on CPU features
  • Even for methods fully visible in the source code of the JDK as Java, Java can and will use Intrinics instead, see java.lang.Math
  • Instead of compiling to native step-by-step, it will insert pre-determined native code based on the platform it runs on
  • Important: Java will change the code generated by the compiler based on the platform and CPU it runs on!

Intrinsics or Native II

# Lenovo T530, JDK 8
Benchmark                   Mode  Cnt      Score  Units
IntrinsicsArrayCopy.manual  avgt    2 48,177.054  ns/op
IntrinsicsArrayCopy.system  avgt    2 48,678.219  ns/op

# Lenovo T450s, JDK 8, BAT
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2  66,560.515  ns/op
IntrinsicsArrayCopy.system  avgt    2  67,051.983  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  60,674.969  ns/op
# Lenovo T450s, JDK 11, BAT
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2  67,476.175  ns/op
IntrinsicsArrayCopy.system  avgt    2  58,952.894  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  58,771.408  ns/op
# Lenovo T450s, GraalVM EE 1.0.0.14, BAT
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2  94,315.864  ns/op
IntrinsicsArrayCopy.system  avgt    2  70,201.933  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  71,605.877  ns/op
# Lenovo T450s, JDK 11, AC
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2  56,108.312  ns/op
IntrinsicsArrayCopy.system  avgt    2  55,748.067  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  53,769.876  ns/op
# AWS c5.xlarge, JDK 11
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2  83,066.975  ns/op
IntrinsicsArrayCopy.system  avgt    2  63,422.029  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  64,748.500  ns/op
# GCP n1-standard-4, JDK 11
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2 146,932.588  ns/op
IntrinsicsArrayCopy.system  avgt    2 101,970.577  ns/op
IntrinsicsArrayCopy.arrays  avgt    2 116,233.650  ns/op
# GCP n1-highcpu-8, JDK 11
Benchmark                   Mode  Cnt       Score  Units
IntrinsicsArrayCopy.manual  avgt    2  82,865.258  ns/op
IntrinsicsArrayCopy.system  avgt    2  62,810.793  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  61,237.489  ns/op
# Lenovo T450s, JDK 11, BAT, Min Power
Benchmark                   Mode  Cnt        Score  Units
IntrinsicsArrayCopy.manual  avgt    2  215,360.090  ns/op
IntrinsicsArrayCopy.system  avgt    2  210,971.627  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  214,870.207  ns/op

Stupid Compiler


@Setup
public void setup()
{
    foo = String.valueOf(System.currentTimeMillis()) + "ashdfhgas dfhsa df";
}

public String plain()
{    
    return "a" + foo;
}

public String concat()
{    
    return "a".concat(foo);
}

public String builder()
{    
    return new StringBuilder().append("a").append(foo).toString();
}

public String builderSized()
{    
    return new StringBuilder(32).append("a").append(foo).toString();
}

public String builderNonFluid()
{    
     final StringBuilder sb = new StringBuilder();
     sb.append("a");
     sb.append(foo);
     return sb.toString();
}

public String builderNonFluidSized()
{    
    final StringBuilder sb = new StringBuilder(32);
    sb.append("a");
    sb.append(foo);
    return sb.toString();
}

Let's make some Strings


Benchmark             Mode  Cnt   Score   Error  Units
builderSized          avgt    2  22.170          ns/op
plain                 avgt    2  22.213          ns/op
builder               avgt    2  22.409          ns/op
builderNonFluidSized  avgt    2  32.230          ns/op
concat                avgt    2  33.061          ns/op
builderNonFluid       avgt    2  49.581          ns/op
  • A sized builder is not faster?
  • A builder that is not fluid is slower?
  • P.S. We need that foo setup to avoid compile-time optimization

Stupid Compiler II

Don't Trust Your First Benchmark


private String[] foo = new String[5];
private FastRandom r = new FastRandom(1L);

@Setup
public void setup()
{
    foo[0] = String.valueOf(r.nextInt()) + "ashdfhgas dfhsa df";
    foo[1] = String.valueOf(r.nextInt()) + "hsa df";
    foo[2] = String.valueOf(r.nextInt()) + "fhsa df";
    foo[3] = String.valueOf(r.nextInt()) + "as dfhsa df";
    foo[4] = String.valueOf(r.nextInt()) 
                    + "ashdfhgas dfhsa hsajdhf kjahskjdh fkjhsa dhfkjskhj hdf"
                    + "ashdfhgas dfhsa hsajdhf kjahs as87837 837 987kjskhj hdf";
}

public String foo()
{
    return foo[r.nextInt(5)];
}

@Benchmark
public String plain()
{    
    return "a" + foo();
}

@Benchmark
public String concat()
{    
    return "a".concat(foo());
}

// more here

# AC
Benchmark             Mode  Cnt   Score   Error  Units
concat                avgt    2  51.746          ns/op
plain                 avgt    2  66.310          ns/op
builder               avgt    2  66.313          ns/op
builderNonFluid       avgt    2  67.179          ns/op
builderSized          avgt    2  67.762          ns/op
builderNonFluidSized  avgt    2  67.860          ns/op

# BAT
Benchmark             Mode  Cnt   Score   Error  Units
concat                avgt    2  62.868          ns/op
plain                 avgt    2  76.616          ns/op
builder               avgt    2  76.888          ns/op
builderNonFluid       avgt    2  78.363          ns/op
builderNonFluidSized  avgt    2  81.314          ns/op
builderSized          avgt    2  82.788          ns/op

# -prof gc
Benchmark                   Mode  Cnt     Score   Units
concat                      avgt    2    65.788   ns/op
·gc.alloc.rate.norm         avgt    2   120.014    B/op

builderNonFluid             avgt    2    77.814   ns/op
·gc.alloc.rate.norm         avgt    2   272.017    B/op

builder                     avgt    2    78.407   ns/op
·gc.alloc.rate.norm         avgt    2   272.012    B/op

plain                       avgt    2    79.952   ns/op
·gc.alloc.rate.norm         avgt    2   272.007    B/op

builderSized                avgt    2    81.799   ns/op
·gc.alloc.rate.norm         avgt    2   391.999    B/op

builderNonFluidSized        avgt    2    81.151   ns/op
·gc.alloc.rate.norm         avgt    2   392.001    B/op
  • Concat is memory efficient... interesting
  • Never assume that your artificial example tells it all

Let's talk hardware

Hardware does not matter... you wish!

final int SIZE = 1_000_000;
final int[] src = new int[SIZE];
    
@Benchmark
public int step1()
{
    int sum = 0;
    for (int i = 0; i < SIZE; i++)
    {
        sum += src[i];
    }
    
    return sum;
}

@Benchmark
public int step20()
{
    int sum = 0;
    for (int i = 0; i < SIZE; i = i + 20)
    {
        sum += src[i];
    }
    
    return sum;
}
  • Theory: size20 must be faster, about 20x

Benchmark                 Mode  Cnt    Score  Error  Units
ArraysAndHardware.step1   avgt    2  325,159         ns/op
ArraysAndHardware.step20  avgt    2   97,402         ns/op
  • We expected 16,300 ns/op
  • We got only 3.3x more speed
  • What? Why?
  • P.S. step20 results will vary a lot when running repeatedly

Hardware Greetings

Also Java has to live by hardware laws

Linux Perf for step1
==========================================================================================
       4485.209219      task-clock (msec)         #    0,290 CPUs utilized          
    11,127,672,435      cycles                    #    2,481 GHz                      (38,32%)
    11,620,571,786      instructions              #    1,04  insn per cycle           (46,29%)
       665,870,415      branches                  #  148,459 M/sec                    (46,36%)
         1,066,446      branch-misses             #    0,16% of all branches          (46,40%)
     9,482,675,316      L1-dcache-loads           # 2114,210 M/sec                    (46,50%)
       593,181,935      L1-dcache-load-misses     #    6,26% of all L1-dcache hits    (46,59%)
        21,448,491      LLC-loads                 #    4,782 M/sec                    (30,64%)
         5,856,980      LLC-load-misses           #   27,31% of all LL-cache hits     (30,67%)
Linux Perf for step20
==========================================================================================
       4455.946190      task-clock (msec)         #    0,286 CPUs utilized          
    11,046,881,256      cycles                    #    2,479 GHz                      (38,45%)
     3,836,171,205      instructions              #    0,35  insn per cycle           (46,27%)
       927,360,764      branches                  #  208,118 M/sec                    (46,40%)
           873,526      branch-misses             #    0,09% of all branches          (46,49%)
       979,546,079      L1-dcache-loads           #  219,829 M/sec                    (46,57%)
       860,310,077      L1-dcache-load-misses     #   87,83% of all L1-dcache hits    (46,66%)
       746,566,165      LLC-loads                 #  167,544 M/sec                    (30,76%)
       330,798,682      LLC-load-misses           #   44,31% of all LL-cache hits     (30,64%)

Detour: The modern CPU

Everything happens in parallel

  • Current Intels can do 3 to 4 instructions per cycle
  • A cycle is 1 Hz
  • Parallel execution of code despite code not being programmed that way
  • Parallel execution units per core
  • CPU reorders when permitted to fill the execution units
  • Tries to do things ahead of the result (prediction, speculation)
  • When units aren't filled, performance is wasted

The Matrix


final int[][] src = new int[1000][1000];

@Benchmark
public int horizontal() {
    int sum = 0;
    for (int i = 0; i < SIZE; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            sum += src[i][j];
        }
    }
    return sum;
}

@Benchmark
public int vertical() {
    int sum = 0;
    for (int i = 0; i < SIZE; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            sum += src[j][i];
        }
    }
    return sum;
}

@Benchmark
public int horizontalBackwards() {
    int sum = 0;
    for (int i = SIZE - 1; i >=0; i--)
    {
        for (int j = SIZE - 1; j >= 0; j--)
        {
            sum += src[i][j];
        }
    }
    return sum;
}

What is real and what is not

  • All calculations should have the same speed, shouldn't they?

Benchmark                                Mode           Score   Error  Units
horizontal                               avgt     470,009.870          ns/op
horizontalBackwards                      avgt     672,821.809          ns/op
vertical                                 avgt   3,209,562.255          ns/op
  • Horizontal access is about 7x faster, forward by 30% faster!

Benchmark                                         Mode       Score   Error  Units
Matrix.horizontal                                 avgt     470,009          ns/op
Matrix.horizontal:L1-dcache-load-misses           avgt      63,645           #/op
Matrix.horizontal:L1-dcache-loads                 avgt   1,014,532           #/op
Matrix.horizontal:LLC-load-misses                 avgt         429           #/op
Matrix.horizontal:LLC-loads                       avgt       1,523           #/op

Matrix.horizontalBackwards                        avgt     672,821          ns/op
Matrix.horizontalBackwards:L1-dcache-load-misses  avgt      63,053           #/op
Matrix.horizontalBackwards:L1-dcache-loads        avgt   1,007,381           #/op
Matrix.horizontalBackwards:LLC-load-misses        avgt         854           #/op
Matrix.horizontalBackwards:LLC-loads              avgt       2,764           #/op

Matrix.vertical                                   avgt   3,209,562          ns/op
Matrix.vertical:L1-dcache-load-misses             avgt   2,060,998           #/op
Matrix.vertical:L1-dcache-loads                   avgt   3,020,796           #/op
Matrix.vertical:LLC-load-misses                   avgt      15,223           #/op
Matrix.vertical:LLC-loads                         avgt     694,397           #/op

Matrix - Yes, you probably did it wrong

Influence of the Size on the Result


Benchmark             (SIZE) Mode            Score Factor  Units
horizontal                1  avgt              5.3    1.0  ns/op
horizontal               10  avgt             45.2    1.0  ns/op
horizontal               20  avgt            141.3    1.0  ns/op
horizontal              100  avgt          3,388.0    1.0  ns/op
horizontal             1000  avgt        373,619.3    1.0  ns/op
horizontal            10000  avgt     41,701,496.2    1.0  ns/op

horizontalBackwards       1  avgt              5.3    1.0  ns/op
horizontalBackwards      10  avgt             43.2    1.0  ns/op
horizontalBackwards      20  avgt            140.1    1.0  ns/op
horizontalBackwards     100  avgt          3,261.3    1.0  ns/op
horizontalBackwards    1000  avgt        511,246.1    1.4  ns/op
horizontalBackwards   10000  avgt     46,908,230.0    1.1  ns/op

vertical                  1  avgt              5.3    1.0  ns/op
vertical                 10  avgt             77.0    1.7  ns/op
vertical                 20  avgt            275.8    2.0  ns/op
vertical                100  avgt          6,226.3    1.8  ns/op
vertical               1000  avgt      2,719,478.0    7.2  ns/op
vertical              10000  avgt  1,683,041,892.2   40.4  ns/op
  • Understand your future data set
  • Don't limit to your own machine
  • Verify theories!

Detour: Cost

What it costs to leave the CPU

  Real Humanized
CPU Cycle 0.4 ns 1 s
L1 Cache 0.9 ns 2 s
L2 Cache 2.8 ns 7 s
L3 Cache 28 ns 1 min
Memory Access 100 ns 4 min
NVM SSD 25 μs 17 h
SSD 50–150 μs1.5-4 days
HDD 1–10 ms 1-9 months
TCP SF to NY 65 ms 5 years
TCP SF to Hong Kong141 ms11 years
  • It costs 100 to 200 cycles when we have to go to main memory
  • Hence Java and the CPU avoid main memory access at all cost
  • Java does not have the notion of a memory location, hence this is transparent
  • Things will change with project Valhalla and its value types that brings primitive data type performance to "objects".
  • Microservices: Always consider this overhead when thinking of microservices.

Data: She brought me closer to humanity than I ever thought possible, and for a time...I was tempted by her offer.
Picard: How long a time?
Data: 0.68 seconds, sir. For an android, that is nearly an eternity.

I know what you will do next

Modern Fortune Tellers


private static final int COUNT = 10_000;

// Contain random numbers from -50 to 50
private int[] sorted;
private int[] unsorted;
private int[] reversed;

public void doIt(
            int[] array, 
            Blackhole bh1, 
            Blackhole bh2)
{
    for (int v : array)
    {
        if (v > 0)
        {
            bh1.consume(v);
        }
        else
        {
            bh2.consume(v);
        }
    }
}
  • Will there be any difference when passing sorted or unsorted arrays?

Benchmark  Mode  Cnt      Score   Error  Units
reversed   avgt    2  39570.696          ns/op
sorted     avgt    2  39859.027          ns/op
unsorted   avgt    2  66043.605          ns/op
  • Sorted is 65% faster. Why?
  • Let's ask the CPU

Branch Prediction

Modern CPU marvels: I know what you might need next


Benchmark                                        Mode        Score   Error  Units
=================================================================================
BranchPrediction.reversed                        avgt    35,182.532         ns/op
BranchPrediction.reversed:IPC                    avgt         3.257          #/op
BranchPrediction.reversed:branch-misses          avgt        13.767          #/op
BranchPrediction.reversed:branches               avgt    55,339.912          #/op
BranchPrediction.reversed:cycles                 avgt   110,993.223          #/op
BranchPrediction.reversed:instructions           avgt   361,508.018          #/op

BranchPrediction.sorted                          avgt    35,183.238         ns/op
BranchPrediction.sorted:IPC                      avgt         3.257          #/op
BranchPrediction.sorted:branch-misses            avgt        14.738          #/op
BranchPrediction.sorted:branches                 avgt    55,884.946          #/op
BranchPrediction.sorted:cycles                   avgt   112,912.763          #/op
BranchPrediction.sorted:instructions             avgt   367,820.131          #/op

BranchPrediction.unsorted                        avgt    65,770.667         ns/op
BranchPrediction.unsorted:IPC                    avgt         1.577          #/op
BranchPrediction.unsorted:branch-misses          avgt      3795.737          #/op
BranchPrediction.unsorted:branches               avgt    56,008.188          #/op
BranchPrediction.unsorted:cycles                 avgt   211,089.268          #/op
BranchPrediction.unsorted:instructions           avgt   333,105.231          #/op

Branches: Know what you test

How small changes can mess up things big time


final int d = i % 2 == 0 ? -1 : 1; 


public int doIt(int[] array)
{
    int sum = 0;
    for (int v : array)
    {
        if (v > 0)
        {
            sum -= v;
        }
        else
        {
            sum += v;
        }
    }
    
    return sum;
}

Benchmark  Mode       Score   Units
===================================
sorted     avgt   5,147.764   ns/op
unsorted   avgt   3,317.656   ns/op
reversed   avgt   4,738.261   ns/op

final int d = i % 2 == 0 
    ? -r.nextInt(5) : r.nextInt(5);

public int doIt(int[] array)
{
    int sum = 0;
    for (int v : array)
    {
        if (v > 0)
        {
            sum -= v;
        }
        else
        {
            sum += v;
        }
    }
    
    return sum;
}

Benchmark  Mode       Score   Units
===================================
sorted     avgt   3,588.073   ns/op
unsorted   avgt  14,282.879   ns/op
reversed   avgt   5,046.499   ns/op

final int d = i % 2 == 0 
    ? -r.nextInt(5) : r.nextInt(5); 

public int doIt(int[] array)
{
    int sum = 0;
    for (int v : array)
    {
        if (v >= 0)
        {
            sum -= -v;
        }
        else
        {
            sum += +v;
        }
    }
    
    return sum;
}

Benchmark  Mode       Score   Units
===================================
sorted     avgt   3,226.433   ns/op
unsorted   avgt   3,261.952   ns/op
reversed   avgt   3,234.786   ns/op

final int d = i % 2 == 0 
    ? -r.nextInt(5) : r.nextInt(5);
        
public int doIt(int[] array)
{
    int sum = 0;
    for (int v : array)
    {
        if (v >= 0)
        {
            sum -= v;
        }
        else
        {
            sum += v;
        }
    }
    
    return sum;
}

Benchmark  Mode       Score   Units
===================================
sorted     avgt   5,044.253   ns/op
unsorted   avgt  11,669.557   ns/op
reversed   avgt   5,390.210   ns/op

Short Summary

Measurement

  • Naive benchmarks are wrong
  • No JMH, no fun
  • Even JMH measurements can be wrong
  • How and where you measure makes a difference

JDK

  • Forget your Java optimization knowledge
  • Java now is not what most people think Java is

JVM

  • Tiered runtime compilation
  • JIT removes code when not needed
  • JIT tries to keep data local
  • Large methods kill the C2 optimization
  • Loops are optimized
  • Intrinsics are used dynamically

Hardware

  • Memory access is expensive
  • Caches are your friend
  • Branches are expensive
  • The CPU speculates

Synchronization

Being alone and still have to sync

private final int SIZE = 1024;
private final Map<String, String> syncMap = 
            Collections.synchronizedMap(new HashMap<>(SIZE));
private final Map<String, String> unsyncMap = 
            new HashMap<>(SIZE);
private final Map<String, String> conMap = 
            new ConcurrentHashMap<>(SIZE);

@Benchmark
public String syncAccess() 
{
    return syncMap.get("1");
}
@Benchmark
public String unsyncAccess() 
{
    return unsyncMap.get("1");
}
@Benchmark
public String conAccess() 
{
    return conMap.get("1");
}
  • Which on is slowest and by how much?
  • Remember, just one thread, so no contention
Benchmark                       Mode  Cnt   Score   Units
SynchronizedAlone.conAccess     avgt    2   9.152   ns/op
SynchronizedAlone.syncAccess    avgt    2  23.462   ns/op
SynchronizedAlone.unsyncAccess  avgt    2   9.366   ns/op
  • Synced is 2.6x slower
  • ConcurrentHashMap does well
  • Moving the maps to local scope totally changes the result - lock elision

With Contention

Let's try more threads

2 Threads
====================================================================
Benchmark                          Mode  Cnt    Score   Error  Units
SynchronizedThreads4.conAccess     avgt    2    9.721          ns/op
SynchronizedThreads4.syncAccess    avgt    2  183.086          ns/op
SynchronizedThreads4.unsyncAccess  avgt    2    9.312          ns/op
4 Threads
====================================================================
Benchmark                          Mode  Cnt    Score   Error  Units
SynchronizedThreads4.conAccess     avgt    2   19.661          ns/op
SynchronizedThreads4.syncAccess    avgt    2  216.041          ns/op
SynchronizedThreads4.unsyncAccess  avgt    2   19.631          ns/op
8 Threads
====================================================================
Benchmark                          Mode  Cnt    Score   Error  Units
SynchronizedThreads4.conAccess     avgt    2   44.055          ns/op
SynchronizedThreads4.syncAccess    avgt    2  377.180          ns/op
SynchronizedThreads4.unsyncAccess  avgt    2   38.562          ns/op

Looping

Loops compared


Benchmark                           (length)  Mode  Cnt   Score   Error  Units
ForEachSimple.arrayFor                     1  avgt    2     3.8          ns/op
ForEachSimple.arrayFor                    10  avgt    2     9.8          ns/op
ForEachSimple.arrayFor                   100  avgt    2    64.4          ns/op
ForEachSimple.arrayFor                  1000  avgt    2  1446.1          ns/op

ForEachSimple.arrayEnhanced                1  avgt    2     3.5          ns/op
ForEachSimple.arrayEnhanced               10  avgt    2     9.5          ns/op
ForEachSimple.arrayEnhanced              100  avgt    2    63.2          ns/op
ForEachSimple.arrayEnhanced             1000  avgt    2  1199.3          ns/op

ForEachSimple.classicFor                   1  avgt    2     5.7          ns/op
ForEachSimple.classicFor                  10  avgt    2    15.5          ns/op
ForEachSimple.classicFor                 100  avgt    2   103.0          ns/op
ForEachSimple.classicFor                1000  avgt    2  1388.8          ns/op

ForEachSimple.enhancedFor                  1  avgt    2     6.1          ns/op
ForEachSimple.enhancedFor                 10  avgt    2    17.0          ns/op
ForEachSimple.enhancedFor                100  avgt    2   118.4          ns/op
ForEachSimple.enhancedFor               1000  avgt    2  1578.6          ns/op

ForEachSimple.lambdaStream                 1  avgt    2    53.7          ns/op
ForEachSimple.lambdaStream                10  avgt    2    65.5          ns/op
ForEachSimple.lambdaStream               100  avgt    2   153.0          ns/op
ForEachSimple.lambdaStream              1000  avgt    2  1326.0          ns/op

ForEachSimple.parallelLambdaStream         1  avgt    2    94.2          ns/op
ForEachSimple.parallelLambdaStream        10  avgt    2  6283.2          ns/op
ForEachSimple.parallelLambdaStream       100  avgt    2  8443.4          ns/op
ForEachSimple.parallelLambdaStream      1000  avgt    2  9367.6          ns/op

// just some strings
final List<String> list = new ArrayList<>();
String[] array = null;

@Param({"1", "10", "100", "1000"})
int length;

@Setup
public void setup() {.. // uses length ..}

@Benchmark
public void arrayFor(Blackhole bh) 
{
    int result = 0;
    for (int i = 0; i < array.length; i++) 
    {
        result += array[i].length();
    }
    
    bh.consume(result);
}

// classicFor
for (int i = 0; i < list.size(); i++) 
{
    result += list.get(i).length();
}


// enhancedFor
for (String s : list) 
{
    result += s.length();
}

// lambdaStream
int result = list.stream()
    .mapToInt(s -> s.length())
    .sum();

//parallelLambdaStream
int result = list.parallelStream()
    .mapToInt(s -> s.length())
    .sum();

Looping

Not so empty loops compared


Benchmark             (length)  Mode  Cnt     Score   Error  Units
arrayFor                     1  avgt    2     238.3          ns/op
arrayFor                    10  avgt    2    2105.6          ns/op
arrayFor                   100  avgt    2   21957.8          ns/op
arrayFor                  1000  avgt    2  253262.0          ns/op

arrayEnhanced                1  avgt    2     234.1          ns/op
arrayEnhanced               10  avgt    2    2173.1          ns/op
arrayEnhanced              100  avgt    2   22080.1          ns/op
arrayEnhanced             1000  avgt    2  250038.0          ns/op

classicFor                   1  avgt    2     238.1          ns/op
classicFor                  10  avgt    2    2177.0          ns/op
classicFor                 100  avgt    2   22318.0          ns/op
classicFor                1000  avgt    2  249914.1          ns/op

enhancedFor                  1  avgt    2     254.8          ns/op
enhancedFor                 10  avgt    2    2185.9          ns/op
enhancedFor                100  avgt    2   23011.1          ns/op
enhancedFor               1000  avgt    2  262605.5          ns/op

lambdaStream                 1  avgt    2     395.7          ns/op
lambdaStream                10  avgt    2    2829.0          ns/op
lambdaStream               100  avgt    2   27771.4          ns/op
lambdaStream              1000  avgt    2  304831.6          ns/op

# 2+2 CPU
parallelLambdaStream         1  avgt    2     453.5          ns/op
parallelLambdaStream        10  avgt    2   11686.4          ns/op
parallelLambdaStream       100  avgt    2   26798.6          ns/op
parallelLambdaStream      1000  avgt    2  189304.3          ns/op
final List<String> list = new ArrayList<>();
    
@Param({"1", "10", "100", "1000"})
int length;

@Setup
public void setup()
{
    for (int i = 0; i < length; i++)
    {
        list.add(
            MessageFormat.format(
            "{0},{0},{0},{0},{0},{0}", String.valueOf(i)));
    }
}

@Benchmark
public void classicFor(Blackhole bh)
{
    int result = 0;
    for (int i = 0; i < list.size(); i++)
    {
        final String s = list.get(i);
        if (s.startsWith("5"))
        {
            continue;
        }
        
        final String[] a = s.split(",");
        for (int j = 0; j < a.length; j++)
        {
            result += Integer.valueOf(a[j]);
        }
    }
    
    bh.consume(result);
}

Looping - Bonus Content

T450s vs. GCP N1-highcpu-16core


# Lenovo T450s
Benchmark             (length)  Mode  Cnt        Score  Error  Units
arrayFor                     1  avgt    2        238.3         ns/op
arrayFor                    10  avgt    2      2,105.6         ns/op
arrayFor                   100  avgt    2     21,957.8         ns/op
arrayFor                  1000  avgt    2    253,262.0         ns/op
arrayFor                 10000  avgt    2  2,810,799.6         ns/op
arrayFor                100000  avgt    2 31,116,545.4         ns/op

classicFor                   1  avgt    2        238.1         ns/op
classicFor                  10  avgt    2      2,177.0         ns/op
classicFor                 100  avgt    2     22,318.0         ns/op
classicFor                1000  avgt    2    257,362.6         ns/op
classicFor               10000  avgt    2  2,809,832.8         ns/op
classicFor              100000  avgt    2 32,017,487.3         ns/op

enhancedFor                  1  avgt    2        254.8         ns/op
enhancedFor                 10  avgt    2      2,185.9         ns/op
enhancedFor                100  avgt    2     23,011.1         ns/op
enhancedFor               1000  avgt    2    254,691.8         ns/op
enhancedFor              10000  avgt    2  3,180,447.3         ns/op
enhancedFor             100000  avgt    2 31,479,657.4         ns/op

lambdaStream                 1  avgt    2        395.7         ns/op
lambdaStream                10  avgt    2      2,829.0         ns/op
lambdaStream               100  avgt    2     27,771.4         ns/op
lambdaStream              1000  avgt    2    299,971.9         ns/op
lambdaStream             10000  avgt    2  3,181,573.5         ns/op
lambdaStream            100000  avgt    2 36,055,386.8         ns/op

parallelLambdaStream         1  avgt    2        453.5         ns/op
parallelLambdaStream        10  avgt    2     11,686.4         ns/op
parallelLambdaStream       100  avgt    2     26,798.6         ns/op
parallelLambdaStream      1000  avgt    2    193,232.6         ns/op
parallelLambdaStream     10000  avgt    2  1,956,289.8         ns/op
parallelLambdaStream    100000  avgt    2 20,628,366.2         ns/op

# Google n1-highcpu-16core
Benchmark             (length)  Mode  Cnt       Score   Error  Units
arrayFor                     1  avgt   2        263.4          ns/op
arrayFor                    10  avgt   2      2,390.4          ns/op
arrayFor                   100  avgt   2     24,871.1          ns/op
arrayFor                  1000  avgt   2    266,237.0          ns/op
arrayFor                 10000  avgt   2  3,020,851.2          ns/op
arrayFor                100000  avgt   2 30,428,126.5          ns/op

classicFor                   1  avgt   2        264.0          ns/op
classicFor                  10  avgt   2      2,490.3          ns/op
classicFor                 100  avgt   2     25,186.0          ns/op
classicFor                1000  avgt   2    281,254.1          ns/op
classicFor               10000  avgt   2   2882,459.1          ns/op
classicFor              100000  avgt   2 34,657,730.5          ns/op

enhancedFor                  1  avgt   2        263.5          ns/op
enhancedFor                 10  avgt   2      2,329.8          ns/op
enhancedFor                100  avgt   2     24,811.0          ns/op
enhancedFor               1000  avgt   2    291,544.4          ns/op
enhancedFor              10000  avgt   2  3,185,459.2          ns/op
enhancedFor             100000  avgt   2 33,898,708.6          ns/op

lambdaStream                 1  avgt   2        547.9          ns/op
lambdaStream                10  avgt   2      3,841.7          ns/op
lambdaStream               100  avgt   2     28,808.7          ns/op
lambdaStream              1000  avgt   2    329,148.1          ns/op
lambdaStream             10000  avgt   2  3,684,920.1          ns/op
lambdaStream            100000  avgt   2 36,935,499.3          ns/op

parallelLambdaStream         1  avgt   2        521.7          ns/op
parallelLambdaStream        10  avgt   2     43,527.8          ns/op
parallelLambdaStream       100  avgt   2     62,834.3          ns/op
parallelLambdaStream      1000  avgt   2    110,602.3          ns/op
parallelLambdaStream     10000  avgt   2    558,666.5          ns/op
parallelLambdaStream    100000  avgt   2  5,183,311.8          ns/op

The Cost of Calling

Method Calls, Inheritance, and Compiler Magic

Inlining Methods


@CompilerControl(CompilerControl.Mode.INLINE)
private static String uppercase(String s)

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
private static String uppercaseNoInline(String s)

@CompilerControl(CompilerControl.Mode.INLINE)
private static String lowercase(String s)

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
private static String lowercaseNoInline(String s)

@CompilerControl(CompilerControl.Mode.INLINE)
private static int sumOfChars(String s)

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
private static int sumOfCharsNoInline(String s)

@Benchmark
public void largeMethod(Blackhole bh)
{ 
    // all the code of the methods copied here
    
    bh.consume(sum);
}

@Benchmark
public void smallAndCallsInline(Blackhole bh)
{    
    final String s = strings[0];
    
    final String s2 = uppercase(s);
    final String s3 = lowercase(s2);
    
    int sum = sumOfChars(s3);
    
    bh.consume(sum);
}

@Benchmark
public void smallAndCallsNoInline(Blackhole bh)
{    
    final String s = strings[0];
    
    final String s2 = uppercaseNoInline(s);
    final String s3 = lowercaseNoInline(s2);
    
    int sum = sumOfCharsNoInline(s3);
    
    bh.consume(sum);
}

Calling a method is expensive


# BAT JDK 11
Benchmark              Mode  Cnt     Score     Error  Units
smallAndCallsInline    avgt    5  4860.736 ± 229.153  ns/op
smallAndCallsNoInline  avgt    5  5672.435 ± 219.164  ns/op
largeMethod            avgt    5  5294.679 ± 218.967  ns/op

# AC JDK 11
Benchmark              Mode  Cnt     Score     Error  Units
smallAndCallsInline    avgt    5  4501.134 ± 217.389  ns/op
smallAndCallsNoInline  avgt    5  4460.731 ± 132.889  ns/op
largeMethod            avgt    5  4244.479 ± 209.325  ns/op

# BAT JDK 11 again
Benchmark              Mode  Cnt     Score     Error  Units
smallAndCallsInline    avgt    5  5661.628 ± 268.654  ns/op
smallAndCallsNoInline  avgt    5  5634.992 ± 323.852  ns/op
largeMethod            avgt    5  5300.749 ± 267.469  ns/op

# AC JDK 8
Benchmark              Mode  Cnt     Score     Error  Units
smallAndCallsInline    avgt    5  4839.311 ± 158.454  ns/op
smallAndCallsNoInline  avgt    5  4923.197 ±  93.549  ns/op
largeMethod            avgt    5  4919.511 ± 223.358  ns/op
  • Benchmarking is hard!
  • No idea what happened here... sorry!

The Cost of Calling

How Expensive are Virtual Method Calls?

  • Check if there is an overhead to find the right method
  • This is about virtual method calls aka the lookup of the target not so much about the call itself (inlining, see before)

public class MegaMorphicSimple
{
    static interface C { public int apply(int x); }
    
    static class C1 implements C 
    { 
        @Override
        public int apply(int x) { return x * x; }
    }
    
    @Benchmark
    public int _1_mono()
    {
        C c = null;
        int r = random.nextInt(4);
        
        if (r == 0) { c = new C1(); } 
        else if (r == 1) { c = new C1(); } 
        else if (r == 2) { c = new C1(); }
        else if (r == 3) { c = new C1(); }
         
        return c.apply(r);
    }

    @Benchmark
    public int _2_bi()
    {
        C c = null;
        int r = random.nextInt(4);
        
        if (r == 0) { c = new C1(); } 
        else if (r == 1) { c = new C2(); } 
        else if (r == 2) { c = new C1(); }
        else if (r == 3) { c = new C2(); }
         
        return c.apply(r);
    }

    @Benchmark
    public int _5_manualDispatch()
    {
        ... 
        if (c instanceof C1)
        {
            return ((C1) c).apply(r);
        }
        else if (c instanceof C2)
        {
            return ((C2) c).apply(r);            
        }
        else if (c instanceof C3)
        {
            return ((C3) c).apply(r);            
        }
        else
        {
            return ((C4) c).apply(r);            
        }
    }
}

The Cost of Calling

How Expensive are Virtual Method Calls?


# identical method per implementing class    
Benchmark                             Mode  Cnt   Score   Error  Units
MegaMorphicSimple._1_mono             avgt    5  19.703 ± 2.177  ns/op
MegaMorphicSimple._2_bi               avgt    5  21.917 ± 3.759  ns/op
MegaMorphicSimple._4_mega             avgt    5  26.778 ± 4.411  ns/op
MegaMorphicSimple._5_manualDispatch   avgt    5  21.861 ± 2.860  ns/op

static interface C { default int apply(int x){ return x * x; }};

static class C1 implements C {};
static class C2 implements C {};
static class C3 implements C {};
static class C4 implements C {};

# default method at interface    
Benchmark                             Mode  Cnt   Score   Error  Units
MegaMorphicDefault._1_mono            avgt    5  18.867 ± 1.325  ns/op
MegaMorphicDefault._2_bi              avgt    5  21.817 ± 2.613  ns/op
MegaMorphicDefault._4_mega            avgt    5  26.355 ± 2.591  ns/op
MegaMorphicDefault._5_manualDispatch  avgt    5  21.763 ± 3.479  ns/op
  • That looks expensive
  • Fancy that we can help the runtime
  • Shall we do that from now on the ugly way?
  • ...but these are nanoseconds and the methods are useless.

# method got bigger and we made that non-default
Benchmark                                 Mode  Cnt    Score   Error  Units
MegaMorphicLargeMethod._1_mono            avgt    5  158.775 ± 4.971  ns/op
MegaMorphicLargeMethod._2_bi              avgt    5  157.100 ± 6.153  ns/op
MegaMorphicLargeMethod._4_mega            avgt    5  165.580 ± 8.669  ns/op
MegaMorphicLargeMethod._5_manualDispatch  avgt    5  162.729 ± 3.795  ns/op

Really calling virtually

So we don't have to think about virtual calls at all

  • Let's have an array with different objects of a type C
  • C has four possible implementations
  • We either let Java make the call or we untangle it ourselves

public class MegaMorphicArray 
{
    private C[] array = new C[20_000];
    
    @Setup
    public void setup() 
    {
        final Random r = new Random(42L);
        
        for (int x = 0; x < 20_000; x++) 
        {
            int i = r.nextInt(JHM_PARAM);
            
            if (i == 0) array[x] = new C1()
            else if (i == 1) array[x] = new C2();
            else if (i == 2) array[x] = new C3();
            else if (i == 3) array[x] = new C4();
        }        
    }
}

static interface C { public int apply(int x); }
    
static class C1 implements C { 
    @Override public int apply(int x) { return x * x; }
}   ...
    
@Benchmark
public int morphic() 
{
    int sum = 0; 

    for (int i = 0; i < 20_000; i++) 
    {
        sum += array[i].apply(i);
    }

    return sum;
}    
    
@Benchmark
public int peeled()
{
    int sum = 0;
    for (int i = 0; i < 20_000; i++) {
        final C c = array[i];
        if (c instanceof C1) {
            sum += ((C1) c).apply(i);
        }
        else if (c instanceof C2) {
            sum += ((C2) c).apply(i);
        }
        else if (c instanceof C3) {
            sum += ((C3) c).apply(i);
        }
        else {
            sum += ((C4) c).apply(i);
        }
    }
    
    return sum;
}

Really calling virtually

The Results are in


# JDK 11
Benchmark (params)  Mode  Cnt     Score  Units
morphic         1  avgt    5   17,705.6  ns/op
morphic         2  avgt    5   98,269.4  ns/op
morphic         3  avgt    5  334,745.9  ns/op
morphic         4  avgt    5  364,485.9  ns/op

peeled          1  avgt    5   16,990.4  ns/op
peeled          2  avgt    5   94,084.2  ns/op
peeled          3  avgt    5  123,590.3  ns/op
peeled          4  avgt    5  147,067.2  ns/op

# JDK 11 -Xint
Benchmark (params) Mode  Cnt        Score  Units
morphic         1  avgt    5  1,682,550.7  ns/op
morphic         2  avgt    5  1,695,875.2  ns/op
morphic         3  avgt    5  1,679,643.2  ns/op
morphic         4  avgt    5  1,675,298.1  ns/op

peeled          1  avgt    5  1,764,953.0  ns/op
peeled          2  avgt    5  1,821,662.6  ns/op
peeled          3  avgt    5  1,885,678.8  ns/op
peeled          4  avgt    5  1,936,054.8  ns/op

Megamorphic Internals

But why do we see this performance?


Benchmark                      (params)  Mode       Score       Error  Units
morphic                               1  avgt    16,829.7 ±  1165.264  ns/op
morphic:IPC                           1  avgt         3.2               #/op
morphic:L1-dcache-load-misses         1  avgt     6,263.3               #/op
morphic:L1-dcache-loads               1  avgt    40,296.8               #/op
morphic:branch-misses                 1  avgt         9.1               #/op
morphic:branches                      1  avgt    24,984.2               #/op
morphic:dTLB-load-misses              1  avgt         0.2               #/op
morphic:dTLB-loads                    1  avgt    40,401.5               #/op

morphic                               2  avgt    97,456.9 ±  1177.830  ns/op
morphic:IPC                           2  avgt         0.7               #/op
morphic:L1-dcache-load-misses         2  avgt     6,472.7               #/op
morphic:L1-dcache-loads               2  avgt    43,597.9               #/op
morphic:branch-misses                 2  avgt     9,850.6               #/op
morphic:branches                      2  avgt    41,935.3               #/op
morphic:dTLB-load-misses              2  avgt         2.0               #/op
morphic:dTLB-loads                    2  avgt    45,509.8               #/op

morphic                               3  avgt   334,117.8 ±  3334.059  ns/op
morphic:IPC                           3  avgt         1.1               #/op
morphic:L1-dcache-load-misses         3  avgt     6,840.3               #/op
morphic:L1-dcache-loads               3  avgt   407,654.7               #/op
morphic:branch-misses                 3  avgt    13,247.0               #/op
morphic:branches                      3  avgt   146,105.8               #/op
morphic:dTLB-load-misses              3  avgt        13.5               #/op
morphic:dTLB-loads                    3  avgt   410,043.5               #/op

morphic                               4  avgt   390,601.4 ± 40515.922  ns/op
morphic:IPC                           4  avgt         0.9               #/op
morphic:L1-dcache-load-misses         4  avgt     6,740.5               #/op
morphic:L1-dcache-loads               4  avgt   408,814.3               #/op
morphic:branch-misses                 4  avgt    15,099.0               #/op
morphic:branches                      4  avgt   145,118.2               #/op
morphic:dTLB-load-misses              4  avgt        18.9               #/op
morphic:dTLB-loads                    4  avgt   410,767.5               #/op

Benchmark                      (params)  Mode        Score       Error  Units
peeled                                1  avgt     16,831.1 ±   816.491  ns/op
peeled:IPC                            1  avgt          3.2               #/op
peeled:L1-dcache-load-misses          1  avgt      6,218.4               #/op
peeled:L1-dcache-loads                1  avgt     39,797.1               #/op
peeled:branch-misses                  1  avgt          9.2               #/op
peeled:branches                       1  avgt     24,854.5               #/op
peeled:dTLB-load-misses               1  avgt          0.2               #/op
peeled:dTLB-loads                     1  avgt     40,133.1               #/op

peeled                                2  avgt     93,347.0 ±  3607.879  ns/op
peeled:IPC                            2  avgt          0.7               #/op
peeled:L1-dcache-load-misses          2  avgt      6,456.1               #/op
peeled:L1-dcache-loads                2  avgt     42,060.7               #/op
peeled:branch-misses                  2  avgt      9,252.3               #/op
peeled:branches                       2  avgt     41,153.9               #/op
peeled:dTLB-load-misses               2  avgt          6.0               #/op
peeled:dTLB-loads                     2  avgt     42,195.1               #/op

peeled                                3  avgt    123,075.7 ±  5181.245  ns/op
peeled:IPC                            3  avgt          0.6               #/op
peeled:L1-dcache-load-misses          3  avgt      6,472.7               #/op
peeled:L1-dcache-loads                3  avgt     42,059.2               #/op
peeled:branch-misses                  3  avgt     13,440.9               #/op
peeled:branches                       3  avgt     51,950.4               #/op
peeled:dTLB-load-misses               3  avgt          5.7               #/op
peeled:dTLB-loads                     3  avgt     43,097.4               #/op

peeled                                4  avgt    136,765.3 ±  2792.939  ns/op
peeled:IPC                            4  avgt          0.6               #/op
peeled:L1-dcache-load-misses          4  avgt      6,460.5               #/op
peeled:L1-dcache-loads                4  avgt     43,312.1               #/op
peeled:branch-misses                  4  avgt     14,818.3               #/op
peeled:branches                       4  avgt     62,214.6               #/op
peeled:dTLB-load-misses               4  avgt          8.0               #/op
peeled:dTLB-loads                     4  avgt     42,775.6               #/op

TLB - The Unknown Buffer

The CPU understood

  • Previous data shows that TLB numbers are high
  • VMT: Virtual Method Table
  • Java looks the right method up in that table and makes afterwards the call to that code
  • TLB: Translation Lookaside Buffer
  • The TLB stores the recent translations of virtual memory to physical memory and can be called an address-translation cache [1]
  • The VMT call target has to be resolved via a TLB lookup
  • Java can inline code for one or two callsides, but not for more
  • One method: Inlined or directly called
  • Two methods: Inlined or directly called with a virtual if-else
  • Three or more methods: We have to use VMT
  • That behavior can change in future Java versions

Lambdas are slow

Let's try some streams first

final int SIZE = 10240;
final int[] integers = new int[SIZE];

public int lambdaArray[Cold|Warm]()
{
    return Arrays.stream(integers)
        .filter(i -> i % 2 == 0).sum();
}

public int lambdaIntStream[Cold|Warm]()
{
    return IntStream.range(0, SIZE)
        .filter(i -> i % 2 == 0).sum();
}

public int loop[Cold|Warm]()
{
    int sum = 0;
    for (int i = 0; i < integers.length; i++)
    {
        if (i % 2 == 0)
        {
            sum += integers[i];
        }
    }
    
    return sum;
}

Benchmark                     Mode  Cnt         Score   Units
Lambda01.lambdaArrayHot       avgt    5      13,968.8   ns/op
Lambda01.lambdaIntStreamHot   avgt    5      13,156.2   ns/op
Lambda01.loopHot              avgt    5      10,146.6   ns/op

# Cold does not have JMH warmup
Benchmark                     Mode  Cnt         Score   Units
Lambda01.lambdaArrayCold      avgt    5     251,372.4   ns/op
Lambda01.lambdaArrayHot       avgt    5      13,968.8   ns/op

Lambda01.lambdaIntStreamCold  avgt    5     794,422.3   ns/op
Lambda01.lambdaIntStreamHot   avgt    5      13,156.2   ns/op

Lambda01.loopCold             avgt    5      15,066.7   ns/op
Lambda01.loopHot              avgt    5      10,146.6   ns/op

# -Xcomp
Benchmark                     Mode  Cnt         Score   Units
Lambda01.lambdaArrayCold      avgt    5  24,809,394.5   ns/op
Lambda01.lambdaIntStreamCold  avgt    5  23,398,581.4   ns/op
Lambda01.loopCold             avgt    5      11,787.0   ns/op

Lambda01.lambdaArrayHot       avgt    5     143,678.1   ns/op
Lambda01.lambdaIntStreamHot   avgt    5     144,445.3   ns/op
Lambda01.loopHot              avgt    5      10,109.5   ns/op

# -Xint
Benchmark                     Mode  Cnt         Score   Units
Lambda01.lambdaArrayCold      avgt    5   5,578,040.0   ns/op
Lambda01.lambdaArrayHot       avgt    5   5,473,135.6   ns/op
Lambda01.lambdaIntStreamCold  avgt    5   5,394,464.0   ns/op
Lambda01.lambdaIntStreamHot   avgt    5   5,317,269.0   ns/op
Lambda01.loopCold             avgt    5     229,327.9   ns/op
Lambda01.loopHot              avgt    5     229,369.6   ns/op

Lambda Functions

Compare Predicate, Pseudo-Predicate, and Direct Implementation

public int forLambda(final Predicate<Integer> p) {
    int sum = 0;
    for (int i = 0; i < array.length; i++) {
        if (p.test(array[i])) {
            sum++;
        }
    }
    return sum;
}

public int forAnonymous(final PseudoLambda<Integer> p) {
    int sum = 0;
    for (int i = 0; i < array.length; i++) {
        if (p.test(array[i])) {
            sum++;
        }
    }
    return sum;
}

public int simple() {
    int sum = 0;
    for (int i = 0; i < array.length; i++) {
        if (array[i] % 2 == 0) {
            sum++;
        }
    }
    return sum;
}

@Benchmark
public void forSimple(Blackhole bh)
{
    bh.consume(fix());
}
    
@Benchmark
public void forLambdaPredicate(Blackhole bh)
{
    bh.consume(forLambda(i -> i % 2 == 0));
}

interface PseudoLambda<T>
{
    boolean test(T t);
}

@Benchmark
public void forAnonymousClass(Blackhole bh)
{
    bh.consume(
        forAnonymous(new PseudoLambda<Integer>()
        {
            @Override
            public boolean test(Integer i)
            {
                return i % 2 == 0;
            }
        })
    );
}

Lambda Functions


Benchmark                     Mode  Cnt     Score  Units
forAnonymousClass             avgt    2     152.9  ns/op
forLambdaPredicate            avgt    2     128.8  ns/op
forLambdaPredicatePredefined  avgt    2     127.9  ns/op
forSimple                     avgt    2     128.1  ns/op

# -Xint
Benchmark                     Mode  Cnt     Score  Units
forAnonymousClass             avgt    2  21,312.5  ns/op
forLambdaPredicate            avgt    2  20,154.8  ns/op
forLambdaPredicatePredefined  avgt    2  19,871.6  ns/op
forSimple                     avgt    2   3,471.7  ns/op

# No warmup and just 5x measurements    
Benchmark                     Mode  Cnt     Score  Units
forAnonymousClass               ss    5  92,787.6  ns/op
forLambdaPredicate              ss    5 122,377.4  ns/op
forLambdaPredicatePredefined    ss    5  50,435.6  ns/op
forSimple                       ss    5  16,268.6  ns/op
  • Basic lambdas are great when hot
  • When cold, performance is questionable
  • Not the right thing for server startup or rare execution
  • BUT that all depends on the total performance debt collected

GC can be your friend


public class GCAndAccessSpeed 
{
    private final static int SIZE  = 100_000;
    private final List<String> STRINGS = new ArrayList<>();
    private final List<String> ordered = new ArrayList<>();
    private final List<String> nonOrdered = new ArrayList<>();
    
    @Param({"false", "true"}) boolean gc;
    @Param({"1", "10"})       int COUNT;
    @Param({"false", "true"}) boolean drop;
    
    @Setup
    public void setup() throws InterruptedException
    {
        final FastRandom r = new FastRandom(7);
        for (int i = 0; i < COUNT * SIZE; i++)
        {
            STRINGS.add(RandomUtils.randomString(r, 1, 20));
        }
        
        for (int i = 0; i < SIZE; i++)
        {
            ordered.add(STRINGS.get(i * COUNT));
        }
        nonOrdered.addAll(ordered);
        Collections.shuffle(nonOrdered, new Random(r.nextInt()));
        
        if (drop)
        {
            STRINGS.clear();
        }
        
        if (gc) 
        {
            for (int c = 0; c < 5; c++) 
            {
                System.gc();
                TimeUnit.SECONDS.sleep(2);
            }
        }
    }
    
    @Benchmark
    public int walk[Ordered|NonOrdered]()
    {
        int sum = 0;
        for (int i = 0; i < [ordered|nonOrdered].size(); i++)
        {
            sum += [ordered|nonOrdered].get(i).length();
        }
        return sum;
    }
}

Applied Memory Knowledge

  • Theory is that memory closer together is good
  • G1 is compacting, hence moving objects closer together
  • When walking a structure with more "local" objects, the cache is hotter
  • Verify with Non-GC setup aka EpsilonGC

# G1 GC
Benchmark        (COUNT)  (drop)   (gc) Mode  Cnt      Score   Units
walkNonOrdered        1   false  false  avgt    5  1596315.2   ns/op
walkOrdered           1   false  false  avgt    5   611137.7   ns/op

walkNonOrdered        1   false   true  avgt    5  1172951.3   ns/op
walkOrdered           1   false   true  avgt    5   431143.5   ns/op

walkNonOrdered        1    true  false  avgt    5  1562844.1   ns/op
walkOrdered           1    true  false  avgt    5   605119.4   ns/op

walkNonOrdered        1    true   true  avgt    5  1243973.9   ns/op
walkOrdered           1    true   true  avgt    5   400721.9   ns/op

walkNonOrdered       10   false  false  avgt    5  1903731.9   ns/op
walkOrdered          10   false  false  avgt    5  1229945.1   ns/op

walkNonOrdered       10   false   true  avgt    5  2026861.7   ns/op
walkOrdered          10   false   true  avgt    5  1809961.9   ns/op

walkNonOrdered       10    true  false  avgt    5  1920658.4   ns/op
walkOrdered          10    true  false  avgt    5  1239658.5   ns/op

walkNonOrdered       10    true   true  avgt    5  1160229.2   ns/op
walkOrdered          10    true   true  avgt    5   403949.5   ns/op

# -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC
Benchmark        (COUNT)  (drop)   (gc) Mode  Cnt      Score   Units
walkNonOrdered        1   false  false  avgt    5  1667611.7   ns/op
walkNonOrdered        1   false   true  avgt    5  1820968.2   ns/op
walkNonOrdered        1    true  false  avgt    5  1928822.7   ns/op
walkNonOrdered        1    true   true  avgt    5  1777251.4   ns/op

walkOrdered           1   false  false  avgt    5   931728.5   ns/op
walkOrdered           1   false   true  avgt    5   902433.7   ns/op
walkOrdered           1    true  false  avgt    5   930294.3   ns/op
walkOrdered           1    true   true  avgt    5   907886.5   ns/op

Conclusion

Rule 1

Don't worry too much about performance. Java is very good in taking care of it.

Rule 2

Don't optimize prematurely and don't micro-tune.

Rule 3

Simply write clean and good code. You won't see most of the issues ever. Think before you code. Java can make code fast, but bad algorithms stay bad algorithms, bad designs stays bad designs.



Rule 4

When you worry, worry about the hottest code. Measure carefully and apply what you learned. The compiler and the CPU are your friends but make also your profiling life harder.

Rule 5

Understand, measure, review, test, and measure again.

Rule 6

The code you write is not the code that is executed. Measure under production conditions to understand and avoid optimizing non-compiled and non-optimized code.

Valhalla

Loom

The code

When you want to play with the code

  • All examples of today and more
  • Do whatever you like with it
  • JDK 8 or higher and Maven needed
  • Uses the Java Microbenchmark Harness from OpenJDK
  • Optionally uses kernel routines to profile low level

https://github.com/Xceptance/jmh-jmm-training

Questions and Answers