High Performance Java

About the smart Java internals

Disclaimer

Java is more than what most people think

The times when Java was a bytecode interpreter, running slow and sluggish, are long gone. But most people don't know the magic that is going on to speed up Java to native code speed and beyond.

  • This is not science, this is applied knowledge
  • Java got closer to the hardware
  • CPUs got better, but memory stayed slow, hence CPUs do whatever it takes to avoid main memory and that is what Java tries to do too
  • This is not about the fastest code, this is to understand the speed and draw conclusions
  • We will learn how to avoid a mistakes but besides that trust Java
This training is partially based on material from Java Performance Puzzlers by Douglas Hawkins and Optimizing Java by O'Reilly Media 2018.

An early warning

Optimization and tuning is fun... but...

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

Donald Knuth, Turing Award Lecture 1974

  • We might talk about 10% of the code today
  • When carefully crafted, 10 to 50% speed up is doable
  • Most of the time, a new architecture and a clearer design gives you more than 100% speed up
  • The rest of the time, it is just good to know what Java does to get us the speed we experience...
  • ...and to understand why often tuning attempts or benchmarks fail

Let's Benchmark

Get some measurements going

The code

When you want to play with the code

  • All examples of today and more
  • Do whatever you like with it
  • JDK 8 or higher and Maven needed
  • Uses the Java Microbenchmark Harness from OpenJDK
  • Optionally uses kernel routines to profile low level

https://github.com/Xceptance/jmh-jmm-training

Poor Man's Benchmark 1

Just try to benchmark something

public class PoorMansBenchmark
{   
    private static String append(int i)
    {
        final StringBuilder sb = new StringBuilder();
        sb.append(System.currentTimeMillis());
        sb.append("Foo");
        sb.append(i);
        sb.append('c');
        sb.append("lkJAHJ AHSDJHF KJASHF HSAKJFD");
        
        final String s = sb.toString();
        char[] ch = s.toCharArray();
        Arrays.sort(ch);
        
        return new String(ch);
    }

    // the real code is longer and contains measurement, 
    // see your disk!
    public static void main (String[] args)
    {
        int SIZE = 10;
       
        int sum = 0;
        for (int i = 0; i < SIZE; i++)
        {
            sum += append(i).length();
        } 
    }
}
  • Some non-sense with StringBuilder and String
  • Within a loop we call our method
  • Measurement around the call (see repo)

Let's Run PoorMansbenchmark

# SIZE = 10
 1,215996
 2,18298
 3,15279
 4,14614
 5,24858
 6,14660
 7,14418
 8,14319
 9,14285
10,15174
# SIZE = 50 
 1,201377
 2,17068
 3,14505
 4,13963
 5,23990
 6,14178
 7,13813
 8,13802
 9,13657
10,14729
20,13982
30,13840
40,21632
41,5975085
42,66191
43,13048
44,17985
45,10986
46,10674
47,16902
48,30752
49,10097
50,10312
# SIZE = 200
  1,372817
  2,19545
  3,16347
  4,16073
  5,27529
  6,18329
  7,15736
  8,15539
  9,15440
 10,16634
 20,15990
 30,15975
 40,16335
 50,19240
100,15945
110,17394
120,18280
130,18231
140,17730
150,15949
160,20445
165,24192
170,9069
180,8109
190,7797
200,7568
# SIZE = 1000
   1,400721
   2,35590
   3,15275
   4,14360
   5,25368
   6,14292
   7,16662
   8,14041
   9,13680
  10,14805
  90,14909
 100,15138
 160,14973
 170,10130
 180,7961
 300,6507
 400,5636
 500,6504
 640,27575
 650,3233
 800,3362
 900,3146
 910,4010
 920,2799
 930,2800
 940,2885
 980,4141
 990,2751
1000,2946

That is not helpful

The data is so inconsistent

What happened

Greetings from the Hotspot compiler

  • To make your code fly, a compiler compiles the bytecode on-demand
  • Hotspot compiler
  • Tiered Compiler, C1 and C2 with four modes
  • Profiles result and optimizes continuously
  • Can inline, loop-unroll, run escape analysis, branch predict, use intrinsics, do lock tricks, and a few more things
  • Compiler decides what code is hot
  • Compile is per method
  • In addition loops are compiled, because the method might be cold, but the loop hot

Compare interpreter vs. compiler

Hotspot

No Tiered Compiler

Interpreter Only

Compile Everything

Some more information about Hotspot

Just to get a feel for the complexity

Compiler Stages [1]

  • Level 0: Interpreter only
  • Level 1: C1, full optimization, no profiling
  • Level 2: C1 with invocation and backtrack counters
  • Level 3: C1 with full profiling
  • Level 4: C2, fully optimized

Compiler Flow [2]

Watch the Compiler

48    1       3       java.lang.String::equals (81 bytes)
49    2       3       java.lang.String::hashCode (55 bytes)
49    3       3       java.lang.String::charAt (29 bytes)
49    4       3       java.lang.String::length (6 bytes)
51    5       3       java.lang.Object::<init> (1 bytes)
51    6       3       java.lang.AbstractStringBuilder::ensureCapacityInternal (27 bytes)
51    7     n 0       java.lang.System::arraycopy (native)   (static)
51    8       3       java.lang.String::indexOf (70 bytes)
52   10       3       java.lang.Math::min (11 bytes)
52    9       3       java.util.Arrays::copyOfRange (63 bytes)
53   11       1       java.lang.ref.Reference::get (5 bytes)
53   12       3       java.lang.String::getChars (62 bytes)
53   13       1       java.lang.ThreadLocal::access$400 (5 bytes)
57   14       3       java.util.DualPivotQuicksort::sort (1195 bytes)
62   20       1       java.lang.Object::<init> (1 bytes)
62    5       3       java.lang.Object::<init> (1 bytes)   made not entrant
  • Time Stamp since start
  • Compilation order
  • Compiler level
  • Method and size
  • n: native
  • s: synchronized
  • !: has exception handler
  • %: On-stack replacement (OSR)
  • made not entrant: not longer used

JMH to the rescue

Java Microbenchmark Harness

  • OpenJDK developed JMH
  • Deliver consistent results
  • Be able to control the compiler
  • Suppress optimizations such as dead code
  • CLI tool using Maven
  • Use IDE to write, but not to run
  • Knows different JVMs and their tricks
  • Contains profilers either JMX, Hotspot or OS based
  • Also great for multi-threaded testing
  • Read the examples carefully before assuming your benchmark is legit
  • Keep in mind, most code is not hot

mvn clean install; java -Xms2g -Xmx2g -XX:+AlwaysPreTouch -jar target/benchmarks.jar YourClass

Java Performance Fun

Ohhhh Trap

You trained it well

private final static int SIZE = 1000;

public static void main(String[] args)
{
    Object trap = null;
    Object o = null;
    
    for (int i = 0; i < 1000; i++)
    {
        final Timer t = Timer.startTimer();
        
        for (int j = 0; j < SIZE; j++)
        {
            // burn time and train that null is normal
            o = new Object();
            
            if (trap != null)
            {
                System.out.println("Got you." + o);
                trap = null;
            }
        }
        
        // Give me a Null, Vasily. One Null only, please.  
        if (i == 400)
        {
            trap = new Object();
        }

        System.out.println(
                        MessageFormat.format("{1} {0, number, #}", 
                        t.stop().runtimeNanos(), i));
    }        
}
  • Line 15: with or without assignment for the second chart
  • Example by Douglas Hawkins: org.sample.AllocationTrap.java

Escape Analysis

Use the stack instead of the heap

@Benchmark
public long array64()
{
    int[] a = new int[64];
    
    a[0] = r.nextInt();
    a[1] = r.nextInt();
    
    return a[0] + a[1];
}

@Benchmark
public long array65()
{
    int[] a = new int[65];
    
    a[0] = r.nextInt();
    a[1] = r.nextInt();
    
    return a[0] + a[1];
}
EscapeAnalysis.array64                           avgt  2    22.804           ns/op
EscapeAnalysis.array64:·gc.alloc.rate            avgt  2    ≈ 10⁻⁴          MB/sec
EscapeAnalysis.array64:·gc.alloc.rate.norm       avgt  2    ≈ 10⁻⁶            B/op
EscapeAnalysis.array64:·gc.count                 avgt  2       ≈ 0          counts

EscapeAnalysis.array65                           avgt  2    41.890           ns/op
EscapeAnalysis.array65:·gc.alloc.rate            avgt  2  5795.571          MB/sec
EscapeAnalysis.array65:·gc.alloc.rate.norm       avgt  2   280.000            B/op
EscapeAnalysis.array65:·gc.count                 avgt  2    93.000          counts
# -XX:EliminateAllocationArraySizeLimit=70
EscapeAnalysis.array65                           avgt  2    22.279           ns/op
EscapeAnalysis.array65:·gc.alloc.rate            avgt  2    ≈ 10⁻⁴          MB/sec
EscapeAnalysis.array65:·gc.alloc.rate.norm       avgt  2    ≈ 10⁻⁶            B/op
EscapeAnalysis.array65:·gc.count                 avgt  2       ≈ 0          counts

Expensive last statement

Just one statement more...

@Benchmark
public long add230()
{
    int sum = 0;
    int i = 0;
    
    // now 23 of these
    sum += a[++i] + a[++i] + a[++i] + a[++i] + a[++i] +
           a[++i] + a[++i] + a[++i] + a[++i] + a[++i];

    return sum;
}

@Benchmark
public long add231()
{
    int sum = 0;
    int i = 0;
    
    // now 23 of these
    sum += a[++i] + a[++i] + a[++i] + a[++i] + a[++i] +
           a[++i] + a[++i] + a[++i] + a[++i] + a[++i];
    // and this
    sum += a[++i];

    return sum;
}
Benchmark                      Mode  Cnt    Score   Units
ExpensiveLastAddSmiley.add230  avgt    2   41.159   ns/op
ExpensiveLastAddSmiley.add231  avgt    2  122.162   ns/op
  • Darn... the last add is really expensive

Expensive Last Statement

Let's check the compiler

803  434       4  org.sample.ExpensiveLastAdd::add230 (2251 bytes)
    
797  434       3  org.sample.ExpensiveLastAdd::add231 (2263 bytes)
799  436       4  org.sample.ExpensiveLastAdd::add231 (2263 bytes)
811  436       4  org.sample.ExpensiveLastAdd::add231 (2263 bytes)   COMPILE SKIPPED: Out of stack space (not retryable)
  • We cannot C2 compile it
  • Stuck with the C1 and full profiling turned on

Another Surprise

Just add ints together

@Benchmark
public int add1396()
{
    int sum = 0;
    
    sum += 1; ... sum += 1396;

    return sum;
}

@Benchmark
public int add1397()
{
    int sum = 0;
    
    sum += 1; ... sum += 1396;
    sum += 1397;

    return sum;
}
Benchmark                Mode  Cnt     Score   Error  Units
ExpensiveIntAdd.add1396  avgt    2     3.567          ns/op
ExpensiveIntAdd.add1397  avgt    2  2688.218          ns/op
  • That was really expensive but why?
  • add1396 is compiled into a pre-evaluated return statement, hence it is really fast
  • add1397 problem is called huge method limit
  • Method bytecode limit is 8000 bytes
  • So it stays bytecode
# Executed with -Xint
Benchmark                Mode  Cnt     Score   Error  Units
ExpensiveIntAdd.add1396  avgt    2  2660.065          ns/op
ExpensiveIntAdd.add1397  avgt    2  2644.399          ns/op

Your loop is easy

Flatten out the loops

private int next()
{
    int i = r.nextInt(1) + 1;
    return i;
}

@Benchmark
public int classic()
{
    int sum = 0;
    int step = next();
    for (int i = 0; i < ints.length; i = i + 1)
    {
        sum += ints[i];
        step = next();
    }
    return sum + step;
}

@Benchmark
public int variable()
{
    int sum = 0;
    int step = next();
    for (int i = 0; i < ints.length; i = i + step)
    {
        sum += ints[i];
        step = next();
    }
    return sum + step;
}
  • Which loop will win?
Benchmark            (size)  Mode  Cnt      Score   Units
LoopUnroll.classic    10000  avgt    2  18,871.574  ns/op
LoopUnroll.variable   10000  avgt    2  27,433.414  ns/op
  • Our loop got unrolled
  • Less range checks, less jumps
# -Xint
Benchmark            (size)  Mode Cnt        Score  Units
LoopUnroll.classic    10000  avgt   2 2,650,812.533 ns/op
LoopUnroll.variable   10000  avgt   2 2,449,845.956 ns/op
# classic()
 8.995.890.331  cycles:u                  #    3,292 GHz                      
15.443.666.755  instructions:u            #    1,72  insn per cycle         
                                          #    0,32  stalled cycles per insn  
   368.352.858  branches:u                #  134,800 M/sec                    
     1.714.520  branch-misses:u           #    0,47% of all branches         
 1.383.613.083  L1-dcache-loads:u         #  506,340 M/sec                    
    43.244.875  L1-dcache-load-misses:u   #    3,13% of all L1-dcache hits    

# variable()     
 8.597.780.881  cycles:u                  #    3,235 GHz                      
 2.828.996.294  stalled-cycles-frontend:u #   32,90% frontend cycles idle    
23.722.750.782  instructions:u            #    2,76  insn per cycle         
                                          #    0,12  stalled cycles per insn 
 2.156.234.427  branches:u                #  811,237 M/sec                   
     1.189.662  branch-misses:u           #    0,06% of all branches          
 1.826.787.276  L1-dcache-loads:u         #  687,289 M/sec                   
    29.874.307  L1-dcache-load-misses:u   #    1,64% of all L1-dcache hits

Intrinsics or Native

Native code to the rescue

final int SIZE = 100_000;
final int[] src = new int[SIZE];
    
public int[] manualArrayCopy()
{
    int[] target = new int[SIZE];
    for (int i = 0; i < SIZE; i++)
    {
        target[i] = src[i];
    }
    
    return target;
}

public int[] systemArrayCopy()
{
    int[] target = new int[SIZE];
    System.arraycopy(
        src, 0, target, 0, SIZE);
    
    return target;
}

@Benchmark
public int[] manual()
{
    return manualArrayCopy();
}

@Benchmark
public int[] system()
{
    return systemArrayCopy();
}
Benchmark                   Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy.manual  avgt    2  48177.054          ns/op
IntrinsicsArrayCopy.system  avgt    2  48678.219          ns/op
  • Uh... native is not faster?
  • Well, if Java compiled its code, it is native
                    Manual               System
=======================================================
Iteration   1: 1,431,758.000 ns/op    129,551.000 ns/op
Iteration   2:   714,942.000 ns/op     81,926.000 ns/op
Iteration   3:   188,272.000 ns/op     77,098.000 ns/op
Iteration   4:    70,491.000 ns/op     72,010.000 ns/op
Iteration   5:    82,862.000 ns/op     73,495.000 ns/op
Iteration   6:    66,534.000 ns/op     75,625.000 ns/op
Iteration   7:    61,277.000 ns/op     74,382.000 ns/op
  • Native is quickly outperformed by Hotspot
  • BUT: That depends on your hardware
  • Hotspot can vary its code and intrinsics selection based on CPU instructions available

Wait... What are Intrinsics?

Native code vs. Intrinsics

  • Native: A JNI based method call that calls any kind of provided native platform code
  • Intrinsics: Native platform code, but not called as method but directly placed in the Hotspot compiled code
  • public static native void arraycopy(...) can be a native call or an intrinsic
  • Java makes that decision when the VM is started based on CPU features
  • Even for methods fully visible in the source code of the JDK as Java, Java can and will use Intrinics instead, see java.lang.Math
# -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining
                @ 54   java.lang.Math::min (11 bytes)   (intrinsic)
                @ 57   java.lang.System::arraycopy (0 bytes)   (intrinsic)

Stupid Compiler

Let's concat some Strings

private String foo = null;

@Setup
public void setup() {
    foo = String.valueOf(System.currentTimeMillis()) + "ashdfhgas dfhsa df";
}

@Benchmark
public String plain() {    
    return "a" + foo;
}

@Benchmark
public String concat() {    
    return "a".concat(foo);
}

@Benchmark
public String builder() {    
    return new StringBuilder().append("a").append(foo).toString();
}

@Benchmark
public String builderFull() {    
     final StringBuilder sb = new StringBuilder();
     sb.append("a");
     sb.append(foo);
     return sb.toString();
}

@Benchmark
public String builderSized() {    
    return new StringBuilder(32).append("a").append(foo).toString();
}

@Benchmark
public String builderFullSized() {    
    final StringBuilder sb = new StringBuilder(32);
    sb.append("a");
    sb.append(foo);
    return sb.toString();
}
Benchmark                  Mode  Cnt   Score   Error  Units
Strings1.builderSized      avgt    2  15.737          ns/op
Strings1.plain             avgt    2  16.179          ns/op
Strings1.builder           avgt    2  16.075          ns/op
Strings1.builderFullSized  avgt    2  33.930          ns/op
Strings1.concat            avgt    2  36.296          ns/op
Strings1.builderFull       avgt    2  49.676          ns/op
  • builderLong fails despite the same code
  • Compilers is trained on the builder pattern, fails otherwise
# Let's try with more data
Benchmark                         Mode  Cnt     Score Units
Strings2.concat                   avgt    2    48.120 ns/op
Strings2.builderFullSized         avgt    2    59.898 ns/op
Strings2.builderSized             avgt    2    59.959 ns/op
Strings2.plain                    avgt    2    60.265 ns/op
Strings2.builder                  avgt    2    64.167 ns/op
Strings2.builderFull              avgt    2    64.351 ns/op

Strings2.concat:·gc.alloc.rate.norm            120.013  B/op
Strings2.builder:·gc.alloc.rate.norm           272.013  B/op
Strings2.plain:·gc.alloc.rate.norm             272.018  B/op
Strings2.builderFull:·gc.alloc.rate.norm       272.020  B/op
Strings2.builderFullSized:·gc.alloc.rate.norm  392.010  B/op
Strings2.builderSized:·gc.alloc.rate.norm      392.010  B/op
  • Welcome to benchmark land
  • Don't assume that your artificial example tells it all

Let's talk hardware

Hardware does not matter... you wish!

final int SIZE = 1_000_000;
final int[] src = new int[SIZE];
    
@Benchmark
public int step1()
{
    int sum = 0;
    for (int i = 0; i < SIZE; i++)
    {
        sum += src[i];
    }
    
    return sum;
}

@Benchmark
public int step20()
{
    int sum = 0;
    for (int i = 0; i < SIZE; i = i + 20)
    {
        sum += src[i];
    }
    
    return sum;
}
  • size20 is obviously faster, about 20x
Benchmark                      Mode  Cnt       Score   Error  Units
ArraysAndHardware.step1        avgt    2  325,159.158          ns/op
ArraysAndHardware.step20       avgt    2   97,402.981          ns/op
  • We expected 16,300 ns/op
  • We got only 3.3x more speed
  • What? Why?
  • P.S. step20 results will vary a lot when running repeatably

Hardware Greetings

Also Java has to live by hardware laws

Linux Perf for step1
==========================================================================================
   5174,723228      task-clock:u (msec)       #    0,320 CPUs utilized          
             0      context-switches:u        #    0,000 K/sec                  
             0      cpu-migrations:u          #    0,000 K/sec                  
           148      page-faults:u             #    0,029 K/sec                  
17.608.341.100      cycles:u                  #    3,403 GHz                      (35,81%)
 9.565.555.351      stalled-cycles-frontend:u #   54,32% frontend cycles idle     (35,94%)
17.694.322.391      instructions:u            #    1,00  insn per cycle         
                                              #    0,54  stalled cycles per insn  (43,14%)
   938.066.934      branches:u                #  181,279 M/sec                    (43,10%)
       659.800      branch-misses:u           #    0,07% of all branches          (43,08%)
15.030.118.557      L1-dcache-loads:u         # 2904,526 M/sec                    (42,20%)
Linux Perf for step20
==========================================================================================
   5211,208535      task-clock:u (msec)       #    0,321 CPUs utilized          
             0      context-switches:u        #    0,000 K/sec                  
             0      cpu-migrations:u          #    0,000 K/sec                  
           155      page-faults:u             #    0,030 K/sec                  
16.720.652.877      cycles:u                  #    3,209 GHz                      (35,98%)
14.817.482.507      stalled-cycles-frontend:u #   88,62% frontend cycles idle     (36,13%)
 4.354.757.726      instructions:u            #    0,26  insn per cycle         
                                              #    3,40  stalled cycles per insn  (43,39%)
 1.081.134.885      branches:u                #  207,463 M/sec                    (43,47%)
       527.695      branch-misses:u           #    0,05% of all branches          (43,44%)
 1.073.841.169      L1-dcache-loads:u         #  206,064 M/sec                    (42,18%)

The Matrix

What is real and what is not

final int SIZE = 1000;
final int[][] src = new int[SIZE][SIZE];

@Benchmark
public int horizontal()
{
    int sum = 0;
    for (int i = 0; i < SIZE; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            sum += src[i][j];
        }
    }
    return sum;
}

@Benchmark
public int vertical()
{
    int sum = 0;
    for (int i = 0; i < SIZE; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            sum += src[j][i];
        }
    }
    return sum;
}
  • Both calculations should have the same speed, shouldn't they?
  • Ok... which one is faster... by how much?
Benchmark          Mode  Cnt        Score    Error   Units
Matrix.horizontal  avgt    2    325,245.614          ns/op
Matrix.vertical    avgt    2  6,885,314.222          ns/op
  • Horizontal access is about 20x faster!
  • Being friends with the L1 cache pays off

I know what you will do next

Modern fortune tellers

private static final int COUNT = 10_000;

// Contain random numbers from -50 to 50
private int[] sorted;
private int[] unsorted;
private int[] reversed;

public void doIt(
            int[] array, 
            Blackhole bh1, 
            Blackhole bh2)
{
    for (int v : array)
    {
        if (v > 0)
        {
            bh1.consume(v);
        }
        else
        {
            bh2.consume(v);
        }
    }
}
  • Will there be any difference when passing sorted or unsorted arrays?
Benchmark                  Mode  Cnt      Score   Error  Units
BranchPrediction.reversed  avgt    2  39570.696          ns/op
BranchPrediction.sorted    avgt    2  39859.027          ns/op
BranchPrediction.unsorted  avgt    2  66043.605          ns/op
  • Sorted is 65% faster. Why?
  • Let's ask the CPU

Branch Prediction

Modern CPU marvels: I know what you might need next

sorted()
==========================================================================================        
   3944,036489      task-clock:u (msec)       #    0,441 CPUs utilized          
13.422.117.531      cycles:u                  #    3,403 GHz                      (35,75%)
 4.110.325.580      stalled-cycles-frontend:u #   30,62% frontend cycles idle     (35,88%)
37.645.201.882      instructions:u            #    2,80  insn per cycle         
                                              #    0,11  stalled cycles per insn  (43,10%)
 6.007.718.894      branches:u                # 1523,241 M/sec                    (43,21%)
       815.228      branch-misses:u           #    0,01% of all branches          (43,21%)
unsorted()
==========================================================================================        
   3993,633024      task-clock:u (msec)       #    0,445 CPUs utilized          
13.395.176.813      cycles:u                  #    3,354 GHz                      (35,96%)
 3.984.918.339      stalled-cycles-frontend:u #   29,75% frontend cycles idle     (36,15%)
17.512.270.313      instructions:u            #    1,31  insn per cycle         
                                              #    0,23  stalled cycles per insn  (43,41%)
 3.629.203.191      branches:u                #  908,747 M/sec                    (43,52%)
   251.849.716      branch-misses:u           #    6,94% of all branches          (43,62%)

Synchronization

Being alone and still have to sync

private final int SIZE = 1024;
private final Map<String, String> syncMap = 
            Collections.synchronizedMap(new HashMap<>(SIZE));
private final Map<String, String> unsyncMap = 
            new HashMap<>(SIZE);
private final Map<String, String> conMap = 
            new ConcurrentHashMap<>(SIZE);

@Benchmark
public String syncAccess() 
{
    return syncMap.get("1");
}
@Benchmark
public String unsyncAccess() 
{
    return unsyncMap.get("1");
}
@Benchmark
public String conAccess() 
{
    return conMap.get("1");
}
  • Which on is slowest and by how much?
  • Remember, just one thread, so no contention
Benchmark                       Mode  Cnt   Score   Units
SynchronizedAlone.conAccess     avgt    2   9.152   ns/op
SynchronizedAlone.syncAccess    avgt    2  23.462   ns/op
SynchronizedAlone.unsyncAccess  avgt    2   9.366   ns/op
  • 40% performance remains
  • ConcurrentHashMap does well

With Contention

Let's try more threads

2 Threads
====================================================================
Benchmark                          Mode  Cnt    Score   Error  Units
SynchronizedThreads4.conAccess     avgt    2    9.721          ns/op
SynchronizedThreads4.syncAccess    avgt    2  183.086          ns/op
SynchronizedThreads4.unsyncAccess  avgt    2    9.312          ns/op
4 Threads
====================================================================
Benchmark                          Mode  Cnt    Score   Error  Units
SynchronizedThreads4.conAccess     avgt    2   19.661          ns/op
SynchronizedThreads4.syncAccess    avgt    2  216.041          ns/op
SynchronizedThreads4.unsyncAccess  avgt    2   19.631          ns/op
8 Threads
====================================================================
Benchmark                          Mode  Cnt    Score   Error  Units
SynchronizedThreads4.conAccess     avgt    2   44.055          ns/op
SynchronizedThreads4.syncAccess    avgt    2  377.180          ns/op
SynchronizedThreads4.unsyncAccess  avgt    2   38.562          ns/op

Looping

Loops compared

Benchmark                           (length)  Mode  Cnt      Score   Error  Units
ForEachSimple.arrayFor                     1  avgt    2      4.766          ns/op
ForEachSimple.arrayFor                    10  avgt    2     12.217          ns/op
ForEachSimple.arrayFor                   100  avgt    2     81.079          ns/op
ForEachSimple.arrayFor                  1000  avgt    2   1440.692          ns/op

ForEachSimple.classicFor                   1  avgt    2      7.408          ns/op
ForEachSimple.classicFor                  10  avgt    2     21.839          ns/op
ForEachSimple.classicFor                 100  avgt    2    131.976          ns/op
ForEachSimple.classicFor                1000  avgt    2   1771.723          ns/op

ForEachSimple.enhancedFor                  1  avgt    2      7.791          ns/op
ForEachSimple.enhancedFor                 10  avgt    2     21.825          ns/op
ForEachSimple.enhancedFor                100  avgt    2    149.064          ns/op
ForEachSimple.enhancedFor               1000  avgt    2   1794.154          ns/op

ForEachSimple.lambdaStream                 1  avgt    2     66.357          ns/op
ForEachSimple.lambdaStream                10  avgt    2     68.765          ns/op
ForEachSimple.lambdaStream               100  avgt    2    233.032          ns/op
ForEachSimple.lambdaStream              1000  avgt    2   1842.485          ns/op

# 1 CPU
ForEachSimple.parallelLambdaStream         1  avgt    2    101.870          ns/op
ForEachSimple.parallelLambdaStream        10  avgt    2   2915.802          ns/op
ForEachSimple.parallelLambdaStream       100  avgt    2   2472.963          ns/op
ForEachSimple.parallelLambdaStream      1000  avgt    2  15094.696          ns/op

# 2 CPU
ForEachSimple.parallelLambdaStream         1  avgt    2    102.620          ns/op
ForEachSimple.parallelLambdaStream        10  avgt    2   2233.131          ns/op
ForEachSimple.parallelLambdaStream       100  avgt    2   1927.355          ns/op
ForEachSimple.parallelLambdaStream      1000  avgt    2   9494.494          ns/op
// just some strings
final List<String> list = new ArrayList<>();
String[] array = null;

@Param({"1", "10", "100", "1000"})
int length;

@Setup
public void setup() {.. // uses length ..}

@Benchmark
public void arrayFor(Blackhole bh) 
{
    int result = 0;
    for (int i = 0; i < array.length; i++) 
    {
        result += array[i].length();
    }
    
    bh.consume(result);
}

// classicFor
for (int i = 0; i < list.size(); i++) 
{
    result += list.get(i).length();
}


// enhancedFor
for (String s : list) 
{
    result += s.length();
}

// lambdaStream
int result = list.stream()
    .mapToInt(s -> s.length())
    .sum();

//parallelLambdaStream
int result = list.parallelStream()
    .mapToInt(s -> s.length())
    .sum();

Looping

Not so empty loops compared

Benchmark                              (length)  Mode  Cnt       Score   Error  Units
ForEachExpensive.arrayFor                     1  avgt    2     284.117          ns/op
ForEachExpensive.arrayFor                    10  avgt    2    2732.129          ns/op
ForEachExpensive.arrayFor                   100  avgt    2   27239.061          ns/op
ForEachExpensive.arrayFor                  1000  avgt    2  303093.889          ns/op

ForEachExpensive.classicFor                   1  avgt    2     297.321          ns/op
ForEachExpensive.classicFor                  10  avgt    2    2685.728          ns/op
ForEachExpensive.classicFor                 100  avgt    2   26594.440          ns/op
ForEachExpensive.classicFor                1000  avgt    2  305164.516          ns/op

ForEachExpensive.enhancedFor                  1  avgt    2     293.539          ns/op
ForEachExpensive.enhancedFor                 10  avgt    2    2715.344          ns/op
ForEachExpensive.enhancedFor                100  avgt    2   27004.904          ns/op
ForEachExpensive.enhancedFor               1000  avgt    2  305563.946          ns/op

ForEachExpensive.lambdaStream                 1  avgt    2     452.960          ns/op
ForEachExpensive.lambdaStream                10  avgt    2    3345.737          ns/op
ForEachExpensive.lambdaStream               100  avgt    2   33530.716          ns/op
ForEachExpensive.lambdaStream              1000  avgt    2  351836.808          ns/op

# 1 CPU
ForEachExpensive.parallelLambdaStream         1  avgt    2     488.579          ns/op
ForEachExpensive.parallelLambdaStream        10  avgt    2    9477.440          ns/op
ForEachExpensive.parallelLambdaStream       100  avgt    2   42575.863          ns/op
ForEachExpensive.parallelLambdaStream      1000  avgt    2  478289.549          ns/op

# 2 CPU
ForEachExpensive.parallelLambdaStream         1  avgt    2     506.472          ns/op
ForEachExpensive.parallelLambdaStream        10  avgt    2   11439.329          ns/op
ForEachExpensive.parallelLambdaStream       100  avgt    2   30330.986          ns/op
ForEachExpensive.parallelLambdaStream      1000  avgt    2  219663.144          ns/op
  • Gave the thing something to do
final List<String> list = new ArrayList<>();
    
@Param({"1", "10", "100", "1000"})
int length;

@Setup
public void setup()
{
    for (int i = 0; i < length; i++)
    {
        list.add(
            MessageFormat.format(
            "{0},{0},{0},{0},{0},{0}", String.valueOf(i)));
    }
}

@Benchmark
public void classicFor(Blackhole bh)
{
    int result = 0;
    for (int i = 0; i < list.size(); i++)
    {
        final String s = list.get(i);
        if (s.startsWith("5"))
        {
            continue;
        }
        
        final String[] a = s.split(",");
        for (int j = 0; j < a.length; j++)
        {
            result += Integer.valueOf(a[j]);
        }
    }
    
    bh.consume(result);
}

Remember where to go

Jumping around costs. The overhead of megamorphic calls

  • Check if there is an overhead to find the right method
  • This is about virtual method calls
  • See MegaMorphicSimple
Benchmark                             Mode  Cnt   Score   Error  Units
MegaMorphicSimple._1_mono             avgt    2   15.714          ns/op
MegaMorphicSimple._2_bi               avgt    2   16.935          ns/op
MegaMorphicSimple._4_mega             avgt    2   21.242          ns/op
MegaMorphicSimple._5_manuellDispatch  avgt    2   17.126          ns/op

-Xint
Benchmark                             Mode  Cnt    Score   Error  Units
MegaMorphicSimple._1_mono             avgt    2  711.119          ns/op
MegaMorphicSimple._2_bi               avgt    2  738.386          ns/op
MegaMorphicSimple._4_mega             avgt    2  713.038          ns/op
MegaMorphicSimple._5_manuellDispatch  avgt    2  741.916          ns/op

Really calling virtually

Taking a loop into account

  • See MegaMorphicLoop
Benchmark                (params)  Mode  Cnt       Score   Error  Units
MegaMorphicLoop.morphic         1  avgt    2   12266.342          ns/op
MegaMorphicLoop.morphic         2  avgt    2   86931.668          ns/op
MegaMorphicLoop.morphic         3  avgt    2  286773.072          ns/op
MegaMorphicLoop.morphic         4  avgt    2  314543.873          ns/op
MegaMorphicLoop.peeled          1  avgt    2   12272.004          ns/op
MegaMorphicLoop.peeled          2  avgt    2   79541.629          ns/op
MegaMorphicLoop.peeled          3  avgt    2   99999.983          ns/op
MegaMorphicLoop.peeled          4  avgt    2  113230.267          ns/op

-Xint
Benchmark                (params)  Mode  Cnt        Score   Error  Units
MegaMorphicLoop.morphic         1  avgt    2  1415381.501          ns/op
MegaMorphicLoop.morphic         2  avgt    2  1568624.611          ns/op
MegaMorphicLoop.morphic         3  avgt    2  1431867.799          ns/op
MegaMorphicLoop.morphic         4  avgt    2  1428699.067          ns/op
MegaMorphicLoop.peeled          1  avgt    2  1430297.965          ns/op
MegaMorphicLoop.peeled          2  avgt    2  1551667.590          ns/op
MegaMorphicLoop.peeled          3  avgt    2  1772520.711          ns/op
MegaMorphicLoop.peeled          4  avgt    2  1713067.861          ns/op

Lambdas are slow

Let's try some streams first

final int SIZE = 10240;
final int[] integers = new int[SIZE];

public int lambdaArrayCold()
{
    return Arrays.stream(integers)
        .filter(i -> i % 2 == 0).sum();
}

public int lambdaIntStreamCold()
{
    return IntStream.range(0, SIZE)
        .filter(i -> i % 2 == 0).sum();
}

public int loopCold()
{
    int sum = 0;
    for (int i = 0; i < integers.length; i++)
    {
        if (i % 2 == 0)
        {
            sum += integers[i];
        }
    }
    
    return sum;
}
Default
Benchmark                   Mode  Cnt        Score  Units
lambda01.lambdaArrayCold      avgt    5     50,673.799 ns/op
lambda01.lambdaArrayHot       avgt    5     11,084.661 ns/op
lambda01.lambdaIntStreamCold  avgt    5     76,926.132 ns/op
lambda01.lambdaIntStreamHot   avgt    5     56,211.713 ns/op
lambda01.loopCold             avgt    5      7,355.324 ns/op
lambda01.loopHot              avgt    5      8,420.860 ns/op

-Xint
Benchmark                   Mode  Cnt        Score   Units
lambda01.lambdaArrayCold      avgt    5  3,674,930.417 ns/op
lambda01.lambdaArrayHot       avgt    5  3,199,644.700 ns/op
lambda01.lambdaIntStreamCold  avgt    5  3,498,449.750 ns/op
lambda01.lambdaIntStreamHot   avgt    5  3,270,098.067 ns/op
lambda01.loopCold             avgt    5    166,555.552 ns/op
lambda01.loopHot              avgt    5    166,784.633 ns/op

-Xcomp
Benchmark                   Mode  Cnt       Score  Units
lambda01.lambdaArrayCold      avgt    5    128,531.100 ns/op
lambda01.lambdaArrayHot       avgt    5     97,316.828 ns/op
lambda01.lambdaIntStreamCold  avgt    5    128,284.283 ns/op
lambda01.lambdaIntStreamHot   avgt    5    153,985.807 ns/op
lambda01.loopCold             avgt    5      9,520.145 ns/op
lambda01.loopHot              avgt    5     10,760.031 ns/op

Lambda Functions

public int forLambda(final Predicate<Integer> p) {
    int sum = 0;
    for (int i = 0; i < array.length; i++) {
        if (p.test(array[i])) {
            sum++;
        }
    }
    return sum;
}

public int forAnonymous(final PseudoLambda<Integer> p) {
    int sum = 0;
    for (int i = 0; i < array.length; i++) {
        if (p.test(array[i])) {
            sum++;
        }
    }
    return sum;
}

public int simple() {
    int sum = 0;
    for (int i = 0; i < array.length; i++) {
        if (array[i] % 2 == 0) {
            sum++;
        }
    }
    return sum;
}
  • Compare Predicate, Pseudo-Predicate, and direct implementation
Benchmark                      Mode Cnt     Score    Units
forAnonymousClass             thrpt   2  9149.416   ops/ms
forLambdaPredicate            thrpt   2  9456.789   ops/ms
forLambdaPredicatePredefined  thrpt   2  9137.711   ops/ms
simple                        thrpt   2  9037.135   ops/ms

-Xint
Benchmark                      Mode  Cnt     Score  Units
forAnonymousClass             thrpt    2    67.478 ops/ms
forLambdaPredicate            thrpt    2    60.300 ops/ms
forLambdaPredicatePredefined  thrpt    2    72.308 ops/ms
simple                        thrpt    2   411.272 ops/ms

!! Throughput not runtime !!

HashMap

The default HashMap is bad we suppose... so get a new one

  • Four different HashMaps
  • javolution.util.FastMap
  • gnu.trove...THashMap
  • fastutil...Object2ObjectOpenHashMap
  • java.util.HashMap
  • Write or read test
Benchmark                      Mode Cnt      Score   Units
HashMapTest.readJDKMap         avgt  2   90,056.856  ns/op
HashMapTest.readJLMap          avgt  2   98,586.577  ns/op
HashMapTest.readFastUtilsMap   avgt  2  148,826.875  ns/op
HashMapTest.readTrove4jMap     avgt  2  161,337.933  ns/op

HashMapTest.writeJDKMap        avgt  2  372,811.896  ns/op
HashMapTest.writeFastUtilsMap  avgt  2  523,318.130  ns/op
HashMapTest.writeJLMap         avgt  2  575,261.486  ns/op
HashMapTest.writeTrove4JMap    avgt  2  693,145.167  ns/op
Benchmark                              Mode       Score Units
writeFastUtilsMap:·gc.alloc.rate.norm  avgt 262,432.112  B/op
writeTrove4JMap:·gc.alloc.rate.norm    avgt 411,288.161  B/op
writeJDKMap:·gc.alloc.rate.norm        avgt 458,880.090  B/op
writeJLMap:·gc.alloc.rate.norm         avgt 852,248.116  B/op
  • Consider me unimpressed!

HashMaps again

Benchmarks can be misleading

  • We learned before that benchmarking is hard
  • So try a different setup
  • Read and write after each other as well as hit some misses
Benchmark                    (SIZE)  Mode  Cnt       Score   Error   Units
HashMapMixTest.jlMap             10  avgt    2      322.634          ns/op
HashMapMixTest.jlMap            100  avgt    2    3,241.468          ns/op
HashMapMixTest.jlMap           1000  avgt    2   39,945.111          ns/op
HashMapMixTest.jlMap          10000  avgt    2  587,367.589          ns/op
 
HashMapMixTest.jdkMap            10  avgt    2      417.161          ns/op
HashMapMixTest.jdkMap           100  avgt    2    3,872.609          ns/op
HashMapMixTest.jdkMap          1000  avgt    2   43,488.347          ns/op
HashMapMixTest.jdkMap         10000  avgt    2  649,102.621          ns/op

HashMapMixTest.fastUtilsMap      10  avgt    2      273.397          ns/op
HashMapMixTest.fastUtilsMap     100  avgt    2    3,169.923          ns/op
HashMapMixTest.fastUtilsMap    1000  avgt    2   44,502.276          ns/op
HashMapMixTest.fastUtilsMap   10000  avgt    2  838,007.539          ns/op

HashMapMixTest.troveMap          10  avgt    2      434.216          ns/op
HashMapMixTest.troveMap         100  avgt    2    4,763.608          ns/op
HashMapMixTest.troveMap        1000  avgt    2   60,432.867          ns/op
HashMapMixTest.troveMap       10000  avgt    2  978,056.680          ns/op

HashMaps and Memory

Let's throw in some memory

  • Our example from before
  • Write and remove
  • Had to change it to more write to see anything
Benchmark                                              (SIZE)  Mode  Cnt        Score   Error   Units
HashMapMemTest.fastUtilsMap                                10  avgt    2       310.796           ns/op
HashMapMemTest.fastUtilsMap                               100  avgt    2     7,142.161           ns/op
HashMapMemTest.fastUtilsMap                              1000  avgt    2   108,633.371           ns/op
HashMapMemTest.fastUtilsMap                             10000  avgt    2   892,746.165           ns/op

HashMapMemTest.jdkMap                                      10  avgt    2       599.702           ns/op
HashMapMemTest.jdkMap                                     100  avgt    2     5,512.704           ns/op
HashMapMemTest.jdkMap                                    1000  avgt    2    66,108.162           ns/op
HashMapMemTest.jdkMap                                   10000  avgt    2   826,412.256           ns/op

HashMapMemTest.jlMap                                       10  avgt    2       823.435           ns/op
HashMapMemTest.jlMap                                      100  avgt    2    23,207.931           ns/op
HashMapMemTest.jlMap                                     1000  avgt    2    62,190.147           ns/op
HashMapMemTest.jlMap                                    10000  avgt    2 1,024,182.562           ns/op

HashMapMemTest.troveMap                                    10  avgt    2     1,667.100           ns/op
HashMapMemTest.troveMap                                   100  avgt    2    20,846.917           ns/op
HashMapMemTest.troveMap                                  1000  avgt    2   199,411.343           ns/op
HashMapMemTest.troveMap                                 10000  avgt    2 2,348,563.339           ns/op

HashMapMemTest.fastUtilsMap:·gc.alloc.rate.norm            10  avgt    2        ≈ 10⁻⁴            B/op
HashMapMemTest.fastUtilsMap:·gc.alloc.rate.norm           100  avgt    2         0.002            B/op
HashMapMemTest.fastUtilsMap:·gc.alloc.rate.norm          1000  avgt    2         0.024            B/op
HashMapMemTest.fastUtilsMap:·gc.alloc.rate.norm         10000  avgt    2         0.246            B/op

HashMapMemTest.jdkMap:·gc.alloc.rate.norm                  10  avgt    2       226.072            B/op
HashMapMemTest.jdkMap:·gc.alloc.rate.norm                 100  avgt    2     2,048.536            B/op
HashMapMemTest.jdkMap:·gc.alloc.rate.norm                1000  avgt    2    20,253.878            B/op
HashMapMemTest.jdkMap:·gc.alloc.rate.norm               10000  avgt    2   202,265.849            B/op

HashMapMemTest.jlMap:·gc.alloc.rate.norm                   10  avgt    2       492.173            B/op
HashMapMemTest.jlMap:·gc.alloc.rate.norm                  100  avgt    2    26,723.155            B/op
HashMapMemTest.jlMap:·gc.alloc.rate.norm                 1000  avgt    2    20,253.185            B/op
HashMapMemTest.jlMap:·gc.alloc.rate.norm                10000  avgt    2   202,270.452            B/op

HashMapMemTest.troveMap:·gc.alloc.rate.norm                10  avgt    2       567.145            B/op
HashMapMemTest.troveMap:·gc.alloc.rate.norm               100  avgt    2     3,459.933            B/op
HashMapMemTest.troveMap:·gc.alloc.rate.norm              1000  avgt    2    40,321.631            B/op
HashMapMemTest.troveMap:·gc.alloc.rate.norm             10000  avgt    2   208,018.280            B/op

Conclusion

Performance and measuring it can be tricky

Rule 1

Don't worry too much about performance. Java is very good in taking care of it.

Rule 2

Don't optimize prematurely.

Rule 3

Simply write clean and good code. You won't see most of the issues ever. Think before you code. Java can make code fast, but bad algorithms stay bad algorithms, bad designs stays bad designs.

Rule 4

When you worry, worry about the hottest code. Measure carefully and apply what you learned. The compiler and the CPU are your friends but make also your profiling life harder.

Rule 5

Understand, measure, review, test, and measure again.

Rule 6

The code you write is not the code that is executed. Measure under production conditions to understand and avoid optimizing non-compiled and non-optimized code.

Bonus Content

Comparing 8, 11, and Graal

Bonus: Escape Analysis

Use the stack instead of the heap

### OpenJDK 1.8.0_201-b09
Benchmark               Mode  Cnt   Score  Units
EscapeAnalysis.array63  avgt    2  30.731  ns/op
EscapeAnalysis.array64  avgt    2  30.660  ns/op
EscapeAnalysis.array65  avgt    2  64.117  ns/op

### OpenJDK 11.0.2+9
Benchmark               Mode  Cnt   Score  Units
EscapeAnalysis.array63  avgt    2  30.581  ns/op
EscapeAnalysis.array64  avgt    2  30.898  ns/op
EscapeAnalysis.array65  avgt    2  60.778  ns/op
    
### GraalVM 1.0.0-rc12 
Benchmark               Mode  Cnt   Score  Units
EscapeAnalysis.array63  avgt    2  53.834  ns/op
EscapeAnalysis.array64  avgt    2  54.201  ns/op
EscapeAnalysis.array65  avgt    2  55.582  ns/op

### GraalVM 1.0.0-rc12 with disabled Graal JIT
Benchmark               Mode  Cnt   Score  Units
EscapeAnalysis.array63  avgt    2  30.570  ns/op
EscapeAnalysis.array64  avgt    2  30.586  ns/op
EscapeAnalysis.array65  avgt    2  62.307  ns/op
### GraalVM 1.0.0-rc12 with disabled Graal JIT
Benchmark                                                Mode  Cnt     Score   Error   Units
EscapeAnalysis.array64                                   avgt    2    30.902           ns/op
EscapeAnalysis.array64:·gc.alloc.rate                    avgt    2    ≈ 10⁻⁴          MB/sec
EscapeAnalysis.array64:·gc.alloc.rate.norm               avgt    2    ≈ 10⁻⁶            B/op
EscapeAnalysis.array64:·gc.count                         avgt    2       ≈ 0          counts
EscapeAnalysis.array65                                   avgt    2    67.923           ns/op
EscapeAnalysis.array65:·gc.alloc.rate                    avgt    2  3573.491          MB/sec
EscapeAnalysis.array65:·gc.alloc.rate.norm               avgt    2   280.000            B/op
EscapeAnalysis.array65:·gc.count                         avgt    2    58.000          counts

### GraalVM 1.0.0-rc12
Benchmark                                                Mode  Cnt     Score   Error   Units
EscapeAnalysis.array64                                   avgt    2    53.766           ns/op
EscapeAnalysis.array64:·gc.alloc.rate                    avgt    2  4390.046          MB/sec
EscapeAnalysis.array64:·gc.alloc.rate.norm               avgt    2   272.258            B/op
EscapeAnalysis.array64:·gc.count                         avgt    2    71.000          counts
EscapeAnalysis.array65                                   avgt    2    56.409           ns/op
EscapeAnalysis.array65:·gc.alloc.rate                    avgt    2  4306.519          MB/sec
EscapeAnalysis.array65:·gc.alloc.rate.norm               avgt    2   280.290            B/op
EscapeAnalysis.array65:·gc.count                         avgt    2    70.000          counts

Bonus: Loop Optimization

Flatten out the Loops

### OpenJDK 1.8.0_201-b09
Benchmark            (size)  Mode  Cnt      Score   Error  Units
LoopUnroll.classic    10000  avgt    2  24127.350          ns/op
LoopUnroll.variable   10000  avgt    2  34738.485          ns/op

### OpenJDK 11.0.2+9
Benchmark            (size)  Mode  Cnt      Score   Error  Units
LoopUnroll.classic    10000  avgt    2  24130.359          ns/op
LoopUnroll.variable   10000  avgt    2  36324.211          ns/op
    
### GraalVM 1.0.0-rc12 
Benchmark            (size)  Mode  Cnt      Score   Error  Units
LoopUnroll.classic    10000  avgt    2  25042.661          ns/op
LoopUnroll.variable   10000  avgt    2  25709.628          ns/op

### GraalVM 1.0.0-rc12 with disabled Graal JIT
Benchmark            (size)  Mode  Cnt      Score   Error  Units
LoopUnroll.classic    10000  avgt    2  24207.002          ns/op
LoopUnroll.variable   10000  avgt    2  36822.005          ns/op

Bonus: INTRINSICS OR NATIVE

Native code to the rescue

              OpenJDK 1.8.0_201-b09        OpenJDK 11.0.2+9
                  Manual     System       Manual     System
===========================================================
Iteration   1: 1,848,588    141,586    2,273,238    146,863
Iteration   2:   892,577    121,540      863,448    148,519
Iteration   3:   398,532    105,858      806,948    137,621
Iteration   4:   372,315     94,825      711,051    145,200
Iteration   5: 1,026,252     86,088      400,594     99,668
Iteration   6:   285,201    113,580      343,561    156,568
Iteration   7:   239,481    100,616      346,154    138,162
Iteration  20:    92,447    114,357      105,415     95,984

                  GraalVM 1.0.0-rc12      GraalVM 1.0.0-rc12 Classic C2 
                  Manual     System       Manual    System
===========================================================
Iteration   1: 2,749,350    247,863    1,748,441    115,802
Iteration   2:   709,289    127,796      640,278    192,041
Iteration   3:   585,415    118,018      619,790    100,181
Iteration   4:   597,294    110,933      623,115     90,116
Iteration   5:   588,853    119,067      251,239     88,891
Iteration   6:   549,106    118,401      227,545     91,431
Iteration   7:   619,271    124,600      165,236     85,492
Iteration  20:   602,838    125,727       87,943     90,784
Iteration 100:   480,763    125,155      119,415     97,711

Questions and Answers