High Performance Java

About the Smart Java Internals

René Schwietzke, Xceptance GmbH, Jena, Germany

About

Xceptance

Focused on Software Test and Quality Assurance
Performance Testing, Test Automation, Functional Testing, QA, and Test Process Consulting
Founded 2004
Headquarters in Jena, Germany
Subsidiary in Cambridge, MA, USA
Performance Test Tool XLT, Java-based, APL 2.0

René Schwietzke

Co-Founder and Managing Directory Xceptance
Working with Java since version 1.0
Performance Tester since 1999
@ReneSchwietzke

Sources

Always attribute other people's work

Java Performance Puzzlers
by Douglas Q Hawkins, Azul Systems

Aleksey Shipilëv: https://shipilev.net/

Optimizing Java by Evans/Gough/Newland

Java and Performance

Java is much more than what most people think. Spoilers!

The times when Java was a bytecode interpreter, running slow and sluggish, are long gone. But most people don't know the magic that is going on to speed up Java to native code speed and beyond.

Actually, most programmers don't even know how their code gets executed or how a computer works... sadly.

How speed is achieved and the mysterious things behind that
Hardware matters
Avoid common mistakes but besides that, trust Java
The code you write is not the code that runs

An Early Warning or The Disclaimer

Optimization and tuning is fun... but...

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

Donald Knuth, Turing Award Lecture 1974

This talk is NOT about how your code should look like
If you expect tips like use StringBuilder instead of... this talk is not for you
This talk is not about the fastest code (bummer!)
It is about understanding the JVM and getting measurements right
It is about avoiding to work against Java
It is also about the right approach to performance tuning

Another Motivation or Early Prove

Java is efficient but the why has to be explained

Let's Benchmark

Get Some Measurements Going

Poor Man's Benchmark 1

public class PoorMansBenchmark
{
    private static String append(final int i)
    {
        final StringBuilder sb = new StringBuilder();
        sb.append(System.currentTimeMillis());
        sb.append("Foo");
        sb.append(i);
        sb.append('c');
        sb.append("lkJAHJ AHSDJHF KJASHF HSAKJFD");

        final String s = sb.toString();
        char[] ch = s.toCharArray();
        Arrays.sort(ch);

        return new String(ch);
    }

    public static void main (final String[] args)
    {
        final int SIZE = 100_000;
        final long[] result = new long[SIZE];

        int sum = 0;
        for (int i = 0; i < SIZE; i++)
        {
            final Timer t = Timer.startTimer();

            sum += append(i).length();

            result[i] = t.stop().runtimeNanos();
        }

        // pretend we need it but we really don't
        // just to avoid optimizations (spoilers, duh!)
        if (sum < 0)
        {
            System.out.println(sum);
        }
    }
}

Just benchmark something...

Some non-sense with StringBuilder and String
We call our method in a loop
A helper tracks the time

https://github.com/Xceptance/jmh-jmm-training: org/sample/PoorMansbenchmark.java

Let's Run the Poor Man's Benchmark


# Size: 10
  1,  437,607
  2,   22,873
  3,   18,180
  4,   17,597
  5,   17,698
  6,   18,582
  7,   17,607
  8,   18,200
  9,   19,798
 10,   19,224


# Size: 100
   1,  516,854
   2,   32,120
   3,   18,300
   4,   17,633
   5,   17,126
   6,   20,441
   7,   17,503
   8,   18,101
   9,   17,483
  10,   17,085
  20,   18,877
  30,   18,818
  40,   18,301
  50,   23,070
  60,   18,888
  70,   21,198
  80,   12,854
  90,   11,948
 100,   11,879


# Size: 1,000
    1,  402,289
    2,   22,654
    3,   18,931
    4,   18,435
    5,   17,494
    6,   19,335
    7,   17,798
    8,   18,991
    9,   18,162
   10,   17,625
   20,   19,639
   30,   19,123
   40,   19,255
   50,   20,874
   60,   18,435
   70,   18,070
   80,   13,012
   90,   12,678
  100,   12,566
  200,   10,230
  300,    7,902
  400,    5,919
  500,    4,765
  600,    4,451
  700,    4,199
  800,    3,865
  900,    3,956
1,000,    3,339


# Size: 10,000
     1,  714,799
     2,   22,859
     3,   18,531
     4,   17,673
     5,   17,444
     6,   18,501
     7,   17,245
     8,   18,362
     9,   17,224
    10,   17,244
   100,   12,138
   200,   12,647
   300,    7,421
   400,    6,503
   500,    4,678
   600,    4,259
   700,    4,099
   800,    3,960
   900,    3,571
 1,000,    3,620
 2,000,    2,224
 3,000,    1,666
 4,000,    1,616
 5,000,    1,645
 6,000,    1,167
 7,000,    1,207
 8,000,    1,137
 9,000,    1,098
10,000,      987


# Size: 1,000,000
        1,  719,086
        2,   36,834
        3,   26,186
        4,   25,529
        5,   25,329
        6,   27,658
        7,   25,488
        8,   26,424
        9,   26,683
       10,   26,623
      100,   17,835
    1,000,    4,877
   10,000,    1,125
   20,000,    1,035
   30,000,      766
   40,000,      657
   50,000,      667
   60,000,      627
   70,000,      497
   80,000,      518
   90,000,      498
  100,000,    1,184
  200,000,      437
  300,000,      428
  400,000,      448
  500,000,      418
  600,000,      537
  700,000,      548
  800,000,      438
  900,000,      478
1,000,000,      438

The Java Hater's Response

That must be Garbage Collection

Let's disable GC
Epsilon GC to the rescue
Noop-Garbage Collector
It provides but does not free memory
-XX:+UseEpsilonGC


# G1 GC
# Size: 10,000
     1,  615,818
     2,   24,265
     3,   18,772
     4,   17,692
     5,   17,712
     6,   19,408
     7,   17,884
     8,   17,873
     9,   17,793
    10,   17,601
   100,   17,671
   200,    8,785
   300,    6,210
   400,    5,110
   500,    5,302
   600,    3,636
   700,    4,040
   800,    2,908
   900,    2,464
 1,000,    1,969
 2,000,    1,474
 3,000,    1,576
 4,000,    1,606
 5,000,    1,505
 6,000,    1,515
 7,000,    1,495
 8,000,    1,514
 9,000,    1,384
10,000,      858


# No GC
# Size: 10,000
     1,  847,466
     2,   65,884
     3,   37,987
     4,   36,217
     5,   40,327
     6,   36,357
     7,   35,896
     8,   36,267
     9,   35,967
    10,   36,467
   100,   19,918
   200,   11,039
   300,    3,970
   400,    3,620
   500,    3,669
   600,    3,600
   700,    3,370
   800,    3,039
   900,    2,840
 1,000,    2,739
 2,000,    1,650
 3,000,    1,410
 4,000,    1,430
 5,000,    1,399
 6,000,    1,040
 7,000,    1,029
 8,000,    1,020
 9,000,    1,040
10,000,    1,050


# ShenandoahGC
# Size: 10,000
     1,  599,049
     2,   42,120
     3,   17,912
     4,   16,799
     5,   16,263
     6,   18,672
     7,   17,164
     8,   16,435
     9,   16,577
    10,   16,658
   100,   12,195
   200,    4,706
   300,    4,220
   400,    7,661
   500,    6,355
   600,    6,690
   700,    4,513
   800,    5,515
   900,    5,283
 1,000,    3,167
 2,000,    3,350
 3,000,    3,835
 4,000,    3,299
 5,000,    1,468
 6,000,    1,457
 7,000,    1,001
 8,000,    1,012
 9,000,    1,305
10,000,      921

The Java Hater's Response II

Java is slow

GraalVM to the rescue
Let's compile to machine code
native-image


# Java 17
# Size: 100,000
      1,  465,864
      2,   21,818
      3,   18,809
      4,   17,809
      5,   18,052
      6,   22,413
      7,   17,547
      8,   19,112
      9,   17,466
     10,   17,456
    100,   12,419
  1,000,    3,736
  2,000,    2,312
  3,000,    1,464
  4,000,    1,454
  5,000,    1,081
  6,000,    1,091
  7,000,    1,080
  8,000,    1,080
  9,000,      989
 10,000,    1,009
 20,000,    1,757
 30,000,      646
 40,000,      565
 50,000,      596
 60,000,      556
 70,000,      495
 80,000,      494
 90,000,      484
100,000,      414


# Graal 22 / Native
# Size: 100,000
      1,   11,251
      2,    2,798
      3,    1,933
      4,    1,771
      5,    1,630
      6,    1,570
      7,    1,590
      8,    1,540
      9,    3,935
     10,    1,600
    100,    1,670
  1,000,    2,023
  2,000,    1,681
  3,000,    1,660
  4,000,    1,670
  5,000,    1,429
  6,000,    1,369
  7,000,    3,733
  8,000,    1,610
  9,000,    1,399
 10,000,    1,419
 20,000,      574
 30,000,    1,389
 40,000,      574
 50,000,      563
 60,000,      724
 70,000,      564
 80,000,      554
 90,000,      554
100,000,      564

That is not helpful

The Data is so Inconsistent

Nightmares!

6 hours lost due to a BIOS bug

Issues with clocksource on T14s BIOS 1.30, tsc was not stable and disabled

Let's Analyze

Greetings from the Java Internals

Java has three(!) compilers
You interact only with javac
To make your code fly, bytecode is compiled to machine code (JITed)
Classic: Hotspot Compiler (our focus right now)
New Kid: Graal Compiler
Tiered Compiler: C1 and C2 with four modes
Profiles and optimizes continuously(!)
Can inline, loop-unroll, run escape analysis, branch predict, use intrinsics, do lock tricks, and a more things (yes, will get to that later)

Compiler decides what code is hot
Compile is per method
In addition, loops are compiled, because the method might be cold, but the loop is hot

Some more information about Hotspot

Just to get a feel for the complexity

Compiler Stages ^[1]

Level 0: Interpreter only
Level 1: C1, full optimization, no profiling
Level 2: C1 with invocation and backedge counters
Level 3: C1 with full profiling
Level 4: C2, fully optimized

Compiler Flow ^[2]

[1] src/share/vm/runtime/advancedThresholdPolicy.hpp
[2] Douglas Hawkins, JVM Mechanics, https://www.youtube.com/watch?v=a-U0so9FfqQ&t=2036s

Compare Interpreter vs. Compiler

Hotspot

No Tiered Compiler

Interpreter Only

Compile Everything

-XX:-TieredCompilation, -Xint, -Xcomp

Compile Or Not to Compile

Compile when Needed vs. Right from the Start

-Xcomp forces Java to always compile to native code
Still permits later change in opinion

70x slower in the beginning to be faster later
Effect 1: We compile things we might never need again
Effect 2: We might run out of compiler cache

# defaults
      1, 560,960
      2,  34,772
      3,  26,167
      4,  24,237
      5,  35,926
      6,  24,116
      7,  23,163
      8,  22,895
      9,  24,142
     10,  23,174
    100,  23,596
    200,  12,973
    300,  12,361
    400,   9,852
    500,   9,596
    600,   8,774
    700,   8,956
    800,   7,280
    900,   7,076
   1000,   6,089
  10000,   5,071
  20000,   2,683
  30000,   2,703
  40000,   1,532
  50000,   1,121
  60000,   1,012
  70000,     975
  80000,     979
  90000,     987
 100000,     976
 900000,     700
1000000,     708

# -Xcomp
      1, 40,041,490
      2,     51,254
      3,     17,217
      4,      2,077
      5,      1,714
      6,      1,607
      7,      1,610
      8,      1,536
      9,      1,576
     10,      1,536
    100,      1,376
    200,      2,481
    300,      1,284
    400,      1,307
    500,      1,150
    600,      1,177
    700,      1,173
    800,      1,163
    900,      1,174
   1000,      1,546
  10000,      1,096
  20000,        962
  30000,        999
  40000,      1,093
  50000,        891
  60000,        883
  70000,        885
  80000,        862
  90000,        894
 100000,        869
 900000,        871
1000000,        870

Compile Or Not to Compile


# -Xint
      1,   443,537
      2,    20,145
      3,    18,862
      4,    18,324
      5,    16,484
      6,    16,345
      7,    16,494
      8,    16,186
      9,    16,563
     10,    16,513
    100,    17,340
    200,    17,091
    300,    16,782
    400,    17,111
    500,    17,240
    600,    17,608
    700,    17,012
    800,    18,832
    900,    16,763
  1,000,    16,902
  2,000,    16,942
  3,000,    17,609
  4,000,    16,822
  5,000,    17,042
  6,000,    16,991
  7,000,    17,021
  8,000,    16,683
  9,000,    16,653
 10,000,    18,891
100,000,    17,558
200,000,    20,303
300,000,    21,516
400,000,    19,050
500,000,    19,477
600,000,    19,676
700,000,    19,775
800,000,    19,517
900,000,    19,665


# JIT 1
      1,   439,989
      2,    20,364
      3,    16,531
      4,    15,973
      5,    15,811
      6,    16,490
      7,    15,841
      8,    16,095
      9,    15,619
     10,    15,172
    100,    10,213
    200,    10,892
    300,     8,286
    400,     4,888
    500,     3,682
    600,     2,647
    700,     2,292
    800,     1,917
    900,     1,714
  1,000,       974
  2,000,       700
  3,000,       689
  4,000,       700
  5,000,       699
  6,000,       689
  7,000,       700
  8,000,       689
  9,000,       680
 10,000,       690
100,000,       629
200,000,       761
300,000,       852
400,000,       771
500,000,       832
600,000,       761
700,000,       830
800,000,       829
900,000,       809


# JIT 1-2
      1,   469,922
      2,    21,275
      3,    18,368
      4,    17,044
      5,    17,681
      6,    18,457
      7,    17,532
      8,    17,940
      9,    17,143
     10,    17,472
    100,    11,498
    200,    11,717
    300,     7,069
    400,     4,321
    500,     3,175
    600,     3,325
    700,     2,121
    800,     1,603
    900,     1,503
  1,000,     1,264
  2,000,       995
  3,000,     1,005
  4,000,       995
  5,000,     1,005
  6,000,     1,005
  7,000,     1,005
  8,000,       996
  9,000,       996
 10,000,     1,005
100,000,       976
200,000,     1,185
300,000,     1,185
400,000,     1,204
500,000,     1,264
600,000,     1,235
700,000,     1,234
800,000,     1,294
900,000,     1,284


# JIT 1-3
      1,   378,533
      2,    21,978
      3,    18,809
      4,    17,991
      5,    17,364
      6,    18,451
      7,    17,962
      8,    17,752
      9,    17,324
     10,    17,544
    100,    11,692
    200,    12,050
    300,    10,347
    400,     6,070
    500,     4,366
    600,     3,947
    700,     3,339
    800,     2,791
    900,     2,403
  1,000,     2,193
  2,000,     1,425
  3,000,     1,445
  4,000,     1,435
  5,000,     1,436
  6,000,     1,445
  7,000,     1,425
  8,000,     1,415
  9,000,     1,425
 10,000,     1,426
100,000,     1,425
200,000,     1,735
300,000,     1,764
400,000,     1,924
500,000,     1,753
600,000,     1,724
700,000,     1,704
800,000,     1,724
900,000,     1,784


# JIT 1-4
      1,   489,002
      2,    30,301
      3,    18,479
      4,    17,639
      5,    17,659
      6,    19,248
      7,    17,509
      8,    18,838
      9,    17,109
     10,    17,309
    100,    11,933
    200,    12,173
    300,     7,425
    400,     6,346
    500,     5,366
    600,     4,547
    700,     4,148
    800,     3,867
    900,     3,527
  1,000,     3,547
  2,000,     2,229
  3,000,     1,409
  4,000,     2,119
  5,000,     1,429
  6,000,     1,049
  7,000,     1,039
  8,000,     1,039
  9,000,     1,059
 10,000,     1,039
100,000,       379
200,000,       460
300,000,       459
400,000,       519
500,000,       529
600,000,       470
700,000,       470
800,000,       469
900,000,       499


# No Tiered
      1,   772,154
      2,    50,772
      3,    33,622
      4,    29,284
      5,    29,294
      6,    30,371
      7,    29,155
      8,    29,405
      9,    28,786
     10,    29,075
    100,    27,131
    200,    26,174
    300,    26,871
    400,    27,300
    500,    28,986
    600,    35,606
    700,    23,900
    800,    27,799
    900,    15,485
  1,000,    14,826
  2,000,     7,626
  3,000,     5,093
  4,000,     4,007
  5,000,     3,409
  6,000,     3,559
  7,000,     3,199
  8,000,     2,931
  9,000,     2,323
 10,000,     2,123
100,000,       399
200,000,       558
300,000,       539
400,000,       528
500,000,       548
600,000,       528
700,000,       528
800,000,       578
900,000,       548

-XX:TieredStopAtLevel=X

Watch the Compiler


1157  104 %     3       org.sample.poor.PoorMansBenchmark::main @ 17 (470 bytes)
1162  105       3       org.sample.poor.PoorMansBenchmark::main (470 bytes)
1328  135       3       org.sample.poor.PoorMansBenchmark::append (66 bytes)
1454  153       4       org.sample.poor.PoorMansBenchmark::append (66 bytes)
1561  178 %     4       org.sample.poor.PoorMansBenchmark::main @ 55 (470 bytes)
1566  135       3       org.sample.poor.PoorMansBenchmark::append (66 bytes)   made not entrant
6002  178 %     4       org.sample.poor.PoorMansBenchmark::main @ 55 (470 bytes)   made not entrant
6069  367 %     3       org.sample.poor.PoorMansBenchmark::main @ 100 (470 bytes)
6082  407 %     4       org.sample.poor.PoorMansBenchmark::main @ 100 (470 bytes)

Time Stamp since start
Compilation order
Compiler level
Method and size of the bytecode

n: native
s: synchronized
!: has exception handler
%: On-stack replacement (OSR)
made not entrant: not longer to be used

-XX:PrintCompilation, log format can change

Java Performance Fun

Things to Know when Executing or Profiling Code

Ohhhh Trap

You trained it well

private final static int SIZE = 1000;

public static void main(String[] args)
{
    Object trap = null;
    Object o = null;

    for (int i = 0; i < 4000; i++)
    {
        final Timer t = Timer.startTimer();

        for (int j = 0; j < SIZE; j++)
        {
            // burn time and train that null is normal
            o = new Object();

            if (trap != null)
            {
                System.out.println("Got you." + o);
                trap = null;
            }
        }

        // Give me a Non-Null, Vasily.
        // One Non-Null only, please.
        if (i == 2000)
        {
            trap = new Object();
        }

        System.out.println(
            MessageFormat.format("{1} {0, number, #}",
            t.stop().runtimeNanos(), i));
    }
}

Example by Douglas Hawkins: org/sample/AllocationTrap.java

Learning: Unused Code

If I don't need it, I remove it

Java freely removes code
- Results are not needed
- Code is probably not executed
- Even loops might die and not just stay empty
Rare conditions can completely change the runtime behavior forever

Danger: That can kill all your benchmark results
Danger: One null or one exception can turn your performance upside-down
Important: If your code is too complicated, Java cannot make that call

More is Less

You can horribly shoot yourself in the foot

/**
 * JHM Benchmark
 * Vary SIZE from 1 to 10000
 * @param args
 */
public int nonSense()
{
    Object trap = null;
    Object o = null;

    for (int i = 0; i < OUTER; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            o = new Object();

            if (trap != null)
            {
                o = new Object();
                trap = null;
            }
        }

        if (i == 500)
        {
            trap = new Object();
        }
    }

    return trap == null ? 1 : 0;
}

We vary the inner loop length

# One Shoot Execution aka measure one execution and no warm-up
Benchmark                           (SIZE)  Mode  Cnt     Score   Error  Units
RemoveNonSenseJMH.nonSenseSingle         1    ss         85,298          ns/op
RemoveNonSenseJMH.nonSenseSingle        10    ss        513,886          ns/op
RemoveNonSenseJMH.nonSenseSingle       100    ss      3,603,910          ns/op
RemoveNonSenseJMH.nonSenseSingle      1000    ss      9,465,274          ns/op
RemoveNonSenseJMH.nonSenseSingle     10000    ss     17,528,758          ns/op

# Repeated Execution, No Warm up, 100 ms duration
Benchmark                           (SIZE)  Mode  Cnt     Score   Error  Units
RemoveNonSenseJMH.nonSenseNoWarmup       1  avgt            963          ns/op
RemoveNonSenseJMH.nonSenseNoWarmup      10  avgt            600          ns/op
RemoveNonSenseJMH.nonSenseNoWarmup     100  avgt            597          ns/op
RemoveNonSenseJMH.nonSenseNoWarmup    1000  avgt            697          ns/op
RemoveNonSenseJMH.nonSenseNoWarmup   10000  avgt          2,258          ns/op

# Warm it Up First, 100 ms duration
Benchmark                           (SIZE)  Mode  Cnt     Score   Error  Units
RemoveNonSenseJMH.nonSenseWarmup         1  avgt            660          ns/op
RemoveNonSenseJMH.nonSenseWarmup        10  avgt            609          ns/op
RemoveNonSenseJMH.nonSenseWarmup       100  avgt            409          ns/op
RemoveNonSenseJMH.nonSenseWarmup      1000  avgt            439          ns/op
RemoveNonSenseJMH.nonSenseWarmup     10000  avgt            410          ns/op

# -Xint to verify
Benchmark                           (SIZE)  Mode  Cnt     Score   Error  Units
RemoveNonSenseJMH.nonSenseWarmup         1  avgt        113,536          ns/op
RemoveNonSenseJMH.nonSenseWarmup        10  avgt      1,110,772          ns/op
RemoveNonSenseJMH.nonSenseWarmup       100  avgt     11,449,399          ns/op
RemoveNonSenseJMH.nonSenseWarmup      1000  avgt    112,158,390          ns/op
RemoveNonSenseJMH.nonSenseWarmup     10000  avgt  1,104,793,429          ns/op

org.sample.RemoveNonSenseJMH

Understand Runtime Internals

Now We Can Really Talk About Performance

Your loop is easy


private int next() {
    return r.nextInt(1) + 1;
}

@Benchmark
public int classic() {
    int s1 = 1;
    final int step = next(1);

    for (int i = 0; i < SIZE; i = i + 1) {
        s1 += i;
    }

    return s1;
}

@Benchmark
public int variable() {
    int s1 = 1;
    final int step = next(1);

    for (int i = 0; i < SIZE; i = i + step) {
        s1 += i;
    }

    return s1;
}

@Benchmark
public int manualUnroll() {
    int s1 = 1;
    final int step = next(1) + 9;

    for (int i = 0; i < SIZE; i = i + step) {
        s1 += i;
        s1 += i + 1;
        ...
        s1 += i + 9;
    }

    return s1;
}

Flatten out loops


# Benchmark    (size)  Mode  Cnt       Score Units
classic        10000   avgt    4   2,563.905 ns/op
variable       10000   avgt    4   5,125.580 ns/op
manualUnroll   10000   avgt    4   2,646.231 ns/op

# -Xint
classic        10000   avgt    4  89,497.225 ns/op
variable       10000   avgt    4  88,081.687 ns/op
manualUnroll   10000   avgt    4  50,411.112 ns/op

The CPU hates to jump!


classic                                10000  avgt       2,593 ns/op
classic:branches                       10000  avgt         706  #/op
classic:L1-dcache-loads                10000  avgt         184  #/op
classic:instructions                   10000  avgt      17,452  #/op
classic:cycles                         10000  avgt      10,860  #/op
classic:IPC                            10000  avgt       1.607  #/op

manualUnroll                           10000  avgt       2,660 ns/op
manualUnroll:branches                  10000  avgt       1,061  #/op
manualUnroll:L1-dcache-loads           10000  avgt       2,189  #/op
manualUnroll:instructions              10000  avgt      22,585  #/op
manualUnroll:cycles                    10000  avgt      11,321  #/op
manualUnroll:IPC                       10000  avgt       1.972  #/op

variable                               10000  avgt       5,263 ns/op
variable:branches                      10000  avgt      10,059  #/op
variable:L1-dcache-loads               10000  avgt      20,144  #/op
variable:instructions                  10000  avgt      90,128  #/op
variable:cycles                        10000  avgt      21,595  #/op
variable:IPC                           10000  avgt       4.167  #/op

org/sample/LoopUnroll.java

Detour: The modern CPU

Everything happens in parallel

A cycle is 1 Hz
Parallel execution units per core
Parallel execution of code despite code not being programmed that way
CPU reorders when permitted to fill the execution units
Tries to do things ahead of the result (prediction, speculation)
When units aren't filled, performance is wasted
Branches/jumps are expensive
Current Intels can do 3 to 4^* instructions per cycle
Current AMDs can do 5 to 6^* instructions per cycle

Escape Analysis

Make it less expensive

@Benchmark
public long array64()
{
    int[] a = new int[64];

    a[0] = r.nextInt();
    a[1] = r.nextInt();

    return a[0] + a[1];
}

@Benchmark
public long array65()
{
    int[] a = new int[65];

    a[0] = r.nextInt();
    a[1] = r.nextInt();

    return a[0] + a[1];
}


array64                                       avgt    2    10.598           ns/op
array65                                       avgt    2    31.383           ns/op


# -prof gc
array64                                       avgt    2    10.598           ns/op
array64:·gc.alloc.rate                        avgt    2    ≈ 10⁻⁴          MB/sec
array64:·gc.alloc.rate.norm                   avgt    2    ≈ 10⁻⁶            B/op

array65                                       avgt    2    31.383           ns/op
array65:·gc.alloc.rate                        avgt    2  7732.432          MB/sec
array65:·gc.alloc.rate.norm                   avgt    2   280.000            B/op


array64Returned                               avgt    2    30.605           ns/op
array64Returned:·gc.alloc.rate                avgt    2  7703.857          MB/sec
array64Returned:·gc.alloc.rate.norm           avgt    2   272.000            B/op


# -XX:EliminateAllocationArraySizeLimit=128
array65                                       avgt    2    10.569           ns/op
array65:·gc.alloc.rate                        avgt    2    ≈ 10⁻⁴          MB/sec
array65:·gc.alloc.rate.norm                   avgt    2    ≈ 10⁻⁶            B/op
array65:·gc.count                             avgt    2       ≈ 0          counts


array64                                  avgt    2   10.474          ns/op
array64:IPC                              avgt         1.555           #/op
array64:L1-dcache-load-misses            avgt         0.005           #/op
array64:L1-dcache-loads                  avgt        26.915           #/op
array64:branches                         avgt         9.137           #/op
array64:cycles                           avgt        44.428           #/op
array64:instructions                     avgt        69.077           #/op

array65                                  avgt    2   30.526          ns/op
array65:IPC                              avgt         0.968           #/op
array65:L1-dcache-load-misses            avgt         4.427           #/op
array65:L1-dcache-loads                  avgt        47.829           #/op
array65:branches                         avgt        19.767           #/op
array65:cycles                           avgt       130.389           #/op
array65:instructions                     avgt       126.335           #/op

org/sample/EscapeAnalysis.java

Intrinsics or Native

Native code to the rescue


final int SIZE = 100_000;
final int[] src = new int[SIZE];

public int[] manualArrayCopy()
{
    final int[] target = new int[SIZE];
    for (int i = 0; i < SIZE; i++)
    {
        target[i] = src[i];
    }

    return target;
}

public int[] systemArrayCopy()
{
    final int[] target = new int[SIZE];
    System.arraycopy(src, 0, target, 0, SIZE);

    return target;
}

public int[] arraysCopy()
{
    return Arrays.copyOf(src, src.length);
}


# Benchmark  Mode  Cnt    Score    Error  Units
Manual       avgt    2   44,112           ns/op
System       avgt    2   44,893           ns/op
Arrays       avgt    2   45,300           ns/op

Uh... native is not faster?
Well, if Java compiled its code, it is native


Iteration         Manual         System
===============================================
Iteration   1: 2,061,452 ns/op   77,621 ns/op
Iteration   2:   592,751 ns/op  120,478 ns/op
Iteration   3:   477,828 ns/op  103,568 ns/op
Iteration   4:   479,590 ns/op  102,592 ns/op
Iteration   5:   482,536 ns/op  151,151 ns/op
Iteration   6:   482,785 ns/op   87,365 ns/op
Iteration   7:   710,620 ns/op  187,857 ns/op
...
Iteration  80:    46,009 ns/op   66,394 ns/op

Depends on your hardware
Code execution count matters!
Hotspot can vary its code and intrinsics selection based on CPU instructions available

org/sample/IntrinsicsArrayCopy.java

Wait... What are Intrinsics?

Native Code vs. Intrinsics

Native: A JNI based method call that calls any kind of provided native platform code
Intrinsics: Native platform code, but not called as method but directly placed in the Hotspot compiled code
Project Panama will make JNI nicer and faster


# -XX:+PrintCompilation -XX:+PrintInlining
@ 54   java.lang.Math::min (11 bytes)   (intrinsic)
@ 57   java.lang.System::arraycopy (0 bytes)   (intrinsic)

public static native void arraycopy(...) can be a native call or an intrinsic
Java makes that decision when the VM is started based on CPU features
Even methods written in pure Java, Java can and will use Intrinics instead, see java.lang.Math
Instead of compiling to native step-by-step, it will insert pre-determined native code
@HotSpotIntrinsicCandidate

Intrinsics or Native II


# Lenovo T530, JDK 8
Benchmark                   Mode  Cnt    Score  Units
IntrinsicsArrayCopy.manual  avgt    2   48,177  ns/op
IntrinsicsArrayCopy.system  avgt    2   48,678  ns/op


# Lenovo T450s, JDK 8, BAT
Benchmark                   Mode  Cnt    Score  Units
IntrinsicsArrayCopy.manual  avgt    2   66,560  ns/op
IntrinsicsArrayCopy.system  avgt    2   67,051  ns/op
IntrinsicsArrayCopy.arrays  avgt    2   60,674  ns/op


# Lenovo T450s, JDK 11, BAT
Benchmark                   Mode  Cnt    Score  Units
IntrinsicsArrayCopy.manual  avgt    2   67,476  ns/op
IntrinsicsArrayCopy.system  avgt    2   58,952  ns/op
IntrinsicsArrayCopy.arrays  avgt    2   58,771  ns/op


# Lenovo T450s, JDK 11, AC
Benchmark                   Mode  Cnt    Score  Units
IntrinsicsArrayCopy.manual  avgt    2   56,108  ns/op
IntrinsicsArrayCopy.system  avgt    2   55,748  ns/op
IntrinsicsArrayCopy.arrays  avgt    2   53,769  ns/op


# Lenovo T450s, JDK 11, BAT, Min Power
Benchmark                   Mode  Cnt    Score  Units
IntrinsicsArrayCopy.manual  avgt    2  215,360  ns/op
IntrinsicsArrayCopy.system  avgt    2  210,971  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  214,870  ns/op


# Lenovo T14s, AMD, JDK 11, AC
Benchmark                   Mode  Cn     Score  Units
IntrinsicsArrayCopy.arrays  avgt    2   45,295  ns/op
IntrinsicsArrayCopy.manual  avgt    2   44,141  ns/op
IntrinsicsArrayCopy.system  avgt    2   44,861  ns/op


# Lenovo T14s, AMD, JDK 11, AC, SMT off, Boost off
Benchmark                   Mode  Cnt    Score  Units
IntrinsicsArrayCopy.arrays  avgt    2   52,025  ns/op
IntrinsicsArrayCopy.manual  avgt    2   59,333  ns/op
IntrinsicsArrayCopy.system  avgt    2   51,319  ns/op


# AWS c5.xlarge, JDK 11
Benchmark                   Mode  Cnt    Score  Units
IntrinsicsArrayCopy.manual  avgt    2   83,066  ns/op
IntrinsicsArrayCopy.system  avgt    2   63,422  ns/op
IntrinsicsArrayCopy.arrays  avgt    2   64,748  ns/op


# GCP n1-standard-4, JDK 11
Benchmark                   Mode  Cnt    Score  Units
IntrinsicsArrayCopy.manual  avgt    2  146,932  ns/op
IntrinsicsArrayCopy.system  avgt    2  101,970  ns/op
IntrinsicsArrayCopy.arrays  avgt    2  116,233  ns/op


# GCP n1-highcpu-8, JDK 11
Benchmark                   Mode  Cnt    Score  Units
IntrinsicsArrayCopy.manual  avgt    2   82,865  ns/op
IntrinsicsArrayCopy.system  avgt    2   62,810  ns/op
IntrinsicsArrayCopy.arrays  avgt    2   61,237  ns/op

org.sample.IntrinsicsArrayCopy_AverageTime

Stupid Compiler

Let's create some Strings


@Setup
public void setup() {
    foo = String.valueOf(System.currentTimeMillis()) + "ashdfhgas dfhsa df";
}

public String plain() {
    return "a" + foo;
}

public String concat() {
    return "a".concat(foo);
}

public String builder() {
    return new StringBuilder().append("a").append(foo).toString();
}

public String builderSized() {
    return new StringBuilder(70).append("a").append(foo).toString();
}

public String builderNonFluid() {
     final StringBuilder sb = new StringBuilder();
     sb.append("a");
     sb.append(foo);
     return sb.toString();
}

public String builderNonFluidSized() {
    final StringBuilder sb = new StringBuilder(70);
    sb.append("a");
    sb.append(foo);
    return sb.toString();
}


# JDK 11
# Benchmark                      Mode  Cnt   Score  Units
Strings1.builder               avgt    2  12.822    ns/op
Strings1.builderSized          avgt    2  13.055    ns/op
Strings1.plain                 avgt    2  13.203    ns/op
Strings1.concat                avgt    2  16.199    ns/op
Strings1.builderNonFluidSized  avgt    2  22.070    ns/op
Strings1.builderNonFluid       avgt    2  24.335    ns/op

# JDK 17
Strings1.builderSized          avgt    2  10.635    ns/op
Strings1.plain                 avgt    2  10.979    ns/op
Strings1.concat                avgt    2  11.121    ns/op
Strings1.builder               avgt    2  12.929    ns/op
Strings1.builderNonFluidSized  avgt    2  19.207    ns/op
Strings1.builderNonFluid       avgt    2  24.035    ns/op

A sized builder is not faster?
A builder that is not fluid is slower?
Java JIT has optimizations patterns it follows
Yes, javac is the first to change the code


# -Xint
Benchmark                      Mode  Cnt     Score   Units
Strings1.concat                avgt    2    809.838  ns/op
Strings1.builderNonFluidSized  avgt    2  2,575.899  ns/op
Strings1.builderSized          avgt    2  2,738.723  ns/op
Strings1.builderNonFluid       avgt    2  2,969.153  ns/op
Strings1.plain                 avgt    2  2,982.004  ns/op
Strings1.builder               avgt    2  3,115.392  ns/op

Inspired by Douglas Q Hawkins: org.sample.Strings1

Let's talk hardware

Hardware does not matter... you wish!

final int SIZE = 1_000_000;
final int[] src = new int[100_000];

@Benchmark
public int step1()
{
    int sum = 0;
    for (int i = 0; i < 100_000; i++)
    {
        sum += src[i];
    }

    return sum;
}

@Benchmark
public int step20()
{
    int sum = 0;
    for (int i = 0; i < 100_000; i = i + 20)
    {
        sum += src[i];
    }

    return sum;
}

Theory: size20 must be faster, about 20x


Benchmark                      Mode  Cnt       Score   Units
ArraysAndHardware.step1        avgt    2  24,526.814   ns/op
ArraysAndHardware.step20       avgt    2   2,609.979   ns/op

We expected 1,225 ns/op
We got only 9.4x more speed
What? Why?
P.S. step20 results will vary a lot when running repeatedly
This was 3.3x speed on a T450s!

Hardware Greetings

Also Java has to live by hardware laws


Benchmark                       Mode  Cnt       Score   Error  Units
step1                           avgt    2  24,562.946          ns/op
step1:IPC                       avgt            1.173           #/op
step1:CPI                       avgt            0.852           #/op
step1:L1-dcache-load-misses     avgt        6,324.421           #/op
step1:L1-dcache-loads           avgt      101,184.005           #/op
step1:branch-misses             avgt            9.841           #/op
step1:branches                  avgt        6,417.104           #/op
step1:cycles                    avgt      103,033.534           #/op
step1:instructions              avgt      120,883.682           #/op
step1:stalled-cycles-backend    avgt       82,888.746           #/op
step1:stalled-cycles-frontend   avgt           96.538           #/op

18.593.601.312      L1-dcache-loads           #    3,995 G/sec
 1.140.028.415      L1-dcache-load-misses     #    6,13% of all L1-dcache accesses


Benchmark                       Mode  Cnt       Score   Error  Units
step20                          avgt    2   2,632.399          ns/op
step20:IPC                      avgt            1.855           #/op
step20:CPI                      avgt            0.539           #/op
step20:L1-dcache-load-misses    avgt        5,293.077           #/op
step20:L1-dcache-loads          avgt        5,210.832           #/op
step20:branch-misses            avgt            7.153           #/op
step20:branches                 avgt        5,062.571           #/op
step20:cycles                   avgt       11,036.292           #/op
step20:instructions             avgt       20,462.526           #/op
step20:stalled-cycles-backend   avgt        5,496.080           #/op
step20:stalled-cycles-frontend  avgt           57.150           #/op

 9.963.540.045      L1-dcache-loads           #    2,010 G/sec
 9.753.693.673      L1-dcache-load-misses     #   97,89% of all L1-dcache accesses

-prof perfnorm requires a special kernel tooling

Detour: Cost

What it costs to leave the CPU

	Real	Humanized
CPU Cycle	0.4 ns	1 s
L1 Cache	0.9 ns	2 s
L2 Cache	2.8 ns	7 s
L3 Cache	28 ns	1 min
Memory Access	100 ns	4 min
NVM SSD	25 μs	17 h
SSD	50–150 μs	1.5-4 days
HDD	1–10 ms	1-9 months
TCP SF to NY	65 ms	5 years
TCP SF to Hong Kong	141 ms	11 years

It costs 100 to 200 cycles when we have to go to main memory
Hence Java and the CPU avoid main memory access at all cost
Java does not have the notion of a memory location, hence this is transparent
Things will change with project Valhalla and its "inline classes" that brings primitive data type performance to "objects".

Data: She brought me closer to humanity than I ever thought possible, and for a time...I was tempted by her offer.
Picard: How long a time?
Data: 0.68 seconds, sir. For an android, that is nearly an eternity.

The Matrix


final int[][] src = new int[1000][1000];

@Benchmark
public int horizontal() {
    int sum = 0;
    for (int i = 0; i < SIZE; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            sum += src[i][j];
        }
    }
    return sum;
}

@Benchmark
public int vertical() {
    int sum = 0;
    for (int i = 0; i < SIZE; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            sum += src[j][i];
        }
    }
    return sum;
}

@Benchmark
public int horizontalBackwards() {
    int sum = 0;
    for (int i = SIZE - 1; i >=0; i--)
    {
        for (int j = SIZE - 1; j >= 0; j--)
        {
            sum += src[i][j];
        }
    }
    return sum;
}

What is real and what is not

All calculations should have the same speed, shouldn't they?


# Lenovo T530, Intel i7
# Benchmark                              Mode       Score   Error  Units
horizontal                               avgt     470,009          ns/op
horizontalBackwards                      avgt     672,821          ns/op
vertical                                 avgt  3,209,5625          ns/op

Horizontal access is about 7x faster, forward by 30% faster!


# Lenovo T14s, AMD Ryzen 7
# Benchmark                    (SIZE)  Mode  Cnt     Score  Error  Units
horizontal                       1000  avgt    2   314,558         ns/op
horizontalBackwards              1000  avgt    2   291,384         ns/op
vertical                         1000  avgt    2 1,280,180         ns/op

Horizontal access is about 4x faster, forward almost no diff anymore

org.sample.Matrix with -prof perfnorm

The Matrix Reloaded


final int[][] src = new int[1000][1000];

@Benchmark
public int horizontal() {
    int sum = 0;
    for (int i = 0; i < SIZE; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            sum += src[i][j];
        }
    }
    return sum;
}

@Benchmark
public int vertical() {
    int sum = 0;
    for (int i = 0; i < SIZE; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            sum += src[j][i];
        }
    }
    return sum;
}

@Benchmark
public int horizontalBackwards() {
    int sum = 0;
    for (int i = SIZE - 1; i >=0; i--)
    {
        for (int j = SIZE - 1; j >= 0; j--)
        {
            sum += src[i][j];
        }
    }
    return sum;
}


# Lenovo T530
Benchmark                                           Mode       Score    Units
Matrix.horizontal                                   avgt     470,009     ns/op
Matrix.horizontal:L1-dcache-load-misses             avgt      63,645      #/op
Matrix.horizontal:L1-dcache-loads                   avgt   1,014,532      #/op
Matrix.horizontal:LLC-load-misses                   avgt         429      #/op
Matrix.horizontal:LLC-loads                         avgt       1,523      #/op

Matrix.horizontalBackwards                          avgt     672,821     ns/op
Matrix.horizontalBackwards:L1-dcache-load-misses    avgt      63,053      #/op
Matrix.horizontalBackwards:L1-dcache-loads          avgt   1,007,381      #/op
Matrix.horizontalBackwards:LLC-load-misses          avgt         854      #/op
Matrix.horizontalBackwards:LLC-loads                avgt       2,764      #/op

Matrix.vertical                                     avgt   3,209,562     ns/op
Matrix.vertical:L1-dcache-load-misses               avgt   2,060,998      #/op
Matrix.vertical:L1-dcache-loads                     avgt   3,020,796      #/op
Matrix.vertical:LLC-load-misses                     avgt      15,223      #/op
Matrix.vertical:LLC-loads                           avgt     694,397      #/op


# Lenovo T14s
Benchmark                                           Mode       Score     Units
Matrix.horizontal                                   avgt     286,633     ns/op
Matrix.horizontal:L1-dcache-load-misses             avgt      63,902      #/op
Matrix.horizontal:L1-dcache-loads                   avgt   1,031,015      #/op
Matrix.horizontal:stalled-cycles-backend            avgt   1,015,230      #/op
Matrix.horizontal:stalled-cycles-frontend           avgt       4,848      #/op

Matrix.horizontalBackwards                          avgt     278,992     ns/op
Matrix.horizontalBackwards:L1-dcache-load-misses    avgt      64,526      #/op
Matrix.horizontalBackwards:L1-dcache-loads          avgt   1,033,686      #/op
Matrix.horizontalBackwards:stalled-cycles-backend   avgt     870,881      #/op
Matrix.horizontalBackwards:stalled-cycles-frontend  avgt       5,029      #/op

Matrix.vertical                                     avgt   1,484,345     ns/op
Matrix.vertical:L1-dcache-load-misses               avgt   2,114,064      #/op
Matrix.vertical:L1-dcache-loads                     avgt   3,108,037      #/op
Matrix.vertical:stalled-cycles-backend              avgt   5,081,442      #/op
Matrix.vertical:stalled-cycles-frontend             avgt      16,206      #/op

org.sample.Matrix with -prof perfnorm

Matrix - You probably did it wrong

Influence of the Size on the Results


Benchmark            (SIZE)  Mode  Cnt         Score  Factor  Units
horizontal                1  avgt    2             5          ns/op
horizontal               10  avgt    2            35          ns/op
horizontal               20  avgt    2           107          ns/op
horizontal              100  avgt    2         2,546          ns/op
horizontal             1000  avgt    2       306,505          ns/op
horizontal             5000  avgt    2     9,901,156          ns/op
horizontal            10000  avgt    2    39,579,171          ns/op

vertical                  1  avgt    2             5    1.0   ns/op
vertical                 10  avgt    2           101    2.9   ns/op
vertical                 20  avgt    2           292    2.7   ns/op
vertical                100  avgt    2         5,921    2.3   ns/op
vertical               1000  avgt    2     1,748,734    5.7   ns/op
vertical               5000  avgt    2   272,269,384   27.5   ns/op
vertical              10000  avgt    2 1,884,958,277   47.6   ns/op

horizontalBackwards       1  avgt    2             5          ns/op
horizontalBackwards      10  avgt    2            36          ns/op
horizontalBackwards      20  avgt    2           107          ns/op
horizontalBackwards     100  avgt    2         2,544          ns/op
horizontalBackwards    1000  avgt    2       390,983          ns/op
horizontalBackwards    5000  avgt    2    10,800,172          ns/op
horizontalBackwards   10000  avgt    2    40,368,771          ns/op

Understand your future data set
Use production hardware
Verify theories!

Wrong again?

Understand Your Hardware

Executed in independent threads count 1 to 8
What might be the difference?


# RUN 1 - T14s, AMD
Benchmark     Mode  Cnt  Score   Diff   Units
threads1      avgt    2  250.5          ms/op
threads2      avgt    2  257.0    +3%   ms/op
threads4      avgt    2  284.6   +13%   ms/op
threads6      avgt    2  330.6   +32%   ms/op
threads8      avgt    2  375.9   +50%   ms/op

Only one or two active cores run turbo mode
More than three and we are slowed down to keep the system cool


# RUN 2 - T14s, AMD, turbo mode off for verification
Benchmark     Mode  Cnt  Score   Diff  Units
threads1      avgt    2  611.4         ms/op
threads2      avgt    2  612.6    +0%  ms/op
threads4      avgt    2  611.4    +0%  ms/op
threads6      avgt    2  616.4    +1%  ms/op
threads8      avgt    2  624.8    +2%  ms/op


private final int SIZE = 100_000_000;
private final int[] SOURCE = new int[SIZE];

double doCPUBoundStuff(int[] data) {
    int numDataPoints = data.length;
    int i = 0;
    double avg = 0;
    double var = 0;

    data[0] = 0;
    data[1] = 1;

    for (i = 2; i < numDataPoints; i++) {
        data[i] = data[i-1] + data[i-2];
    }

    for (avg = 0, i = 0; i < numDataPoints; i++) {
        avg += (double)numDataPoints;
    }
    avg /= (double) numDataPoints;

    for (var = 0, i = 0; i < numDataPoints; i++) {
        double diff = (double) data[i] - avg;
        var += diff*diff;
    }
    var /= (double) (numDataPoints - 1);

    return avg;
}

@Setup
public void setup() {
    for (int i = 0; i < SIZE; i++) {
        SOURCE[i] = i;
    }
}

@Benchmark
@Threads(1..8)
public double t1() {
    return doCPUBoundStuff(SOURCE);
}

org.sample.scale.Powersave Intel: echo "1" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
AMD: echo "0" | sudo tee /sys/devices/system/cpu/cpufreq/boost

I know what you will do next

Modern Fortune Teller


private static final int COUNT = 10_000;

// Contain random numbers from -50 to 50
private int[] sorted;
private int[] unsorted;
private int[] reversed;

public int doIt(int[] array)
{
    for (int v : array)
    {
        int sum = 0;
        for (int v : array)
        {
            if (v > 0)
            {
                sum = sum + 2;
            }
            else
            {
                sum = sum + 1;
            }
        }
        return sum;
    }
}

Will there be any difference when passing sorted or unsorted arrays?


# T14s
# Benchmark  Mode  Cnt       Score   Error  Units
sorted       avgt    2   3,423.415          ns/op
reversed     avgt    2   3,467.932          ns/op
unsorted     avgt    2  10,600.557          ns/op

Sorted is 65% faster. Why?
Let's ask the CPU

org.sample.BranchPrediction1

Branch Prediction

Modern CPU marvels: I know what you might need next


# Benchmark                                         Mode  Cnt      Score   Error      Units
BranchPrediction1.sorted                            avgt    2   3329.529              ns/op
BranchPrediction1.sorted:IPC                        avgt           3.812          insns/clk
BranchPrediction1.sorted:L1-dcache-load-misses      avgt         643.914               #/op
BranchPrediction1.sorted:L1-dcache-loads            avgt       10287.609               #/op
BranchPrediction1.sorted:L1-icache-load-misses      avgt           0.201               #/op
BranchPrediction1.sorted:L1-icache-loads            avgt         241.148               #/op
BranchPrediction1.sorted:branch-misses              avgt           5.800               #/op
BranchPrediction1.sorted:branches                   avgt       17527.359               #/op
BranchPrediction1.sorted:cycles                     avgt       13969.192               #/op
BranchPrediction1.sorted:instructions               avgt       53254.559               #/op
BranchPrediction1.sorted:stalled-cycles-backend     avgt         155.405               #/op
BranchPrediction1.sorted:stalled-cycles-frontend    avgt          16.920               #/op

BranchPrediction1.unsorted                          avgt    2  14399.107              ns/op
BranchPrediction1.unsorted:IPC                      avgt           0.901          insns/clk
BranchPrediction1.unsorted:L1-dcache-load-misses    avgt         653.839               #/op
BranchPrediction1.unsorted:L1-dcache-loads          avgt       11429.217               #/op
BranchPrediction1.unsorted:L1-icache-load-misses    avgt           1.365               #/op
BranchPrediction1.unsorted:L1-icache-loads          avgt       21877.925               #/op
BranchPrediction1.unsorted:branch-misses            avgt         956.056               #/op
BranchPrediction1.unsorted:branches                 avgt       17693.113               #/op
BranchPrediction1.unsorted:cycles                   avgt       59200.151               #/op
BranchPrediction1.unsorted:instructions             avgt       53320.560               #/op
BranchPrediction1.unsorted:stalled-cycles-backend   avgt         225.695               #/op
BranchPrediction1.unsorted:stalled-cycles-frontend  avgt        1366.556               #/op

Branches: Know what you test

How small changes can mess up things big time


final int d = i % 2 == 0 ?
    -10 : 10;


public int doIt(int[] array)
{
    int sum = 0;
    for (int v : array)
    {
        if (v > 0)
        {
            sum += 2;
        }
        else
        {
            sum += 1;
        }
    }

    return sum;
}


Benchmark  Mode   Score   Units
===============================
sorted     avgt   3,177   ns/op
unsorted   avgt   3,709   ns/op


final int d = i % 2 == 0
    ? -r.nextInt(5) : r.nextInt(5);


public int doIt(int[] array)
{
    int sum = 0;
    for (int v : array)
    {
        if (v > 0)
        {
            sum += 2;
        }
        else
        {
            sum += 1;
        }
    }

    return sum;
}


Benchmark  Mode   Score   Units
===============================
sorted     avgt   3,231   ns/op
unsorted   avgt   5,175   ns/op


final int d = i % 2 == 0
    ? -r.nextInt(5) - 1 :
       r.nextInt(5) + 1;

public int doIt(int[] array)
{
    int sum = 0;
    for (int v : array)
    {
        if (v > 0)
        {
            sum += 2;
        }
        else
        {
            sum += 1;
        }
    }

    return sum;
}


Benchmark  Mode   Score   Units
===============================
sorted     avgt   3,369   ns/op
unsorted   avgt   3,671   ns/op


final int d = r.nextInt();



public int doIt(int[] array)
{
    int sum = 0;
    for (int v : array)
    {
        if (v > 0)
        {
            sum += 2;
        }
        else
        {
            sum += 1;
        }
    }

    return sum;
}


Benchmark  Mode   Score   Units
===============================
sorted     avgt   3,240   ns/op
unsorted   avgt  15,440   ns/op

Short Summary

Measurement

Naive benchmarks are wrong
No JMH, no fun
Even JMH measurements can be wrong
How and where you measure makes a difference

JDK

Forget your Java optimization knowledge
Java now is not what most people think Java is

JVM

Tiered runtime compilation
JIT removes code when not needed
JIT tries to keep data local
Loops are optimized
Intrinsics are used dynamically

Hardware

Memory access is expensive
Caches are your friend
Branches are expensive
The CPU speculates

Synchronization

Being alone and still have to sync

private final int SIZE = 1024;
private final Map<String, String> syncMap =
            Collections.synchronizedMap(new HashMap<>(SIZE));
private final Map<String, String> unsyncMap =
            new HashMap<>(SIZE);
private final Map<String, String> conMap =
            new ConcurrentHashMap<>(SIZE);

@Benchmark
public String syncAccess()
{
    return syncMap.get("1");
}
@Benchmark
public String unsyncAccess()
{
    return unsyncMap.get("1");
}
@Benchmark
public String conAccess()
{
    return conMap.get("1");
}

Which on is slowest and by how much?
Remember, just one thread, so no contention

Benchmark                       Mode  Cnt   Score   Units
SynchronizedAlone.conAccess     avgt    2   9.152   ns/op
SynchronizedAlone.syncAccess    avgt    2  23.462   ns/op
SynchronizedAlone.unsyncAccess  avgt    2   9.366   ns/op

Synced is 2.6x slower
ConcurrentHashMap does well
Moving the maps to local scope totally changes the result - lock elision

org/sample/SynchronizedAlone.java

With Contention

Let's try more threads

2 Threads
====================================================================
Benchmark                          Mode  Cnt    Score   Error  Units
SynchronizedThreads4.conAccess     avgt    2    9.721          ns/op
SynchronizedThreads4.syncAccess    avgt    2  183.086          ns/op
SynchronizedThreads4.unsyncAccess  avgt    2    9.312          ns/op

4 Threads
====================================================================
Benchmark                          Mode  Cnt    Score   Error  Units
SynchronizedThreads4.conAccess     avgt    2   19.661          ns/op
SynchronizedThreads4.syncAccess    avgt    2  216.041          ns/op
SynchronizedThreads4.unsyncAccess  avgt    2   19.631          ns/op

8 Threads
====================================================================
Benchmark                          Mode  Cnt    Score   Error  Units
SynchronizedThreads4.conAccess     avgt    2   44.055          ns/op
SynchronizedThreads4.syncAccess    avgt    2  377.180          ns/op
SynchronizedThreads4.unsyncAccess  avgt    2   38.562          ns/op

Looping

Loops compared


Benchmark                           (length)  Mode  Cnt   Score   Error  Units
ForEachSimple.arrayFor                     1  avgt    2     3.8          ns/op
ForEachSimple.arrayFor                    10  avgt    2     9.8          ns/op
ForEachSimple.arrayFor                   100  avgt    2    64.4          ns/op
ForEachSimple.arrayFor                  1000  avgt    2 1,446.1          ns/op

ForEachSimple.arrayEnhanced                1  avgt    2     3.5          ns/op
ForEachSimple.arrayEnhanced               10  avgt    2     9.5          ns/op
ForEachSimple.arrayEnhanced              100  avgt    2    63.2          ns/op
ForEachSimple.arrayEnhanced             1000  avgt    2 1,199.3          ns/op

ForEachSimple.classicFor                   1  avgt    2     5.7          ns/op
ForEachSimple.classicFor                  10  avgt    2    15.5          ns/op
ForEachSimple.classicFor                 100  avgt    2   103.0          ns/op
ForEachSimple.classicFor                1000  avgt    2 1,388.8          ns/op

ForEachSimple.enhancedFor                  1  avgt    2     6.1          ns/op
ForEachSimple.enhancedFor                 10  avgt    2    17.0          ns/op
ForEachSimple.enhancedFor                100  avgt    2   118.4          ns/op
ForEachSimple.enhancedFor               1000  avgt    2 1,578.6          ns/op

ForEachSimple.lambdaStream                 1  avgt    2    53.7          ns/op
ForEachSimple.lambdaStream                10  avgt    2    65.5          ns/op
ForEachSimple.lambdaStream               100  avgt    2   153.0          ns/op
ForEachSimple.lambdaStream              1000  avgt    2 1,326.0          ns/op

ForEachSimple.parallelLambdaStream         1  avgt    2    94.2          ns/op
ForEachSimple.parallelLambdaStream        10  avgt    2 6,283.2          ns/op
ForEachSimple.parallelLambdaStream       100  avgt    2 8,443.4          ns/op
ForEachSimple.parallelLambdaStream      1000  avgt    2 9,367.6          ns/op

org/sample/ForEachSimple.java

// just some strings
final List<String> list = new ArrayList<>();
String[] array = null;

@Param({"1", "10", "100", "1000"})
int length;

@Setup
public void setup() {.. // uses length ..}

@Benchmark
public void arrayFor(Blackhole bh) {
    int result = 0;
    for (int i = 0; i < array.length; i++)     {
        result += array[i].length();
    }

    bh.consume(result);
}

// arrayEnhanced
for (String s : array) {
    result += s .length();
}

// classicFor
for (int i = 0; i < list.size(); i++) {
    result += list.get(i).length();
}

// enhancedFor
for (String s : list) {
    result += s.length();
}

// lambdaStream
int result = list.stream()
    .mapToInt(s -> s.length())
    .sum();

//parallelLambdaStream
int result = list.parallelStream()
    .mapToInt(s -> s.length())
    .sum();

Looping

Not so empty loops compared


Benchmark             (length)  Mode  Cnt     Score   Error  Units
arrayFor                     1  avgt    2      238.3         ns/op
arrayFor                    10  avgt    2    2,105.6         ns/op
arrayFor                   100  avgt    2   21,957.8         ns/op
arrayFor                  1000  avgt    2  253,262.0         ns/op

arrayEnhanced                1  avgt    2      234.1         ns/op
arrayEnhanced               10  avgt    2    2,173.1         ns/op
arrayEnhanced              100  avgt    2   22,080.1         ns/op
arrayEnhanced             1000  avgt    2  250,038.0         ns/op

classicFor                   1  avgt    2      238.1         ns/op
classicFor                  10  avgt    2    2,177.0         ns/op
classicFor                 100  avgt    2   22,318.0         ns/op
classicFor                1000  avgt    2  249,914.1         ns/op

enhancedFor                  1  avgt    2      254.8         ns/op
enhancedFor                 10  avgt    2    2,185.9         ns/op
enhancedFor                100  avgt    2   23,011.1         ns/op
enhancedFor               1000  avgt    2  262,605.5         ns/op

lambdaStream                 1  avgt    2      395.7         ns/op
lambdaStream                10  avgt    2    2,829.0         ns/op
lambdaStream               100  avgt    2   27,771.4         ns/op
lambdaStream              1000  avgt    2  304,831.6         ns/op

# 2+2 CPU
parallelLambdaStream         1  avgt    2      453.5         ns/op
parallelLambdaStream        10  avgt    2   11,686.4         ns/op
parallelLambdaStream       100  avgt    2   26,798.6         ns/op
parallelLambdaStream      1000  avgt    2  189,304.3         ns/op

final List<String> list = new ArrayList<>();

@Param({"1", "10", "100", "1000"})
int length;

@Setup
public void setup()
{
    for (int i = 0; i < length; i++)
    {
        list.add(
            MessageFormat.format(
            "{0},{0},{0},{0},{0},{0}", String.valueOf(i)));
    }
}

@Benchmark
public void classicFor(Blackhole bh)
{
    int result = 0;
    for (int i = 0; i < list.size(); i++)
    {
        final String s = list.get(i);
        if (s.startsWith("5"))
        {
            continue;
        }

        final String[] a = s.split(",");
        for (int j = 0; j < a.length; j++)
        {
            result += Integer.valueOf(a[j]);
        }
    }

    bh.consume(result);
}

org/sample/ForEachExpensive.java

Looping - Bonus Content

T450s vs. GCP N1-highcpu-16core


# Lenovo T450s
Benchmark             (length)  Mode  Cnt        Score  Error  Units
arrayFor                     1  avgt    2        238.3         ns/op
arrayFor                    10  avgt    2      2,105.6         ns/op
arrayFor                   100  avgt    2     21,957.8         ns/op
arrayFor                  1000  avgt    2    253,262.0         ns/op
arrayFor                 10000  avgt    2  2,810,799.6         ns/op
arrayFor                100000  avgt    2 31,116,545.4         ns/op

classicFor                   1  avgt    2        238.1         ns/op
classicFor                  10  avgt    2      2,177.0         ns/op
classicFor                 100  avgt    2     22,318.0         ns/op
classicFor                1000  avgt    2    257,362.6         ns/op
classicFor               10000  avgt    2  2,809,832.8         ns/op
classicFor              100000  avgt    2 32,017,487.3         ns/op

enhancedFor                  1  avgt    2        254.8         ns/op
enhancedFor                 10  avgt    2      2,185.9         ns/op
enhancedFor                100  avgt    2     23,011.1         ns/op
enhancedFor               1000  avgt    2    254,691.8         ns/op
enhancedFor              10000  avgt    2  3,180,447.3         ns/op
enhancedFor             100000  avgt    2 31,479,657.4         ns/op

lambdaStream                 1  avgt    2        395.7         ns/op
lambdaStream                10  avgt    2      2,829.0         ns/op
lambdaStream               100  avgt    2     27,771.4         ns/op
lambdaStream              1000  avgt    2    299,971.9         ns/op
lambdaStream             10000  avgt    2  3,181,573.5         ns/op
lambdaStream            100000  avgt    2 36,055,386.8         ns/op

parallelLambdaStream         1  avgt    2        453.5         ns/op
parallelLambdaStream        10  avgt    2     11,686.4         ns/op
parallelLambdaStream       100  avgt    2     26,798.6         ns/op
parallelLambdaStream      1000  avgt    2    193,232.6         ns/op
parallelLambdaStream     10000  avgt    2  1,956,289.8         ns/op
parallelLambdaStream    100000  avgt    2 20,628,366.2         ns/op


# Google n1-highcpu-16core
Benchmark             (length)  Mode  Cnt       Score   Error  Units
arrayFor                     1  avgt   2        263.4          ns/op
arrayFor                    10  avgt   2      2,390.4          ns/op
arrayFor                   100  avgt   2     24,871.1          ns/op
arrayFor                  1000  avgt   2    266,237.0          ns/op
arrayFor                 10000  avgt   2  3,020,851.2          ns/op
arrayFor                100000  avgt   2 30,428,126.5          ns/op

classicFor                   1  avgt   2        264.0          ns/op
classicFor                  10  avgt   2      2,490.3          ns/op
classicFor                 100  avgt   2     25,186.0          ns/op
classicFor                1000  avgt   2    281,254.1          ns/op
classicFor               10000  avgt   2   2882,459.1          ns/op
classicFor              100000  avgt   2 34,657,730.5          ns/op

enhancedFor                  1  avgt   2        263.5          ns/op
enhancedFor                 10  avgt   2      2,329.8          ns/op
enhancedFor                100  avgt   2     24,811.0          ns/op
enhancedFor               1000  avgt   2    291,544.4          ns/op
enhancedFor              10000  avgt   2  3,185,459.2          ns/op
enhancedFor             100000  avgt   2 33,898,708.6          ns/op

lambdaStream                 1  avgt   2        547.9          ns/op
lambdaStream                10  avgt   2      3,841.7          ns/op
lambdaStream               100  avgt   2     28,808.7          ns/op
lambdaStream              1000  avgt   2    329,148.1          ns/op
lambdaStream             10000  avgt   2  3,684,920.1          ns/op
lambdaStream            100000  avgt   2 36,935,499.3          ns/op

parallelLambdaStream         1  avgt   2        521.7          ns/op
parallelLambdaStream        10  avgt   2     43,527.8          ns/op
parallelLambdaStream       100  avgt   2     62,834.3          ns/op
parallelLambdaStream      1000  avgt   2    110,602.3          ns/op
parallelLambdaStream     10000  avgt   2    558,666.5          ns/op
parallelLambdaStream    100000  avgt   2  5,183,311.8          ns/op

The Cost of Calling

Method Calls, Inheritance, and Compiler Magic

The Cost of Calling

How Expensive are Virtual Method Calls?

The target is not known upfront due to inheritance and interfaces
Check if there is an overhead to find the right method
This is about virtual method calls aka the lookup of the target not so much about the call itself


public class MegaMorphicSimple
{
    static interface C { public int apply(int x); }

    static class C1 implements C
    {
        @Override
        public int apply(int x) { return x * x; }
    }

    @Benchmark
    public int _1_mono()
    {
        C c = null;
        int r = random.nextInt(4);

        if (r == 0) { c = new C1(); }
        else if (r == 1) { c = new C1(); }
        else if (r == 2) { c = new C1(); }
        else if (r == 3) { c = new C1(); }

        return c.apply(r);
    }

    @Benchmark
    public int _2_bi()
    {
        C c = null;
        int r = random.nextInt(4);

        if (r == 0) { c = new C1(); }
        else if (r == 1) { c = new C2(); }
        else if (r == 2) { c = new C1(); }
        else if (r == 3) { c = new C2(); }

        return c.apply(r);
    }

    @Benchmark
    public int _5_manualDispatch()
    {
        ...
        if (c instanceof C1)
        {
            return ((C1) c).apply(r);
        }
        else if (c instanceof C2)
        {
            return ((C2) c).apply(r);
        }
        else if (c instanceof C3)
        {
            return ((C3) c).apply(r);
        }
        else
        {
            return ((C4) c).apply(r);
        }
    }
}

Inspired by Douglas Hawkins: org.sample.MegaMorphic[Simple|Default]

The Cost of Calling

How Expensive are Virtual Method Calls?


# identical method per implementing class
Benchmark                             Mode  Cnt   Score   Error  Units
MegaMorphicSimple._1_mono             avgt    5  19.703 ± 2.177  ns/op
MegaMorphicSimple._2_bi               avgt    5  21.917 ± 3.759  ns/op
MegaMorphicSimple._4_mega             avgt    5  26.778 ± 4.411  ns/op
MegaMorphicSimple._5_manualDispatch   avgt    5  21.861 ± 2.860  ns/op


static interface C { default int apply(int x){ return x * x; }};

static class C1 implements C {};
static class C2 implements C {};
static class C3 implements C {};
static class C4 implements C {};


# default method at interface
Benchmark                             Mode  Cnt   Score   Error  Units
MegaMorphicDefault._1_mono            avgt    5  18.867 ± 1.325  ns/op
MegaMorphicDefault._2_bi              avgt    5  21.817 ± 2.613  ns/op
MegaMorphicDefault._4_mega            avgt    5  26.355 ± 2.591  ns/op
MegaMorphicDefault._5_manualDispatch  avgt    5  21.763 ± 3.479  ns/op

That looks expensive
Fancy that we can help the runtime
Shall we do that from now on the ugly way?
...but these are nanoseconds and the methods are useless.


# method got bigger and we made that non-default
Benchmark                                 Mode  Cnt    Score   Error  Units
MegaMorphicLargeMethod._1_mono            avgt    5  158.775 ± 4.971  ns/op
MegaMorphicLargeMethod._2_bi              avgt    5  157.100 ± 6.153  ns/op
MegaMorphicLargeMethod._4_mega            avgt    5  165.580 ± 8.669  ns/op
MegaMorphicLargeMethod._5_manualDispatch  avgt    5  162.729 ± 3.795  ns/op

org.sample.MegaMorphic[Simple|Default]

Really calling virtually

So we don't have to think about virtual calls at all

Let's have an array with different objects of a type C
C has four possible implementations
We either let Java make the call or we untangle it ourselves


public class MegaMorphicArray
{
    private C[] array = new C[20_000];

    @Setup
    public void setup()
    {
        final Random r = new Random(42L);

        for (int x = 0; x < 20_000; x++)
        {
            int i = r.nextInt(JHM_PARAM);

            if (i == 0) array[x] = new C1()
            else if (i == 1) array[x] = new C2();
            else if (i == 2) array[x] = new C3();
            else if (i == 3) array[x] = new C4();
        }
    }
}


static interface C { public int apply(int x); }

static class C1 implements C {
    @Override public int apply(int x) { return x * x; }
}   ...

@Benchmark
public int morphic()
{
    int sum = 0;

    for (int i = 0; i < 20_000; i++)
    {
        sum += array[i].apply(i);
    }

    return sum;
}

@Benchmark
public int peeled()
{
    int sum = 0;
    for (int i = 0; i < 20_000; i++) {
        final C c = array[i];
        if (c instanceof C1) {
            sum += ((C1) c).apply(i);
        }
        else if (c instanceof C2) {
            sum += ((C2) c).apply(i);
        }
        else if (c instanceof C3) {
            sum += ((C3) c).apply(i);
        }
        else {
            sum += ((C4) c).apply(i);
        }
    }

    return sum;
}

org.sample.MegaMorphicArray

Really calling virtually

The Results are in


# JDK 11
Benchmark (params)  Mode  Cnt     Score  Units
morphic         1  avgt    5   17,705.6  ns/op
morphic         2  avgt    5   98,269.4  ns/op
morphic         3  avgt    5  334,745.9  ns/op
morphic         4  avgt    5  364,485.9  ns/op

peeled          1  avgt    5   16,990.4  ns/op
peeled          2  avgt    5   94,084.2  ns/op
peeled          3  avgt    5  123,590.3  ns/op
peeled          4  avgt    5  147,067.2  ns/op


# JDK 11 -Xint
Benchmark (params) Mode  Cnt        Score  Units
morphic         1  avgt    5  1,682,550.7  ns/op
morphic         2  avgt    5  1,695,875.2  ns/op
morphic         3  avgt    5  1,679,643.2  ns/op
morphic         4  avgt    5  1,675,298.1  ns/op

peeled          1  avgt    5  1,764,953.0  ns/op
peeled          2  avgt    5  1,821,662.6  ns/op
peeled          3  avgt    5  1,885,678.8  ns/op
peeled          4  avgt    5  1,936,054.8  ns/op

org.sample.MegaMorphicArray

Megamorphic Internals

But why do we see this performance?


Benchmark                      (params)  Mode       Score       Error  Units
morphic                               1  avgt    16,829.7 ±  1165.264  ns/op
morphic:IPC                           1  avgt         3.2               #/op
morphic:L1-dcache-load-misses         1  avgt     6,263.3               #/op
morphic:L1-dcache-loads               1  avgt    40,296.8               #/op
morphic:branch-misses                 1  avgt         9.1               #/op
morphic:branches                      1  avgt    24,984.2               #/op
morphic:dTLB-load-misses              1  avgt         0.2               #/op
morphic:dTLB-loads                    1  avgt    40,401.5               #/op

morphic                               2  avgt    97,456.9 ±  1177.830  ns/op
morphic:IPC                           2  avgt         0.7               #/op
morphic:L1-dcache-load-misses         2  avgt     6,472.7               #/op
morphic:L1-dcache-loads               2  avgt    43,597.9               #/op
morphic:branch-misses                 2  avgt     9,850.6               #/op
morphic:branches                      2  avgt    41,935.3               #/op
morphic:dTLB-load-misses              2  avgt         2.0               #/op
morphic:dTLB-loads                    2  avgt    45,509.8               #/op

morphic                               3  avgt   334,117.8 ±  3334.059  ns/op
morphic:IPC                           3  avgt         1.1               #/op
morphic:L1-dcache-load-misses         3  avgt     6,840.3               #/op
morphic:L1-dcache-loads               3  avgt   407,654.7               #/op
morphic:branch-misses                 3  avgt    13,247.0               #/op
morphic:branches                      3  avgt   146,105.8               #/op
morphic:dTLB-load-misses              3  avgt        13.5               #/op
morphic:dTLB-loads                    3  avgt   410,043.5               #/op

morphic                               4  avgt   390,601.4 ± 40515.922  ns/op
morphic:IPC                           4  avgt         0.9               #/op
morphic:L1-dcache-load-misses         4  avgt     6,740.5               #/op
morphic:L1-dcache-loads               4  avgt   408,814.3               #/op
morphic:branch-misses                 4  avgt    15,099.0               #/op
morphic:branches                      4  avgt   145,118.2               #/op
morphic:dTLB-load-misses              4  avgt        18.9               #/op
morphic:dTLB-loads                    4  avgt   410,767.5               #/op


Benchmark                      (params)  Mode        Score       Error  Units
peeled                                1  avgt     16,831.1 ±   816.491  ns/op
peeled:IPC                            1  avgt          3.2               #/op
peeled:L1-dcache-load-misses          1  avgt      6,218.4               #/op
peeled:L1-dcache-loads                1  avgt     39,797.1               #/op
peeled:branch-misses                  1  avgt          9.2               #/op
peeled:branches                       1  avgt     24,854.5               #/op
peeled:dTLB-load-misses               1  avgt          0.2               #/op
peeled:dTLB-loads                     1  avgt     40,133.1               #/op

peeled                                2  avgt     93,347.0 ±  3607.879  ns/op
peeled:IPC                            2  avgt          0.7               #/op
peeled:L1-dcache-load-misses          2  avgt      6,456.1               #/op
peeled:L1-dcache-loads                2  avgt     42,060.7               #/op
peeled:branch-misses                  2  avgt      9,252.3               #/op
peeled:branches                       2  avgt     41,153.9               #/op
peeled:dTLB-load-misses               2  avgt          6.0               #/op
peeled:dTLB-loads                     2  avgt     42,195.1               #/op

peeled                                3  avgt    123,075.7 ±  5181.245  ns/op
peeled:IPC                            3  avgt          0.6               #/op
peeled:L1-dcache-load-misses          3  avgt      6,472.7               #/op
peeled:L1-dcache-loads                3  avgt     42,059.2               #/op
peeled:branch-misses                  3  avgt     13,440.9               #/op
peeled:branches                       3  avgt     51,950.4               #/op
peeled:dTLB-load-misses               3  avgt          5.7               #/op
peeled:dTLB-loads                     3  avgt     43,097.4               #/op

peeled                                4  avgt    136,765.3 ±  2792.939  ns/op
peeled:IPC                            4  avgt          0.6               #/op
peeled:L1-dcache-load-misses          4  avgt      6,460.5               #/op
peeled:L1-dcache-loads                4  avgt     43,312.1               #/op
peeled:branch-misses                  4  avgt     14,818.3               #/op
peeled:branches                       4  avgt     62,214.6               #/op
peeled:dTLB-load-misses               4  avgt          8.0               #/op
peeled:dTLB-loads                     4  avgt     42,775.6               #/op

TLB - The Unknown Buffer

The CPU understood

Previous data shows that TLB numbers are high
VMT: Virtual Method Table
Java looks the right method up in that table and makes afterwards the call to that code
TLB: Translation Lookaside Buffer
The TLB stores the recent translations of virtual memory to physical memory and can be called an address-translation cache [1]
The VMT call target has to be resolved via a TLB lookup
Java can inline code for one or two callsides, but not for more

One method: Inlined or directly called
Two methods: Inlined or directly called with a virtual if-else
Three or more methods: We have to use VMT
That behavior can change in future Java versions

Lambdas are slow

Let's try some streams first

final int SIZE = 10240;
final int[] integers = new int[SIZE];

public int lambdaArray[Cold|Warm]()
{
    return Arrays.stream(integers)
        .filter(i -> i % 2 == 0).sum();
}

public int lambdaIntStream[Cold|Warm]()
{
    return IntStream.range(0, SIZE)
        .filter(i -> i % 2 == 0).sum();
}

public int loop[Cold|Warm]()
{
    int sum = 0;
    for (int i = 0; i < integers.length; i++)
    {
        if (i % 2 == 0)
        {
            sum += integers[i];
        }
    }

    return sum;
}


Benchmark                     Mode  Cnt         Score   Units
Lambda01.lambdaArrayHot       avgt    5      13,968.8   ns/op
Lambda01.lambdaIntStreamHot   avgt    5      13,156.2   ns/op
Lambda01.loopHot              avgt    5      10,146.6   ns/op


# Cold does not have JMH warmup
Benchmark                     Mode  Cnt         Score   Units
Lambda01.lambdaArrayCold      avgt    5     251,372.4   ns/op
Lambda01.lambdaArrayHot       avgt    5      13,968.8   ns/op

Lambda01.lambdaIntStreamCold  avgt    5     794,422.3   ns/op
Lambda01.lambdaIntStreamHot   avgt    5      13,156.2   ns/op

Lambda01.loopCold             avgt    5      15,066.7   ns/op
Lambda01.loopHot              avgt    5      10,146.6   ns/op

# -Xcomp
Benchmark                     Mode  Cnt         Score   Units
Lambda01.lambdaArrayCold      avgt    5  24,809,394.5   ns/op
Lambda01.lambdaIntStreamCold  avgt    5  23,398,581.4   ns/op
Lambda01.loopCold             avgt    5      11,787.0   ns/op

Lambda01.lambdaArrayHot       avgt    5     143,678.1   ns/op
Lambda01.lambdaIntStreamHot   avgt    5     144,445.3   ns/op
Lambda01.loopHot              avgt    5      10,109.5   ns/op

# -Xint
Benchmark                     Mode  Cnt         Score   Units
Lambda01.lambdaArrayCold      avgt    5   5,578,040.0   ns/op
Lambda01.lambdaArrayHot       avgt    5   5,473,135.6   ns/op
Lambda01.lambdaIntStreamCold  avgt    5   5,394,464.0   ns/op
Lambda01.lambdaIntStreamHot   avgt    5   5,317,269.0   ns/op
Lambda01.loopCold             avgt    5     229,327.9   ns/op
Lambda01.loopHot              avgt    5     229,369.6   ns/op

org/sample/Lambda01.java

Lambda Functions

Compare Predicate, Pseudo-Predicate, and Direct Implementation

public int forLambda(final Predicate<Integer> p) {
    int sum = 0;
    for (int i = 0; i < array.length; i++) {
        if (p.test(array[i])) {
            sum++;
        }
    }
    return sum;
}

public int forAnonymous(final PseudoLambda<Integer> p) {
    int sum = 0;
    for (int i = 0; i < array.length; i++) {
        if (p.test(array[i])) {
            sum++;
        }
    }
    return sum;
}

public int simple() {
    int sum = 0;
    for (int i = 0; i < array.length; i++) {
        if (array[i] % 2 == 0) {
            sum++;
        }
    }
    return sum;
}


@Benchmark
public void forSimple(Blackhole bh)
{
    bh.consume(simple());
}

@Benchmark
public void forLambdaPredicate(Blackhole bh)
{
    bh.consume(forLambda(i -> i % 2 == 0));
}

interface PseudoLambda<T>
{
    boolean test(T t);
}

@Benchmark
public void forAnonymousClass(Blackhole bh)
{
    bh.consume(
        forAnonymous(new PseudoLambda<Integer>()
        {
            @Override
            public boolean test(Integer i)
            {
                return i % 2 == 0;
            }
        })
    );
}

org/sample/Lambda02.java

Lambda Functions


Benchmark                     Mode  Cnt     Score  Units
forAnonymousClass             avgt    2     152.9  ns/op
forLambdaPredicate            avgt    2     128.8  ns/op
forLambdaPredicatePredefined  avgt    2     127.9  ns/op
forSimple                     avgt    2     128.1  ns/op

# -Xint
Benchmark                     Mode  Cnt     Score  Units
forAnonymousClass             avgt    2  21,312.5  ns/op
forLambdaPredicate            avgt    2  20,154.8  ns/op
forLambdaPredicatePredefined  avgt    2  19,871.6  ns/op
forSimple                     avgt    2   3,471.7  ns/op


# No warmup and just 5x measurements
Benchmark                     Mode  Cnt     Score  Units
forAnonymousClass               ss    5  92,787.6  ns/op
forLambdaPredicate              ss    5 122,377.4  ns/op
forLambdaPredicatePredefined    ss    5  50,435.6  ns/op
forSimple                       ss    5  16,268.6  ns/op

Basic lambdas are great when hot
When cold, performance is questionable
Not the right thing for server startup or rare execution
BUT that all depends on the total performance debt collected

org/sample/Lambda02.java

GC can be your friend


public class GCAndAccessSpeed
{
    private final static int SIZE  = 100_000;
    private final List<String> STRINGS = new ArrayList<>();
    private final List<String> ordered = new ArrayList<>();
    private final List<String> nonOrdered = new ArrayList<>();

    @Param({"false", "true"}) boolean gc;
    @Param({"1", "10"})       int COUNT;
    @Param({"false", "true"}) boolean drop;

    @Setup
    public void setup() throws InterruptedException
    {
        final FastRandom r = new FastRandom(7);
        for (int i = 0; i < COUNT * SIZE; i++)
        {
            STRINGS.add(RandomUtils.randomString(r, 1, 20));
        }

        for (int i = 0; i < SIZE; i++)
        {
            ordered.add(STRINGS.get(i * COUNT));
        }
        nonOrdered.addAll(ordered);
        Collections.shuffle(nonOrdered, new Random(r.nextInt()));

        if (drop)
        {
            STRINGS.clear();
        }

        if (gc)
        {
            for (int c = 0; c < 5; c++)
            {
                System.gc();
                TimeUnit.SECONDS.sleep(2);
            }
        }
    }

    @Benchmark
    public int walk[Ordered|NonOrdered]()
    {
        int sum = 0;
        for (int i = 0; i < [ordered|nonOrdered].size(); i++)
        {
            sum += [ordered|nonOrdered].get(i).length();
        }
        return sum;
    }
}

Applied Memory Knowledge

Theory is that memory closer together is good
G1 is compacting, hence moving objects closer together
When walking a structure with more "local" objects, the cache is hotter
Verify with Non-GC setup aka EpsilonGC


# G1 GC
Benchmark        (COUNT)  (drop)   (gc) Mode  Cnt        Score   Units
walkNonOrdered        1   false  false  avgt    5  1,596,315.2   ns/op
walkOrdered           1   false  false  avgt    5    611,137.7   ns/op

walkNonOrdered        1   false   true  avgt    5  1,172,951.3   ns/op
walkOrdered           1   false   true  avgt    5    431,143.5   ns/op

walkNonOrdered        1    true  false  avgt    5  1,562,844.1   ns/op
walkOrdered           1    true  false  avgt    5    605,119.4   ns/op

walkNonOrdered        1    true   true  avgt    5  1,243,973.9   ns/op
walkOrdered           1    true   true  avgt    5    400,721.9   ns/op

walkNonOrdered       10   false  false  avgt    5  1,903,731.9   ns/op
walkOrdered          10   false  false  avgt    5  1,229,945.1   ns/op

walkNonOrdered       10   false   true  avgt    5  2,026,861.7   ns/op
walkOrdered          10   false   true  avgt    5  1,809,961.9   ns/op

walkNonOrdered       10    true  false  avgt    5  1,920,658.4   ns/op
walkOrdered          10    true  false  avgt    5  1,239,658.5   ns/op

walkNonOrdered       10    true   true  avgt    5  1,160,229.2   ns/op
walkOrdered          10    true   true  avgt    5    403,949.5   ns/op

# -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC
Benchmark        (COUNT)  (drop)   (gc) Mode  Cnt        Score   Units
walkNonOrdered        1   false  false  avgt    5  1,667,611.7   ns/op
walkNonOrdered        1   false   true  avgt    5  1,820,968.2   ns/op
walkNonOrdered        1    true  false  avgt    5  1,928,822.7   ns/op
walkNonOrdered        1    true   true  avgt    5  1,777,251.4   ns/op

walkOrdered           1   false  false  avgt    5    931,728.5   ns/op
walkOrdered           1   false   true  avgt    5    902,433.7   ns/op
walkOrdered           1    true  false  avgt    5    930,294.3   ns/op
walkOrdered           1    true   true  avgt    5    907,886.5   ns/op

org.sample.GCAndAccessSpeed Inspired by: Aleksey Shipilёv, Moving GC and Locality

Conclusion

Rule 1

Don't worry too much about performance. Java is very good in taking care of it.

Rule 2

Don't optimize prematurely and don't micro-tune.

Rule 3

Simply write clean and good code. You won't see most of the issues ever. Think before you code. Java can make code fast, but bad algorithms stay bad algorithms, bad designs stays bad designs.

Rule 4

When you worry, worry about the hottest code. Measure carefully and apply what you learned. The compiler and the CPU are your friends but make also your profiling life harder.

Rule 5

Understand, measure, review, test, and measure again.

Rule 6

The code you write is not the code that is executed. Measure under production conditions to understand and avoid optimizing non-compiled and non-optimized code.

Valhalla

Panama Vector-API

XLT Report Generator

Applied Knowledge About CPU, Caches, and more to the XLT Reporting Engine

# OLD
      4,449,972.01 msec task-clock                #   13.341 CPUs utilized
         3,692,466      context-switches          #  829.773 /sec
           677,485      cpu-migrations            #  152.245 /sec
         1,148,505      page-faults               #  258.093 /sec
10,607,013,813,174      cycles                    #    2.384 GHz                      (50.00%)
   355,223,105,079      stalled-cycles-frontend   #    3.35% frontend cycles idle     (50.01%)
 2,005,906,915,627      stalled-cycles-backend    #   18.91% backend cycles idle      (50.02%)
21,127,403,990,596      instructions              #    1.99  insn per cycle
                                                  #    0.09  stalled cycles per insn  (50.02%)
 4,456,128,125,803      branches                  #    1.001 G/sec                    (50.02%)
    16,654,985,143      branch-misses             #    0.37% of all branches          (50.01%)
 8,252,058,721,619      L1-dcache-loads           #    1.854 G/sec                    (50.00%)
   132,098,926,744      L1-dcache-load-misses     #    1.60% of all L1-dcache accesses  (50.00%)

             333.5 seconds time elapsed

            4158.9 seconds user
             295.2 seconds sys

# NEW
      2,740,643.95 msec task-clock                #   14.051 CPUs utilized
         8,421,379      context-switches          #    3.073 K/sec
           203,008      cpu-migrations            #   74.073 /sec
         1,152,818      page-faults               #  420.638 /sec
 6,490,638,190,051      cycles                    #    2.368 GHz                      (49.99%)
   235,691,588,371      stalled-cycles-frontend   #    3.63% frontend cycles idle     (50.00%)
   881,992,481,414      stalled-cycles-backend    #   13.59% backend cycles idle      (50.01%)
10,681,912,161,893      instructions              #    1.65  insn per cycle
                                                  #    0.08  stalled cycles per insn  (50.01%)
 1,873,443,357,019      branches                  #  683.578 M/sec                    (50.02%)
    13,816,096,139      branch-misses             #    0.74% of all branches          (50.01%)
 4,424,610,353,285      L1-dcache-loads           #    1.614 G/sec                    (50.00%)
   103,982,020,029      L1-dcache-load-misses     #    2.35% of all L1-dcache accesses  (50.00%)

             195.0 seconds time elapsed

            2507.4 seconds user
             232.7 seconds sys

XLT - https://github.com/Xceptance/XLT

High Performance Java

About

Xceptance

René Schwietzke

Sources

Java and Performance

An Early Warning or The Disclaimer

Another Motivation or Early Prove

Let's Benchmark

Poor Man's Benchmark 1

Let's Run the Poor Man's Benchmark

The Java Hater's Response

The Java Hater's Response II

That is not helpful

Nightmares!

Let's Analyze

Some more information about Hotspot

Compiler Stages [1]

Compiler Flow [2]

Compare Interpreter vs. Compiler

Hotspot

No Tiered Compiler

Interpreter Only

Compile Everything

Compile Or Not to Compile

Compile Or Not to Compile

Watch the Compiler

Java Performance Fun

Ohhhh Trap

Learning: Unused Code

More is Less

Understand Runtime Internals

Your loop is easy

Detour: The modern CPU

Escape Analysis

Intrinsics or Native

Wait... What are Intrinsics?

Intrinsics or Native II

Stupid Compiler

Let's talk hardware

Hardware Greetings

Detour: Cost

The Matrix

The Matrix Reloaded

Matrix - You probably did it wrong

Wrong again?

I know what you will do next

Branch Prediction

Branches: Know what you test

Short Summary

Measurement

JDK

JVM

Hardware

Synchronization

With Contention

Looping

Looping

Looping - Bonus Content

The Cost of Calling

The Cost of Calling

The Cost of Calling

Really calling virtually

Really calling virtually

Megamorphic Internals

TLB - The Unknown Buffer

Lambdas are slow

Lambda Functions

Lambda Functions

GC can be your friend

Conclusion

Rule 1

Rule 2

Rule 3

Rule 4

Rule 5

Rule 6

Valhalla

Panama Vector-API

XLT Report Generator

Compiler Stages ^[1]

Compiler Flow ^[2]