Parallel Programming is Hard

Why you really need synchronization and it is not what you think

What's in for you?

Why you should stick around

  • Learn some computer and compiler concepts that make concurrency hard
  • Basic language and hardware concepts to explain the why you have to care
  • Create awareness for common programming mistakes

I want to clear up common misconceptions when programming with more than one thread. I am not a JDK developer, just a power user, so some details might be overly simplified or slightly incorrect.

You will see code... even assembler code

If this doesn't scare you away, what will?

About René Schwietzke

  • Co-Founder and Managing Directory Xceptance
  • Master of Computer Science (in German: Dipl.-Inf.)
  • Programmer since 1989
  • Java since 1996, Java 1.0
  • QA and Testing since 1998
  • Performance Tester since 1999
  • @ReneSchwietzke
  • @reneschwietzke@foojay.social
  • #java #qa #test #performance #performancetest #quality

Motivation I

Java is highly efficient - you want to make your programs efficient too

Motivation II

Write correct code, make it scale, keep it efficient

Facts

  • Multi-cores are standard
  • Multi-threading is standard
  • Efficiency is a must nowadays
  • Concurrency cannot be easily tested
  • Java is an older language but still evolving
  • Java has the abiliy to grant you full powers

Misconceptions

  • The code you see is the code that runs
  • Synchronization is needed for "transactional" sections which can be hit by more than one thread at the same time
  • Modern CPUs, more speed for free
  • I can debug and profile everything
  • Memory is plenty and affordable
  • Thread contention is the biggest concurrency problem

You learn from others

Always attribute other people's work and state your sources

Some of the content has been inspired by or borrowed and modified from Concurrency Concepts in Java by Douglas Hawkins [1]

The Three-Slide Presentation

All You Have to Know in Three Slides or Less

Why Java works the way it works

The spec, no fun

  • Did anyone read the JLS - Java Language Spec?
  • Chapter 17 - Threads and Locks
  • It defines the Java Memory Model - JMM
  • Explains what compiler and JVM builders have to obey
  • Java was the first language that tried such a model

The Java Memory Model specifies how the Java virtual machine works with the computer's memory (RAM)...

It is very important to understand the Java memory model if you want to design correctly behaving concurrent programs. ...specifies how and when threads can see values written to shared variables by other threads, and how to synchronize access to shared variables...

Jakob Jenkov

TL;DR

The one rule that tells Java to optimize the hell out of your code without breaking it

As if there's one thread, unless you say otherwise.

  • In English: Java will optimize as much as possible at its own discretion, as long as the outcome stays the same - Single Thread Consistency
  • Sadly, the compiler does not prevent us from messing things up.
  • Using threads doesn't tell Java a thing.

Once upon a time

When we blew it all up

When Java adjusted the JLS and made the JVM more performant with Java 1.5, a lot of programs stopped behaving correctly.

Not because the JVM was broken, but because all programs had broken synchronization and just worked fine up to that point.

And Now... The Conclusion

Everything from Before in a Few More Words

Memory

Memory is Everything

JMM vs the Computer

Our World (JVM) vs. Real World (Hardware and OS)

Where do we live?


public class Test
{
    // we all always live on the heap
    private Map<String, String> aMap;
    private int counter;
    private static final Any<Integer> = new Any<>();

    /**
     * @param a - stack only (call-by-value)
     * @param b - you don't know, reference passing on stack
     *            but the object itself?
     *            (call-by-reference)
     */
    public void foo(int a, Any<?> b)
    {
        int c = a++; // register or stack or both

        var d = b; // ref d on stack or in register,
                   // but b might be anywhere

        // reference on stack/in register, but instance?
        // when small and it does not escape on stack
        // when larger always on heap
        int[] e = new int[a];

        // reference stack, instance on heap
        List<Foo> l = new ArrayList<>();
    }
}
  • The type does not indicate storage or sharing
  • Local does not mean local storage, just local scope
  • Simple rule: If you can synchronize on it, it can and could be shared and cause trouble!
  • Valhalla will bring something new: Value Object (codes like a class, works like an int) and you cannot synchronize on it, see the simple rule :-)
  • Programming hint: Immutability rulez!
  • If you cannot change it, you cannot screw it up (almost)!

Sharing and Data Race

Forgot to define important terms

Sharing means that another thread can see and/or change data that someone else created or owns.

The data access does not have to be at the same time to create a potential data race.

A data race occurs when: two or more threads in a single process access the same memory location concurrently, and at least one of the accesses is for writing, and the threads are not using any exclusive locks to control their accesses to that memory [1].

Concurrency

It is in the word, isn't it?

Concurrency means, that data is accessed by two or more threads.
At least one is writing.

There is no time component in that definition!
Concurrency must be interpreted more liberally nowadays.

  • Concurrent does not mean "at the same time" anymore, contrary to popular believe.
  • Data that is accessed exclusively by the same thread all the time is, by definiton, not shared even though it is technically "shared" aka open and accessible.

Let's Approach our Topic

Let's look behind the scenes and bust some myths

Little Code is Safe?

One-liners are always safe, aren't they?

How small?

Is this small enough for an atomic (indivisible) operation?


int shared = 0; // field of an instance
=================================================
shared++;

getfield        #2   // Field shared:I
iconst_1
iadd
putfield        #2   // Field shared:I

local = shared;
local = local + 1;
shared = local;

0x00007f2af910f20c: mov 0xc(%rsi),%edi  ; getfield shared
0x00007f2af910f20f: inc %edi
0x00007f2af910f211: mov %edi,0xc(%rsi)  ; putfield shared

# also possible
0x00007f73c03a18cc: incl 0xc(%rsi) ; increment at address*

int local = 0; // stack aka method scope
=================================================
local += 2;

IINC 1 2

# "local" already lives in %rax
0x00007ff49cfde350: movabs $0x2,%r10
0x00007ff49cfde35a: add    %r10,%rax
  • Called Read - Modify - Write operations
  • One Java command != one CPU command
  • Spoiler That applies to almost everything!

Is new atomic?

Is a new atomic?

class Point
{
    private int x;
    private int y;

    public Point(int x, int y)
    {
        this.x = x;
        this.y = y;
    }
}

Point shared = new Point(x, y);
// pseudo-code
local = calloc(sizeof(Point));
local.<init>(x, y);
    Object.<init>();
    this.x = x;
    this.y = y;
shared = local;
  • Looks indivisible but it is not
  • In about 20 slides, we will know why.

Visibility

When do others see my changes?

No garbage ever

Clean hands all the time guaranteed by the spec

No out-of-thin-air reads: Each read of a variable must see a value written by a write to that variable.

This hurts big time (cutting edge performance optimizations), because every single bit is written to first (initialized) before made available. That's why, reuse might be still a good thing in Java... but then it is not immutable... darn...

#cache #bandwidth

  • Java will never present us garbage
  • Everything is initialized
  • That differs from the underlying memory management by the OS, which does not write to recently allocated memory, not even provide it for real in the first place.

Not everything will be visible

shared = 20;
shared = 40;

print(shared);
  • Java builds a data dependency graph
  • Line 4 depends on 2, but nobody on line 1
  • Safe to drop line 1
  • Single thread consistency
// remove dead store
shared = 40;

print(shared);
// even fancier
print(40);
shared = 40;
  • This is totally legal!
  • Will it be done? Maybe, maybe not.

Not everything will be stored

shared = 0;
for (int x : array)
{
    shared += x;
}
  • Inefficient code, because we write to memory all the time
  • JIT can safely optimize it, unless you say otherwise
local = 0; // likely a register
for (int x : array)
{
    local += x;
}

shared = local;
  • Single thread consistency
  • Speed is king and memory is slow

More reading

// Can X be 30?
x = shared * shared;

// Compiler might do that
local1 = shared;
local2 = shared;
x = local1 * local2;

// In reality, this happens...
local = shared;
x = local * local;

0x00007f7740ab2ccc: mov  0xc(%rsi),%eax ;*getfield shared
0x00007f7740ab2ccf: imul %eax,%eax      ;*imul

thread 1
==========================

local1 = shared; // 5

local2 = shared; // 6
x = local1 * local2; // 30

thread 2
==========================
shared = 5;

shared = 6;
 
 
  • Removing the redundant load is legal and gives us nice squares by accident
  • We will see this code again!!

And there is Reordering

Data and Control Dependence


// our code
sharedX = 2;
sharedY = 3;

if (sharedX > 0)
{
    print(sharedX);
}

// legal versions by the compiler
sharedX = 2;

if (sharedX > 0)
{
    print(sharedX);
}
sharedY = 3;

print(2);
sharedX = 2;
sharedY = 3;

sharedY = 3;
sharedX = 2;

if (2 > 0)
{
    print(2);
}

And the JLS states...

To some programmers, this behavior may seem "broken". However, it should be noted that this code is improperly synchronized.

The semantics of the Java programming language allow compilers and microprocessors to perform optimizations that can interact with incorrectly synchronized code in ways that can produce behaviors that seem paradoxical.

Modern Machines

A Quick Detour

Detour: Cost

What Does it Cost to Leave the CPU?

  Real Humanized
CPU Cycle 0.4 ns 1 s
L1 Cache 0.9 ns 2 s
L2 Cache 2.8 ns 7 s
L3 Cache 28 ns 1 min
Memory Access 100 ns 4 min
NVM SSD 25 μs 17 h
SSD 50–150 μs1.5-4 days
HDD 1–10 ms 1-9 months
TCP SF to NY 65 ms 5 years
TCP SF to Hong Kong141 ms11 years
  • It costs 100 to 200 cycles when we have to go to main memory
  • Hence Java and the CPU avoid main memory access at all cost
  • Java does not have the notion of a location, hence this is fully transparent to you... but you can still work with it :)

The modern world

Everything happens in parallel

  • To fully utilize modern computers architectures, the compilers and the processors are free to do whatever it takes...
  • ...as long as your defined semantics stay the same
  • Javac, JIT, and the CPU reorder and optimize
  • Current Intels/AMDs can do 4 to 8 instructions per cycle

Reordering

Implicit concurrent execution of your code


public class Foo
{
    private double total;
    private double fees;

    public void transact(double m)
    {
        total += m;
        fee += Math.abs(m * 0.01);
    }
}

# rM is m in a register
load  total, rTotal
add   rM, rTotal
store rTotal, total

load  0.01, rT1
mul   rM, rT1
abs   rT1
load  fees, rFees
add   rT1, rFees
store rFees ,fees

# Execution Unit 1
load  total, rTotal
add   rM, rTotal
store rTotal, total
 
 

# Execution Unit 2
load  0.01, rT1
mul   rM, rT1
abs   rT1
 
 

# Execution Unit 3
load  fees, rFees
 
 
 
 

# Execution Unit 4
 
 
 
add   rT1, rFees
store rFees, fees
  • We ran 5 cycles with 9 instructions, 1.8 IPCs (instructions per cyle)

Fortune Teller

Predict and speculate to be ahead of the crowd


private static final int COUNT = 10_000;

// Contains random numbers from -50 to 50
private int[] sorted;
private int[] reversed;

public int doIt(int[] array) {
    for (int v : array) {
        int sum = 0;
        for (int v : array) {
            if (v > 0) {
                sum = sum + 2;
            }
            else {
                sum = sum + 1;
            }
        }
        return sum;
    }
}
  • Will there be any difference when passing sorted or unsorted arrays?

# T14s
# Benchmark  Mode  Cnt       Score   Error  Units
sorted       avgt    2       3,200          ns/op
unsorted     avgt    2      14,927          ns/op
  • Sorted is 4.7x faster. Why?

# Benchmark                                    Mode Cnt      Score       Units
BranchPrediction1.sorted                       avgt   2   3,200.529      ns/op
BranchPrediction1.sorted:IPC                   avgt           3.912  insns/clk
BranchPrediction1.sorted:branch-misses         avgt           5.800       #/op
BranchPrediction1.sorted:branches              avgt      17,827.359       #/op
BranchPrediction1.sorted:cycles                avgt      13,969.192       #/op
BranchPrediction1.sorted:instructions          avgt      53,254.559       #/op

BranchPrediction1.unsorted                     avgt   2  14,927.107      ns/op
BranchPrediction1.unsorted:IPC                 avgt           0.875  insns/clk
BranchPrediction1.unsorted:branch-misses       avgt       1,072.056       #/op
BranchPrediction1.unsorted:branches            avgt      17,693.113       #/op
BranchPrediction1.unsorted:cycles              avgt      60,857.134       #/op
BranchPrediction1.unsorted:instructions        avgt      53,320.560       #/op

Learnings Summarized

What did we learn in the show tonight, CraigRené?

  • The code you wrote is not the code that runs.
  • As long as the result is the same:
    • Compilers can modify the order and flow
    • Remove or add code
    • CPUs can reorder and speculate
    • CPUs can run instructions in parallel
  • You likely still behave like a 1990 compiler.
  • Your program runs in your head like on a 1980 computer.

Loop unrolling, inlining, reordering, dead code removal, branch prediction, caching, speculative execution, pipelining, memory access optimization, location aware memory, register machine, ahead of time evaluation, thread scheduling and moving, on-the-fly code replacement ...

Shared Access is not timed Concurrency

What we have to do to get things working

Concurrency and visibility are not about modifying things at the same time, it is all about reading data that was modified by another thread. These operations could be seconds or even (theoretically) minutes apart.

  • Establishing an order is important
  • Only read after a corresponding write
  • Every read after a write sees the written value
  • We have a "happen-before relation"
  • By means of synchronization, we ensure that this order is maintained
  • Synchronization introduces barriers/fences into the code

Back to NEW

See Where it Fails and Why

New at a New level

What all the previous thing will do to NEW


// Am I immutable?
class Point {
    private int x;
    private int y;

    public Point(int x, int y) {
        this.x = x;
        this.y = y;
    }

    public int getX() {
        return x;
    }

    public int getY() {
        return y;
    }
}
  • Not immutable according to the standard
// pseudo-code
local = calloc(sizeof(Point));
local.<init>(x, y);
    Object.<init>();
    this.x = x;
    this.y = y;
shared = local;


Reordered New

But why can I see the 0 values?

// pseudo-code
local = calloc(sizeof(Point));
local.<init>(x, y);
    Object.<init>();
    this.x = x;
    this.y = y;
shared = local;


// a possible optimization
local = calloc(sizeof(Point));
shared = local;
local.<init>(x, y);
    Object.<init>();
    this.x = x;
    this.y = y;
// this will permit to load the CPU better and
// start the store to memory earlier
  • Single thread consistency
  • The compilers and the CPU change your code

Double-Checked Locking

Now you might understand why double-checked locking is broken

// what you wrote
public static Singleton getInstance()
{
    if (instance == null)
    {
        synchronized (Singleton.class)
        {
            if (instance == null)
            {
                instance = new Singleton();
            }
        }
    }

    return instance;
}


// what you get
public static Singleton getInstance()
{
    if (instance == null)
    {
        synchronized (Singleton.class)
        {
            if (instance == null)
            {
                local = calloc(sizeof(Singleton));
                instance = local;
                local.<init>;
            }
        }
    }

    return instance;
}

Fences to protect us

Silence! Or I will synchronize you.

No Fences

Let the CPU and compiler optimize the hell out of your code

==== Producer       ==== Consumer
...                 ...
sharedData = ...;   while (!sharedDone)
sharedDone = true;  {
...                     ...
                    }
                    print(sharedData);

==== Producer       ==== Consumer
...                 ...
sharedDone = true;  local = !sharedDone;
                    while (local)
sharedData = ...;   {
...                     ...
                    }
                    print(sharedData);

Fences

Make clear what the rules are

==== Producer               ==== Consumer
...                         ...
sharedData = ...;           while (!sharedDone)
--- store_store_fence();    {
sharedDone = true;              --- load_load_fence();
...                             ...
                            }
                            --- load_load_fence();
                            print(sharedData);
  • Prevent reordering of instructions
  • Instruct where to read from and write too
  • Ensure consistent state
  • What happens above the fence, stays above the fence. What happens below, always happens below.

Synchronization Actions

But there are no fences in the Java language, mum!?

  • volatile
  • synchronized
  • final
  • Atomics
  • varhandles
  • Each "method" has different properties such as speed, scope, and hardware support
  • All establish a happen-before relation
  • All prevent reordering
  • All will ensure consistent writes before reads

Synchronization Actions

What can they do for us?

  • final: You see only one state
  • volatile: Ensures consistent reads and writes - main memory instead of registers
  • synchronized: Ensures consistent reads and writes and prevents that multiple threads run the same code, establishes single thread semantics
  • Atomics: Ensure consistent reads and writes, encapsulate read-modify-write operations
  • varhandles: Atomic, synchronized, and final as code (extremly simplified definition)!

Our Toolbox discussed

What Each "Consistency Tool" Offers

Final for Immutability

Final has some unknown characteristics

class Point
{
    private final int x;
    private final int y;

    public Point(int x, int y)
    {
        this.x = x;
        this.y = y;
    }
}

Point shared = new Point(x, y);
// pseudo-code with final
local = calloc(sizeof(Point));
local.<init>(x, y);
    Object.<init>();
    this.x = x;
    this.y = y;
--- freeze(); ---
shared = local;
// pseudo-code no final
local = calloc(sizeof(Point));
shared = local;
local.<init>(x, y);
    Object.<init>();
    this.x = x;
    this.y = y;

  • Final takes care of reordering
  • Makes sure immutable is immutable

Mutability - Volatile

Make mutability safe

The Java volatile keyword is used to mark a Java variable as being stored in main memory. More precisely that means, that every read of a volatile variable will be read from the computer's main memory, and not from the CPU cache, and that every write to a volatile variable will be written to main memory, and not just to the CPU cache.

http://tutorials.jenkov.com/java-concurrency/volatile.html

// that now works
private volatile Singleton instance;

public static Singleton getInstance()
{
    // volatile ensure that we read from main memory now
    if (instance == null) // a fence
    {
        synchronized (Singleton.class) // another fence
        {
            if (instance == null) // another fence
            {
                local = calloc(sizeof(Singleton));
                local.<init>;
                --- fence();
                instance = local; // write to main memory
            }
        }
    }

    return instance; // yeah... read main memory again too
}

Volatile at work

Read-Modify-Write Problem

// what you wrote
private volatile int shared;

public void inc()
{
    shared += 1;
}
// what the machines does
private volatile int shared;

public void inc()
{
    local = shared; // read from main memory
    --- fence();
    local = local + 1;
    shared = local; // flush to main memory
    --- fence();
}

0xc4c2fcc: mov    0xc(%rsi),%r11d
0xc4c2fd0: inc    %r11d
0xc4c2fd3: mov    %r11d,0xc(%rsi)      ;*putfield shared
0xc4c2fd7: lock addl $0x0,-0x40(%rsp)  ;*prevent reordering

Our Square Problem Again


private int shared = 0;

public int square()
{
    int x = shared * shared;
    return x;
}

0xcab294c: mov  0xc(%rsi),%eax ;*getfield shared
0xcab294f: imul %eax,%eax      ;*imul

private volatile int shared = 0;

0xcab294c: mov    0xc(%rsi),%eax  ;*getfield shared
0xcab294f: mov    0xc(%rsi),%r10d ;*getfield shared
0xcab2953: imul   %r10d,%eax      ;*imul

The famous one - synchronized

Create bigger regions

A Java synchronized block marks a method or a block of code as synchronized. Java synchronized blocks can be used to avoid race conditions.

http://tutorials.jenkov.com/java-concurrency/synchronized.html

  • Only synchronizes on objects
  • Java does not synchronize threads, threads synchronize on memory!
private int shared;

public synchronized void inc()
{
    shared += 1;
}
private int shared;

public void inc()
{
    synchronized (this)
    {
        shared += 1;
    }
}

Costly it is

Nothing in life is free, especially not synchronization


0x00007f80cc3a18c0: sub $0x18,%rsp
0x00007f80cc3a18c7: mov %rbp,0x10(%rsp)  ;*synchronization entry
                                         ; - org.sample.Test3::doIt2@-1 (line 17)
0x00007f80cc3a18cc: incl 0xc(%rsi)  ;*putfield shared
                                    ; - org.sample.Test3::doIt2@7 (line 17)
0x00007f80cc3a18cf: add $0x10,%rsp
0x00007f80cc3a18d3: pop %rbp
0x00007f80cc3a18d4: mov 0x108(%r15),%r10
0x00007f80cc3a18db: test %eax,(%r10)  ;   {poll_return} *** SAFEPOINT POLL ***
0x00007f80cc3a18de: ret

0x00007f80cc3a1440: mov %eax,-0x14000(%rsp)
0x00007f80cc3a1447: push %rbp
0x00007f80cc3a1448: sub $0x20,%rsp
0x00007f80cc3a144c: mov %rsi,%rbp
0x00007f80cc3a144f: mov (%rsi),%rax
0x00007f80cc3a1452: mov %rax,%r10
0x00007f80cc3a1455: and $0x7,%r10
0x00007f80cc3a1459: cmp $0x5,%r10
0x00007f80cc3a145d: jne L0003
0x00007f80cc3a145f: mov $0x60040,%r11d  ;   {metadata('org/sample/Test3')}
0x00007f80cc3a1465: movabs $0x800000000,%r10
0x00007f80cc3a146f: add %r11,%r10
0x00007f80cc3a1472: mov 0xb8(%r10),%r10
0x00007f80cc3a1479: mov %r10,%r11
0x00007f80cc3a147c: or %r15,%r11
0x00007f80cc3a147f: mov %r11,%r8
0x00007f80cc3a1482: xor %rax,%r8
0x00007f80cc3a1485: test $0xffffffffffffff87,%r8
0x00007f80cc3a148c: jne L000d  ;*synchronization entry
                               ; - org.sample.Test3::doIt@-1 (line 13)
             L0000: incl 0xc(%rbp)
0x00007f80cc3a1495: mov $0x7,%r10d
0x00007f80cc3a149b: and 0x0(%rbp),%r10
0x00007f80cc3a149f: cmp $0x5,%r10
0x00007f80cc3a14a3: jne L0007  ;*return {reexecute=0 rethrow=0 return_oop=0}
                               ; - org.sample.Test3::doIt@10 (line 14)
             L0001: add $0x20,%rsp
0x00007f80cc3a14a9: pop %rbp
0x00007f80cc3a14aa: mov 0x108(%r15),%r10
0x00007f80cc3a14b1: test %eax,(%r10)  ;   {poll_return} *** SAFEPOINT POLL ***
0x00007f80cc3a14b4: ret
             L0002: lock cmpxchg %r10,(%rsi)
             L0003: lea 0x10(%rsp),%rbx
0x00007f80cc3a14bf: mov (%rsi),%rax
0x00007f80cc3a14c2: test $0x2,%rax
0x00007f80cc3a14c8: jne L0004
0x00007f80cc3a14ca: or $0x1,%rax
0x00007f80cc3a14ce: mov %rax,(%rbx)
0x00007f80cc3a14d1: lock cmpxchg %rbx,(%rsi)
0x00007f80cc3a14d6: je L0005
0x00007f80cc3a14dc: sub %rsp,%rax
0x00007f80cc3a14df: and $0xfffffffffffff007,%rax
0x00007f80cc3a14e6: mov %rax,(%rbx)
0x00007f80cc3a14e9: jmp L0005
             L0004: mov %rax,%r10
0x00007f80cc3a14f1: xor %rax,%rax
0x00007f80cc3a14f4: lock cmpxchg %r15,0x7e(%r10)
0x00007f80cc3a14fa: movq $0x3,(%rbx)
             L0005: je L0000
             L0006: lea 0x10(%rsp),%rdx
0x00007f80cc3a1508: data16 xchg %ax,%ax
0x00007f80cc3a150b: call 0x00007f80c48f5a00  ; ImmutableOopMap{rbp=Oop }
                                             ;*synchronization entry
                                             ; - org.sample.Test3::doIt@-1 (line 13)
                                             ;   {runtime_call _complete_monitor_locking_Java}
0x00007f80cc3a1510: jmp L0000
             L0007: lea 0x10(%rsp),%rax
0x00007f80cc3a151a: cmpq $0x0,(%rax)
0x00007f80cc3a1521: je L000c
0x00007f80cc3a1527: mov 0x0(%rbp),%r10
0x00007f80cc3a152b: test $0x2,%r10
0x00007f80cc3a1532: je L000b
0x00007f80cc3a1534: xor %rax,%rax
0x00007f80cc3a1537: or 0x8e(%r10),%rax
0x00007f80cc3a153e: jne L000c
0x00007f80cc3a1540: mov 0x9e(%r10),%rax
0x00007f80cc3a1547: or 0x96(%r10),%rax
0x00007f80cc3a154e: jne L0008
0x00007f80cc3a1550: movq $0x0,0x7e(%r10)
0x00007f80cc3a1558: jmp L000c
             L0008: cmpq $0x0,0xa6(%r10)
0x00007f80cc3a1565: je L0009
0x00007f80cc3a1567: xor %rax,%rax
0x00007f80cc3a156a: movq $0x0,0x7e(%r10)
0x00007f80cc3a1572: lock addl $0x0,(%rsp)
0x00007f80cc3a1577: cmpq $0x0,0xa6(%r10)
0x00007f80cc3a1582: jne L000a
0x00007f80cc3a1584: lock cmpxchg %r15,0x7e(%r10)
0x00007f80cc3a158a: jne L000a
             L0009: or $0x1,%eax
0x00007f80cc3a158f: jmp L000c
             L000a: test $0x0,%eax
0x00007f80cc3a1596: jmp L000c
             L000b: mov (%rax),%r10
0x00007f80cc3a159b: lock cmpxchg %r10,0x0(%rbp)
             L000c: je L0001
0x00007f80cc3a15a7: mov %rbp,%rdi
0x00007f80cc3a15aa: lea 0x10(%rsp),%rsi  ;*synchronization entry
                                         ; - org.sample.Test3::doIt@-1 (line 13)
0x00007f80cc3a15af: mov %r15,%rdx
0x00007f80cc3a15b2: movabs $0x7f80e35353b0,%r10
0x00007f80cc3a15bc: call *%r10  ;*return {reexecute=0 rethrow=0 return_oop=0}
                                ; - org.sample.Test3::doIt@10 (line 14)
0x00007f80cc3a15bf: jmp L0001
             L000d: test $0x7,%r8
0x00007f80cc3a15cb: jne L0002
0x00007f80cc3a15d1: test $0x300,%r8
0x00007f80cc3a15d8: jne L000e
0x00007f80cc3a15da: and $0x37f,%rax
0x00007f80cc3a15e1: mov %rax,%r11
0x00007f80cc3a15e4: or %r15,%r11
             L000e: lock cmpxchg %r11,(%rsi)
0x00007f80cc3a15ec: jne L0006
0x00007f80cc3a15f2: jmp L0000

Lock Coarsening and Widening

Make things even more efficient

synchronized (buffer)
{
    buffer.add(x);
}

foo = bar;

synchronized (buffer)
{
    buffer.add(y);
}
synchronized (buffer)
{
    buffer.add(x);
    foo = bar;
    buffer.add(y);
}
synchronized (buffer)
{
    buffer.add(x);
}
foo = bar;

synchronized (buffer)
{
    buffer.add(x);
    foo = bar;
}

Atomics

Lock free updates... what?

private AtomicInteger atomic = new AtomicInteger(0);

public void do()
{
    atomic.getAndIncrement();
}
// under the hood
boolean applied = false;
do
{
    int value = shared.get();
    applied = shared.compareAndSet(value, value + 1);
}
while (!applied);
  • Uses hardware support (CAS - Compare and Swap)
  • Version controlled updates
  • Update if it was not updated otherwise read again, try again
  • Waits can be long, hence it is lock free but not wait free

Varhandles

We skip this today, because that is a talk on its own

  • Nothing to see here, please move on!

Summary

49 min in 30 seconds

  • You HAVE TO tell Java that something is shared
  • Otherwise, Java will build and execute very optimized code
  • The CPU goes nuts too... unless Java stops it from doing so
  • Java doesn't manage the cache, the hardware does
  • Use well-known libraries
  • Shared data is more expensive but might be worth it
  • Immutable data is great but also not free
  • Virtual Threads (Loom) are just threads, all rules apply
  • Scoped Values and Structured Concurrency will take weight off your shoulders
  • Testing is hard, reviewing is easier
  • The End - Thank you!