#1BRC - The First 80%

How to Read
One Billion Rows
Fast Enough

René Schwietzke, Xceptance GmbH

About René Schwietzke

  • Master of Computer Science (Dipl.-Inf.)
  • Programming since 1987
  • Java since Java 1.0.2, 1996
  • QA and Testing since 1998
  • Performance Tester since 1999
  • Co-Founder of
  • @reneschwietzke@foojay.social
  • @reneschwietzke.bsky.social
  • Blog reneschwietzke.de
  •    @ReneSchwietzke

About

  • Founded 2004
  • Headquarters in Jena, Germany; Subsidiary in Cambridge, MA, USA
  • Specialized in Software Testing and Quality Assurance
  • Performance testing since 2004
  • Over 150 performance test projects every year
  • World-wide customer base including APAC and South America
  • Performance Test Tool , Java-based, APL 2.0
  • SaaS Performance Testing Offering

License

I care and share

c b

This presentation is licensed under
Creative Commons Attribution 4.0 International License

#1BRC - Introduction

What is 1BRC and Why the Fuzz?

The Origin

A busy January 2024 for Java Devs

  • Gunnar Morling got bored during the holidays
  • His wife suggested to do something nice and fun
  • So he asked the community the #1brc question

... write a Java program for retrieving temperature measurement values from a text file and calculate the min, mean, and max temperatures per weather station.
... one caveat, the file has 1,000,000,000 rows!

Data

Input


Zagreb;19.7
Havana;36.2
Mexicali;27.3
Marseille;15.4
Banjul;24.2
Canberra;0.8
Mandalay;19.3
Bridgetown;29.6
Chihuahua;10.9
Honiara;35.3
Gangtok;-1.7
Houston;32.4
Zürich;13.8
Kolkata;20.6
Tripoli;8.9
Karachi;28.3
Zagreb;13.2
Palm Springs;20.3
Lhasa;6.2
Sacramento;28.3
Edmonton;5.6
Toluca;14.0
El Paso;24.0
Antsiranana;24.9
Tromsø;-3.9
        

Output


# This is TreeMap::toString
# Data shown here is formatted for screen
{
   Abha=5.0/18.0/27.4, Abidjan=15.7/26.0/34.1,
   Abéché=12.1/29.4/35.6, Accra=14.7/26.4/33.1,
   Addis Ababa=2.1/16.0/24.3, Adelaide=4.1/17.3/29.7,
   ...
   Willemstad=-7.9,28.2,57.7, Winnipeg=-28.1,3.0,39.4,
   Wrocław=-27.5,9.6,50.2, Xi'an=-22.5,14.0,49.1,
   Yakutsk=-42.2,-8.8,26.4, Yangon=-2.6,27.6,59.0,
   Yaoundé=-10.1,23.6,58.3, Yellowknife=-38.4,-4.0,40.0,
   Yerevan=-18.4,12.4,49.6, Yinchuan=-25.4,9.5,42.0,
   Zagreb=-25.3,10.6,48.4, Zanzibar City=-5.0,26.1,61.2,
   Zürich=-20.9,9.4,43.5, Ürümqi=-26.8,7.4,43.7,
   İzmir=-15.8,18.0,50.1
}
    
  • 1 billion rows
  • Size 13.8 GB

Rules

Less Questions, More Fair Doing

  • Java Only
  • JDKs via SDKMAN only, any version!
  • Any CLI parameters permitted
  • No external dependencies
  • Results measured on bare metal
  • File in RAM (RAMdisk or file cache)
  • No preprocessing of the data permitted
  • Five measurements, middle three averaged
  • Max 10k stations - initial version had 413
  • Only valid data in the file, UTF-8
  • Line ending is \n
  • City names max 100 bytes (unicode) long
  • Temperatures -99.9 to 99.9 (e.g. 5.0, 0.0)
  • IEEE 754 rounding-direction "roundTowardPositive"

The Results

What do you think? Guess some numbers.

TestCoresTime
Digital Ocean Intel
Time
Hetzner AMD
Baseline1247.6 s152.1 s
Baseline8111.3 s81.4 s
Thomas Wuerthinger18.9 s7.3 s
Thomas Wuerthinger81.2 s1.6 s
Today's 20% Goal149.5 s30.4 s

Let's Challenge Ourselves

What We Will Do and What We Won't

Some Updated Rules

Just to Make Things Easier

  • Limit measurements to code in question
  • Exclude startup and shutdown of JVM
  • One Simple Benchmark Framework - 1SBF
  • Supports warmup and measurements rounds
  • Calculates mean of measurements
  • Report Berlin=23615/-0.4/6.0/15.2
  • New instance of the test class per round
  • SHA-512 of result to discover differences
  • Supports batchmode

Limitations

Some limitations apply today

  • Talk time is not enough for all details
  • 80% don't require all tricks
  • No Über-Code today!
  • Only things you reuse too!
  • Real world gains vary a lot

Goal: All code is still easily understandable.

  • No Unsafe
  • No Vector API
  • No Graal
  • Almost No Bit-Magic
  • No Branchless Code
  • No Memory-Mapped Files
  • No JNI/Panama

The Baseline

The Original Code


public String run(final String fileName) throws IOException {
    Collector<Measurement, MeasurementAggregator, ResultRow> collector = Collector.of(MeasurementAggregator::new,
            (agg, m) -> {
                agg.min = Math.min(agg.min, m.value);
                agg.max = Math.max(agg.max, m.value);
                agg.sum += m.value;
                agg.count++;
            }, (agg1, agg2) -> {
                var res = new MeasurementAggregator();
                res.min = Math.min(agg1.min, agg2.min);
                res.max = Math.max(agg1.max, agg2.max);
                res.sum = agg1.sum + agg2.sum;
                res.count = agg1.count + agg2.count;

                return res;
            }, agg -> {
                return new ResultRow(agg.min, (Math.round(agg.sum * 10.0) / 10.0) / agg.count, agg.max);
            });

    // Measurement is a record
    var result = Files.lines(Paths.get(fileName)).parallel() // <-- optional
            .map(l -> l.split(";"))
            .map(l -> new Measurement(l)) // Measurement is a record
            .collect(groupingBy(m -> m.station(), collector));

    return new TreeMap<>(result).toString();
}
    

The Results

TestCoresTime DOTime HE
Baseline 1 247.6 s 152.1 s
Baseline 8 111.3 s 81.4 s
Gemini 1 144.8 s
Claude 1 Fail OOM
Claude2 1 122.4 s

Flamegraph

run78.4 %
readLine17.7 %
split31.7 %
parseDouble11.7 %
store/reduce14.4 %

java \
    -agentpath:async-profiler/lib/libasyncProfiler.so=start,event=cpu,flamegraph,file=$1-cpu.html \
    -cp target/classes/ \
    org.rschwietzke.devoxxpl24.BRC01_BaselineST measurements.txt 2 2 \

The Second Baseline

Stream Tuning is Too Complicated


/**
 * A more classic version
 */
public String run(final String fileName) throws IOException
{
    final Map<String, Temperatures> cities = new HashMap<>();

    try (var reader = Files.newBufferedReader(Paths.get(fileName)))
    {
        String line;
        while ((line = reader.readLine()) != null)
        {
            // split the line
            final String[] cols = line.split(";");

            // get us the data to store
            final String city = cols[0];
            final double temperature = Double.parseDouble(cols[1]);

            // store and sum up the data
            // Map::merge(key, value, remappingFunction)
            cities.merge(city, new Temperatures(temperature), (t1, t2) -> t1.merge(t2));
        }
    }

    // ok, we got everything, now we need to order it
    return new TreeMap<String, Temperatures>(cities).toString();
}
    

/**
 * The temperature holder
 */
private static class Temperatures {
	private final double min;
	private final double max;
	private final double total;
	private final int count;

	public Temperatures(double value) {
		this.min = value;
		this.max = value;
		this.total = value;
		this.count = 1;
	}

	private Temperatures(double min, double max, double total, int count) {
		this.min = min;
		this.max = max;
		this.total = total;
		this.count = count;
	}

	public Temperatures merge(final Temperatures other) {
		return new Temperatures(
                Math.min(min, other.min),
                Math.max(max, other.max),
                total + other.total,
                count + other.count);
	}
    ...
}
    

The Results

TestTime DO%Time HE%
BRC01_BaselineST247.6 s100.0%152.0 s100.0%
BRC03_NoStreamST232.1 s93.7%138.9 s91.4%

Flamegraph

Area 01 03
run 78.4 % 80.4 %
readLine 17.7 % 17.9 %
split 31.7 % 26.8 %
parseDouble 11.7 % 12.2 %
merge 14.4 % 21.0 %

java \
    -agentpath:async-profiler/lib/libasyncProfiler.so=start,event=cpu,flamegraph,file=$1-cpu.html \
    -cp target/classes/ \
    org.rschwietzke.devoxxpl24.BRC03_NoStreamST measurements.txt 2 2 \

Split is Slow

Get Rid Off split


public String run(final String fileName) throws IOException
{
    // our cities with temperatures
    final Map<String, Temperatures> cities = new HashMap<>();

    try (var reader = Files.newBufferedReader(Paths.get(fileName)))
    {
        String line;
        while ((line = reader.readLine()) != null)
        {
            // split the line
            final int pos = line.indexOf(';');

            // get us the city
            final String city = line.substring(0, pos);
            final String temperatureAsString = line.substring(pos + 1);

            // parse our temperature
            final double temperature = Double.parseDouble(temperatureAsString);

            // merge the data into the captured data
            cities.merge(city, new Temperatures(temperature), (t1, t2) -> t1.merge(t2));
        }
    }

    // ok, we got everything, now we need to order it and print it
    return new TreeMap<String, Temperatures>(cities).toString();
}
    

The Results

TestTime DO%Time HE%
BRC01_BaselineST247.6 s100.0%152.0 s100.0%
BRC03_NoStreamST232.1 s93.7%138.9 s91.4%
BRC05_ReplaceSplitST175.9 s71.0%109.7 s72.2%

But doesn't split shortcut?


private String[] split(String regex, int limit, boolean withDelimiters) {
    /* fastpath if the regex is a
     * (1) one-char String and this character is not one of the
     *     RegEx's meta characters ".$|()[{^?*+\\", or
     * (2) two-char String and the first char is the backslash and
     *     the second is not the ascii digit or ascii letter.
     */
    char ch = 0;
    if (((regex.length() == 1 &&
            ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
            (regex.length() == 2 &&
                    regex.charAt(0) == '\\' &&
                    (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
                    ((ch-'a')|('z'-ch)) < 0 &&
                    ((ch-'A')|('Z'-ch)) < 0)) &&
            (ch < Character.MIN_HIGH_SURROGATE ||
                    ch > Character.MAX_LOW_SURROGATE))
    {
        // All the checks above can potentially be constant folded by
        // a JIT/AOT compiler when the regex is a constant string.
        // ...
        // This split now creates objects
        return split(ch, limit, withDelimiters);
    }
    ...
}
    

Flamegraph

Area 01 03 05
run 78.4 % 80.4 % 81.3 %
readLine 17.7 % 17.9 % 31.6 %
split 31.7 % 26.8 % n/a
indexOf n/a n/a 2.2 %
subString n/a 9.2 % 7.6 %
parseDouble 11.7 % 12.2 % 14.8 %
merge 14.4 % 21.0 % 23.9 %

Double Parsing

Make double Parsing Less Generic

The Problem in 224 Lines


static ASCIIToBinaryConverter readJavaFormatString( String in ) throws NumberFormatException {
    boolean isNegative = false;
    boolean signSeen   = false;
    int     decExp;
    char    c;

parseNumber:
    try{
        in = in.trim(); // don't fool around with white space.
                        // throws NullPointerException if null
        int len = in.length();
        if ( len == 0 ) {
            throw new NumberFormatException("empty String");
        }
        int i = 0;
        switch (in.charAt(i)){
        case '-':
            isNegative = true;
            //FALLTHROUGH
        case '+':
            i++;
            signSeen = true;
        }
        c = in.charAt(i);
        if(c == 'N') { // Check for NaN
            if((len-i)==NAN_LENGTH && in.indexOf(NAN_REP,i)==i) {
                return A2BC_NOT_A_NUMBER;
            }
            // something went wrong, throw exception
            break parseNumber;
        } else if(c == 'I') { // Check for Infinity strings
            if((len-i)==INFINITY_LENGTH && in.indexOf(INFINITY_REP,i)==i) {
                return isNegative? A2BC_NEGATIVE_INFINITY : A2BC_POSITIVE_INFINITY;
            }
            // something went wrong, throw exception
            break parseNumber;
        } else if (c == '0')  { // check for hexadecimal floating-point number
            if (len > i+1 ) {
                char ch = in.charAt(i+1);
                if (ch == 'x' || ch == 'X' ) { // possible hex string
                    return parseHexString(in);
                }
            }
        }  // look for and process decimal floating-point string

        char[] digits = new char[ len ];
        boolean decSeen = false;
        int nDigits = 0;
        int decPt = 0;
        int nLeadZero = 0;
        int nTrailZero = 0;

    skipLeadingZerosLoop:
        while (i < len) {
            c = in.charAt(i);
            if (c == '0') {
                nLeadZero++;
            } else if (c == '.') {
                if (decSeen) {
                    // already saw one ., this is the 2nd.
                    throw new NumberFormatException("multiple points");
                }
                decPt = i;
                if (signSeen) {
                    decPt -= 1;
                }
                decSeen = true;
            } else {
                break skipLeadingZerosLoop;
            }
            i++;
        }
    digitLoop:
        while (i < len) {
            c = in.charAt(i);
            if (c >= '1' && c <= '9') {
                digits[nDigits++] = c;
                nTrailZero = 0;
            } else if (c == '0') {
                digits[nDigits++] = c;
                nTrailZero++;
            } else if (c == '.') {
                if (decSeen) {
                    // already saw one ., this is the 2nd.
                    throw new NumberFormatException("multiple points");
                }
                decPt = i;
                if (signSeen) {
                    decPt -= 1;
                }
                decSeen = true;
            } else {
                break digitLoop;
            }
            i++;
        }
        nDigits -=nTrailZero;
        //
        // At this point, we've scanned all the digits and decimal
        // point we're going to see. Trim off leading and trailing
        // zeros, which will just confuse us later, and adjust
        // our initial decimal exponent accordingly.
        // To review:
        // we have seen i total characters.
        // nLeadZero of them were zeros before any other digits.
        // nTrailZero of them were zeros after any other digits.
        // if ( decSeen ), then a . was seen after decPt characters
        // ( including leading zeros which have been discarded )
        // nDigits characters were neither lead nor trailing
        // zeros, nor point
        //
        //
        // special hack: if we saw no non-zero digits, then the
        // answer is zero!
        // Unfortunately, we feel honor-bound to keep parsing!
        //
        boolean isZero = (nDigits == 0);
        if ( isZero &&  nLeadZero == 0 ){
            // we saw NO DIGITS AT ALL,
            // not even a crummy 0!
            // this is not allowed.
            break parseNumber; // go throw exception
        }
        //
        // Our initial exponent is decPt, adjusted by the number of
        // discarded zeros. Or, if there was no decPt,
        // then its just nDigits adjusted by discarded trailing zeros.
        //
        if ( decSeen ){
            decExp = decPt - nLeadZero;
        } else {
            decExp = nDigits + nTrailZero;
        }

        //
        // Look for 'e' or 'E' and an optionally signed integer.
        //
        if ( (i < len) &&  (((c = in.charAt(i) )=='e') || (c == 'E') ) ){
            int expSign = 1;
            int expVal  = 0;
            int reallyBig = Integer.MAX_VALUE / 10;
            boolean expOverflow = false;
            switch( in.charAt(++i) ){
            case '-':
                expSign = -1;
                //FALLTHROUGH
            case '+':
                i++;
            }
            int expAt = i;
        expLoop:
            while ( i < len  ){
                if ( expVal >= reallyBig ){
                    // the next character will cause integer
                    // overflow.
                    expOverflow = true;
                }
                c = in.charAt(i++);
                if(c>='0' && c<='9') {
                    expVal = expVal*10 + ( (int)c - (int)'0' );
                } else {
                    i--;           // back up.
                    break expLoop; // stop parsing exponent.
                }
            }
            int expLimit = BIG_DECIMAL_EXPONENT + nDigits + nTrailZero;
            if (expOverflow || (expVal > expLimit)) {
                // There is still a chance that the exponent will be safe to
                // use: if it would eventually decrease due to a negative
                // decExp, and that number is below the limit.  We check for
                // that here.
                if (!expOverflow && (expSign == 1 && decExp < 0)
                        && (expVal + decExp) < expLimit) {
                    // Cannot overflow: adding a positive and negative number.
                    decExp += expVal;
                } else {
                    //
                    // The intent here is to end up with
                    // infinity or zero, as appropriate.
                    // The reason for yielding such a small decExponent,
                    // rather than something intuitive such as
                    // expSign*Integer.MAX_VALUE, is that this value
                    // is subject to further manipulation in
                    // doubleValue() and floatValue(), and I don't want
                    // it to be able to cause overflow there!
                    // (The only way we can get into trouble here is for
                    // really outrageous nDigits+nTrailZero, such as 2
                    // billion.)
                    //
                    decExp = expSign * expLimit;
                }
            } else {
                // this should not overflow, since we tested
                // for expVal > (MAX+N), where N >= abs(decExp)
                decExp = decExp + expSign*expVal;
            }

            // if we saw something not a digit ( or end of string )
            // after the [Ee][+-], without seeing any digits at all
            // this is certainly an error. If we saw some digits,
            // but then some trailing garbage, that might be ok.
            // so we just fall through in that case.
            // HUMBUG
            if ( i == expAt ) {
                break parseNumber; // certainly bad
            }
        }
        //
        // We parsed everything we could.
        // If there are leftovers, then this is not good input!
        //
        if ( i < len &&
            ((i != len - 1) ||
            (in.charAt(i) != 'f' &&
             in.charAt(i) != 'F' &&
             in.charAt(i) != 'd' &&
             in.charAt(i) != 'D'))) {
            break parseNumber; // go throw exception
        }
        if(isZero) {
            return isNegative ? A2BC_NEGATIVE_ZERO : A2BC_POSITIVE_ZERO;
        }
        return new ASCIIToBinaryBuffer(isNegative, decExp, digits, nDigits);
    } catch ( StringIndexOutOfBoundsException e ){ }
    throw new NumberFormatException("For input string: \"" + in + "\"");
}

    

public static double parseDouble(final String s) {
    return parseDouble(s, 0, s.length() - 1);
}

private static final double[] multipliers = {
    1, 1, 0.1, 0.01, 0.001, 0.000_1, 0.000_01, 0.000_001, 0.000_000_1, 0.000_000_01,
    0.000_000_001, 0.000_000_000_1, 0.000_000_000_01, 0.000_000_000_001, 0.000_000_000_000_1,
    0.000_000_000_000_01, 0.000_000_000_000_001, 0.000_000_000_000_000_1, 0.000_000_000_000_000_01};

public static double parseDouble(final String s, final int offset, final int end) {
    final int negative = s.charAt(offset) == '-' ? offset + 1 : offset;

    long value = 0;
    int decimalPos = end;

    for (int i = negative; i <= end; i++) {
        final int d = s.charAt(i);
        if (d == '.') {
            decimalPos = i;
            continue;
        }
        final int v = d - DIGITOFFSET;
        value = ((value << 3) + (value << 1));
        value += v;
    }

    // adjust the decimal places
    value = negative != offset ? -value : value;
    return value * multipliers[end - decimalPos + 1];
}
    

The Results

TestTime DO%Time HE%
BRC01_BaselineST247.6 s100.0%152.0 s100.0%
BRC03_NoStreamST232.1 s93.7%138.9 s91.4%
BRC05_ReplaceSplitST175.9 s71.0%109.7 s72.2%
BRC06_NewDoubleParsingST141.2 s57.0%99.0 s65.2%

Flamegraph

Area 05 06
run 81.3 % 76.2 %
readLine 31.6 % 32.8 %
indexOf 2.2 % 2.2 %
subString 7.6 % 9.9 %
parseDouble 14.8 % 4.4 %
merge 23.9 % 25.8 %

java \
    -agentpath:async-profiler/lib/libasyncProfiler.so=start,event=cpu,sig,flamegraph,file=$1-cpu.html \
    -cp target/classes/ \
    org.rschwietzke.devoxxpl24.BRC06_NewDoubleParsingST measurements.txt 2 2 \

Substring

One Less String


while ((line = reader.readLine()) != null)
{
    // split the line
    final int pos = line.indexOf(';');

    // get us the city
    final String city = line.substring(0, pos);
    // OLD: final String temperatureAsString = line.substring(pos + 1);

    // parse our temperature inline without an instance of a string for temperature
    // OLD: final double temperature = Double.parseDouble(temperatureAsString);
    final double temperature = ParseDouble.parseDouble(line, pos + 1, line.length() - 1);

    // merge the data into the captured data
    cities.merge(city, new Temperatures(temperature), (t1, t2) -> t1.merge(t2));
}
    

The Results

TestTime DO%Time HE%
BRC01_BaselineST247.6 s100.0%152.0 s100.0%
BRC03_NoStreamST232.1 s93.7%138.9 s91.4%
BRC05_ReplaceSplitST175.9 s71.0%109.7 s72.2%
BRC06_NewDoubleParsingST141.2 s57.0%99.0 s65.2%
BRC07_NoCopyForDoubleST127.7 s51.6%88.0 s57.9%

Flamegraph

Area 05 06 07
run 81.3 % 76.2 % 76.4 %
readLine 31.6 % 32.8 % 29.6 %
indexOf 2.2 % 2.2 % 2.6 %
subString 7.6 % 9.9 % 5.8 %
parseDouble 14.8 % 4.4 % 6.0 %
merge 23.9 % 25.8 % 28.7 %

java \
    -agentpath:async-profiler/lib/libasyncProfiler.so=start,event=cpu,sig,flamegraph,file=$1-cpu.html \
    -cp target/classes/ \
    org.rschwietzke.devoxxpl24.BRC07_NoCopyForDoubleST measurements.txt 2 2 \

Double is Double the Work

Ignore double as longint as Possible


/**
 * Parses a double but ends up with an int, only because we know
 * the format of the results -99.9 to 99.9
 */
public static int parseInteger(final String s, final int offset, final int end)
{
    final int negative = s.charAt(offset) == '-' ? offset + 1 : offset;

    int value = 0;

    for (int i = negative; i <= end; i++)
    {
        final int d = s.charAt(i);
        if (d == '.')
        {
            continue;
        }
        final int v = d - DIGITOFFSET;
        // multiply by 10 = multiply by 8 + multiply by 2
        value = ((value << 3) + (value << 1));
        value += v;
    }

    value = negative != offset ? -value : value;
    return value;
}
    

The Results

TestTime DO%Time HE%
BRC01_BaselineST247.6 s100.0%152.0 s100.0%
BRC03_NoStreamST232.1 s93.7%138.9 s91.4%
BRC05_ReplaceSplitST175.9 s71.0%109.7 s72.2%
BRC06_NewDoubleParsingST141.2 s57.0%99.0 s65.2%
BRC07_NoCopyForDoubleST127.7 s51.6%88.0 s57.9%
BRC08_GoIntST124.0 s50.1%87.3 s57.4%

Flamegraph

Area 05 06 07 08
run 81.3 % 76.2 % 76.4 % 79.3 %
readLine 31.6 % 32.8 % 29.6 % 34.4 %
indexOf 2.2 % 2.2 % 2.6 % 2.5 %
subString 7.6 % 9.9 % 5.8 % 6.0 %
parseDouble 14.8 % 4.4 % 6.0 % 8.2 %
merge 23.9 % 25.8 % 28.7 % 24.2 %

java \
    -agentpath:async-profiler/lib/libasyncProfiler.so=start,event=cpu,sig,flamegraph,file=$1-cpu.html \
    -cp target/classes/ \
    org.rschwietzke.devoxxpl24.BRC08_GoIntST measurements.txt 2 2 \

Remove Lambda

Explicit Merging


while ((line = reader.readLine()) != null)
{
    // split the line
    final int pos = line.indexOf(';');

    // get us the city
    final String city = line.substring(0, pos);

    // parse our temperature inline without an instance of a string for temperature
    final int temperature = ParseDouble.parseInteger(line, pos + 1, line.length() - 1);

    // get city and update
    // OLD: cities.merge(city, new Temperatures(temperature), (t1, t2) -> t1.merge(t2));
    final var v = cities.get(city);
    final var t = new Temperatures(temperature);
    cities.put(city, v != null ? v.merge(t) : t);
}
    

The Results

TestTime DO%Time HE%
BRC01_BaselineST247.6 s100.0%152.0 s100.0%
BRC01_BaselineMT111.2 s44.9%81.3 s53.5%
BRC03_NoStreamST232.1 s93.7%138.9 s91.4%
BRC05_ReplaceSplitST175.9 s71.0%109.7 s72.2%
BRC06_NewDoubleParsingST141.2 s57.0%99.0 s65.2%
BRC07_NoCopyForDoubleST127.7 s51.6%88.0 s57.9%
BRC08_GoIntST124.0 s50.1%87.3 s57.4%
BRC09_NoMergeST131.1 s53.0%95.4 s62.7%

Flamegraph

run 80.1 %
readLine 39.1 %
indexOf 2.8 %
substring 7.9 %
parseInteger6.3 %
get 10.9 %
merge 2.3 %
put 5.2 %

java \
    -agentpath:async-profiler/lib/libasyncProfiler.so=start,event=cpu,sig,flamegraph,file=$1-cpu.html \
    -cp target/classes/ \
    org.rschwietzke.devoxxpl24.BRC09_NoMergeST measurements.txt 2 2 \

Mutablility to the Rescue

Less Memory Churn


while ((line = reader.readLine()) != null)
{
    ...
    // get city
    final Temperatures v = cities.get(city);
    if (v != null)
    {
        // know it, put both together
        v.add(temperature);
    }
    else
    {
        // we have not seen that city yet, create a container and store it
        cities.put(city, new Temperatures(temperature));
    }
}
    

The Results

TestTime DO%Time HE%
BRC01_BaselineST247.6 s100.0%152.0 s100.0%
BRC03_NoStreamST232.1 s93.7%138.9 s91.4%
BRC05_ReplaceSplitST175.9 s71.0%109.7 s72.2%
BRC06_NewDoubleParsingST141.2 s57.0%99.0 s65.2%
BRC07_NoCopyForDoubleST127.7 s51.6%88.0 s57.9%
BRC08_GoIntST124.0 s50.1%87.3 s57.4%
BRC09_NoMergeST131.1 s53.0%95.4 s62.7%
BRC10_MutateST109.8 s44.4%84.9 s55.9%

Flamegraph

Area 09 10
run 80.3 % 71.3 %
readLine 39.1 % 38.1 %
indexOf 2.8 % 2.4 %
substring 7.9 % 6.2 %
parseInteger 6.3 % 6.6 %
get 10.9 % 15.7 %
merge 2.3 % 0.0 %
put 5.2 % 0.0 %

java \
    -agentpath:async-profiler/lib/libasyncProfiler.so=start,event=cpu,sig,flamegraph,file=$1-cpu.html \
    -cp target/classes/ \
    org.rschwietzke.devoxxpl24.BRC10_MutateST measurements.txt 2 2 \

Open Hashing

Simplifying the Map


// https://github.com/mikvor/hashmapTest/blob/master/src/main/java/map/objobj/ObjObjMap.java
public class FastHashMap<K, V> {
    private static final Object FREE_KEY = new Object();
    private Object[] m_data;
...
    public V get( final K key ) {
        int ptr = (key.hashCode() & m_mask) << 1;
        Object k = m_data[ ptr ];

        if ( k == FREE_KEY ) {
            return null;  //end of chain already
        }

        if ( k.hashCode() == key.hashCode() && k.equals( key ) ) {
            return (V) m_data[ ptr + 1 ];
        }

        while ( true ) {
            ptr = (ptr + 2) & m_mask2; //that's next index
            k = m_data[ ptr ];
            if ( k == FREE_KEY ) {
                return null;
            }
            if (k.equals( key )) {
                return (V) m_data[ ptr + 1 ];
            }
        }
    }
...
}
    

The Results

TestTime DO%Time HE%
BRC01_BaselineST247.6 s100.0%152.0 s100.0%
BRC03_NoStreamST232.1 s93.7%138.9 s91.4%
BRC05_ReplaceSplitST175.9 s71.0%109.7 s72.2%
BRC06_NewDoubleParsingST141.2 s57.0%99.0 s65.2%
BRC07_NoCopyForDoubleST127.7 s51.6%88.0 s57.9%
BRC08_GoIntST124.0 s50.1%87.3 s57.4%
BRC09_NoMergeST131.1 s53.0%95.4 s62.7%
BRC10_MutateST109.8 s44.4%84.9 s55.9%
BRC12_NewMapST109.6 s44.3%80.2 s52.8%

Flamegraph

Area 12
run74.1 %
readLine 37.9 %
indexOf 2.0 %
substring 7.7 %
parseInteger8.3 %
get 15.4 %
merge 0.0 %
put 0.0 %

java \
    -agentpath:async-profiler/lib/libasyncProfiler.so=start,event=cpu,sig,flamegraph,file=$1-cpu.html \
    -cp target/classes/ \
    org.rschwietzke.devoxxpl24.BRC12_NewMapST measurements.txt 2 2 \

Set instead of Map

Eliminate Slots and Hashcode Calculations


private Temperatures[] m_data;

public void getPutOrUpdate( final String city, int value ) {
    final int hash = city.hashCode();
    int ptr = hash & m_mask;
    Temperatures k = m_data[ ptr ];

    if ( k == FREE_KEY ) {
        put(new Temperatures(city, value));
        return;
    }

    if ( k.equals( city ) ) {
        k.add(value);
        return;
    }

    while ( true ) {
        ptr = (ptr + 1) & m_mask; //that's next index
        k = m_data[ ptr ];
        if ( k == FREE_KEY ) {
            put(new Temperatures(city, value));
            return;
        }
        if ( k.equals( city ) ) {
            k.add(value);
            return;
        }
    }
}
    

The Results

TestTime DO%Time HE%
BRC01_BaselineST247.6 s100.0%152.0 s100.0%
BRC03_NoStreamST232.1 s93.7%138.9 s91.4%
BRC05_ReplaceSplitST175.9 s71.0%109.7 s72.2%
BRC06_NewDoubleParsingST141.2 s57.0%99.0 s65.2%
BRC07_NoCopyForDoubleST127.7 s51.6%88.0 s57.9%
BRC08_GoIntST124.0 s50.1%87.3 s57.4%
BRC09_NoMergeST131.1 s53.0%95.4 s62.7%
BRC10_MutateST109.8 s44.4%84.9 s55.9%
BRC12_NewMapST109.6 s44.3%80.2 s52.8%
BRC13_HardcodedSetST105.3 s42.6%76.5 s50.3%

Flamegraph

run 72.1 %
readLine 38.2 %
indexOf 3.2 %
substring 4.5 %
parseInteger6.7 %
putOrUpdate16.1 %

java \
    -agentpath:async-profiler/lib/libasyncProfiler.so=start,event=cpu,sig,flamegraph,file=$1-cpu.html \
    -cp target/classes/ \
    org.rschwietzke.devoxxpl24.BRC13_HardcodedSetST measurements.txt 2 2 \

Heavy Lifting

Approach IO Differently

Think Harder

Sit back, relax, and enjoy some music

  • Think about the data
  • String;Double\n
  • Bytes{1-100};-?[0-9]{1-2}.[0-9]\n
  • Bytes;Bytes\n
  • Bytes;Bytes.Byte\n
  • Bytes{6,107}
  • Forget String
  • Forget char
  • Delay Unicode processing
  • Forget error handling
  • Read and process just bytes

try (var raf = new RandomAccessFile(fileName, "r");
     var channel = raf.getChannel();)
{
    final Line line = new Line(channel);

    while (true)
    {
        line.readFromChannel();

        if (line.hasNewLine)
        {
            // parse our temperature inline without an instance of a string for temperature
            final int temperature = ParseDouble.parseInteger(line.data,
                line.semicolonPos + 1, line.newlinePos - 1);

            // find and update
            cities.getPutOrUpdate(line, temperature);
        }
        if (line.EOF)
        {
            break;
        }
    }
}
    

private ByteBuffer buffer = ByteBuffer.allocate(MIN_BUFFERSIZE);
private byte[] data = buffer.array();

private void readFromChannel()
{
    hashCode = -1;
    hasNewLine = false;

    try {
        // do we near the end of the buffer?
        if (end - pos < REMAINING_MIN_BUFFERSIZE)
        {
            // we move the buffer indirectly, because
            // the ByteBuffer just wraps our array,
            // nothing for the tenderhearted
            System.arraycopy(data, pos,
                data, 0, data.length - pos);
            end = end - pos;
            pos = 0;
            buffer.position(end);

            // fill the buffer up
            final int readBytes = channel.read(buffer);
            if (readBytes == -1) {
                EOF = true;
            }

            end = buffer.position();
        }
    }
    catch (IOException e) {
        ...
    }
    
    lineStartPos = pos;

    // look for semicolon and new line
    int i = pos;
    for (; i < end; i++)
    {
    	final byte b = data[i];
    	if (b == ';')
    	{
    		semicolonPos = i;
    		break;
    	}
    }

    i++;
    for (; i < end; i++)
    {
        final byte b = data[i];
    	if (b == '\n')
    	{
    		newlinePos = i;
    		pos = i + 1;
    		hasNewLine = true;
    		return;
    	}
    }
}





    

The Results

TestTime DO%Time HE%
BRC01_BaselineST247.6 s100.0%152.0 s100.0%
BRC03_NoStreamST232.1 s93.7%138.9 s91.4%
BRC05_ReplaceSplitST175.9 s71.0%109.7 s72.2%
BRC06_NewDoubleParsingST141.2 s57.0%99.0 s65.2%
BRC07_NoCopyForDoubleST127.7 s51.6%88.0 s57.9%
BRC08_GoIntST124.0 s50.1%87.3 s57.4%
BRC09_NoMergeST131.1 s53.0%95.4 s62.7%
BRC10_MutateST109.8 s44.4%84.9 s55.9%
BRC12_NewMapST109.6 s44.3%80.2 s52.8%
BRC13_HardcodedSetST105.3 s42.6%76.5 s50.3%
BRC14b_ReadBytesFixed57.3 s23.2%46.1 s30.4%
BRC14c_ReadBytesObjects46.0 s30.3%
BRC14b_ReadBytesFixedImproved43.0 s28.3%

Flamegraph

Area 13 14
run72.1 %80.2 %
readLine 38.2 %
indexOf 3.2 %
substring 4.5 %
read 2.1 %
parse line 18.4 %
parseInteger 6.7 %8.3 %
putOrUpdate 16.1 %48.6 %
hashCode 5.1 %26.6 %
equals 7.5 %14.4 %
put/update 2.8 %7.6 %

java \
    -agentpath:async-profiler/lib/libasyncProfiler.so=start,event=cpu,sig,flamegraph,file=$1-cpu.html \
    -cp target/classes/ \
    org.rschwietzke.devoxxpl24.BRC14_ReadBytesST measurements.txt 2 2 \

A Surprise

Enjoy a Treat

Some things in life come free

A benefit we have not asked for


# Run with max heap 4 GB for 10 M lines
java -Xlog:gc*=debug \
     -Xms4g -Xmx4g \
     -XX:+UnlockExperimentalVMOptions \
     -XX:+UseEpsilonGC \
     -cp target/classes/ \
     org.rschwietzke.devoxxpl24.BRC01_BaselineST \
     data-10m.txt 0 1

[3.726s][info][gc] Heap: 4096M reserved,
                   3204M (78.22%) committed,
                   3199M (78.11%) used
        

1,000 M lines = 100 x 3.2 GB = 320 GB


#
java ... \
     org.rschwietzke.devoxxpl24.BRC13_HardcodedSetST \
     data-10m.txt 0 1

[2,035s][info][gc] Heap: 4096M reserved,
                   1129M (27,57%) committed,
                   1037M (25,33%) used
        

1,000 M lines = 100 x 1.04 GB = 104 GB


# Run with max heap 4 MB for 1000 M lines
java -Xlog:gc*=debug \
     -Xms4m -Xmx4m \
     -XX:+UnlockExperimentalVMOptions \
     -XX:+UseEpsilonGC \
     -cp target/classes/ \
     org.rschwietzke.devoxxpl24.BRC14_ReadBytesST \
     data-1000m.txt 0 1

[55.401s][info][gc] Heap: 4096K reserved,
                    4096K (100.0%) committed,
                    3634K (88.73%) used
            

1,000 M lines = 3.7 MB


# Run with max heap 4 GB for 10 M lines again !!
java ... \
     org.rschwietzke.devoxxpl24.BRC14c_ReadBytesObjects\
     data-10m.txt 0 1

[1.945s][info][gc] Heap: 4096M reserved,
                   4096M (100.00%) committed,
                   208M (5.10%) used
        

1,000 M lines = 100 x 208M = 21 GB

Some Cheapish Tuning

Attempting to Save on Number Parsing


/**
 * Parses a double but ends up with an int, only because we know
 * the format of the results -99.9 to 99.9
 */
public static int parseIntegerFixed(final byte[] b, final int offset, final int end) {
    final int length = end - offset; // one is missing, we care for that later

    // we know the first three pieces already 9.9
    int p0 = b[end];
    int p1 = b[end - 2] * 10;
    int value = p0 + p1 - (DIGITOFFSET + DIGITOFFSET * 10);

    // we are 9.9
    if (length == 2) {
    	return value;
    }

    // ok, we are either -9.9 or 99.9 or -99.9
    if (b[offset] != (byte)'-') {
    	// we are 99.9
    	value += b[end - 3] * 100 - DIGITOFFSET * 100;
    	return value;
    }

    // we are either -99.9 or -9.9
    if (length == 3) {
    	// -9.9
    	return -value;
    }

    // -99.9
    value += b[end - 3] * 100 - DIGITOFFSET * 100;
    return -value;
}

The Results

TestTime DO%Time HE%
BRC01_BaselineST247.6 s100.0%152.0 s100.0%
BRC03_NoStreamST232.1 s93.7%138.9 s91.4%
BRC05_ReplaceSplitST175.9 s71.0%109.7 s72.2%
BRC06_NewDoubleParsingST141.2 s57.0%99.0 s65.2%
BRC07_NoCopyForDoubleST127.7 s51.6%88.0 s57.9%
BRC08_GoIntST124.0 s50.1%87.3 s57.4%
BRC09_NoMergeST131.1 s53.0%95.4 s62.7%
BRC10_MutateST109.8 s44.4%84.9 s55.9%
BRC12_NewMapST109.6 s44.3%80.2 s52.8%
BRC13_HardcodedSetST105.3 s42.6%76.5 s50.3%
BRC14b_ReadBytesFixed57.3 s23.2%46.1 s30.4%
BRC15_ParseDoubleFixedST55.0 s22.2%42.7 s28.1%

Flamegraph

Area 14 15
run 80.2 % 78.0 %
read 2.1 % 3.9 %
parse line 18.4 % 18.3 %
parseInteger 8.3 % 9.0 %
putOrUpdate 48.6 % 44.3 %
hashCode 26.6 % 26.6 %
equals 14.4 % 14.4 %
put/update 7.6 % 7.6 %

java \
    -agentpath:async-profiler/lib/libasyncProfiler.so=start,event=cpu,sig,flamegraph,file=$1-cpu.html \
    -cp target/classes/ \
    org.rschwietzke.devoxxpl24.BRC15_ParseDoubleFixedST measurements.txt 2 2 \

Changing Gears

Tuning is Not That Simple Anymore

More Ideas

Hot but not really hot, because we need that code

  • hashCode still on top
  • equals for the byte array
  • parsing the line is still expensive
  • Think harder first
  • Resort to other metrics for further tuning

Simpler Min/Max


public void add(final int value)
{
    this.min = Math.min(this.min, value);
    this.max = Math.max(this.max, value);
    this.total += value;
    this.count++;
}





 
    

public void add(final int value)
{
    if (value < this.min)
    {
        this.min = value;
    }
    else if (value > this.max)
    {
        this.max = value;
    }
    this.total += value;
    this.count++;
}
    

HashCode


// OLD: look for semicolon and new line
int i = pos;
for (; i < end; i++)
{
    final byte b = data[i];
    if (b == ';')
    {
        semicolonPos = i;
        break;
    }
}

// Later
public int ByteBuffer::hashCode() {
    int h = 1;
    int p = position();
    for (int i = limit() - 1; i >= p; i--)
        h = 31 * h + (int)get(i);

    return h;
}

// We loop twice, when searching the ; and when
// calculating the hashcode later.
// Do it in one go, will safe also a comparison

// NEW: look for semicolon and new line
// when checking for semicolon, we do the
// hashcode right away
int h = 1;
int i = pos;
for (; i < end; i++)
{
    final byte b = data[i];
    if (b == ';')
    {
        semicolonPos = i;
        break;
    }
    h = 31 * h + b;
}
this.hashCode = h;
 
        

HashCode


// we loop twice, when searching the ; and when calculating the hashcode later
// do it in one go, will safe also a comparison

// look for semicolon and new line
// when checking for semicolon, we do the hashcode right away
int h = 1;
int i = pos;
for (; i < end; i++)
{
    final byte b = data[i];
    if (b == ';')
    {
        semicolonPos = i;
        break;
    }
    h = 31 * h + b;
}
this.hashCode = h;
    

// remove the multiplication and
// use bit shift and subtraction

// look for semicolon and new line
// when checking for semicolon, we do the hashcode right away
int h = 1;
int i = pos;
for (; i < end; i++)
{
    final byte b = data[i];
    if (b == ';')
    {
        semicolonPos = i;
        break;
    }
    h = (h << 5) - h + b;
}
this.hashCode = h;
    

The Results

TestTime DO%Time HE%
BRC01_BaselineST247.6 s100.0%152.0 s100.0%
BRC01_BaselineMT111.2 s44.9%81.3 s53.5%
BRC03_NoStreamST232.1 s93.7%138.9 s91.4%
BRC05_ReplaceSplitST175.9 s71.0%109.7 s72.2%
BRC06_NewDoubleParsingST141.2 s57.0%99.0 s65.2%
BRC07_NoCopyForDoubleST127.7 s51.6%88.0 s57.9%
BRC08_GoIntST124.0 s50.1%87.3 s57.4%
BRC09_NoMergeST131.1 s53.0%95.4 s62.7%
BRC10_MutateST109.8 s44.4%84.9 s55.9%
BRC12_NewMapST109.6 s44.3%80.2 s52.8%
BRC13_HardcodedSetST105.3 s42.6%76.5 s50.3%
BRC14b_ReadBytesFixed57.3 s23.2%46.1 s30.4%
BRC15_ParseDoubleFixedST55.0 s22.2%42.7 s28.1%
BRC20_UseArrayNoBufferST52.5 s21.2%42.9 s28.3%
BRC21_ManualMinMaxST51.6 s20.9%42.7 s28.1%
BRC22_EarlyHashCodeST39.1 s15.8%36.4 s24.0%
BRC23a_NoMulSplitST38.7 s15.7%33.7 s22.2%
BRC23_NoMulST39.9 s16.1%36.3 s23.9%
BRC24_DifferentBranchInHashCodeST40.0 s16.2%35.9 s23.6%
BRC25_SmallAddReordingST39.0 s15.8%36.2 s23.8%
BRC26_MoreMapSpaceST39.3 s15.9%33.0 s21.7%

More Fiddling

TestTime DO%Time HE%
BRC27_SmallPutST40.8 s16.5%33.0 s21.7%
BRC28_FineTuningST42.9 s17.4%31.9 s21.0%
BRC29a_ParseDoubleTuningST40.7 s16.4%31.7 s20.9%
BRC29c_ArrayCopyInMethod35.8 s14.5%30.7 s20.3%
BRC29d_EqualsNotBoolean37.0 s15.0%31.4 s20.7%
BRC29e_EarlyIntResolution36.8 s14.9%31.6 s20.8%
BRC29f_LessDataForTempResolution36.4 s14.7%31.4 s20.7%
BRC29g_FixedIntParsing38.9 s15.7%31.9 s21.0%
BRC30_DangerNoEqualsST28.4 s11.5%23.1 s15.2%
BRC40a_NoChannel38.6 s15.6%31.0 s20.4%
BRC40b_ReturnInstead35.3 s14.3%31.5 s20.7%
BRC40c_UnrollTempParsing32.4 s13.1%28.8 s19.0%
BRC40d_LongLoop33.2 s13.4%28.3 s18.7%
BRC40e_NoReloadSub31.7 s12.8%28.1 s18.5%
BRC40f_DoWhile31.8 s12.9%28.1 s18.5%
BRC40g_Put31.9 s12.9%28.1 s18.5%
BRC40h_ManualMismatch33.6 s13.6%27.3 s18.0%
BRC40i_SmallerSemicolonLoop32.4 s13.1%27.0 s17.8%
BRC40j_LessStateInSetEquals32.1 s13.0%27.2 s17.9%
BRC40_NoChannel35.3 s14.3%31.6 s20.8%
BRC41a_FixedFastHashSet31.8 s12.9%27.3 s18.0%
BRC41b_ReorderedLineFields31.8 s12.9%27.4 s18.0%
BRC41c_LargerBuffer31.8 s12.9%27.2 s18.0%

More Fiddling

TestTime DO%Time HE%
BRC42a_WhileTrue31.6 s12.8%27.2 s17.9%
BRC42b_NoReturnBranch32.4 s13.1%28.6 s18.8%
BRC43_NoSubClass33.2 s13.4%27.6 s18.2%
BRC45_DoubleTheSetSize32.5 s13.2%27.4 s18.0%
BRC46_TunedHashSet32.5 s13.1%28.2 s18.6%
BRC47_LeanerPut32.5 s13.1%28.0 s18.5%
BRC48_FixedFactor34.0 s13.7%28.2 s18.6%
BRC49_OffsetSubtraction32.5 s13.1%27.8 s18.3%
BRC50_Short32.2 s13.0%27.7 s18.3%
BRC51_TempParsingLessBranches32.9 s13.3%28.3 s18.6%
BRC52_TempParsingBitSubtraction31.8 s12.9%27.7 s18.3%
BRC53_SetEqualsNoLocalAssignment31.6 s12.8%27.2 s17.9%
BRC54_LoopVariableBackInLoop31.8 s12.9%27.3 s18.0%
BRC55_SimplerPutCall31.7 s12.8%27.2 s17.9%
BRC56_MainLoopAsWhile31.6 s12.8%26.5 s17.5%
BRC57_SimplerHashing_VOID32.0 s12.9%25.9 s17.1%
BRC58_UnnoticedCharSkipping33.5 s13.6%27.4 s18.0%
BRC60_FeedCPUUnrollSemicolonLoop31.3 s12.7%26.1 s17.2%
BRC61_BitShiftMul10Void31.7 s12.8%26.3 s17.3%
BRC62_IsEqualsWithADifferentBranchApproach30.8 s12.5%26.7 s17.6%
BRC63b_Equals_MainLoop30.6 s12.4%26.3 s17.3%
BRC63_Equals31.3 s12.7%26.6 s17.5%

More Fiddling

TestTime DO%Time HE%
BRC64_CombinePuts32.1 s13.0%27.1 s17.9%
BRC65_OneMainLoopMethod31.8 s12.9%29.3 s19.3%
BRC66_HashCode31.0 s12.5%26.2 s17.2%
BRC67_StoreArrayLength30.9 s12.5%26.2 s17.3%
BRC68_RemoveNewLinePos31.3 s12.7%27.5 s18.1%
BRC69_RemovedExtraAdd30.9 s12.5%26.2 s17.3%
BRC70_RemovedPutReturn31.3 s12.7%26.2 s17.3%

Final Thoughts and Numbers

A Few Last Data Tables

Ideas we utilized

  • Replace standard JDK functionality
  • Low or no allocation of memory
  • Use the set boundaries to your advantage (data format)
  • Wrappers are bad
  • Immutability is bad
  • Utilize memory caching and keeping data together
  • Less branches (if, loops) and no branch misses
  • CPU pipelining, do more than one thing at a time
  • Utilize inlining

Behind the Scenes


# org.rschwietzke.devoxxpl24.BRC01_BaselineST
# Hetzner, AMD, 21.0.4-tem
# Measurement Runtime: 150,932 ms 100%

       160,915.21 ms task-clock              #   1.066 CPUs utilized
           37,195    context-switches        # 231.147 /sec
            6,232    cpu-migrations          #  38.728 /sec
          236,617    page-faults             #   1.470 K/sec
  539,415,718,644    cycles                  #   3.352 GHz
   41,233,849,888    stalled-cycles-frontend #   7.64% frontend cycles idle
2,258,010,166,520    instructions            #    4.19 insn per cycle
                                             #    0.02 stalled cycles per insn
  522,584,792,160    branches                #   3.248 G/sec
    2,479,441,210    branch-misses           #   0.47% of all branches

    150.932228741 seconds time elapsed

    145.057695000 seconds user
     11.626114000 seconds sys

# 2258 instructions per CSV line

# org.rschwietzke.devoxxpl24.BRC45_DoubleTheSetSize
# Hetzner, AMD, 21.0.4-tem
# Mean Measurement Runtime: 27,061 ms 17.9% (to baseline)

       27,538.78 ms task-clock              #   1.018 CPUs utilized
           1,029    context-switches        #  37.365 /sec
              29    cpu-migrations          #   1.053 /sec
           7,885    page-faults             # 286.324 /sec
  98,276,507,946    cycles                  #   3.569 GHz
  10,431,399,857    stalled-cycles-frontend #  10.61% frontend cycles idle
 303,659,521,159    instructions            #    3.09 insn per cycle
                                            #    0.03 stalled cycles per insn
  55,868,360,010    branches                #   2.029 G/sec
   1,540,980,291    branch-misses           #   2.76% of all branches

    27.061235813 seconds time elapsed

    26.366348000 seconds user
     1.062075000 seconds sys

# 303 instructions per CSV line

What is left?

There is still room

  • byte limits, reads only 8 bit at a time but could do 64 bit instead (factor 8!)
  • int or long would be better (need for UNSAFE or different ByteBuffer usage)
  • mmap a file instead of filling a buffer
  • Reduce branching further, especially misses
  • Branchless coding
  • Tuning for inlining and CPU pipelining still possible

Behind the Scenes


# org.rschwietzke.devoxxpl24.BRC45_DoubleTheSetSize
# Hetzner, AMD, 21.0.4-tem
# Mean Measurement Runtime: 27,061 ms 17.9% (to baseline)


      27,538.78 msec task-clock              #   1.018 CPUs utilized
          1,029      context-switches        #  37.365 /sec
             29      cpu-migrations          #   1.053 /sec
          7,885      page-faults             # 286.324 /sec
 98,276,507,946      cycles                  #   3.569 GHz
 10,431,399,857      stalled-cycles-frontend #  10.61% frontend cycles idle
303,659,521,159      instructions            #    3.09 insn per cycle
                                             #    0.03 stalled cycles per insn
 55,868,360,010      branches                #   2.029 G/sec
  1,540,980,291      branch-misses           #   2.76% of all branches

   27.061235813 seconds time elapsed

   26.366348000 seconds user
    1.062075000 seconds sys

# 303 instructions per CSV line

# Thomas Wuerthinger, 1 core
# Hetzner, AMD, 21.0.4-tem
# Mean Measurement Runtime: 11,281 ms 7.4% (to baseline),
# 41.7% of BRC45_DoubleTheSetSize

      11,269.99 msec task-clock              #   0.999 CPUs utilized
          1,748      context-switches        # 155.102 /sec
              1      cpu-migrations          #   0.089 /sec
        223,183      page-faults             #  19.803 K/sec
 39,555,927,795      cycles                  #   3.510 GHz
  2,945,609,903      stalled-cycles-frontend #   7.45% frontend cycles idle
164,487,950,474      instructions            #    4.16 insn per cycle
                                             #    0.02 stalled cycles per insn
 15,668,783,505      branches                #   1.390 G/sec
    149,371,238      branch-misses           #   0.95% of all branches

   11.281867347 seconds time elapsed

    0.076590000 seconds user
    0.130600000 seconds sys

# 164 instructions per CSV line

Warning, Warning

Your are about to enter an unstable area!

Different Machines

TestWhatMachine%
BRC01_BaselineST Thinkpad T14s218.8 s100.0 %
BRC01_BaselineST GCP-C2-4c-32g278.4 s100.0 %
BRC01_BaselineST GCP-C2D-4c-32g172.4 s100.0 %
BRC01_BaselineST DO Intel Standard 16c-32g 247.6 s 100.0 %
BRC01_BaselineST DO Intel Premium 16c-32g 154.2 s 100.0 %
BRC01_BaselineST Hetzner AMD 8c-32g 146.7 s 100.0 %
BRC26_MoreMapSpaceST Thinkpad T14s35.3 s 16.1 %
BRC26_MoreMapSpaceST GCP-C2-4c-32g43.4 s 15.6 %
BRC26_MoreMapSpaceST GCP-C2D-4c-32g34.8 s 20.2 %
BRC26_MoreMapSpaceS DO Intel Standard 16c-32g 39.0 s 15.8 %
BRC26_MoreMapSpaceS DO Intel Premium 16c-32g 31.1 s 19.8 %
BRC26_MoreMapSpaceST Hetzner AMD 8c-32g 33.5 s 22.8 %

Different JDKs

TestWhatTime DO%
BRC01_BaselineST 21.0.3-tem215.5 s100.0 %
BRC26_MoreMapSpaceST 21.0.3-tem38.4 s 17.8 %
BRC01_BaselineST 21.0.2-graal212.7 s100.0 %
BRC26_MoreMapSpaceST 21.0.2-graal39.8 s 17.6 %
BRC26_MoreMapSpaceST graal-native-o341.4 s  
BRC26_MoreMapSpaceST graal-native-pgo44.7 s  
TestWhatTime HE
BRC69_RemovedExtraAdd21.0.1-tem, Epsilon25.7 s
BRC69_RemovedExtraAdd21.0.6-tem, Epsilon26.1 s
BRC69_RemovedExtraAdd21.0.6-tem, G126.1 s
BRC69_RemovedExtraAdd21.0.6-graal, G124.9 s
BRC69_RemovedExtraAdd24-ea.34, Epsilon26.7 s
BRC69_RemovedExtraAdd24-ea.34, Epsilon, Lilliput26.3 s
BRC69_RemovedExtraAdd25.lilliput.100, Epsilon, Lilliput26.3 s
BRC69_RemovedExtraAdd24.valhalla, Epsilon26.5 s

Disclaimer and Warning

The cloud sucks when benchmarking

  • Shared hardware is bad for small numbers
  • Burst capacity might be a problem
  • Other users running stuff on that host
  • Some tools fail aka perf

# BRC01_BaselineST - DO CPU 16c
Warmup Runtime:      219,665 ms
Warmup Runtime:      220,341 ms
Warmup Runtime:      219,312 ms
Measurement Runtime: 219,020 ms
Measurement Runtime: 223,180 ms
Measurement Runtime: 226,418 ms

# BRC27_SmallPutST - DO CPU 16c
Warmup Runtime:      38,792 ms
Warmup Runtime:      37,929 ms
Warmup Runtime:      39,136 ms
Measurement Runtime: 39,197 ms
Measurement Runtime: 39,103 ms
Measurement Runtime: 39,106 ms

Disclaimer and Warning

The cloud sucks when benchmarking


# Google Cloud C2 - 4c - 32 GB
# BRC01_BaselineST - DO CPU 16c
Measurement Runtime: 281,448 ms
Measurement Runtime: 278,622 ms
Measurement Runtime: 281,130 ms
Measurement Runtime: 280,942 ms
Measurement Runtime: 281,287 ms

# BRC26_MoreMapSpaceST - DO CPU 16c
Measurement Runtime:  43,390 ms
Measurement Runtime:  44,003 ms
Measurement Runtime:  43,634 ms
Measurement Runtime:  43,682 ms
Measurement Runtime:  44,305 ms

# Google Cloud C2D - 4c - 32 GB
# BRC01_BaselineST - DO CPU 16c
Measurement Runtime: 172,498 ms
Measurement Runtime: 197,795 ms
Measurement Runtime: 195,888 ms
Measurement Runtime: 195,546 ms
Measurement Runtime: 195,896 ms

# BRC26_MoreMapSpaceST - DO CPU 16c
Measurement Runtime:  38,891 ms
Measurement Runtime:  34,864 ms
Measurement Runtime:  35,682 ms
Measurement Runtime:  35,691 ms
Measurement Runtime:  35,712 ms

The first 80% are not too hard.

Most of the concepts seen here are applicable to other problems.

Change one thing at a time and measure.

Most expensive first, unless something is simpler to fix.

Q&A and Rating

Thank you and please rate!

https://www.jfokus.se/rate/2178