How To Sell Performance Test Results

René Schwietzke, Xceptance GmbH

License

I care and share

c b

This presentation is licensed under
Creative Commons Attribution 4.0 International License.

About René Schwietzke

  • Master of Computer Science (Dipl.-Inf.)
  • Programmer since 1989
  • QA and Testing since 1998
  • Performance Tester since 1999
  • Co-Founder of
  •    @ReneSchwietzke
  • @reneschwietzke@foojay.social

About

  • Founded 2004
  • Headquarters in Jena, Germany; Subsidiary in Cambridge, MA, USA
  • Specialized in Software Testing and Quality Assurance
  • Performance testing since 2004
  • Over 150 performance test projects every year
  • World-wide customer base including APAC and South America
  • Performance Test Tool , Java-based, APL 2.0

Limitations

Things we won't talk about today

  • What is performance testing?
  • Why do we have to do it?
  • What is the right approach?
  • What is the right tool?
  • What are the right goals?
  • Many things are oversimplified

"Our" Performance Testing

Our Daily Bread

Sliced in 600 Sec or Less

Customer Segments

Whom we work for and with

  • Engineering: Product development departments (10%)
  • Businesses: Non-Commerce Businesses (10%)
  • Implementers: Commerce Service Implementation Partners (20%)
  • Merchants: Direct Commerce Customers using [PS]aaS with optional proprietary components (60%)

100 to 150 load test projects annually.

Knowledge

How much do they know about performance testing?

  • No Idea: Someone told us, it is required but we don't see a need.
  • It is important: We know it makes sense, but we don't know how to do it.
  • Done that before: We tested on-premise, but this is now SaaS.
  • Tested before: We tested before with you, let's do it again.
  • Got an idea: We understand the concept, but don't have a good approach.
  • Well educated: Already plenty of knowledge and experience.

Optionally to be garnished with: Houston, we have a problem!

Misunderstood Things

Clarifying direction and basic testing concepts

  • Server-Side Performance Test: Dynamic communication between client and server; excludes static content; skips client-side-only third-parties; think request-based testing
  • Client-Side Performance Test: Rendering performance; includes all content; single user traffic; think Core Web.Vitals
  • Perceived Performance: End user performance impression; includes interaction responsiveness; think visual complete

People think perceived performance but mean client-side rendering while looking for server-side testing.

Many and Too Much

They: Many opinions, ideas, approaches, and goals.

We: Got too much data.

Mass Production

Make load testing efficient, comparable, and scalable.

Photo by Roger Wollstadt, Wolfsburg - Volkswagen Assembly Line, CC-BY-SA 2.0

Throw Names at Me

Standardize naming and terminology

  • Load Test
  • Performance Test
  • Volume Test
  • Stress Test
  • Endurance Test
  • Spike Test
  • Soak Test
  • Breakpoint Test
  • All replaced by load and performance testing
  • Types of tests replaced by load profiles, such as average day, hype sale, Black Friday
  • Volume is X %of target
  • Length is time now, e.g. 8-hour test

The measured performance should, under most load conditions, not imply that the system is under stress.

In English: A single user sees the same performance under heavy load as if they were the only user.

A Typical Project

How Does Our Daily Work Look Like?

The naive expectation

Performance testing is simple, isn't it?

 Requirement

500 ms or less

 Result

321 ms
  

A Typical Project

How to get from A to B? Standardize it!

Discovery

  • Motivation
  • Scenarios
  • Load Profiles
  • Data Dependencies
  • Third Parties
  • Timeline
  • The Spiel

Execution

  • Scripting Phase
  • Dry Runs
  • Test the Waters (25%, 50%)
  • 100% for 1 Hour
    • Rinse and Repeat If Needed
  • 100% for 8 Hours Includes Cache Invalidation
    • Rinse and Repeat If Needed
  • Final Reporting

Requirements

What we will ask for

  • Scenarios
  • Load Profiles
    • Visits / hour
    • Page Views / hour
    • Orders / hour
  • Runtime Expectations
  • Human vs. Bots
  • Cart Size Distribution
  • Conversion Rates
  • Search Count
  • Account Creation
  • Guest vs. Registered Checkout
  • Product Heatmap
  • Coupons/Discounts
  • Payment

Requirements

What we have to work with at the end

Visits / h 100,000
Page Interactions / h 1,000,000
Orders / h 3,000
Conversion Rate* 3%
Bot Visits / h* 1,000
Searches / h* 7,000
Cart Size Distribution*1/10% 2/20% 3/50% 4/20%

Detour: Concurrent Users

The Nonsense Metric That Sticks

Concurrent Users is a metric that lacks a time component as well as a definition what these users do.

Concurrent users is more of a result metric, because when you apply activities and time, you get a number that actually makes sense.

Result

What we will be asked for at the end

  • Achieved Visits
  • Achieved Orders
  • Achieved Page Views*
  • How Many Orders Failed*
  • Sadly, this is really all!

Let's Measure

What Do We Measure?

Data Capturing

What data can we and will we capture

  • Start Time
  • DNS Resolution
  • Connect Time
  • Send/Busy/Receive Time
  • Bytes Sent/Received
  • TTFB (indirect)
  • TTLB (indirect)
  • URL
  • Method
  • Parameters
  • Response Code
  • Evaluated State*

Beyond HTTP

  • Scenario Sending It*
  • Action Triggering It*
  • Agent Executing It*
  • Semantic IDs (e.g. request-id)*
  • Error Trace*

R,QuickView.1,1571927593069,112,false,1593,6096,200,https://host/842177173640.html?cgid=sales,text/html,0,0,111,0,111,111,,,,,0,,
R,QuickView.2,1571927593184,79,false,1639,592,200,https://host/Wishlist?productID=842177173640,application/json,0,0,79,0,79,79,,,,,0,,
A,QuickView,1571927593064,199,false
R,AddToCart.1,1571927597981,263,false,1727,3889,200,https://host/Cart-AddProduct?format=ajax,text/html,0,0,260,1,260,261,,,,,0,,
A,AddToCart,1571927597981,264,false
T,TOrder,1571927533453,100982,false,,
        

Roll it out across time

Moving average of the last 1%

Test time: 75 min - Total: 348,798 - Mean: 576 ms
P50: 576 ms - P95: 1,420 ms - P99: 2,850 ms - P99.9: 6,430 ms

Bring back details

Too much was removed before

Test time: 75 min - Total: 348,798 - Mean: 576 ms
P50: 576 ms - P95: 1,420 ms - P99: 2,850 ms - P99.9: 6,430 ms

Let the Data Flow

Just collect

How Much Data is That?

A standard load test result of a large US customer

Business Perspective

Runtime 3 hours
User Scenarios 17
Visits 5,266,130
Page Interactions 55,462,101
Total Requests 122,185,828
Orders 677,606
Errors 70,491
Datacenters 7
Load Generators 50 / 800 Cores / 1.6 TB RAM

Tool Perspective

 
Test Cases 17
Transactions5,266,130
Actions 55,925,554
Requests 122,185,828
Events 124,519
Custom 5,232,721
Agent 53,409
Data Lines 189,751,960

How many data points?

How many points of data are captured?

For Transactions47,395,170
For Actions279,627,770
For Requests2,810,274,044
For Custom Data622,595
For Event Data26,163,605
For Agent Data1,228,407
Total3,165,311,591
Uncompressed Data 48.72 GB
Compressed Data 4.10 GB
Lines per Second 17,569
Datapoints per Second293,084

Rating Criteria

Let's Sell the Results

Concept

The Basic Idea

  • As Simple as School Grades
  • A, B, C, D, F (American)
  • A+ for Overachieving
  • The Average is B
  • This is Based on Three Sub-Criteria!
  • Response Time, Errors, Predictability
  • The Worst Grade Rules it All!

Criteria 1 - Response Times

The Most Interesting Data

  • Split into groups
  • B is the average
  • Worst group determines the total
  • P95 is the driving number*
  • Adjustable if needed
  • Based on years of measurements and achievability

Criteria 1 - Response Times

Criteria 2 - Errors

Something Always Goes Wrong

  • Barely any test run is error free
  • Convey the significance of errors
  • Technical Errors: Code 0 or >=500
  • Functional Errors: Application caused
  • Number of visits affected
  • Feature importance is used as a factor
  • Visible patterns of failure

Criteria 3 - Predictability

Simplify the View on Long-Term Behavior

  • Turn it into a single business value
  • Strategic motivation
  • Not scientifically proven
  • 10 sec based on the user perception model by the Google RAIL Model1
  • Applied to requests not page loads, hence it is weaker

Example for Predictability

  • ev = 1,522
  • rv = 739
  • tv = 1,120,892
  • Business Impact Value = 0.20 %

Predictability Mapped

Mapping the Business Impact Value

  • BIV as input
  • Combines errors and response times
  • Adds response time patterns
  • Pattern view is subjective
  • Demonstrates the end-user's pain

Patterns

Bonus Motivation

Apply a monetary value to enhance visibility

Reporting

The Final Summary

  • A management view
  • A one line verdict
  • A grade that pops
  • A monetary value for the business
  • Some numbers for engineers
  • A detailed report is referenced
  • A compromise to cater to all groups
  • Yes, this is oversimplification

Summary

The good, the bad, and everything in between

Good

  • Easier result communication
  • Less time needed to explain things
  • Results are comparable

Bad

  • Some customers start to haggle
  • Error patterns become invisible
  • Errors become an accepted fact

Misc

  • Not fully automated yet
  • Risk of working by the books
  • Subjective pieces cause discussions

The Summary Summarized

Four lines, not more

  • Communication culture improved
  • Achieved quicker turnarounds
  • Mostly gets all parties on the same page
  • Created a de-facto standard for our commerce testing

Download and Share

Never believe in any benchmark result unless you doctored it yourself!

René Schwietzke

Let's Do Lunch Break Q&A!

Performance Testing - The Numbers

What are we looking for?

Given Requirements

  • Visits/h - 100,000
  • Page Views/h - 1 million
  • Orders/h - 3,500
  • Runtime Average - 250 ms
  • Runtime P99.9 - 500 ms
  • Runtime Max - 3,000 ms
  • Lot of vage ideas

Results

  • Achieved Visits
  • Achieved Page Views
  • Achieved Orders
  • Runtimes and if they match expectations
  • Errors count and error picture
  • Stability and development over time

Process and Sell the Data

Our Result and Communication Challenge

Quick Example

A data example

Test time: 1 h - Total: 14,645 - Mean: 174 ms - P95: 210 ms - P99: 250 ms - Max: 1,169 ms - P99.9: 480 ms

The Average is my friend

Is the average good enough?

Test time: 8 h - Total: 6,796 - Mean: 156 ms - P95: 730 ms - P99: 1,610 ms - Max: 6,592 ms - P99.9: 4,930 ms

Which PXX

Which PXX might be a good vehicle for the message?

Test Time: 3h 30 m - Total: 114,386 - Mean: 504 ms - P95: 550 ms - P99: 2,660 ms - Max: 6,169 ms - P99.9: 4,280 ms

What do we use

What is our final communication data set?

Numbers

  • P95: The number to rule it all, grouped by type such as catalog, cart, homepage, or checkout
  • Avg: To get a feel for the deviation and what could be delivered under normal circumstances
  • Error Count: To make sure the numbers are not void
  • Max: To see the worst case

Time Dimension

  • There is no simple way to display development over time
  • Resorting to a visual chart for the moment
  • What to draw? All points, some points, or averages?
  • Displaying all points turns the chart into a mess
  • The moving average either hides details or runs behind
  • Settling for a one second resolution and displaying min and max for that second plus all points in small buckets
  • Purpose: See patterns and events, enhancing trust in the single number that forms the communication

Chart Example

Test time: 1 h - Total: 112,695 - Mean: 501 ms - P50: 250 ms - P95: 1,950 ms - P99: 5,170 ms - Max: 16,689 ms - P99.9: 7,830 ms

Chart Example - More Drastic

Test time: 3 h - Total: 3,145,233 - Mean: 669 ms - P50: 250 ms - P95: 2,760 ms - P99: 8,740 ms - Max: 35,945 ms - P99.9: 30,020 ms

Chart Example - Caches

The Data Table

Why all that data?

Does it makes sense to collect 10x more data than needed?

  • A load test can be expensive
  • The execution window might have passed
  • The next run might not yield the same results
  • The environment might have changed (software, hardware, data)
  • Results are often the trigger for a more detailed analysis, getting the same results is mostly impossible
  • Problems are often time and event based, hence you have to know the question but you got only the response
  • Therefore, every report is produced offline and can be "reshaped" if needed
  • Raw data is available to look for details beyond a regular report

About the Data Processing

Some More Technical Stuff

The Data Crunching

How we crunch the data

  • Multi-threading offline process
  • Converts data from CSV to XML/HTML/charts
  • Multi-threaded parsing
  • Multi-threaded filtering and merging
  • Single threaded summation and timeline building
  • Multi-threaded chart creation
  • Process is repeatable and movable
  • Takes 6 min on an 8-core machine (our 3.2 billion points set)
  • Tuned version needs 3 min
  • Processing 1,189,712 lines/s on a Thinkpad T14s (8+8 cores)
  • Filtering and merging is most expensive
  • Summation cannot be scaled at the moment