-
Notifications
You must be signed in to change notification settings - Fork 0
04 Performance Overview
Keywords: Java Performance, Latency, Throughput, Tail Latency, CPU Utilization, Memory Pressure, GC, Allocation Rate, JIT, Benchmarking, Profiling, Amdahl’s Law, Little’s Law, Cache Locality, False Sharing, Concurrency, Backpressure, Event Loops, Hot Paths, Cold Paths, Optimization, JVM Tuning, Bottlenecks, Scalability, Tiered Compilation, G1GC, ZGC, Shenandoah, Safepoints, Escape Analysis, Inlining, JFR, Async Profiler
Performance is one of the most misunderstood topics in software engineering.
Many developers think performance means:
- “make it faster”
- “use more threads”
- “add caching”
- “optimize the code”
But real performance engineering is not about random optimization.
It is about understanding:
- where time is spent
- where memory is consumed
- where contention appears
- where latency is created
- where throughput collapses
- where the system behaves differently under load
A Java system is not fast because it has clever code.
It is fast because its architecture respects:
- CPU limits
- memory limits
- I/O limits
- synchronization costs
- garbage collection behavior
- queue dynamics
- cache locality
- runtime optimization behavior
This page provides the system-level foundation for the entire 04 section.
It is the bridge between:
Knowing Java APIs
and
Engineering High-Performance Java Systems
Performance is not a single number.
It is a set of related properties.
| Metric | Meaning |
|---|---|
| Latency | Time for one request or task |
| Throughput | Number of tasks completed per unit time |
| Tail Latency | Worst-case or p95/p99 response times |
| CPU Utilization | How effectively CPU cores are used |
| Memory Footprint | How much memory the system consumes |
| Allocation Rate | How fast the runtime creates objects |
| GC Overhead | Time spent reclaiming memory |
| Contention | How much threads fight over shared resources |
| Scalability | How performance changes as load increases |
A system may have high throughput but terrible latency.
It may have low average latency but terrible p99 latency.
It may appear fast in testing but collapse under production load.
Real performance analysis must look at the whole system.
This is one of the most important distinctions in performance engineering.
Latency is the time to complete one operation.
Example:
User sends request
↓
Server processes request
↓
Response returned
If this takes 80 milliseconds, latency is 80 ms.
Latency matters in:
- user experience
- APIs
- interactive systems
- trading systems
- real-time systems
Throughput is how many operations a system can complete over time.
Example:
1000 requests per second
Throughput matters in:
- batch systems
- servers
- pipelines
- message processing
- data ingestion
A system can optimize for one and harm the other.
Examples:
- batching improves throughput but may increase latency
- tiny queues reduce latency but may reduce throughput
- aggressive parallelism may increase throughput but hurt tail latency
Good engineering balances both.
Most performance problems come from a small set of causes:
- too much blocking
- too many allocations
- poor cache usage
- excessive synchronization
- unbounded queues
- thread oversubscription
- inefficient I/O patterns
- slow external dependencies
- GC pressure
- hidden contention
- excessive object churn
The root cause is usually not the algorithm alone.
It is the interaction between:
- code
- runtime
- JVM
- operating system
- hardware
- workload shape
Java performance must be understood across layers.
Application Logic
↓
Concurrency Model
↓
Runtime / JVM
↓
Operating System
↓
CPU / Memory / Disk / Network
Each layer can become the bottleneck.
This includes:
- business logic
- parsing
- transformation
- validation
- request handling
Mistakes here include:
- unnecessary loops
- duplicated work
- expensive data structures
- redundant computation
This includes:
- threads
- executors
- queues
- locks
- event loops
- async pipelines
Mistakes here include:
- too many threads
- blocking in the wrong place
- lock contention
- bad queue sizing
This includes:
- JIT compilation
- garbage collection
- class loading
- reflection
- profiling
- memory management
Mistakes here include:
- excessive allocation
- megamorphic call sites
- reflective hot paths
- poor object layout
This includes:
- scheduling
- virtual memory
- file system caching
- TCP stack behavior
- kernel wakeups
- context switching
Mistakes here include:
- thread explosion
- blocking I/O misuse
- poor descriptor management
- heavy syscall overhead
This includes:
- CPU caches
- branch prediction
- memory bandwidth
- NUMA effects
- disk latency
- network latency
Mistakes here include:
- cache-thrashing data layouts
- false sharing
- too much memory traffic
- poor locality
Performance is usually determined by the bottleneck.
Not every part of the system matters equally.
If the database takes 80% of the time, optimizing local Java code may not help much.
If the CPU is idle but the network is slow, adding more CPU does not solve the problem.
A good performance engineer always asks:
- What is actually slow?
- Where is the wait happening?
- What resource is saturated?
- What changes under load?
Amdahl’s Law explains the ceiling on speedup.
If part of the system is sequential, that sequential part limits the overall speedup.
Meaning:
No amount of parallelism can eliminate a serial bottleneck
Example:
- 90% of work can parallelize
- 10% remains serial
Even with infinite processors, the serial section limits total improvement.
This is why:
- a single lock can cap scalability
- a single database call can dominate latency
- a sequential pipeline step can block parallel gains
Little’s Law connects:
- work in progress
- throughput
- response time
In plain terms:
More queued work increases response time.
This is essential for understanding:
- request queues
- thread pools
- event loops
- backpressure
- overloaded systems
If your queue grows, latency grows too.
If arrivals exceed processing capacity, the system accumulates work and becomes slow.
Average response time can lie.
A system might have:
- average latency of 20 ms
- p99 latency of 2 seconds
From a user perspective, the system feels broken.
Tail latency matters because:
- users notice the slowest cases
- distributed systems amplify slow tail events
- retries and timeouts turn latency spikes into failures
Good performance engineering always tracks:
- p50
- p95
- p99
- p99.9
CPU efficiency means using CPU cycles for useful work instead of overhead.
Waste happens through:
- context switching
- lock contention
- cache misses
- branch misprediction
- busy waiting
- excessive syscall overhead
Efficient systems keep CPUs busy doing useful tasks, not managing chaos.
Modern CPUs are fast because of caches.
If data is near where it is used, performance improves.
If data is scattered, performance worsens.
Good locality means:
- sequential access
- compact data structures
- reuse of recently accessed values
- minimal cache misses
Bad locality means:
- random access
- scattered objects
- pointer-heavy graphs
- large working sets
False sharing occurs when different threads modify different variables that happen to live on the same CPU cache line.
Even though the variables are logically independent, the CPU treats them as related because they share cache-line storage.
This causes:
- cache invalidation storms
- wasted coherence traffic
- severe throughput degradation
False sharing is a classic hidden performance killer in concurrent systems.
In Java, allocation is cheap, but not free.
High allocation rates can cause:
- frequent young generation GC
- promotion pressure
- memory churn
- tail latency spikes
Good performance design often tries to:
- reuse objects where practical
- avoid unnecessary allocations
- reduce temporary object creation
- pool expensive buffers carefully
But object pooling is not always the answer.
The goal is not “zero allocation at all costs.”
The goal is:
Allocation behavior that matches the workload
Garbage collection is one of the most important runtime performance factors in Java.
GC can help with:
- memory safety
- automatic cleanup
- simpler development
But GC can also introduce:
- pauses
- throughput reduction
- memory pressure
- unpredictable latency if badly tuned
Different collectors are optimized for different goals:
- throughput
- low pause time
- balanced performance
- large heaps
- server workloads
Good performance engineering requires awareness of GC behavior.
Visual 1.2: Tiered Compilation Pipeline - The journey from interpreted bytecode to optimized native code.
JIT Pipeline ASCII: [Interpreter] --> (Profiling) --> [C1 Compiler] --> (Hot Methods) --> [C2 Optimizer]
Java performance improves significantly through runtime optimization.
The JVM can:
- profile hot methods
- inline methods
- remove dead code
- eliminate allocations
- optimize locks
- specialize call sites
- deoptimize when assumptions change
This means Java often gets faster as it runs.
A benchmark run once may not reflect steady-state performance.
A realistic performance assessment must consider:
- warmup
- JIT compilation
- profiling thresholds
- steady-state execution
A hot path is code executed frequently.
A cold path is code executed rarely.
Optimization effort should focus on hot paths.
Examples of hot paths:
- request dispatch
- serialization
- parsing
- inner loops
- contention points
Examples of cold paths:
- startup configuration
- error reporting
- rare administrative actions
Do not over-optimize code that rarely runs.
Performance claims without proper measurement are usually wrong.
Benchmarking must account for:
- JIT warmup
- GC effects
- dead code elimination
- constant folding
- operating system noise
- background load
- CPU scaling behavior
A benchmark should measure:
- one thing at a time
- a realistic workload
- multiple repetitions
- steady-state behavior
In Java, microbenchmarking is usually done with:
- JMH
These are not the same.
Answers:
How fast is it?
Used to compare versions or approaches.
Answers:
Where is the time going?
Used to discover hotspots and bottlenecks.
Performance work usually needs both.
You cannot optimize what you cannot see.
Important signals include:
- CPU usage
- memory usage
- allocation rate
- GC pauses
- queue depth
- lock contention
- thread state distribution
- I/O wait times
- tail latency
Visibility is the beginning of performance tuning.
Useful tools include:
| Tool | Purpose |
|---|---|
| JFR | Low-overhead runtime profiling |
| async-profiler | CPU/lock profiling and flame graphs |
| jcmd | JVM diagnostics |
| jstat | GC and runtime statistics |
| jmap | Heap inspection |
| JMH | Microbenchmarking |
| VisualVM | High-level monitoring |
Flame graphs provide a visual 2D representation of where CPU time is spent. The width of each bar represents the total percentage of CPU cycles consumed by that method and its children.
Visual 1.4: A sample Flame Graph. Wide bars at the top indicate significant bottlenecks in specific method calls.
Concurrent systems create special performance problems.
Examples:
- too many threads
- lock contention
- queue saturation
- blocking in the wrong place
- poor work distribution
Concurrency improves scalability only when it is designed carefully.
More threads do not automatically improve performance.
Sometimes they make it worse.
Event-driven systems often outperform thread-per-request systems at scale because they:
- reduce thread count
- reduce context switching
- improve resource concentration
- handle many idle connections efficiently
But event-driven systems require:
- careful event loop design
- bounded work in the loop
- fast dispatch
- proper backpressure
This is why 04-Event-Loop-Design matters.
Backpressure is a performance control strategy.
It ensures the system does not accept more work than it can safely process.
Without backpressure:
- queues grow without limit
- latency climbs
- memory grows
- failures cascade
With backpressure:
- overload is contained
- the system stays stable
- producers slow down
- response quality remains controlled
This is why 04-Backpressure-Strategies is central to performance engineering.
Symptoms:
- high CPU usage
- slow computation
- low headroom
- heat and throttling
Causes:
- heavy computation
- tight loops
- excessive parsing
- inefficient algorithms
- locking overhead
- busy waiting
Symptoms:
- frequent GC
- high heap usage
- OutOfMemoryError
- allocation spikes
Causes:
- excessive object creation
- large caches
- memory leaks
- large collections
- buffer churn
Symptoms:
- threads waiting
- low CPU usage
- high latency
- blocked requests
Causes:
- slow disk
- slow network
- database stalls
- external API delay
- blocking I/O under load
Symptoms:
- throughput collapse
- high lock wait time
- uneven latency
- thread pileups
Causes:
- synchronized hotspots
- shared mutable state
- queue contention
- global locks
- false sharing
A high-performance system minimizes waiting.
Work can be spent on:
- computation
- memory access
- synchronization
- I/O wait
- scheduling wait
- GC pause time
The wrong optimization target is often the visible one.
The right optimization target is usually the one causing wait time.
A strong performance workflow usually looks like this:
- Measure the system.
- Identify the bottleneck.
- Determine whether the bottleneck is CPU, memory, I/O, contention, or architecture.
- Fix the dominant bottleneck.
- Measure again.
- Repeat.
This cycle is better than random optimization.
Optimize:
- algorithms
- data structures
- parsing
- repeated computation
Optimize:
- thread count
- queue sizes
- lock usage
- event-loop behavior
Optimize:
- allocation patterns
- GC behavior
- reflection
- JIT friendliness
Optimize:
- blocking behavior
- file/network I/O
- thread scheduling
- resource limits
Optimize:
- cache locality
- memory bandwidth
- branch predictability
- NUMA placement
-
Scenario: A producer-consumer system uses an
ArrayBlockingQueuewith no capacity limit. -
Result: Thread Pool Collapse. Under a 10k/sec request spike, the queue consumes all available heap memory, leading to continuous Full GC cycles and eventual
OutOfMemoryError.
-
Scenario: A shared
HashMapis protected by a single global lock in a high-concurrency logging system. - Result: Throughput Ceiling. As CPU cores increase, the performance flatlines because threads spend 90% of their time waiting for the lock rather than doing actual work (Amdahl's Law in action).
-
Scenario: A microservice parses 50MB XML files by creating millions of temporary
Stringobjects per second. - Result: The GC Tax. Even with ZGC, the sheer volume of allocations triggers "Allocation Stalls," where application threads are paused just to wait for the GC to free up enough space.
High-performance Java systems are not just “fast code.”
They are systems that align well with:
- JVM execution behavior
- workload shape
- external dependencies
- concurrency model
- memory layout
- runtime adaptation
This is why the best systems often feel simple from the outside, but are carefully engineered under the hood.
Scenario: A high-throughput API was experiencing random 500ms latency spikes.
Investigation: Using JFR, CPU usage was low, but GC activity was extremely high.
Root Cause: A JSON serialization library was creating massive amounts of temporary
byte[]arrays for every HTTP response. The Eden space filled up quickly, triggering thousands of Minor GCs per minute.
The Fix: We did not change the GC. Instead, we reduced temporary allocations, reused buffers where practical, and switched to a streaming JSON parser.
Result: Allocation rate dropped by 90%. Minor GC frequency dropped. p99 latency stabilized at 15ms.
Lesson: The best way to optimize Garbage Collection is to generate less garbage.
The JVM is not passive.
It actively optimizes your application through:
- interpretation
- profiling
- JIT compilation
- inlining
- escape analysis
- dead code elimination
- lock optimizations
- deoptimization
- garbage collection coordination
This is why Java often gets faster after warmup.
A clean, mechanically sympathetic codebase allows the runtime to do its job.
Java is not simply interpreted.
It uses tiered compilation:
- the interpreter starts execution quickly
- C1 performs fast lightweight compilation
- C2 performs aggressive optimization for hot code
This balances startup cost and peak throughput.
Typical pipeline:
Interpreter
↓
C1
↓
C2
↓
Optimized Native Code
A method call is replaced by the body of the method.
Benefits:
- lower call overhead
- more optimization opportunities
- better instruction locality
The JVM determines whether an object escapes the method or thread.
If not, it may:
- eliminate the allocation entirely
- scalar-replace object fields
- keep values in registers or on the stack
The JVM can rewrite loops to reduce branch overhead.
Unreachable or useless operations can be removed.
If the JVM proves synchronization is unnecessary, it may remove the lock.
The JVM may assume stable runtime behavior and optimize aggressively.
If the assumption breaks, it can deoptimize safely.
Some runtime operations require all Java threads to reach a safepoint.
Examples:
- garbage collection
- deoptimization
- class redefinition
- certain profiling operations
If a thread is stuck in a long uninterruptible loop, it can delay the safepoint and cause latency spikes.
GC pause time is not just collection time.
It also includes the time needed to bring threads to safepoints.
Visual 1.3: Physical memory layout separating Managed Heap, Thread Stacks, and Native Metaspace.
Understanding performance requires understanding memory layout.
Application
↓
Heap / Metaspace / Stack / Code Cache / Off-Heap
Stores objects and arrays.
Managed by GC.
Java manages object lifecycles using a generational approach. Understanding the flow from Eden to Old Generation is key to tuning GC performance.
Visual 1.5: The flow of objects through Eden, Survivor (S0/S1), and Old Generation, including the relationship with Metaspace and Code Cache.
Key Components from the Model:
- Young Generation (Eden + Survivor): Most objects are allocated here and collected quickly.
- Promotion: Surviving objects move from Survivor spaces to the Old Generation (Tenured).
- Non-Heap Areas: Metaspace (Class metadata), Code Cache (JIT compiled code), and Stack (Thread locals) operate outside the main GC heap.
Each thread has its own stack.
Contains:
- method frames
- local variables
- return information
Stores class metadata.
Uses native memory.
Stores JIT-compiled machine code.
If full, compilation effectiveness suffers.
Used for:
- direct buffers
- native interop
- high-performance I/O structures
This is important in NIO and low-latency systems.
Different collectors trade throughput, latency, and footprint differently.
| GC Name | Focus | Best Use Case | Notes |
|---|---|---|---|
| Parallel GC | Max throughput | Batch jobs | Longer pauses, high CPU use |
| G1GC | Balanced | General backend services | Region-based, predictable pauses |
| ZGC | Ultra-low latency | Trading, real-time systems | Very short pauses, large heaps |
| Shenandoah | Ultra-low latency | Low-pause workloads | Concurrent compaction |
Architectural insight:
If you are building a system with NIO and event loops for low latency, pairing it with a low-latency collector usually makes sense.
Using a throughput-only collector in a latency-sensitive architecture can undermine the whole design.
Use tuning flags deliberately, not blindly.
| Flag | Purpose |
|---|---|
-Xms / -Xmx
|
Set initial and maximum heap size |
-XX:+UseG1GC |
Enable G1GC |
-XX:+UseZGC |
Enable ZGC |
-XX:MaxGCPauseMillis=200 |
Ask G1GC to target a pause budget |
-XX:+HeapDumpOnOutOfMemoryError |
Generate heap dump on OOM |
-XX:+AlwaysPreTouch |
Pre-touch memory pages at startup |
-XX:+UseStringDeduplication |
Reduce duplicate string memory |
Best practice:
- profile first
- tune second
- validate third
A sample GC monitor view might include:
| Metric | Meaning |
|---|---|
| S0 / S1 | Survivor space usage |
| E | Eden space usage |
| O | Old generation usage |
| M | Metaspace usage |
| YGC | Young GC count |
| YGCT | Young GC time |
| FGC | Full GC count |
| FGCT | Full GC time |
| GCT | Total GC time |
These numbers help answer whether the bottleneck is:
- allocation rate
- promotion pressure
- old generation growth
- excessive full collections
Continue exploring:
- 04-Event-Loop-Design
- 04-Backpressure-Strategies
- 01-NIO-Selector-Architecture
- 02-Thread-Pool-Mechanics
- 02-Java-Memory-Model
- 03-Runtime-Overview
- 01-NIO-Channel-Buffer-Model
- 03-ClassLoader-Architecture