feat: add Celeris framework benchmarks#189
Open
FumingPower3925 wants to merge 17 commits intomainfrom
Open
Conversation
…protocol combinations
Adds servers/celeris/ package wrapping the Celeris HTTP engine (v0.3.0)
for benchmarking. Supports 27 server variants: 3 engines (iouring, epoll,
adaptive) × 3 objectives (latency, throughput, balanced) × 3 protocols
(h1, h2, hybrid). The std engine is available for local dev but excluded
from benchmark runs.
Server naming: celeris-{engine}-{objective}-{protocol}
Ref: #188
…rd, and CI - cmd/server: add celeris dispatch files (Linux + stub) with prefix-based routing via ParseServerType() - cmd/bench: add 27-entry celerisServers list and "celeris" benchmark mode - internal/dashboard: add "celeris" category classification, update parseFramework to strip protocol suffix (works for all naming depths) - .github/workflows/benchmark.yml: update jq classify_category and parse_framework for celeris server names Ref: #188
The io_uring/epoll event loops call the handler synchronously on the event loop thread. Per-request json.Marshal allocations cause GC pressure that stalls the locked OS thread, leading to hangs under benchmark load. Changes: - Pre-compute JSON body at construction time (like theoretical servers) - Pre-compute header slices and static response bodies as package vars - Add defer s.Cancel() to clean up stream context after each request - Inline all WriteResponse calls to eliminate method overhead
v0.3.1 fixes a bug in the io_uring engine's send queue / buffer management that caused hangs when handling responses under sustained benchmark load. Validated with 48/48 tests passing across all engine/objective/protocol combinations.
Two issues caused CI to appear stuck for hours when a server stops responding (e.g., Celeris io_uring engine deadlock under sustained load): 1. Run() had no safeguard timeout — workers used the parent context which is never cancelled by the benchmark duration timer. If the HTTP client timeout fails to fire (e.g., server keeps TCP alive but stops processing), wg.Wait() blocks indefinitely while the heartbeat goroutine keeps the C2 from detecting the issue. Fix: Create a scoped context (Duration + 60s) for workers and cancel it when the benchmark period ends, ensuring all in-flight HTTP requests are cancelled promptly. 2. No failed server detection — if a server produced 0 RPS (clearly broken), the runner still attempted all 5 benchmark types, wasting up to 50 minutes per server in retry loops. Fix: Skip remaining benchmark types if RPS < 10 (server unhealthy) or after 2 consecutive failures.
…loop deadlock) v0.3.2 fixes the three compounding issues that caused the io_uring engine to deadlock under sustained high-concurrency load: - sync.Pool buffers (Data/OutboundBuffer) now returned after H1 handler - sendQueue capped to prevent unbounded growth under back-pressure - Reduced allocation pressure in the hot path
With 50 servers × 5 benchmark types = 250 total benchmarks (up from 115 before celeris), the previous timeouts were too tight: - BenchmarkTimeout: 3h → 5h (metal mode needs ~150 min + retries) - Cleanup watchdog: 2h → 6h (was killing runs before BenchmarkTimeout) - Workflow job timeout: 4h → 7h - Workflow MAX_WAIT: 4h → 6h
When spot capacity is unavailable, the orchestrator wastes 10 minutes per AZ waiting for workers that will never register. Instance boot + binary download + registration takes 2-3 minutes in practice, so 4 minutes is plenty. This cuts worst-case AZ cycling from 60+ minutes to ~24 minutes before on-demand fallback.
…egister When waitForWorkers times out, check which role (server/client) actually registered. If one succeeded but the other didn't get spot capacity, immediately deploy the missing role as on-demand in the same AZ instead of failing the entire architecture and cycling to the next AZ.
When a spot instance is terminated mid-benchmark, instead of bubbling the error up to runArchitectureWithRetry (which restarts from scratch in a new AZ), recover in-place: 1. Detect interruption faster by monitoring BOTH server and client heartbeats (previously only checked client, so server-only interruptions took ~15 min to detect) 2. Delete the terminated spot stacks and clear stale registrations 3. Deploy both roles as on-demand in the same AZ (parallel) 4. New client resumes from checkpoint via completed_benchmarks in the assignment API — no work is repeated 5. If in-place recovery fails, fall through to the existing AZ-cycling retry in runArchitectureWithRetry
…r count When the benchmark duration expires, in-flight HTTP requests are cancelled via context, which was incorrectly counted as errors (1-32 per benchmark). Now only counts actual server errors during the benchmark window.
This comment was marked as outdated.
This comment was marked as outdated.
Without this, the last periodic heartbeat could show stale progress (e.g. 243/245) if the final benchmarks completed between ticks, making it look like benchmarks were missed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
servers/celeris/package wrapping the Celeris HTTP engine (v0.3.0) for benchmarkingcelerisbenchmark mode (-mode celeris) and included inallmode"celeris"categoryParseServerType()— no giant switch statementsServer naming convention
celeris-{engine}-{objective}-{protocol}Examples:
celeris-iouring-throughput-h1,celeris-adaptive-balanced-hybridEngines
iouringepolladaptiveObjectives
latencythroughputbalancedCloses #188
Test plan
go build ./cmd/server/andgo build ./cmd/bench/pass (macOS, non-Linux stub path)go test ./internal/dashboard/passes with new classify/framework test casesgolangci-lint runclean on all changed packagesGOOS=linux go build ./cmd/server/celeris-std-throughput-h1locally to validate handler responses