Skip to content

CADC-15011: Kueue testing updates for lifecycle management and control-plane observability#334

Merged
shinybrar merged 10 commits intomainfrom
CADC-15011/kueue-testing
Mar 12, 2026
Merged

CADC-15011: Kueue testing updates for lifecycle management and control-plane observability#334
shinybrar merged 10 commits intomainfrom
CADC-15011/kueue-testing

Conversation

@shinybrar
Copy link
Copy Markdown
Contributor

@shinybrar shinybrar commented Mar 9, 2026

Overview

This PR introduces a comprehensive overhaul of the kueuer testing framework for production-ready Kueue validation system with full lifecycle management, control-plane observability, and automated analysis.

Key Improvements

Lifecycle

  • New lifecycle module: Complete workflow automation from preflight checks to teardown
  • Scenario support: control and backlog scenarios for testing under different queue pressures
  • Queue validation: Automated Kueue health checks and queue readiness verification
  • Artifact consolidation: All outputs (benchmarks, plots, observations, reports) organized under unified artifacts/<run_id>/ structure

Observability

  • Real-time observe module to gather control-plane metrics during benchmark runs
  • Tracks Kueue controller resources, API server latency, and queue pressure
  • Visualization of controller memory/CPU, API server p95 latency, and queue wait times

Metrics

  • Better metrics which emphasize decision-critical signals (throughput, completion ratio, tail turnaround, overhead)
  • Single-view summaries for performance, evictions, and observations
  • Clear metric definitions and interpretation guidelines

Streamlined CLI

  • Unified under kr benchmark e2e workflow
  • Add basic profiles, local-safe and cluster-scale for different testing scenarios
  • Consistent cli patterns across all commands
  • Removed duplicated code and cleaner command structure

Documentation

  • High-level introduction to concepts and architecture
  • Complete command documentation with all options

What This Enables

The tool now answers two critical production questions:

  1. Scale handling: Can Kueue admit, queue, preempt, and complete large numbers of jobs correctly?
  2. Control-plane load: Does Kueue add unacceptable pressure to the Kubernetes control plane?

Example Usage

# Quick end-to-end validation
RUN_ID="$(date -u +%Y%m%d-%H%M%S)"
uv run kr preflight --run-id "$RUN_ID"
uv run kr benchmark e2e \
  --run-id "$RUN_ID" \
  --profile local-safe \
  --counts 2,4,8,16,32,64

# Inspect results
find "artifacts/$RUN_ID" -type f | sort

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 9, 2026

✅ All pre-commit checks passed

Thanks for keeping the repo tidy! ✨

…moving code duplication

- Move  and  to top-level CLI for simpler access (was )
- Consolidate  as the primary end-to-end workflow command, replacing separate  and  commands
- Extract shared constants to  and centralized Kubernetes config to  to eliminate magic numbers and reduce duplication
- Remove experimental  module (unused code)
- Refactor internal suite orchestration to accept resolved performance/eviction options instead of raw profile names, enabling flexible parameter overrides
- Update documentation to reflect simplified CLI surface and new command structure
- Remove standalone  CLI commands (observation functionality now accessed through  and )
- Update all tests to use new command structure and internal APIs
Refactor codebase for improved maintainability and CLI UX
@shinybrar shinybrar changed the title CADC-15011: Kueue Testing Improvements feat(kueuer): Comprehensive Kueue Testing Framework with Lifecycle Management and Control-Plane Observability Mar 9, 2026
@shinybrar shinybrar changed the title feat(kueuer): Comprehensive Kueue Testing Framework with Lifecycle Management and Control-Plane Observability feat(kueuer): Kueue Testing Framework with Lifecycle Management and Control-Plane Observability Mar 9, 2026
@shinybrar shinybrar requested a review from SharonGoliath March 10, 2026 16:22
@shinybrar shinybrar changed the title feat(kueuer): Kueue Testing Framework with Lifecycle Management and Control-Plane Observability CADC-15011: Kueue testing updates for lifecycle management and control-plane observability Mar 10, 2026
Copy link
Copy Markdown

@SharonGoliath SharonGoliath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's too big to read individually, so I downloaded and installed it, and did a pylint (default configuration) and pytest --cov on it.
    pylint output that I think is important enough to raise:
  1. Bug: access before definition (src/kueuer/utils/k8s_config.py)Pylint: E0203: Access to member '_initialized' before its definition line 42 at line 39.
  2. Wrong number of arguments (src/kueuer/benchmarks/track.py)
    Pylint: E1121: Too many positional arguments for method call at lines 293, 326, 340 (e.g. status(item, "Complete") / status(item, "Failed")).
  3. Cyclic import (pylint R0401)
    Chain: kueuer.benchmarks.benchmark → kueuer.lifecycle.commands → kueuer.lifecycle.suite (and back).
  1. There's no login host, so I don't know how this will change execution instructions. Shaun has set up capsule for multi-tenant (CADC and RCS) user management. Use an OIDC claim for kubectl.
  2. Comments on the documentation:
  3. benchmark-walkthrough.md - line 68 mentions "manifest state". A "manifest" shows up in a lot of places in the code, but this is the only mention in the docs. Either remove the single mention, or explain it if it's important enough.
  4. metrics-semantics.md - lines 90 - 94 read more like they belong in a change log
  5. the docs do not mention the kueue occupation restrictions. For my own knowledge, do these restrictions affect this testing?
  6. What would happen if testing was done with milli-cores, for example?

@shinybrar shinybrar merged commit e2ba92f into main Mar 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants