fix: stabilize CALDERA live stack trace and log verification tests (refs #31)#43
fix: stabilize CALDERA live stack trace and log verification tests (refs #31)#43JJcyborg wants to merge 1 commit into
Conversation
JJcyborg
left a comment
There was a problem hiding this comment.
Review: fix: stabilize CALDERA live stack trace and log verification tests
I've done a thorough review of this PR across all three modified files. Overall this is a solid, well-structured stability fix. The core changes — exponential backoff, configurable timeouts, retry logic, schema validation, and better diagnostics — directly address the flaky-test symptoms described in #31. Below are my detailed findings.
1. Environment Variable Configuration ✅ (with minor suggestion)
The timeout/interval/retry values are all configurable via CALDERA_* env vars with sensible defaults:
| Variable | Default | Previous | Assessment |
|---|---|---|---|
CALDERA_TRACE_VERIFY_TIMEOUT |
90s | 60s | Good — more headroom for slow CI |
CALDERA_LOG_VERIFY_TIMEOUT |
90s | 60s | Good |
CALDERA_TRACE_VERIFY_INTERVAL |
2s | 3s | Fine — faster initial polling + backoff makes up for it |
CALDERA_LOG_VERIFY_INTERVAL |
2s | 3s | Fine |
CALDERA_ABILITY_TIMEOUT |
30s | N/A | Reasonable |
CALDERA_MAX_RETRIES |
3 | N/A | Reasonable for integration tests |
CALDERA_POST_FLUSH_SLEEP |
1.0s | 0.5s | Justified for stability |
Minor concern: Both tests/test_caldera_detection_coverage.py and tools/caldera_otel_harness.py read the same env vars (e.g., CALDERA_TRACE_VERIFY_TIMEOUT). The test passes its values as CLI args to the harness, so the harness-level env vars are effectively redundant when invoked from the test. This isn't a bug — the harness can still be used standalone — but it's worth a brief comment clarifying the precedence for future maintainers.
Suggestion (non-blocking): Wrap the int(os.environ.get(...)) calls in try/except with a fallback and a warning log. Currently a malformed env var like CALDERA_TRACE_VERIFY_TIMEOUT=abc will crash at import time with an unhelpful ValueError. A defensive fallback:
def _env_int(key: str, default: int) -> int:
raw = os.environ.get(key)
if raw is None:
return default
try:
return int(raw)
except ValueError:
print(f"[CALDERA] Warning: invalid value for {key}={raw!r}, using default {default}", file=sys.stderr)
return defaultThis would make CI failures much easier to diagnose.
2. Exponential Backoff Implementation ✅
backoff = min(interval_seconds * (1.5 ** min(attempts - 1, 4)), interval_seconds * 4)With default interval_seconds=2:
- Attempt 1: 2.0s → Attempt 2: 3.0s → Attempt 3: 4.5s → Attempt 4: 6.75s → Attempt 5+: 8.0s (capped)
The exponential growth stops at 1.5^4 ≈ 5.06x, capped at 4x the interval. Maximum individual wait is 8 seconds with defaults. This is sensible — aggressive enough early, patient later, and never excessive.
One thing to verify: If someone sets a large interval (e.g., CALDERA_TRACE_VERIFY_INTERVAL=30), the max backoff becomes 30 * 4 = 120s, meaning each poll takes 2 minutes. With a 90s timeout, only one poll would happen. This edge case is unlikely but worth documenting that intervals above ~22s effectively disable backoff within the default timeout.
3. Retry Logic ✅ (well-structured)
The _run_ability retry loop handles three categories of failure:
- Non-zero exit code → retry with 1s sleep
- Trace not verified → retry with 2s sleep
- Correlation not verified → retry with 3s sleep
- Subprocess exceptions (CalledProcessError, TimeoutExpired, JSONDecodeError) → retry with 2s sleep
The design correctly falls through to returning the last payload on the final attempt, letting downstream assertions produce clear errors. The _diagnostic_log helper prints [CALDERA-RETRY] prefixed messages — excellent for CI debuggability.
Worst-case wall time estimate: 3 retries × (30 + 90 + 90 + 30)s = ~720s per ability. With 4 parametrized safe abilities plus the avoid-logs and coverage tests, total suite time could reach ~1 hour in the worst case. This is acceptable for integration tests but should be documented in a CONTRIBUTING or CI readme.
Minor style note: The three consecutive if blocks (not elif) in the retry loop are correct — we want to return the payload regardless of which condition flagged, and the continue skips to the next attempt. But the structure would be clearer as a single check:
should_retry = False
if payload["exit_code"] != 0:
should_retry = True; reason = f"non-zero exit_code={payload['exit_code']}"
elif not payload["trace"].get("verified"):
should_retry = True; reason = "trace not verified"
elif correlation.get("expected_count", 0) > 0 and not correlation.get("verified"):
should_retry = True; reason = "correlation not verified"
if should_retry and attempt < retries:
_diagnostic_log(..., reason)
time.sleep(...)
continueThis is purely stylistic — the current code is correct.
4. Schema Validation ✅
validate_stream_schema is clean and well-designed:
- Returns empty list on success (all required fields present) — easy to assert with
assert not missing - Returns all required fields when
hitsis empty — correctly surfaces "no data at all" as "everything missing" REQUIRED_STREAM_SCHEMA_FIELDSmaps stream names to field lists — extensible
The new test test_telemetry_stream_schema_fields_are_populated validates both the default (trace) and falco stream schemas, but only conditionally checks falco (if falco_hits). This is appropriate — the test only checks falco when that ability actually produces falco data.
Suggestion (non-blocking): Consider adding a 5-line unit test for validate_stream_schema() directly (e.g., in a test_caldera_coverage_support.py). It's simple enough that it might not seem necessary, but a quick unit test would catch regressions in the helper itself:
def test_validate_stream_schema_empty_hits():
assert validate_stream_schema("default", [], [...]) == [...]
def test_validate_stream_schema_all_present():
assert validate_stream_schema("default", [{"service_name": "x", ...}], [...]) == []5. Diagnostics and Error Messages ✅ (excellent improvement)
This is the strongest part of the PR. Every assertion now includes:
- The ability_id being tested
- The actual vs expected values
- Relevant context (trace_id, hit counts, stderr excerpts)
- JSON dumps of raw data (truncated to 500-2000 chars to avoid log bloat)
The harness-side [CALDERA-TRACE] and [CALDERA-LOGS] diagnostic messages are well-structured:
- Log attempt counts (
attempt N/M) - Identify which streams are/aren't found yet
- Timeout messages include final state summary
- Throttled logging (first 3 attempts + every 5th) avoids log spam
6. Backward Compatibility ✅
_run()adds an optionaltimeoutkwarg (None default) — backward compatible_run_ability()changes from positional to keyword-onlyretriesparameter — this is a breaking API change, but it's test-internal, so acceptable- Harness output adds
sqlandstream_not_foundfields to correlation — additive, backward compatible - Post-flush sleep increase from 0.5s to 1.0s is a minor slowdown (~0.5s per ability) but justified for stability
7. Duplicate Section Header 🔶 (trivial)
In test_caldera_detection_coverage.py, there are two adjacent section separators:
# ---------------------------------------------------------------------------
# Telemetry stream schema validation helpers
# ---------------------------------------------------------------------------
# ---------------------------------------------------------------------------
# Enhanced assertion helpers
# ---------------------------------------------------------------------------Should be one separator with a blank line, not two adjacent ones. Trivial style nit.
Summary
| Area | Verdict |
|---|---|
| Timeout defaults | ✅ Sensible |
| Exponential backoff | ✅ Correct, well-capped |
| Env var overrides | ✅ Good with minor improvement opportunity |
| Retry logic | ✅ Sound |
| Schema validation | ✅ Clean, extensible |
| Error messages | ✅ Excellent — huge improvement |
| Backward compatibility | ✅ Safe |
| Test coverage | ✅ Adequate (new schema test) |
| Style | ✅ Clean (one trivial duplicate header) |
Bottom line: This is a well-crafted stability fix that directly addresses the flaky-test symptoms. The suggestions above are all non-blocking — they're improvements for maintainability and edge-case robustness, not correctness issues. I'm approving this with the minor suggestions left as optional follow-ups.
Improves test reliability for CALDERA detection coverage by addressing async log buffering, Falco capture gaps, and telemetry schema validation:
Files modified:
tools/caldera_otel_harness.py— env-var timeouts, exponential backoff, diagnostic loggingtests/test_caldera_detection_coverage.py— retry logic, schema validation test, better assertionstests/caldera_coverage_support.py— schema field mapping, validate_stream_schema() helperCloses #31