Skip to content

fix: stabilize CALDERA live stack trace and log verification tests (refs #31)#43

Draft
JJcyborg wants to merge 1 commit into
mainfrom
fix/issue-31-caldera-stability
Draft

fix: stabilize CALDERA live stack trace and log verification tests (refs #31)#43
JJcyborg wants to merge 1 commit into
mainfrom
fix/issue-31-caldera-stability

Conversation

@JJcyborg

@JJcyborg JJcyborg commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Improves test reliability for CALDERA detection coverage by addressing async log buffering, Falco capture gaps, and telemetry schema validation:

  1. Environment-variable configurable timeouts (CALDERA_* env vars)
  2. Exponential backoff polling replacing fixed-interval sleeps
  3. Retry logic for ability execution (up to CALDERA_MAX_RETRIES)
  4. Telemetry stream schema field validation (REQUIRED_STREAM_SCHEMA_FIELDS)
  5. Detailed error messages on timeout showing expected vs found data

Files modified:

  • tools/caldera_otel_harness.py — env-var timeouts, exponential backoff, diagnostic logging
  • tests/test_caldera_detection_coverage.py — retry logic, schema validation test, better assertions
  • tests/caldera_coverage_support.py — schema field mapping, validate_stream_schema() helper

Closes #31

@JJcyborg JJcyborg left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: fix: stabilize CALDERA live stack trace and log verification tests

I've done a thorough review of this PR across all three modified files. Overall this is a solid, well-structured stability fix. The core changes — exponential backoff, configurable timeouts, retry logic, schema validation, and better diagnostics — directly address the flaky-test symptoms described in #31. Below are my detailed findings.


1. Environment Variable Configuration ✅ (with minor suggestion)

The timeout/interval/retry values are all configurable via CALDERA_* env vars with sensible defaults:

Variable Default Previous Assessment
CALDERA_TRACE_VERIFY_TIMEOUT 90s 60s Good — more headroom for slow CI
CALDERA_LOG_VERIFY_TIMEOUT 90s 60s Good
CALDERA_TRACE_VERIFY_INTERVAL 2s 3s Fine — faster initial polling + backoff makes up for it
CALDERA_LOG_VERIFY_INTERVAL 2s 3s Fine
CALDERA_ABILITY_TIMEOUT 30s N/A Reasonable
CALDERA_MAX_RETRIES 3 N/A Reasonable for integration tests
CALDERA_POST_FLUSH_SLEEP 1.0s 0.5s Justified for stability

Minor concern: Both tests/test_caldera_detection_coverage.py and tools/caldera_otel_harness.py read the same env vars (e.g., CALDERA_TRACE_VERIFY_TIMEOUT). The test passes its values as CLI args to the harness, so the harness-level env vars are effectively redundant when invoked from the test. This isn't a bug — the harness can still be used standalone — but it's worth a brief comment clarifying the precedence for future maintainers.

Suggestion (non-blocking): Wrap the int(os.environ.get(...)) calls in try/except with a fallback and a warning log. Currently a malformed env var like CALDERA_TRACE_VERIFY_TIMEOUT=abc will crash at import time with an unhelpful ValueError. A defensive fallback:

def _env_int(key: str, default: int) -> int:
    raw = os.environ.get(key)
    if raw is None:
        return default
    try:
        return int(raw)
    except ValueError:
        print(f"[CALDERA] Warning: invalid value for {key}={raw!r}, using default {default}", file=sys.stderr)
        return default

This would make CI failures much easier to diagnose.


2. Exponential Backoff Implementation ✅

backoff = min(interval_seconds * (1.5 ** min(attempts - 1, 4)), interval_seconds * 4)

With default interval_seconds=2:

  • Attempt 1: 2.0s → Attempt 2: 3.0s → Attempt 3: 4.5s → Attempt 4: 6.75s → Attempt 5+: 8.0s (capped)

The exponential growth stops at 1.5^4 ≈ 5.06x, capped at 4x the interval. Maximum individual wait is 8 seconds with defaults. This is sensible — aggressive enough early, patient later, and never excessive.

One thing to verify: If someone sets a large interval (e.g., CALDERA_TRACE_VERIFY_INTERVAL=30), the max backoff becomes 30 * 4 = 120s, meaning each poll takes 2 minutes. With a 90s timeout, only one poll would happen. This edge case is unlikely but worth documenting that intervals above ~22s effectively disable backoff within the default timeout.


3. Retry Logic ✅ (well-structured)

The _run_ability retry loop handles three categories of failure:

  1. Non-zero exit code → retry with 1s sleep
  2. Trace not verified → retry with 2s sleep
  3. Correlation not verified → retry with 3s sleep
  4. Subprocess exceptions (CalledProcessError, TimeoutExpired, JSONDecodeError) → retry with 2s sleep

The design correctly falls through to returning the last payload on the final attempt, letting downstream assertions produce clear errors. The _diagnostic_log helper prints [CALDERA-RETRY] prefixed messages — excellent for CI debuggability.

Worst-case wall time estimate: 3 retries × (30 + 90 + 90 + 30)s = ~720s per ability. With 4 parametrized safe abilities plus the avoid-logs and coverage tests, total suite time could reach ~1 hour in the worst case. This is acceptable for integration tests but should be documented in a CONTRIBUTING or CI readme.

Minor style note: The three consecutive if blocks (not elif) in the retry loop are correct — we want to return the payload regardless of which condition flagged, and the continue skips to the next attempt. But the structure would be clearer as a single check:

should_retry = False
if payload["exit_code"] != 0:
    should_retry = True; reason = f"non-zero exit_code={payload['exit_code']}"
elif not payload["trace"].get("verified"):
    should_retry = True; reason = "trace not verified"
elif correlation.get("expected_count", 0) > 0 and not correlation.get("verified"):
    should_retry = True; reason = "correlation not verified"
if should_retry and attempt < retries:
    _diagnostic_log(..., reason)
    time.sleep(...)
    continue

This is purely stylistic — the current code is correct.


4. Schema Validation ✅

validate_stream_schema is clean and well-designed:

  • Returns empty list on success (all required fields present) — easy to assert with assert not missing
  • Returns all required fields when hits is empty — correctly surfaces "no data at all" as "everything missing"
  • REQUIRED_STREAM_SCHEMA_FIELDS maps stream names to field lists — extensible

The new test test_telemetry_stream_schema_fields_are_populated validates both the default (trace) and falco stream schemas, but only conditionally checks falco (if falco_hits). This is appropriate — the test only checks falco when that ability actually produces falco data.

Suggestion (non-blocking): Consider adding a 5-line unit test for validate_stream_schema() directly (e.g., in a test_caldera_coverage_support.py). It's simple enough that it might not seem necessary, but a quick unit test would catch regressions in the helper itself:

def test_validate_stream_schema_empty_hits():
    assert validate_stream_schema("default", [], [...]) == [...]
def test_validate_stream_schema_all_present():
    assert validate_stream_schema("default", [{"service_name": "x", ...}], [...]) == []

5. Diagnostics and Error Messages ✅ (excellent improvement)

This is the strongest part of the PR. Every assertion now includes:

  • The ability_id being tested
  • The actual vs expected values
  • Relevant context (trace_id, hit counts, stderr excerpts)
  • JSON dumps of raw data (truncated to 500-2000 chars to avoid log bloat)

The harness-side [CALDERA-TRACE] and [CALDERA-LOGS] diagnostic messages are well-structured:

  • Log attempt counts (attempt N/M)
  • Identify which streams are/aren't found yet
  • Timeout messages include final state summary
  • Throttled logging (first 3 attempts + every 5th) avoids log spam

6. Backward Compatibility ✅

  • _run() adds an optional timeout kwarg (None default) — backward compatible
  • _run_ability() changes from positional to keyword-only retries parameter — this is a breaking API change, but it's test-internal, so acceptable
  • Harness output adds sql and stream_not_found fields to correlation — additive, backward compatible
  • Post-flush sleep increase from 0.5s to 1.0s is a minor slowdown (~0.5s per ability) but justified for stability

7. Duplicate Section Header 🔶 (trivial)

In test_caldera_detection_coverage.py, there are two adjacent section separators:

# ---------------------------------------------------------------------------
# Telemetry stream schema validation helpers
# ---------------------------------------------------------------------------
# ---------------------------------------------------------------------------
# Enhanced assertion helpers
# ---------------------------------------------------------------------------

Should be one separator with a blank line, not two adjacent ones. Trivial style nit.


Summary

Area Verdict
Timeout defaults ✅ Sensible
Exponential backoff ✅ Correct, well-capped
Env var overrides ✅ Good with minor improvement opportunity
Retry logic ✅ Sound
Schema validation ✅ Clean, extensible
Error messages ✅ Excellent — huge improvement
Backward compatibility ✅ Safe
Test coverage ✅ Adequate (new schema test)
Style ✅ Clean (one trivial duplicate header)

Bottom line: This is a well-crafted stability fix that directly addresses the flaky-test symptoms. The suggestions above are all non-blocking — they're improvements for maintainability and edge-case robustness, not correctness issues. I'm approving this with the minor suggestions left as optional follow-ups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Efficacy: Stabilize Live Stack Trace and Log Integration Verification for CALDERA

1 participant