fix: stabilize CALDERA live stack trace and log verification tests (refs #31) by JJcyborg · Pull Request #43 · JJediny/LocalObserve

JJcyborg · 2026-06-09T02:13:45Z

Improves test reliability for CALDERA detection coverage by addressing async log buffering, Falco capture gaps, and telemetry schema validation:

Environment-variable configurable timeouts (CALDERA_* env vars)
Exponential backoff polling replacing fixed-interval sleeps
Retry logic for ability execution (up to CALDERA_MAX_RETRIES)
Telemetry stream schema field validation (REQUIRED_STREAM_SCHEMA_FIELDS)
Detailed error messages on timeout showing expected vs found data

Files modified:

tools/caldera_otel_harness.py — env-var timeouts, exponential backoff, diagnostic logging
tests/test_caldera_detection_coverage.py — retry logic, schema validation test, better assertions
tests/caldera_coverage_support.py — schema field mapping, validate_stream_schema() helper

Closes #31

…efs #31)

JJcyborg

Review: fix: stabilize CALDERA live stack trace and log verification tests

I've done a thorough review of this PR across all three modified files. Overall this is a solid, well-structured stability fix. The core changes — exponential backoff, configurable timeouts, retry logic, schema validation, and better diagnostics — directly address the flaky-test symptoms described in #31. Below are my detailed findings.

1. Environment Variable Configuration ✅ (with minor suggestion)

The timeout/interval/retry values are all configurable via CALDERA_* env vars with sensible defaults:

Variable	Default	Previous	Assessment
`CALDERA_TRACE_VERIFY_TIMEOUT`	90s	60s	Good — more headroom for slow CI
`CALDERA_LOG_VERIFY_TIMEOUT`	90s	60s	Good
`CALDERA_TRACE_VERIFY_INTERVAL`	2s	3s	Fine — faster initial polling + backoff makes up for it
`CALDERA_LOG_VERIFY_INTERVAL`	2s	3s	Fine
`CALDERA_ABILITY_TIMEOUT`	30s	N/A	Reasonable
`CALDERA_MAX_RETRIES`	3	N/A	Reasonable for integration tests
`CALDERA_POST_FLUSH_SLEEP`	1.0s	0.5s	Justified for stability

Minor concern: Both tests/test_caldera_detection_coverage.py and tools/caldera_otel_harness.py read the same env vars (e.g., CALDERA_TRACE_VERIFY_TIMEOUT). The test passes its values as CLI args to the harness, so the harness-level env vars are effectively redundant when invoked from the test. This isn't a bug — the harness can still be used standalone — but it's worth a brief comment clarifying the precedence for future maintainers.

Suggestion (non-blocking): Wrap the int(os.environ.get(...)) calls in try/except with a fallback and a warning log. Currently a malformed env var like CALDERA_TRACE_VERIFY_TIMEOUT=abc will crash at import time with an unhelpful ValueError. A defensive fallback:

def _env_int(key: str, default: int) -> int:
    raw = os.environ.get(key)
    if raw is None:
        return default
    try:
        return int(raw)
    except ValueError:
        print(f"[CALDERA] Warning: invalid value for {key}={raw!r}, using default {default}", file=sys.stderr)
        return default

This would make CI failures much easier to diagnose.

2. Exponential Backoff Implementation ✅

backoff = min(interval_seconds * (1.5 ** min(attempts - 1, 4)), interval_seconds * 4)

With default interval_seconds=2:

Attempt 1: 2.0s → Attempt 2: 3.0s → Attempt 3: 4.5s → Attempt 4: 6.75s → Attempt 5+: 8.0s (capped)

The exponential growth stops at 1.5^4 ≈ 5.06x, capped at 4x the interval. Maximum individual wait is 8 seconds with defaults. This is sensible — aggressive enough early, patient later, and never excessive.

One thing to verify: If someone sets a large interval (e.g., CALDERA_TRACE_VERIFY_INTERVAL=30), the max backoff becomes 30 * 4 = 120s, meaning each poll takes 2 minutes. With a 90s timeout, only one poll would happen. This edge case is unlikely but worth documenting that intervals above ~22s effectively disable backoff within the default timeout.

3. Retry Logic ✅ (well-structured)

The _run_ability retry loop handles three categories of failure:

Non-zero exit code → retry with 1s sleep
Trace not verified → retry with 2s sleep
Correlation not verified → retry with 3s sleep
Subprocess exceptions (CalledProcessError, TimeoutExpired, JSONDecodeError) → retry with 2s sleep

The design correctly falls through to returning the last payload on the final attempt, letting downstream assertions produce clear errors. The _diagnostic_log helper prints [CALDERA-RETRY] prefixed messages — excellent for CI debuggability.

Worst-case wall time estimate: 3 retries × (30 + 90 + 90 + 30)s = ~720s per ability. With 4 parametrized safe abilities plus the avoid-logs and coverage tests, total suite time could reach ~1 hour in the worst case. This is acceptable for integration tests but should be documented in a CONTRIBUTING or CI readme.

Minor style note: The three consecutive if blocks (not elif) in the retry loop are correct — we want to return the payload regardless of which condition flagged, and the continue skips to the next attempt. But the structure would be clearer as a single check:

should_retry = False
if payload["exit_code"] != 0:
    should_retry = True; reason = f"non-zero exit_code={payload['exit_code']}"
elif not payload["trace"].get("verified"):
    should_retry = True; reason = "trace not verified"
elif correlation.get("expected_count", 0) > 0 and not correlation.get("verified"):
    should_retry = True; reason = "correlation not verified"
if should_retry and attempt < retries:
    _diagnostic_log(..., reason)
    time.sleep(...)
    continue

This is purely stylistic — the current code is correct.

4. Schema Validation ✅

validate_stream_schema is clean and well-designed:

Returns empty list on success (all required fields present) — easy to assert with assert not missing
Returns all required fields when hits is empty — correctly surfaces "no data at all" as "everything missing"
REQUIRED_STREAM_SCHEMA_FIELDS maps stream names to field lists — extensible

The new test test_telemetry_stream_schema_fields_are_populated validates both the default (trace) and falco stream schemas, but only conditionally checks falco (if falco_hits). This is appropriate — the test only checks falco when that ability actually produces falco data.

Suggestion (non-blocking): Consider adding a 5-line unit test for validate_stream_schema() directly (e.g., in a test_caldera_coverage_support.py). It's simple enough that it might not seem necessary, but a quick unit test would catch regressions in the helper itself:

def test_validate_stream_schema_empty_hits():
    assert validate_stream_schema("default", [], [...]) == [...]
def test_validate_stream_schema_all_present():
    assert validate_stream_schema("default", [{"service_name": "x", ...}], [...]) == []

5. Diagnostics and Error Messages ✅ (excellent improvement)

This is the strongest part of the PR. Every assertion now includes:

The ability_id being tested
The actual vs expected values
Relevant context (trace_id, hit counts, stderr excerpts)
JSON dumps of raw data (truncated to 500-2000 chars to avoid log bloat)

The harness-side [CALDERA-TRACE] and [CALDERA-LOGS] diagnostic messages are well-structured:

Log attempt counts (attempt N/M)
Identify which streams are/aren't found yet
Timeout messages include final state summary
Throttled logging (first 3 attempts + every 5th) avoids log spam

6. Backward Compatibility ✅

_run() adds an optional timeout kwarg (None default) — backward compatible
_run_ability() changes from positional to keyword-only retries parameter — this is a breaking API change, but it's test-internal, so acceptable
Harness output adds sql and stream_not_found fields to correlation — additive, backward compatible
Post-flush sleep increase from 0.5s to 1.0s is a minor slowdown (~0.5s per ability) but justified for stability

7. Duplicate Section Header 🔶 (trivial)

In test_caldera_detection_coverage.py, there are two adjacent section separators:

# ---------------------------------------------------------------------------
# Telemetry stream schema validation helpers
# ---------------------------------------------------------------------------
# ---------------------------------------------------------------------------
# Enhanced assertion helpers
# ---------------------------------------------------------------------------

Should be one separator with a blank line, not two adjacent ones. Trivial style nit.

Summary

Area	Verdict
Timeout defaults	✅ Sensible
Exponential backoff	✅ Correct, well-capped
Env var overrides	✅ Good with minor improvement opportunity
Retry logic	✅ Sound
Schema validation	✅ Clean, extensible
Error messages	✅ Excellent — huge improvement
Backward compatibility	✅ Safe
Test coverage	✅ Adequate (new schema test)
Style	✅ Clean (one trivial duplicate header)

Bottom line: This is a well-crafted stability fix that directly addresses the flaky-test symptoms. The suggestions above are all non-blocking — they're improvements for maintainability and edge-case robustness, not correctness issues. I'm approving this with the minor suggestions left as optional follow-ups.

fix: stabilize CALDERA live stack trace and log verification tests (r…

55bf57f

…efs #31)

JJcyborg commented Jun 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: stabilize CALDERA live stack trace and log verification tests (refs #31)#43

fix: stabilize CALDERA live stack trace and log verification tests (refs #31)#43
JJcyborg wants to merge 1 commit into
mainfrom
fix/issue-31-caldera-stability

JJcyborg commented Jun 9, 2026

Uh oh!

JJcyborg left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JJcyborg commented Jun 9, 2026

Uh oh!

JJcyborg left a comment

Choose a reason for hiding this comment

Review: fix: stabilize CALDERA live stack trace and log verification tests

1. Environment Variable Configuration ✅ (with minor suggestion)

2. Exponential Backoff Implementation ✅

3. Retry Logic ✅ (well-structured)

4. Schema Validation ✅

5. Diagnostics and Error Messages ✅ (excellent improvement)

6. Backward Compatibility ✅

7. Duplicate Section Header 🔶 (trivial)

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant