Skip to content

fix(cli): reduce avoidable errors in panda CLI investigations#130

Open
qu0b wants to merge 7 commits into
masterfrom
qu0b/cli-error-ergonomics
Open

fix(cli): reduce avoidable errors in panda CLI investigations#130
qu0b wants to merge 7 commits into
masterfrom
qu0b/cli-error-ergonomics

Conversation

@qu0b

@qu0b qu0b commented Jun 6, 2026

Copy link
Copy Markdown
Member

Reduces avoidable errors hit during raw panda CLI investigations of hosted devnets. Each fix was reproduced and verified against a live devnet.

Background: where OTel logs live

otel_logs/otel_traces exist in different ClickHouse databases per deployment:

  • Hosted multi-VM devnets → external.otel_logs (clickhouse-raw) — the Loki replacement
  • Local Kurtosis devnets → otel.otel_logs (local-kurtosis, default db otel)
  • Team-internal platform → internal.otel_logs

otel_traces sits alongside otel_logs in the same db (join on TraceId); it's tracing, not the logs replacement.

Changes

  • clickhouse alias (operations_clickhouse.go): one server-side chokepoint (CLI + sandbox) auto-qualifies a bare otel_logs/otel_traces to the right db per datasource (external hosted, otel local), so FROM otel_logs works in both modes. Qualified refs and lookalikes (my_otel_logs) are untouched. 10 unit tests.
  • error hints (server_helpers.go): ClickHouse UNKNOWN_TABLE → points at panda schema + names the real otel_logs locations (was a misleading "missing datasource" hint); new 502 hint explains an ethnode instance may not exist and how to find real names.
  • loki redirect (loki_cmd.go): failed loki commands print a runnable panda clickhouse query example against the OTel tables.
  • ethnode (ethnode.go): --help documents that instances can't be enumerated and the client-pair name isn't guessable, with discovery methods; adds the list-networks CLI command (the server op + Python fn already existed — only the wrapper was missing).
  • dora epoch --help (dora.go): query head_epoch - 1; fields nest under data and use lowercase names (globalparticipationrate, attestationscount).

Verification (live)

  • external.otel_logs (what the alias emits) returns rows on the cluster; bare otel_logs fails today
  • ethnode host.name discovery query returns the real nodes (teku-erigon-1/ethrex-1/nethermind-1), confirming the guessed teku-geth-1 never existed
  • panda ethnode list-networks returns live networks
  • error-hint A/B confirmed with old vs new CLI; go build/vet/gofmt/tests clean

Deliberately out of scope

  • No ethnode list-instances — the proxy holds no enumerable instance list by design; node names must come from Dora (/v1/clients/consensus) or otel_logs host.name.
  • panda schema still can't see external.*/internal.* (discovery only enumerates a non-empty default db); the alias + hints route around it.

golangci-lint not run locally: pre-existing env mismatch (binary go1.25 vs repo target 1.26.1). CI runs it with the correct toolchain.

qu0b added 2 commits June 6, 2026 12:58
Hosted-devnet CLI sessions repeatedly hit avoidable errors: querying the
disabled Loki module, guessing non-existent ethnode instance names (bare
502s), parsing null head-epoch fields, and using a bare `otel_logs` that
only resolves on local clusters. Each failure left the agent with a generic
or absent hint that did not point at the right discovery command.

- clickhouse: auto-qualify bare otel_logs/otel_traces to the database that
  owns them per datasource (external on hosted, otel on local-kurtosis), so
  the same query works in both deployment modes. Already-qualified refs and
  lookalike identifiers are preserved.
- cli: message-aware error hints — ClickHouse UNKNOWN_TABLE/UNKNOWN_DATABASE
  now points at `panda schema` and names the real otel_logs locations; a 502
  hint explains an ethnode instance may not exist and how to discover names.
- loki: redirect failed commands to the ClickHouse OTel log tables with a
  ready-to-run query example.
- ethnode: document that instances cannot be enumerated and the client-pair
  convention is not guessable; show how to discover real names.
- dora: document that head/in-progress epochs return null fields (use
  head_epoch-1) and that summary fields nest under the `data` key.
The ethnode.list_networks server operation and Python function already
existed and worked; only the CLI wrapper was missing. Add it, mirroring
list-datasources and reusing printListing.

Note this lists NETWORKS, not node instances — instance names still cannot
be enumerated (the proxy holds no enumerable instance list), so the help
continues to point at Dora / otel_logs host.name for node discovery.
@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Eval Smoke (opencode-go/deepseek-v4-flash)

2 tests  ±0   2 ✅ ±0   59s ⏱️ ±0s
1 suites ±0   0 💤 ±0 
1 files   ±0   0 ❌ ±0 

Results for commit 82d2e57. ± Comparison against base commit 5bad70f.

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

🔭 Langfuse

View this run’s session (2 traces)

📊 Dataset run 252bef22 on panda-eval-smoke@qu0b-cli-error-ergonomics

Live verification against glamsterdam-devnet-5 showed the Dora epoch fields
are lowercase (globalparticipationrate, attestationscount), not camelCase,
and the head epoch reports partial/artificially-low values rather than strict
null (null only at the boundary / for future epochs). Correct the help so it
does not point users at field names that are always None.
Comment thread pkg/server/operations_clickhouse.go Outdated
// (or start-of-string) so qualified references like `external.otel_logs`,
// `otel.otel_logs`, or a column named `my_otel_logs` are left untouched —
// Go's RE2 has no lookbehind, so the boundary is captured and re-emitted.
var unqualifiedOtelTable = regexp.MustCompile(`(?i)(^|[^.\w])(otel_logs|otel_traces)\b`)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer we leave this out for now I think. We can improve the docs and discovery to the point where the agent should know to provide the db in the query. We can come back to this if we can't fix it up the chain (fairly sure we can though)

Comment thread pkg/cli/server_helpers.go Outdated
return "the requested module, operation, datasource, or resource is not available on this server; check 'panda datasources' and 'panda resources list'"
case http.StatusServiceUnavailable:
return "the server is running but a required service (e.g. sandbox) is not available — check server logs with 'docker compose logs server'"
case http.StatusBadGateway:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This datasource-specific content should be closer to the datasource imo

qu0b added 4 commits June 9, 2026 15:24
- Remove server-side otel table auto-qualification (qualifyOtelTables):
  fix this upstream in docs/discovery instead so the agent generates
  fully-qualified queries from the start
- Move ClickHouse unknown-table error hint from the generic
  serverErrorHint helper to wrapClickHouseError in clickhouse_cmd.go,
  closer to the datasource that produces the error
- Remove StatusBadGateway case with datasource-specific content from
  the generic helper; ethnode command Long description already covers
  instance-name discovery
- Delete the now-orphaned qualifyOtelTablesSQL test file
Two follow-ups exposed by removing server-side otel table
auto-qualification in the previous commit:

1. The ethnode help text and loki redirect synopsis still showed bare
   `otel_logs` against clickhouse-raw (a hosted cluster whose default db
   is `default`) and claimed it was "auto-resolved". With qualification
   gone those examples now fail, so qualify them to `external.otel_logs`
   and drop the stale auto-resolve claim.

2. wrapClickHouseError appended its hint on top of the error returned by
   decodeAPIError. A ClickHouse unknown-table/database error comes back
   as HTTP 404, so the generic status hint ("datasource ... not
   available") fired too — producing two hints, the first misleading.
   decodeAPIError now returns a structured *apiError (status + message);
   wrapClickHouseError matches it via errors.As and rebuilds the error
   with only the table-specific hint, while genuine route 404s (unknown
   datasource) keep the generic hint. Verified live against the local
   server. Adds a unit test covering both branches.
Mirror the ClickHouse unknown-table fix for the Loki redirect. A
disabled Loki module returns HTTP 404, so redirectLokiError was stacking
its synopsis on top of the generic "check 'panda datasources'" hint —
redundant once we point straight at the replacement datasource.

- redirectLokiError now matches the structured *apiError via errors.As
  and rebuilds the error with only the synopsis. Matching on status 404
  / "not enabled" / "not available" is more precise than the old
  substring scan of the formatted string, and a genuine query failure
  against a live Loki (e.g. a 400 parse error) is no longer redirected.
- Align the synopsis example's severity regex with the canonical runbook
  pattern (crit|err|error|fatal) and note that severity lives in Body.
- Add a unit test covering redirect vs non-redirect branches.
@qu0b-reviewer

qu0b-reviewer Bot commented Jun 9, 2026

Copy link
Copy Markdown

🤖 qu0b-reviewer

The PR is clean across all its pieces. No grounded issues.

Summary

The PR makes three ergonomic improvements to the CLI error path:

  1. server_helpers.godecodeAPIError now returns a structured *apiError instead of a plain error, so individual commands can match it via errors.As and attach datasource-specific hints without stacking redundant generic hints.

  2. clickhouse_cmd.gowrapClickHouseError matches *apiError for unknown-table/database messages and replaces the misleading generic 404 hint ("datasource not available") with a table-specific one pointing at panda schema and the correct OTel log table qualification. A unit test covers the two branches plus a non-apiError passthrough.

  3. loki_cmd.goredirectLokiError mirrors the same pattern: matches *apiError for Loki unavailability (404, "not enabled", "not available") and replaces the generic hint with the full ClickHouse redirect synopsis. A 400 parse error against a live Loki is correctly not redirected. A unit test covers four branches.

Secondary changes: help text improvements in dora.go (documenting epoch state and field nesting), ethnode.go (explaining instance names are not enumerable and showing discovery paths + the new list-networks subcommand), and minor copy tweaks.

Issues

  • None — the *apiError type is confined to pkg/cli, all callers use errors.As correctly, tests verify both redirect and passthrough paths, and the wrapClickHouseError / redirectLokiError logic each returns the input error unchanged on non-matching paths.

Suggestions

  • None.

Reviewed @ 82d2e578
"Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live." — John Woods

@qu0b

qu0b commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

@samcm updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants