fix(cli): reduce avoidable errors in panda CLI investigations#130
Conversation
Hosted-devnet CLI sessions repeatedly hit avoidable errors: querying the disabled Loki module, guessing non-existent ethnode instance names (bare 502s), parsing null head-epoch fields, and using a bare `otel_logs` that only resolves on local clusters. Each failure left the agent with a generic or absent hint that did not point at the right discovery command. - clickhouse: auto-qualify bare otel_logs/otel_traces to the database that owns them per datasource (external on hosted, otel on local-kurtosis), so the same query works in both deployment modes. Already-qualified refs and lookalike identifiers are preserved. - cli: message-aware error hints — ClickHouse UNKNOWN_TABLE/UNKNOWN_DATABASE now points at `panda schema` and names the real otel_logs locations; a 502 hint explains an ethnode instance may not exist and how to discover names. - loki: redirect failed commands to the ClickHouse OTel log tables with a ready-to-run query example. - ethnode: document that instances cannot be enumerated and the client-pair convention is not guessable; show how to discover real names. - dora: document that head/in-progress epochs return null fields (use head_epoch-1) and that summary fields nest under the `data` key.
The ethnode.list_networks server operation and Python function already existed and worked; only the CLI wrapper was missing. Add it, mirroring list-datasources and reusing printListing. Note this lists NETWORKS, not node instances — instance names still cannot be enumerated (the proxy holds no enumerable instance list), so the help continues to point at Dora / otel_logs host.name for node discovery.
🔭 LangfuseView this run’s session (2 traces) 📊 Dataset run |
Live verification against glamsterdam-devnet-5 showed the Dora epoch fields are lowercase (globalparticipationrate, attestationscount), not camelCase, and the head epoch reports partial/artificially-low values rather than strict null (null only at the boundary / for future epochs). Correct the help so it does not point users at field names that are always None.
| // (or start-of-string) so qualified references like `external.otel_logs`, | ||
| // `otel.otel_logs`, or a column named `my_otel_logs` are left untouched — | ||
| // Go's RE2 has no lookbehind, so the boundary is captured and re-emitted. | ||
| var unqualifiedOtelTable = regexp.MustCompile(`(?i)(^|[^.\w])(otel_logs|otel_traces)\b`) |
There was a problem hiding this comment.
I'd prefer we leave this out for now I think. We can improve the docs and discovery to the point where the agent should know to provide the db in the query. We can come back to this if we can't fix it up the chain (fairly sure we can though)
| return "the requested module, operation, datasource, or resource is not available on this server; check 'panda datasources' and 'panda resources list'" | ||
| case http.StatusServiceUnavailable: | ||
| return "the server is running but a required service (e.g. sandbox) is not available — check server logs with 'docker compose logs server'" | ||
| case http.StatusBadGateway: |
There was a problem hiding this comment.
This datasource-specific content should be closer to the datasource imo
- Remove server-side otel table auto-qualification (qualifyOtelTables): fix this upstream in docs/discovery instead so the agent generates fully-qualified queries from the start - Move ClickHouse unknown-table error hint from the generic serverErrorHint helper to wrapClickHouseError in clickhouse_cmd.go, closer to the datasource that produces the error - Remove StatusBadGateway case with datasource-specific content from the generic helper; ethnode command Long description already covers instance-name discovery - Delete the now-orphaned qualifyOtelTablesSQL test file
Two follow-ups exposed by removing server-side otel table
auto-qualification in the previous commit:
1. The ethnode help text and loki redirect synopsis still showed bare
`otel_logs` against clickhouse-raw (a hosted cluster whose default db
is `default`) and claimed it was "auto-resolved". With qualification
gone those examples now fail, so qualify them to `external.otel_logs`
and drop the stale auto-resolve claim.
2. wrapClickHouseError appended its hint on top of the error returned by
decodeAPIError. A ClickHouse unknown-table/database error comes back
as HTTP 404, so the generic status hint ("datasource ... not
available") fired too — producing two hints, the first misleading.
decodeAPIError now returns a structured *apiError (status + message);
wrapClickHouseError matches it via errors.As and rebuilds the error
with only the table-specific hint, while genuine route 404s (unknown
datasource) keep the generic hint. Verified live against the local
server. Adds a unit test covering both branches.
Mirror the ClickHouse unknown-table fix for the Loki redirect. A disabled Loki module returns HTTP 404, so redirectLokiError was stacking its synopsis on top of the generic "check 'panda datasources'" hint — redundant once we point straight at the replacement datasource. - redirectLokiError now matches the structured *apiError via errors.As and rebuilds the error with only the synopsis. Matching on status 404 / "not enabled" / "not available" is more precise than the old substring scan of the formatted string, and a genuine query failure against a live Loki (e.g. a 400 parse error) is no longer redirected. - Align the synopsis example's severity regex with the canonical runbook pattern (crit|err|error|fatal) and note that severity lives in Body. - Add a unit test covering redirect vs non-redirect branches.
…nomics # Conflicts: # pkg/cli/loki_cmd.go
🤖 qu0b-reviewerThe PR is clean across all its pieces. No grounded issues. SummaryThe PR makes three ergonomic improvements to the CLI error path:
Secondary changes: help text improvements in Issues
Suggestions
Reviewed @ |
|
@samcm updated |
Reduces avoidable errors hit during raw
pandaCLI investigations of hosted devnets. Each fix was reproduced and verified against a live devnet.Background: where OTel logs live
otel_logs/otel_tracesexist in different ClickHouse databases per deployment:external.otel_logs(clickhouse-raw) — the Loki replacementotel.otel_logs(local-kurtosis, default dbotel)internal.otel_logsotel_tracessits alongsideotel_logsin the same db (join onTraceId); it's tracing, not the logs replacement.Changes
operations_clickhouse.go): one server-side chokepoint (CLI + sandbox) auto-qualifies a bareotel_logs/otel_tracesto the right db per datasource (externalhosted,otellocal), soFROM otel_logsworks in both modes. Qualified refs and lookalikes (my_otel_logs) are untouched. 10 unit tests.server_helpers.go): ClickHouseUNKNOWN_TABLE→ points atpanda schema+ names the realotel_logslocations (was a misleading "missing datasource" hint); new 502 hint explains an ethnode instance may not exist and how to find real names.loki_cmd.go): failed loki commands print a runnablepanda clickhouse queryexample against the OTel tables.ethnode.go):--helpdocuments that instances can't be enumerated and the client-pair name isn't guessable, with discovery methods; adds thelist-networksCLI command (the server op + Python fn already existed — only the wrapper was missing).dora.go): queryhead_epoch - 1; fields nest underdataand use lowercase names (globalparticipationrate,attestationscount).Verification (live)
external.otel_logs(what the alias emits) returns rows on the cluster; bareotel_logsfails todayhost.namediscovery query returns the real nodes (teku-erigon-1/ethrex-1/nethermind-1), confirming the guessedteku-geth-1never existedpanda ethnode list-networksreturns live networksgo build/vet/gofmt/tests cleanDeliberately out of scope
ethnode list-instances— the proxy holds no enumerable instance list by design; node names must come from Dora (/v1/clients/consensus) orotel_logshost.name.panda schemastill can't seeexternal.*/internal.*(discovery only enumerates a non-empty default db); the alias + hints route around it.