ci(e2e): remove --llm-api-key and Databricks credential setup by TomeHirata · Pull Request #802 · omnigent-ai/omnigent

TomeHirata · 2026-06-19T08:11:27Z

Summary

All e2e tests now use the in-process mock LLM server — no real LLM credentials needed for the standard PR gate. Tests that require real credentials (prompt policy classifier) skip cleanly.

Changes

Remove --llm-api-key, --profile default, --harness databricks from pytest invocation
Remove "Set LLM credentials" and "Write gateway profile (~/.databrickscfg)" steps
Remove OMNIGENT_TEST_MODEL_SPREAD / OMNIGENT_TEST_MODEL_POOL_GPT env vars (only needed for load-balancing real gateway calls)
Update workflow description to reflect mock LLM usage

Test plan

CI passes without credentials (mock LLM handles all tests)
2 prompt-policy tests show as SKIPPED (expected — they check DATABRICKS_TOKEN)

This pull request and its description were written by Isaac.

All e2e tests now use the in-process mock LLM server by default. Tests that require real credentials (prompt policy classifier) skip cleanly via @pytest.mark.skipif(not DATABRICKS_TOKEN, ...). Removes: - --llm-api-key, --profile, --harness flags from pytest invocation - "Set LLM credentials" and "Write gateway profile" steps - OMNIGENT_TEST_MODEL_SPREAD / OMNIGENT_TEST_MODEL_POOL_GPT env vars (only needed for load-balancing real gateway calls) Co-authored-by: Isaac

omnigent-ci · 2026-06-19T08:13:09Z

Polly AI Review

PR Review: `ci(e2e): remove --llm-api-key and Databricks credential setup`

1. Blocking Issues

None.

The removed pytest flags (--llm-api-key, --profile default, --harness databricks) and env vars (OMNIGENT_TEST_MODEL_SPREAD, OMNIGENT_TEST_MODEL_POOL_GPT) are cleanly excised with no dangling references in the remaining workflow. The heredoc credential write and $GITHUB_ENV exports are fully removed.

2. Security Analysis

This diff is a net security improvement:

The old Set LLM credentials step injected LLM_API_KEY into $GITHUB_ENV, making the secret visible to every subsequent step in the job — a well-known GitHub Actions secret-leakage vector. Removing it is strictly correct.
The Write gateway profile step wrote the token in plaintext to ~/.databrickscfg on the runner. Removing it eliminates a credential artifact that could be exfiltrated by a compromised action or a log-scraping attack on the runner's filesystem.
DATABRICKS_BEARER=$LLM_API_KEY passthrough is gone — no longer a lateral-movement path if a sub-process were compromised.

No new attack surface is introduced. The mock-LLM path carries no credentials by definition.

3. Non-Blocking Suggestions

a. Scheduled run now uses mock LLM too.
The workflow fires on schedule (09:00 UTC daily). The credential setup is removed unconditionally, so nightly runs also exercise only the mock path. If nightly.yml already owns real-credential integration coverage this is fine, but it is worth a one-line comment confirming that intent, so a future reader doesn't add credentials back "to fix" the nightly.

b. Test plan checkboxes are unchecked.
Both items in the PR description are [ ]. Recommend confirming CI green and the expected 2 SKIPPED prompt-policy tests before merge, and checking these off (or noting them in the merge commit).

c. tests/_model_pools.py may be dead code.
The removed comment explicitly references tests/_model_pools.py as the consumer of OMNIGENT_TEST_MODEL_SPREAD. With that env var gone, the module is presumably unused in CI. Worth a follow-up to remove or document it so it doesn't confuse contributors.

4. Summary

A clean, well-scoped CI hygiene change. Removing real credentials from the PR gate is the right call once the mock-LLM path covers the full test surface, and the diff actually improves the security posture by eliminating secret-to-env and secret-to-disk patterns. The only risk worth watching is that the scheduled run has silently lost real-gateway coverage — confirming nightly.yml or another job still exercises the Databricks path would close that question entirely. No correctness or security blockers; safe to merge once the test plan is verified green.

_{Automated review by Polly · workflow run}

Removing the credential steps broke tests that use databricks_workspace or omnigent_credentials_env fixtures — they read ~/.databrickscfg at collection time and raise pytest.UsageError when the [default] profile is missing. Write a stub profile using secrets when available, falling back to placeholder values so the file always exists. Tests that need real LLM calls skip via their own guards (skipif(not DATABRICKS_TOKEN)). Co-authored-by: Isaac

Replace pytest.UsageError with pytest.skip in the databricks_workspace fixture so tests requiring real Databricks credentials skip cleanly when ~/.databrickscfg is absent. This removes the need to write a stub profile in e2e.yml — the fixture gates itself, no workaround needed. Co-authored-by: Isaac

databricks_workspace, omnigent_credentials_env, and patched_databrickscfg are no longer used by any e2e test — all tests migrated to mock_credentials_env. Also removes now-unused imports (configparser, shutil, FileLock, lookup_databricks_host) and related constants (_DEFAULT_PROFILE, _DATABRICKSCFG_PATH, _DATABRICKSCFG_LOCK_PATH). Co-authored-by: Isaac

…eway creds test_run_omnigent_example_agents: add --harness openai-agents --model mock-model to agent_with_tools_calculate and coding_supervisor_with_forks cases so the mock LLM handles all turns instead of the YAML's claude-sdk executor (which requires Databricks gateway credentials not available in CI). test_example_coding_supervisor_with_forks: inject ANTHROPIC_BASE_URL, ANTHROPIC_API_KEY, and HARNESS_CLAUDE_SDK_API_KEY_HELPER into the env for the claude-sdk parametrize case so it routes to the mock server. Co-authored-by: Isaac

ClaudeSDKExecutor(gateway=True) reads ~/.databrickscfg before invoking the claude binary. Without the file (e.g. CI without real credentials), it errors before any LLM mock can intercept. Skip rather than fail. Co-authored-by: Isaac

- test_coding_supervisor_with_forks: add skip guard for codex harness when ~/.databrickscfg is absent (same as claude-sdk — CodexExecutor with gateway=True requires Databricks credentials before the binary runs) - test_prompt_policy_allow_path_reaches_llm: re-seed mock-model queue immediately before send_user_message_to_session to shrink the window where a parallel test's reset_mock_llm can clear it; add @pytest.mark.flaky with 2 reruns as a safety net for the remaining race Co-authored-by: Isaac

The server's policy-classifier LLM uses the "mock-model" key on the shared mock server. Per-test reset_mock_llm calls from parallel xdist workers were clearing this queue between configure and the actual classifier call, causing "Policy classifier error (fail-closed)". Fix: add POST /mock/pin endpoint to mock_llm_server.py — pinned queues survive POST /mock/reset. The live_server fixture pins "mock-model" immediately after startup so the policy-classifier queue is safe from parallel resets for the entire session. Co-authored-by: Isaac

…classifier" This reverts commit de66950.

…lures test_policies_e2e.py: fix ruff format (parenthesised assert collapsed). test_prompt_policy_allow_path_reaches_llm is added to known_failures (mode: skip) while the proper fix (pinned mock-model queue surviving parallel reset_mock_llm calls) is tracked separately — the mock server pinning approach needs further debugging before landing. Co-authored-by: Isaac

…et queue The switch and fork+switch paths pass the prior transcript as context directly to the first real LLM call (the recall turn) — no separate replay request is issued. The two-entry queue `[{"text": "OK"}, {"text": marker}]` caused the recall turn to consume "OK" (index 0) while the actual marker was never reached, breaking both test_switch_agent_in_place_carries_history and test_fork_with_agent_switch_carries_history. Note: poll_session_until_terminal returns ALL non-user session items (not just the current turn's), so body_2 in the switch test legitimately includes "ACK" from turn 1 — that is expected behavior, not a bug. Co-authored-by: Isaac

test_parallel_named_sub_agents_e2e consistently flakes across many PRs due to sub-agent auto-wake timing (240s window). Not related to any recent code changes. Adding to known_failures to unblock PR #802. Co-authored-by: Isaac

This reverts commit 34c66f0.

The prompt_policy classifier uses the server-level LLM ("mock-model"). Per-test reset_mock_llm calls from parallel xdist workers cleared the regular queue between configure and the classifier call, causing "Policy classifier error (fail-closed)". Fix: add a non-resettable fallback response to _ResponseQueue. Unlike regular entries, the fallback survives POST /mock/reset — it is used when the regular queue is exhausted. live_server sets "mock-model"'s fallback to {"action": "allow", "reason": ""} so the classifier always returns ALLOW regardless of parallel resets. Integration tests are unaffected: their configured responses take priority over the fallback; the fallback only fires on unexpected extra calls (harmless since client-side tool tests don't make second calls). Also removes the @pytest.mark.flaky workaround and the now-unnecessary re-seed in test_prompt_policy_allow_path_reaches_llm, and removes the known_failures skip entry. Co-authored-by: Isaac

Instead of skipping when ~/.databrickscfg is absent, override the parametrized model to a non-databricks name (e.g. "claude-mock") so ClaudeSDKExecutor/CodexExecutor route through ANTHROPIC_BASE_URL / OPENAI_BASE_URL with gateway=False — no credential file needed. Co-authored-by: Isaac

…roach main already uses del model + mock_model = f"mock-coding-supervisor-{harness}" which keeps all harnesses in mock mode (avoids gateway routing for databricks-* model names). Our model.startswith() check conflicted with the del model line on merge, causing F821. Use main's cleaner version. Co-authored-by: Isaac

MockState.reset() called self.queues.clear() which deleted ALL queue objects including ones with a fallback set via POST /mock/set_fallback. The next resolve_queue() call created a fresh _ResponseQueue without the fallback, so the policy classifier still got no response. Fix: iterate over queues and only delete those without a fallback. Queues with a fallback have their responses/index reset (cleared) but keep the fallback, so the classifier always gets ALLOW even after per-test resets. Co-authored-by: Isaac

…el collision Integration tests configure the "default" queue and use model="mock-model" for agent LLM calls. With the fallback preserved on "mock-model", those calls were hitting the ALLOW fallback instead of the configured responses. Fix: change the server's llm.model to "_policy_llm_" (a key no test uses) and set the ALLOW fallback on that key. Integration tests continue to configure "default" and LLM calls with model="mock-model" fall through to "default" (correct). Policy classifier calls with model="_policy_llm_" get the ALLOW fallback (correct). Co-authored-by: Isaac

github-actions Bot added the size/S Pull request size: S label Jun 19, 2026

TomeHirata added 4 commits June 19, 2026 17:26

Merge branch 'main' into ci/remove-llm-api-key-from-e2e

ab49fe7

github-actions Bot added size/L Pull request size: L and removed size/S Pull request size: S labels Jun 19, 2026

TomeHirata added 8 commits June 19, 2026 17:46

Revert "fix(ci): pin mock-model queue so parallel resets don't clear …

a431cb6

…classifier" This reverts commit de66950.

fix(ci): add parallel_named_sub_agents to known_failures

34c66f0

test_parallel_named_sub_agents_e2e consistently flakes across many PRs due to sub-agent auto-wake timing (240s window). Not related to any recent code changes. Adding to known_failures to unblock PR #802. Co-authored-by: Isaac

TomeHirata added the automerge Automatically Run Merge CI label Jun 19, 2026

TomeHirata added 6 commits June 19, 2026 21:35

Revert "fix(ci): add parallel_named_sub_agents to known_failures"

1fa0842

This reverts commit 34c66f0.

TomeHirata merged commit 3009016 into main Jun 19, 2026
66 of 67 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(e2e): remove --llm-api-key and Databricks credential setup#802

ci(e2e): remove --llm-api-key and Databricks credential setup#802
TomeHirata merged 19 commits into
mainfrom
ci/remove-llm-api-key-from-e2e

TomeHirata commented Jun 19, 2026

Uh oh!

omnigent-ci Bot commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TomeHirata commented Jun 19, 2026

Summary

Changes

Test plan

Uh oh!

omnigent-ci Bot commented Jun 19, 2026

Polly AI Review

PR Review: ci(e2e): remove --llm-api-key and Databricks credential setup

1. Blocking Issues

2. Security Analysis

3. Non-Blocking Suggestions

4. Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PR Review: `ci(e2e): remove --llm-api-key and Databricks credential setup`