feat: in-process LLM credential hot-reload by lalan7 · Pull Request #2955 · openshift/lightspeed-service

lalan7 · 2026-06-16T19:15:23Z

Summary

When credentialsSecretRef is updated by a CronJob or external rotation, the lightspeed-operator triggers a rolling restart of lightspeed-app-server. For customers using short-lived LLM tokens (1-hour validity), this causes hourly pod restarts and temporary capacity loss — including reloading RAG vector indexes and embedding models.

This PR makes ProviderConfig re-read the credential file on every LLM request via a new get_credentials() method, so rotated secrets are picked up without a pod restart. This follows the Prometheus credentials_file pattern: re-read on every request, no caching, no edge cases with K8s symlink swaps.

Changes

ols/utils/checks.py — New read_secret_from_path() helper for repeated file reads
ols/app/models/config.py — Store _credentials_path in ProviderConfig.__init__(), add get_credentials() method that re-reads the file when a path is configured
All LLM providers (openai, azure_openai, watsonx, rhoai_vllm, rhelai_vllm, google_vertex) — Use get_credentials() instead of .credentials
Unit tests — Tests for the helper, ProviderConfig.get_credentials() (cached, re-read, fallback, directory), and OpenAI provider credential rotation
docs/credential-hot-reload.md — Design rationale and test plan

Why this approach

Concern	Answer
K8s symlink safety	`kubelet` uses atomic symlink swaps (`..data`); `os.stat()` mtime can be unreliable after swaps. Re-reading avoids the edge case.
Precedent	Prometheus `credentials_file`, Envoy, controller-runtime `CertWatcher`, kube-proxy all re-read files.
Dependencies	None — no watchdog, fsnotify, or background threads.
Performance	One `open()+read()` of a <100 byte file per LLM request is negligible vs LLM call latency (seconds).
Thread safety	Each request reads independently; no shared mutable state.

Scope

Hot-reload applies to top-level credentials_path, which is the path used by operator-managed deployments (credentialsSecretRef). Provider-specific config blocks (openai_config.credentials_path, azure_openai_config.credentials_path, etc.) continue to cache credentials at startup — this matches the existing operator behavior, which does not generate these blocks.

Azure AD credentials (tenant_id, client_id, client_secret) are also read once at startup. These can be addressed in a follow-up if needed.

Test plan

read_secret_from_path() — file, directory, nonexistent, rotation
ProviderConfig.get_credentials() — cached fallback, disk re-read, read failure fallback, directory path
OpenAI provider — verifies rotated credential is picked up across two load_llm() calls
Full suite: 1067 tests pass, ruff check clean, mypy clean

Ref

RFE-9380

When credentialsSecretRef is updated by a CronJob or external rotation, the lightspeed-operator currently triggers a rolling restart. For short-lived tokens (1-hour rotation) this causes hourly capacity loss. This change makes ProviderConfig re-read the credential file on every LLM request via a new get_credentials() method, following the same Prometheus-style pattern (credentials_file re-read on every scrape). This is safe with Kubernetes atomic symlink swaps, requires no new dependencies, and adds negligible overhead vs the LLM call latency. Changes: - ols/utils/checks.py: add read_secret_from_path() helper - ols/app/models/config.py: store _credentials_path, add get_credentials() - All LLM providers: use get_credentials() instead of .credentials - Unit tests for helper, ProviderConfig, and OpenAI provider rotation - docs/credential-hot-reload.md: design rationale and test plan Ref: https://issues.redhat.com/browse/RFE-9380 Co-authored-by: Cursor <cursoragent@cursor.com>

coderabbitai · 2026-06-16T19:15:40Z

📝 Walkthrough

Walkthrough

Adds in-process LLM credential hot-reload (RFE-9380). A new read_secret_from_path utility reads a credential file on each call. ProviderConfig gains a _credentials_path attribute and a get_credentials() method that delegates to this utility. All six LLM providers are updated to call get_credentials() instead of reading the static credentials field directly. Tests and documentation are included.

Changes

Credential Hot-Reload

Layer / File(s)	Summary
`read_secret_from_path` helper and tests `ols/utils/checks.py`, `tests/unit/utils/test_checks.py`	`read_secret_from_path(path, default_filename)` is added to resolve a path as a file or directory, read and strip contents, and return `None` with a warning on `OSError`. Tests cover file, directory, missing, and rotation cases.
`ProviderConfig._credentials_path` and `get_credentials()` `ols/app/models/config.py`, `tests/unit/app/models/test_config.py`	`ProviderConfig` gains `_credentials_path` stored at init and a `get_credentials()` method that calls `read_secret_from_path` per invocation, falling back to cached `credentials` when the path is absent or unreadable. Four unit tests validate all branches.
Provider callsite updates and rotation test `ols/src/llms/providers/openai.py`, `azure_openai.py`, `google_vertex.py`, `rhoai_vllm.py`, `rhelai_vllm.py`, `watsonx.py`, `tests/unit/llms/providers/test_openai.py`	All six providers switch from `provider_config.credentials` to `provider_config.get_credentials()`. GoogleVertex variants add a `ValueError` when the result is `None`. An end-to-end pytest confirms `openai_api_key` updates when the backing file is rotated.
Hot-reload design documentation `docs/credential-hot-reload.md`	New doc covers the rolling-restart problem, per-request re-read rationale, changed components, Podman and OpenShift validation steps, a verification checklist, and operator PR coordination notes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title clearly and specifically summarizes the main change: implementing in-process LLM credential hot-reload, which is the primary objective of the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-06-16T19:15:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tisnik for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/unit/app/models/test_config.py (1)

4357-4443: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add type hints to the new test signatures.

The new tests in this block omit parameter/return annotations, which conflicts with strict typing
requirements.

Suggested fix

+from pathlib import Path
@@
-def test_provider_config_get_credentials_returns_cached_when_no_path():
+def test_provider_config_get_credentials_returns_cached_when_no_path() -> None:
@@
-def test_provider_config_get_credentials_rereads_from_disk(tmp_path):
+def test_provider_config_get_credentials_rereads_from_disk(tmp_path: Path) -> None:
@@
-def test_provider_config_get_credentials_falls_back_on_read_failure(tmp_path):
+def test_provider_config_get_credentials_falls_back_on_read_failure(tmp_path: Path) -> None:
@@
-def test_provider_config_get_credentials_with_directory(tmp_path):
+def test_provider_config_get_credentials_with_directory(tmp_path: Path) -> None:

As per coding guidelines, "All function signatures must include type hints".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/app/models/test_config.py` around lines 4357 - 4443, Add type
hints to all four test function signatures in this block. For
test_provider_config_get_credentials_returns_cached_when_no_path, add return
type hint -> None. For test_provider_config_get_credentials_rereads_from_disk,
test_provider_config_get_credentials_falls_back_on_read_failure, and
test_provider_config_get_credentials_with_directory, add the parameter type hint
for tmp_path as pathlib.Path and return type hint as -> None to each function
signature.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/credential-hot-reload.md`:
- Around line 52-57: The fenced code block showing the Request flow diagram is
missing a language identifier, which triggers markdownlint rule MD040. Add the
language identifier "text" to the opening fence of the code block (the three
backticks before the Request line) so it reads ```text instead of just ```. This
will satisfy the markdown linting requirement while properly labeling the
content type.

In `@ols/app/models/config.py`:
- Around line 612-618: The get_credentials() method in ols/app/models/config.py
returns the fresh credential value without persisting it to the cached
self.credentials instance variable. When fresh credentials are successfully read
from the path (when fresh is not None), update self.credentials with this fresh
value before returning it. This ensures that subsequent calls to
get_credentials() will return the last known-good rotated key rather than
falling back to a stale startup value if a later read fails.

In `@ols/src/llms/providers/google_vertex.py`:
- Around line 35-38: Replace the `ValueError` exception with
`InvalidConfigurationError` in the credential validation logic of the Google
Vertex provider. Import `InvalidConfigurationError` from `ols.utils.checks` at
the top of the file if not already present. Then update the exception raised
when `creds_value is None` (at the location around lines 35-38) and apply the
same change to the second occurrence mentioned at lines 74-79 where similar
credential validation occurs, ensuring both locations use the domain-specific
exception type for consistent configuration error handling across the provider.

In `@tests/unit/llms/providers/test_openai.py`:
- Around line 275-302: The function `test_openai_picks_up_rotated_credentials`
is missing type hints on its parameters and return type, violating the
repository's typed-signature requirement. Add type annotations to both the
`tmp_path` parameter (use the appropriate Path type from pathlib) and the
`fake_certifi_store` parameter (determine the correct fixture type), and add a
return type hint of None for the test function.

---

Outside diff comments:
In `@tests/unit/app/models/test_config.py`:
- Around line 4357-4443: Add type hints to all four test function signatures in
this block. For
test_provider_config_get_credentials_returns_cached_when_no_path, add return
type hint -> None. For test_provider_config_get_credentials_rereads_from_disk,
test_provider_config_get_credentials_falls_back_on_read_failure, and
test_provider_config_get_credentials_with_directory, add the parameter type hint
for tmp_path as pathlib.Path and return type hint as -> None to each function
signature.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 64086aa0-5d90-4440-9524-4d7a4b23a9e5

📥 Commits

Reviewing files that changed from the base of the PR and between a8aa7a8 and bc5ee67.

📒 Files selected for processing (12)

docs/credential-hot-reload.md
ols/app/models/config.py
ols/src/llms/providers/azure_openai.py
ols/src/llms/providers/google_vertex.py
ols/src/llms/providers/openai.py
ols/src/llms/providers/rhelai_vllm.py
ols/src/llms/providers/rhoai_vllm.py
ols/src/llms/providers/watsonx.py
ols/utils/checks.py
tests/unit/app/models/test_config.py
tests/unit/llms/providers/test_openai.py
tests/unit/utils/test_checks.py

coderabbitai · 2026-06-16T19:22:56Z

+```
+Request → load_llm() → LLMProvider.__init__()
+                          → provider_config.get_credentials()
+                              → open(credentials_path) + read()
+                              → return fresh value
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language identifier to the fenced code block.

This block is missing a language tag and triggers markdownlint MD040.

Suggested fix

-``` +```text Request → load_llm() → LLMProvider.__init__() → provider_config.get_credentials() → open(credentials_path) + read() → return fresh value -``` +```

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

```

Request → load_llm() → LLMProvider.__init__()

→ provider_config.get_credentials()

→ open(credentials_path) + read()

→ return fresh value

```

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 52-52: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/credential-hot-reload.md` around lines 52 - 57, The fenced code block showing the Request flow diagram is missing a language identifier, which triggers markdownlint rule MD040. Add the language identifier "text" to the opening fence of the code block (the three backticks before the Request line) so it reads ```text instead of just ```. This will satisfy the markdown linting requirement while properly labeling the content type.

Source: Linters/SAST tools

coderabbitai · 2026-06-16T19:22:56Z

+        if self._credentials_path is not None:
+            fresh = checks.read_secret_from_path(
+                self._credentials_path, constants.API_TOKEN_FILENAME
+            )
+            if fresh is not None:
+                return fresh
+        return self.credentials


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Persist fresh credentials before returning them.

On Line 617, get_credentials() returns the fresh value but does not update the cached
self.credentials. If a later read fails, fallback returns a stale startup key rather than the
last known-good rotated key.

Suggested fix

if self._credentials_path is not None: fresh = checks.read_secret_from_path( self._credentials_path, constants.API_TOKEN_FILENAME ) if fresh is not None: - return fresh + self.credentials = fresh + return fresh return self.credentials

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if self._credentials_path is not None:

fresh = checks.read_secret_from_path(

self._credentials_path, constants.API_TOKEN_FILENAME

)

if fresh is not None:

return fresh

return self.credentials

if self._credentials_path is not None:

fresh = checks.read_secret_from_path(

self._credentials_path, constants.API_TOKEN_FILENAME

)

if fresh is not None:

self.credentials = fresh

return fresh

return self.credentials

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ols/app/models/config.py` around lines 612 - 618, The get_credentials() method in ols/app/models/config.py returns the fresh credential value without persisting it to the cached self.credentials instance variable. When fresh credentials are successfully read from the path (when fresh is not None), update self.credentials with this fresh value before returning it. This ensures that subsequent calls to get_credentials() will return the last known-good rotated key rather than falling back to a stale startup value if a later read fails.

coderabbitai · 2026-06-16T19:22:56Z

+        creds_value = self.provider_config.get_credentials()
+        if creds_value is None:
            raise ValueError("credentials are required for Google Vertex provider")
-        self.credentials = load_vertex_credentials(self.provider_config.credentials)
+        self.credentials = load_vertex_credentials(creds_value)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use InvalidConfigurationError instead of ValueError for missing credentials.

Both new guard clauses raise generic ValueError; per repository policy these configuration failures should use
the domain-specific exception type for consistent handling.

Proposed fix

+from ols.utils.checks import InvalidConfigurationError @@ - if creds_value is None: - raise ValueError("credentials are required for Google Vertex provider") + if creds_value is None: + raise InvalidConfigurationError( + "credentials are required for Google Vertex provider" + ) @@ - if creds_value is None: - raise ValueError( + if creds_value is None: + raise InvalidConfigurationError( "credentials are required for Google Vertex Anthropic provider" )

As per coding guidelines, "Use custom exceptions from their respective domain modules ... InvalidConfigurationError from ols.utils.checks".

Also applies to: 74-79

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ols/src/llms/providers/google_vertex.py` around lines 35 - 38, Replace the `ValueError` exception with `InvalidConfigurationError` in the credential validation logic of the Google Vertex provider. Import `InvalidConfigurationError` from `ols.utils.checks` at the top of the file if not already present. Then update the exception raised when `creds_value is None` (at the location around lines 35-38) and apply the same change to the second occurrence mentioned at lines 74-79 where similar credential validation occurs, ensuring both locations use the domain-specific exception type for consistent configuration error handling across the provider.

Source: Coding guidelines

coderabbitai · 2026-06-16T19:22:56Z

+def test_openai_picks_up_rotated_credentials(tmp_path, fake_certifi_store):
+    """Test that OpenAI provider re-reads credentials on each load."""
+    secret_file = tmp_path / "apitoken"
+    secret_file.write_text("initial-key")
+
+    config = ProviderConfig(
+        {
+            "name": "test_provider",
+            "type": "openai",
+            "url": "http://test.url",
+            "credentials_path": str(secret_file),
+            "models": [
+                {
+                    "name": "test_model",
+                    "url": "http://test.url/",
+                    "credentials_path": str(secret_file),
+                }
+            ],
+        }
+    )
+
+    openai_1 = OpenAI(model="test_model", provider_config=config)
+    assert openai_1.default_params["openai_api_key"] == "initial-key"
+
+    secret_file.write_text("rotated-key")
+
+    openai_2 = OpenAI(model="test_model", provider_config=config)
+    assert openai_2.default_params["openai_api_key"] == "rotated-key"


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add type hints to the new test function signature.

The new test function omits parameter/return annotations, which conflicts with the repository’s typed-signature rule.

Proposed fix

+from pathlib import Path @@ -def test_openai_picks_up_rotated_credentials(tmp_path, fake_certifi_store): +def test_openai_picks_up_rotated_credentials( + tmp_path: Path, fake_certifi_store: str +) -> None:

As per coding guidelines, "All function signatures must include type hints".

🧰 Tools

🪛 ast-grep (0.43.0)

[warning] 283-283: Do not make http calls without encryption
Context: "http://test.url"
Note: [CWE-319].

(requests-http)

[warning] 288-288: Do not make http calls without encryption
Context: "http://test.url/"
Note: [CWE-319].

(requests-http)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/llms/providers/test_openai.py` around lines 275 - 302, The function `test_openai_picks_up_rotated_credentials` is missing type hints on its parameters and return type, violating the repository's typed-signature requirement. Add type annotations to both the `tmp_path` parameter (use the appropriate Path type from pathlib) and the `fake_certifi_store` parameter (determine the correct fixture type), and add a return type hint of None for the test function.

Source: Coding guidelines

openshift-ci · 2026-06-16T20:46:54Z

@lalan7: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot requested review from raptorsun and tisnik June 16, 2026 19:15

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: in-process LLM credential hot-reload#2955

feat: in-process LLM credential hot-reload#2955
lalan7 wants to merge 1 commit into
openshift:mainfrom
lalan7:feat/credential-hot-reload

lalan7 commented Jun 16, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci Bot commented Jun 16, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 16, 2026

Uh oh!

coderabbitai Bot Jun 16, 2026

Uh oh!

coderabbitai Bot Jun 16, 2026

Uh oh!

coderabbitai Bot Jun 16, 2026

Uh oh!

openshift-ci Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lalan7 commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Why this approach

Scope

Test plan

Ref

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci Bot commented Jun 16, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lalan7 commented Jun 16, 2026 •

edited

Loading

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading