Skip to content

feat: in-process LLM credential hot-reload#2955

Open
lalan7 wants to merge 1 commit into
openshift:mainfrom
lalan7:feat/credential-hot-reload
Open

feat: in-process LLM credential hot-reload#2955
lalan7 wants to merge 1 commit into
openshift:mainfrom
lalan7:feat/credential-hot-reload

Conversation

@lalan7

@lalan7 lalan7 commented Jun 16, 2026

Copy link
Copy Markdown

Summary

When credentialsSecretRef is updated by a CronJob or external rotation, the lightspeed-operator triggers a rolling restart of lightspeed-app-server. For customers using short-lived LLM tokens (1-hour validity), this causes hourly pod restarts and temporary capacity loss — including reloading RAG vector indexes and embedding models.

This PR makes ProviderConfig re-read the credential file on every LLM request via a new get_credentials() method, so rotated secrets are picked up without a pod restart. This follows the Prometheus credentials_file pattern: re-read on every request, no caching, no edge cases with K8s symlink swaps.

Changes

  • ols/utils/checks.py — New read_secret_from_path() helper for repeated file reads
  • ols/app/models/config.py — Store _credentials_path in ProviderConfig.__init__(), add get_credentials() method that re-reads the file when a path is configured
  • All LLM providers (openai, azure_openai, watsonx, rhoai_vllm, rhelai_vllm, google_vertex) — Use get_credentials() instead of .credentials
  • Unit tests — Tests for the helper, ProviderConfig.get_credentials() (cached, re-read, fallback, directory), and OpenAI provider credential rotation
  • docs/credential-hot-reload.md — Design rationale and test plan

Why this approach

Concern Answer
K8s symlink safety kubelet uses atomic symlink swaps (..data); os.stat() mtime can be unreliable after swaps. Re-reading avoids the edge case.
Precedent Prometheus credentials_file, Envoy, controller-runtime CertWatcher, kube-proxy all re-read files.
Dependencies None — no watchdog, fsnotify, or background threads.
Performance One open()+read() of a <100 byte file per LLM request is negligible vs LLM call latency (seconds).
Thread safety Each request reads independently; no shared mutable state.

Scope

Hot-reload applies to top-level credentials_path, which is the path used by operator-managed deployments (credentialsSecretRef). Provider-specific config blocks (openai_config.credentials_path, azure_openai_config.credentials_path, etc.) continue to cache credentials at startup — this matches the existing operator behavior, which does not generate these blocks.

Azure AD credentials (tenant_id, client_id, client_secret) are also read once at startup. These can be addressed in a follow-up if needed.

Test plan

  • read_secret_from_path() — file, directory, nonexistent, rotation
  • ProviderConfig.get_credentials() — cached fallback, disk re-read, read failure fallback, directory path
  • OpenAI provider — verifies rotated credential is picked up across two load_llm() calls
  • Full suite: 1067 tests pass, ruff check clean, mypy clean

Ref

When credentialsSecretRef is updated by a CronJob or external rotation,
the lightspeed-operator currently triggers a rolling restart.  For
short-lived tokens (1-hour rotation) this causes hourly capacity loss.

This change makes ProviderConfig re-read the credential file on every
LLM request via a new get_credentials() method, following the same
Prometheus-style pattern (credentials_file re-read on every scrape).
This is safe with Kubernetes atomic symlink swaps, requires no new
dependencies, and adds negligible overhead vs the LLM call latency.

Changes:
- ols/utils/checks.py: add read_secret_from_path() helper
- ols/app/models/config.py: store _credentials_path, add get_credentials()
- All LLM providers: use get_credentials() instead of .credentials
- Unit tests for helper, ProviderConfig, and OpenAI provider rotation
- docs/credential-hot-reload.md: design rationale and test plan

Ref: https://issues.redhat.com/browse/RFE-9380
Co-authored-by: Cursor <cursoragent@cursor.com>
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Adds in-process LLM credential hot-reload (RFE-9380). A new read_secret_from_path utility reads a credential file on each call. ProviderConfig gains a _credentials_path attribute and a get_credentials() method that delegates to this utility. All six LLM providers are updated to call get_credentials() instead of reading the static credentials field directly. Tests and documentation are included.

Changes

Credential Hot-Reload

Layer / File(s) Summary
read_secret_from_path helper and tests
ols/utils/checks.py, tests/unit/utils/test_checks.py
read_secret_from_path(path, default_filename) is added to resolve a path as a file or directory, read and strip contents, and return None with a warning on OSError. Tests cover file, directory, missing, and rotation cases.
ProviderConfig._credentials_path and get_credentials()
ols/app/models/config.py, tests/unit/app/models/test_config.py
ProviderConfig gains _credentials_path stored at init and a get_credentials() method that calls read_secret_from_path per invocation, falling back to cached credentials when the path is absent or unreadable. Four unit tests validate all branches.
Provider callsite updates and rotation test
ols/src/llms/providers/openai.py, azure_openai.py, google_vertex.py, rhoai_vllm.py, rhelai_vllm.py, watsonx.py, tests/unit/llms/providers/test_openai.py
All six providers switch from provider_config.credentials to provider_config.get_credentials(). GoogleVertex variants add a ValueError when the result is None. An end-to-end pytest confirms openai_api_key updates when the backing file is rotated.
Hot-reload design documentation
docs/credential-hot-reload.md
New doc covers the rolling-restart problem, per-request re-read rationale, changed components, Podman and OpenShift validation steps, a verification checklist, and operator PR coordination notes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and specifically summarizes the main change: implementing in-process LLM credential hot-reload, which is the primary objective of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from raptorsun and tisnik June 16, 2026 19:15
@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tisnik for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unit/app/models/test_config.py (1)

4357-4443: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add type hints to the new test signatures.

The new tests in this block omit parameter/return annotations, which conflicts with strict typing
requirements.

Suggested fix
+from pathlib import Path
@@
-def test_provider_config_get_credentials_returns_cached_when_no_path():
+def test_provider_config_get_credentials_returns_cached_when_no_path() -> None:
@@
-def test_provider_config_get_credentials_rereads_from_disk(tmp_path):
+def test_provider_config_get_credentials_rereads_from_disk(tmp_path: Path) -> None:
@@
-def test_provider_config_get_credentials_falls_back_on_read_failure(tmp_path):
+def test_provider_config_get_credentials_falls_back_on_read_failure(tmp_path: Path) -> None:
@@
-def test_provider_config_get_credentials_with_directory(tmp_path):
+def test_provider_config_get_credentials_with_directory(tmp_path: Path) -> None:

As per coding guidelines, "All function signatures must include type hints".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/app/models/test_config.py` around lines 4357 - 4443, Add type
hints to all four test function signatures in this block. For
test_provider_config_get_credentials_returns_cached_when_no_path, add return
type hint -> None. For test_provider_config_get_credentials_rereads_from_disk,
test_provider_config_get_credentials_falls_back_on_read_failure, and
test_provider_config_get_credentials_with_directory, add the parameter type hint
for tmp_path as pathlib.Path and return type hint as -> None to each function
signature.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/credential-hot-reload.md`:
- Around line 52-57: The fenced code block showing the Request flow diagram is
missing a language identifier, which triggers markdownlint rule MD040. Add the
language identifier "text" to the opening fence of the code block (the three
backticks before the Request line) so it reads ```text instead of just ```. This
will satisfy the markdown linting requirement while properly labeling the
content type.

In `@ols/app/models/config.py`:
- Around line 612-618: The get_credentials() method in ols/app/models/config.py
returns the fresh credential value without persisting it to the cached
self.credentials instance variable. When fresh credentials are successfully read
from the path (when fresh is not None), update self.credentials with this fresh
value before returning it. This ensures that subsequent calls to
get_credentials() will return the last known-good rotated key rather than
falling back to a stale startup value if a later read fails.

In `@ols/src/llms/providers/google_vertex.py`:
- Around line 35-38: Replace the `ValueError` exception with
`InvalidConfigurationError` in the credential validation logic of the Google
Vertex provider. Import `InvalidConfigurationError` from `ols.utils.checks` at
the top of the file if not already present. Then update the exception raised
when `creds_value is None` (at the location around lines 35-38) and apply the
same change to the second occurrence mentioned at lines 74-79 where similar
credential validation occurs, ensuring both locations use the domain-specific
exception type for consistent configuration error handling across the provider.

In `@tests/unit/llms/providers/test_openai.py`:
- Around line 275-302: The function `test_openai_picks_up_rotated_credentials`
is missing type hints on its parameters and return type, violating the
repository's typed-signature requirement. Add type annotations to both the
`tmp_path` parameter (use the appropriate Path type from pathlib) and the
`fake_certifi_store` parameter (determine the correct fixture type), and add a
return type hint of None for the test function.

---

Outside diff comments:
In `@tests/unit/app/models/test_config.py`:
- Around line 4357-4443: Add type hints to all four test function signatures in
this block. For
test_provider_config_get_credentials_returns_cached_when_no_path, add return
type hint -> None. For test_provider_config_get_credentials_rereads_from_disk,
test_provider_config_get_credentials_falls_back_on_read_failure, and
test_provider_config_get_credentials_with_directory, add the parameter type hint
for tmp_path as pathlib.Path and return type hint as -> None to each function
signature.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 64086aa0-5d90-4440-9524-4d7a4b23a9e5

📥 Commits

Reviewing files that changed from the base of the PR and between a8aa7a8 and bc5ee67.

📒 Files selected for processing (12)
  • docs/credential-hot-reload.md
  • ols/app/models/config.py
  • ols/src/llms/providers/azure_openai.py
  • ols/src/llms/providers/google_vertex.py
  • ols/src/llms/providers/openai.py
  • ols/src/llms/providers/rhelai_vllm.py
  • ols/src/llms/providers/rhoai_vllm.py
  • ols/src/llms/providers/watsonx.py
  • ols/utils/checks.py
  • tests/unit/app/models/test_config.py
  • tests/unit/llms/providers/test_openai.py
  • tests/unit/utils/test_checks.py

Comment on lines +52 to +57
```
Request → load_llm() → LLMProvider.__init__()
→ provider_config.get_credentials()
→ open(credentials_path) + read()
→ return fresh value
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language identifier to the fenced code block.

This block is missing a language tag and triggers markdownlint MD040.

Suggested fix
-```
+```text
 Request → load_llm() → LLMProvider.__init__()
                           → provider_config.get_credentials()
                               → open(credentials_path) + read()
                               → return fresh value
-```
+```
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
Request → load_llm() → LLMProvider.__init__()
→ provider_config.get_credentials()
→ open(credentials_path) + read()
→ return fresh value
```
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 52-52: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/credential-hot-reload.md` around lines 52 - 57, The fenced code block
showing the Request flow diagram is missing a language identifier, which
triggers markdownlint rule MD040. Add the language identifier "text" to the
opening fence of the code block (the three backticks before the Request line) so
it reads ```text instead of just ```. This will satisfy the markdown linting
requirement while properly labeling the content type.

Source: Linters/SAST tools

Comment thread ols/app/models/config.py
Comment on lines +612 to +618
if self._credentials_path is not None:
fresh = checks.read_secret_from_path(
self._credentials_path, constants.API_TOKEN_FILENAME
)
if fresh is not None:
return fresh
return self.credentials

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Persist fresh credentials before returning them.

On Line 617, get_credentials() returns the fresh value but does not update the cached
self.credentials. If a later read fails, fallback returns a stale startup key rather than the
last known-good rotated key.

Suggested fix
         if self._credentials_path is not None:
             fresh = checks.read_secret_from_path(
                 self._credentials_path, constants.API_TOKEN_FILENAME
             )
             if fresh is not None:
-                return fresh
+                self.credentials = fresh
+                return fresh
         return self.credentials
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if self._credentials_path is not None:
fresh = checks.read_secret_from_path(
self._credentials_path, constants.API_TOKEN_FILENAME
)
if fresh is not None:
return fresh
return self.credentials
if self._credentials_path is not None:
fresh = checks.read_secret_from_path(
self._credentials_path, constants.API_TOKEN_FILENAME
)
if fresh is not None:
self.credentials = fresh
return fresh
return self.credentials
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ols/app/models/config.py` around lines 612 - 618, The get_credentials()
method in ols/app/models/config.py returns the fresh credential value without
persisting it to the cached self.credentials instance variable. When fresh
credentials are successfully read from the path (when fresh is not None), update
self.credentials with this fresh value before returning it. This ensures that
subsequent calls to get_credentials() will return the last known-good rotated
key rather than falling back to a stale startup value if a later read fails.

Comment on lines +35 to +38
creds_value = self.provider_config.get_credentials()
if creds_value is None:
raise ValueError("credentials are required for Google Vertex provider")
self.credentials = load_vertex_credentials(self.provider_config.credentials)
self.credentials = load_vertex_credentials(creds_value)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use InvalidConfigurationError instead of ValueError for missing credentials.

Both new guard clauses raise generic ValueError; per repository policy these configuration failures should use
the domain-specific exception type for consistent handling.

Proposed fix
+from ols.utils.checks import InvalidConfigurationError
@@
-        if creds_value is None:
-            raise ValueError("credentials are required for Google Vertex provider")
+        if creds_value is None:
+            raise InvalidConfigurationError(
+                "credentials are required for Google Vertex provider"
+            )
@@
-        if creds_value is None:
-            raise ValueError(
+        if creds_value is None:
+            raise InvalidConfigurationError(
                 "credentials are required for Google Vertex Anthropic provider"
             )

As per coding guidelines, "Use custom exceptions from their respective domain modules ... InvalidConfigurationError from ols.utils.checks".

Also applies to: 74-79

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ols/src/llms/providers/google_vertex.py` around lines 35 - 38, Replace the
`ValueError` exception with `InvalidConfigurationError` in the credential
validation logic of the Google Vertex provider. Import
`InvalidConfigurationError` from `ols.utils.checks` at the top of the file if
not already present. Then update the exception raised when `creds_value is None`
(at the location around lines 35-38) and apply the same change to the second
occurrence mentioned at lines 74-79 where similar credential validation occurs,
ensuring both locations use the domain-specific exception type for consistent
configuration error handling across the provider.

Source: Coding guidelines

Comment on lines +275 to +302
def test_openai_picks_up_rotated_credentials(tmp_path, fake_certifi_store):
"""Test that OpenAI provider re-reads credentials on each load."""
secret_file = tmp_path / "apitoken"
secret_file.write_text("initial-key")

config = ProviderConfig(
{
"name": "test_provider",
"type": "openai",
"url": "http://test.url",
"credentials_path": str(secret_file),
"models": [
{
"name": "test_model",
"url": "http://test.url/",
"credentials_path": str(secret_file),
}
],
}
)

openai_1 = OpenAI(model="test_model", provider_config=config)
assert openai_1.default_params["openai_api_key"] == "initial-key"

secret_file.write_text("rotated-key")

openai_2 = OpenAI(model="test_model", provider_config=config)
assert openai_2.default_params["openai_api_key"] == "rotated-key"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add type hints to the new test function signature.

The new test function omits parameter/return annotations, which conflicts with the repository’s typed-signature rule.

Proposed fix
+from pathlib import Path
@@
-def test_openai_picks_up_rotated_credentials(tmp_path, fake_certifi_store):
+def test_openai_picks_up_rotated_credentials(
+    tmp_path: Path, fake_certifi_store: str
+) -> None:

As per coding guidelines, "All function signatures must include type hints".

🧰 Tools
🪛 ast-grep (0.43.0)

[warning] 283-283: Do not make http calls without encryption
Context: "http://test.url"
Note: [CWE-319].

(requests-http)


[warning] 288-288: Do not make http calls without encryption
Context: "http://test.url/"
Note: [CWE-319].

(requests-http)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/llms/providers/test_openai.py` around lines 275 - 302, The
function `test_openai_picks_up_rotated_credentials` is missing type hints on its
parameters and return type, violating the repository's typed-signature
requirement. Add type annotations to both the `tmp_path` parameter (use the
appropriate Path type from pathlib) and the `fake_certifi_store` parameter
(determine the correct fixture type), and add a return type hint of None for the
test function.

Source: Coding guidelines

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown

@lalan7: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant