feat: in-process LLM credential hot-reload#2955
Conversation
When credentialsSecretRef is updated by a CronJob or external rotation, the lightspeed-operator currently triggers a rolling restart. For short-lived tokens (1-hour rotation) this causes hourly capacity loss. This change makes ProviderConfig re-read the credential file on every LLM request via a new get_credentials() method, following the same Prometheus-style pattern (credentials_file re-read on every scrape). This is safe with Kubernetes atomic symlink swaps, requires no new dependencies, and adds negligible overhead vs the LLM call latency. Changes: - ols/utils/checks.py: add read_secret_from_path() helper - ols/app/models/config.py: store _credentials_path, add get_credentials() - All LLM providers: use get_credentials() instead of .credentials - Unit tests for helper, ProviderConfig, and OpenAI provider rotation - docs/credential-hot-reload.md: design rationale and test plan Ref: https://issues.redhat.com/browse/RFE-9380 Co-authored-by: Cursor <cursoragent@cursor.com>
📝 WalkthroughWalkthroughAdds in-process LLM credential hot-reload (RFE-9380). A new ChangesCredential Hot-Reload
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unit/app/models/test_config.py (1)
4357-4443:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd type hints to the new test signatures.
The new tests in this block omit parameter/return annotations, which conflicts with strict typing
requirements.Suggested fix
+from pathlib import Path @@ -def test_provider_config_get_credentials_returns_cached_when_no_path(): +def test_provider_config_get_credentials_returns_cached_when_no_path() -> None: @@ -def test_provider_config_get_credentials_rereads_from_disk(tmp_path): +def test_provider_config_get_credentials_rereads_from_disk(tmp_path: Path) -> None: @@ -def test_provider_config_get_credentials_falls_back_on_read_failure(tmp_path): +def test_provider_config_get_credentials_falls_back_on_read_failure(tmp_path: Path) -> None: @@ -def test_provider_config_get_credentials_with_directory(tmp_path): +def test_provider_config_get_credentials_with_directory(tmp_path: Path) -> None:As per coding guidelines, "All function signatures must include type hints".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/app/models/test_config.py` around lines 4357 - 4443, Add type hints to all four test function signatures in this block. For test_provider_config_get_credentials_returns_cached_when_no_path, add return type hint -> None. For test_provider_config_get_credentials_rereads_from_disk, test_provider_config_get_credentials_falls_back_on_read_failure, and test_provider_config_get_credentials_with_directory, add the parameter type hint for tmp_path as pathlib.Path and return type hint as -> None to each function signature.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/credential-hot-reload.md`:
- Around line 52-57: The fenced code block showing the Request flow diagram is
missing a language identifier, which triggers markdownlint rule MD040. Add the
language identifier "text" to the opening fence of the code block (the three
backticks before the Request line) so it reads ```text instead of just ```. This
will satisfy the markdown linting requirement while properly labeling the
content type.
In `@ols/app/models/config.py`:
- Around line 612-618: The get_credentials() method in ols/app/models/config.py
returns the fresh credential value without persisting it to the cached
self.credentials instance variable. When fresh credentials are successfully read
from the path (when fresh is not None), update self.credentials with this fresh
value before returning it. This ensures that subsequent calls to
get_credentials() will return the last known-good rotated key rather than
falling back to a stale startup value if a later read fails.
In `@ols/src/llms/providers/google_vertex.py`:
- Around line 35-38: Replace the `ValueError` exception with
`InvalidConfigurationError` in the credential validation logic of the Google
Vertex provider. Import `InvalidConfigurationError` from `ols.utils.checks` at
the top of the file if not already present. Then update the exception raised
when `creds_value is None` (at the location around lines 35-38) and apply the
same change to the second occurrence mentioned at lines 74-79 where similar
credential validation occurs, ensuring both locations use the domain-specific
exception type for consistent configuration error handling across the provider.
In `@tests/unit/llms/providers/test_openai.py`:
- Around line 275-302: The function `test_openai_picks_up_rotated_credentials`
is missing type hints on its parameters and return type, violating the
repository's typed-signature requirement. Add type annotations to both the
`tmp_path` parameter (use the appropriate Path type from pathlib) and the
`fake_certifi_store` parameter (determine the correct fixture type), and add a
return type hint of None for the test function.
---
Outside diff comments:
In `@tests/unit/app/models/test_config.py`:
- Around line 4357-4443: Add type hints to all four test function signatures in
this block. For
test_provider_config_get_credentials_returns_cached_when_no_path, add return
type hint -> None. For test_provider_config_get_credentials_rereads_from_disk,
test_provider_config_get_credentials_falls_back_on_read_failure, and
test_provider_config_get_credentials_with_directory, add the parameter type hint
for tmp_path as pathlib.Path and return type hint as -> None to each function
signature.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 64086aa0-5d90-4440-9524-4d7a4b23a9e5
📒 Files selected for processing (12)
docs/credential-hot-reload.mdols/app/models/config.pyols/src/llms/providers/azure_openai.pyols/src/llms/providers/google_vertex.pyols/src/llms/providers/openai.pyols/src/llms/providers/rhelai_vllm.pyols/src/llms/providers/rhoai_vllm.pyols/src/llms/providers/watsonx.pyols/utils/checks.pytests/unit/app/models/test_config.pytests/unit/llms/providers/test_openai.pytests/unit/utils/test_checks.py
| ``` | ||
| Request → load_llm() → LLMProvider.__init__() | ||
| → provider_config.get_credentials() | ||
| → open(credentials_path) + read() | ||
| → return fresh value | ||
| ``` |
There was a problem hiding this comment.
Add a language identifier to the fenced code block.
This block is missing a language tag and triggers markdownlint MD040.
Suggested fix
-```
+```text
Request → load_llm() → LLMProvider.__init__()
→ provider_config.get_credentials()
→ open(credentials_path) + read()
→ return fresh value
-```
+```📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ``` | |
| Request → load_llm() → LLMProvider.__init__() | |
| → provider_config.get_credentials() | |
| → open(credentials_path) + read() | |
| → return fresh value | |
| ``` |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 52-52: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/credential-hot-reload.md` around lines 52 - 57, The fenced code block
showing the Request flow diagram is missing a language identifier, which
triggers markdownlint rule MD040. Add the language identifier "text" to the
opening fence of the code block (the three backticks before the Request line) so
it reads ```text instead of just ```. This will satisfy the markdown linting
requirement while properly labeling the content type.
Source: Linters/SAST tools
| if self._credentials_path is not None: | ||
| fresh = checks.read_secret_from_path( | ||
| self._credentials_path, constants.API_TOKEN_FILENAME | ||
| ) | ||
| if fresh is not None: | ||
| return fresh | ||
| return self.credentials |
There was a problem hiding this comment.
Persist fresh credentials before returning them.
On Line 617, get_credentials() returns the fresh value but does not update the cached
self.credentials. If a later read fails, fallback returns a stale startup key rather than the
last known-good rotated key.
Suggested fix
if self._credentials_path is not None:
fresh = checks.read_secret_from_path(
self._credentials_path, constants.API_TOKEN_FILENAME
)
if fresh is not None:
- return fresh
+ self.credentials = fresh
+ return fresh
return self.credentials📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if self._credentials_path is not None: | |
| fresh = checks.read_secret_from_path( | |
| self._credentials_path, constants.API_TOKEN_FILENAME | |
| ) | |
| if fresh is not None: | |
| return fresh | |
| return self.credentials | |
| if self._credentials_path is not None: | |
| fresh = checks.read_secret_from_path( | |
| self._credentials_path, constants.API_TOKEN_FILENAME | |
| ) | |
| if fresh is not None: | |
| self.credentials = fresh | |
| return fresh | |
| return self.credentials |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ols/app/models/config.py` around lines 612 - 618, The get_credentials()
method in ols/app/models/config.py returns the fresh credential value without
persisting it to the cached self.credentials instance variable. When fresh
credentials are successfully read from the path (when fresh is not None), update
self.credentials with this fresh value before returning it. This ensures that
subsequent calls to get_credentials() will return the last known-good rotated
key rather than falling back to a stale startup value if a later read fails.
| creds_value = self.provider_config.get_credentials() | ||
| if creds_value is None: | ||
| raise ValueError("credentials are required for Google Vertex provider") | ||
| self.credentials = load_vertex_credentials(self.provider_config.credentials) | ||
| self.credentials = load_vertex_credentials(creds_value) |
There was a problem hiding this comment.
Use InvalidConfigurationError instead of ValueError for missing credentials.
Both new guard clauses raise generic ValueError; per repository policy these configuration failures should use
the domain-specific exception type for consistent handling.
Proposed fix
+from ols.utils.checks import InvalidConfigurationError
@@
- if creds_value is None:
- raise ValueError("credentials are required for Google Vertex provider")
+ if creds_value is None:
+ raise InvalidConfigurationError(
+ "credentials are required for Google Vertex provider"
+ )
@@
- if creds_value is None:
- raise ValueError(
+ if creds_value is None:
+ raise InvalidConfigurationError(
"credentials are required for Google Vertex Anthropic provider"
)As per coding guidelines, "Use custom exceptions from their respective domain modules ... InvalidConfigurationError from ols.utils.checks".
Also applies to: 74-79
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ols/src/llms/providers/google_vertex.py` around lines 35 - 38, Replace the
`ValueError` exception with `InvalidConfigurationError` in the credential
validation logic of the Google Vertex provider. Import
`InvalidConfigurationError` from `ols.utils.checks` at the top of the file if
not already present. Then update the exception raised when `creds_value is None`
(at the location around lines 35-38) and apply the same change to the second
occurrence mentioned at lines 74-79 where similar credential validation occurs,
ensuring both locations use the domain-specific exception type for consistent
configuration error handling across the provider.
Source: Coding guidelines
| def test_openai_picks_up_rotated_credentials(tmp_path, fake_certifi_store): | ||
| """Test that OpenAI provider re-reads credentials on each load.""" | ||
| secret_file = tmp_path / "apitoken" | ||
| secret_file.write_text("initial-key") | ||
|
|
||
| config = ProviderConfig( | ||
| { | ||
| "name": "test_provider", | ||
| "type": "openai", | ||
| "url": "http://test.url", | ||
| "credentials_path": str(secret_file), | ||
| "models": [ | ||
| { | ||
| "name": "test_model", | ||
| "url": "http://test.url/", | ||
| "credentials_path": str(secret_file), | ||
| } | ||
| ], | ||
| } | ||
| ) | ||
|
|
||
| openai_1 = OpenAI(model="test_model", provider_config=config) | ||
| assert openai_1.default_params["openai_api_key"] == "initial-key" | ||
|
|
||
| secret_file.write_text("rotated-key") | ||
|
|
||
| openai_2 = OpenAI(model="test_model", provider_config=config) | ||
| assert openai_2.default_params["openai_api_key"] == "rotated-key" |
There was a problem hiding this comment.
Add type hints to the new test function signature.
The new test function omits parameter/return annotations, which conflicts with the repository’s typed-signature rule.
Proposed fix
+from pathlib import Path
@@
-def test_openai_picks_up_rotated_credentials(tmp_path, fake_certifi_store):
+def test_openai_picks_up_rotated_credentials(
+ tmp_path: Path, fake_certifi_store: str
+) -> None:As per coding guidelines, "All function signatures must include type hints".
🧰 Tools
🪛 ast-grep (0.43.0)
[warning] 283-283: Do not make http calls without encryption
Context: "http://test.url"
Note: [CWE-319].
(requests-http)
[warning] 288-288: Do not make http calls without encryption
Context: "http://test.url/"
Note: [CWE-319].
(requests-http)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/llms/providers/test_openai.py` around lines 275 - 302, The
function `test_openai_picks_up_rotated_credentials` is missing type hints on its
parameters and return type, violating the repository's typed-signature
requirement. Add type annotations to both the `tmp_path` parameter (use the
appropriate Path type from pathlib) and the `fake_certifi_store` parameter
(determine the correct fixture type), and add a return type hint of None for the
test function.
Source: Coding guidelines
|
@lalan7: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
When
credentialsSecretRefis updated by a CronJob or external rotation, the lightspeed-operator triggers a rolling restart oflightspeed-app-server. For customers using short-lived LLM tokens (1-hour validity), this causes hourly pod restarts and temporary capacity loss — including reloading RAG vector indexes and embedding models.This PR makes
ProviderConfigre-read the credential file on every LLM request via a newget_credentials()method, so rotated secrets are picked up without a pod restart. This follows the Prometheuscredentials_filepattern: re-read on every request, no caching, no edge cases with K8s symlink swaps.Changes
ols/utils/checks.py— Newread_secret_from_path()helper for repeated file readsols/app/models/config.py— Store_credentials_pathinProviderConfig.__init__(), addget_credentials()method that re-reads the file when a path is configuredopenai,azure_openai,watsonx,rhoai_vllm,rhelai_vllm,google_vertex) — Useget_credentials()instead of.credentialsProviderConfig.get_credentials()(cached, re-read, fallback, directory), and OpenAI provider credential rotationdocs/credential-hot-reload.md— Design rationale and test planWhy this approach
kubeletuses atomic symlink swaps (..data);os.stat()mtime can be unreliable after swaps. Re-reading avoids the edge case.credentials_file, Envoy, controller-runtimeCertWatcher, kube-proxy all re-read files.open()+read()of a <100 byte file per LLM request is negligible vs LLM call latency (seconds).Scope
Hot-reload applies to top-level
credentials_path, which is the path used by operator-managed deployments (credentialsSecretRef). Provider-specific config blocks (openai_config.credentials_path,azure_openai_config.credentials_path, etc.) continue to cache credentials at startup — this matches the existing operator behavior, which does not generate these blocks.Azure AD credentials (
tenant_id,client_id,client_secret) are also read once at startup. These can be addressed in a follow-up if needed.Test plan
read_secret_from_path()— file, directory, nonexistent, rotationProviderConfig.get_credentials()— cached fallback, disk re-read, read failure fallback, directory pathload_llm()callsruff checkclean,mypycleanRef