RFC 0005: Platform-managed Kubernetes sandboxes#1680
Conversation
|
All contributors have signed the DCO ✍️ ✅ |
7119994 to
94b22ee
Compare
|
I have read the DCO document and I hereby sign the DCO. |
|
recheck |
|
|
||
| ### Request shape | ||
|
|
||
| The exact API can be protobuf-native, driver-specific configuration, or a |
There was a problem hiding this comment.
Does this mean that we would add support through the proposal in #1589? (Or at least partially).
There was a problem hiding this comment.
Yes, I think this should partially align with #1589.
The Kubernetes-specific request fields in this RFC, such as requested namespace, allocation mode, warm-pool reference, RuntimeClass, node selector, tolerations, and possibly service account selection, seem like good candidates for the driver-owned config shape proposed in #1589.
Kubernetes placement fields can use driver_config, but credentials and authorization should be handled by OpenShell’s main credential and authorization systems, because they affect the whole control plane, not just the Kubernetes driver.
I can update this RFC to reference #1589 as the likely mechanism for the Kubernetes-specific configuration surface.
There was a problem hiding this comment.
It sounds like a lot of the k8s specific implementations fields like agent-sandbox configuration could use the driver_config. It would be nice if the 2 proposals are aligned 👍
There was a problem hiding this comment.
+1, we should use driver_config detailed in #1589.
| its configured default namespace. When the request namespace is omitted, the | ||
| driver keeps existing behavior and provisions into the configured namespace. | ||
| When the request namespace is present, the driver uses it as the namespace for |
There was a problem hiding this comment.
This means it is still up to the user to specify the namespace, correct? How is access control to the namespace handled?
There was a problem hiding this comment.
Correct, this needs to be clearer.
The intent is not that an end user can choose an arbitrary namespace. In this model, the caller is a trusted Kubernetes platform controller. That controller resolves the tenant, selects the namespace, reconciles the namespace controls, and then calls OpenShell.
OpenShell still needs an authorization check on its side.
The Kubernetes platform controller should authenticate to OpenShell Gateway with a control-plane identity, for example a Kubernetes ServiceAccount/OIDC token, mTLS client identity, or another configured gateway identity.
OpenShell would then authorize that caller identity against allowed operations and resource scopes. For example, the controller identity may be allowed to request only specific namespace patterns, approved Secret namespaces, approved service accounts, approved allocation modes, and approved Kubernetes driver config.
Therefore, it is trusted only when requested by an authenticated and authorized platform-controller identity.
I can update the RFC to make this explicit and describe namespace override as a trusted control-plane field, not a general user-supplied field.
derekwaynecarr
left a comment
There was a problem hiding this comment.
I think we need to finish authorization semantics on the OpenShell control plane itself before proceeding too far on this. In particular, this design would need to describe how the OpenShell control plane should be secured to support this operational pattern.
| This RFC proposes a platform-managed Kubernetes sandbox provisioning model for | ||
| OpenShell. In this model, a Kubernetes platform owns tenant onboarding, | ||
| namespace creation, quotas, network policy, secret synchronization, policy | ||
| compilation and scheduling. OpenShell remains the sandbox execution plane: the |
There was a problem hiding this comment.
I think we will need to support two types of usage scenarios:
- users of OpenShell that have no underlying knowledge of the control surface that runs their sandboxes (sandbox as a service)
- users of existing platforms that want to delegate to OpenShell as their sandbox execution plane on their existing platform (its more an operator in an existing platform)
I think we need to finish up authorization of the core OpenShell control plane, and firm up its associated data model a bit more, and then map how usage pattern (2) would fit into that authorization model before opening up too many prescriptive knobs. Is the identity/authz model of OpenShell and the k8s control plane common in this proposal, or different? How would we control who can connect to which sandboxes, etc.
There was a problem hiding this comment.
Agreed. This RFC is targeting the second usage model: an existing Kubernetes platform wants to delegate sandbox execution to OpenShell.
It is not trying to define the direct sandbox-as-a-service model where end users call OpenShell without knowing the underlying control plane.
In the current proposal, the Kubernetes platform identity/authz model and the OpenShell identity/authz model are different but connected.
The Kubernetes platform remains responsible for tenant onboarding, namespace creation, namespace RBAC, quota, NetworkPolicy, ESO Secret creation, and deciding which tenant/workload is allowed to request a sandbox.
OpenShell still needs its own control-plane identity and authorization model. The platform controller would authenticate to OpenShell as a trusted control-plane caller, for example using a Kubernetes ServiceAccount/OIDC token, mTLS identity, or another configured gateway identity. OpenShell would then authorize that caller to request only specific namespaces or namespace pattern, Secret refs, service accounts, allocation modes, and driver config.
For sandbox access, OpenShell should not rely only on Kubernetes namespace RBAC. OpenShell should have an ownership/access model for sandbox objects. A sandbox should carry owner/tenant/project metadata, and OpenShell should use that metadata to decide who can connect to the sandbox, stream logs, exec, delete it, or attach credentials. In the platform-delegated model, the platform controller may be the only direct OpenShell caller, and end-user access is mediated through the platform. In the sandbox-as-a-service model, OpenShell would need to authorize end users directly.
So the two authz systems are not the same, but they need to be mapped:
Kubernetes authz protects Kubernetes resources and namespaces.
OpenShell authz protects OpenShell API operations, sandbox ownership, credential use, and sandbox connection rights.
I can update the RFC to describe this explicitly and frame this RFC as usage pattern (2): an existing Kubernetes platform delegates sandbox execution to OpenShell.
| Kubernetes Secret-backed credentials, placement metadata, and optional Agent | ||
| Sandbox allocation settings. | ||
|
|
||
| The proposal adds first-class support for: |
There was a problem hiding this comment.
I worry about the security implications for this until we can connect it to a proposed authorization model for OpenShell in a bit more detail. Being able to target a specific namespace, reference secrets, and presumably control what service account the pod is running under needs scrutiny.
I agree that we will want to support fast sandbox creation (so claim/warm-pool is useful), but we will also want to support checkpoint/restore and/or scale-to-zero semantics. I think we should figure out the proxy -> sandbox split first and then see where we stand after that point.
There was a problem hiding this comment.
Agreed. The RFC should not move forward as an accepted API shape until the OpenShell control-plane authorization model and proxy/sandbox split are clearer.
I can move this back to draft and revise it to focus on the requirements and security boundaries: caller identity, namespace authorization, Secret reference authorization, service account authorization, driver config authorization, sandbox ownership, and who can connect to or operate a sandbox.
|
|
||
| Today, the required integration hooks are not all available: | ||
|
|
||
| - The Kubernetes driver is configured with one sandbox namespace and creates |
There was a problem hiding this comment.
I agree that we will want to separate sandboxes in separate namespaces, and definitely separate it from the namespace that may also be running the OpenShell control plane itself. One thing I had been waiting to see how things shake out is if we introduce a domain object above Sandbox in the OpenShell domain model.
There was a problem hiding this comment.
That makes sense.
The requirement I am trying to capture is that platform-managed sandboxes need namespace separation and platform-owned metadata, but I agree the RFC should not assume all of that belongs directly on SandboxSpec.
A higher-level OpenShell domain object above Sandbox may be the right place to represent tenant/session/workload intent, with the Kubernetes driver mapping that intent to Sandbox, SandboxClaim, or warm-pool-backed allocation.
I can update the RFC to describe the requirement and leave the exact domain object/API shape open.
| audit metadata and cleanup lifecycle. | ||
| - The sandbox create path does not expose a trusted platform-selected target | ||
| namespace. | ||
| - OpenShell provider credentials are stored in OpenShell provider records; |
There was a problem hiding this comment.
I had imagined we would explore this via a Credential proto plugin design.
There was a problem hiding this comment.
Agreed. Kubernetes Secrets should probably be treated as one credential source backend, not as a Kubernetes-driver-only feature.
The RFC can be updated to align with a broader Credential proto/plugin model. In that model, OpenShell defines a common credential-source interface, and Kubernetes Secret is one implementation of that interface. The gateway resolves the approved credential source, while the existing provider/supervisor flow remains responsible for controlled credential injection into sandbox traffic.
| - OpenShell provider credentials are stored in OpenShell provider records; | ||
| provider/provider-v2 attachment does not accept Kubernetes `Secret` | ||
| references as the credential source. | ||
| - The Kubernetes driver does not expose `SandboxClaim` and `SandboxWarmPool` as |
There was a problem hiding this comment.
We should determine if OpenShell itself should write SandboxWarmPool itself, so a user of the OpenShell control plane could define the behavior they desire for particular sandbox templates.
There was a problem hiding this comment.
Good point. The RFC currently assumes OpenShell can select an existing SandboxWarmPool, but it does not clearly answer who owns creation and reconciliation of warm pools.
There seem to be two possible models: the platform pre-creates warm pools and OpenShell references them, or OpenShell owns warm-pool creation/reconciliation based on sandbox templates and desired profiles.
I can move this into open questions and avoid prescribing the ownership model until we decide how warm-pool lifecycle should fit into OpenShell.
| - The Kubernetes driver does not expose `SandboxClaim` and `SandboxWarmPool` as | ||
| the platform allocation path for warm pools or template-backed placement. | ||
|
|
||
| Without these hooks, a Kubernetes platform would need workarounds such as one |
There was a problem hiding this comment.
I agree we need to solve all these issues, but I am not sure if this particular workaround is right until we get authorization on the OpenShell control plane completed. Maybe we can keep this RFC in a draft state until we get that satisfied? In particular, what gaps are missing in the OpenShell authorization surface to make it safe for this pattern would be good to explore. /cc @mrunalp
There was a problem hiding this comment.
That sounds reasonable.
The goals still seem valid, but I agree the RFC should not move toward acceptance until the OpenShell control-plane authorization model is clear enough to make namespace selection, Secret references, service account selection, and sandbox access safe.
I can move the RFC back to draft and revise it to focus on requirements, authorization gaps, and integration points rather than presenting the API shape as ready for acceptance.
|
I agree that we will eventually need to support a single OpenShell gateway running Sandboxes in multiple kube namespaces that are preconfigured by the platform. We will need an authz system before we can do that though. Authz systems are hard to get right and kube already has an RBAC system. I'm trying to think about how we can leverage it when we are running on kube, or if we just need to roll our own for uniform UX across OpenShell deployment environments. There are many needs for a sandbox-as-a-service use case that do no apply in single-player use cases and podman/docker on a local workstation where the authn/authz is very basic. The authz and sandbox namespacing is a big enough topic on its own. Storing creds in Secret and using other resources from the Sandbox API could each be their own RFC. I'm not saying to create those at this time as, IMHO, they would be too forward looking to be actionable right now. |
Signed-off-by: Rohan Kumar <rohank@nvidia.com>
94b22ee to
0d2da99
Compare
| - https://github.com/NVIDIA/OpenShell/pull/1680 | ||
| --- | ||
|
|
||
| # RFC 0005 - Platform-Managed Kubernetes Sandboxes |
There was a problem hiding this comment.
If I had to summarize this RFC, would it be correct to say you are asking for
- Sandbox configuration hooks so that Sandboxes can be launched with specific Kubernetes properties (eg namespaces, warm pools, etc).
- A Kubernetes secret backend for Providers
- An event stream that publishes OpenShell events (eg sandbox created, sandbox deleted, etc).
Am I missing anything?
Summary
Adds RFC 0005 for platform-managed Kubernetes sandbox provisioning.
This RFC proposes support for a trusted Kubernetes platform control plane to call OpenShell Gateway with a platform-selected namespace, supplied sandbox policy, approved Kubernetes Secret-backed provider credentials, runtime placement metadata, and optional Agent Sandbox allocation through SandboxClaim and SandboxWarmPool.
Related issue: #1678
Notes