diff --git a/rfcs/0006-pluggable-auth/0006-pluggable-auth.md b/rfcs/0006-pluggable-auth/0006-pluggable-auth.md new file mode 100644 index 0000000..4f8da5a --- /dev/null +++ b/rfcs/0006-pluggable-auth/0006-pluggable-auth.md @@ -0,0 +1,823 @@ +--- +start_date: 2026-05-27 +mlflow_issue: https://github.com/mlflow/mlflow/issues/21240 +rfc_pr: +--- + + + +| Author(s) | [Patrick Koss](https://github.com/PatrickKoss) | +| :--------------------- | :------------------------------------------------- | +| **Date Last Modified** | 2026-05-27 | + + + +# Summary: Pluggable Enterprise Authentication and Authorization + +MLflow's only built-in auth surface is a single configurable +`authorization_function` (`mlflow/server/auth/config.py`). That one hook +conflates two distinct concerns — *who are you* (authentication) and *what may +you do* (authorization) — into one function that returns a +`werkzeug.datastructures.Authorization` carrying nothing but a username. It is +too thin for bearer tokens, OIDC claims, group membership, or just-in-time user +provisioning, and the FastAPI request path silently refuses any non-default +function at all (`mlflow/server/auth/__init__.py:4141`). Worse, the knowledge of +*which permission a given route requires* is fused inside ~200 validator +functions spread across six dispatch structures. Any external authorization +system — Kubernetes `SubjectAccessReview`, OPA, a corporate policy engine — has +to rediscover and duplicate that entire mapping, and re-sync it every time +MLflow adds a route. + +This RFC proposes splitting auth into **two small plugin contracts** while +keeping the expensive, churning knowledge in core: + +- **`AuthenticationProvider`** — turns an inbound request (headers, cookies, + token) into a rich `Identity`. Reference adapters: OAuth/OIDC bearer tokens, + Kubernetes `TokenReview`, and upstream proxy identity headers. +- **`AuthorizationBackend`** — owns the allow/deny decision given an `Identity` + and a **normalized** `AuthorizationRequirement`. Reference adapters: the + default database-backed RBAC resolver, Kubernetes `SubjectAccessReview`, and + OPA. + +The load-bearing design rule: **core retains sole ownership of the +route → requirement mapping**, expressed as a single authoritative +`OPERATION_REGISTRY`. Plugins never see a route, a protobuf class, or a GraphQL +field — only the tuple `(resource_type, resource_id, action, workspace)`. A CI +guard fails the build if any route ships without declaring a requirement. + +This is the extension RFC that **RFC 0005** ("Role-Based Access Control for +MLflow OSS") explicitly flagged as future work in its *"Extension point: +resolver interface"* section. It builds on 0005's role model, its +`get_role_permission_for_resource(...)` / `list_accessible_workspace_names(...)` +resolver surface, and its `READ/USE/EDIT/MANAGE` permission levels. It does not +change 0005's role storage. The default plugins reproduce today's behavior +byte-for-byte. + +# Basic example + +The operator-facing change is a small config edit. The shape of the config +mirrors the existing basic-auth INI (`mlflow/server/auth/basic_auth.ini`). + +**Default — identical to today's behavior.** An operator who upgrades and +changes nothing gets exactly the current basic-auth + database RBAC: + +```ini +[mlflow] +default_permission = READ +database_uri = sqlite:///auth.db +admin_username = admin +admin_password = password + +authn_providers = basic-auth # ordered chain; first to authenticate or challenge wins +authz_backend = database # the RFC 0005 role resolver, wrapped as a backend +``` + +**Kubernetes deployment — OIDC tokens, decisions delegated to the API server:** + +```ini +[mlflow] +default_permission = NO_PERMISSIONS +database_uri = sqlite:///auth.db +admin_username = admin +admin_password = password + +authn_providers = oidc, basic-auth +authz_backend = k8s-sar + +[authn.oidc] +issuer = https://idp.example.com +audience = mlflow +group_claim = groups +provision = true + +[authz.k8s-sar] +api_server = https://kubernetes.default.svc +on_error = deny +cache_ttl_seconds = 30 +``` + +**The plugin author's whole job.** An authorization plugin implements one +method, and it never learns MLflow's routing: + +```python +class SarBackend: + name = "k8s-sar" + + def authorize(self, query: AuthorizationQuery) -> Decision: + req = query.requirement # ("experiment", "42", "update", "ml-research") + # ...POST a SubjectAccessReview, read status.allowed... + return Decision(allowed=..., reason=...) +``` + +The plugin sees `("experiment", "42", "update", "ml-research")`. It never sees +`POST /api/2.0/mlflow/runs/log-metric`, never imports `LogMetric`, never learns +that logging a metric requires `update` on the run's *parent experiment*. That +derivation stays in core. + +## Motivation + +The legacy auth surface has three structural problems. They mirror the +three-problem framing of RFC 0005, one layer up. + +**First, authentication and authorization are a single string.** The only hook +is `auth_config.authorization_function`, a dotted path resolved through +`importlib` (`mlflow/server/auth/__init__.py:2467`). The default, +`authenticate_request_basic_auth` (`:2486`), both reads the `Authorization` +header *and* decides the user is who they claim. There is no way to express the +common enterprise shape "authenticate with OIDC, but delegate the *decision* to +Kubernetes RBAC or an OPA policy." The two concerns are welded together. + +**Second, the identity is too thin.** `authenticate_request()` returns a +`werkzeug Authorization`, whose only useful attribute is `.username`. Real +enterprise deployments need to validate a JWT against an IdP's JWKS, carry the +user's group membership for group→permission mapping, provision a user row on +first login, and link an external identity to an existing local user. None of +that fits through a username string. Today operators bolt an OAuth proxy +(oauth2-proxy, Authelia, Pomerium) in front of MLflow; the proxy authenticates, +but MLflow stays blind to who the user is and cannot use IdP groups in its own +RBAC. + +**Third, and most damaging for maintainability, the route → requirement mapping +is fused inside the validators and scattered across six structures.** Each entry +in `BEFORE_REQUEST_HANDLERS` (`:2148`) maps a protobuf request class to a +validator that *both* extracts the resource *and* decides +(`validate_can_read_run` pulls the run id, resolves its parent experiment and +workspace, queries grants, and returns a boolean). That knowledge is replicated +across `BEFORE_REQUEST_VALIDATORS` (`:2280`), three parameterized regex dicts +(`TRACE_*`, `LOGGED_MODEL_*`, `WEBHOOK_*`), the FastAPI dispatcher +`_find_fastapi_validator` (`:4079`), and the hardcoded GraphQL `PROTECTED_FIELDS` +set (`:3779`). An external authorization integration — for example Kubeflow's +`mlflow-integration`, which the issue thread cites — cannot consume any of this. +It must rediscover that "GetRun means read on the parent experiment," redo it for +every one of ~200 operations, and re-sync on every MLflow release that adds a +route. That duplication is the single hardest thing to maintain in a plugin +approach, and it is exactly what this RFC is designed to prevent. + +The demand is concrete and named in the issue: Kubernetes `TokenReview` + +`SubjectAccessReview`, OPA, OAuth/OIDC, Kerberos, and proxy-provided identity +headers. RFC 0005 deferred the "pluggable authorization resolver" and flagged the +small resolver interface as a known hook-in point for "a future extension RFC." +This is that RFC. + +### Out of scope + +- **A full OIDC/SAML login experience.** Login pages, server-side sessions, + cookie lifecycle, single logout, and SDK token-acquisition flows are the + concern of a *specific* authentication provider plugin, not this framework + RFC. We define the `AuthenticationProvider` interface and show a bearer-token + OIDC adapter as an illustration; a full browser-SSO provider is a separate + plugin (and possibly its own RFC). This matches the maintainer's stated + preference: make MLflow pluggable, let the community maintain the plugins, with + some donated back. +- **Changing RFC 0005's role storage.** The `roles` / `role_permissions` / + `user_role_assignments` tables and the role API are a prerequisite, not a + subject, of this RFC. +- **Building an identity provider.** MLflow delegates authentication to external + systems. +- **The group → MLflow-role mapping *policy*.** *Where* IdP groups get mapped to + MLflow permissions (in the provider, in a core mapping table, or entirely + inside the authz backend) is a real decision but is flagged as an open question + rather than settled here. + +## Detailed design + +### The three layers + +The design is three layers with a hard boundary between them. The boundary is +the whole point: it is what keeps route knowledge out of plugins. + +| Layer | Owner | Sees | Produces | +| :--- | :--- | :--- | :--- | +| **Dispatch** (route → requirement) | **MLflow core**, never a plugin | Flask `Request`, Starlette `Request`, GraphQL `info`, protobuf class, gateway path/body | an `AuthorizationRequirement` | +| **Authentication** | `AuthenticationProvider` plugin(s) | the raw request (headers, cookies, body) | an `Identity` (or skip / challenge) | +| **Authorization** | `AuthorizationBackend` plugin (one) | `Identity` + `AuthorizationRequirement` + context | a `Decision` | + +```mermaid +flowchart TD + req["Inbound request
(Flask / FastAPI / GraphQL / gateway)"] --> dispatch + dispatch["CORE dispatch
OPERATION_REGISTRY"] -->|"AuthRequest"| authn + authn["AuthenticationProvider chain
(plugin)"] -->|"Identity"| chokepoint + dispatch -->|"AuthorizationRequirement
(type, id, action, workspace)"| chokepoint + chokepoint["CORE chokepoint"] -->|"AuthorizationQuery"| authz + authz["AuthorizationBackend
(plugin)"] -->|"Decision"| chokepoint + chokepoint -->|allow| handler["route handler"] + chokepoint -->|deny| forbidden["403 / 401 / redirect"] +``` + +A plugin author never learns what a "trace tag PATCH" is, never parses a +protobuf, never re-derives that `LogMetric` requires `update` on the parent +experiment. Core resolves every surface into the same tuple and hands that tuple +to the backend. + +### Identity + +The authn output and the authz subject. It replaces the thin `Authorization`. + +```python +@dataclass(frozen=True) +class Identity: + # Stable principal key. The authz subject, and the link key for a + # JIT-provisioned local user row. MUST be stable across logins. + username: str + + # Richer attributes; None when the provider does not supply them. + email: str | None = None + display_name: str | None = None + groups: tuple[str, ...] = () # IdP groups/roles, consumed by group→permission mapping + is_admin: bool = False # provider may assert super-admin (e.g. an IdP claim) + + # Provenance + raw material for backends (OPA input, audit, SAR extras). + provider: str = "basic-auth" # entry-point name of the authenticating provider + claims: Mapping[str, Any] = field(default_factory=dict) # raw JWT claims / SAML attrs / headers + + # JIT provisioning signal. When True, core ensures a local user row exists + # (with email/display_name/groups) before authorization runs. + provision_if_absent: bool = False +``` + +Each field earns its place against a real requirement: `claims` carries JWT/SAML +material for token validation and OPA policies; `groups` feeds group→permission +mapping; `provision_if_absent` plus the attributes drive JIT provisioning; +`is_admin` lets an IdP assert super-admin via a claim. `frozen=True` makes the +identity safe to use as part of a cache key. + +### AuthenticationProvider + +```python +@dataclass +class AuthChallenge: + status_code: int = 401 + headers: Mapping[str, str] = field(default_factory=dict) # WWW-Authenticate, Location, Set-Cookie + body: str = "" + # When True, core emits a 302 redirect for HTML clients and a 401 for + # API clients (decided by content negotiation, in core). + is_redirect: bool = False + + +class AuthenticationResult: + """Exactly one outcome per provider call.""" + @staticmethod + def authenticated(identity: Identity) -> "AuthenticationResult": ... + @staticmethod + def skip() -> "AuthenticationResult": ... # "not my credential type" — try next provider + @staticmethod + def challenge(response: AuthChallenge) -> "AuthenticationResult": ... + + +class AuthenticationProvider(Protocol): + name: str # matches the entry-point name; stamped into Identity.provider + def authenticate(self, request: "AuthRequest") -> AuthenticationResult: ... +``` + +Three decisions: + +- **`skip()` vs `challenge()`.** `skip()` means "this credential isn't mine" + (no `Authorization: Bearer` header, no session cookie) — core moves to the + next provider in the chain. `challenge()` means "this *is* mine but it's + absent or invalid; here is how the client should authenticate." Only after + *every* provider skips does core emit the default challenge. This is what lets + a chain coexist — bearer token, then session cookie, then basic auth — without + each provider guessing about the others. +- **Browser-redirect vs API-401 is decided by core.** The provider only + declares *intent* via `AuthChallenge.is_redirect`; core performs content + negotiation (an `Accept: text/html` browser gets a 302 to `Location`; an API + client gets a 401). Providers never touch a Flask `Response` or a Starlette + `RedirectResponse`. Contrast `authenticate_request_basic_auth()` today + (`:2486`), which returns a raw Flask `Response` and therefore cannot work on + the FastAPI path. +- **JIT provisioning flows through core, not the plugin.** When a provider + returns `authenticated(Identity(provision_if_absent=True, ...))`, the core + chokepoint calls a small `IdentityStore.ensure_user(identity)` that creates the + local user row keyed by `username`, populating email/display_name and + (optionally) syncing group→role assignments, *before* authorization runs. + Providers never write to the auth database. + +### AuthRequest: one interface, both frameworks + +MLflow runs both Flask (WSGI, `_before_request` at `:2552`) and FastAPI/Starlette +(`_find_fastapi_validator` at `:4079`). These are two separate auth code paths +today, and the FastAPI path *rejects any non-default `authorization_function`* +(`:4141`) — so custom auth doesn't even work for gateway routes right now. We +must not perpetuate that split. + +Both Flask and Starlette requests are wrapped into one read-only adapter so a +provider is written once: + +```python +class AuthRequest(Protocol): + @property + def method(self) -> str: ... + @property + def path(self) -> str: ... + @property + def headers(self) -> Mapping[str, str]: ... # case-insensitive + @property + def cookies(self) -> Mapping[str, str]: ... + @property + def query(self) -> Mapping[str, str]: ... + def body_json(self) -> dict | None: ... # cached parse, shared with dispatch + handler + @property + def framework(self) -> Literal["flask", "starlette"]: ... +``` + +The one subtle point is body reads. A Starlette body is single-read and async +(MLflow already caches it in `request.state.cached_body`, see +`mlflow/server/auth/__init__.py:4031`). The adapter caches the parsed body so the +provider, the requirement extractor, and the route handler all share one parse; +on the FastAPI side the async middleware pre-reads it so the synchronous provider +stays synchronous. The adapter is ~40 lines and lives in core. + +Writing the provider once matters: an OIDC/Bearer/Kerberos provider's logic — +validate the token against JWKS, map the claims — is identical regardless of +framework. Two entry points would double the surface that can drift. + +Reference authentication adapters: + +| Adapter | How it authenticates | Identity it emits | +| :--- | :--- | :--- | +| `basic-auth` (default) | reads the `Authorization: Basic` header, `store.authenticate_user` | `Identity(username=..., provider="basic-auth")` | +| `oidc` (bearer) | validates a JWT against the IdP JWKS, checks `iss`/`aud`/`exp` | `Identity(username, email, groups, claims=, provision_if_absent=True)` | +| `k8s-tokenreview` | POSTs a `TokenReview` to the API server; reads `status.user` | `Identity(username, groups=status.user.groups)` | +| `proxy-header` | trusts `X-Forwarded-User` / `X-Forwarded-Groups` from a vetted upstream proxy | `Identity(username, groups)` | + +### AuthorizationBackend: the permission store that owns the decision + +```python +@dataclass(frozen=True) +class AuthorizationRequirement: + resource_type: str # permissions.VALID_RESOURCE_TYPES, plus "workspace" / "system" + resource_id: str | None # concrete id / name / pattern; None for create-in-workspace and pure workspace ops + action: str # "read" | "use" | "update" | "delete" | "manage" | "create" + workspace: str | None # resolved workspace name, or None when workspaces disabled + + +@dataclass(frozen=True) +class AuthorizationQuery: + subject: Identity + requirement: AuthorizationRequirement + context: "RequestContext" # method, path, request_id, claims passthrough for OPA / SAR + + +@dataclass(frozen=True) +class Decision: + allowed: bool + effective_permission: str | None = None # READ/USE/EDIT/MANAGE/NO_PERMISSIONS; None if the backend can't express a level + is_admin: bool = False # backend may assert the subject is super-admin + reason: str | None = None # surfaced in the 403 body and the audit log + cache_ttl_seconds: int | None = None # backend's cache hint; None => use the configured default + + +class AuthorizationBackend(Protocol): + name: str + def authorize(self, query: AuthorizationQuery) -> Decision: ... + # Batch entry point for list/search filtering (see "search filtering" below). + def authorize_batch(self, queries: Sequence[AuthorizationQuery]) -> Sequence[Decision]: ... + # Optional fast path for the read-predicate. The DB backend overrides it with a single + # grant query; remote backends fall back to authorize_batch. + def list_readable( + self, subject: Identity, resource_type: str, workspace: str | None, + candidate_ids: Sequence[str], + ) -> set[str]: ... +``` + +The **six action verbs** map onto the existing `Permission` booleans +(`mlflow/server/auth/permissions.py`): `read→can_read`, `use→can_use`, +`update→can_update`, `delete→can_delete`, `manage→can_manage`, plus `create`, +which today is a workspace-level gate (`_user_can_create_in_workspace()` at +`:548`, a workspace-wide `can_use` check). These six cover every existing +validator. + +#### The default DB backend reproduces today's behavior exactly + +`DefaultDbAuthorizationBackend.authorize(query)`: + +1. If `query.subject.is_admin` or the local user row is admin, return + `Decision(allowed=True, is_admin=True)` — reproduces the `sender_is_admin()` + short-circuit at `:2568`. +2. For `create`, reproduce `_user_can_create_in_workspace()` (`:548`). +3. Otherwise call the *existing* resolver: + `perm = store.get_role_permission_for_resource(user.id, requirement.resource_type, + requirement.resource_id, requirement.workspace)`, fold against + `default_permission` exactly as `_get_role_permission_or_default()` does + (`:524`), then `allowed = getattr(perm, f"can_{action}")`. Return + `Decision(allowed, effective_permission=perm.name)`. + +This is a mechanical extraction of the body already inside every `validate_can_*` +function — the same `_role_permission_for()` / `_get_role_permission_or_default()` +chain (`:640`, `:524`). The default backend is byte-for-byte today's behavior, +which is the backward-compatibility requirement. + +#### Kubernetes SubjectAccessReview adapter + +`authorize(query)` builds a `SubjectAccessReview` and POSTs it to the API server: + +```json +{ "apiVersion": "authorization.k8s.io/v1", "kind": "SubjectAccessReview", + "spec": { "user": "", "groups": ["<...subject.groups>"], + "resourceAttributes": { "verb": "", "group": "mlflow.org", + "resource": "", "name": "", + "namespace": "" } } } +``` + +The requirement tuple was deliberately shaped to match SAR's +`resourceAttributes`: + +| Requirement field | SAR field | Mapping | +| :--- | :--- | :--- | +| `action` | `verb` | read→get, use→use, update→update, delete→delete, manage→`*`, create→create | +| `resource_type` | `resource` | direct | +| `resource_id` | `name` | direct | +| `workspace` | `namespace` | direct | + +The response `status.allowed` → `Decision.allowed`, `status.reason` → +`Decision.reason`. `effective_permission` is `None` — SAR answers a boolean per +`(verb, resource)`, and core only needs the boolean for a single check. +`SelfSubjectAccessReview` is the variant when the provider passed the user's own +bearer token through (the API server fills in the user). This is the donated +Kubeflow integration's natural home. + +#### OPA adapter + +`authorize(query)` POSTs JSON `input` to `POST /v1/data/`: + +```json +{ "input": { + "subject": { "user": "...", "groups": ["..."], "is_admin": false }, + "resource": { "type": "experiment", "id": "42", "workspace": "ml-research" }, + "action": "update", + "claims": { "...": "subject.claims passthrough" } } } +``` + +The response `{"result": {"allow": true, "level": "EDIT", "reason": "..."}}` maps +to `Decision(allowed=result.allow, effective_permission=result.get("level"), +reason=result.get("reason"))`. The full `claims` passthrough is why +`Identity.claims` exists — OPA policies can reason over arbitrary IdP attributes. + +#### What each backend can express + +| Capability | DB (default) | K8s SAR | OPA | +| :--- | :--- | :--- | :--- | +| `allowed` | yes | yes | yes | +| `effective_permission` level | yes | no (boolean only) | optional | +| `is_admin` assertion | yes (user row) | via policy | via policy | +| cheap `list_readable` | yes (one grant query) | no (N calls or batch) | partial-eval, else N | +| per-check cost | in-process | one network call | one network call | + +### Core keeps owning route → requirement (the centerpiece) + +#### Today's shape and why it traps plugins + +Each `BEFORE_REQUEST_HANDLERS` entry (`:2148`) maps a protobuf class to a +validator that *both* resolves the requirement *and* decides inline. `GetRun → +validate_can_read_run` extracts the run id, looks up the experiment, resolves the +workspace, queries grants, and returns a boolean — all fused. A remote backend +cannot reuse any of it without re-deriving "GetRun means read on the parent +experiment." That re-derivation, multiplied across ~200 operations and re-synced +every release, is the maintenance trap. + +#### The refactor: validators become requirement descriptors + +Split each validator into two halves: + +- **Requirement resolver** (stays in core, one per operation): a pure extraction + `resolve(request) -> [AuthorizationRequirement]`. It pulls `run_id` from the + body, resolves its experiment and workspace, and returns + `AuthorizationRequirement("experiment", experiment_id, "read", workspace)`. No + decision. +- **The decision** moves to a single chokepoint that calls `backend.authorize`. + +The dispatch tables change from `class → validator()` to +`class → RequirementDescriptor`: + +```python +@dataclass(frozen=True) +class RequirementDescriptor: + # Pure extraction. May return several requirements (e.g. a bulk metric-history + # read across N runs checks read on each). + resolve: Callable[["AuthRequest"], list[AuthorizationRequirement]] + +# BEFORE_REQUEST_HANDLERS becomes, e.g.: +GetRun: RequirementDescriptor(resolve=lambda r: [_require_run(r, "read")]), +LogMetric: RequirementDescriptor(resolve=lambda r: [_require_run(r, "update")]), +CreateExperiment: RequirementDescriptor(resolve=lambda r: [_require_workspace_create(r)]), +``` + +`_require_run(r, action)` is the extraction half of today's +`_get_permission_from_run_id`; it returns a requirement instead of a `Permission`. +The single chokepoint — replacing the validator call in `_before_request` +(`:2572`) and the FastAPI validator call inside the middleware (`:4108`+) — is: + +```python +def _authorize(req: AuthRequest, identity: Identity, descriptor) -> Response | None: + if descriptor is None: + return _handle_unmapped_route(req) # fail-closed; see the registry section + for requirement in descriptor.resolve(req): + decision = backend.authorize(AuthorizationQuery(identity, requirement, _ctx(req))) + if not decision.allowed: + return make_forbidden_response(decision.reason) + return None +``` + +The dispatch functions (`_find_validator` at `:2505`, `_find_fastapi_validator` +at `:4079`) keep their structure — the same regex/exact-match ordering +(logged-models → webhooks → exact `(path, method)` → traces regex → gateway → +otel) — they just return a `RequirementDescriptor` instead of a +`Callable[[], bool]`. The fail-closed trace default (`lambda: False` at `:2547`) +becomes a descriptor that always denies. + +The plugin never sees `_find_validator`, the protobuf classes, the gateway path +regexes, or the GraphQL field names. It only ever receives an +`AuthorizationQuery`. That is the property the issue's point about core owning the +mapping demands. + +#### Special cases that already exist and survive cleanly + +- **`sender_is_admin` validators** (webhooks at `:2412`, budget policies): + become a descriptor producing `("system", None, "manage", None)`. The backend's + admin short-circuit (or core's super-admin gate — see Open questions) handles + it. +- **`lambda: True` "authenticated but unrestricted" routes** (e.g. + `GET_CURRENT_USER`, jobs, assistant): become a `REQUIRE_AUTHENTICATED` sentinel + — identity required, no backend call. This is distinct from *public* (no + identity needed at all). +- **After-request grant/filter handlers** (`AFTER_REQUEST_PATH_HANDLERS` at + `:3227`, e.g. `set_can_manage_experiment_permission`, search-result filtering) + are a separate concern, addressed under "search filtering" and "Drawbacks." + They stay in core and are not part of the plugin contract. + +### The authoritative OPERATION_REGISTRY and the CI guard + +#### One source of truth + +Today the mapping is spread across at least six structures: +`BEFORE_REQUEST_HANDLERS` (`:2148`), `BEFORE_REQUEST_VALIDATORS` plus three +`.update()` blocks (`:2280`+), `TRACE_PARAMETERIZED_*`, `LOGGED_MODEL_*`, +`WEBHOOK_*`, `_find_fastapi_validator` (`:4079`), and +`GraphQLAuthorizationMiddleware.PROTECTED_FIELDS` (`:3779`). A new route can ship +unprotected: the FastAPI middleware literally returns `await call_next(request)` +when no validator matches (`:4138`) — silent fail-open. + +Consolidate into one declarative registry, keyed by a stable operation id, with +an explicit protection classification: + +```python +class Protection(Enum): + PUBLIC = "public" # no auth at all (health, static, landing page) + AUTHENTICATED = "authenticated" # identity required, no authz check + AUTHORIZED = "authorized" # identity + backend.authorize(requirement) + + +@dataclass(frozen=True) +class OperationSpec: + operation: str # "GetRun", "graphql.mlflowSearchRuns", "gateway.invoke" + protection: Protection + descriptor: RequirementDescriptor | None # required iff AUTHORIZED + + +OPERATION_REGISTRY: dict[str, OperationSpec] = { ... } +``` + +The existing dispatch tables become *derived* from this registry (so the lookup +structures keep their current shape and performance — no behavior change), but +the registry is the authority and the thing reviewers edit. + +#### The CI guard + +A test enumerates every registered operation across all four surfaces: + +- **Flask/protobuf:** walk `get_endpoints(...)` (the generator already used at + `:2282`) → every `(path, method)`. +- **FastAPI:** introspect the FastAPI route table for `/gateway/*`, + `/v1/traces`, `/ajax-api/3.0/*`. +- **GraphQL:** every field on the schema's query and mutation types. + +For each, assert it appears in `OPERATION_REGISTRY` with an explicit +`Protection`. A new route with no entry fails CI. No more silent fail-open. This +test is the most valuable maintainability artifact in the proposal: it forces the +route → requirement knowledge to stay complete and in one place. + +#### GraphQL: always-on, registry-driven + +Today `PROTECTED_FIELDS` (`:3779`) is a hardcoded set of seven fields, gated +behind `MLFLOW_SERVER_ENABLE_GRAPHQL_AUTH` (off by default, `:3902`); any field +not in the set is unprotected. Under the plugin model the GraphQL field → +requirement map joins `OPERATION_REGISTRY` (operation ids like +`graphql.mlflowGetRun`) and becomes **always-on** when the auth server is +enabled — the flag is dropped. The middleware (`:3789`) resolves each field to a +requirement and calls the same `backend.authorize` chokepoint as REST; its +existing two-phase pattern (pre-resolve check + post-resolve filter at `:3878`) +maps onto `authorize` (pre) and `list_readable` (post). The CI guard enforces +that *every* query/mutation field is classified: read-only metadata fields may be +`AUTHENTICATED`; data-bearing fields must be `AUTHORIZED`. This is a deliberate +fail-closed correction — see Adoption strategy for the behavior-change note. + +#### Gateway granularity + +Every gateway invocation today checks `USE` on the `gateway_endpoint` resource +(`_validate_gateway_use_permission`), with the endpoint name extracted from path +or body (`_extract_gateway_endpoint_name`). The assessment: + +- `gateway_endpoint` `USE` for invocation is the right primary granularity — + keep it. It's the resource a caller "uses." +- `gateway_secret` and `gateway_model_definition` have CRUD validators in + `BEFORE_REQUEST_HANDLERS`, but the *invocation* path checks only the endpoint, + not the underlying secret or model definition. For most deployments that is + correct: the secret is an admin artifact bound to the endpoint. **Decision: + invocation authorizes on the endpoint only; secret and model definition are + management-time resources.** A backend that wants finer control can inspect + `requirement.context` (which carries the resolved endpoint→secret binding), but + core will not issue a second mandatory check. + +The gateway extractor stays in core (it is route knowledge); the requirement it +emits is `("gateway_endpoint", endpoint_id, "use", workspace)`. The current +allow-all proxy passthrough becomes an explicit `AUTHENTICATED` (or `PUBLIC`) +registry entry instead of a silently-true validator, making the intentional +openness auditable. + +### Configuration + +Keep the existing INI `[mlflow]` section (`mlflow/server/auth/config.py`) and the +entry-point plugin pattern (`get_entry_points` at +`mlflow/server/__init__.py:219`). Use entry-point groups for the +*implementations*; INI keys *select* them by name. A single dotted-path string +(today's `authorization_function`) doesn't scale to "a chain of three authn +providers, each with its own config block," and entry points give third-party +packages a clean install-and-declare story consistent with `mlflow.app`. + +New entry-point groups: + +```toml +[project.entry-points."mlflow.auth.authn_provider"] +basic-auth = "mlflow.server.auth.providers:BasicAuthProvider" +oidc = "mlflow_oidc_plugin:OidcProvider" + +[project.entry-points."mlflow.auth.authz_backend"] +database = "mlflow.server.auth.backends:DefaultDbAuthorizationBackend" +k8s-sar = "mlflow_k8s_auth:SarBackend" +opa = "mlflow_opa_auth:OpaBackend" +``` + +`AuthConfig` gains `authn_providers: list[str]`, `authz_backend: str`, and a +`plugin_configs: dict[str, dict]` populated from `[authn.]` / +`[authz.]` sections, which are handed to each plugin's factory. + +**Backward-compatibility shim:** if the legacy `authorization_function` key is +present and the new keys are absent, core synthesizes +`authn_providers = ` and +`authz_backend = database`. Existing configs keep working unchanged. + +### Caching, error handling, fail-closed + +Remote backends (SAR/OPA) cost a network round-trip per check, and a single page +load fans out to many checks. Three mechanisms. + +**Decision cache.** Core wraps the configured backend in a +`CachingAuthorizationBackend` keyed by `(subject.username, resource_type, +resource_id, action, workspace)`. TTL precedence: +`Decision.cache_ttl_seconds` (the backend's hint) → `[authz.] +cache_ttl_seconds` → global default. This reuses the TTL-cache machinery already +in `AuthConfig` (`auth_cache_ttl_seconds`, `workspace_cache_ttl_seconds`), with +the same per-worker staleness caveat documented there (revocation lag is bounded +by the TTL). Default TTL is short for remote backends (e.g. 30s); for the +in-process DB backend, caching is optional. + +**`on_error` policy** (per backend, config-driven), applied when `authorize` +raises or times out: + +| `on_error` | Behavior | When to use | +| :--- | :--- | :--- | +| `deny` (default) | treat as `Decision(allowed=False)` | production; fail-closed | +| `fallback` | consult the default DB backend | hybrid: remote authoritative, DB as degraded mode | +| `allow` | `Decision(allowed=True)` | dev only; logged loudly | + +Fail-closed is the default everywhere, matching the existing posture: unknown +trace paths deny (`:2547`), workspace-lookup failure denies, and RFC 0005's +`check_user_permission` denies unknown resources. The unmapped-route handler +denies; a timeout is an error subject to `on_error`. + +## Drawbacks + +- **Per-request remote latency.** Every protected request becomes a network call + for SAR/OPA backends, mitigated but not eliminated by the decision cache. A + busy UI page issuing many REST/GraphQL calls multiplies this. +- **Two caches, two staleness windows.** The decision cache plus the existing + workspace cache mean revocation isn't instant; it's bounded by TTL. This is + already true of today's auth cache, but the proposal adds a second window. +- **SAR loses the effective permission level.** SAR answers booleans, so + `Decision.effective_permission` is `None`, so RFC 0005's admin-UI + "what is Bob's level on experiment 42?" (`check_user_permission`) degrades to + allow/deny under a SAR deployment. Acceptable, but it must be documented. +- **Grant authoring is DB-backend-specific.** The after-request handler + `set_can_manage_experiment_permission` (`:2581`) writes a grant on resource + creation — meaningless for SAR/OPA, where grants live externally. Core must + make these handlers no-ops when the configured backend is not the DB backend. + The `AuthorizationBackend` protocol is deliberately *decision-only*; grant + *authoring* (RFC 0005's role API, the after-create MANAGE grant) is + DB-backend-specific and is skipped under an external backend that owns its own + grants. This is real coupling and is surfaced rather than hidden. +- **Implementation cost.** Splitting ~200 fused validators into extraction + + descriptor, then routing every surface through one chokepoint, is a sizeable + refactor of `mlflow/server/auth/__init__.py`. The risk is mitigated by the + default backend reproducing exact behavior and by a characterization test that + runs the existing auth integration suite against the new chokepoint. + +## Alternatives + +**Keep the single `authorization_function` string; don't split authn/authz.** +Rejected. It conflates the two concerns the issue explicitly asks to separate, it +can't express "OIDC authn + SAR authz," and the FastAPI path already demonstrates +the single-function model breaking down (it rejects custom functions outright at +`:4141`). + +**Let plugins own routing — each plugin re-derives route → requirement.** +Rejected. This is exactly the maintainability trap the issue's point about core +owning the mapping calls out. Every plugin would duplicate the ~200-entry mapping +and drift on every new MLflow route. The whole design exists to prevent this. + +**Two authentication entry points (one Flask, one FastAPI) instead of +`AuthRequest`.** Rejected. It doubles the provider implementation surface for +identical token-validation logic, and a doubled surface drifts. + +**Per-resource backend methods** (`can_read_experiment`, `can_use_gateway`, …) +mirroring the pre-RBAC layout. Rejected for the same reason RFC 0005 rejected +per-table storage: it preserves fan-out. One `authorize(query)` is the +consolidation. + +**Impact of not doing this.** Enterprises keep running an OAuth proxy in front of +MLflow (extra infrastructure, MLflow blind to identity) or fork the auth server. +External integrations like Kubeflow's `mlflow-integration` keep duplicating the +route → permission map and re-syncing it every release — the precise cost the +issue raises. + +# Adoption strategy + +The change is largely additive, with a few deliberate behavior corrections. + +**basic-auth is unchanged.** It becomes `BasicAuthProvider` (the authn half of +today's `authenticate_request_basic_auth`) plus `DefaultDbAuthorizationBackend` +(the authz half — the RFC 0005 resolver). The default config selects exactly +these, so an operator who upgrades and changes nothing sees identical behavior. +This is asserted by the CI guard plus a characterization test that runs the +existing auth integration suite against the new chokepoint. + +**Additive (non-breaking):** the `Identity` / `Decision` types, the entry-point +groups, the `OPERATION_REGISTRY`, the `authn_providers` / `authz_backend` config +keys, and the FastAPI path *gaining* custom-auth support (today it rejects it at +`:4141` — a strict improvement). + +**Behavior changes, called out explicitly:** + +- **GraphQL auth becomes always-on** when the auth server is enabled (the + `MLFLOW_SERVER_ENABLE_GRAPHQL_AUTH` flag is dropped). Anyone who relied on + GraphQL being unauthorized by default is affected. This is a deliberate + fail-closed correction, scoped to the auth-enabled server. +- **The FastAPI unmapped-route path flips from fail-open to fail-closed** via the + registry. A gateway-adjacent route not in the registry starts returning 403. + The CI guard makes this discoverable before release; operators who mount custom + FastAPI routes on the auth app must register them. +- **`Identity` replaces the `Authorization` return contract** of + `authenticate_request()`. A third party that wrote a custom + `authorization_function` returning a `werkzeug Authorization` is shimmed (the + returned object is adapted to `Identity(username=...)`), so it keeps working but + is soft-deprecated. + +**Sequencing.** Ship the provider/backend split with the legacy +`authorization_function` shim live; document the entry-point migration; deprecate +the shim one minor release later. RFC 0005's role model and resolver interface are +a prerequisite — this RFC assumes 0005 has landed. + +# Open questions + +- **Where does group → MLflow-role mapping live?** In the authn provider (it + emits MLflow roles directly), in core (a configurable group→role table), or + entirely in the authz backend (OPA reasons over `Identity.groups`)? The + cleanest default is probably a core mapping table consulted during JIT + provisioning, but a policy-engine deployment may want it all in the backend. +- **Are `system` / super-admin operations backend-gated or core-only?** Truly + global operations (create user, delete workspace) use `sender_is_admin` today. + Do we model them as `("system", None, "manage")` and let a backend authorize + them, or keep super-admin a core-only gate that a backend can only *assert* + (via `Decision.is_admin`) but not *grant*? Leaning toward core-only for + `system` ops, backend-assertable everywhere else. +- **Is `authorize_batch` mandatory or default-looped?** Mandatory pushes + complexity onto every backend author; a default loop over `authorize` is + simpler but slow for remote backends. Leaning toward default-looped with a + strong documentation nudge. +- **The N+1 search-filtering problem — the genuinely hard part.** + `search_experiments`, `search_logged_models`, `search_registered_models`, + `search_model_versions`, and the GraphQL model-version filter must drop results + the caller can't read. Today this is one grant query feeding an in-memory + predicate — cheap because the DB backend can enumerate grants. A remote SAR + backend cannot enumerate; filtering 500 results naively is 500 SAR calls. The + `list_readable` hook concentrates the problem behind the interface (the DB + backend keeps its single query; OPA can use partial evaluation to return the + allowed set in one query; SAR batches client-side with bounded concurrency), + but a SAR backend filtering a large unfiltered search *will* be slow. The + honest mitigation is to narrow candidate sets before the store query (the + existing `filter_experiment_ids` pattern) and document the cost. Whether to + push the predicate down into the tracking-store query for backends that can + express grants as a filter is left to a follow-up. +- **Per-request workspace-resolution cost under remote backends.** The workspace + lookup happens in core, *before* the backend call, to build the requirement. It + stays in core and keeps its existing cache, but it adds a store round-trip per + request that a remote backend can't avoid.