From 3368d8e9636f72afc059978f5cdd714e63ef049b Mon Sep 17 00:00:00 2001 From: Nehanth Date: Tue, 28 Apr 2026 09:49:54 -0500 Subject: [PATCH 01/12] Add RFC 0007: Scorer presets for common evaluation patterns Proposes a Preset class that packages a named collection of scorers for common evaluation patterns (RAG, agent, conversational-agent, safety, quality). Presets can be passed directly in the scorers list alongside individual scorers, with automatic deduplication. Based on mlflow/mlflow#21445. Co-Authored-By: Claude Signed-off-by: Nehanth --- .../0007-scorer-presets.md | 652 ++++++++++++++++++ 1 file changed, 652 insertions(+) create mode 100644 rfcs/0007-scorer-presets/0007-scorer-presets.md diff --git a/rfcs/0007-scorer-presets/0007-scorer-presets.md b/rfcs/0007-scorer-presets/0007-scorer-presets.md new file mode 100644 index 0000000..2d2c49b --- /dev/null +++ b/rfcs/0007-scorer-presets/0007-scorer-presets.md @@ -0,0 +1,652 @@ +--- + +## start_date: 2026-04-23 + +mlflow_issue: [https://github.com/mlflow/mlflow/issues/21445](https://github.com/mlflow/mlflow/issues/21445) +rfc_pr: + +# Scorer Presets for Common Evaluation Patterns + + +| Author(s) | Nehanth | +| ---------------------- | ----------- | +| **Date Last Modified** | 2026-04-28 | +| **AI Assistant(s)** | Claude Code | + + +# Summary + +> **Note:** This RFC is based on [mlflow/mlflow#21445](https://github.com/mlflow/mlflow/issues/21445). The motivation, proposed presets, and API examples are derived from that issue, with additional design details and implementation specifics added here. + +MLflow provides 21 built-in scorers for evaluating GenAI outputs, but users have no way to select a coherent subset for a specific evaluation pattern. Today, evaluating an agent requires importing and instantiating 9+ individual scorer classes -- boilerplate that gets copy-pasted across teams and templates. + +This RFC proposes a `Preset` class that packages a named collection of scorers. MLflow ships built-in presets for common evaluation patterns (`RAG`, `AGENT`, `CONVERSATIONAL_AGENT`, `SAFETY`, `QUALITY`), and users can define their own. Presets can be passed directly in the `scorers` list alongside individual scorers, with automatic deduplication when presets overlap. + +# Basic Example + +```python +import mlflow +from mlflow.genai.scorers import AGENT + +# Use a built-in preset directly +result = mlflow.genai.evaluate( + data=eval_dataset, + predict_fn=predict_fn, + scorers=[AGENT], +) +``` + +```python +# Mix presets and individual scorers +from mlflow.genai.scorers import AGENT, Guidelines + +result = mlflow.genai.evaluate( + data=eval_dataset, + predict_fn=predict_fn, + scorers=[AGENT, Guidelines(name="tone", guidelines=["Respond professionally"])], +) +``` + +```python +# Combine presets -- duplicates are resolved automatically +from mlflow.genai.scorers import AGENT, SAFETY + +# Both contain Safety(); it runs once, not twice +result = mlflow.genai.evaluate( + data=eval_dataset, + scorers=[AGENT, SAFETY], +) +``` + +```python +# Define a custom preset +from mlflow.genai.scorers import Preset, Safety, Fluency + +my_preset = Preset("my_team_eval", scorers=[Safety(), Fluency(), my_custom_scorer]) + +result = mlflow.genai.evaluate( + data=eval_dataset, + scorers=[my_preset, another_scorer], +) +``` + +```python +# Discover available built-in presets +from mlflow.genai.scorers import list_presets + +for name, scorer_names in list_presets().items(): + print(f"{name}: {', '.join(scorer_names)}") +``` + +## Motivation + +### The Problem + +As described in [the original issue](https://github.com/mlflow/mlflow/issues/21445), the Databricks agent app template [evaluate_agent.py](https://github.com/databricks/app-templates/blob/main/agent-openai-agents-sdk/agent_server/evaluate_agent.py) imports and instantiates 9 separate scorers to evaluate a conversational agent: + +```python +from mlflow.genai.scorers import ( + Completeness, + ConversationalSafety, + ConversationCompleteness, + Fluency, + KnowledgeRetention, + RelevanceToQuery, + Safety, + ToolCallCorrectness, + UserFrustration, +) + +mlflow.genai.evaluate( + data=simulator, + predict_fn=predict_fn, + scorers=[ + Completeness(), + ConversationCompleteness(), + ConversationalSafety(), + KnowledgeRetention(), + UserFrustration(), + Fluency(), + RelevanceToQuery(), + Safety(), + ToolCallCorrectness(), + ], +) +``` + +Every team building agent evaluation follows this same pattern. This creates three problems (from the [original issue](https://github.com/mlflow/mlflow/issues/21445)): + +1. **No built-in grouping.** `get_all_scorers()` returns all 19 default-constructible scorers. Users evaluating a RAG pipeline get `ToolCallCorrectness`; users evaluating an agent get `RetrievalGroundedness`. Each unnecessary scorer wastes an LLM API call. +2. **21 scorers to choose from.** Users must read documentation for each scorer to determine relevance. Session-level scorers (e.g., `KnowledgeRetention`) silently produce no results when passed to single-turn evaluation. +3. **Copy-paste problem.** The same scorer lists get duplicated across templates, notebooks, and tutorials. When new scorers are added, existing lists don't pick them up. + +### Who Benefits + +- **New users** get a curated starting point without reading all 21 scorer docs +- **Teams** can define and share custom presets, ensuring consistent evaluation across projects +- **Template authors** replace hardcoded scorer lists with a single preset +- **MLflow maintainers** gain a single place to update when new scorers are added + +### Out of Scope + +- **Parameterized presets.** Passing `model` or `inference_params` to all scorers in a preset. Users can iterate over the preset's scorers instead. +- **Third-party scorer presets.** Integrating presets for DeepEval, RAGAS, or TruLens scorers. +- **Preset registration/storage in the tracking server.** Presets are code-side only. + +## Detailed Design + +### The `Preset` Class + +A `Preset` is a named, iterable container of scorers. It is **not** a `Scorer` subclass -- it is a grouping mechanism that gets flattened into individual scorers at validation time. + +```python +class Preset: + """A named, immutable collection of scorers for a common evaluation pattern. + + Presets can be passed in the ``scorers`` list alongside individual + scorers. They are flattened and deduplicated during validation, + so the evaluation loop only ever sees individual ``Scorer`` instances. + + Args: + name: A descriptive name for this preset. + scorers: The list of scorer instances in this preset. + """ + + def __init__(self, name: str, scorers: list[Scorer]): + self._name = name + self._scorers = tuple(scorers) + + @property + def name(self) -> str: + return self._name + + @property + def scorers(self) -> tuple: + return self._scorers + + def __iter__(self): + return iter(self._scorers) + + def __len__(self): + return len(self._scorers) + + def __add__(self, other): + if isinstance(other, (Preset, list)): + return list(self) + list(other) + return NotImplemented + + def __radd__(self, other): + if isinstance(other, list): + return other + list(self) + return NotImplemented + + def __repr__(self): + scorer_names = [type(s).__name__ for s in self._scorers] + return f"Preset('{self._name}', [{', '.join(scorer_names)}])" +``` + +**Key design decisions:** + +- **Immutable.** Scorers are stored as a tuple and exposed via a read-only property. Built-in presets are module-level constants and must not be mutated. +- **Not a `Scorer` subclass.** A preset doesn't produce feedback -- it's a container. The evaluation loop assumes one scorer = one result column. Making `Preset` a scorer would require changes throughout the pipeline (aggregation, telemetry, serialization). +- **Iterable.** Supports `__iter__`, `__len__`, and `__add__`/`__radd__` so it composes naturally: `AGENT + [my_scorer]`, `[my_scorer] + AGENT`, or `AGENT + SAFETY`. +- **Stores instances, not classes.** Users pass already-configured scorer instances. + +### Deduplication + +When multiple presets are combined, the same scorer type can appear more than once. For example, `AGENT` and `SAFETY` both contain `Safety()`. Running the same scorer twice wastes LLM API calls and produces duplicate result columns. + +`validate_scorers()` deduplicates by scorer type after flattening: + +```python +def _deduplicate_scorers(scorers: list[Scorer]) -> list[Scorer]: + seen = set() + result = [] + for scorer in scorers: + scorer_type = type(scorer) + if scorer_type not in seen: + seen.add(scorer_type) + result.append(scorer) + return result +``` + +This uses first-occurrence-wins: if `AGENT` appears before `SAFETY` in the list, the `Safety()` instance from `AGENT` is kept and the one from `SAFETY` is dropped. For built-in scorers with default constructors, the instances are interchangeable, so the choice is arbitrary. + +Custom scorers with the same type but different configurations (e.g., two `Guidelines` instances with different `guidelines` args) should **not** be deduplicated, since they produce different results. The deduplication uses `type(scorer)` as the key, but scorers with different `name` attributes are kept: + +```python +def _deduplicate_scorers(scorers: list[Scorer]) -> list[Scorer]: + seen = set() + result = [] + for scorer in scorers: + key = (type(scorer), scorer.name) + if key not in seen: + seen.add(key) + result.append(scorer) + return result +``` + +### How `evaluate()` Handles Presets + +Presets are flattened and deduplicated in `validate_scorers()`, which already validates the `scorers` list before evaluation begins: + +```python +def validate_scorers(scorers: list[Any]) -> list[Scorer]: + if not isinstance(scorers, list): + raise MlflowException.invalid_parameter_value( + "The `scorers` argument must be a list of scorers or presets. " + "You can use a built-in preset like `scorers=[AGENT]`, or " + "`scorers=get_all_scorers()` for all available built-in scorers." + ) + + from mlflow.genai.scorers.presets import Preset + + # 1. Flatten presets into individual scorers + flat = [] + for item in scorers: + if isinstance(item, Preset): + flat.extend(item) + else: + flat.append(item) + + # 2. Deduplicate by (type, name) + flat = _deduplicate_scorers(flat) + + # 3. Existing validation on the flattened list + valid_scorers = [] + for scorer in flat: + if isinstance(scorer, Scorer): + valid_scorers.append(scorer) + else: + # existing error handling... +``` + +`evaluate()` itself does not change. By the time scorers reach the evaluation loop, they are all individual `Scorer` instances. + +### Built-in Presets + +MLflow ships five built-in presets as module-level constants. All contained scorers use default constructors. + +> **Note:** `**TaskSuccess`** is a new scorer proposed in [mlflow/mlflow#22972](https://github.com/mlflow/mlflow/issues/22972). It evaluates whether an agent successfully accomplished the user's task without requiring ground truth data — unlike `Correctness`, which requires an `expectations` column. This scorer would be added to the `AGENT`, `CONVERSATIONAL_AGENT`, and `QUALITY` presets. This work can be part of this RFC or be a future addition after this RFC is completed. + + +| Preset | Scorers | Use Case | +| ---------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------- | +| `RAG` | RetrievalRelevance, RetrievalSufficiency, RetrievalGroundedness, RelevanceToQuery, Safety, Completeness | Retrieval-augmented generation pipelines | +| `AGENT` | ToolCallCorrectness, ToolCallEfficiency, RelevanceToQuery, Safety, Completeness, **TaskSuccess** | Single-turn tool-calling agents | +| `CONVERSATIONAL_AGENT` | All of `AGENT` + UserFrustration, ConversationCompleteness, ConversationalSafety, ConversationalToolCallEfficiency, KnowledgeRetention | Multi-turn conversational agents | +| `SAFETY` | Safety, ConversationalSafety | Safety-focused evaluation (composable with other presets) | +| `QUALITY` | RelevanceToQuery, Fluency, Completeness, **TaskSuccess** | Architecture-independent output quality | + + +#### Design Rationale + +- **Safety is in `RAG` and `AGENT`** because these presets aim to be complete starting points. Most users want safety checks without composing two presets. +- **Fluency is excluded from `AGENT`** because agent evaluation emphasizes tool usage and task completion. Users who need it can compose: `AGENT + [Fluency()]`. +- **`CONVERSATIONAL_AGENT` excludes `ConversationalRoleAdherence`** because it requires a defined persona in the system prompt, which not all agents have. +- **`Correctness` is excluded from all presets** because it requires `expectations` (ground truth) data. Users who have ground truth can add it manually: `QUALITY + [Correctness()]`. +- **`Guidelines` and `ConversationalGuidelines` are excluded from all presets** because both require a `guidelines` constructor argument. + +### `list_presets()` + +A companion function for discovering available built-in presets: + +```python +def list_presets() -> dict[str, list[str]]: + """Return a mapping of built-in preset names to their scorer class names.""" +``` + +### Implementation + +#### New file: `mlflow/genai/scorers/presets.py` + +```python +from mlflow.genai.scorers.base import Scorer +from mlflow.genai.scorers.builtin_scorers import ( + Completeness, + ConversationalSafety, + ConversationalToolCallEfficiency, + ConversationCompleteness, + Correctness, + Fluency, + KnowledgeRetention, + RelevanceToQuery, + RetrievalGroundedness, + RetrievalRelevance, + RetrievalSufficiency, + Safety, + ToolCallCorrectness, + ToolCallEfficiency, + UserFrustration, +) + + +class Preset: + def __init__(self, name: str, scorers: list[Scorer]): + self._name = name + self._scorers = tuple(scorers) + + @property + def name(self) -> str: + return self._name + + @property + def scorers(self) -> tuple: + return self._scorers + + def __iter__(self): + return iter(self._scorers) + + def __len__(self): + return len(self._scorers) + + def __add__(self, other): + if isinstance(other, (Preset, list)): + return list(self) + list(other) + return NotImplemented + + def __radd__(self, other): + if isinstance(other, list): + return other + list(self) + return NotImplemented + + def __repr__(self): + scorer_names = [type(s).__name__ for s in self._scorers] + return f"Preset('{self._name}', [{', '.join(scorer_names)}])" + + +RAG = Preset("rag", [ + RetrievalRelevance(), + RetrievalSufficiency(), + RetrievalGroundedness(), + RelevanceToQuery(), + Safety(), + Completeness(), +]) + +AGENT = Preset("agent", [ + ToolCallCorrectness(), + ToolCallEfficiency(), + RelevanceToQuery(), + Safety(), + Completeness(), +]) + +CONVERSATIONAL_AGENT = Preset("conversational-agent", [ + ToolCallCorrectness(), + ToolCallEfficiency(), + RelevanceToQuery(), + Safety(), + Completeness(), + UserFrustration(), + ConversationCompleteness(), + ConversationalSafety(), + ConversationalToolCallEfficiency(), + KnowledgeRetention(), +]) + +SAFETY = Preset("safety", [ + Safety(), + ConversationalSafety(), +]) + +QUALITY = Preset("quality", [ + RelevanceToQuery(), + Fluency(), + Completeness(), + Correctness(), +]) + +_BUILTIN_PRESETS = { + "rag": RAG, + "agent": AGENT, + "conversational-agent": CONVERSATIONAL_AGENT, + "safety": SAFETY, + "quality": QUALITY, +} + + +def list_presets() -> dict[str, list[str]]: + return { + name: [type(s).__name__ for s in preset] + for name, preset in _BUILTIN_PRESETS.items() + } +``` + +No circular dependency risk: `presets.py` imports from `builtin_scorers.py`, and nothing in the existing chain imports from `presets.py`. + +#### Updated: `mlflow/genai/scorers/__init__.py` + +Add `Preset`, the built-in preset constants, and `list_presets` to `_LAZY_IMPORTS`, `__all__`, and the `TYPE_CHECKING` block. The `__getattr__` function dispatches to the `presets` module: + +```python +_LAZY_IMPORTS_PRESETS = { + "Preset", "RAG", "AGENT", "CONVERSATIONAL_AGENT", + "SAFETY", "QUALITY", "list_presets", +} + +def __getattr__(name): + if name in _LAZY_IMPORTS: + from mlflow.genai.scorers import builtin_scorers + return getattr(builtin_scorers, name) + if name in _LAZY_IMPORTS_PRESETS: + from mlflow.genai.scorers import presets + return getattr(presets, name) + raise AttributeError(f"module {__name__!r} has no attribute {name!r}") +``` + +#### Updated: `mlflow/genai/scorers/validation.py` + +Flatten presets and deduplicate before validating individual scorers: + +```python +def validate_scorers(scorers: list[Any]) -> list[Scorer]: + if not isinstance(scorers, list): + raise MlflowException.invalid_parameter_value( + "The `scorers` argument must be a list of scorers or presets. " + "You can use a built-in preset like `scorers=[AGENT]`, or " + "`scorers=get_all_scorers()` for all available built-in scorers." + ) + + from mlflow.genai.scorers.presets import Preset + + flat = [] + for item in scorers: + if isinstance(item, Preset): + flat.extend(item) + else: + flat.append(item) + + flat = _deduplicate_scorers(flat) + # ... existing validation on the flattened list +``` + +#### Updated: `mlflow/genai/__init__.py` + +Re-export for convenience: + +```python +from mlflow.genai.scorers import Preset, list_presets +``` + +### Testing Plan + +New file: `tests/genai/scorers/test_presets.py` + + +| Test | Verifies | +| ---------------------------------------- | --------------------------------------------------------- | +| `test_builtin_preset_{rag,agent,...}` | Exact scorer types in each built-in preset | +| `test_custom_preset` | Users can create a `Preset` with arbitrary scorers | +| `test_preset_in_validate_scorers` | `validate_scorers([AGENT, my_scorer])` flattens correctly | +| `test_preset_deduplication` | `[AGENT, SAFETY]` deduplicates shared `Safety()` | +| `test_dedup_preserves_different_names` | Two `Guidelines` with different names are both kept | +| `test_preset_add_list` | `AGENT + [Fluency()]` returns a combined list | +| `test_list_add_preset` | `[Fluency()] + AGENT` returns a combined list | +| `test_preset_add_preset` | `AGENT + SAFETY` returns a combined list | +| `test_preset_iter_and_len` | `list(AGENT)` and `len(AGENT)` work correctly | +| `test_preset_invalid_scorer_in_validate` | A preset containing a non-scorer raises `MlflowException` | +| `test_list_presets` | Returns correct dict with correct class names | +| `test_preset_repr` | `repr(AGENT)` shows name and scorer class names | + + +```python +@pytest.mark.parametrize("preset", [RAG, AGENT, CONVERSATIONAL_AGENT, SAFETY, QUALITY]) +def test_builtin_preset_contains_valid_scorers(preset): + assert len(preset) > 0 + assert all(isinstance(s, BuiltInScorer) for s in preset) + assert len(list(preset)) == len(set(type(s) for s in preset)) # no duplicates +``` + +### Files Changed + + +| File | Change | +| ------------------------------------- | ---------------------------------------------------------------- | +| `mlflow/genai/scorers/presets.py` | **New.** `Preset` class, built-in presets, `list_presets()`. | +| `mlflow/genai/scorers/__init__.py` | Add lazy imports for `Preset`, built-in presets, `list_presets`. | +| `mlflow/genai/__init__.py` | Re-export `Preset`, `list_presets`. | +| `mlflow/genai/scorers/validation.py` | Flatten presets and deduplicate in `validate_scorers()`. | +| `tests/genai/scorers/test_presets.py` | **New.** Tests for `Preset` class and built-in presets. | + + +## Drawbacks + +1. **New class in the API.** Adds `Preset` to the public surface. Mitigation: it's a simple container with no complex behavior. +2. **Opinionated defaults.** Not everyone will agree on which scorers belong in which preset. Mitigation: presets are extensible via `+`, and users can define their own. +3. **Implicit behavior changes on upgrade.** A new scorer added to a built-in preset means different evaluation results after upgrading. Consistent with how `get_all_scorers()` already behaves. + +# Alternatives + +### 1. `get_preset()` function (no class) + +Instead of a `Preset` class, provide a simple function that returns a plain list: + +```python +from typing import Literal + +from mlflow.exceptions import MlflowException +from mlflow.genai.scorers.builtin_scorers import ( + Completeness, + ConversationalSafety, + ConversationalToolCallEfficiency, + ConversationCompleteness, + Correctness, + Fluency, + KnowledgeRetention, + RelevanceToQuery, + RetrievalGroundedness, + RetrievalRelevance, + RetrievalSufficiency, + Safety, + ToolCallCorrectness, + ToolCallEfficiency, + UserFrustration, +) + +_PRESETS: dict[str, list[type]] = { + "rag": [ + RetrievalRelevance, + RetrievalSufficiency, + RetrievalGroundedness, + RelevanceToQuery, + Safety, + Completeness, + ], + "agent": [ + ToolCallCorrectness, + ToolCallEfficiency, + RelevanceToQuery, + Safety, + Completeness, + ], + "conversational-agent": [ + ToolCallCorrectness, + ToolCallEfficiency, + RelevanceToQuery, + Safety, + Completeness, + UserFrustration, + ConversationCompleteness, + ConversationalSafety, + ConversationalToolCallEfficiency, + KnowledgeRetention, + ], + "safety": [ + Safety, + ConversationalSafety, + ], + "quality": [ + RelevanceToQuery, + Fluency, + Completeness, + Correctness, + ], +} + +_VALID_PRESET_NAMES = ", ".join(sorted(_PRESETS.keys())) +PresetName = Literal["rag", "agent", "conversational-agent", "safety", "quality"] + + +def get_preset(name: PresetName) -> list: + if name not in _PRESETS: + raise MlflowException.invalid_parameter_value( + f"Unknown preset '{name}'. Valid presets are: {_VALID_PRESET_NAMES}" + ) + return [scorer_class() for scorer_class in _PRESETS[name]] + + +def list_presets() -> dict[str, list[str]]: + return { + name: [cls.__name__ for cls in classes] + for name, classes in _PRESETS.items() + } +``` + +Usage: + +```python +from mlflow.genai.scorers import get_preset + +# Simple usage +result = mlflow.genai.evaluate(scorers=get_preset("agent")) + +# Extending a preset +scorers = get_preset("agent") + [Guidelines(name="tone", guidelines=["Be professional"])] +result = mlflow.genai.evaluate(scorers=scorers) +``` + +**Pros:** Simpler (~30 lines). No validation changes needed. Returns fresh instances each call (no mutable singleton concern). `Literal` type gives IDE autocompletion. Going from function to class later is non-breaking. + +**Cons:** No user-defined presets. Composition requires `+` with list concatenation. The preset concept disappears immediately -- it's just a list. No deduplication when combining presets. + +This is a viable first step if the class approach is deemed too heavy. The class can be added later as a non-breaking extension. + +### 2. Tag-based filtering + +Add `categories` to each scorer class and provide `get_scorers(categories=["rag"])`. More flexible but over-engineered for 21 scorers and requires modifying every existing class. + +### 3. Enum-based API + +`ScorerPreset.RAG.get_scorers()`. Type-safe but heavier API surface. The `Literal` type on a function already provides IDE autocompletion. + +### 4. Do nothing + +Users keep copy-pasting scorer lists. Does not scale as the scorer count grows. + +# Adoption Strategy + +This is an **additive, non-breaking change**. Existing code continues to work unchanged. + +- Update documentation and templates to show `Preset` usage alongside the manual import pattern. +- Update the `validate_scorers()` error message to mention presets for discoverability. +- Databricks agent templates can simplify from 9 imports + 9 instantiations to `scorers=[CONVERSATIONAL_AGENT]`. + +# Open Questions + +1. **Should `ConversationalRoleAdherence` be in `CONVERSATIONAL_AGENT`?** Currently excluded because it requires a defined persona. **Open for discussion.** +2. **Should `Correctness` be in `AGENT` or `RAG`?** Currently only in `QUALITY` because it requires `expectations` data. **Open for discussion.** +3. **Should there be an `ALL` preset?** `get_all_scorers()` already serves this role. **Recommendation:** Do not add. +4. **Deduplication key.** Should deduplication use `type(scorer)` alone, or `(type(scorer), scorer.name)`? The latter preserves multiple instances of the same class with different names (e.g., two `Guidelines` with different rules). +5. **Future: parameterized presets?** e.g., `AGENT.with_model("openai:/gpt-4o")` returning a new preset with the model set on all scorers. Deferred to keep the initial API simple. + From 38d98ef6498baa47c09d0a893edbb8a88dbe8e23 Mon Sep 17 00:00:00 2001 From: Nehanth Date: Tue, 28 Apr 2026 11:04:16 -0500 Subject: [PATCH 02/12] Update RFC: built-in presets as subclasses of Preset Each built-in preset is now a subclass (Agent, Rag, ConversationalAgent, SafetyPreset, Quality) that creates fresh scorer instances on each call. Eliminates shared mutable state and enables future preset-specific configuration. Co-Authored-By: Claude Signed-off-by: Nehanth --- .../0007-scorer-presets.md | 234 +++++++++++------- 1 file changed, 141 insertions(+), 93 deletions(-) diff --git a/rfcs/0007-scorer-presets/0007-scorer-presets.md b/rfcs/0007-scorer-presets/0007-scorer-presets.md index 2d2c49b..10959bf 100644 --- a/rfcs/0007-scorer-presets/0007-scorer-presets.md +++ b/rfcs/0007-scorer-presets/0007-scorer-presets.md @@ -26,35 +26,35 @@ This RFC proposes a `Preset` class that packages a named collection of scorers. ```python import mlflow -from mlflow.genai.scorers import AGENT +from mlflow.genai.scorers import Agent -# Use a built-in preset directly +# Use a built-in preset directly -- each call creates fresh scorer instances result = mlflow.genai.evaluate( data=eval_dataset, predict_fn=predict_fn, - scorers=[AGENT], + scorers=[Agent()], ) ``` ```python # Mix presets and individual scorers -from mlflow.genai.scorers import AGENT, Guidelines +from mlflow.genai.scorers import Agent, Guidelines result = mlflow.genai.evaluate( data=eval_dataset, predict_fn=predict_fn, - scorers=[AGENT, Guidelines(name="tone", guidelines=["Respond professionally"])], + scorers=[Agent(), Guidelines(name="tone", guidelines=["Respond professionally"])], ) ``` ```python # Combine presets -- duplicates are resolved automatically -from mlflow.genai.scorers import AGENT, SAFETY +from mlflow.genai.scorers import Agent, SafetyPreset # Both contain Safety(); it runs once, not twice result = mlflow.genai.evaluate( data=eval_dataset, - scorers=[AGENT, SAFETY], + scorers=[Agent(), SafetyPreset()], ) ``` @@ -187,11 +187,75 @@ class Preset: **Key design decisions:** -- **Immutable.** Scorers are stored as a tuple and exposed via a read-only property. Built-in presets are module-level constants and must not be mutated. +- **Immutable.** Scorers are stored as a tuple and exposed via a read-only property. - **Not a `Scorer` subclass.** A preset doesn't produce feedback -- it's a container. The evaluation loop assumes one scorer = one result column. Making `Preset` a scorer would require changes throughout the pipeline (aggregation, telemetry, serialization). -- **Iterable.** Supports `__iter__`, `__len__`, and `__add__`/`__radd__` so it composes naturally: `AGENT + [my_scorer]`, `[my_scorer] + AGENT`, or `AGENT + SAFETY`. +- **Iterable.** Supports `__iter__`, `__len__`, and `__add__`/`__radd__` so it composes naturally: `Agent() + [my_scorer]`, `[my_scorer] + Agent()`, or `Agent() + SafetyPreset()`. - **Stores instances, not classes.** Users pass already-configured scorer instances. +### Built-in Presets as Subclasses + +Each built-in preset is a subclass of `Preset` that hardcodes its scorer list. This means each call creates **fresh scorer instances** (no shared mutable singletons) and opens the door for preset-specific configuration and control flow in the future. + +```python +class Agent(Preset): + def __init__(self): + super().__init__("agent", [ + ToolCallCorrectness(), + ToolCallEfficiency(), + RelevanceToQuery(), + Safety(), + Completeness(), + ]) + +class Rag(Preset): + def __init__(self): + super().__init__("rag", [ + RetrievalRelevance(), + RetrievalSufficiency(), + RetrievalGroundedness(), + RelevanceToQuery(), + Safety(), + Completeness(), + ]) + +class ConversationalAgent(Preset): + def __init__(self): + super().__init__("conversational-agent", [ + ToolCallCorrectness(), + ToolCallEfficiency(), + RelevanceToQuery(), + Safety(), + Completeness(), + UserFrustration(), + ConversationCompleteness(), + ConversationalSafety(), + ConversationalToolCallEfficiency(), + KnowledgeRetention(), + ]) + +class SafetyPreset(Preset): + def __init__(self): + super().__init__("safety", [ + Safety(), + ConversationalSafety(), + ]) + +class Quality(Preset): + def __init__(self): + super().__init__("quality", [ + RelevanceToQuery(), + Fluency(), + Completeness(), + ]) +``` + +**Why subclasses over instances:** + +- **Fresh instances every time.** `Agent()` creates new scorer instances on each call. No shared mutable state — the singleton problem is eliminated entirely. +- **Preset-specific configuration.** Each preset can accept its own parameters in the future (e.g., `Agent(model="openai:/gpt-4o")` to set the judge model for all scorers). +- **Type checking.** `isinstance(preset, Agent)` works — code can distinguish which preset is being used. +- **Custom control flow.** Each preset can override methods for preset-specific validation or behavior. + ### Deduplication When multiple presets are combined, the same scorer type can appear more than once. For example, `AGENT` and `SAFETY` both contain `Safety()`. Running the same scorer twice wastes LLM API calls and produces duplicate result columns. @@ -235,7 +299,7 @@ def validate_scorers(scorers: list[Any]) -> list[Scorer]: if not isinstance(scorers, list): raise MlflowException.invalid_parameter_value( "The `scorers` argument must be a list of scorers or presets. " - "You can use a built-in preset like `scorers=[AGENT]`, or " + "You can use a built-in preset like `scorers=[Agent()]`, or " "`scorers=get_all_scorers()` for all available built-in scorers." ) @@ -263,20 +327,19 @@ def validate_scorers(scorers: list[Any]) -> list[Scorer]: `evaluate()` itself does not change. By the time scorers reach the evaluation loop, they are all individual `Scorer` instances. -### Built-in Presets - -MLflow ships five built-in presets as module-level constants. All contained scorers use default constructors. +### Built-in Preset Summary -> **Note:** `**TaskSuccess`** is a new scorer proposed in [mlflow/mlflow#22972](https://github.com/mlflow/mlflow/issues/22972). It evaluates whether an agent successfully accomplished the user's task without requiring ground truth data — unlike `Correctness`, which requires an `expectations` column. This scorer would be added to the `AGENT`, `CONVERSATIONAL_AGENT`, and `QUALITY` presets. This work can be part of this RFC or be a future addition after this RFC is completed. +MLflow ships five built-in preset subclasses. Each call creates fresh scorer instances. +> **Note:** **`TaskSuccess`** is a new scorer proposed in [mlflow/mlflow#22972](https://github.com/mlflow/mlflow/issues/22972). It evaluates whether an agent successfully accomplished the user's task without requiring ground truth data — unlike `Correctness`, which requires an `expectations` column. This scorer would be added to `Agent`, `ConversationalAgent`, and `Quality`. This work can be part of this RFC or be a future addition after this RFC is completed. -| Preset | Scorers | Use Case | -| ---------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------- | -| `RAG` | RetrievalRelevance, RetrievalSufficiency, RetrievalGroundedness, RelevanceToQuery, Safety, Completeness | Retrieval-augmented generation pipelines | -| `AGENT` | ToolCallCorrectness, ToolCallEfficiency, RelevanceToQuery, Safety, Completeness, **TaskSuccess** | Single-turn tool-calling agents | -| `CONVERSATIONAL_AGENT` | All of `AGENT` + UserFrustration, ConversationCompleteness, ConversationalSafety, ConversationalToolCallEfficiency, KnowledgeRetention | Multi-turn conversational agents | -| `SAFETY` | Safety, ConversationalSafety | Safety-focused evaluation (composable with other presets) | -| `QUALITY` | RelevanceToQuery, Fluency, Completeness, **TaskSuccess** | Architecture-independent output quality | +| Preset | Scorers | Use Case | +| ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------- | +| `Rag()` | RetrievalRelevance, RetrievalSufficiency, RetrievalGroundedness, RelevanceToQuery, Safety, Completeness | Retrieval-augmented generation pipelines | +| `Agent()` | ToolCallCorrectness, ToolCallEfficiency, RelevanceToQuery, Safety, Completeness, **TaskSuccess** | Single-turn tool-calling agents | +| `ConversationalAgent()`| All of `Agent` + UserFrustration, ConversationCompleteness, ConversationalSafety, ConversationalToolCallEfficiency, KnowledgeRetention | Multi-turn conversational agents | +| `SafetyPreset()` | Safety, ConversationalSafety | Safety-focused evaluation (composable with other presets) | +| `Quality()` | RelevanceToQuery, Fluency, Completeness, **TaskSuccess** | Architecture-independent output quality | #### Design Rationale @@ -355,62 +418,39 @@ class Preset: return f"Preset('{self._name}', [{', '.join(scorer_names)}])" -RAG = Preset("rag", [ - RetrievalRelevance(), - RetrievalSufficiency(), - RetrievalGroundedness(), - RelevanceToQuery(), - Safety(), - Completeness(), -]) - -AGENT = Preset("agent", [ - ToolCallCorrectness(), - ToolCallEfficiency(), - RelevanceToQuery(), - Safety(), - Completeness(), -]) - -CONVERSATIONAL_AGENT = Preset("conversational-agent", [ - ToolCallCorrectness(), - ToolCallEfficiency(), - RelevanceToQuery(), - Safety(), - Completeness(), - UserFrustration(), - ConversationCompleteness(), - ConversationalSafety(), - ConversationalToolCallEfficiency(), - KnowledgeRetention(), -]) - -SAFETY = Preset("safety", [ - Safety(), - ConversationalSafety(), -]) - -QUALITY = Preset("quality", [ - RelevanceToQuery(), - Fluency(), - Completeness(), - Correctness(), -]) - -_BUILTIN_PRESETS = { - "rag": RAG, - "agent": AGENT, - "conversational-agent": CONVERSATIONAL_AGENT, - "safety": SAFETY, - "quality": QUALITY, -} - - -def list_presets() -> dict[str, list[str]]: - return { - name: [type(s).__name__ for s in preset] - for name, preset in _BUILTIN_PRESETS.items() - } +class Rag(Preset): + def __init__(self): + super().__init__("rag", [ + RetrievalRelevance(), RetrievalSufficiency(), RetrievalGroundedness(), + RelevanceToQuery(), Safety(), Completeness(), + ]) + +class Agent(Preset): + def __init__(self): + super().__init__("agent", [ + ToolCallCorrectness(), ToolCallEfficiency(), + RelevanceToQuery(), Safety(), Completeness(), + ]) + +class ConversationalAgent(Preset): + def __init__(self): + super().__init__("conversational-agent", [ + ToolCallCorrectness(), ToolCallEfficiency(), + RelevanceToQuery(), Safety(), Completeness(), + UserFrustration(), ConversationCompleteness(), + ConversationalSafety(), ConversationalToolCallEfficiency(), + KnowledgeRetention(), + ]) + +class SafetyPreset(Preset): + def __init__(self): + super().__init__("safety", [Safety(), ConversationalSafety()]) + +class Quality(Preset): + def __init__(self): + super().__init__("quality", [ + RelevanceToQuery(), Fluency(), Completeness(), + ]) ``` No circular dependency risk: `presets.py` imports from `builtin_scorers.py`, and nothing in the existing chain imports from `presets.py`. @@ -421,8 +461,8 @@ Add `Preset`, the built-in preset constants, and `list_presets` to `_LAZY_IMPORT ```python _LAZY_IMPORTS_PRESETS = { - "Preset", "RAG", "AGENT", "CONVERSATIONAL_AGENT", - "SAFETY", "QUALITY", "list_presets", + "Preset", "Rag", "Agent", "ConversationalAgent", + "SafetyPreset", "Quality", "list_presets", } def __getattr__(name): @@ -444,7 +484,7 @@ def validate_scorers(scorers: list[Any]) -> list[Scorer]: if not isinstance(scorers, list): raise MlflowException.invalid_parameter_value( "The `scorers` argument must be a list of scorers or presets. " - "You can use a built-in preset like `scorers=[AGENT]`, or " + "You can use a built-in preset like `scorers=[Agent()]`, or " "`scorers=get_all_scorers()` for all available built-in scorers." ) @@ -478,24 +518,32 @@ New file: `tests/genai/scorers/test_presets.py` | ---------------------------------------- | --------------------------------------------------------- | | `test_builtin_preset_{rag,agent,...}` | Exact scorer types in each built-in preset | | `test_custom_preset` | Users can create a `Preset` with arbitrary scorers | -| `test_preset_in_validate_scorers` | `validate_scorers([AGENT, my_scorer])` flattens correctly | -| `test_preset_deduplication` | `[AGENT, SAFETY]` deduplicates shared `Safety()` | +| `test_preset_in_validate_scorers` | `validate_scorers([Agent(), my_scorer])` flattens correctly | +| `test_preset_deduplication` | `[Agent(), SafetyPreset()]` deduplicates shared `Safety()` | | `test_dedup_preserves_different_names` | Two `Guidelines` with different names are both kept | -| `test_preset_add_list` | `AGENT + [Fluency()]` returns a combined list | -| `test_list_add_preset` | `[Fluency()] + AGENT` returns a combined list | -| `test_preset_add_preset` | `AGENT + SAFETY` returns a combined list | -| `test_preset_iter_and_len` | `list(AGENT)` and `len(AGENT)` work correctly | +| `test_preset_add_list` | `Agent() + [Fluency()]` returns a combined list | +| `test_list_add_preset` | `[Fluency()] + Agent()` returns a combined list | +| `test_preset_add_preset` | `Agent() + SafetyPreset()` returns a combined list | +| `test_preset_iter_and_len` | `list(Agent())` and `len(Agent())` work correctly | | `test_preset_invalid_scorer_in_validate` | A preset containing a non-scorer raises `MlflowException` | | `test_list_presets` | Returns correct dict with correct class names | -| `test_preset_repr` | `repr(AGENT)` shows name and scorer class names | +| `test_preset_repr` | `repr(Agent())` shows name and scorer class names | +| `test_preset_fresh_instances` | `Agent()` creates new scorer instances each time | ```python -@pytest.mark.parametrize("preset", [RAG, AGENT, CONVERSATIONAL_AGENT, SAFETY, QUALITY]) -def test_builtin_preset_contains_valid_scorers(preset): +@pytest.mark.parametrize("preset_cls", [Rag, Agent, ConversationalAgent, SafetyPreset, Quality]) +def test_builtin_preset_contains_valid_scorers(preset_cls): + preset = preset_cls() assert len(preset) > 0 assert all(isinstance(s, BuiltInScorer) for s in preset) assert len(list(preset)) == len(set(type(s) for s in preset)) # no duplicates + +def test_preset_fresh_instances(): + a1 = Agent() + a2 = Agent() + # Each call creates new scorer instances + assert a1.scorers[0] is not a2.scorers[0] ``` ### Files Changed @@ -640,13 +688,13 @@ This is an **additive, non-breaking change**. Existing code continues to work un - Update documentation and templates to show `Preset` usage alongside the manual import pattern. - Update the `validate_scorers()` error message to mention presets for discoverability. -- Databricks agent templates can simplify from 9 imports + 9 instantiations to `scorers=[CONVERSATIONAL_AGENT]`. +- Databricks agent templates can simplify from 9 imports + 9 instantiations to `scorers=[ConversationalAgent()]`. # Open Questions 1. **Should `ConversationalRoleAdherence` be in `CONVERSATIONAL_AGENT`?** Currently excluded because it requires a defined persona. **Open for discussion.** -2. **Should `Correctness` be in `AGENT` or `RAG`?** Currently only in `QUALITY` because it requires `expectations` data. **Open for discussion.** -3. **Should there be an `ALL` preset?** `get_all_scorers()` already serves this role. **Recommendation:** Do not add. +2. **Should `Correctness` be in `Agent` or `Rag`?** Currently excluded from all presets because it requires `expectations` data. **Open for discussion.** +3. **Should there be an `All` preset?** `get_all_scorers()` already serves this role. **Recommendation:** Do not add. 4. **Deduplication key.** Should deduplication use `type(scorer)` alone, or `(type(scorer), scorer.name)`? The latter preserves multiple instances of the same class with different names (e.g., two `Guidelines` with different rules). -5. **Future: parameterized presets?** e.g., `AGENT.with_model("openai:/gpt-4o")` returning a new preset with the model set on all scorers. Deferred to keep the initial API simple. +5. **Future: parameterized presets?** e.g., `Agent(model="openai:/gpt-4o")` to set the judge model for all scorers in the preset. Can be a future addition. From 00db80f709212f33ed61da8676ee46f9caf6f7d5 Mon Sep 17 00:00:00 2001 From: Nehanth Date: Tue, 28 Apr 2026 11:08:37 -0500 Subject: [PATCH 03/12] Fix naming consistency throughout RFC Co-Authored-By: Claude Signed-off-by: Nehanth --- rfcs/0007-scorer-presets/0007-scorer-presets.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/rfcs/0007-scorer-presets/0007-scorer-presets.md b/rfcs/0007-scorer-presets/0007-scorer-presets.md index 10959bf..5ca362c 100644 --- a/rfcs/0007-scorer-presets/0007-scorer-presets.md +++ b/rfcs/0007-scorer-presets/0007-scorer-presets.md @@ -20,7 +20,7 @@ rfc_pr: MLflow provides 21 built-in scorers for evaluating GenAI outputs, but users have no way to select a coherent subset for a specific evaluation pattern. Today, evaluating an agent requires importing and instantiating 9+ individual scorer classes -- boilerplate that gets copy-pasted across teams and templates. -This RFC proposes a `Preset` class that packages a named collection of scorers. MLflow ships built-in presets for common evaluation patterns (`RAG`, `AGENT`, `CONVERSATIONAL_AGENT`, `SAFETY`, `QUALITY`), and users can define their own. Presets can be passed directly in the `scorers` list alongside individual scorers, with automatic deduplication when presets overlap. +This RFC proposes a `Preset` class that packages a named collection of scorers. MLflow ships built-in preset subclasses for common evaluation patterns (`Rag`, `Agent`, `ConversationalAgent`, `SafetyPreset`, `Quality`), and users can define their own. Presets can be passed directly in the `scorers` list alongside individual scorers, with automatic deduplication when presets overlap. # Basic Example @@ -258,7 +258,7 @@ class Quality(Preset): ### Deduplication -When multiple presets are combined, the same scorer type can appear more than once. For example, `AGENT` and `SAFETY` both contain `Safety()`. Running the same scorer twice wastes LLM API calls and produces duplicate result columns. +When multiple presets are combined, the same scorer type can appear more than once. For example, `Agent()` and `SafetyPreset()` both contain `Safety()`. Running the same scorer twice wastes LLM API calls and produces duplicate result columns. `validate_scorers()` deduplicates by scorer type after flattening: @@ -274,7 +274,7 @@ def _deduplicate_scorers(scorers: list[Scorer]) -> list[Scorer]: return result ``` -This uses first-occurrence-wins: if `AGENT` appears before `SAFETY` in the list, the `Safety()` instance from `AGENT` is kept and the one from `SAFETY` is dropped. For built-in scorers with default constructors, the instances are interchangeable, so the choice is arbitrary. +This uses first-occurrence-wins: if `Agent()` appears before `SafetyPreset()` in the list, the `Safety()` instance from `Agent()` is kept and the one from `SafetyPreset()` is dropped. For built-in scorers with default constructors, the instances are interchangeable, so the choice is arbitrary. Custom scorers with the same type but different configurations (e.g., two `Guidelines` instances with different `guidelines` args) should **not** be deduplicated, since they produce different results. The deduplication uses `type(scorer)` as the key, but scorers with different `name` attributes are kept: @@ -344,10 +344,10 @@ MLflow ships five built-in preset subclasses. Each call creates fresh scorer ins #### Design Rationale -- **Safety is in `RAG` and `AGENT`** because these presets aim to be complete starting points. Most users want safety checks without composing two presets. -- **Fluency is excluded from `AGENT`** because agent evaluation emphasizes tool usage and task completion. Users who need it can compose: `AGENT + [Fluency()]`. -- **`CONVERSATIONAL_AGENT` excludes `ConversationalRoleAdherence`** because it requires a defined persona in the system prompt, which not all agents have. -- **`Correctness` is excluded from all presets** because it requires `expectations` (ground truth) data. Users who have ground truth can add it manually: `QUALITY + [Correctness()]`. +- **Safety is in `Rag` and `Agent`** because these presets aim to be complete starting points. Most users want safety checks without composing two presets. +- **Fluency is excluded from `Agent`** because agent evaluation emphasizes tool usage and task completion. Users who need it can compose: `Agent() + [Fluency()]`. +- **`ConversationalAgent` excludes `ConversationalRoleAdherence`** because it requires a defined persona in the system prompt, which not all agents have. +- **`Correctness` is excluded from all presets** because it requires `expectations` (ground truth) data. Users who have ground truth can add it manually: `Quality() + [Correctness()]`. - **`Guidelines` and `ConversationalGuidelines` are excluded from all presets** because both require a `guidelines` constructor argument. ### `list_presets()` @@ -692,7 +692,7 @@ This is an **additive, non-breaking change**. Existing code continues to work un # Open Questions -1. **Should `ConversationalRoleAdherence` be in `CONVERSATIONAL_AGENT`?** Currently excluded because it requires a defined persona. **Open for discussion.** +1. **Should `ConversationalRoleAdherence` be in `ConversationalAgent`?** Currently excluded because it requires a defined persona. **Open for discussion.** 2. **Should `Correctness` be in `Agent` or `Rag`?** Currently excluded from all presets because it requires `expectations` data. **Open for discussion.** 3. **Should there be an `All` preset?** `get_all_scorers()` already serves this role. **Recommendation:** Do not add. 4. **Deduplication key.** Should deduplication use `type(scorer)` alone, or `(type(scorer), scorer.name)`? The latter preserves multiple instances of the same class with different names (e.g., two `Guidelines` with different rules). From f42a950c727ca36d4f64939ac90a4a1e6de04ad2 Mon Sep 17 00:00:00 2001 From: Nehanth Date: Tue, 28 Apr 2026 11:10:57 -0500 Subject: [PATCH 04/12] Remove _BUILTIN_PRESETS and list_presets from main proposal Co-Authored-By: Claude Signed-off-by: Nehanth --- .../0007-scorer-presets.md | 30 ++++--------------- 1 file changed, 6 insertions(+), 24 deletions(-) diff --git a/rfcs/0007-scorer-presets/0007-scorer-presets.md b/rfcs/0007-scorer-presets/0007-scorer-presets.md index 5ca362c..0b7ee1c 100644 --- a/rfcs/0007-scorer-presets/0007-scorer-presets.md +++ b/rfcs/0007-scorer-presets/0007-scorer-presets.md @@ -70,14 +70,6 @@ result = mlflow.genai.evaluate( ) ``` -```python -# Discover available built-in presets -from mlflow.genai.scorers import list_presets - -for name, scorer_names in list_presets().items(): - print(f"{name}: {', '.join(scorer_names)}") -``` - ## Motivation ### The Problem @@ -350,15 +342,6 @@ MLflow ships five built-in preset subclasses. Each call creates fresh scorer ins - **`Correctness` is excluded from all presets** because it requires `expectations` (ground truth) data. Users who have ground truth can add it manually: `Quality() + [Correctness()]`. - **`Guidelines` and `ConversationalGuidelines` are excluded from all presets** because both require a `guidelines` constructor argument. -### `list_presets()` - -A companion function for discovering available built-in presets: - -```python -def list_presets() -> dict[str, list[str]]: - """Return a mapping of built-in preset names to their scorer class names.""" -``` - ### Implementation #### New file: `mlflow/genai/scorers/presets.py` @@ -457,12 +440,12 @@ No circular dependency risk: `presets.py` imports from `builtin_scorers.py`, and #### Updated: `mlflow/genai/scorers/__init__.py` -Add `Preset`, the built-in preset constants, and `list_presets` to `_LAZY_IMPORTS`, `__all__`, and the `TYPE_CHECKING` block. The `__getattr__` function dispatches to the `presets` module: +Add `Preset` and the built-in preset subclasses to `_LAZY_IMPORTS`, `__all__`, and the `TYPE_CHECKING` block. The `__getattr__` function dispatches to the `presets` module: ```python _LAZY_IMPORTS_PRESETS = { "Preset", "Rag", "Agent", "ConversationalAgent", - "SafetyPreset", "Quality", "list_presets", + "SafetyPreset", "Quality", } def __getattr__(name): @@ -506,7 +489,7 @@ def validate_scorers(scorers: list[Any]) -> list[Scorer]: Re-export for convenience: ```python -from mlflow.genai.scorers import Preset, list_presets +from mlflow.genai.scorers import Preset ``` ### Testing Plan @@ -526,7 +509,6 @@ New file: `tests/genai/scorers/test_presets.py` | `test_preset_add_preset` | `Agent() + SafetyPreset()` returns a combined list | | `test_preset_iter_and_len` | `list(Agent())` and `len(Agent())` work correctly | | `test_preset_invalid_scorer_in_validate` | A preset containing a non-scorer raises `MlflowException` | -| `test_list_presets` | Returns correct dict with correct class names | | `test_preset_repr` | `repr(Agent())` shows name and scorer class names | | `test_preset_fresh_instances` | `Agent()` creates new scorer instances each time | @@ -551,9 +533,9 @@ def test_preset_fresh_instances(): | File | Change | | ------------------------------------- | ---------------------------------------------------------------- | -| `mlflow/genai/scorers/presets.py` | **New.** `Preset` class, built-in presets, `list_presets()`. | -| `mlflow/genai/scorers/__init__.py` | Add lazy imports for `Preset`, built-in presets, `list_presets`. | -| `mlflow/genai/__init__.py` | Re-export `Preset`, `list_presets`. | +| `mlflow/genai/scorers/presets.py` | **New.** `Preset` class and built-in preset subclasses. | +| `mlflow/genai/scorers/__init__.py` | Add lazy imports for `Preset` and built-in presets. | +| `mlflow/genai/__init__.py` | Re-export `Preset`. | | `mlflow/genai/scorers/validation.py` | Flatten presets and deduplicate in `validate_scorers()`. | | `tests/genai/scorers/test_presets.py` | **New.** Tests for `Preset` class and built-in presets. | From 4171a01d17381c58f95248421cdfc5861d97c0d0 Mon Sep 17 00:00:00 2001 From: Nehanth Date: Wed, 6 May 2026 16:48:20 -0400 Subject: [PATCH 05/12] Update RFC: address review feedback - Add deduplication to Preset __init__ and __add__ - Remove TaskSuccess from presets (out of scope) - Remove RetrievalSufficiency from Rag preset (requires ground truth) - Trim implementation details per reviewer feedback Co-Authored-By: Claude Signed-off-by: Nehanth --- .../0007-scorer-presets.md | 295 ++---------------- 1 file changed, 25 insertions(+), 270 deletions(-) diff --git a/rfcs/0007-scorer-presets/0007-scorer-presets.md b/rfcs/0007-scorer-presets/0007-scorer-presets.md index 0b7ee1c..c0b27f8 100644 --- a/rfcs/0007-scorer-presets/0007-scorer-presets.md +++ b/rfcs/0007-scorer-presets/0007-scorer-presets.md @@ -146,7 +146,18 @@ class Preset: def __init__(self, name: str, scorers: list[Scorer]): self._name = name - self._scorers = tuple(scorers) + self._scorers = tuple(self._deduplicate(scorers)) + + @staticmethod + def _deduplicate(scorers): + seen = set() + result = [] + for scorer in scorers: + key = (type(scorer), scorer.name) + if key not in seen: + seen.add(key) + result.append(scorer) + return result @property def name(self) -> str: @@ -164,12 +175,14 @@ class Preset: def __add__(self, other): if isinstance(other, (Preset, list)): - return list(self) + list(other) + combined = list(self) + list(other) + return self._deduplicate(combined) return NotImplemented def __radd__(self, other): if isinstance(other, list): - return other + list(self) + combined = other + list(self) + return self._deduplicate(combined) return NotImplemented def __repr__(self): @@ -179,7 +192,7 @@ class Preset: **Key design decisions:** -- **Immutable.** Scorers are stored as a tuple and exposed via a read-only property. +- **Immutable and deduplicated.** Scorers are stored as a tuple and exposed via a read-only property. Deduplication happens in `__init__` and `__add__` using `(type, name)` as the key, so scorers of the same class with different names are preserved (e.g., two `Guidelines` with different rules). - **Not a `Scorer` subclass.** A preset doesn't produce feedback -- it's a container. The evaluation loop assumes one scorer = one result column. Making `Preset` a scorer would require changes throughout the pipeline (aggregation, telemetry, serialization). - **Iterable.** Supports `__iter__`, `__len__`, and `__add__`/`__radd__` so it composes naturally: `Agent() + [my_scorer]`, `[my_scorer] + Agent()`, or `Agent() + SafetyPreset()`. - **Stores instances, not classes.** Users pass already-configured scorer instances. @@ -203,7 +216,6 @@ class Rag(Preset): def __init__(self): super().__init__("rag", [ RetrievalRelevance(), - RetrievalSufficiency(), RetrievalGroundedness(), RelevanceToQuery(), Safety(), @@ -252,70 +264,12 @@ class Quality(Preset): When multiple presets are combined, the same scorer type can appear more than once. For example, `Agent()` and `SafetyPreset()` both contain `Safety()`. Running the same scorer twice wastes LLM API calls and produces duplicate result columns. -`validate_scorers()` deduplicates by scorer type after flattening: +Deduplication happens in two places: -```python -def _deduplicate_scorers(scorers: list[Scorer]) -> list[Scorer]: - seen = set() - result = [] - for scorer in scorers: - scorer_type = type(scorer) - if scorer_type not in seen: - seen.add(scorer_type) - result.append(scorer) - return result -``` - -This uses first-occurrence-wins: if `Agent()` appears before `SafetyPreset()` in the list, the `Safety()` instance from `Agent()` is kept and the one from `SafetyPreset()` is dropped. For built-in scorers with default constructors, the instances are interchangeable, so the choice is arbitrary. - -Custom scorers with the same type but different configurations (e.g., two `Guidelines` instances with different `guidelines` args) should **not** be deduplicated, since they produce different results. The deduplication uses `type(scorer)` as the key, but scorers with different `name` attributes are kept: - -```python -def _deduplicate_scorers(scorers: list[Scorer]) -> list[Scorer]: - seen = set() - result = [] - for scorer in scorers: - key = (type(scorer), scorer.name) - if key not in seen: - seen.add(key) - result.append(scorer) - return result -``` - -### How `evaluate()` Handles Presets +- **In the `Preset` class** — both `__init__` and `__add__` deduplicate using `(type(scorer), scorer.name)` as the key, so the preset is always clean whenever scorers are added or combined. +- **In `validate_scorers()`** — when multiple presets are passed directly in a list (e.g., `scorers=[Agent(), SafetyPreset()]`) without using `+`, `__add__` is never called. `validate_scorers()` flattens and deduplicates as a safety net. -Presets are flattened and deduplicated in `validate_scorers()`, which already validates the `scorers` list before evaluation begins: - -```python -def validate_scorers(scorers: list[Any]) -> list[Scorer]: - if not isinstance(scorers, list): - raise MlflowException.invalid_parameter_value( - "The `scorers` argument must be a list of scorers or presets. " - "You can use a built-in preset like `scorers=[Agent()]`, or " - "`scorers=get_all_scorers()` for all available built-in scorers." - ) - - from mlflow.genai.scorers.presets import Preset - - # 1. Flatten presets into individual scorers - flat = [] - for item in scorers: - if isinstance(item, Preset): - flat.extend(item) - else: - flat.append(item) - - # 2. Deduplicate by (type, name) - flat = _deduplicate_scorers(flat) - - # 3. Existing validation on the flattened list - valid_scorers = [] - for scorer in flat: - if isinstance(scorer, Scorer): - valid_scorers.append(scorer) - else: - # existing error handling... -``` +Scorers of the same class with different names are preserved (e.g., two `Guidelines` with different rules). Only true duplicates — same class and same name — are removed. `evaluate()` itself does not change. By the time scorers reach the evaluation loop, they are all individual `Scorer` instances. @@ -323,15 +277,13 @@ def validate_scorers(scorers: list[Any]) -> list[Scorer]: MLflow ships five built-in preset subclasses. Each call creates fresh scorer instances. -> **Note:** **`TaskSuccess`** is a new scorer proposed in [mlflow/mlflow#22972](https://github.com/mlflow/mlflow/issues/22972). It evaluates whether an agent successfully accomplished the user's task without requiring ground truth data — unlike `Correctness`, which requires an `expectations` column. This scorer would be added to `Agent`, `ConversationalAgent`, and `Quality`. This work can be part of this RFC or be a future addition after this RFC is completed. - | Preset | Scorers | Use Case | | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------- | -| `Rag()` | RetrievalRelevance, RetrievalSufficiency, RetrievalGroundedness, RelevanceToQuery, Safety, Completeness | Retrieval-augmented generation pipelines | -| `Agent()` | ToolCallCorrectness, ToolCallEfficiency, RelevanceToQuery, Safety, Completeness, **TaskSuccess** | Single-turn tool-calling agents | +| `Rag()` | RetrievalRelevance, RetrievalGroundedness, RelevanceToQuery, Safety, Completeness | Retrieval-augmented generation pipelines | +| `Agent()` | ToolCallCorrectness, ToolCallEfficiency, RelevanceToQuery, Safety, Completeness | Single-turn tool-calling agents | | `ConversationalAgent()`| All of `Agent` + UserFrustration, ConversationCompleteness, ConversationalSafety, ConversationalToolCallEfficiency, KnowledgeRetention | Multi-turn conversational agents | | `SafetyPreset()` | Safety, ConversationalSafety | Safety-focused evaluation (composable with other presets) | -| `Quality()` | RelevanceToQuery, Fluency, Completeness, **TaskSuccess** | Architecture-independent output quality | +| `Quality()` | RelevanceToQuery, Fluency, Completeness | Architecture-independent output quality | #### Design Rationale @@ -339,207 +291,10 @@ MLflow ships five built-in preset subclasses. Each call creates fresh scorer ins - **Safety is in `Rag` and `Agent`** because these presets aim to be complete starting points. Most users want safety checks without composing two presets. - **Fluency is excluded from `Agent`** because agent evaluation emphasizes tool usage and task completion. Users who need it can compose: `Agent() + [Fluency()]`. - **`ConversationalAgent` excludes `ConversationalRoleAdherence`** because it requires a defined persona in the system prompt, which not all agents have. +- **`RetrievalSufficiency` is excluded from `Rag`** because it requires `expected_response` or `expected_facts` (ground truth). Users who have expectations data can add it manually: `Rag() + [RetrievalSufficiency()]`. - **`Correctness` is excluded from all presets** because it requires `expectations` (ground truth) data. Users who have ground truth can add it manually: `Quality() + [Correctness()]`. - **`Guidelines` and `ConversationalGuidelines` are excluded from all presets** because both require a `guidelines` constructor argument. -### Implementation - -#### New file: `mlflow/genai/scorers/presets.py` - -```python -from mlflow.genai.scorers.base import Scorer -from mlflow.genai.scorers.builtin_scorers import ( - Completeness, - ConversationalSafety, - ConversationalToolCallEfficiency, - ConversationCompleteness, - Correctness, - Fluency, - KnowledgeRetention, - RelevanceToQuery, - RetrievalGroundedness, - RetrievalRelevance, - RetrievalSufficiency, - Safety, - ToolCallCorrectness, - ToolCallEfficiency, - UserFrustration, -) - - -class Preset: - def __init__(self, name: str, scorers: list[Scorer]): - self._name = name - self._scorers = tuple(scorers) - - @property - def name(self) -> str: - return self._name - - @property - def scorers(self) -> tuple: - return self._scorers - - def __iter__(self): - return iter(self._scorers) - - def __len__(self): - return len(self._scorers) - - def __add__(self, other): - if isinstance(other, (Preset, list)): - return list(self) + list(other) - return NotImplemented - - def __radd__(self, other): - if isinstance(other, list): - return other + list(self) - return NotImplemented - - def __repr__(self): - scorer_names = [type(s).__name__ for s in self._scorers] - return f"Preset('{self._name}', [{', '.join(scorer_names)}])" - - -class Rag(Preset): - def __init__(self): - super().__init__("rag", [ - RetrievalRelevance(), RetrievalSufficiency(), RetrievalGroundedness(), - RelevanceToQuery(), Safety(), Completeness(), - ]) - -class Agent(Preset): - def __init__(self): - super().__init__("agent", [ - ToolCallCorrectness(), ToolCallEfficiency(), - RelevanceToQuery(), Safety(), Completeness(), - ]) - -class ConversationalAgent(Preset): - def __init__(self): - super().__init__("conversational-agent", [ - ToolCallCorrectness(), ToolCallEfficiency(), - RelevanceToQuery(), Safety(), Completeness(), - UserFrustration(), ConversationCompleteness(), - ConversationalSafety(), ConversationalToolCallEfficiency(), - KnowledgeRetention(), - ]) - -class SafetyPreset(Preset): - def __init__(self): - super().__init__("safety", [Safety(), ConversationalSafety()]) - -class Quality(Preset): - def __init__(self): - super().__init__("quality", [ - RelevanceToQuery(), Fluency(), Completeness(), - ]) -``` - -No circular dependency risk: `presets.py` imports from `builtin_scorers.py`, and nothing in the existing chain imports from `presets.py`. - -#### Updated: `mlflow/genai/scorers/__init__.py` - -Add `Preset` and the built-in preset subclasses to `_LAZY_IMPORTS`, `__all__`, and the `TYPE_CHECKING` block. The `__getattr__` function dispatches to the `presets` module: - -```python -_LAZY_IMPORTS_PRESETS = { - "Preset", "Rag", "Agent", "ConversationalAgent", - "SafetyPreset", "Quality", -} - -def __getattr__(name): - if name in _LAZY_IMPORTS: - from mlflow.genai.scorers import builtin_scorers - return getattr(builtin_scorers, name) - if name in _LAZY_IMPORTS_PRESETS: - from mlflow.genai.scorers import presets - return getattr(presets, name) - raise AttributeError(f"module {__name__!r} has no attribute {name!r}") -``` - -#### Updated: `mlflow/genai/scorers/validation.py` - -Flatten presets and deduplicate before validating individual scorers: - -```python -def validate_scorers(scorers: list[Any]) -> list[Scorer]: - if not isinstance(scorers, list): - raise MlflowException.invalid_parameter_value( - "The `scorers` argument must be a list of scorers or presets. " - "You can use a built-in preset like `scorers=[Agent()]`, or " - "`scorers=get_all_scorers()` for all available built-in scorers." - ) - - from mlflow.genai.scorers.presets import Preset - - flat = [] - for item in scorers: - if isinstance(item, Preset): - flat.extend(item) - else: - flat.append(item) - - flat = _deduplicate_scorers(flat) - # ... existing validation on the flattened list -``` - -#### Updated: `mlflow/genai/__init__.py` - -Re-export for convenience: - -```python -from mlflow.genai.scorers import Preset -``` - -### Testing Plan - -New file: `tests/genai/scorers/test_presets.py` - - -| Test | Verifies | -| ---------------------------------------- | --------------------------------------------------------- | -| `test_builtin_preset_{rag,agent,...}` | Exact scorer types in each built-in preset | -| `test_custom_preset` | Users can create a `Preset` with arbitrary scorers | -| `test_preset_in_validate_scorers` | `validate_scorers([Agent(), my_scorer])` flattens correctly | -| `test_preset_deduplication` | `[Agent(), SafetyPreset()]` deduplicates shared `Safety()` | -| `test_dedup_preserves_different_names` | Two `Guidelines` with different names are both kept | -| `test_preset_add_list` | `Agent() + [Fluency()]` returns a combined list | -| `test_list_add_preset` | `[Fluency()] + Agent()` returns a combined list | -| `test_preset_add_preset` | `Agent() + SafetyPreset()` returns a combined list | -| `test_preset_iter_and_len` | `list(Agent())` and `len(Agent())` work correctly | -| `test_preset_invalid_scorer_in_validate` | A preset containing a non-scorer raises `MlflowException` | -| `test_preset_repr` | `repr(Agent())` shows name and scorer class names | -| `test_preset_fresh_instances` | `Agent()` creates new scorer instances each time | - - -```python -@pytest.mark.parametrize("preset_cls", [Rag, Agent, ConversationalAgent, SafetyPreset, Quality]) -def test_builtin_preset_contains_valid_scorers(preset_cls): - preset = preset_cls() - assert len(preset) > 0 - assert all(isinstance(s, BuiltInScorer) for s in preset) - assert len(list(preset)) == len(set(type(s) for s in preset)) # no duplicates - -def test_preset_fresh_instances(): - a1 = Agent() - a2 = Agent() - # Each call creates new scorer instances - assert a1.scorers[0] is not a2.scorers[0] -``` - -### Files Changed - - -| File | Change | -| ------------------------------------- | ---------------------------------------------------------------- | -| `mlflow/genai/scorers/presets.py` | **New.** `Preset` class and built-in preset subclasses. | -| `mlflow/genai/scorers/__init__.py` | Add lazy imports for `Preset` and built-in presets. | -| `mlflow/genai/__init__.py` | Re-export `Preset`. | -| `mlflow/genai/scorers/validation.py` | Flatten presets and deduplicate in `validate_scorers()`. | -| `tests/genai/scorers/test_presets.py` | **New.** Tests for `Preset` class and built-in presets. | - - ## Drawbacks 1. **New class in the API.** Adds `Preset` to the public surface. Mitigation: it's a simple container with no complex behavior. From 22f64f8443775ac5ec880df4d3dd6e04c7587cab Mon Sep 17 00:00:00 2001 From: Nehanth Date: Wed, 6 May 2026 16:51:20 -0400 Subject: [PATCH 06/12] Add back validate_scorers code block for review context Co-Authored-By: Claude Signed-off-by: Nehanth --- .../0007-scorer-presets.md | 21 ++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/rfcs/0007-scorer-presets/0007-scorer-presets.md b/rfcs/0007-scorer-presets/0007-scorer-presets.md index c0b27f8..04a9c83 100644 --- a/rfcs/0007-scorer-presets/0007-scorer-presets.md +++ b/rfcs/0007-scorer-presets/0007-scorer-presets.md @@ -267,7 +267,26 @@ When multiple presets are combined, the same scorer type can appear more than on Deduplication happens in two places: - **In the `Preset` class** — both `__init__` and `__add__` deduplicate using `(type(scorer), scorer.name)` as the key, so the preset is always clean whenever scorers are added or combined. -- **In `validate_scorers()`** — when multiple presets are passed directly in a list (e.g., `scorers=[Agent(), SafetyPreset()]`) without using `+`, `__add__` is never called. `validate_scorers()` flattens and deduplicates as a safety net. +- **In `validate_scorers()`** — when multiple presets are passed directly in a list (e.g., `scorers=[Agent(), SafetyPreset()]`) without using `+`, `__add__` is never called. `validate_scorers()` flattens and deduplicates as a safety net: + +```python +def validate_scorers(scorers: list[Any]) -> list[Scorer]: + from mlflow.genai.scorers.presets import Preset + + # 1. Flatten presets into individual scorers + flat = [] + for item in scorers: + if isinstance(item, Preset): + flat.extend(item) + else: + flat.append(item) + + # 2. Deduplicate by (type, name) + flat = Preset._deduplicate(flat) + + # 3. Existing validation on the flattened list + ... +``` Scorers of the same class with different names are preserved (e.g., two `Guidelines` with different rules). Only true duplicates — same class and same name — are removed. From 287c3ccecd129302bb2e1ed017359a7a4f209d51 Mon Sep 17 00:00:00 2001 From: Nehanth Date: Mon, 1 Jun 2026 10:30:34 -0400 Subject: [PATCH 07/12] Switch from __add__ to __or__ for set union semantics Co-Authored-By: Claude Signed-off-by: Nehanth --- .../0007-scorer-presets.md | 23 ++++++++++--------- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/rfcs/0007-scorer-presets/0007-scorer-presets.md b/rfcs/0007-scorer-presets/0007-scorer-presets.md index 04a9c83..5629de8 100644 --- a/rfcs/0007-scorer-presets/0007-scorer-presets.md +++ b/rfcs/0007-scorer-presets/0007-scorer-presets.md @@ -48,13 +48,14 @@ result = mlflow.genai.evaluate( ``` ```python -# Combine presets -- duplicates are resolved automatically +# Combine presets using | -- duplicates are resolved automatically from mlflow.genai.scorers import Agent, SafetyPreset # Both contain Safety(); it runs once, not twice +scorers = Agent() | SafetyPreset() result = mlflow.genai.evaluate( data=eval_dataset, - scorers=[Agent(), SafetyPreset()], + scorers=scorers, ) ``` @@ -173,13 +174,13 @@ class Preset: def __len__(self): return len(self._scorers) - def __add__(self, other): + def __or__(self, other): if isinstance(other, (Preset, list)): combined = list(self) + list(other) return self._deduplicate(combined) return NotImplemented - def __radd__(self, other): + def __ror__(self, other): if isinstance(other, list): combined = other + list(self) return self._deduplicate(combined) @@ -192,9 +193,9 @@ class Preset: **Key design decisions:** -- **Immutable and deduplicated.** Scorers are stored as a tuple and exposed via a read-only property. Deduplication happens in `__init__` and `__add__` using `(type, name)` as the key, so scorers of the same class with different names are preserved (e.g., two `Guidelines` with different rules). +- **Immutable and deduplicated.** Scorers are stored as a tuple and exposed via a read-only property. Deduplication happens in `__init__` and `__or__` using `(type, name)` as the key, so scorers of the same class with different names are preserved (e.g., two `Guidelines` with different rules). - **Not a `Scorer` subclass.** A preset doesn't produce feedback -- it's a container. The evaluation loop assumes one scorer = one result column. Making `Preset` a scorer would require changes throughout the pipeline (aggregation, telemetry, serialization). -- **Iterable.** Supports `__iter__`, `__len__`, and `__add__`/`__radd__` so it composes naturally: `Agent() + [my_scorer]`, `[my_scorer] + Agent()`, or `Agent() + SafetyPreset()`. +- **Set union via `|`.** Supports `__or__`/`__ror__` for combining presets with deduplication: `Agent() | [my_scorer]`, `[my_scorer] | Agent()`, or `Agent() | SafetyPreset()`. Uses `|` instead of `+` because the deduplication behavior matches set union semantics. - **Stores instances, not classes.** Users pass already-configured scorer instances. ### Built-in Presets as Subclasses @@ -266,8 +267,8 @@ When multiple presets are combined, the same scorer type can appear more than on Deduplication happens in two places: -- **In the `Preset` class** — both `__init__` and `__add__` deduplicate using `(type(scorer), scorer.name)` as the key, so the preset is always clean whenever scorers are added or combined. -- **In `validate_scorers()`** — when multiple presets are passed directly in a list (e.g., `scorers=[Agent(), SafetyPreset()]`) without using `+`, `__add__` is never called. `validate_scorers()` flattens and deduplicates as a safety net: +- **In the `Preset` class** — both `__init__` and `__or__` deduplicate using `(type(scorer), scorer.name)` as the key, so the preset is always clean whenever scorers are added or combined. +- **In `validate_scorers()`** — when multiple presets are passed directly in a list (e.g., `scorers=[Agent(), SafetyPreset()]`) without using `|`, `__or__` is never called. `validate_scorers()` flattens and deduplicates as a safety net: ```python def validate_scorers(scorers: list[Any]) -> list[Scorer]: @@ -308,10 +309,10 @@ MLflow ships five built-in preset subclasses. Each call creates fresh scorer ins #### Design Rationale - **Safety is in `Rag` and `Agent`** because these presets aim to be complete starting points. Most users want safety checks without composing two presets. -- **Fluency is excluded from `Agent`** because agent evaluation emphasizes tool usage and task completion. Users who need it can compose: `Agent() + [Fluency()]`. +- **Fluency is excluded from `Agent`** because agent evaluation emphasizes tool usage and task completion. Users who need it can compose: `Agent() | [Fluency()]`. - **`ConversationalAgent` excludes `ConversationalRoleAdherence`** because it requires a defined persona in the system prompt, which not all agents have. -- **`RetrievalSufficiency` is excluded from `Rag`** because it requires `expected_response` or `expected_facts` (ground truth). Users who have expectations data can add it manually: `Rag() + [RetrievalSufficiency()]`. -- **`Correctness` is excluded from all presets** because it requires `expectations` (ground truth) data. Users who have ground truth can add it manually: `Quality() + [Correctness()]`. +- **`RetrievalSufficiency` is excluded from `Rag`** because it requires `expected_response` or `expected_facts` (ground truth). Users who have expectations data can add it manually: `Rag() | [RetrievalSufficiency()]`. +- **`Correctness` is excluded from all presets** because it requires `expectations` (ground truth) data. Users who have ground truth can add it manually: `Quality() | [Correctness()]`. - **`Guidelines` and `ConversationalGuidelines` are excluded from all presets** because both require a `guidelines` constructor argument. ## Drawbacks From 9710c792b4c593f0280e68e3cd18e669b6eb1574 Mon Sep 17 00:00:00 2001 From: Nehanth Date: Mon, 1 Jun 2026 10:56:07 -0400 Subject: [PATCH 08/12] Update RFC: add persistence, customization, drop Safety/Quality presets Co-Authored-By: Claude Signed-off-by: Nehanth --- .../0007-scorer-presets.md | 267 +++++++++--------- 1 file changed, 129 insertions(+), 138 deletions(-) diff --git a/rfcs/0007-scorer-presets/0007-scorer-presets.md b/rfcs/0007-scorer-presets/0007-scorer-presets.md index 5629de8..06a2a45 100644 --- a/rfcs/0007-scorer-presets/0007-scorer-presets.md +++ b/rfcs/0007-scorer-presets/0007-scorer-presets.md @@ -10,7 +10,7 @@ rfc_pr: | Author(s) | Nehanth | | ---------------------- | ----------- | -| **Date Last Modified** | 2026-04-28 | +| **Date Last Modified** | 2026-05-26 | | **AI Assistant(s)** | Claude Code | @@ -20,7 +20,7 @@ rfc_pr: MLflow provides 21 built-in scorers for evaluating GenAI outputs, but users have no way to select a coherent subset for a specific evaluation pattern. Today, evaluating an agent requires importing and instantiating 9+ individual scorer classes -- boilerplate that gets copy-pasted across teams and templates. -This RFC proposes a `Preset` class that packages a named collection of scorers. MLflow ships built-in preset subclasses for common evaluation patterns (`Rag`, `Agent`, `ConversationalAgent`, `SafetyPreset`, `Quality`), and users can define their own. Presets can be passed directly in the `scorers` list alongside individual scorers, with automatic deduplication when presets overlap. +This RFC proposes a `Preset` class that packages a named collection of scorers with support for **customization** and **persistence**. MLflow ships three built-in preset subclasses (`Rag`, `Agent`, `ConversationalAgent`) as starting points. Users can create custom presets, persist them to the MLflow server, and share them across teams and sessions. Presets can be passed directly in the `scorers` list alongside individual scorers, with automatic deduplication when presets overlap. # Basic Example @@ -28,7 +28,7 @@ This RFC proposes a `Preset` class that packages a named collection of scorers. import mlflow from mlflow.genai.scorers import Agent -# Use a built-in preset directly -- each call creates fresh scorer instances +# Use a built-in preset directly result = mlflow.genai.evaluate( data=eval_dataset, predict_fn=predict_fn, @@ -49,10 +49,10 @@ result = mlflow.genai.evaluate( ```python # Combine presets using | -- duplicates are resolved automatically -from mlflow.genai.scorers import Agent, SafetyPreset +from mlflow.genai.scorers import Agent, Rag -# Both contain Safety(); it runs once, not twice -scorers = Agent() | SafetyPreset() +# Overlapping scorers (e.g. Safety, RelevanceToQuery) run once, not twice +scorers = Agent() | Rag() result = mlflow.genai.evaluate( data=eval_dataset, scorers=scorers, @@ -60,15 +60,19 @@ result = mlflow.genai.evaluate( ``` ```python -# Define a custom preset +# Define a custom preset and persist it for team sharing from mlflow.genai.scorers import Preset, Safety, Fluency my_preset = Preset("my_team_eval", scorers=[Safety(), Fluency(), my_custom_scorer]) -result = mlflow.genai.evaluate( - data=eval_dataset, - scorers=[my_preset, another_scorer], -) +# Register to MLflow server so the team can reuse it +my_preset.register() + +# Later, another team member loads it +from mlflow.genai.scorers import get_preset + +preset = get_preset(name="my_team_eval") +result = mlflow.genai.evaluate(data=eval_dataset, scorers=[preset]) ``` ## Motivation @@ -112,19 +116,18 @@ Every team building agent evaluation follows this same pattern. This creates thr 1. **No built-in grouping.** `get_all_scorers()` returns all 19 default-constructible scorers. Users evaluating a RAG pipeline get `ToolCallCorrectness`; users evaluating an agent get `RetrievalGroundedness`. Each unnecessary scorer wastes an LLM API call. 2. **21 scorers to choose from.** Users must read documentation for each scorer to determine relevance. Session-level scorers (e.g., `KnowledgeRetention`) silently produce no results when passed to single-turn evaluation. 3. **Copy-paste problem.** The same scorer lists get duplicated across templates, notebooks, and tutorials. When new scorers are added, existing lists don't pick them up. +4. **No persistence or sharing.** Teams cannot save and share a curated set of scorers. Each team member independently assembles their own list, leading to drift across projects. ### Who Benefits - **New users** get a curated starting point without reading all 21 scorer docs -- **Teams** can define and share custom presets, ensuring consistent evaluation across projects +- **Teams** can define, persist, and share custom presets across sessions and team members - **Template authors** replace hardcoded scorer lists with a single preset - **MLflow maintainers** gain a single place to update when new scorers are added ### Out of Scope -- **Parameterized presets.** Passing `model` or `inference_params` to all scorers in a preset. Users can iterate over the preset's scorers instead. - **Third-party scorer presets.** Integrating presets for DeepEval, RAGAS, or TruLens scorers. -- **Preset registration/storage in the tracking server.** Presets are code-side only. ## Detailed Design @@ -137,8 +140,8 @@ class Preset: """A named, immutable collection of scorers for a common evaluation pattern. Presets can be passed in the ``scorers`` list alongside individual - scorers. They are flattened and deduplicated during validation, - so the evaluation loop only ever sees individual ``Scorer`` instances. + scorers. They are flattened during validation, so the evaluation + loop only ever sees individual ``Scorer`` instances. Args: name: A descriptive name for this preset. @@ -147,7 +150,20 @@ class Preset: def __init__(self, name: str, scorers: list[Scorer]): self._name = name - self._scorers = tuple(self._deduplicate(scorers)) + self._validate_no_duplicates(scorers) + self._scorers = tuple(scorers) + + @staticmethod + def _validate_no_duplicates(scorers): + seen = set() + for scorer in scorers: + key = (type(scorer), scorer.name) + if key in seen: + raise MlflowException.invalid_parameter_value( + f"Duplicate scorer: {type(scorer).__name__} with name '{scorer.name}'. " + "Use different names for scorers of the same type." + ) + seen.add(key) @staticmethod def _deduplicate(scorers): @@ -186,6 +202,10 @@ class Preset: return self._deduplicate(combined) return NotImplemented + def register(self, *, experiment_id: str | None = None): + """Register this preset to the MLflow server for team sharing.""" + ... + def __repr__(self): scorer_names = [type(s).__name__ for s in self._scorers] return f"Preset('{self._name}', [{', '.join(scorer_names)}])" @@ -193,14 +213,15 @@ class Preset: **Key design decisions:** -- **Immutable and deduplicated.** Scorers are stored as a tuple and exposed via a read-only property. Deduplication happens in `__init__` and `__or__` using `(type, name)` as the key, so scorers of the same class with different names are preserved (e.g., two `Guidelines` with different rules). +- **Immutable.** Scorers are stored as a tuple and exposed via a read-only property. +- **Blocks duplicates on construction.** `__init__` raises an error if duplicate scorers (same type and name) are passed. This is explicit — users know immediately if they have a conflict, rather than duplicates being silently removed. +- **Set union via `|`.** Supports `__or__`/`__ror__` for combining presets with deduplication: `Agent() | [my_scorer]` or `Agent() | Rag()`. Uses `|` instead of `+` because the deduplication behavior matches set union semantics. Deduplication on `|` is silent because combining presets with overlapping scorers is expected usage. - **Not a `Scorer` subclass.** A preset doesn't produce feedback -- it's a container. The evaluation loop assumes one scorer = one result column. Making `Preset` a scorer would require changes throughout the pipeline (aggregation, telemetry, serialization). -- **Set union via `|`.** Supports `__or__`/`__ror__` for combining presets with deduplication: `Agent() | [my_scorer]`, `[my_scorer] | Agent()`, or `Agent() | SafetyPreset()`. Uses `|` instead of `+` because the deduplication behavior matches set union semantics. - **Stores instances, not classes.** Users pass already-configured scorer instances. ### Built-in Presets as Subclasses -Each built-in preset is a subclass of `Preset` that hardcodes its scorer list. This means each call creates **fresh scorer instances** (no shared mutable singletons) and opens the door for preset-specific configuration and control flow in the future. +Each built-in preset is a subclass of `Preset` that hardcodes its scorer list. This means each call creates **fresh scorer instances** (no shared mutable singletons) and supports preset-specific customization. ```python class Agent(Preset): @@ -237,38 +258,93 @@ class ConversationalAgent(Preset): ConversationalToolCallEfficiency(), KnowledgeRetention(), ]) - -class SafetyPreset(Preset): - def __init__(self): - super().__init__("safety", [ - Safety(), - ConversationalSafety(), - ]) - -class Quality(Preset): - def __init__(self): - super().__init__("quality", [ - RelevanceToQuery(), - Fluency(), - Completeness(), - ]) ``` **Why subclasses over instances:** -- **Fresh instances every time.** `Agent()` creates new scorer instances on each call. No shared mutable state — the singleton problem is eliminated entirely. -- **Preset-specific configuration.** Each preset can accept its own parameters in the future (e.g., `Agent(model="openai:/gpt-4o")` to set the judge model for all scorers). +- **Fresh instances every time.** `Agent()` creates new scorer instances on each call. No shared mutable state. +- **Preset-specific customization.** Each preset can accept its own parameters (e.g., `Agent(model="openai:/gpt-4o")` to set the judge model for all scorers). - **Type checking.** `isinstance(preset, Agent)` works — code can distinguish which preset is being used. - **Custom control flow.** Each preset can override methods for preset-specific validation or behavior. +### Customization + +Users can customize presets in several ways: + +**Combine with additional scorers using `|`:** + +```python +scorers = Agent() | [Fluency(), Guidelines(name="tone", guidelines=["Be professional"])] +``` + +**Create a custom preset from scratch:** + +```python +my_preset = Preset("my_eval", scorers=[ + ToolCallCorrectness(), + Safety(), + my_custom_scorer, +]) +``` + +**Subclass a built-in preset to add defaults:** + +```python +class MyAgent(Agent): + def __init__(self): + super().__init__() + # Add team-specific scorers + self._scorers = self._scorers + (Fluency(), my_compliance_scorer) +``` + +### Persistence + +Presets can be registered to the MLflow server so teams can share them across sessions. This leverages the existing scorer registration infrastructure. + +**Register a preset:** + +```python +my_preset = Preset("my_team_agent", scorers=[ + ToolCallCorrectness(), + Safety(), + Fluency(), +]) + +# Register to the active experiment +my_preset.register() + +# Or register to a specific experiment +my_preset.register(experiment_id="123") +``` + +**Load a persisted preset:** + +```python +from mlflow.genai.scorers import get_preset + +# Load from the active experiment +preset = get_preset(name="my_team_agent") + +# Load from a specific experiment +preset = get_preset(name="my_team_agent", experiment_id="123") + +result = mlflow.genai.evaluate(data=eval_dataset, scorers=[preset]) +``` + +**Why persistence matters:** + +- **Version stability.** Persisted presets are snapshots — they don't change when MLflow upgrades. Built-in presets serve as starting points; teams persist their own versions for stability. +- **Team sharing.** A persisted preset is available to any team member with access to the experiment. +- **Customization without code.** Teams can customize and persist presets without modifying source code or templates. + ### Deduplication -When multiple presets are combined, the same scorer type can appear more than once. For example, `Agent()` and `SafetyPreset()` both contain `Safety()`. Running the same scorer twice wastes LLM API calls and produces duplicate result columns. +When presets are combined using `|`, the same scorer type can appear more than once. For example, `Agent()` and `Rag()` both contain `Safety()` and `RelevanceToQuery()`. Running the same scorer twice wastes LLM API calls and produces duplicate result columns. Deduplication happens in two places: -- **In the `Preset` class** — both `__init__` and `__or__` deduplicate using `(type(scorer), scorer.name)` as the key, so the preset is always clean whenever scorers are added or combined. -- **In `validate_scorers()`** — when multiple presets are passed directly in a list (e.g., `scorers=[Agent(), SafetyPreset()]`) without using `|`, `__or__` is never called. `validate_scorers()` flattens and deduplicates as a safety net: +- **In `__or__`** — when presets are combined using `|`, duplicates are removed using `(type(scorer), scorer.name)` as the key. This is expected behavior when combining presets with overlapping scorers. +- **In `validate_scorers()`** — when multiple presets are passed directly in a list (e.g., `scorers=[Agent(), Rag()]`) without using `|`, `__or__` is never called. `validate_scorers()` flattens and deduplicates as a safety net: ```python def validate_scorers(scorers: list[Any]) -> list[Scorer]: @@ -295,15 +371,13 @@ Scorers of the same class with different names are preserved (e.g., two `Guideli ### Built-in Preset Summary -MLflow ships five built-in preset subclasses. Each call creates fresh scorer instances. +MLflow ships three built-in preset subclasses as starting points. Each call creates fresh scorer instances. Users can customize and persist their own presets. | Preset | Scorers | Use Case | | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------- | | `Rag()` | RetrievalRelevance, RetrievalGroundedness, RelevanceToQuery, Safety, Completeness | Retrieval-augmented generation pipelines | | `Agent()` | ToolCallCorrectness, ToolCallEfficiency, RelevanceToQuery, Safety, Completeness | Single-turn tool-calling agents | | `ConversationalAgent()`| All of `Agent` + UserFrustration, ConversationCompleteness, ConversationalSafety, ConversationalToolCallEfficiency, KnowledgeRetention | Multi-turn conversational agents | -| `SafetyPreset()` | Safety, ConversationalSafety | Safety-focused evaluation (composable with other presets) | -| `Quality()` | RelevanceToQuery, Fluency, Completeness | Architecture-independent output quality | #### Design Rationale @@ -312,101 +386,21 @@ MLflow ships five built-in preset subclasses. Each call creates fresh scorer ins - **Fluency is excluded from `Agent`** because agent evaluation emphasizes tool usage and task completion. Users who need it can compose: `Agent() | [Fluency()]`. - **`ConversationalAgent` excludes `ConversationalRoleAdherence`** because it requires a defined persona in the system prompt, which not all agents have. - **`RetrievalSufficiency` is excluded from `Rag`** because it requires `expected_response` or `expected_facts` (ground truth). Users who have expectations data can add it manually: `Rag() | [RetrievalSufficiency()]`. -- **`Correctness` is excluded from all presets** because it requires `expectations` (ground truth) data. Users who have ground truth can add it manually: `Quality() | [Correctness()]`. +- **`Correctness` is excluded from all presets** because it requires `expectations` (ground truth) data. - **`Guidelines` and `ConversationalGuidelines` are excluded from all presets** because both require a `guidelines` constructor argument. +- **Only three built-in presets** (Rag, Agent, ConversationalAgent) — these represent clear, distinct evaluation patterns. Other groupings (e.g., safety, quality) are too vague or too small to justify a built-in preset. Users can create and persist their own groupings for their specific needs. ## Drawbacks -1. **New class in the API.** Adds `Preset` to the public surface. Mitigation: it's a simple container with no complex behavior. -2. **Opinionated defaults.** Not everyone will agree on which scorers belong in which preset. Mitigation: presets are extensible via `+`, and users can define their own. -3. **Implicit behavior changes on upgrade.** A new scorer added to a built-in preset means different evaluation results after upgrading. Consistent with how `get_all_scorers()` already behaves. +1. **New class in the API.** Adds `Preset` to the public surface. Mitigation: it's a simple container with persistence support. +2. **Opinionated defaults.** Not everyone will agree on which scorers belong in which preset. Mitigation: presets are extensible via `|`, and users can create and persist their own. +3. **Persistence adds scope.** Supporting preset registration and retrieval increases implementation complexity. Mitigation: leverages the existing scorer registration infrastructure. # Alternatives ### 1. `get_preset()` function (no class) -Instead of a `Preset` class, provide a simple function that returns a plain list: - -```python -from typing import Literal - -from mlflow.exceptions import MlflowException -from mlflow.genai.scorers.builtin_scorers import ( - Completeness, - ConversationalSafety, - ConversationalToolCallEfficiency, - ConversationCompleteness, - Correctness, - Fluency, - KnowledgeRetention, - RelevanceToQuery, - RetrievalGroundedness, - RetrievalRelevance, - RetrievalSufficiency, - Safety, - ToolCallCorrectness, - ToolCallEfficiency, - UserFrustration, -) - -_PRESETS: dict[str, list[type]] = { - "rag": [ - RetrievalRelevance, - RetrievalSufficiency, - RetrievalGroundedness, - RelevanceToQuery, - Safety, - Completeness, - ], - "agent": [ - ToolCallCorrectness, - ToolCallEfficiency, - RelevanceToQuery, - Safety, - Completeness, - ], - "conversational-agent": [ - ToolCallCorrectness, - ToolCallEfficiency, - RelevanceToQuery, - Safety, - Completeness, - UserFrustration, - ConversationCompleteness, - ConversationalSafety, - ConversationalToolCallEfficiency, - KnowledgeRetention, - ], - "safety": [ - Safety, - ConversationalSafety, - ], - "quality": [ - RelevanceToQuery, - Fluency, - Completeness, - Correctness, - ], -} - -_VALID_PRESET_NAMES = ", ".join(sorted(_PRESETS.keys())) -PresetName = Literal["rag", "agent", "conversational-agent", "safety", "quality"] - - -def get_preset(name: PresetName) -> list: - if name not in _PRESETS: - raise MlflowException.invalid_parameter_value( - f"Unknown preset '{name}'. Valid presets are: {_VALID_PRESET_NAMES}" - ) - return [scorer_class() for scorer_class in _PRESETS[name]] - - -def list_presets() -> dict[str, list[str]]: - return { - name: [cls.__name__ for cls in classes] - for name, classes in _PRESETS.items() - } -``` +Instead of a `Preset` class, provide a simple function that returns a plain list. This approach is simpler and also supports persistence and customization. Usage: @@ -421,11 +415,9 @@ scorers = get_preset("agent") + [Guidelines(name="tone", guidelines=["Be profess result = mlflow.genai.evaluate(scorers=scorers) ``` -**Pros:** Simpler (~30 lines). No validation changes needed. Returns fresh instances each call (no mutable singleton concern). `Literal` type gives IDE autocompletion. Going from function to class later is non-breaking. +**Pros:** Simpler. No validation changes needed. Returns fresh instances each call. `Literal` type gives IDE autocompletion. Going from function to class later is non-breaking. Can also support persistence by registering and loading presets by name. -**Cons:** No user-defined presets. Composition requires `+` with list concatenation. The preset concept disappears immediately -- it's just a list. No deduplication when combining presets. - -This is a viable first step if the class approach is deemed too heavy. The class can be added later as a non-breaking extension. +**Cons:** No user-defined preset objects. Composition requires `+` with list concatenation. The preset concept disappears immediately -- it's just a list. No deduplication when combining presets. If this approach is preferred, the RFC can be updated to use it. ### 2. Tag-based filtering @@ -433,7 +425,7 @@ Add `categories` to each scorer class and provide `get_scorers(categories=["rag" ### 3. Enum-based API -`ScorerPreset.RAG.get_scorers()`. Type-safe but heavier API surface. The `Literal` type on a function already provides IDE autocompletion. +`ScorerPreset.RAG.get_scorers()`. Type-safe but heavier API surface. ### 4. Do nothing @@ -446,12 +438,11 @@ This is an **additive, non-breaking change**. Existing code continues to work un - Update documentation and templates to show `Preset` usage alongside the manual import pattern. - Update the `validate_scorers()` error message to mention presets for discoverability. - Databricks agent templates can simplify from 9 imports + 9 instantiations to `scorers=[ConversationalAgent()]`. +- Teams can persist their customized presets and share them across projects. # Open Questions 1. **Should `ConversationalRoleAdherence` be in `ConversationalAgent`?** Currently excluded because it requires a defined persona. **Open for discussion.** 2. **Should `Correctness` be in `Agent` or `Rag`?** Currently excluded from all presets because it requires `expectations` data. **Open for discussion.** -3. **Should there be an `All` preset?** `get_all_scorers()` already serves this role. **Recommendation:** Do not add. -4. **Deduplication key.** Should deduplication use `type(scorer)` alone, or `(type(scorer), scorer.name)`? The latter preserves multiple instances of the same class with different names (e.g., two `Guidelines` with different rules). -5. **Future: parameterized presets?** e.g., `Agent(model="openai:/gpt-4o")` to set the judge model for all scorers in the preset. Can be a future addition. - +3. **Deduplication key.** Should deduplication use `type(scorer)` alone, or `(type(scorer), scorer.name)`? The latter preserves multiple instances of the same class with different names (e.g., two `Guidelines` with different rules). +4. **Class vs function for persistence.** The class-based approach is more ergonomic, while the function-based approach may be more flexible for persistence. Both support customization and persistence. The class approach is proposed as the primary design, with the function approach as a viable alternative. From 040ecb785f0a5225b86ec328306fa36cb91d1798 Mon Sep 17 00:00:00 2001 From: Nehanth Date: Mon, 1 Jun 2026 11:08:54 -0400 Subject: [PATCH 09/12] Clean up alternatives and open questions Co-Authored-By: Claude Signed-off-by: Nehanth --- rfcs/0007-scorer-presets/0007-scorer-presets.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/rfcs/0007-scorer-presets/0007-scorer-presets.md b/rfcs/0007-scorer-presets/0007-scorer-presets.md index 06a2a45..e5d274a 100644 --- a/rfcs/0007-scorer-presets/0007-scorer-presets.md +++ b/rfcs/0007-scorer-presets/0007-scorer-presets.md @@ -413,11 +413,18 @@ result = mlflow.genai.evaluate(scorers=get_preset("agent")) # Extending a preset scorers = get_preset("agent") + [Guidelines(name="tone", guidelines=["Be professional"])] result = mlflow.genai.evaluate(scorers=scorers) + +# Create a custom preset and persist it +register_preset("my_team_agent", scorers=[Safety(), Fluency(), my_custom_scorer]) + +# Load it later +scorers = get_preset("my_team_agent") +result = mlflow.genai.evaluate(scorers=scorers) ``` **Pros:** Simpler. No validation changes needed. Returns fresh instances each call. `Literal` type gives IDE autocompletion. Going from function to class later is non-breaking. Can also support persistence by registering and loading presets by name. -**Cons:** No user-defined preset objects. Composition requires `+` with list concatenation. The preset concept disappears immediately -- it's just a list. No deduplication when combining presets. If this approach is preferred, the RFC can be updated to use it. +**Cons:** No user-defined preset objects. Composition requires `+` with list concatenation. The preset concept disappears immediately -- it's just a list. No deduplication when combining presets. If this approach is preferred, the RFC can be updated to use it. This is a viable alternative if the class approach is deemed too heavy. ### 2. Tag-based filtering @@ -444,5 +451,3 @@ This is an **additive, non-breaking change**. Existing code continues to work un 1. **Should `ConversationalRoleAdherence` be in `ConversationalAgent`?** Currently excluded because it requires a defined persona. **Open for discussion.** 2. **Should `Correctness` be in `Agent` or `Rag`?** Currently excluded from all presets because it requires `expectations` data. **Open for discussion.** -3. **Deduplication key.** Should deduplication use `type(scorer)` alone, or `(type(scorer), scorer.name)`? The latter preserves multiple instances of the same class with different names (e.g., two `Guidelines` with different rules). -4. **Class vs function for persistence.** The class-based approach is more ergonomic, while the function-based approach may be more flexible for persistence. Both support customization and persistence. The class approach is proposed as the primary design, with the function approach as a viable alternative. From 4c6388cb99625236216f8478c9bc209a6fbcebfc Mon Sep 17 00:00:00 2001 From: Nehanth Date: Mon, 1 Jun 2026 11:20:58 -0400 Subject: [PATCH 10/12] Update date, simplify open questions Co-Authored-By: Claude Signed-off-by: Nehanth --- rfcs/0007-scorer-presets/0007-scorer-presets.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/rfcs/0007-scorer-presets/0007-scorer-presets.md b/rfcs/0007-scorer-presets/0007-scorer-presets.md index e5d274a..9c584b0 100644 --- a/rfcs/0007-scorer-presets/0007-scorer-presets.md +++ b/rfcs/0007-scorer-presets/0007-scorer-presets.md @@ -10,7 +10,7 @@ rfc_pr: | Author(s) | Nehanth | | ---------------------- | ----------- | -| **Date Last Modified** | 2026-05-26 | +| **Date Last Modified** | 2026-06-01 | | **AI Assistant(s)** | Claude Code | @@ -449,5 +449,4 @@ This is an **additive, non-breaking change**. Existing code continues to work un # Open Questions -1. **Should `ConversationalRoleAdherence` be in `ConversationalAgent`?** Currently excluded because it requires a defined persona. **Open for discussion.** -2. **Should `Correctness` be in `Agent` or `Rag`?** Currently excluded from all presets because it requires `expectations` data. **Open for discussion.** +1. **Class-based vs function-based approach.** The class-based approach is proposed as the primary design for its ergonomics and customization support. The function-based approach is a viable alternative that may be more flexible for persistence. Both approaches were discussed during review. From 6ab572eacf20e6633f6bde40cd59c2d2379bbaab Mon Sep 17 00:00:00 2001 From: Nehanth Date: Mon, 1 Jun 2026 11:41:52 -0400 Subject: [PATCH 11/12] Trim RFC: simplify design rationale and alternatives Co-Authored-By: Claude Signed-off-by: Nehanth --- .../0007-scorer-presets.md | 34 ++----------------- 1 file changed, 2 insertions(+), 32 deletions(-) diff --git a/rfcs/0007-scorer-presets/0007-scorer-presets.md b/rfcs/0007-scorer-presets/0007-scorer-presets.md index 9c584b0..8262074 100644 --- a/rfcs/0007-scorer-presets/0007-scorer-presets.md +++ b/rfcs/0007-scorer-presets/0007-scorer-presets.md @@ -382,13 +382,7 @@ MLflow ships three built-in preset subclasses as starting points. Each call crea #### Design Rationale -- **Safety is in `Rag` and `Agent`** because these presets aim to be complete starting points. Most users want safety checks without composing two presets. -- **Fluency is excluded from `Agent`** because agent evaluation emphasizes tool usage and task completion. Users who need it can compose: `Agent() | [Fluency()]`. -- **`ConversationalAgent` excludes `ConversationalRoleAdherence`** because it requires a defined persona in the system prompt, which not all agents have. -- **`RetrievalSufficiency` is excluded from `Rag`** because it requires `expected_response` or `expected_facts` (ground truth). Users who have expectations data can add it manually: `Rag() | [RetrievalSufficiency()]`. -- **`Correctness` is excluded from all presets** because it requires `expectations` (ground truth) data. -- **`Guidelines` and `ConversationalGuidelines` are excluded from all presets** because both require a `guidelines` constructor argument. -- **Only three built-in presets** (Rag, Agent, ConversationalAgent) — these represent clear, distinct evaluation patterns. Other groupings (e.g., safety, quality) are too vague or too small to justify a built-in preset. Users can create and persist their own groupings for their specific needs. +- **Only three built-in presets** (Rag, Agent, ConversationalAgent) — these represent clear, distinct evaluation patterns. Users can create and persist their own groupings for specific needs. ## Drawbacks @@ -400,31 +394,7 @@ MLflow ships three built-in preset subclasses as starting points. Each call crea ### 1. `get_preset()` function (no class) -Instead of a `Preset` class, provide a simple function that returns a plain list. This approach is simpler and also supports persistence and customization. - -Usage: - -```python -from mlflow.genai.scorers import get_preset - -# Simple usage -result = mlflow.genai.evaluate(scorers=get_preset("agent")) - -# Extending a preset -scorers = get_preset("agent") + [Guidelines(name="tone", guidelines=["Be professional"])] -result = mlflow.genai.evaluate(scorers=scorers) - -# Create a custom preset and persist it -register_preset("my_team_agent", scorers=[Safety(), Fluency(), my_custom_scorer]) - -# Load it later -scorers = get_preset("my_team_agent") -result = mlflow.genai.evaluate(scorers=scorers) -``` - -**Pros:** Simpler. No validation changes needed. Returns fresh instances each call. `Literal` type gives IDE autocompletion. Going from function to class later is non-breaking. Can also support persistence by registering and loading presets by name. - -**Cons:** No user-defined preset objects. Composition requires `+` with list concatenation. The preset concept disappears immediately -- it's just a list. No deduplication when combining presets. If this approach is preferred, the RFC can be updated to use it. This is a viable alternative if the class approach is deemed too heavy. +Instead of a `Preset` class, provide a simple function that returns a plain list. Simpler to implement and can also support persistence via `register_preset()` / `get_preset()`. ### 2. Tag-based filtering From 5591ffaa8cecc72d5eb2da9a9f1bf274a6133876 Mon Sep 17 00:00:00 2001 From: Nehanth Date: Tue, 16 Jun 2026 11:37:06 -0400 Subject: [PATCH 12/12] Address Bill's review: persistence UX, trim implementation details Co-Authored-By: Claude Signed-off-by: Nehanth --- .../0007-scorer-presets.md | 159 +++--------------- 1 file changed, 19 insertions(+), 140 deletions(-) diff --git a/rfcs/0007-scorer-presets/0007-scorer-presets.md b/rfcs/0007-scorer-presets/0007-scorer-presets.md index 8262074..ba5cc43 100644 --- a/rfcs/0007-scorer-presets/0007-scorer-presets.md +++ b/rfcs/0007-scorer-presets/0007-scorer-presets.md @@ -10,7 +10,7 @@ rfc_pr: | Author(s) | Nehanth | | ---------------------- | ----------- | -| **Date Last Modified** | 2026-06-01 | +| **Date Last Modified** | 2026-06-16 | | **AI Assistant(s)** | Claude Code | @@ -137,128 +137,30 @@ A `Preset` is a named, iterable container of scorers. It is **not** a `Scorer` s ```python class Preset: - """A named, immutable collection of scorers for a common evaluation pattern. - - Presets can be passed in the ``scorers`` list alongside individual - scorers. They are flattened during validation, so the evaluation - loop only ever sees individual ``Scorer`` instances. - - Args: - name: A descriptive name for this preset. - scorers: The list of scorer instances in this preset. - """ - - def __init__(self, name: str, scorers: list[Scorer]): - self._name = name - self._validate_no_duplicates(scorers) - self._scorers = tuple(scorers) - - @staticmethod - def _validate_no_duplicates(scorers): - seen = set() - for scorer in scorers: - key = (type(scorer), scorer.name) - if key in seen: - raise MlflowException.invalid_parameter_value( - f"Duplicate scorer: {type(scorer).__name__} with name '{scorer.name}'. " - "Use different names for scorers of the same type." - ) - seen.add(key) - - @staticmethod - def _deduplicate(scorers): - seen = set() - result = [] - for scorer in scorers: - key = (type(scorer), scorer.name) - if key not in seen: - seen.add(key) - result.append(scorer) - return result - + def __init__(self, name: str, scorers: list[Scorer]): ... + def __or__(self, other) -> "Preset": ... # set union with deduplication + def __ror__(self, other) -> "Preset": ... + def register(self, *, experiment_id: str | None = None): ... @property - def name(self) -> str: - return self._name - + def name(self) -> str: ... @property - def scorers(self) -> tuple: - return self._scorers - - def __iter__(self): - return iter(self._scorers) - - def __len__(self): - return len(self._scorers) - - def __or__(self, other): - if isinstance(other, (Preset, list)): - combined = list(self) + list(other) - return self._deduplicate(combined) - return NotImplemented - - def __ror__(self, other): - if isinstance(other, list): - combined = other + list(self) - return self._deduplicate(combined) - return NotImplemented - - def register(self, *, experiment_id: str | None = None): - """Register this preset to the MLflow server for team sharing.""" - ... - - def __repr__(self): - scorer_names = [type(s).__name__ for s in self._scorers] - return f"Preset('{self._name}', [{', '.join(scorer_names)}])" + def scorers(self) -> tuple: ... + def __iter__(self): ... + def __len__(self): ... + def __repr__(self): ... ``` **Key design decisions:** - **Immutable.** Scorers are stored as a tuple and exposed via a read-only property. - **Blocks duplicates on construction.** `__init__` raises an error if duplicate scorers (same type and name) are passed. This is explicit — users know immediately if they have a conflict, rather than duplicates being silently removed. -- **Set union via `|`.** Supports `__or__`/`__ror__` for combining presets with deduplication: `Agent() | [my_scorer]` or `Agent() | Rag()`. Uses `|` instead of `+` because the deduplication behavior matches set union semantics. Deduplication on `|` is silent because combining presets with overlapping scorers is expected usage. +- **Set union via `|`.** Combines presets with deduplication and returns a new `Preset`: `Agent() | [my_scorer]` or `Agent() | Rag()`. Results can be chained and registered. Uses `|` instead of `+` because the deduplication behavior matches set union semantics. - **Not a `Scorer` subclass.** A preset doesn't produce feedback -- it's a container. The evaluation loop assumes one scorer = one result column. Making `Preset` a scorer would require changes throughout the pipeline (aggregation, telemetry, serialization). - **Stores instances, not classes.** Users pass already-configured scorer instances. ### Built-in Presets as Subclasses -Each built-in preset is a subclass of `Preset` that hardcodes its scorer list. This means each call creates **fresh scorer instances** (no shared mutable singletons) and supports preset-specific customization. - -```python -class Agent(Preset): - def __init__(self): - super().__init__("agent", [ - ToolCallCorrectness(), - ToolCallEfficiency(), - RelevanceToQuery(), - Safety(), - Completeness(), - ]) - -class Rag(Preset): - def __init__(self): - super().__init__("rag", [ - RetrievalRelevance(), - RetrievalGroundedness(), - RelevanceToQuery(), - Safety(), - Completeness(), - ]) - -class ConversationalAgent(Preset): - def __init__(self): - super().__init__("conversational-agent", [ - ToolCallCorrectness(), - ToolCallEfficiency(), - RelevanceToQuery(), - Safety(), - Completeness(), - UserFrustration(), - ConversationCompleteness(), - ConversationalSafety(), - ConversationalToolCallEfficiency(), - KnowledgeRetention(), - ]) -``` +Each built-in preset is a subclass of `Preset` that hardcodes its scorer list. Each call creates **fresh scorer instances** (no shared mutable singletons) and supports preset-specific customization. See the Built-in Preset Summary table below for the scorers in each preset. **Why subclasses over instances:** @@ -287,16 +189,6 @@ my_preset = Preset("my_eval", scorers=[ ]) ``` -**Subclass a built-in preset to add defaults:** - -```python -class MyAgent(Agent): - def __init__(self): - super().__init__() - # Add team-specific scorers - self._scorers = self._scorers + (Fluency(), my_compliance_scorer) -``` - ### Persistence Presets can be registered to the MLflow server so teams can share them across sessions. This leverages the existing scorer registration infrastructure. @@ -337,6 +229,12 @@ result = mlflow.genai.evaluate(data=eval_dataset, scorers=[preset]) - **Team sharing.** A persisted preset is available to any team member with access to the experiment. - **Customization without code.** Teams can customize and persist presets without modifying source code or templates. +**Persistence behavior:** + +- **Scope.** Presets are scoped to experiments, consistent with how scorer registration already works in MLflow. This prevents name collisions across teams and ensures presets are organized alongside the experiments they evaluate. If no `experiment_id` is provided, the active experiment is used. +- **Custom scorer portability.** If a preset contains custom scorers, those scorers must be registered first. When a teammate loads the preset, the custom scorers are resolved from the registry. If a custom scorer is not registered, `preset.register()` will raise an error. +- **Discovery.** `list_presets()` returns all registered presets for the current experiment, allowing teams to discover what presets are available. This follows the same pattern as `list_scorers()`. + ### Deduplication When presets are combined using `|`, the same scorer type can appear more than once. For example, `Agent()` and `Rag()` both contain `Safety()` and `RelevanceToQuery()`. Running the same scorer twice wastes LLM API calls and produces duplicate result columns. @@ -344,26 +242,7 @@ When presets are combined using `|`, the same scorer type can appear more than o Deduplication happens in two places: - **In `__or__`** — when presets are combined using `|`, duplicates are removed using `(type(scorer), scorer.name)` as the key. This is expected behavior when combining presets with overlapping scorers. -- **In `validate_scorers()`** — when multiple presets are passed directly in a list (e.g., `scorers=[Agent(), Rag()]`) without using `|`, `__or__` is never called. `validate_scorers()` flattens and deduplicates as a safety net: - -```python -def validate_scorers(scorers: list[Any]) -> list[Scorer]: - from mlflow.genai.scorers.presets import Preset - - # 1. Flatten presets into individual scorers - flat = [] - for item in scorers: - if isinstance(item, Preset): - flat.extend(item) - else: - flat.append(item) - - # 2. Deduplicate by (type, name) - flat = Preset._deduplicate(flat) - - # 3. Existing validation on the flattened list - ... -``` +- **In `validate_scorers()`** — when multiple presets are passed directly in a list (e.g., `scorers=[Agent(), Rag()]`) without using `|`, `__or__` is never called. `validate_scorers()` flattens presets into individual scorers and deduplicates as a safety net. Scorers of the same class with different names are preserved (e.g., two `Guidelines` with different rules). Only true duplicates — same class and same name — are removed.