Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 99 additions & 0 deletions superset/mcp_service/dataset/schemas.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,29 @@ class TableColumnInfo(BaseModel):
filterable: bool | None = Field(None, description="Is filterable")
description: str | None = Field(None, description="Column description")

@model_serializer(mode="wrap")
def _filter_column_fields_by_context(
self, serializer: Any, info: Any
) -> Dict[str, Any]:
"""Filter column fields based on serialization context.

If context contains 'column_fields', only include those fields.
Otherwise, include all fields. This trims wide datasets so a
50-column dataset doesn't ship 50 long descriptions when the
caller only needs column_name + type.
"""
data = serializer(self)

if info.context and isinstance(info.context, dict):
column_fields = info.context.get("column_fields")
if column_fields:
requested = set(column_fields)
# Always preserve column_name as the only required field
requested.add("column_name")
return {k: v for k, v in data.items() if k in requested}
Comment on lines +109 to +115
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The column filtering check treats an explicitly provided empty list as "no filter" and returns all column fields. This is a logic bug because callers can pass column_fields=[] (or values that parse to an empty list) and unexpectedly get verbose fields like description for every column, which defeats the payload-size reduction and can reintroduce oversized responses/timeouts. Handle empty lists as a valid filter input (e.g., still enforce the minimal required field set) instead of falling back to full serialization. [logic error]

Severity Level: Critical 🚨
- ❌ MCP `get_dataset_info` cannot honor explicit empty column_fields.
- ⚠️ Wide datasets may still return verbose per-column descriptions.
Steps of Reproduction ✅
1. In `superset/tests/unit_tests/mcp_service/dataset/tool/test_dataset_tools.py:18-31`,
copy the pattern of `test_get_dataset_info_respects_column_fields` but change the request
payload to use an empty list for `column_fields`:

   `{"request": {"identifier": 3, "select_columns": ["id", "columns"], "column_fields":
   []}}`.

2. This request is validated into `GetDatasetInfoRequest` in
`superset/mcp_service/dataset/schemas.py:172-221`; the `@field_validator("column_fields")`
calls `parse_json_or_list` (see `schema_utils.py:111-151`), which returns `[]` unchanged
for a Python list, so `request.column_fields` is an empty list, not `None`.

3. The MCP tool handler `get_dataset_info` in
`superset/mcp_service/dataset/tool/get_dataset_info.py:21-27` fetches a `DatasetInfo`
instance, then at lines 119-126 calls `result.model_dump(..., context={"select_columns":
request.select_columns, "column_fields": request.column_fields})`, so
`info.context["column_fields"]` is `[]` for this call.

4. During serialization, each `TableColumnInfo` is processed by
`_filter_column_fields_by_context` in `superset/mcp_service/dataset/schemas.py:27-48`;
`info.context` is a dict and `column_fields` is `[]`, so the `if column_fields:` check at
lines 40-42 evaluates false and the method returns `data` unfiltered at line 48, including
verbose fields like `description`, `groupby`, `filterable`, etc. This contradicts the
request's explicit `column_fields=[]` and re-expands column payloads, undermining the PR's
goal of trimming oversized responses.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** superset/mcp_service/dataset/schemas.py
**Line:** 109:115
**Comment:**
	*Logic Error: The column filtering check treats an explicitly provided empty list as "no filter" and returns all column fields. This is a logic bug because callers can pass `column_fields=[]` (or values that parse to an empty list) and unexpectedly get verbose fields like `description` for every column, which defeats the payload-size reduction and can reintroduce oversized responses/timeouts. Handle empty lists as a valid filter input (e.g., still enforce the minimal required field set) instead of falling back to full serialization.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Copy link
Copy Markdown
Contributor Author

@aminghadersohi aminghadersohi May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in b888574 — both select_columns=[] and column_fields=[] now coerce to the lean default in the field validators rather than falling through to "no filter". Added a regression test (test_get_dataset_info_empty_lists_fall_back_to_defaults) that passes an empty list and asserts the lean defaults are still applied.


return data
Comment on lines +109 to +117
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect Filtering Logic

The serializer skips filtering when column_fields is an empty list, returning all fields instead of just column_name. This can lead to unexpectedly large responses if users pass [] intending minimal output. Update the logic to always filter when the key is present, defaulting to only column_name for empty lists.

Code suggestion
Check the AI-generated fix before applying
Suggested change
if info.context and isinstance(info.context, dict):
column_fields = info.context.get("column_fields")
if column_fields:
requested = set(column_fields)
# Always preserve column_name as the only required field
requested.add("column_name")
return {k: v for k, v in data.items() if k in requested}
return data
if info.context and isinstance(info.context, dict):
column_fields = info.context.get("column_fields")
if "column_fields" in info.context:
requested = set(column_fields or [])
# Always preserve column_name as the only required field
requested.add("column_name")
return {k: v for k, v in data.items() if k in requested}
return data

Code Review Run #f4f326


Should Bito avoid suggestions like this for future reviews? (Manage Rules)

  • Yes, avoid them

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed via a different layer in b888574 — the _parse_column_fields field_validator on GetDatasetInfoRequest coerces an empty list (and None) back to DEFAULT_GET_DATASET_INFO_COLUMN_FIELDS, so the serializer never sees an empty column_fields at runtime. Same outcome you wanted (no caller silently re-enables verbose per-column fields), but resolved before the model is constructed rather than inside the serializer. Regression test in test_get_dataset_info_empty_lists_fall_back_to_defaults and test_get_dataset_info_request_empty_lists_use_defaults.



class SqlMetricInfo(BaseModel):
metric_name: str = Field(
Expand Down Expand Up @@ -311,13 +334,89 @@ def create(cls, error: str, error_type: str) -> "DatasetError":
)


DEFAULT_GET_DATASET_INFO_COLUMNS: List[str] = [
"id",
"table_name",
"schema",
"database_name",
"database_id",
"uuid",
"is_virtual",
"description",
"main_dttm_col",
"sql",
"url",
"columns",
"metrics",
]

DEFAULT_GET_DATASET_INFO_COLUMN_FIELDS: List[str] = [
"column_name",
"type",
"is_dttm",
]


class GetDatasetInfoRequest(MetadataCacheControl):
"""Request schema for get_dataset_info with support for ID or UUID."""

identifier: Annotated[
int | str,
Field(description="Dataset identifier - can be numeric ID or UUID string"),
]
select_columns: Annotated[
List[str],
Field(
default_factory=lambda: list(DEFAULT_GET_DATASET_INFO_COLUMNS),
description=(
"Top-level fields to include in the response. Defaults to a lean "
"set that excludes verbose fields like params, template_params, "
"extra, tags, certification_details. Pass an explicit list to "
"override (e.g. ['id','table_name','columns'] for minimal output)."
),
),
]
column_fields: Annotated[
List[str],
Field(
default_factory=lambda: list(DEFAULT_GET_DATASET_INFO_COLUMN_FIELDS),
description=(
"Per-column fields to include for entries in 'columns'. Defaults "
"to ['column_name','type','is_dttm']. Pass a wider list to "
"include 'verbose_name','groupby','filterable','description' "
"when needed. Trimming per-column fields keeps responses small "
"for wide datasets."
),
),
]

@field_validator("select_columns", mode="before")
@classmethod
def _parse_select_columns(cls, value: Any) -> Any:
from superset.mcp_service.utils.schema_utils import parse_json_or_list

if value is None:
return list(DEFAULT_GET_DATASET_INFO_COLUMNS)
parsed = parse_json_or_list(value, "select_columns")
# Treat empty list as "use defaults" so callers cannot accidentally
# opt out of size reduction by passing []. Without this, an empty
# list disables filtering downstream and reintroduces oversized
# responses.
if not parsed:
return list(DEFAULT_GET_DATASET_INFO_COLUMNS)
return parsed

@field_validator("column_fields", mode="before")
@classmethod
def _parse_column_fields(cls, value: Any) -> Any:
from superset.mcp_service.utils.schema_utils import parse_json_or_list

if value is None:
return list(DEFAULT_GET_DATASET_INFO_COLUMN_FIELDS)
parsed = parse_json_or_list(value, "column_fields")
if not parsed:
return list(DEFAULT_GET_DATASET_INFO_COLUMN_FIELDS)
return parsed


class CreateVirtualDatasetRequest(BaseModel):
Expand Down
29 changes: 24 additions & 5 deletions superset/mcp_service/dataset/tool/get_dataset_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@

import logging
from datetime import datetime, timezone
from typing import Any

from fastmcp import Context
from sqlalchemy.orm import joinedload, subqueryload
Expand Down Expand Up @@ -58,7 +59,7 @@
@requires_data_model_metadata_access
async def get_dataset_info(
request: GetDatasetInfoRequest, ctx: Context
) -> DatasetInfo | DatasetError:
) -> dict[str, Any] | DatasetError:
"""Get dataset metadata by ID or UUID.

Returns columns, metrics, and schema details.
Expand All @@ -68,6 +69,12 @@ async def get_dataset_info(
- DO NOT use schema.table_name format (e.g., "public.customers")
- To find a dataset ID, use the list_datasets tool first

Response size control (use these to keep responses small):
- select_columns: top-level fields to include (default: lean set)
- column_fields: per-column fields for entries in 'columns' (default:
column_name, type, is_dttm). Pass a wider list to opt in to
verbose_name, groupby, filterable, description.

IMPORTANT - Saved Metrics vs Columns:
The response includes both 'columns' (raw database columns) and 'metrics'
(pre-defined saved metrics). When building chart configs, use saved_metric=true
Expand Down Expand Up @@ -144,12 +151,24 @@ async def get_dataset_info(
len(result.metrics) if result.metrics else 0,
)
)
else:
await ctx.warning(
"Dataset retrieval failed: error_type=%s, error=%s"
% (result.error_type, result.error)
await ctx.debug(
"Filtering response: select_columns=%s, column_fields=%s"
% (request.select_columns, request.column_fields)
)
with event_logger.log_context(action="mcp.get_dataset_info.serialization"):
return result.model_dump(
mode="json",
by_alias=True,
context={
"select_columns": request.select_columns,
"column_fields": request.column_fields,
},
)

await ctx.warning(
"Dataset retrieval failed: error_type=%s, error=%s"
% (result.error_type, result.error)
)
return result

except Exception as e:
Expand Down
Loading
Loading