fix(mcp): trim get_dataset_info response to prevent oversized payloads by aminghadersohi · Pull Request #39898 · apache/superset

aminghadersohi · 2026-05-05T23:26:42Z

SUMMARY

get_dataset_info could return ~80KB+ payloads for wide datasets, causing clients to truncate the response and LLM agents to time out trying to recover. Two issues:

The response always included verbose top-level fields (params, template_params, extra, certified_by, certification_details, tags, schema_perm) regardless of caller need.
Each TableColumnInfo serialized all 7 fields including long description text, so a 50-column dataset with verbose descriptions alone could exceed 30KB.

This change adds two new request parameters to GetDatasetInfoRequest:

select_columns — top-level fields to include. Defaults to a lean set (id, table_name, schema, database_name, database_id, uuid, is_virtual, description, main_dttm_col, sql, url, columns, metrics).
column_fields — per-column fields to include in columns entries. Defaults to ["column_name", "type", "is_dttm"]. Wider lists let callers opt in to verbose_name, groupby, filterable, description.

TableColumnInfo and DatasetInfo already had a model_serializer(mode="wrap") that reads select_columns from the Pydantic serialization context. The tool now passes both select_columns and column_fields through model_dump(context=...) so filtering applies during serialization rather than after, mirroring the pattern already in list_datasets and list_databases.

The default response shrinks from ~80KB to a few KB for typical wide datasets while existing callers that pass explicit select_columns continue to work unchanged.

BEFORE/AFTER SCREENSHOTS

N/A — backend change. Behavior change is observable via response payload size on tools/call for get_dataset_info.

TESTING INSTRUCTIONS

pytest tests/unit_tests/mcp_service/dataset/ -v

New tests verify:

Default response excludes verbose top-level and per-column fields.
Explicit select_columns trims the response to requested fields only.
Explicit column_fields opts in to verbose per-column fields.

ADDITIONAL INFORMATION

Has associated issue:
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

The get_dataset_info tool could return ~80KB+ payloads for wide datasets, causing client truncation and timeouts. Two issues: 1. The response always included verbose top-level fields (params, template_params, extra, certified_by, certification_details, tags, schema_perm) regardless of caller need. 2. Each TableColumnInfo serialized all 7 fields including long description text, so a 50-column dataset with verbose descriptions alone could exceed 30KB. This change adds two new request parameters to GetDatasetInfoRequest: - select_columns: top-level fields to include. Defaults to a lean set (id, table_name, schema, database_name, database_id, uuid, is_virtual, description, main_dttm_col, sql, url, columns, metrics). - column_fields: per-column fields to include in 'columns' entries. Defaults to (column_name, type, is_dttm). Wider lists let callers opt in to verbose_name, groupby, filterable, description. The schema's model_serializer reads select_columns/column_fields from the Pydantic serialization context, mirroring the pattern already in DatasetInfo. The tool passes these through model_dump(context=...) so filtering applies during serialization rather than after. The default response shrinks from ~80KB to a few KB for typical wide datasets, while existing callers that pass explicit select_columns continue to work unchanged.

codeant-ai-for-open-source · 2026-05-05T23:31:41Z

+        if info.context and isinstance(info.context, dict):
+            column_fields = info.context.get("column_fields")
+            if column_fields:
+                requested = set(column_fields)
+                # Always preserve column_name as the only required field
+                requested.add("column_name")
+                return {k: v for k, v in data.items() if k in requested}


Suggestion: The column filtering check treats an explicitly provided empty list as "no filter" and returns all column fields. This is a logic bug because callers can pass column_fields=[] (or values that parse to an empty list) and unexpectedly get verbose fields like description for every column, which defeats the payload-size reduction and can reintroduce oversized responses/timeouts. Handle empty lists as a valid filter input (e.g., still enforce the minimal required field set) instead of falling back to full serialization. [logic error]

Severity Level: Critical 🚨

- ❌ MCP `get_dataset_info` cannot honor explicit empty column_fields. - ⚠️ Wide datasets may still return verbose per-column descriptions.

Steps of Reproduction ✅

1. In `superset/tests/unit_tests/mcp_service/dataset/tool/test_dataset_tools.py:18-31`, copy the pattern of `test_get_dataset_info_respects_column_fields` but change the request payload to use an empty list for `column_fields`: `{"request": {"identifier": 3, "select_columns": ["id", "columns"], "column_fields": []}}`. 2. This request is validated into `GetDatasetInfoRequest` in `superset/mcp_service/dataset/schemas.py:172-221`; the `@field_validator("column_fields")` calls `parse_json_or_list` (see `schema_utils.py:111-151`), which returns `[]` unchanged for a Python list, so `request.column_fields` is an empty list, not `None`. 3. The MCP tool handler `get_dataset_info` in `superset/mcp_service/dataset/tool/get_dataset_info.py:21-27` fetches a `DatasetInfo` instance, then at lines 119-126 calls `result.model_dump(..., context={"select_columns": request.select_columns, "column_fields": request.column_fields})`, so `info.context["column_fields"]` is `[]` for this call. 4. During serialization, each `TableColumnInfo` is processed by `_filter_column_fields_by_context` in `superset/mcp_service/dataset/schemas.py:27-48`; `info.context` is a dict and `column_fields` is `[]`, so the `if column_fields:` check at lines 40-42 evaluates false and the method returns `data` unfiltered at line 48, including verbose fields like `description`, `groupby`, `filterable`, etc. This contradicts the request's explicit `column_fields=[]` and re-expands column payloads, undermining the PR's goal of trimming oversized responses.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** superset/mcp_service/dataset/schemas.py **Line:** 109:115 **Comment:** *Logic Error: The column filtering check treats an explicitly provided empty list as "no filter" and returns all column fields. This is a logic bug because callers can pass `column_fields=[]` (or values that parse to an empty list) and unexpectedly get verbose fields like `description` for every column, which defeats the payload-size reduction and can reintroduce oversized responses/timeouts. Handle empty lists as a valid filter input (e.g., still enforce the minimal required field set) instead of falling back to full serialization. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

👍 | 👎

Good catch. Fixed in b888574 — both select_columns=[] and column_fields=[] now coerce to the lean default in the field validators rather than falling through to "no filter". Added a regression test (test_get_dataset_info_empty_lists_fall_back_to_defaults) that passes an empty list and asserts the lean defaults are still applied.

codecov · 2026-05-05T23:40:05Z

Codecov Report

❌ Patch coverage is 33.33333% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.87%. Comparing base (4a21a53) to head (b297e39).
⚠️ Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
superset/mcp_service/dataset/schemas.py	35.29%	22 Missing ⚠️
...erset/mcp_service/dataset/tool/get_dataset_info.py	20.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #39898      +/-   ##
==========================================
- Coverage   63.88%   63.87%   -0.01%     
==========================================
  Files        2583     2583              
  Lines      136592   136630      +38     
  Branches    31502    31508       +6     
==========================================
+ Hits        87265    87278      +13     
- Misses      47814    47839      +25     
  Partials     1513     1513

Flag	Coverage Δ
hive	`39.39% <33.33%> (-0.01%)`	⬇️
mysql	`59.03% <33.33%> (-0.02%)`	⬇️
postgres	`59.11% <33.33%> (-0.02%)`	⬇️
presto	`41.08% <33.33%> (-0.01%)`	⬇️
python	`60.55% <33.33%> (-0.02%)`	⬇️
sqlite	`58.74% <33.33%> (-0.02%)`	⬇️
unit	`100.00% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Address codeant-ai review on apache#39898: an explicit empty list (e.g. column_fields=[]) parsed through the field validator yielded an empty list, then the model_serializer's `if column_fields:` check treated it as falsy and fell through to returning all fields, re-enabling verbose per-column descriptions for wide datasets. Field validators now coerce both an empty list and None to the lean default, so select_columns=[] and column_fields=[] are equivalent to omitting them. Added a regression test for the empty-list case.

bito-code-review

Code Review Agent Run #f4f326

Actionable Suggestions - 1

superset/mcp_service/dataset/schemas.py - 1
- Incorrect Filtering Logic · Line 109-117

Filtered by Review Rules

Bito filtered these suggestions based on rules created automatically for your feedback. Manage rules.

superset/mcp_service/dataset/tool/get_dataset_info.py - 1
- Inconsistent return type serialization · Line 62-62

Review Details

Files reviewed - 3 · Commit Range: b5ed6b6..b5ed6b6
- superset/mcp_service/dataset/schemas.py
- superset/mcp_service/dataset/tool/get_dataset_info.py
- tests/unit_tests/mcp_service/dataset/tool/test_dataset_tools.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

/review - Manually triggers a full AI review.
/pause - Pauses automatic reviews on this pull request.
/resume - Resumes automatic reviews.
/resolve - Marks all Bito-posted review comments as resolved.
/abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by

bito-code-review · 2026-05-05T23:50:13Z

+        if info.context and isinstance(info.context, dict):
+            column_fields = info.context.get("column_fields")
+            if column_fields:
+                requested = set(column_fields)
+                # Always preserve column_name as the only required field
+                requested.add("column_name")
+                return {k: v for k, v in data.items() if k in requested}
+
+        return data


Incorrect Filtering Logic

The serializer skips filtering when column_fields is an empty list, returning all fields instead of just column_name. This can lead to unexpectedly large responses if users pass [] intending minimal output. Update the logic to always filter when the key is present, defaulting to only column_name for empty lists.

Code suggestion

Check the AI-generated fix before applying

Suggested change

if info.context and isinstance(info.context, dict):

column_fields = info.context.get("column_fields")

if column_fields:

requested = set(column_fields)

# Always preserve column_name as the only required field

requested.add("column_name")

return {k: v for k, v in data.items() if k in requested}

return data

if info.context and isinstance(info.context, dict):

column_fields = info.context.get("column_fields")

if "column_fields" in info.context:

requested = set(column_fields or [])

# Always preserve column_name as the only required field

requested.add("column_name")

return {k: v for k, v in data.items() if k in requested}

return data

Code Review Run #f4f326

Should Bito avoid suggestions like this for future reviews? (Manage Rules)

Yes, avoid them

Already addressed via a different layer in b888574 — the _parse_column_fields field_validator on GetDatasetInfoRequest coerces an empty list (and None) back to DEFAULT_GET_DATASET_INFO_COLUMN_FIELDS, so the serializer never sees an empty column_fields at runtime. Same outcome you wanted (no caller silently re-enables verbose per-column fields), but resolved before the model is constructed rather than inside the serializer. Regression test in test_get_dataset_info_empty_lists_fall_back_to_defaults and test_get_dataset_info_request_empty_lists_use_defaults.

Adds direct unit tests for TableColumnInfo serializer (without/with context) and GetDatasetInfoRequest field validators (defaults, explicit overrides, empty lists, JSON-string parsing) to bring patch coverage on schemas.py up to where the model_serializer and field_validator branches are exercised independently of the FastMCP client path.

bito-code-review · 2026-05-06T01:20:23Z

Code Review Agent Run #94571b

Actionable Suggestions - 0

Review Details

Files reviewed - 2 · Commit Range: b5ed6b6..b297e39
- superset/mcp_service/dataset/schemas.py
- tests/unit_tests/mcp_service/dataset/tool/test_dataset_tools.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

/review - Manually triggers a full AI review.
/pause - Pauses automatic reviews on this pull request.
/resume - Resumes automatic reviews.
/resolve - Marks all Bito-posted review comments as resolved.
/abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by

pull-request-size Bot added the size/L label May 5, 2026

dosubot Bot added the data:dataset Related to dataset configurations label May 5, 2026

codeant-ai-for-open-source Bot reviewed May 5, 2026

View reviewed changes

bito-code-review Bot reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mcp): trim get_dataset_info response to prevent oversized payloads#39898

fix(mcp): trim get_dataset_info response to prevent oversized payloads#39898
aminghadersohi wants to merge 3 commits intoapache:masterfrom
aminghadersohi:aminghadersohi/fix-get-dataset-info-payload

aminghadersohi commented May 5, 2026

Uh oh!

codeant-ai-for-open-source Bot May 5, 2026

Uh oh!

aminghadersohi May 5, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 5, 2026 •

edited

Loading

Uh oh!

bito-code-review Bot left a comment •

edited

Loading

Uh oh!

bito-code-review Bot May 5, 2026

Uh oh!

aminghadersohi May 5, 2026

Uh oh!

bito-code-review Bot commented May 6, 2026 •

edited

Loading

Code Review Agent Run #94571b

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aminghadersohi commented May 5, 2026

SUMMARY

BEFORE/AFTER SCREENSHOTS

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

Uh oh!

codeant-ai-for-open-source Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

aminghadersohi May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bito-code-review Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Code Review Agent Run #f4f326

Uh oh!

bito-code-review Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

aminghadersohi May 5, 2026

Choose a reason for hiding this comment

Uh oh!

bito-code-review Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Agent Run #94571b

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aminghadersohi May 5, 2026 •

edited

Loading

codecov Bot commented May 5, 2026 •

edited

Loading

bito-code-review Bot left a comment •

edited

Loading

bito-code-review Bot commented May 6, 2026 •

edited

Loading