Skip to content

[SPARK-56975][SS] Reject user-specified schema in DataStreamReader.table()#56017

Closed
PorridgeSwim wants to merge 1 commit into
apache:masterfrom
PorridgeSwim:forbidSpecifySchemaForTable
Closed

[SPARK-56975][SS] Reject user-specified schema in DataStreamReader.table()#56017
PorridgeSwim wants to merge 1 commit into
apache:masterfrom
PorridgeSwim:forbidSpecifySchemaForTable

Conversation

@PorridgeSwim
Copy link
Copy Markdown
Contributor

@PorridgeSwim PorridgeSwim commented May 20, 2026

What changes were proposed in this pull request?

Make DataStreamReader.table() reject user-specified schemas by calling assertNoSpecifiedSchema("table"), mirroring DataStreamReader.changes().

Why are the changes needed?

DataStreamReader.table() accepts a user-specified schema without complaint and then silently ignores it:

spark.readStream
  .schema(new StructType().add("a", IntegerType))
  .table("some_table")     // no error; the schema has no effect

User-specified schema is not a meaningful input to .table() — catalog tables declare their own schema, and TableCatalog.loadTable(Identifier) has no parameter to receive a user schema, so even if Spark wanted to forward one it couldn't. The user's .schema(...) call is therefore always a misconfiguration.

The rest of DataStreamReader already surfaces this kind of misconfiguration as a clear error:

  • .load() goes through DataSourceV2Utils.getTableFromProvider, which throws _LEGACY_ERROR_TEMP_2242 ("<provider> source does not support user-specified schema") when the provider does not implement supportsExternalMetadata().
  • .changes() explicitly calls assertNoSpecifiedSchema("changes") and throws _LEGACY_ERROR_TEMP_1189 ("User specified schema not supported with changes.").

.table() is the odd one out: same invalid configuration, no error. Users can write readStream.schema(s).table(name), see a working query, and reasonably assume s had an effect — when in fact the resulting stream uses the catalog schema and s was dropped. Surfacing this as a clear error aligns .table() with the existing behavior of .load() and .changes().

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added DataStreamTableAPISuite test "read: user-specified schema is not allowed with table API".

Was this patch authored or co-authored using generative AI tooling?

No

@PorridgeSwim PorridgeSwim changed the title Forbid user specified schema for Table Reject user-specified schema in DataStreamReader.table() May 20, 2026
@PorridgeSwim PorridgeSwim changed the title Reject user-specified schema in DataStreamReader.table() [SPARK-56975][SS]Reject user-specified schema in DataStreamReader.table() May 20, 2026
@PorridgeSwim PorridgeSwim changed the title [SPARK-56975][SS]Reject user-specified schema in DataStreamReader.table() [SPARK-56975][SS] Reject user-specified schema in DataStreamReader.table() May 20, 2026
Copy link
Copy Markdown
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@HeartSaVioR
Copy link
Copy Markdown
Contributor

Thanks! Merging to master/4.x.

HeartSaVioR pushed a commit that referenced this pull request May 21, 2026
…ble()

### What changes were proposed in this pull request?

Make `DataStreamReader.table()` reject user-specified schemas by calling `assertNoSpecifiedSchema("table")`, mirroring `DataStreamReader.changes()`.

### Why are the changes needed?

`DataStreamReader.table()` accepts a user-specified schema without complaint and then silently ignores it:

```scala
spark.readStream
  .schema(new StructType().add("a", IntegerType))
  .table("some_table")     // no error; the schema has no effect
```

User-specified schema is not a meaningful input to `.table()` — catalog tables declare their own schema, and `TableCatalog.loadTable(Identifier)` has no parameter to receive a user schema, so even if Spark wanted to forward one it couldn't. The user's `.schema(...)` call is therefore always a misconfiguration.

The rest of `DataStreamReader` already surfaces this kind of misconfiguration as a clear error:

- `.load()` goes through `DataSourceV2Utils.getTableFromProvider`, which throws `_LEGACY_ERROR_TEMP_2242` ("`<provider>` source does not support user-specified schema") when the provider does not implement `supportsExternalMetadata()`.
- `.changes()` explicitly calls `assertNoSpecifiedSchema("changes")` and throws `_LEGACY_ERROR_TEMP_1189` ("User specified schema not supported with `changes`.").

`.table()` is the odd one out: same invalid configuration, no error. Users can write `readStream.schema(s).table(name)`, see a working query, and reasonably assume `s` had an effect — when in fact the resulting stream uses the catalog schema and `s` was dropped. Surfacing this as a clear error aligns `.table()` with the existing behavior of `.load()` and `.changes()`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added `DataStreamTableAPISuite` test `"read: user-specified schema is not allowed with table API"`.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #56017 from PorridgeSwim/forbidSpecifySchemaForTable.

Lead-authored-by: You Zhou <98635051+PorridgeSwim@users.noreply.github.com>
Co-authored-by: You Zhou <you.zhou@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit 05b4d81)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants