Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/contributor-guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ Benchmarking Guide <benchmarking>
Adding a New Operator <adding_a_new_operator>
Adding a New Expression <adding_a_new_expression>
Supported Spark Expressions <spark_expressions_support>
Supported Spark Configurations <spark_configs_support>
Tracing <tracing>
Profiling <profiling>
Comet SQL Tests <sql-file-tests.md>
Expand Down
160 changes: 160 additions & 0 deletions docs/source/contributor-guide/spark_configs_support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Supported Spark Configurations

This document tracks Spark SQL configurations that affect Comet's behavior. For each
configuration we record which Comet expressions or operators are influenced, what
verification has been performed, and any known gaps.

## How to Read This Document

The status column uses these values:

- **Supported** -- Comet runs the affected expressions natively under every value of
the config, and produces results matching Spark.
- **Partial** -- Comet runs natively for some values of the config but falls back to
Spark for others, or runs natively but with documented incompatibilities.
- **Falls back** -- Comet does not run the affected expressions natively under this
config and always defers to Spark.
- **Unaudited** -- the config's interaction with Comet has not yet been verified.

## Audited Configurations

- `spark.sql.legacy.timeParserPolicy`
- Default: `EXCEPTION`
- Status: Falls back (see notes)
- Affected expressions: `date_format`, `from_unixtime`, `unix_timestamp`, `to_unix_timestamp`, `to_timestamp`, `to_timestamp_ntz`, `to_date`, `try_to_timestamp` (Spark 4+)
- Spark versions checked: 3.4.3, 3.5.8, 4.0.1
- Date: 2026-05-02
- `spark.sql.optimizer.nestedSchemaPruning.enabled`
- Default: `true`
- Status: Supported
- Affected components: catalyst optimizer rules `SchemaPruning` and `NestedColumnAliasing`, datasource V2 push-down (`PushDownUtils`), Parquet readers (`ParquetReadSupport`, `ParquetFileFormat`, `ParquetScan`)
- Spark versions checked: 3.4.3, 3.5.8, 4.0.1
- Date: 2026-05-02

## Audit Notes

### `spark.sql.legacy.timeParserPolicy`

**Source.** `SQLConf.LEGACY_TIME_PARSER_POLICY` selects the formatter used by
`TimestampFormatter` and `DateFormatter`:

- `LEGACY` -- `java.text.SimpleDateFormat` / `FastDateFormat`. Lenient parsing.
- `CORRECTED` -- `java.time.DateTimeFormatter` via `Iso8601TimestampFormatter`. Strict.
- `EXCEPTION` (default) -- same parser as `CORRECTED`, plus
`DateTimeFormatterHelper.checkParsedDiff` raises `SparkUpgradeException`
(`INCONSISTENT_BEHAVIOR_CROSS_VERSION`) when the new parser fails on input that the
legacy parser would have accepted. Pattern validation also raises
`SparkUpgradeException` when a pattern is recognized only by the legacy formatter
(this check applies under both `CORRECTED` and `EXCEPTION`).

**Affected expressions.** Determined by tracing `TimestampFormatterHelper`,
`TimestampFormatter(...)`, and `DateFormatter(...)` usage in
`sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala`
across Spark 3.4, 3.5, 4.0, and 4.1. Three expression classes mix in
`TimestampFormatterHelper`:

- `DateFormatClass` -- `date_format`
- `FromUnixTime` -- `from_unixtime`
- `ToTimestamp` (abstract) -- `UnixTimestamp` (`unix_timestamp`),
`ToUnixTimestamp` (`to_unix_timestamp`), `GetTimestamp` (used by
`ParseToTimestamp` for `to_timestamp` / `to_timestamp_ntz`, `ParseToDate` for
`to_date`, and Spark 4's `try_to_timestamp`)

`Cast` between strings and date / timestamp also reads the policy via the default
formatters but is tested separately by `CometCastSuite` and is out of scope here.

**Comet status.** None of the listed expressions consult `legacyTimeParserPolicy` in
their Comet serde. The native implementations of `date_format`, `from_unixtime`, and
`unix_timestamp` use a fixed strftime-style mapping that does not vary with policy;
the remaining four (`to_unix_timestamp`, `to_timestamp`, `to_date`,
`try_to_timestamp`) have no native implementation and fall back to Spark. Today this
works because:

- `date_format` is `Compatible` only for a small whitelist of formats under UTC; the
whitelisted formats happen to produce identical output under all three policies.
- `from_unixtime` is marked `Incompatible` and falls back unless
`spark.comet.expression.FromUnixTime.allowIncompatible=true` is set.
- `unix_timestamp(<timestamp_or_date>)` does not call the formatter at all; the
string-input overload falls back.

If a Comet contributor adds native string-format parsing or extends the date_format
whitelist, this audit should be revisited and the policy must be honored explicitly.

**Test coverage.** `spark/src/test/resources/sql-tests/expressions/datetime/`:

- One ConfigMatrix file per expression covering convergent inputs under
`LEGACY,CORRECTED,EXCEPTION` (`*_time_parser_policy.sql`).
- Per-policy files locking in divergent behavior:
- `_legacy.sql` -- lenient inputs (single-digit fields, out-of-range values,
trailing characters) and legacy-only pattern tokens (`'aaaa'`).
- `_corrected.sql` -- the same inputs return null; legacy-only tokens raise
`INCONSISTENT_BEHAVIOR_CROSS_VERSION.DATETIME_PATTERN_RECOGNITION` at formatter
creation.
- `_exception.sql` -- the same inputs raise
`INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER` at parse time.

**Findings.** All 42 generated test cases pass on Spark 3.4.3, 3.5.8, and 4.0.1. No
Comet bugs were uncovered by the audit. The tests use `query spark_answer_only` so
that result-correctness is enforced regardless of whether Comet runs the expression
natively or falls back.

### `spark.sql.optimizer.nestedSchemaPruning.enabled`

**Source.** When `true`, the catalyst optimizer rewrites projections of nested
fields so columnar readers fetch only the requested leaves of a struct, array, or
map column. Read sites verified on Spark 3.4.3, 3.5.8, 4.0.1:

- `org.apache.spark.sql.catalyst.optimizer.SchemaPruning` and
`org.apache.spark.sql.catalyst.optimizer.NestedColumnAliasing` -- gated by
`nestedSchemaPruningEnabled`; rewrite the project list to expose only the leaves
that downstream operators consume.
- `org.apache.spark.sql.execution.datasources.v2.PushDownUtils.pruneColumns` --
pushes the pruned schema into V2 scans only when the flag is `true`.
- `org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport`,
`ParquetFileFormat`, and `ParquetScan` -- propagate the flag into the Parquet
reader's requested schema.

**Comet status.** `CometParquetFileFormat.populateConf` propagates the SQL conf
into the Hadoop conf so the Parquet read path honors it. Comet has no other
special handling -- native scans inherit Spark's already-pruned `requiredSchema`
and pass it through to the native reader. The separate
`spark.sql.optimizer.serializer.nestedSchemaPruning.enabled` (Encoder-level) is
out of scope.

**Test coverage.** `spark/src/test/scala/org/apache/comet/parquet/CometNestedSchemaPruningSuite.scala`:

- One Scala test per scenario, run across both `SCAN_NATIVE_DATAFUSION` and
`SCAN_NATIVE_ICEBERG_COMPAT` under the V1 Parquet path. Plain Parquet V2 is not
Comet-accelerated (Comet's V2 scan rule covers only CSV and Iceberg) so it is
excluded from the matrix.
- Each scenario inspects the executed plan via a small helper that walks Comet
scan execs (`CometScanExec`, `CometNativeScanExec`) and asserts the
`requiredSchema` matches the expected pruned (or unpruned) shape, then compares
results against Spark via `checkSparkAnswer`.
- Scenarios: top-level struct field, field inside array of struct, field inside
map value, doubly-nested struct field, projection plus filter on nested field,
null at intermediate struct level. Each scenario asserts both the
pruning-enabled and pruning-disabled behavior, except the null-intermediate
case which only varies the pruning-on path.

**Findings.** All 12 generated test cases pass on Spark 3.4.3, 3.5.8, and 4.0.1.
No Comet bugs were uncovered.
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
-- Licensed to the Apache Software Foundation (ASF) under one
-- or more contributor license agreements. See the NOTICE file
-- distributed with this work for additional information
-- regarding copyright ownership. The ASF licenses this file
-- to you under the Apache License, Version 2.0 (the
-- "License"); you may not use this file except in compliance
-- with the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing,
-- software distributed under the License is distributed on an
-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-- KIND, either express or implied. See the License for the
-- specific language governing permissions and limitations
-- under the License.

-- Convergent date_format() behavior across all three timeParserPolicy values.
-- Patterns here produce identical output under LEGACY, CORRECTED, and EXCEPTION.
-- ConfigMatrix: spark.sql.legacy.timeParserPolicy=LEGACY,CORRECTED,EXCEPTION
-- Config: spark.sql.session.timeZone=UTC

statement
CREATE TABLE test_date_format_policy(ts timestamp) USING parquet

statement
INSERT INTO test_date_format_policy VALUES (timestamp('2024-06-15 10:30:45')), (timestamp('1970-01-01 00:00:00')), (NULL)

query spark_answer_only
SELECT date_format(ts, 'yyyy-MM-dd') FROM test_date_format_policy

query spark_answer_only
SELECT date_format(ts, 'yyyy-MM-dd HH:mm:ss') FROM test_date_format_policy

query spark_answer_only
SELECT date_format(ts, 'HH:mm:ss') FROM test_date_format_policy

query spark_answer_only
SELECT date_format(ts, 'yyyyMMdd') FROM test_date_format_policy

query spark_answer_only
SELECT date_format(ts, 'yyyyMM') FROM test_date_format_policy

-- literal arguments
query spark_answer_only
SELECT date_format(timestamp('2024-06-15 10:30:45'), 'yyyy-MM-dd')

query spark_answer_only
SELECT date_format(NULL, 'yyyy-MM-dd')
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
-- Licensed to the Apache Software Foundation (ASF) under one
-- or more contributor license agreements. See the NOTICE file
-- distributed with this work for additional information
-- regarding copyright ownership. The ASF licenses this file
-- to you under the Apache License, Version 2.0 (the
-- "License"); you may not use this file except in compliance
-- with the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing,
-- software distributed under the License is distributed on an
-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-- KIND, either express or implied. See the License for the
-- specific language governing permissions and limitations
-- under the License.

-- date_format() under CORRECTED timeParserPolicy.
-- Patterns recognized only by the legacy formatter raise SparkUpgradeException at
-- formatter creation, even under CORRECTED, because validatePatternString is called
-- with checkLegacy=true.
-- Config: spark.sql.legacy.timeParserPolicy=CORRECTED
-- Config: spark.sql.session.timeZone=UTC

statement
CREATE TABLE test_date_format_corrected(ts timestamp) USING parquet

statement
INSERT INTO test_date_format_corrected VALUES (timestamp('2024-06-15 10:30:45'))

-- 4-char am/pm marker: legacy accepts, new rejects, validation throws SparkUpgradeException.
query expect_error(INCONSISTENT_BEHAVIOR_CROSS_VERSION)
SELECT date_format(ts, 'yyyy-MM-dd aaaa') FROM test_date_format_corrected
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
-- Licensed to the Apache Software Foundation (ASF) under one
-- or more contributor license agreements. See the NOTICE file
-- distributed with this work for additional information
-- regarding copyright ownership. The ASF licenses this file
-- to you under the Apache License, Version 2.0 (the
-- "License"); you may not use this file except in compliance
-- with the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing,
-- software distributed under the License is distributed on an
-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-- KIND, either express or implied. See the License for the
-- specific language governing permissions and limitations
-- under the License.

-- date_format() under EXCEPTION timeParserPolicy (the default).
-- Patterns rejected by the new formatter but accepted by legacy raise
-- SparkUpgradeException at formatter creation.
-- Config: spark.sql.legacy.timeParserPolicy=EXCEPTION
-- Config: spark.sql.session.timeZone=UTC

statement
CREATE TABLE test_date_format_exception(ts timestamp) USING parquet

statement
INSERT INTO test_date_format_exception VALUES (timestamp('2024-06-15 10:30:45'))

query expect_error(INCONSISTENT_BEHAVIOR_CROSS_VERSION)
SELECT date_format(ts, 'yyyy-MM-dd aaaa') FROM test_date_format_exception
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
-- Licensed to the Apache Software Foundation (ASF) under one
-- or more contributor license agreements. See the NOTICE file
-- distributed with this work for additional information
-- regarding copyright ownership. The ASF licenses this file
-- to you under the Apache License, Version 2.0 (the
-- "License"); you may not use this file except in compliance
-- with the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing,
-- software distributed under the License is distributed on an
-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-- KIND, either express or implied. See the License for the
-- specific language governing permissions and limitations
-- under the License.

-- date_format() under LEGACY timeParserPolicy.
-- Legacy SimpleDateFormat accepts patterns that the new java.time formatter rejects.
-- Config: spark.sql.legacy.timeParserPolicy=LEGACY
-- Config: spark.sql.session.timeZone=UTC

statement
CREATE TABLE test_date_format_legacy(ts timestamp) USING parquet

statement
INSERT INTO test_date_format_legacy VALUES (timestamp('2024-06-15 10:30:45')), (timestamp('1970-01-01 00:00:00')), (NULL)

-- Legacy-only token: 4-char am/pm marker is invalid in the new formatter.
query spark_answer_only
SELECT date_format(ts, 'yyyy-MM-dd aaaa') FROM test_date_format_legacy
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
-- Licensed to the Apache Software Foundation (ASF) under one
-- or more contributor license agreements. See the NOTICE file
-- distributed with this work for additional information
-- regarding copyright ownership. The ASF licenses this file
-- to you under the Apache License, Version 2.0 (the
-- "License"); you may not use this file except in compliance
-- with the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing,
-- software distributed under the License is distributed on an
-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-- KIND, either express or implied. See the License for the
-- specific language governing permissions and limitations
-- under the License.

-- Convergent from_unixtime() behavior across all three timeParserPolicy values.
-- Patterns here produce identical output under LEGACY, CORRECTED, and EXCEPTION.
-- ConfigMatrix: spark.sql.legacy.timeParserPolicy=LEGACY,CORRECTED,EXCEPTION
-- Config: spark.sql.session.timeZone=UTC

statement
CREATE TABLE test_from_unix_time_policy(t long) USING parquet

statement
INSERT INTO test_from_unix_time_policy VALUES (0), (1718451045), (-1), (NULL), (2147483647)

query spark_answer_only
SELECT from_unixtime(t) FROM test_from_unix_time_policy

query spark_answer_only
SELECT from_unixtime(t, 'yyyy-MM-dd') FROM test_from_unix_time_policy

query spark_answer_only
SELECT from_unixtime(t, 'yyyy-MM-dd HH:mm:ss') FROM test_from_unix_time_policy

query spark_answer_only
SELECT from_unixtime(t, 'HH:mm:ss') FROM test_from_unix_time_policy

-- literal arguments
query spark_answer_only
SELECT from_unixtime(0)

query spark_answer_only
SELECT from_unixtime(1718451045, 'yyyy-MM-dd')

query spark_answer_only
SELECT from_unixtime(NULL, 'yyyy-MM-dd')
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
-- Licensed to the Apache Software Foundation (ASF) under one
-- or more contributor license agreements. See the NOTICE file
-- distributed with this work for additional information
-- regarding copyright ownership. The ASF licenses this file
-- to you under the Apache License, Version 2.0 (the
-- "License"); you may not use this file except in compliance
-- with the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing,
-- software distributed under the License is distributed on an
-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-- KIND, either express or implied. See the License for the
-- specific language governing permissions and limitations
-- under the License.

-- from_unixtime() under CORRECTED timeParserPolicy.
-- Patterns recognized only by the legacy formatter raise SparkUpgradeException
-- at formatter creation, even under CORRECTED.
-- Config: spark.sql.legacy.timeParserPolicy=CORRECTED
-- Config: spark.sql.session.timeZone=UTC

statement
CREATE TABLE test_from_unix_time_corrected(t long) USING parquet

statement
INSERT INTO test_from_unix_time_corrected VALUES (1718451045)

query expect_error(INCONSISTENT_BEHAVIOR_CROSS_VERSION)
SELECT from_unixtime(t, 'yyyy-MM-dd aaaa') FROM test_from_unix_time_corrected
Loading
Loading