Skip to content

Preserve dictionary encoding through native expressions where possible #4228

@andygrove

Description

@andygrove

What is the problem the feature request solves?

Comet flattens Parquet dictionary-encoded string columns at the read boundary in native/core/src/parquet/parquet_support.rs:170 (the take(values, keys, None) branch in parquet_convert_array). Because QueryPlanSerde serializes Spark StringType as Arrow Utf8, the requested to_type is never Dictionary, so this branch always fires. Every native expression sees a flat Utf8 array, even when the source column has very low cardinality.

This loses the dict advantage. For low-cardinality columns (URL hosts, country codes, status flags, enum-shaped strings) an expression like unhex, length, lower, upper, regexp_replace, etc. evaluates per-row instead of per-unique-value, multiplying the work by N / M where M is the dictionary size.

Describe the potential solution

Three layers, ordered by feasibility:

  1. Per-expression dict-aware paths. Many expressions (substring, rlike, cast, the temporal kernels) already operate on dictionary values and reassemble. Audit the rest and apply the pattern from string_funcs/substring.rs:54-60. Document this in adding_a_new_expression.md.

  2. Plan-aware read boundary. Make parquet_convert_array's flatten-vs-preserve a function of whether the immediate downstream expression accepts dict input, not just the requested to_type. Needs a small planner pass to thread that hint back to the scan.

  3. Spark-boundary materialization stays. Anywhere a batch leaves native execution (shuffle, fallback, scan output to JVM) flatten as today. Spark's row layout has no dict shape.

Until (2) lands, per-expression work in (1) only helps when a dict is produced mid-plan, which is rare. Both are needed for measurable wins.

Additional context

Concrete starting point: pick one common high-cardinality-savings expression (e.g. length or lower), add the dict path, and microbenchmark on a 1M row column with ~100 unique values. If the speedup is in the 5-15% range expected for string-heavy TPC-DS-style queries, that is enough signal to justify the planner work in (2).

Came up while reviewing #4222.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions