Preserve dictionary encoding through native expressions where possible

### What is the problem the feature request solves?

Comet flattens Parquet dictionary-encoded string columns at the read boundary in `native/core/src/parquet/parquet_support.rs:170` (the `take(values, keys, None)` branch in `parquet_convert_array`). Because `QueryPlanSerde` serializes Spark `StringType` as Arrow `Utf8`, the requested `to_type` is never `Dictionary`, so this branch always fires. Every native expression sees a flat `Utf8` array, even when the source column has very low cardinality.

This loses the dict advantage. For low-cardinality columns (URL hosts, country codes, status flags, enum-shaped strings) an expression like `unhex`, `length`, `lower`, `upper`, `regexp_replace`, etc. evaluates per-row instead of per-unique-value, multiplying the work by `N / M` where `M` is the dictionary size.

### Describe the potential solution

Three layers, ordered by feasibility:

1. **Per-expression dict-aware paths.** Many expressions (`substring`, `rlike`, `cast`, the `temporal` kernels) already operate on dictionary values and reassemble. Audit the rest and apply the pattern from `string_funcs/substring.rs:54-60`. Document this in `adding_a_new_expression.md`.

2. **Plan-aware read boundary.** Make `parquet_convert_array`'s flatten-vs-preserve a function of whether the immediate downstream expression accepts dict input, not just the requested `to_type`. Needs a small planner pass to thread that hint back to the scan.

3. **Spark-boundary materialization stays.** Anywhere a batch leaves native execution (shuffle, fallback, scan output to JVM) flatten as today. Spark's row layout has no dict shape.

Until (2) lands, per-expression work in (1) only helps when a dict is produced mid-plan, which is rare. Both are needed for measurable wins.

### Additional context

Concrete starting point: pick one common high-cardinality-savings expression (e.g. `length` or `lower`), add the dict path, and microbenchmark on a 1M row column with ~100 unique values. If the speedup is in the 5-15% range expected for string-heavy TPC-DS-style queries, that is enough signal to justify the planner work in (2).

Came up while reviewing #4222.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve dictionary encoding through native expressions where possible #4228

What is the problem the feature request solves?

Describe the potential solution

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Preserve dictionary encoding through native expressions where possible #4228

Description

What is the problem the feature request solves?

Describe the potential solution

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions