From 69bcb096ea5db347112fd5946ecf1991cad125f1 Mon Sep 17 00:00:00 2001 From: Nicholas Gates Date: Sun, 3 May 2026 22:37:53 -0400 Subject: [PATCH 1/4] Support for list scalars Signed-off-by: Nicholas Gates --- rfcs/0028-scalar-values.md | 524 +++++++++++++++++++++++++++++++++++++ 1 file changed, 524 insertions(+) create mode 100644 rfcs/0028-scalar-values.md diff --git a/rfcs/0028-scalar-values.md b/rfcs/0028-scalar-values.md new file mode 100644 index 0000000..89e69ee --- /dev/null +++ b/rfcs/0028-scalar-values.md @@ -0,0 +1,524 @@ +- Start Date: 2026-05-04 +- Authors: @ngates +- RFC PR: [vortex-data/rfcs#28](https://github.com/vortex-data/rfcs/pull/28) + +# Scalar Values and Complex Constants + +## Summary + +Vortex should keep `Scalar` as a small, host-resident, context-free value representation, and stop +using it as the primary execution representation for complex values. Complex constants should be +represented in the array world as length-1 arrays wrapped by `ConstantArray`, and complex expression +literals should serialize as singleton arrays instead of recursively nested scalar values. + +This proposal introduces a scalar-or-row-backed constant representation: + +```rust +pub enum ConstantValue { + Scalar(Scalar), + Row(ArrayRef), // invariant: len == 1, dtype == the ConstantArray dtype +} +``` + +Scalar-backed constants remain the fast path for nulls, booleans, primitives, decimals, UTF-8, and +binary values. Row-backed constants become the representation for non-null list, fixed-size-list, +struct, variant, and other complex values where nested scalar materialization is expensive or +requires array-level storage. + +## Motivation + +Vortex currently uses `ScalarValue::Tuple(Vec>)` for list, fixed-size-list, and +struct scalars. That is convenient for simple expression literals, but it is a poor representation +for execution and serialization of list-like values. + +The main problems are: + +- Nested scalar values duplicate structure that Vortex already represents efficiently as arrays. +- Constructing a list scalar from an array row requires recursively calling `execute_scalar` for + every nested element. +- Serializing a complex literal as protobuf recursively expands the value tree instead of preserving + buffers, offsets, validity, and existing encodings. +- `ConstantArray` currently stores only a `Scalar`, so array kernels that detect constants often + accidentally force complex data back through scalar form. +- Moving scalar values toward array-backed storage would make `Scalar::try_new` depend on execution + context, buffer residency, and possibly device synchronization, which would make `Scalar` much less + useful as a simple host literal. + +This has shown up most clearly in list-oriented expressions such as `list_contains`, where a constant +list literal should be cheap to carry around, serialize, and compare against, but currently becomes an +expensive nested scalar object. + +## Goals + +- Keep `Scalar` simple, host-resident, context-free, and cheap for primitive-style values. +- Let execution represent complex constants as arrays, preserving array buffers and encodings. +- Let expression literals carry complex values without recursively serializing nested scalar trees. +- Avoid requiring an `ExecutionCtx` to construct or validate a `Scalar`. +- Allow in-memory complex constants to hold device-resident array buffers without copying them into + host scalar values. +- Preserve compatibility with existing scalar literal and constant-array encodings. + +## Non-Goals + +- This RFC does not remove `ScalarValue::Tuple` immediately. +- This RFC does not require every scalar-like API to move to arrays in one change. +- This RFC does not define a device-resident `Scalar`. +- This RFC does not require canonicalizing complex constants during expression deserialization. + +## Design + +### Scalar remains a host literal + +`Scalar` should remain: + +```rust +pub struct Scalar { + dtype: DType, + value: Option, +} +``` + +Its contract should be narrowed and documented: + +- `Scalar` is a host value. +- `Scalar::try_new` validates only dtype/value compatibility. +- `Scalar::try_new` never needs an `ExecutionCtx`. +- `Scalar` never stores `ArrayRef`, `BufferHandle`, or device buffers. +- `Scalar` is appropriate for expression literals, stats, FFI values, Python values, display, tests, + and scalar-at results where the caller explicitly asked for a scalar. + +`ScalarValue::Tuple` remains valid for compatibility and for small host values. It should no longer +be the default representation for complex constants inside array execution. + +### ConstantArray stores either a Scalar or a singleton row + +`ConstantData` should be changed from: + +```rust +pub struct ConstantData { + scalar: Scalar, +} +``` + +to: + +```rust +pub struct ConstantData { + value: ConstantValue, +} + +pub enum ConstantValue { + Scalar(Scalar), + Row(ArrayRef), +} +``` + +The row variant has these invariants: + +- `row.len() == 1` +- `row.dtype() == constant_array.dtype()` +- the outer `ConstantArray` length is independent of the row length + +Recommended constructors: + +```rust +impl ConstantArray { + pub fn new>(scalar: S, len: usize) -> Self; + + pub fn try_new_value(value: ConstantValue, len: usize) -> VortexResult; + + pub fn try_new_row(row: ArrayRef, len: usize) -> VortexResult; + + pub fn constant_value(&self) -> ConstantValueRef<'_>; + + pub fn scalar(&self) -> Option<&Scalar>; + + pub fn row(&self) -> Option<&ArrayRef>; +} +``` + +The exact public API names can be adjusted, but the important distinction is that callers must be +able to ask whether a constant is scalar-backed or row-backed without forcing materialization. + +`ArrayRef::as_constant()` should keep its current semantics as a scalar-only helper. It should return +`Some(Scalar)` only for scalar-backed constants. New helpers should be added for callers that can +handle complex constants: + +```rust +impl ArrayRef { + pub fn as_constant_scalar(&self) -> Option; + pub fn as_constant_row(&self) -> Option; + pub fn as_constant_value(&self) -> Option>; +} +``` + +This is intentionally conservative. Existing kernels that call `as_constant()` generally expect a +`Scalar` and should not silently get a value that may require execution, allocation, or device reads. + +### Choosing the representation + +New constants should use scalar-backed representation when the value is naturally scalar: + +- `Null` +- `Bool` +- `Primitive` +- `Decimal` +- `Utf8` +- `Binary` + +New constants should use row-backed representation when the value is complex and non-null: + +- `List` +- `FixedSizeList` +- `Struct` +- `Variant` +- `Extension` values whose storage dtype is complex + +Null complex values may remain scalar-backed. A null complex scalar is compact and does not contain +nested values, so there is no benefit to constructing a singleton array just to represent absence. + +The boundary should be pragmatic rather than philosophical. If a future scalar representation becomes +bad for large binary values, a threshold can move those to row-backed constants as well. + +### Scalar extraction + +`ConstantArray::execute_scalar(index, ctx)` should behave as follows: + +- scalar-backed: clone and return the scalar +- row-backed: return `row.execute_scalar(0, ctx)` + +This preserves the public scalar-at contract, but makes scalar materialization explicit and lazy. Code +that only needs to move, serialize, compare, or execute a complex constant can stay in array form. + +### Validity + +For scalar-backed constants, validity is unchanged: + +- non-nullable dtype: `Validity::NonNullable` +- nullable and non-null scalar: `Validity::AllValid` +- nullable and null scalar: `Validity::AllInvalid` + +For row-backed constants, validity should broadcast the singleton row validity: + +- if the row is non-nullable: `Validity::NonNullable` +- if row 0 is valid: `Validity::AllValid` +- if row 0 is invalid: `Validity::AllInvalid` +- if determining row validity requires an array value, represent the validity as a constant boolean + array instead of converting the row to a scalar + +The last case is important for device and deferred execution. It should not force a host scalar read +just to determine whether a row-backed constant is valid. + +### Canonicalization and execution + +Scalar-backed constants keep the existing optimized canonicalization paths. + +Row-backed constants should canonicalize by broadcasting the singleton row structurally: + +- Struct constants produce a `StructArray` whose fields are `ConstantArray`s wrapping the singleton + field rows. +- List constants produce a list-view-style canonical array whose offsets and sizes are constant and + whose elements are the singleton row's element slice. +- Fixed-size-list constants may need to materialize repeated elements when a canonical + `FixedSizeListArray` is requested, because that canonical layout requires `len * list_size` + elements. This is acceptable because canonicalization is an explicit execution boundary. +- Primitive, decimal, UTF-8, binary, and bool row-backed constants are allowed but should usually be + normalized to scalar-backed constants. + +The key rule is that row-backed constants should not be converted into recursive `ScalarValue::Tuple` +except when an API explicitly asks for a `Scalar`. + +### Serialization of ConstantArray + +The existing serialized form for `vortex.constant` should remain readable: + +- metadata: empty +- buffers: one protobuf-encoded `ScalarValue` +- children: none + +Add a row-backed serialized form: + +- metadata: empty, or a small version/kind marker if desired +- buffers: none +- children: one array node, decoded with the same dtype and length `1` + +Deserialization should accept both forms: + +```text +buffers.len() == 1 && children.len() == 0 => Scalar-backed legacy constant +buffers.len() == 0 && children.len() == 1 => Row-backed constant +otherwise => error +``` + +This reuses the existing array serialization machinery, including buffers, offsets, validity, and +encoding trees. It also means complex constants can preserve specialized encodings instead of being +flattened into nested protobuf scalar values. + +New writers should use the row-backed form for complex non-null constants. Writers that need to +target older readers can keep an option to force the legacy scalar form or canonicalize complex +constants before writing. + +### Expression literals + +Expression literals should no longer be restricted to `Scalar`. + +Introduce: + +```rust +pub enum LiteralValue { + Scalar(Scalar), + Row(ArrayRef), // len == 1 +} +``` + +`Literal` then becomes: + +```rust +impl ScalarFnVTable for Literal { + type Options = LiteralValue; +} +``` + +Execution is straightforward: + +```rust +match literal { + LiteralValue::Scalar(s) => ConstantArray::new(s.clone(), row_count), + LiteralValue::Row(row) => ConstantArray::try_new_row(row.clone(), row_count)?, +} +``` + +Recommended expression constructors: + +```rust +pub fn lit(value: impl Into) -> Expression; + +pub fn lit_row(row: ArrayRef) -> VortexResult; + +pub fn lit_value(value: LiteralValue) -> VortexResult; +``` + +`lit(value: impl Into)` remains for compatibility and simple values. Integrations that need +to pass list-like or struct-like values into expressions should use `lit_row`. + +In a later migration, `lit(Scalar::list(...))` may choose to build a singleton array internally, but +that is not required by this RFC. + +### Expression literal serialization + +The current protobuf literal options are: + +```proto +message LiteralOpts { + vortex.scalar.Scalar value = 1; +} +``` + +Replace this with a oneof: + +```proto +message LiteralOpts { + oneof value { + vortex.scalar.Scalar scalar = 1; + ArrayLiteral array = 2; + } +} + +message ArrayLiteral { + repeated string encoding_ids = 1; + vortex.dtype.DType dtype = 2; + uint64 len = 3; + bytes serialized_array = 4; +} +``` + +For this RFC, `ArrayLiteral.len` must be `1`. It is included so the serialized array can be decoded +using the existing array deserialization API, which needs dtype and length from the parent context. + +The `encoding_ids` field carries the array serialization context. Existing array serialization stores +encoding IDs as indices in the flatbuffer; expression literals do not have the file footer's array +context, so they must carry their own context. + +This requires adding a session-aware expression serialization path: + +```rust +pub struct ExprSerializeOptions<'a> { + pub array_ctx: &'a ArrayContext, + pub session: &'a VortexSession, +} + +pub trait ExprSerializeProtoExt { + fn serialize_proto(&self) -> VortexResult; + fn serialize_proto_with_options(&self, options: &ExprSerializeOptions<'_>) -> VortexResult; +} +``` + +The old `serialize_proto` can continue to work for scalar-only expressions. It should return an error +if asked to serialize a row-backed literal without the context required to serialize arrays. + +Deserialization already receives a `VortexSession`, so it can decode `ArrayLiteral` by constructing a +`ReadContext` from `encoding_ids`, decoding `serialized_array` with `dtype` and `len`, and validating +that `len == 1`. + +### DType validation + +`Scalar::try_new` should continue to run dtype validation for scalar values. This validation is +purely structural and should not require an `ExecutionCtx`. + +Row-backed constants and row-backed literals validate through array invariants: + +- the singleton row array must be a valid `ArrayRef` +- its dtype must match the literal or constant dtype +- its length must be 1 + +This cleanly separates scalar validation from array validation. There is no need for an execution +context to construct a scalar, and there is no need for scalar validation to understand array buffers +or device residency. + +### Device buffers + +Device buffers should not be allowed inside `Scalar`. + +Device buffers should be allowed inside row-backed constants and row-backed literals because those +are arrays. In-memory execution can preserve device residency. Scalar extraction from a row-backed +constant may require execution or host transfer depending on the underlying array and execution +context, but that cost is paid only when the caller explicitly requests a scalar. + +Portable expression serialization should copy buffers to host in the same way array serialization +does today. A future device-aware expression transport can carry `BufferHandle`s or external device +segments, but that is out of scope for this RFC. + +### Statistics + +For scalar-backed constants, statistics remain unchanged. + +For row-backed constants: + +- `Stat::IsConstant` is exactly true. +- `Stat::NullCount` can be derived from row validity and outer length. +- `Stat::Min` and `Stat::Max` may be derived lazily by extracting row 0 as a scalar when the dtype + supports scalar ordering and the value is non-null. +- If extracting a scalar would require undesirable execution, implementations may leave min/max + absent unless a compute path explicitly requests them. + +This preserves correctness while avoiding accidental scalarization during cheap metadata operations. + +## Compatibility + +Existing scalar-backed constants and scalar literals remain readable. + +Existing readers will not understand the new row-backed `vortex.constant` serialized form if it is +encoded under the same encoding ID. They will fail because the constant encoding has zero buffers and +one child instead of one scalar buffer. This is a forward-compatibility limitation, not silent data +corruption. + +Writers should expose a compatibility option: + +- modern mode: write row-backed complex constants +- legacy mode: write scalar-backed complex constants, or canonicalize complex constants before writing + +The expression protobuf change is backward-compatible for readers that accept both `scalar` and +`array` literal variants. Older readers will not understand `array` literals. + +Public Rust API compatibility should be managed in phases: + +1. Add new `ConstantValue` and literal APIs. +2. Keep existing scalar-only helpers for existing callers. +3. Migrate internal kernels that can benefit from row-backed constants. +4. Deprecate ambiguous APIs such as `ConstantArray::scalar()` if they cannot represent row-backed + constants safely. +5. Consider breaking API cleanup only after downstream integrations have a migration path. + +## Drawbacks + +This adds a second representation for constants, and kernels must be explicit about whether they +need scalar constants or can operate on row-backed constants. + +Expression serialization becomes more complex because array literals need an array serialization +context. The existing scalar-only expression serialization path is simpler, but it is also the source +of the current inefficiency for complex values. + +Some code that currently assumes every constant has a `Scalar` will need to be audited. The upside is +that this audit makes accidental scalarization visible instead of hiding it behind `as_constant()`. + +Row-backed constants do not remove all materialization costs. If a caller asks for canonical +fixed-size-list arrays, scalar extraction, or legacy serialization, Vortex may still need to build +repeated values. The important change is that these costs move to explicit boundaries. + +## Alternatives + +### Make Scalar array-backed + +We could add `ScalarValue::Array(ArrayRef)` and represent complex scalars directly as singleton +arrays. + +This is rejected because it makes `Scalar` no longer context-free. Scalar equality, hashing, +display, validation, and serialization would all need to handle arrays, and arrays may require +execution or device-to-host transfer. That would make `Scalar::try_new` and scalar literals depend on +execution context, which is exactly the direction we want to avoid. + +### Replace Scalar with length-1 arrays everywhere + +This is conceptually clean, but too disruptive. `Scalar` is still useful for stats, display, FFI, +Python interop, expression literals, and primitive constants. Replacing it everywhere would force +array execution into places that need a cheap value object. + +This RFC takes the smaller step: arrays are used for complex execution constants, while `Scalar` +remains the host literal representation. + +### Always canonicalize complex literals at deserialization + +This avoids carrying arbitrary encodings inside expression literals, but it throws away information +and can eagerly allocate. If a literal was serialized as a specialized array, deserialization should +not immediately flatten it unless execution demands that. + +### Add a separate `vortex.constant.row` encoding + +A new encoding ID would make forward incompatibility clearer for old readers. It would also avoid +changing the shape of `vortex.constant`. + +This is a reasonable fallback, but the in-memory model should still be a single logical +`ConstantArray` with scalar-backed and row-backed variants. Most users and kernels should not care +which wire-level encoding was used. + +### Keep nested scalar values and optimize hot paths + +We could add specialized list-scalar and struct-scalar storage to reduce allocation. That may help +some cases, but it still duplicates the array system and still does not solve device residency or +array literal serialization. + +## Prior Art + +Apache Arrow distinguishes between scalars and arrays, but computation generally operates over +arrays and treats scalar inputs as broadcast values. That is the model this RFC follows: scalar values +remain useful, but execution should be able to represent a broadcast value as array data. + +Database vector engines commonly distinguish constant vectors from flat vectors. A constant vector +does not necessarily mean "store a recursive scalar object"; it means "the logical row value is the +same for every row." For nested values, the payload can still be represented by child vectors. + +Vortex already has the pieces of this model: `ArrayRef`, `ConstantArray`, array serialization, +validity, and singleton rows. The missing piece is allowing constants and literals to carry singleton +arrays directly. + +## Unresolved Questions + +- Should row-backed constants use the existing `vortex.constant` encoding ID with a new child-based + shape, or should the wire format use a separate `vortex.constant.row` encoding ID? +- Should `lit(Scalar::list(...))` continue to produce a scalar-backed literal, or should it eagerly + build a singleton array for complex scalar values? +- What is the exact public API migration for `ConstantArray::scalar()` and `ArrayRef::as_constant()`? +- Should large UTF-8 or binary values ever become row-backed constants based on size? +- How much min/max statistic support should row-backed constants provide without explicit execution? + +## Future Possibilities + +Once row-backed constants exist, Vortex can add more array-native literal construction APIs in Python, +Java, C++, DuckDB, and DataFusion integrations. + +Expression serialization could eventually use a general "literal payload" abstraction that supports +host arrays, device buffers, and external buffer references. That would allow complex literals to be +transported without copying through a monolithic protobuf payload. + +The same singleton-row mechanism may also help with dictionary values, sparse fill values, and other +places where Vortex currently stores a complex repeated value as a scalar. From 59dd543abb578783c29de45e4dd6404d0d7c1729 Mon Sep 17 00:00:00 2001 From: Nicholas Gates Date: Mon, 4 May 2026 09:23:33 -0400 Subject: [PATCH 2/4] Scalar Values Signed-off-by: Nicholas Gates --- rfcs/0028-scalar-values.md | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/rfcs/0028-scalar-values.md b/rfcs/0028-scalar-values.md index 89e69ee..b8bcc9f 100644 --- a/rfcs/0028-scalar-values.md +++ b/rfcs/0028-scalar-values.md @@ -25,6 +25,11 @@ binary values. Row-backed constants become the representation for non-null list, struct, variant, and other complex values where nested scalar materialization is expensive or requires array-level storage. +Scalar functions already operate over `ArrayRef` inputs. This RFC does not change that calling +convention. It only requires literals and constants to enter scalar-function execution as +`ConstantArray(len = row_count, value = ...)`, with complex values represented by row-backed +constants. + ## Motivation Vortex currently uses `ScalarValue::Tuple(Vec>)` for list, fixed-size-list, and @@ -56,6 +61,8 @@ expensive nested scalar object. - Avoid requiring an `ExecutionCtx` to construct or validate a `Scalar`. - Allow in-memory complex constants to hold device-resident array buffers without copying them into host scalar values. +- Define a scalar-function broadcasting model where every input has logical length `row_count`, and + constants are represented as `ConstantArray`s. - Preserve compatibility with existing scalar literal and constant-array encodings. ## Non-Goals @@ -64,6 +71,7 @@ expensive nested scalar object. - This RFC does not require every scalar-like API to move to arrays in one change. - This RFC does not define a device-resident `Scalar`. - This RFC does not require canonicalizing complex constants during expression deserialization. +- This RFC does not redesign scalar function child execution or `Columnar`. ## Design @@ -228,6 +236,28 @@ Row-backed constants should canonicalize by broadcasting the singleton row struc The key rule is that row-backed constants should not be converted into recursive `ScalarValue::Tuple` except when an API explicitly asks for a `Scalar`. +### Scalar functions and broadcasting + +Scalar functions already take `ArrayRef` inputs. This RFC keeps that interface unchanged. + +The required calling convention is: + +- every input array has logical length `args.row_count()` +- scalar broadcasting is represented by `ConstantArray(len = args.row_count(), value = ...)` +- row-backed broadcasting is represented by + `ConstantArray(len = args.row_count(), value = Row(singleton_array))` +- naked length-1 arrays are not normal scalar function inputs, except for private sub-executions + such as evaluating an all-constant expression once + +This means scalar functions do support broadcasting, but broadcasting is encoded in the array +representation rather than in per-kernel length checks. A scalar function should not need to handle +both `len == 1` and `len == row_count` inputs. Its inputs are always length `row_count`; some are +physically constant. + +Scalar functions may still use constant fast paths by checking whether an input is a `ConstantArray`. +This RFC only broadens what a constant can contain. It does not require changing the scalar-function +execution API, adding argument materialization helpers, or changing `Columnar`. + ### Serialization of ConstantArray The existing serialized form for `vortex.constant` should remain readable: From 197fbce85fa49413af67d31259b126a829c2bdac Mon Sep 17 00:00:00 2001 From: Nicholas Gates Date: Mon, 4 May 2026 09:28:10 -0400 Subject: [PATCH 3/4] Extension Types Signed-off-by: Nicholas Gates --- rfcs/{0028-scalar-values.md => 0056-scalar-values.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename rfcs/{0028-scalar-values.md => 0056-scalar-values.md} (100%) diff --git a/rfcs/0028-scalar-values.md b/rfcs/0056-scalar-values.md similarity index 100% rename from rfcs/0028-scalar-values.md rename to rfcs/0056-scalar-values.md From ffe633839eeec54b09ee66acad684d7cd17cee7e Mon Sep 17 00:00:00 2001 From: Nicholas Gates Date: Mon, 4 May 2026 11:01:04 -0400 Subject: [PATCH 4/4] Update author Signed-off-by: Nicholas Gates --- rfcs/0056-scalar-values.md | 831 +++++++++++++++++++++---------------- 1 file changed, 480 insertions(+), 351 deletions(-) diff --git a/rfcs/0056-scalar-values.md b/rfcs/0056-scalar-values.md index b8bcc9f..c62a5b6 100644 --- a/rfcs/0056-scalar-values.md +++ b/rfcs/0056-scalar-values.md @@ -1,359 +1,404 @@ - Start Date: 2026-05-04 -- Authors: @ngates -- RFC PR: [vortex-data/rfcs#28](https://github.com/vortex-data/rfcs/pull/28) +- Authors: @gatesn +- RFC PR: [vortex-data/rfcs#56](https://github.com/vortex-data/rfcs/pull/56) # Scalar Values and Complex Constants ## Summary -Vortex should keep `Scalar` as a small, host-resident, context-free value representation, and stop -using it as the primary execution representation for complex values. Complex constants should be -represented in the array world as length-1 arrays wrapped by `ConstantArray`, and complex expression -literals should serialize as singleton arrays instead of recursively nested scalar values. +Vortex should stop using recursive scalar objects as the row representation for arrays. -This proposal introduces a scalar-or-row-backed constant representation: +The new split is: -```rust -pub enum ConstantValue { - Scalar(Scalar), - Row(ArrayRef), // invariant: len == 1, dtype == the ConstantArray dtype -} -``` +- `Scalar` remains a small host value for scalar-compatible values: null, bool, primitive, decimal, + UTF-8, binary, and optionally small structs whose fields are scalar-compatible. +- Array row access becomes array-shaped. A row is a length-1 `ArrayRef`, not a type-erased `Scalar`. +- `ConstantArray` becomes the only execution representation for broadcast values of any dtype. It is + always a wrapper around a length-1 child array. +- Complex expression literals are represented by a new `ArrayLiteral` expression that holds an + `ArrayRef` directly. -Scalar-backed constants remain the fast path for nulls, booleans, primitives, decimals, UTF-8, and -binary values. Row-backed constants become the representation for non-null list, fixed-size-list, -struct, variant, and other complex values where nested scalar materialization is expensive or -requires array-level storage. - -Scalar functions already operate over `ArrayRef` inputs. This RFC does not change that calling -convention. It only requires literals and constants to enter scalar-function execution as -`ConstantArray(len = row_count, value = ...)`, with complex values represented by row-backed -constants. +There is no scalar-backed vs row-backed `ConstantArray` variant. The bet in this RFC is that the +uniform representation is worth paying for primitive constants too: even `ConstantArray(42, N)` is +represented as a length-1 primitive array wrapped by `ConstantArray`. ## Motivation -Vortex currently uses `ScalarValue::Tuple(Vec>)` for list, fixed-size-list, and -struct scalars. That is convenient for simple expression literals, but it is a poor representation -for execution and serialization of list-like values. +Today's `Scalar` has two jobs: + +- it is a compact host value for simple values such as `i32`, `bool`, and `utf8` +- it is also a recursive container for nested values such as lists and structs + +The first job is useful. Small struct-like host records can also be useful, for example expression +literals or metadata records. The second job duplicates the array system and makes list-like values +expensive. + +Problems with the current design: -The main problems are: +- List-like scalars recursively allocate and walk `ScalarValue::Tuple`. +- Extracting a nested scalar from an array row requires recursive `execute_scalar` calls. +- Complex expression literals serialize as recursive protobuf scalar values instead of preserving + buffers, offsets, validity, and encodings. +- `ConstantArray` stores a `Scalar`, so complex constants are forced through nested scalar form even + though execution already has an array representation. +- There is no clean public way to embed an `ArrayRef` in an expression. The codebase already has a + private `ArrayExpr` for validity expressions, backed by fake equality and hashing. That is a sign + the expression model is missing an explicit array literal node. -- Nested scalar values duplicate structure that Vortex already represents efficiently as arrays. -- Constructing a list scalar from an array row requires recursively calling `execute_scalar` for - every nested element. -- Serializing a complex literal as protobuf recursively expands the value tree instead of preserving - buffers, offsets, validity, and existing encodings. -- `ConstantArray` currently stores only a `Scalar`, so array kernels that detect constants often - accidentally force complex data back through scalar form. -- Moving scalar values toward array-backed storage would make `Scalar::try_new` depend on execution - context, buffer residency, and possibly device synchronization, which would make `Scalar` much less - useful as a simple host literal. +The desired model is simpler: -This has shown up most clearly in list-oriented expressions such as `list_contains`, where a constant -list literal should be cheap to carry around, serialize, and compare against, but currently becomes an -expensive nested scalar object. +- scalar-compatible host values use `Scalar` +- array rows use arrays +- all broadcast execution uses `ConstantArray` +- expressions can carry either a host scalar literal or an array literal ## Goals -- Keep `Scalar` simple, host-resident, context-free, and cheap for primitive-style values. -- Let execution represent complex constants as arrays, preserving array buffers and encodings. -- Let expression literals carry complex values without recursively serializing nested scalar trees. -- Avoid requiring an `ExecutionCtx` to construct or validate a `Scalar`. -- Allow in-memory complex constants to hold device-resident array buffers without copying them into - host scalar values. -- Define a scalar-function broadcasting model where every input has logical length `row_count`, and - constants are represented as `ConstantArray`s. -- Preserve compatibility with existing scalar literal and constant-array encodings. +- Remove recursive scalar values from array row access, constants, and aggregate partial transport. +- Keep `Scalar` small, host-resident, and context-free. +- Allow restricted struct-like host scalars when every field is scalar-compatible. +- Remove type-erased `scalar_at` / `execute_scalar` as the universal array row API. +- Represent every `ConstantArray` as a broadcast of a length-1 child array. +- Add a first-class `ArrayLiteral` expression for embedding `ArrayRef` values in expressions. +- Avoid requiring content equality or hashing for embedded array literals. +- Preserve scalar function execution over `ArrayRef` inputs. +- Preserve a staged migration path from today's scalar APIs. ## Non-Goals -- This RFC does not remove `ScalarValue::Tuple` immediately. -- This RFC does not require every scalar-like API to move to arrays in one change. +- This RFC does not require renaming every public API in one PR. +- This RFC does not require changing scalar function input semantics. - This RFC does not define a device-resident `Scalar`. -- This RFC does not require canonicalizing complex constants during expression deserialization. -- This RFC does not redesign scalar function child execution or `Columnar`. +- This RFC does not make `ArrayLiteral` content-equal. Array literal equality is pointer identity by + design. ## Design -### Scalar remains a host literal +### Scalar means host value -`Scalar` should remain: - -```rust -pub struct Scalar { - dtype: DType, - value: Option, -} -``` - -Its contract should be narrowed and documented: +The intended `Scalar` contract is: - `Scalar` is a host value. -- `Scalar::try_new` validates only dtype/value compatibility. -- `Scalar::try_new` never needs an `ExecutionCtx`. +- `Scalar::try_new` validates dtype/value compatibility without an `ExecutionCtx`. - `Scalar` never stores `ArrayRef`, `BufferHandle`, or device buffers. -- `Scalar` is appropriate for expression literals, stats, FFI values, Python values, display, tests, - and scalar-at results where the caller explicitly asked for a scalar. +- `Scalar` is valid for dtypes whose value fits in a direct host payload: null, bool, primitive, + decimal, UTF-8, and binary. +- `Scalar` may also support a restricted struct payload, preferably named `ScalarValue::Struct` + rather than generic `Tuple`. Every field dtype must itself be scalar-compatible. This gives Vortex + a cheap host record without preserving list-like scalar trees. +- `Scalar` is appropriate for expression literals, scalar-compatible statistics, FFI values, Python + values, display, tests, and APIs where the caller explicitly asks for a host value. -`ScalarValue::Tuple` remains valid for compatibility and for small host values. It should no longer -be the default representation for complex constants inside array execution. +These dtypes do not have scalar values in the long-term model: -### ConstantArray stores either a Scalar or a singleton row +- list +- fixed-size-list +- variant +- extension values whose storage dtype is nested +- structs containing any field that is not scalar-compatible -`ConstantData` should be changed from: +Those values are represented as arrays. A single such value is a length-1 array. -```rust -pub struct ConstantData { - scalar: Scalar, -} -``` +The restricted struct scalar is not an array row representation. It is a host value for places where +building arrays would be needless overhead. Array rows, constants, builders, and aggregate partial +exchange still use arrays. -to: +### ConstantArray is always a broadcast child + +`ConstantArray` should store its value as a length-1 child array: ```rust -pub struct ConstantData { - value: ConstantValue, -} +pub struct ConstantData; -pub enum ConstantValue { - Scalar(Scalar), - Row(ArrayRef), +impl ConstantArray { + pub fn with_child(child: ArrayRef, broadcast_len: usize) -> VortexResult; } ``` -The row variant has these invariants: +`ConstantArray::with_child` has these invariants: -- `row.len() == 1` -- `row.dtype() == constant_array.dtype()` -- the outer `ConstantArray` length is independent of the row length +- `child.len() == 1` +- `child.dtype() == constant.dtype()` +- the outer constant length is `broadcast_len` +- the outer constant has no separate validity +- if `child` is itself a `ConstantArray`, construction normalizes by unwrapping to the inner child -Recommended constructors: +The value of `ConstantArray` at any logical row `i` is `child[0]`. -```rust -impl ConstantArray { - pub fn new>(scalar: S, len: usize) -> Self; +This is the only in-memory representation. There is no `ConstantValue::Scalar` variant and no +`ConstantValue::Row` variant. The child slot is the source of truth for all dtypes, including +primitive dtypes. - pub fn try_new_value(value: ConstantValue, len: usize) -> VortexResult; +### Primitive constants pay the wrapper cost - pub fn try_new_row(row: ArrayRef, len: usize) -> VortexResult; +This RFC intentionally chooses the uniform representation even for primitive constants. - pub fn constant_value(&self) -> ConstantValueRef<'_>; +For example: - pub fn scalar(&self) -> Option<&Scalar>; - - pub fn row(&self) -> Option<&ArrayRef>; -} +```rust +ConstantArray::new(42i32, 1_000_000) ``` -The exact public API names can be adjusted, but the important distinction is that callers must be -able to ask whether a constant is scalar-backed or row-backed without forcing materialization. - -`ArrayRef::as_constant()` should keep its current semantics as a scalar-only helper. It should return -`Some(Scalar)` only for scalar-backed constants. New helpers should be added for callers that can -handle complex constants: +is represented as: -```rust -impl ArrayRef { - pub fn as_constant_scalar(&self) -> Option; - pub fn as_constant_row(&self) -> Option; - pub fn as_constant_value(&self) -> Option>; +```text +ConstantArray { + len: 1_000_000, + child: PrimitiveArray([42]), } ``` -This is intentionally conservative. Existing kernels that call `as_constant()` generally expect a -`Scalar` and should not silently get a value that may require execution, allocation, or device reads. +The expected costs are: -### Choosing the representation +- one length-1 child array allocation for primitive constants +- one child-slot hop in constant fast paths +- updated constant kernels that read from the child instead of an inline scalar -New constants should use scalar-backed representation when the value is naturally scalar: +The expected benefits are: -- `Null` -- `Bool` -- `Primitive` -- `Decimal` -- `Utf8` -- `Binary` +- one constant representation for every dtype +- no recursive scalar values for nested constants +- constant serialization reuses array serialization uniformly +- validity, stats, hashing, equality, slicing, and casting all flow through the array machinery -New constants should use row-backed representation when the value is complex and non-null: +Hot kernels should still have ergonomic helpers, but those helpers must not introduce a second +storage representation. For example, an accessor like this is acceptable: -- `List` -- `FixedSizeList` -- `Struct` -- `Variant` -- `Extension` values whose storage dtype is complex - -Null complex values may remain scalar-backed. A null complex scalar is compact and does not contain -nested values, so there is no benefit to constructing a singleton array just to represent absence. - -The boundary should be pragmatic rather than philosophical. If a future scalar representation becomes -bad for large binary values, a threshold can move those to row-backed constants as well. - -### Scalar extraction +```rust +impl ConstantArray { + pub fn scalar_value(&self, ctx: &mut ExecutionCtx) -> VortexResult>; +} +``` -`ConstantArray::execute_scalar(index, ctx)` should behave as follows: +or, for scalar-compatible fast paths: -- scalar-backed: clone and return the scalar -- row-backed: return `row.execute_scalar(0, ctx)` +```rust +impl ConstantArray { + pub fn primitive_value(&self) -> Option; +} +``` -This preserves the public scalar-at contract, but makes scalar materialization explicit and lazy. Code -that only needs to move, serialize, compare, or execute a complex constant can stay in array form. +These are read helpers over the length-1 child. They are not alternate storage. -### Validity +### Equality, hashing, validity, and stats -For scalar-backed constants, validity is unchanged: +`ConstantArray` equality and hashing should be content-based through the child slot. Two constants +with independently constructed but equal length-1 children should compare equal and hash equal. -- non-nullable dtype: `Validity::NonNullable` -- nullable and non-null scalar: `Validity::AllValid` -- nullable and null scalar: `Validity::AllInvalid` +Validity is owned by the child: -For row-backed constants, validity should broadcast the singleton row validity: +- if the child row is valid, the constant is all-valid +- if the child row is invalid, the constant is all-invalid +- if the child dtype is non-nullable, the constant is non-nullable -- if the row is non-nullable: `Validity::NonNullable` -- if row 0 is valid: `Validity::AllValid` -- if row 0 is invalid: `Validity::AllInvalid` -- if determining row validity requires an array value, represent the validity as a constant boolean - array instead of converting the row to a scalar +Statistics derive from the child: -The last case is important for device and deferred execution. It should not force a host scalar read -just to determine whether a row-backed constant is valid. +- rank-style stats such as min, max, is_constant, and is_sorted pass through when available +- count-style stats such as null_count and true_count scale by `broadcast_len` +- nested stats should avoid constructing recursive scalar values ### Canonicalization and execution -Scalar-backed constants keep the existing optimized canonicalization paths. - -Row-backed constants should canonicalize by broadcasting the singleton row structurally: +Canonicalization should broadcast the length-1 child structurally: -- Struct constants produce a `StructArray` whose fields are `ConstantArray`s wrapping the singleton - field rows. -- List constants produce a list-view-style canonical array whose offsets and sizes are constant and - whose elements are the singleton row's element slice. -- Fixed-size-list constants may need to materialize repeated elements when a canonical - `FixedSizeListArray` is requested, because that canonical layout requires `len * list_size` - elements. This is acceptable because canonicalization is an explicit execution boundary. -- Primitive, decimal, UTF-8, binary, and bool row-backed constants are allowed but should usually be - normalized to scalar-backed constants. +- primitive, bool, decimal, UTF-8, and binary constants can materialize repeated canonical values +- struct constants can produce a `StructArray` whose fields are constant arrays over each child field + row +- list constants can produce a list-view-style canonical array whose offsets and sizes repeat the + singleton row shape +- fixed-size-list constants may need to materialize repeated elements when a canonical + fixed-size-list array is requested -The key rule is that row-backed constants should not be converted into recursive `ScalarValue::Tuple` -except when an API explicitly asks for a `Scalar`. +The key rule is that canonicalization may allocate at an explicit boundary, but normal constant +transport should not convert nested values into recursive scalar values. ### Scalar functions and broadcasting Scalar functions already take `ArrayRef` inputs. This RFC keeps that interface unchanged. -The required calling convention is: +The calling convention remains: - every input array has logical length `args.row_count()` -- scalar broadcasting is represented by `ConstantArray(len = args.row_count(), value = ...)` -- row-backed broadcasting is represented by - `ConstantArray(len = args.row_count(), value = Row(singleton_array))` +- broadcasting is represented by `ConstantArray(len = args.row_count(), child = length_1_array)` - naked length-1 arrays are not normal scalar function inputs, except for private sub-executions such as evaluating an all-constant expression once -This means scalar functions do support broadcasting, but broadcasting is encoded in the array -representation rather than in per-kernel length checks. A scalar function should not need to handle -both `len == 1` and `len == row_count` inputs. Its inputs are always length `row_count`; some are -physically constant. - Scalar functions may still use constant fast paths by checking whether an input is a `ConstantArray`. -This RFC only broadens what a constant can contain. It does not require changing the scalar-function +This RFC only changes what a constant contains. It does not require changing the scalar-function execution API, adding argument materialization helpers, or changing `Columnar`. -### Serialization of ConstantArray - -The existing serialized form for `vortex.constant` should remain readable: +### ArrayLiteral expression -- metadata: empty -- buffers: one protobuf-encoded `ScalarValue` -- children: none +Expression literals should split into two expression nodes: -Add a row-backed serialized form: +- scalar literal expression: stores a host `Scalar` +- array literal expression: stores an `ArrayRef` -- metadata: empty, or a small version/kind marker if desired -- buffers: none -- children: one array node, decoded with the same dtype and length `1` +Add a first-class `ArrayLiteral` expression: -Deserialization should accept both forms: +```rust +pub struct ArrayLiteral; -```text -buffers.len() == 1 && children.len() == 0 => Scalar-backed legacy constant -buffers.len() == 0 && children.len() == 1 => Row-backed constant -otherwise => error +pub struct ArrayLiteralOptions { + array: ArrayRef, +} ``` -This reuses the existing array serialization machinery, including buffers, offsets, validity, and -encoding trees. It also means complex constants can preserve specialized encodings instead of being -flattened into nested protobuf scalar values. +`ArrayLiteralOptions` implements `PartialEq`, `Eq`, and `Hash` by array pointer identity, not by +array contents. + +That means: -New writers should use the row-backed form for complex non-null constants. Writers that need to -target older readers can keep an option to force the legacy scalar form or canonicalize complex -constants before writing. +- two `ArrayLiteral`s wrapping the same `ArrayRef` allocation compare equal +- two independently constructed arrays with equal contents compare unequal +- hashing does not walk buffers, execute arrays, inspect device memory, or depend on array contents -### Expression literals +This is the right tradeoff for expression identity. Embedded arrays can be large, compressed, +deferred, or device-resident. Expression equality should not accidentally become array equality. -Expression literals should no longer be restricted to `Scalar`. +`ArrayLiteral` has arity 0. Its return dtype is `array.dtype()`. -Introduce: +Execution: ```rust -pub enum LiteralValue { - Scalar(Scalar), - Row(ArrayRef), // len == 1 +impl ScalarFnVTable for ArrayLiteral { + type Options = ArrayLiteralOptions; + + fn execute( + &self, + options: &ArrayLiteralOptions, + args: &dyn ExecutionArgs, + ctx: &mut ExecutionCtx, + ) -> VortexResult { + let array = options.array.clone(); + + if array.len() == args.row_count() { + return array.execute(ctx); + } + + if array.len() == 1 { + return Ok(ConstantArray::with_child(array, args.row_count())?.into_array()); + } + + vortex_bail!( + "ArrayLiteral length {} cannot be used in execution scope of length {}", + array.len(), + args.row_count(), + ); + } } ``` -`Literal` then becomes: +Nested expression literal helpers should normally construct length-1 arrays and wrap them in +`ArrayLiteral`. Supporting `len == row_count` is useful for internal expression construction and can +replace the private `ArrayExpr` currently used by scalar-function validity expressions. + +Recommended constructors: ```rust -impl ScalarFnVTable for Literal { - type Options = LiteralValue; -} +pub fn lit(value: impl Into) -> Expression; + +pub fn lit_array(array: ArrayRef) -> Expression; + +pub fn lit_list(elements: ArrayRef) -> VortexResult; + +pub fn lit_fixed_size_list(elements: ArrayRef) -> VortexResult; + +pub fn lit_struct( + dtype: StructDType, + fields: impl IntoIterator, +) -> VortexResult; + +pub fn lit_variant(value: ArrayRef) -> VortexResult; ``` -Execution is straightforward: +The nested helpers build a length-1 array of the requested dtype and return `lit_array(row)`. +Scalar-compatible struct records may also use `lit(Scalar::struct(...))`; execution still +materializes them through a child-backed `ConstantArray`. + +### Per-row access + +Per-row access is fundamentally an array operation. The universal API should return a one-row array: ```rust -match literal { - LiteralValue::Scalar(s) => ConstantArray::new(s.clone(), row_count), - LiteralValue::Row(row) => ConstantArray::try_new_row(row.clone(), row_count)?, +impl ArrayRef { + pub fn row(&self, index: usize) -> VortexResult { + self.slice(index..index + 1) + } } ``` -Recommended expression constructors: +The semantic fallback for host value extraction starts from that row array. The caller canonicalizes +the length-1 array and uses the typed canonical API to read the value: ```rust -pub fn lit(value: impl Into) -> Expression; +let row = array.row(index)?; +let canonical = row.to_canonical(ctx)?; +let value = canonical.as_primitive::()?.value(0); +``` + +That fallback is correct, but it is not the required fast path. Many callers need repeated point +reads from scalar-compatible arrays, and forcing every read through singleton materialization would +regress compressed and wrapper encodings. + +Add a fallible, stateful probe API for those callers: -pub fn lit_row(row: ArrayRef) -> VortexResult; +```rust +pub struct ProbeOptions { + // Exact fields intentionally left unspecified by this RFC. +} + +pub trait ScalarProbe { + fn dtype(&self) -> &DType; + fn get(&mut self, index: usize) -> VortexResult; +} -pub fn lit_value(value: LiteralValue) -> VortexResult; +impl ArrayRef { + pub fn scalar_probe( + &self, + options: ProbeOptions, + ctx: &mut ExecutionCtx, + ) -> VortexResult>; +} ``` -`lit(value: impl Into)` remains for compatibility and simple values. Integrations that need -to pass list-like or struct-like values into expressions should use `lit_row`. +Probe construction can fail for dtypes that are not scalar-compatible. It can also fail when an +encoding chooses not to support scalar probing directly, in which case the caller can use the +semantic row/canonical fallback if that cost is acceptable. -In a later migration, `lit(Scalar::list(...))` may choose to build a singleton array internally, but -that is not required by this RFC. +`ProbeOptions` is deliberately left undefined here. It should describe the caller's expected access +shape, not prescribe an implementation. For example, it may include an estimate of how many values +will be read, whether access is single, random, monotonic, or dense, and an approximate cache budget. +Encodings and wrapper arrays can use those hints to choose between lazy point access, page/chunk +caching, dictionary caching, cursor state, or canonicalizing a larger region up front. -### Expression literal serialization +The important distinction is that scalar probing is an optional scalar-compatible capability, not +the array row protocol. It preserves the current fast path where it matters while preventing +list-like and nested rows from re-entering the system as recursive scalar values. -The current protobuf literal options are: +### Serialization -```proto -message LiteralOpts { - vortex.scalar.Scalar value = 1; -} +`ConstantArray` serialization should serialize the length-1 child as a normal child array: + +```text +new: buffers = [], children = [length-1 array] ``` -Replace this with a oneof: +New readers should continue to accept the legacy scalar-backed form during migration: -```proto -message LiteralOpts { - oneof value { - vortex.scalar.Scalar scalar = 1; - ArrayLiteral array = 2; - } -} +```text +legacy: buffers = [scalar_value], children = [] +``` + +Deserializing the legacy form constructs a length-1 child array and then constructs the +`ConstantArray` wrapper. New writers should prefer the child-backed form for all constants, +including primitive constants. Compatibility writers may continue emitting the legacy scalar form for +old readers. + +`ArrayLiteral` serialization requires array serialization. Portable expression serialization should +add an array-literal payload: +```proto message ArrayLiteral { repeated string encoding_ids = 1; vortex.dtype.DType dtype = 2; @@ -362,193 +407,277 @@ message ArrayLiteral { } ``` -For this RFC, `ArrayLiteral.len` must be `1`. It is included so the serialized array can be decoded -using the existing array deserialization API, which needs dtype and length from the parent context. +On deserialization, Vortex decodes the array and constructs a new pointer-backed `ArrayLiteral`. +Pointer identity is process-local, so a deserialized expression is not pointer-equal to the original +in-memory expression even if it has the same contents. -The `encoding_ids` field carries the array serialization context. Existing array serialization stores -encoding IDs as indices in the flatbuffer; expression literals do not have the file footer's array -context, so they must carry their own context. +### Device buffers -This requires adding a session-aware expression serialization path: +Device buffers should not be allowed inside `Scalar`. -```rust -pub struct ExprSerializeOptions<'a> { - pub array_ctx: &'a ArrayContext, - pub session: &'a VortexSession, -} +Device buffers are allowed inside `ConstantArray` children and `ArrayLiteral` payloads because those +are arrays. In-memory execution can preserve device residency. Portable serialization copies buffers +to host in the same way array serialization does today. + +### Impact on existing code + +The current codebase uses `Scalar` for several different jobs. This RFC keeps the cheap host-value +jobs and removes the array-row jobs. + +#### Row access and assertions + +`OperationsVTable::scalar_at`, `ArrayRef::execute_scalar`, and Python/FFI `scalar_at` are the main +APIs that accidentally turn array rows into nested scalars. They should be deprecated and removed as +type-erased array operations. + +Callers split into two groups: + +- Callers that need a row use `row(index)` or `slice(index..index + 1)` and keep an `ArrayRef`. + This includes constant compression, scalar-fn row execution, `case_when`, nested display, + parquet-variant rows, nested tests, and fuzz/conformance row oracles. +- Callers that need repeated point reads from scalar-compatible arrays construct a `ScalarProbe`. + This includes run-end ends and offsets, validity booleans, patch indices, encoded primitive + values, decimal byte-part values, FSST codes, search/sort paths, and scalar-compatible test values. + For one-off or unsupported reads, callers can fall back to length-1 row canonicalization and typed + canonical extraction. -pub trait ExprSerializeProtoExt { - fn serialize_proto(&self) -> VortexResult; - fn serialize_proto_with_options(&self, options: &ExprSerializeOptions<'_>) -> VortexResult; +Tests comparing nested rows should use array equality on length-1 arrays. Tests comparing +scalar-compatible host values can compare values read through a probe or typed canonical arrays. + +#### Constants and literals + +`ConstantArray::new(value, len)` remains as an ergonomic constructor for host `Scalar` values, but it +immediately builds a length-1 child array and calls `ConstantArray::with_child`. + +`ConstantArray::scalar()` and `ArrayRef::as_constant()` should not be the generic constant API. +Replacement APIs are: + +- `ConstantArray::child()` for the source-of-truth length-1 value. +- typed helpers such as `primitive_value()` for hot scalar-compatible kernels. +- array helpers for kernels that can operate on the length-1 child directly. + +Expression literals split the same way: + +- `lit(value)` stores a host `Scalar`, including restricted scalar-compatible struct literals. +- `lit_array(array)` stores an `ArrayRef` by pointer identity. +- list, fixed-size-list, variant, and nested struct literals use `ArrayLiteral`. + +#### Scalar functions and compute kernels + +Scalar functions continue to receive `ArrayRef` inputs of logical length `args.row_count()`. +Broadcasting is represented by `ConstantArray`, not by passing naked length-1 arrays into kernels. + +Kernels that currently branch on `as_constant()` should inspect the constant child or use typed +helpers over that child. List-oriented functions such as `list_contains` should accept constant list +values through `ArrayLiteral` / `ConstantArray` and avoid list scalar extraction. + +Missing intermediate constant folding is a separate optimization. The scalar-function API does not +need to become iterator-based to remove nested scalars. + +#### Type-erased builders + +Type-erased builders should be array-oriented: + +```rust +trait ArrayBuilder { + fn dtype(&self) -> &DType; + fn extend(&mut self, values: ArrayRef) -> VortexResult<()>; + fn append_nulls(&mut self, len: usize) -> VortexResult<()>; + fn finish(self) -> VortexResult; } ``` -The old `serialize_proto` can continue to work for scalar-only expressions. It should return an error -if asked to serialize a row-backed literal without the context required to serialize arrays. +`append_row(row)` is `extend(row)` with `row.len() == 1`. Typed builders may keep host-value +conveniences such as `append_i32`, `append_utf8`, or `append_scalar(&Scalar)` for scalar-compatible +values, including restricted struct scalars where useful. The erased nested-builder protocol should +not depend on nested scalar values. -Deserialization already receives a `VortexSession`, so it can decode `ArrayLiteral` by constructing a -`ReadContext` from `encoding_ids`, decoding `serialized_array` with `dtype` and `len`, and validating -that `len == 1`. +#### Aggregate partials -### DType validation +Aggregate partials should not be represented as one length-1 array per group. That would preserve +the array model but make hot aggregation state unnecessarily expensive. -`Scalar::try_new` should continue to run dtype validation for scalar values. This validation is -purely structural and should not require an `ExecutionCtx`. +Instead: -Row-backed constants and row-backed literals validate through array invariants: +- in-memory aggregate partials are native Rust structs such as `MeanPartial { sum, count }` or + `MinMaxPartial { min, max }`. +- batched partial exchange and serialization are array-shaped, for example a `StructArray` with one + row per partial. +- combining partials reads typed child arrays from the partial array, not nested scalar rows. +- final aggregate output is an array; a one-group result is a length-1 array. Host scalar extraction + from that result is a caller convenience, not the aggregate protocol. -- the singleton row array must be a valid `ArrayRef` -- its dtype must match the literal or constant dtype -- its length must be 1 +Small struct `Scalar` values remain useful for expression literals, metadata, and compatibility, but +aggregate partial transport should not use `Scalar::struct_`. -This cleanly separates scalar validation from array validation. There is no need for an execution -context to construct a scalar, and there is no need for scalar validation to understand array buffers -or device residency. +#### Stats, encodings, and bindings -### Device buffers +Scalar-valued stats remain `Scalar` when the stat value is scalar-compatible: min, max, sum, +counts, sortedness, and constantness. Nested stats should use child-array stats or array metadata, +not recursive tuple scalars. -Device buffers should not be allowed inside `Scalar`. +Encoding metadata can keep `Scalar` when the metadata dtype is scalar-compatible, for example +FastLanes frame-of-reference references, decimal-byte-parts compare values, datetime-parts scalar +values, and FSST string/binary compare values. Encodings that carry nested fill values or nested row +values store one-row arrays. -Device buffers should be allowed inside row-backed constants and row-backed literals because those -are arrays. In-memory execution can preserve device residency. Scalar extraction from a row-backed -constant may require execution or host transfer depending on the underlying array and execution -context, but that cost is paid only when the caller explicitly requests a scalar. +FFI and Python scalar APIs remain for scalar-compatible values. Nested Python lists/dicts should +lower to arrays or `ArrayLiteral`, and nested row access should return a one-row array object. -Portable expression serialization should copy buffers to host in the same way array serialization -does today. A future device-aware expression transport can carry `BufferHandle`s or external device -segments, but that is out of scope for this RFC. +## Migration Plan -### Statistics +### Phase 1: Add the new shapes -For scalar-backed constants, statistics remain unchanged. +- Add `ConstantArray::with_child`. +- Add `ArrayLiteral` with pointer equality and hashing. +- Add `lit_array` and nested literal helpers. +- Add `ArrayRef::row(index)` as the universal row-access helper. +- Keep existing scalar APIs as compatibility wrappers where needed. -For row-backed constants: +### Phase 2: Move expression literals -- `Stat::IsConstant` is exactly true. -- `Stat::NullCount` can be derived from row validity and outer length. -- `Stat::Min` and `Stat::Max` may be derived lazily by extracting row 0 as a scalar when the dtype - supports scalar ordering and the value is non-null. -- If extracting a scalar would require undesirable execution, implementations may leave min/max - absent unless a compute path explicitly requests them. +- Keep `lit(value: impl Into)` for scalar-compatible host values. +- Move list, fixed-size-list, variant, nested struct, and nested extension literals to + `ArrayLiteral`. +- Replace private `ArrayExpr` with public `ArrayLiteral`. +- Add expression protobuf support for array literals. -This preserves correctness while avoiding accidental scalarization during cheap metadata operations. +### Phase 3: Move ConstantArray storage + +- Change `ConstantArray` to store one length-1 child slot. +- Read both legacy scalar-backed and new child-backed constant encodings. +- Write child-backed constants by default. +- Audit constant compute kernels and preserve hot-path complexity. + +### Phase 4: Remove nested scalar dependencies + +- Deprecate and remove type-erased array `scalar_at` / `execute_scalar`. +- Move row call sites to one-row arrays. +- Move scalar-compatible point-read call sites to `ScalarProbe`, with typed canonical extraction as + the fallback. +- Move nested stats away from recursive scalar values. +- Replace nested builder scalar append with array-oriented `extend`. +- Move aggregate partial exchange to arrays. +- Remove list, fixed-size-list, variant, and nested-extension scalar constructors. ## Compatibility -Existing scalar-backed constants and scalar literals remain readable. +Existing scalar-backed constants and scalar literals remain readable during migration. -Existing readers will not understand the new row-backed `vortex.constant` serialized form if it is +Existing readers will not understand the new child-backed `vortex.constant` serialized form if it is encoded under the same encoding ID. They will fail because the constant encoding has zero buffers and one child instead of one scalar buffer. This is a forward-compatibility limitation, not silent data corruption. Writers should expose a compatibility option: -- modern mode: write row-backed complex constants -- legacy mode: write scalar-backed complex constants, or canonicalize complex constants before writing +- modern mode: write child-backed constants for every dtype +- legacy mode: write scalar-backed constants where possible, or canonicalize unsupported constants -The expression protobuf change is backward-compatible for readers that accept both `scalar` and -`array` literal variants. Older readers will not understand `array` literals. +Expression array literals require new expression protobuf support. Older readers will not understand +array literals. Public Rust API compatibility should be managed in phases: -1. Add new `ConstantValue` and literal APIs. -2. Keep existing scalar-only helpers for existing callers. -3. Migrate internal kernels that can benefit from row-backed constants. -4. Deprecate ambiguous APIs such as `ConstantArray::scalar()` if they cannot represent row-backed - constants safely. -5. Consider breaking API cleanup only after downstream integrations have a migration path. +1. Add child-backed `ConstantArray` construction and `ArrayLiteral`. +2. Keep scalar-only helpers for existing callers. +3. Migrate internal kernels and expression constructors. +4. Deprecate ambiguous APIs such as `ConstantArray::scalar()`. +5. Remove list-like scalar constructors after downstream integrations have a migration path. ## Drawbacks -This adds a second representation for constants, and kernels must be explicit about whether they -need scalar constants or can operate on row-backed constants. - -Expression serialization becomes more complex because array literals need an array serialization -context. The existing scalar-only expression serialization path is simpler, but it is also the source -of the current inefficiency for complex values. +Primitive constants become slightly heavier. They now carry a length-1 child array instead of an +inline scalar. Hot constant kernels must be audited and benchmarked before this lands. -Some code that currently assumes every constant has a `Scalar` will need to be audited. The upside is -that this audit makes accidental scalarization visible instead of hiding it behind `as_constant()`. +Pointer-based `ArrayLiteral` equality is intentionally weaker than value equality. It avoids +accidental expensive comparisons, but optimizers cannot assume two independently constructed +content-equal array literals are the same expression. -Row-backed constants do not remove all materialization costs. If a caller asks for canonical -fixed-size-list arrays, scalar extraction, or legacy serialization, Vortex may still need to build -repeated values. The important change is that these costs move to explicit boundaries. +Removing type-erased array `scalar_at` requires a replacement fast path for primitive random-access +workloads. `ScalarProbe` fills that role, but it adds another capability API that encodings and +wrappers must implement or delegate carefully. ## Alternatives +### Scalar-or-row ConstantArray + +We could represent constants as: + +```rust +enum ConstantValue { + Scalar(Scalar), + Row(ArrayRef), +} +``` + +This reduces the cost of primitive constants, but it keeps two constant representations and preserves +branching pressure in kernels. This RFC chooses a single representation instead. + ### Make Scalar array-backed We could add `ScalarValue::Array(ArrayRef)` and represent complex scalars directly as singleton arrays. -This is rejected because it makes `Scalar` no longer context-free. Scalar equality, hashing, -display, validation, and serialization would all need to handle arrays, and arrays may require -execution or device-to-host transfer. That would make `Scalar::try_new` and scalar literals depend on -execution context, which is exactly the direction we want to avoid. - -### Replace Scalar with length-1 arrays everywhere - -This is conceptually clean, but too disruptive. `Scalar` is still useful for stats, display, FFI, -Python interop, expression literals, and primitive constants. Replacing it everywhere would force -array execution into places that need a cheap value object. +This makes `Scalar` no longer context-free. Scalar equality, hashing, display, validation, and +serialization would all need to understand arrays, and arrays may require execution or +device-to-host transfer. -This RFC takes the smaller step: arrays are used for complex execution constants, while `Scalar` -remains the host literal representation. +### Remove struct scalars too -### Always canonicalize complex literals at deserialization +We could make `Scalar` valid only for null, bool, primitive, decimal, UTF-8, and binary values. -This avoids carrying arbitrary encodings inside expression literals, but it throws away information -and can eagerly allocate. If a literal was serialized as a specialized array, deserialization should -not immediately flatten it unless execution demands that. +This is cleaner, but it makes small host records more expensive than necessary. Struct-like values +show up in expression literals, metadata, and compatibility APIs where a handful of host fields are +cheaper and clearer than a one-row `StructArray`. This RFC keeps restricted struct scalars while +forbidding their use as array rows, constant storage, erased builder input, or aggregate partial +transport. -### Add a separate `vortex.constant.row` encoding +### Only add ArrayLiteral -A new encoding ID would make forward incompatibility clearer for old readers. It would also avoid -changing the shape of `vortex.constant`. +Adding `ArrayLiteral` alone would fix expression embedding, but `ConstantArray` would still force +nested values through recursive scalar representation. This would leave the execution problem mostly +unsolved. -This is a reasonable fallback, but the in-memory model should still be a single logical -`ConstantArray` with scalar-backed and row-backed variants. Most users and kernels should not care -which wire-level encoding was used. +### Keep unrestricted nested scalar values and optimize hot paths -### Keep nested scalar values and optimize hot paths +We could add specialized list-scalar and struct-scalar storage to reduce allocation. -We could add specialized list-scalar and struct-scalar storage to reduce allocation. That may help -some cases, but it still duplicates the array system and still does not solve device residency or -array literal serialization. +This may help some local cases, but it keeps recursive array-row materialization alive and still does +not solve device residency or array literal serialization. The useful subset is restricted struct +host records; list-like and variant values should use arrays. ## Prior Art -Apache Arrow distinguishes between scalars and arrays, but computation generally operates over -arrays and treats scalar inputs as broadcast values. That is the model this RFC follows: scalar values -remain useful, but execution should be able to represent a broadcast value as array data. +Apache Arrow distinguishes between scalars and arrays, and compute generally operates over arrays. +Broadcast scalar inputs are an execution concern, not a reason to represent nested arrays as +recursive scalar objects. -Database vector engines commonly distinguish constant vectors from flat vectors. A constant vector -does not necessarily mean "store a recursive scalar object"; it means "the logical row value is the -same for every row." For nested values, the payload can still be represented by child vectors. +Database vector engines commonly represent constants as constant vectors. For nested values, a +constant vector can still point at child vectors rather than storing a recursive scalar tree. -Vortex already has the pieces of this model: `ArrayRef`, `ConstantArray`, array serialization, -validity, and singleton rows. The missing piece is allowing constants and literals to carry singleton -arrays directly. +Vortex already has most of this model: `ArrayRef`, `ConstantArray`, array serialization, validity, +and expression trees. The missing pieces are a child-only constant representation and a first-class +array literal expression. ## Unresolved Questions -- Should row-backed constants use the existing `vortex.constant` encoding ID with a new child-based - shape, or should the wire format use a separate `vortex.constant.row` encoding ID? -- Should `lit(Scalar::list(...))` continue to produce a scalar-backed literal, or should it eagerly - build a singleton array for complex scalar values? +- What benchmark thresholds should gate the primitive-constant wrapper cost? +- Should public `ArrayLiteral` construction allow row-count-length arrays, or should that be an + internal-only constructor? +- How long should legacy nested `Scalar` constructors remain available? +- Should extension values with scalar-compatible storage be allowed as `Scalar`, or should all + extension values stay array-backed for consistency? - What is the exact public API migration for `ConstantArray::scalar()` and `ArrayRef::as_constant()`? -- Should large UTF-8 or binary values ever become row-backed constants based on size? -- How much min/max statistic support should row-backed constants provide without explicit execution? +- What API should replace Python/FFI `scalar_at` for nested rows? ## Future Possibilities -Once row-backed constants exist, Vortex can add more array-native literal construction APIs in Python, -Java, C++, DuckDB, and DataFusion integrations. +Once nested values are array-backed everywhere, Vortex can add richer nested literal constructors in +Python, Java, C++, DuckDB, and DataFusion without routing through recursive scalar values. -Expression serialization could eventually use a general "literal payload" abstraction that supports -host arrays, device buffers, and external buffer references. That would allow complex literals to be -transported without copying through a monolithic protobuf payload. +`ArrayLiteral` can become the common mechanism for expression-local array payloads such as validity +arrays, dictionary values, compact lookup tables, and literal nested values. -The same singleton-row mechanism may also help with dictionary values, sparse fill values, and other -places where Vortex currently stores a complex repeated value as a scalar. +A future device-aware expression transport could serialize `ArrayLiteral` payloads through external +buffer references instead of copying them into a portable byte payload.