Conversation
c41bb3c to
09a29d1
Compare
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
09a29d1 to
26f1e73
Compare
|
|
||
| A new VariantGet expression is required, the expression has two inputs: | ||
|
|
||
| 1. Path to the required child - similar to JSONPath, but a much stricter subset. Just a combination of names and indexes. |
There was a problem hiding this comment.
the path can be empty right?
| At this point, we have 3 possible cases: | ||
|
|
||
| 1. Perfectly shredded - there's a fully shredded child at this path. If it matches the expected type or can be casted into it, we can just return it. Note that this child might actually be a Variant array with its own shredded children, this just means that we've reached a position where all data is contained within this child, with no relevant data in the "core storage" child. | ||
| 2. Partially shredded - data for this path exists in both the shredded child AND in some unshredded values, which we can merge according to the expected type. | ||
| 3. Unshredded - No shredded child at this path, we try and extract the relevant value from the unshredded values which are unchanged from the original array. |
There was a problem hiding this comment.
I thought about instead of cases partitioning the path+type into parts that apply to shredded and un-shredded types? I think its a similar framing just maybe slightly less states?
| | typed child: i64 values for $.a.b | | ||
| | built from shredded data, raw data, or a merge of both | | ||
| +--------------------------------------------------------------+ |
There was a problem hiding this comment.
is there any shredded data in this field+type or just the unshredded part?
| filtered variants: `VariantGet(Filter(v, m), "$.a", i64)` sees only the selected rows, and any | ||
| shredded child used for `$.a` has been filtered with the same mask. |
There was a problem hiding this comment.
is this easy to do for variant? do you store the mask + core_storage and apply on read. Or do you apply the mask eagily?
There was a problem hiding this comment.
depending on the underlying encoding, but for something like parquet-variant its basically a filter on a struct with a couple of binary fields.
| This makes canonical variants more complex than a single raw child. Any code that transforms a | ||
| canonical `VariantArray` must preserve both `core_storage` and the optional `shredded` child, and | ||
| must keep them row-aligned through filter, slice, take and mask operations. | ||
|
|
There was a problem hiding this comment.
what is another approach that is less complex?
| We can make the dtype parameter required, but I do think that the optional one keeps execution more flexible and opens up | ||
| opportunities for different usage, which is useful for compute engines that have more flexible type systems or that might want | ||
| to process the raw byte data themselves. |
There was a problem hiding this comment.
parameterised by what? we can export these to the engine with a encoding tree walk.
There was a problem hiding this comment.
Not sure I understand this comment.
| - What casts are allowed when `as_dtype` is provided? Numeric widening seems reasonable, but string | ||
| parsing, lossy casts and timestamp/decimal coercions should be decided explicitly. |
There was a problem hiding this comment.
Do we want it to be extensible or fixed?
Do we need to extend via a cast like system?
There was a problem hiding this comment.
Not sure, not sure I fully understand how we currently think about casting.
| whether negative indexes or wildcards are out of scope. | ||
| - What casts are allowed when `as_dtype` is provided? Numeric widening seems reasonable, but string | ||
| parsing, lossy casts and timestamp/decimal coercions should be decided explicitly. | ||
| - What are the exact null semantics for outer nulls, missing paths, `variantnull` values and type |
There was a problem hiding this comment.
is variant null like undefined in js? This is only stored in the un-shredded variant
How do I examine this?
There was a problem hiding this comment.
I can't claim to understands js's nullability. VariantNull is the case where the validity says the value is valid, but it contains a Null. There was a long conversation about the concept here.
| - How should implementations validate consistency between the shredded child and raw | ||
| `core_storage`? This may be a construction-time invariant, a debug assertion or a checked error | ||
| path when merging partial shredding. |
There was a problem hiding this comment.
likely debug on construction? I guess both in debug, seems slow
| - How should implementations validate consistency between the shredded child and raw | ||
| `core_storage`? This may be a construction-time invariant, a debug assertion or a checked error | ||
| path when merging partial shredding. | ||
| - What shape should the shredded tree use for list indexes and nested variants? Struct fields cover |
There was a problem hiding this comment.
I think the path is just [idx] thing, but I'm not sure how to store a shredded a array for that
|
I think that I am a bit dumb when reading this - what would be advantage of skipping shredded values in the canonical array? |
|
That's the current thing, it makes the canonical array a really weird thing that is basically a pass-through to a bunch of things, with delicate rules around it to make sure everything is pushdown down. |
|
|
||
| A new VariantGet expression is required, the expression has two inputs: | ||
|
|
||
| 1. Path to the required child - similar to JSONPath, but a much stricter subset. Just a combination of names and indexes. |
There was a problem hiding this comment.
Indexes being list offsets? Can we just stick to field paths for now?
There was a problem hiding this comment.
Would like to keep the flexibility, especially as variant both in Parquet and DuckDB supports that.
| A new VariantGet expression is required, the expression has two inputs: | ||
|
|
||
| 1. Path to the required child - similar to JSONPath, but a much stricter subset. Just a combination of names and indexes. | ||
| 2. Optional dtype, if None - the return type is `None`, the expression's return type is `Variant`. |
There was a problem hiding this comment.
Why optional? We should assume that Vortex expressions are fully-typed by some surrounding engine. So presumably the output has been coerced into something
There was a problem hiding this comment.
I say that at some other point in the doc, we can make it stricter for now.
| `core_storage`, and its rows must stay aligned with the raw variant rows. | ||
|
|
||
| Nested shredded paths can be represented by nesting typed arrays inside struct arrays. For example, | ||
| if `$.a.b` is shredded but `$.a.c` is not, the shredded child may contain a field for `a`, whose |
There was a problem hiding this comment.
What if there is $.a.b as a both an int64 and a float64 array?
There was a problem hiding this comment.
in cases of these conflicts, Variant semantics are row-wise. DuckDB and arrow have some casting semantics/options around that.
|
|
||
| The canonical Variant array will add an additional child, representing optional shredded data, it will now have: | ||
|
|
||
| 1. Validity |
There was a problem hiding this comment.
From your description of execution, it sounds like you want a non-optional "shredded" child that can be a struct array with no fields. That gives you a sensible place for validity.
There was a problem hiding this comment.
Not sure I follow. Why can't the validity be "inside" the child?
| VariantGet("$.a.b", i64) | unchanged | ||
| | | ||
| +----------------------------------------------------------|---+ | ||
| | validity for rows where $.a.b can be read as i64 | | |
There was a problem hiding this comment.
I don't quite get this? You're returning another VariantArray from the execution of VariantGet?
There was a problem hiding this comment.
I intended this to be the model of execution, not what you actually return. I'll try and make it clearer.
|
|
||
| The canonical `VariantArray` is the stable execution boundary, but it should not force | ||
| `VariantGet` to materialize the whole variant value. When `VariantGet` sees a canonical variant, it | ||
| first uses the explicit `shredded` child when that child contains the requested path. If the path is |
There was a problem hiding this comment.
What does this mean?
If the path is not fully represented by the shredded child
There was a problem hiding this comment.
the shredded children tree ends before the path.
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
RFC for the
VariantGetexpression, with lessons and thoughts learned through vortex-data/vortex#7494.