Skip to content

Support Array and Tuple column types in dynamic write tools #123

@BorisTyshkevich

Description

@BorisTyshkevich

Background

Dynamic write tools currently skip columns of type Array(...), Tuple(...), Dynamic, and JSON with a Warn log:

dynamic write tool: skipping column with unsupported type (no JSON Schema equivalent)

Dynamic and bare JSON are genuinely unrepresentable (schema changes per row). But Array and Tuple have fully static, introspectable schemas and can be mapped to JSON Schema.

Proposed approach: use DESCRIBE TABLE with subcolumns

Instead of writing a recursive parser for CH type strings, delegate decomposition to ClickHouse:

DESCRIBE TABLE db.my_table
SETTINGS describe_include_subcolumns = 1

This returns one row per column and one row per subcolumn, with an is_subcolumn column (0 or 1):

name type default_kind comment is_subcolumn
id UInt64 0
user Tuple(name String, age UInt8) 0
user.name String 1
user.age UInt8 1
tags Array(String) 0

For named Tuples, the subcolumn rows give us the leaf field names and simple types — no recursive parsing needed. For Arrays of named Tuples, CH surfaces subcolumns as Array(leaf_type) (e.g., items.id → Array(UInt32)), giving us the element schema directly.

Implementation sketch

  1. Replace the current system.columns query in getTableColumnsForMode with DESCRIBE TABLE db.table SETTINGS describe_include_subcolumns=1.
  2. Group rows by top-level column name (split on first .).
  3. For each top-level column (is_subcolumn=0):
    • Skip MATERIALIZED/ALIAS as before.
    • Keep skipping Dynamic and bare JSON.
    • For Array(primitive) or other simple types: existing mapCHType path.
    • For Tuple(...) with named fields (subcolumn rows present): build {"type": "object", "properties": {...}, "required": [...]} from subcolumn rows.
    • For Tuple(...) with unnamed fields (no subcolumn rows): {"type": "array", "prefixItems": [...], "minItems": N, "maxItems": N} — requires one level of type-string parsing for the inner primitive types only.
    • For Array(Tuple(...)) with named fields: {"type": "array", "items": <object schema from subcolumns>}.

JSON Schema mapping

CH type JSON Schema
Array(String) {type: array, items: {type: string}}
Array(UInt32) {type: array, items: {type: integer, format: int64}}
Tuple(name String, age UInt8) (named) {type: object, properties: {name: {type: string}, age: {type: integer}}, required: [name, age]}
Tuple(UInt32, String) (unnamed) {type: array, prefixItems: [{type: integer}, {type: string}], minItems: 2, maxItems: 2}
Array(Tuple(id UInt32, val String)) {type: array, items: {type: object, properties: {id: ..., val: ...}}}

SQL literal encoding for INSERT

The LLM sends values as JSON; we need to encode them into SQL literals in buildInsertQuery:

JSON Schema type LLM input example SQL literal
array of primitives ["a","b","c"] ['a','b','c']
object (named Tuple) {"name":"Alice","age":30} ('Alice',30) — fields ordered by Tuple definition
array of items (unnamed Tuple) [1,"hello"] (1,'hello')
array of objects (Array(Tuple)) [{"id":1,"val":"x"},{"id":2,"val":"y"}] [('x',1),('y',2)]

sqlLiteralChecked needs new cases: array (renders [...]) and object (renders (...) with values ordered by field list from dynamicToolParam).

What stays unsupported

  • Dynamic — schema is truly per-row, no static representation
  • Bare JSON / JSON(...) — same reason
  • Nested(...) — syntactic sugar for Array(Tuple(...)), could follow once Array(Tuple) works

Caveats

  • describe_include_subcolumns was added in ClickHouse 22.4. Antalya 26.1.x supports it; verify presence in any older Altinity builds still under test.
  • Unnamed Tuple subcolumn names in CH are positional (1, 2, 3…) — DESCRIBE does not emit dotted subcolumn rows for them, so the unnamed-Tuple path still requires one level of type-string parsing (inner types are always primitives in the common case).
  • Nullable(Array(...)) and LowCardinality(...) wrappers around complex types: strip outer wrapper, apply inner type mapping.

Acceptance criteria

  • Table columns of type Array(primitive) appear in the dynamic write tool schema as JSON array
  • Named Tuple columns appear as JSON object with correct property names and types
  • Unnamed Tuple columns appear as JSON array with prefixItems
  • Array(Tuple(...)) with named fields produces {type: array, items: {type: object, ...}}
  • Dynamic and bare JSON columns continue to be skipped with a Warn log
  • buildInsertQuery correctly encodes array and tuple values as SQL literals
  • Existing tests pass; new tests cover at least Array(String), named Tuple, Array(named Tuple)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions