Summary
When generating structured output against a schema containing an unbounded array (an array with no minItems/maxItems), the MLX constrained decoder emits a fixed, pre-computed number of items instead of letting the model decide the array length. For an unbounded array the forced count is min(16, maximumResponseTokens / 32), and the model is made to generate exactly that many items.
Two consequences:
- Fabrication / padding — the array always contains the forced count of items regardless of the input, so the model invents or repeats entries to fill the slots (and it can never produce an empty array).
- Token-budget exhaustion — a schema with several unbounded arrays forces 16 items in each, which runs the total token budget to zero before the JSON can close, throwing
ConstrainedGenerationError.tokenBudgetExceeded. Raising maximumResponseTokens does not help (see below).
Where (source)
Sources/AnyLanguageModel/Shared/StructuredGeneration.swift — generateArray(...) (≈ lines 445–480):
// arrayDefaultCountDivisor = 32, arrayDefaultCountMax = 16
let budgetBasedCount = backend.totalTokenBudget / Self.arrayDefaultCountDivisor
let defaultCount = max(1, min(Self.arrayDefaultCountMax, budgetBasedCount))
// ...
// when the schema has no minItems/maxItems:
count = defaultCount
// ...
for index in 0 ..< count { // exactly `count` items, always
output += try await generateNode(node.items)
if index < count - 1 { output += try await emit(",") }
}
There is no path for the model to terminate the array early (emit ]) or to produce 0 items. The count is fixed up front. minItems/maxItems only change which fixed number is forced — they don't enable content-driven, variable length.
Related: each free-string field is capped at totalTokenBudget / 16 (freeStringTokenBudgetDivisor = 16), so a larger budget also lets each forced-but-contentless string ramble further before being cut off.
Minimal reproduction
A schema that is just an object with one unbounded string array, and a prompt whose input only supports ~1 item:
// Schema: { "type": "object",
// "properties": { "keywords": { "type": "array", "items": { "type": "string" } } },
// "required": ["keywords"] }
// (built as a DynamicGenerationSchema with a single array property, no minItems/maxItems)
let response = try await session.respond(
to: "Extract the keywords mentioned in: 'The quick brown fox.'",
generating: /* Generable bound to the schema above */ .self,
options: GenerationOptions(maximumResponseTokens: 4096)
)
print(response.content) // jsonString
Observed (reproduced with an MLX 4B-class instruct model): the keywords array contains exactly 16 items for a one-keyword input — a relevant value or two, then repeated/padded copies, with the first free string often rambling up to its per-field cap because the model has nothing left to emit. Shape:
{ "keywords": ["fox", "fox", "fox", "fox", ... ] } // 16 forced items for a ~1-item input
Impact / why bumping the budget doesn't help
- The per-array count caps at 16 once
maximumResponseTokens ≥ 512 (512/32 = 16), so raising the budget does not shorten the arrays.
- It only raises the per-field free-string cap (
budget/16), letting each contentless field ramble longer — so larger budgets consume more tokens, not fewer.
- Empirically, a schema with ~19 unbounded arrays throws
tokenBudgetExceeded at both maximumResponseTokens = 8192 (~10 min) and 32768 (~57 min) — the larger budget just churns much longer before the same failure.
Expected behavior
The model should control array length: emit a content-driven number of items (including 0), terminating the array (]) when appropriate, rather than being forced to a fixed min(16, budget/32). For unbounded arrays, the decoder should allow a model-emitted stop and an empty array, instead of padding to a fixed count.
Environment
- AnyLanguageModel
main (revision 701d7e61…), MLX backend, Apple Silicon (macOS / iOS), 4B-class 4-bit instruct model.
Summary
When generating structured output against a schema containing an unbounded array (an
arraywith nominItems/maxItems), the MLX constrained decoder emits a fixed, pre-computed number of items instead of letting the model decide the array length. For an unbounded array the forced count ismin(16, maximumResponseTokens / 32), and the model is made to generate exactly that many items.Two consequences:
ConstrainedGenerationError.tokenBudgetExceeded. RaisingmaximumResponseTokensdoes not help (see below).Where (source)
Sources/AnyLanguageModel/Shared/StructuredGeneration.swift—generateArray(...)(≈ lines 445–480):There is no path for the model to terminate the array early (emit
]) or to produce 0 items. The count is fixed up front.minItems/maxItemsonly change which fixed number is forced — they don't enable content-driven, variable length.Related: each free-string field is capped at
totalTokenBudget / 16(freeStringTokenBudgetDivisor = 16), so a larger budget also lets each forced-but-contentless string ramble further before being cut off.Minimal reproduction
A schema that is just an object with one unbounded string array, and a prompt whose input only supports ~1 item:
// Schema: { "type": "object", // "properties": { "keywords": { "type": "array", "items": { "type": "string" } } }, // "required": ["keywords"] } // (built as a DynamicGenerationSchema with a single array property, no minItems/maxItems) let response = try await session.respond( to: "Extract the keywords mentioned in: 'The quick brown fox.'", generating: /* Generable bound to the schema above */ .self, options: GenerationOptions(maximumResponseTokens: 4096) ) print(response.content) // jsonStringObserved (reproduced with an MLX 4B-class instruct model): the
keywordsarray contains exactly 16 items for a one-keyword input — a relevant value or two, then repeated/padded copies, with the first free string often rambling up to its per-field cap because the model has nothing left to emit. Shape:{ "keywords": ["fox", "fox", "fox", "fox", ... ] } // 16 forced items for a ~1-item inputImpact / why bumping the budget doesn't help
maximumResponseTokens ≥ 512(512/32 = 16), so raising the budget does not shorten the arrays.budget/16), letting each contentless field ramble longer — so larger budgets consume more tokens, not fewer.tokenBudgetExceededat bothmaximumResponseTokens = 8192(~10 min) and32768(~57 min) — the larger budget just churns much longer before the same failure.Expected behavior
The model should control array length: emit a content-driven number of items (including 0), terminating the array (
]) when appropriate, rather than being forced to a fixedmin(16, budget/32). For unbounded arrays, the decoder should allow a model-emitted stop and an empty array, instead of padding to a fixed count.Environment
main(revision701d7e61…), MLX backend, Apple Silicon (macOS / iOS), 4B-class 4-bit instruct model.