Skip to content

RFC: Collapse the four CsvReader row classes into one JIT-devirtualized CsvLine<TPolicy> #132

Description

@stevehansen

Problem

Sequel to #118. That RFC unified the four read loops behind one Enumerate<TSource, TFactory, TRow>; the TRow row types those factories produce are still four near-duplicate classes:

  • ReadLineCsvReader.cs:448 (~98 LOC, string-backed, ICsvLine)
  • ReadLineSpanCsvReader.cs:549 (~140 LOC, string-backed, ICsvLineSpan)
  • ReadLineSpanOptimizedCsvReader.cs:690 (~143 LOC, ReadOnlyMemory<char>-backed + CsvMemoryOptions ArrayPool, ICsvLineSpan)
  • ReadLineFromMemoryCsvReader.FromMemory.cs:22 (~86 LOC, ReadOnlyMemory<char>-backed, ICsvLineFromMemory)

All four own the same concept: lazy RawFields = Split(raw) → validate column count → lazy ParsedValues = Trim(rawFields) → access by name/index under the ReturnEmptyForMissingColumn policy. They differ only in (a) backing store (string vs ReadOnlyMemory<char>) and (b) which split/trim they call (Split/Trim vs SplitLineOptimized/TrimOptimized). ReadLineSpan and ReadLineSpanOptimized are ~140 lines each and differ essentially only in storage representation.

The duplication is concrete and drift-prone — three error-message string literals are copy-pasted 3–4× each:

  • "Expected {Headers.Length}, got {raw.Count} columns." (:506, :601, :745, FromMemory :68)
  • "Header '{name}' does not exist. Expected one of …" (:525, :621, :765, FromMemory :87)
  • "Invalid row, missing {name} header, expected … columns, got … columns." (:534, :630, :774, FromMemory :96)

A fix to the missing-column message or the count-validation logic must land in four places. This is exactly the erosion #118 warned about ("a fifth row type would force the maintainer to copy the loop a fifth time and re-prove all the invariants").

It is a deepening opportunity in Ousterhout's sense: four shallow modules where one deep module belongs. The interface ("a parsed row: headers, values, access by name/index") is small; the implementation (lazy split → validate → trim → lookup-with-policy → identical errors) is large; the only thing that should vary is the storage/strategy seam.

Test cost today: EngineUnificationTests.cs (675 lines, 20 tests) and IssuesTests #114 exist largely to assert that all five read paths emit identical Headers/Values/Raw. That cross-path identity is currently a test-enforced hope; one row type makes it a structural guarantee.

Proposed Interface

Internal-only. The three public interfaces (ICsvLine, ICsvLineSpan, ICsvLineFromMemory) and every CsvReader/CsvWriter static method are unchanged. This is the hybrid chosen after a parallel design exploration of four divergent shapes (see docs/Architecture-Deepening-Candidates.md): a single MemoryText-backed generic row (the "zero-alloc-first" spine) hardened with the allocation/empty-value discipline that the "common-caller-first" two-tier design got right.

// The ONLY axis that genuinely varies on the memory side: pooled vs plain split/trim.
internal interface ITrimSplit
{
    IList<MemoryText> Split(MemoryText raw, CsvOptions options, int? capacity);
    MemoryText[]      Trim(IList<MemoryText> fields, CsvOptions options);
}

internal readonly struct DefaultTrimSplit : ITrimSplit            // ns2.0 + the net8 string/memory paths
{
    public IList<MemoryText> Split(MemoryText raw, CsvOptions o, int? cap) => CsvReader.SplitLine(raw, o, cap);
    public MemoryText[]      Trim (IList<MemoryText> f, CsvOptions o)       => CsvReader.Trim(f, o);
}

#if NET8_0_OR_GREATER
internal readonly struct OptimizedTrimSplit : ITrimSplit          // the ArrayPool / CsvMemoryOptions path
{
    private readonly CsvMemoryOptions mem;
    public OptimizedTrimSplit(CsvMemoryOptions mem) => this.mem = mem;
    public IList<MemoryText> Split(MemoryText raw, CsvOptions o, int? cap) => CsvReader.SplitLineOptimized(raw, o, mem, cap);
    public MemoryText[]      Trim (IList<MemoryText> f, CsvOptions o)       => CsvReader.TrimOptimized(f, o, mem);
}
#endif

// The ONE row. Backing store is always MemoryText (= ReadOnlyMemory<char> on net8, = string on ns2.0
// via the existing alias). On net8 it carries the UNION of all three frozen interfaces; the two
// this[string]/this[int] indexers differ only by return type, disambiguated by EXPLICIT interface
// implementation — legal C#, and the load-bearing trick that makes one type serve all paths.
internal sealed class CsvLine<TPolicy> :
        ICsvLine
#if NET8_0_OR_GREATER
        , ICsvLineSpan, ICsvLineFromMemory
#endif
    where TPolicy : struct, ITrimSplit
{
    private readonly Dictionary<string, int> headerLookup;
    private readonly CsvOptions options;
    private readonly TPolicy policy;                 // struct field -> JIT-specialized, no virtual dispatch
    private readonly MemoryText[] headers;
    private readonly MemoryText rawMemory;           // the single backing field
    private readonly string? rawString;              // FIX 3: original string when the source has one
    internal IList<MemoryText>? rawFields;           // settable by the engine (multiline pre-split)
    internal MemoryText[]? parsedValues;             // FIX 2: stays INTERNAL so GetBlock can pre-seed

    private static readonly MemoryText Empty = "".AsMemory();   // FIX 1: "" on ns2.0, ROM.Empty on net8

    // string Raw returns the cached original on the TextReader paths (zero-alloc), materializes only
    // on the ROM-backed optimized path — matching today's per-path behavior exactly.
    public string Raw => rawString ?? rawMemory.ToString();

    // The single copy of the by-name lookup + missing-column policy + the three error strings.
    private MemoryText Get(string name)
    {
        if (!headerLookup.TryGetValue(name, out var i))
        {
            if (options.ReturnEmptyForMissingColumn) return Empty;          // FIX 1: not `default!`
            throw new ArgumentOutOfRangeException(nameof(name), name,
                $"Header '{name}' does not exist. Expected one of {string.Join("; ", StringHeaders())}");
        }
        try { return ParsedValues[i]; }
        catch (IndexOutOfRangeException)
        {
            throw new InvalidOperationException(
                $"Invalid row, missing {name} header, expected {headers.Length} columns, got {ParsedValues.Length} columns.");
        }
    }
    // ICsvLine string projections call Get(name).AsString(); ICsvLineSpan returns Get(name)/.Span;
    // ICsvLineFromMemory (explicit) returns Get(name) — one resolver, three faces.
}

Usage — factories retarget; public bodies unchanged

internal readonly struct StringRowFactory : IRowFactory<CsvLine<DefaultTrimSplit>> { /* string source: rawString = the line */ }
#if NET8_0_OR_GREATER
internal readonly struct SpanRowFactory      : IRowFactory<CsvLine<DefaultTrimSplit>> { /* string source -> ICsvLineSpan   */ }
internal readonly struct MemoryRowFactory    : IRowFactory<CsvLine<DefaultTrimSplit>> { /* ROM source   -> ICsvLineFromMemory */ }
internal readonly struct OptimizedRowFactory : IRowFactory<CsvLine<OptimizedTrimSplit>> { /* ROM verbatim, NO ToString */ }
#endif

// Enumerate returns IEnumerable<CsvLine<...>>; IEnumerable<out T> covariance assigns it to
// IEnumerable<ICsvLine> / <ICsvLineSpan> / <ICsvLineFromMemory> with no wrapper and no Cast,
// because CsvLine<...> implements all three. Public Read* method bodies are byte-for-byte unchanged.

// GetBlock (CsvReader.cs:445) — the one internal construction site beyond the factories:
//   new CsvLine<DefaultTrimSplit>(headers, map, index, raw, default, options) { parsedValues = values }

The three corrections (each pins a defect the adversarial review found in the raw design)

  1. Empty value (ns2.0 correctness). Replace return default! with static readonly MemoryText Empty = "".AsMemory(). default! is string.Empty on net8 but null on ns2.0 (where MemoryText = string), silently changing ReturnEmptyForMissingColumn from "" to null. "".AsMemory() is correct on both targets. (Tests run net9-only, so this would otherwise ship undetected.)
  2. GetBlock. Keep parsedValues internal and preserve the { parsedValues = values } initializer at CsvReader.cs:445; otherwise the pre-extracted sub-line values are discarded and the row re-splits a synthesized Raw, producing wrong columns.
  3. Raw identity / allocation parity. Carry the engine-supplied original line string in rawString; Raw => rawString ?? rawMemory.ToString(). Without this, a uniformly MemoryText-backed row re-materializes Raw via new string(span) on the Read/ReadAsSpan TextReader paths, regressing today's zero-alloc Raw. The optimized factory passes no rawString, so that path stays ROM-verbatim.

Dependency Strategy

In-process — merged directly; no ports, adapters, or stand-ins. The split/trim variation (SplitLine/Trim vs SplitLineOptimized/TrimOptimized) and its CsvMemoryOptions are wired via the injected value-type strategy TPolicy : struct, ITrimSplit — not a runtime branch and not a captured delegate. CsvMemoryOptions lives only inside OptimizedTrimSplit's field, so non-pooled rows carry zero extra state (parity with today, where only ReadLineSpanOptimized holds it). The existing free helpers SplitLine/Trim/SplitLineOptimized/TrimOptimized/TrimMemory keep their signatures; the strategies just call them.

Testing Strategy

  • New boundary tests to add:
    • A ns2.0-target assertion that ReturnEmptyForMissingColumn returns "" (not null) — closes the coverage gap that would hide FIX 1. (Either add ns2.0 to the test TFM, or a targeted unit test.)
    • A GetBlock test asserting pre-seeded sub-line values survive and are not re-split — guards FIX 2.
    • An allocation tripwire that line.Raw on Read/ReadAsSpan over a TextReader does not allocate — guards FIX 3 (extends the existing PerformanceTests allocation-comparison style).
  • Existing tests to keep (now load-bearing differently): EngineUnificationTests.cs already asserts identical Headers/Values/Raw across all five paths at the public boundary — it stays, but its cross-path duplication now verifies the unification rather than papering over four hand-written classes. No test needs deletion; none construct a row class directly.
  • Test environment: none — pure in-process computation.

Implementation Recommendations

  • Owns: the lazy split → validate → trim pipeline; the missing-column / short-row policy; the three error strings (now single-copy); the storage representation.
  • Hides: whether the raw line is a cached string, a string-view, or a never-materialized ReadOnlyMemory<char> slice; the pooled-vs-plain split/trim choice (behind ITrimSplit); the dual-interface explicit implementation.
  • Exposes: the three frozen interfaces, unchanged.
  • Migration: internal-only. Delete ReadLineSpan, ReadLineSpanOptimized, ReadLineFromMemory; rename ReadLineCsvLine<DefaultTrimSplit>; retarget the four IRowFactory structs to the closed generic; update the single GetBlock call site. No public consumer changes. Expected ≈465 → ≈210 internal LOC.
  • Performance / AOT: no virtual dispatch added — TPolicy : struct, ITrimSplit is JIT-specialized and inlined exactly like the existing IRowFactory structs, and Enumerate's constraints are unchanged. No reflection; the two closed instantiations (CsvLine<DefaultTrimSplit>, CsvLine<OptimizedTrimSplit>) are trim-rooted by the public Read* methods. The optimized path keeps its ROM-verbatim backing, SplitLineOptimized/TrimOptimized, and one-heap-object-per-row profile.
  • Accepted behavioral note: on net8, a Read()/ReadAsSpan() row now also satisfies ICsvLineFromMemory at runtime (one type implements all three). Internal-only; the declared public return types are unchanged. A consumer relying on a negative cast would observe a change — documented as accepted, consistent with RFC: Unify the four CsvReader read loops behind one JIT-devirtualized engine #118's "same concrete type per path" consequence.
  • Comment warranted: explicit-interface implementation is load-bearing here (two this[string] differing only by return type on one class) — one short comment should explain why both exist.

🤖 Generated with Claude Code — produced by an /improve-codebase-architecture exploration (Candidate 1 of docs/Architecture-Deepening-Candidates.md), with the interface chosen via a 4-design parallel exploration + adversarial judging.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions