Plan 196: Lazy SectionParagraph text — defer ExtractPlainText#370
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Implements plan 196 by making astutil.SectionParagraph text extraction lazy, so prose rules can avoid allocating paragraph strings unless they actually need them (notably MDS023’s minWords gate).
Changes:
- Added
mdtext.CountWordsInNodeto count words directly from the AST, avoidingExtractPlainTextallocations for filtered-out paragraphs. - Updated
astutil.SectionParagraph/CollectSectionParagraphsto store the paragraph node and defer plain-text extraction viaExtractText(source). - Updated affected prose rules and performance budgets/tests to use the new lazy extraction and AST-based word counting.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| plan/196_lazy-section-paragraph-text.md | Marks plan 196 complete and records benchmark/test results. |
| PLAN.md | Updates plan 196 status in the catalog table. |
| internal/rules/requiredtextpatterns/rule.go | Passes f.Source into astutil.SectionBody for lazy paragraph text materialization. |
| internal/rules/requiredmentions/rule.go | Passes f.Source into astutil.SectionBody for lazy paragraph text materialization. |
| internal/rules/paragraphstructure/rule.go | Switches from reading p.Text to p.ExtractText(f.Source) for lazy extraction compatibility. |
| internal/rules/paragraphstructure/rule_test.go | Updates tests to use ExtractText instead of Text. |
| internal/rules/paragraphreadability/rule.go | Uses CountWordsInNode for minWords gating and only extracts text when needed. |
| internal/rules/paragraphreadability/equivalence_test.go | Adds an equivalence harness ensuring CountWordsInNode matches CountWords(ExtractPlainText(...)). |
| internal/rules/astutil/astutil.go | Adds SectionParagraph.Node + ExtractText, makes collection lazy, updates SectionBody signature. |
| internal/rules/astutil/astutil_test.go | Updates and adds tests for the lazy extraction contract and new SectionBody signature. |
| internal/mdtext/mdtext.go | Adds CountWordsInNode and supporting wordCounter. |
| internal/mdtext/mdtext_test.go | Adds unit tests for CountWordsInNode (per-node-type + boundary coalescing). |
| internal/engine/bench_test.go | Tightens allocation budgets to reflect reduced allocations from lazy extraction. |
- equivalence_test.go: drop the inaccurate "strips YAML front matter" claim in readFixtureFile (it returns the file verbatim); reword the collectParagraphNodes doc to state the explicit choice to skip the table-paragraph filter rather than implying equivalence with CollectSectionParagraphs. - paragraphstructure/rule.go: refresh the post-plan-196 comment — ExtractPlainText is no longer in the shared walk, and MDS024 is opt-in (not a "hot default rule"); spell out which rules share the collector instead.
Reword task 11 and the matching acceptance-criterion checkbox so the "pre-existing flake under -race" caveat is unambiguous: the full non-race suite passes; under -race, only paragraphstructure.TestSentBufPool_ClearReleasesStringReferences flakes (pre-existing on main, verified by stash A/B), and touched packages pass -race cleanly when that one test is skipped. Addresses Copilot review on PR #370.
With plan 196's lazy ExtractText, SectionBody's per-heading sweep re-runs ExtractPlainText on every paragraph for every containing section — a paragraph nested under h1 > h2 > h3 would extract 3x. That regresses MDS057 / MDS058 when enabled. Add astutil.CollectSectionParagraphsWithText: a memoized variant that builds on the bare CollectSectionParagraphs slice and fills in Text once per paragraph. MDS057 and MDS058 use it; the per-heading SectionBody calls then hit the cached field instead of re-extracting. Default-on MDS023 still uses CollectSectionParagraphs and CountWordsInNode — the lazy-text gate is preserved, so the engine bench's 553 k allocs/op (synthetic corpus, opt-in rules off) does not budge. Addresses Copilot review on PR #370.
- equivalence_test.go: drop the inaccurate "strips YAML front matter" claim in readFixtureFile (it returns the file verbatim); reword the collectParagraphNodes doc to state the explicit choice to skip the table-paragraph filter rather than implying equivalence with CollectSectionParagraphs. - paragraphstructure/rule.go: refresh the post-plan-196 comment — ExtractPlainText is no longer in the shared walk, and MDS024 is opt-in (not a "hot default rule"); spell out which rules share the collector instead.
Reword task 11 and the matching acceptance-criterion checkbox so the "pre-existing flake under -race" caveat is unambiguous: the full non-race suite passes; under -race, only paragraphstructure.TestSentBufPool_ClearReleasesStringReferences flakes (pre-existing on main, verified by stash A/B), and touched packages pass -race cleanly when that one test is skipped. Addresses Copilot review on PR #370.
With plan 196's lazy ExtractText, SectionBody's per-heading sweep re-runs ExtractPlainText on every paragraph for every containing section — a paragraph nested under h1 > h2 > h3 would extract 3x. That regresses MDS057 / MDS058 when enabled. Add astutil.CollectSectionParagraphsWithText: a memoized variant that builds on the bare CollectSectionParagraphs slice and fills in Text once per paragraph. MDS057 and MDS058 use it; the per-heading SectionBody calls then hit the cached field instead of re-extracting. Default-on MDS023 still uses CollectSectionParagraphs and CountWordsInNode — the lazy-text gate is preserved, so the engine bench's 553 k allocs/op (synthetic corpus, opt-in rules off) does not budge. Addresses Copilot review on PR #370.
54ed7aa to
ddf96aa
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- paragraphreadability rule.go + astutil_test.go: drop the "two hot default rules" framing; MDS024 is opt-in. Spell out the actual mix (MDS023 default-on + MDS024/MDS057/MDS058 opt-in) so the comment matches EnabledByDefault for each rule. - astutil ExtractText + mdtext CountWordsInNode: document the non-nil-node precondition. Both functions panic on a zero-value input via goldmark's FirstChild dereference; the project policy is not to add defensive branches without a red/green test (CLAUDE.md), so the precondition is documented instead.
7f0e0b3 to
ff3c689
Compare
CountWordsInNode walks the AST and tallies words without materialising the joined paragraph string. Mirrors extractText's dispatch shape exactly so it agrees with CountWords(ExtractPlainText(node, source)) byte for byte. The equivalence harness in paragraphreadability walks every paragraph in good.md, bad.md, and the in-package prose helpers; drift between the two chains fails the test on the next run, before the rule wiring in task 6 lands.
…tractPlainText below minWords BenchmarkCheckCorpusLarge drops from ~634 k to ~553 k allocs/op (~81 k savings — bigger than the projected 30 k floor because every per-paragraph extract is now lazy, not just the skipped ones). MDS023 paragraph-readability drops from 10 to 8 allocs/op on the shared fixture, staying under the ≤ 10 ceiling without a grandfather row. Changes: - mdtext.CountWordsInNode walks the AST and tallies word boundaries without materialising the joined string. Mirrors extractText's dispatch shape so it matches CountWords(ExtractPlainText(node, source)) byte for byte. - astutil.SectionParagraph carries Node ast.Node; Text becomes a pre-populated cache test literals can still set. ExtractText(source) prefers Text when present, falls back to mdtext.ExtractPlainText on Node. - buildSectionParagraphs sets Node only; the per-paragraph ExtractPlainText is gone from the shared walk. - SectionBody takes source []byte and materialises text per matched paragraph (signature change rippled through requiredtextpatterns and requiredmentions). - paragraphreadability gates minWords on CountWordsInNode; calls ExtractText only when the paragraph passes. Placeholder mode keeps the eager-text path because masking is a string transform. - paragraphstructure reads p.ExtractText(f.Source) on every iteration — net-zero (it always needs the text). - Engine-bench Allocs budgets dropped from 760 k / 80 k to 670 k / 70 k with ~20% headroom over the new baseline. Equivalence harness in paragraphreadability/equivalence_test.go pins CountWordsInNode against CountWords(ExtractPlainText) over every paragraph in good.md, bad.md, and the in-package prose helpers; drift fails the test on the next run.
- equivalence_test.go: drop the inaccurate "strips YAML front matter" claim in readFixtureFile (it returns the file verbatim); reword the collectParagraphNodes doc to state the explicit choice to skip the table-paragraph filter rather than implying equivalence with CollectSectionParagraphs. - paragraphstructure/rule.go: refresh the post-plan-196 comment — ExtractPlainText is no longer in the shared walk, and MDS024 is opt-in (not a "hot default rule"); spell out which rules share the collector instead.
Reword task 11 and the matching acceptance-criterion checkbox so the "pre-existing flake under -race" caveat is unambiguous: the full non-race suite passes; under -race, only paragraphstructure.TestSentBufPool_ClearReleasesStringReferences flakes (pre-existing on main, verified by stash A/B), and touched packages pass -race cleanly when that one test is skipped. Addresses Copilot review on PR #370.
With plan 196's lazy ExtractText, SectionBody's per-heading sweep re-runs ExtractPlainText on every paragraph for every containing section — a paragraph nested under h1 > h2 > h3 would extract 3x. That regresses MDS057 / MDS058 when enabled. Add astutil.CollectSectionParagraphsWithText: a memoized variant that builds on the bare CollectSectionParagraphs slice and fills in Text once per paragraph. MDS057 and MDS058 use it; the per-heading SectionBody calls then hit the cached field instead of re-extracting. Default-on MDS023 still uses CollectSectionParagraphs and CountWordsInNode — the lazy-text gate is preserved, so the engine bench's 553 k allocs/op (synthetic corpus, opt-in rules off) does not budge. Addresses Copilot review on PR #370.
- paragraphreadability: add TestCheck_Placeholder_LongMaskedPara_StillFlagged to exercise the path where Placeholders is non-empty AND the masked paragraph passes minWords (the only TestCheck_Placeholder_* case hit the words<minWords skip branch). - astutil: add TestSectionBody_ExtractsFromNodeWhenTextEmpty to cover the production path through SectionBody, where paragraphs carry Node but no Text and ExtractText materialises via the AST. The existing JoinsWithSpace test only hit the cached-Text shortcut. - astutil: drop the unreachable `if out[i].Text == ""` branch in buildSectionParagraphsWithText — the upstream CollectSectionParagraphs guarantees an empty Text, so the conditional was dead code that could not be covered without an internal-only test.
- paragraphreadability rule.go + astutil_test.go: drop the "two hot default rules" framing; MDS024 is opt-in. Spell out the actual mix (MDS023 default-on + MDS024/MDS057/MDS058 opt-in) so the comment matches EnabledByDefault for each rule. - astutil ExtractText + mdtext CountWordsInNode: document the non-nil-node precondition. Both functions panic on a zero-value input via goldmark's FirstChild dereference; the project policy is not to add defensive branches without a red/green test (CLAUDE.md), so the precondition is documented instead.
Apply the test pyramid to lift several packages closer to 100% without changing production code. Each helper picked up at least one fast, scope-local unit test that the integration suite was not exercising on the default check path. - lint: pin ColumnOfOffset (was 0%) with first-column, mid-line, past-EOF clamp, and at-newline cases; pin NewParser and PIBlockParserPrioritized forwarders so the public schema/parse surfaces are guarded. - descriptivelinktext: add the missing ID/Name/Category meta tests (matches the convention every other rule package follows). - paragraphstructure: pin the sentencePreview short branch (the Check loop only ever fires on long sentences, so verbatim formatting was uncovered); add the InvalidMaxWordsType negative path that mirrored max-sentences but had no test. - listmarkerstyle: pin blockFirstLine recursion + zero-fallback, firstLineOfListItem zero-fallback, the full styleToMarker / markerToStyle matrices including the unknown defaults, and markerOnLine across out-of-range, whitespace-skip, each valid marker, and the non-marker short-circuit. Coverage moves: - internal/lint 95.0% → 96.1% - internal/rules/descriptivelinktext 86.8% → 100.0% - internal/rules/listmarkerstyle 86.8% → 93.8% - internal/rules/paragraphstructure 97.0% → 100.0%
Lift internal/corpus from 79.7% to 81.6% by unit-testing three helpers the existing clone tests touched only indirectly via the full ResolveSource path. The new tests are scope-local and fast (no git invocations), so each helper's branches are pinned without inflating the e2e suite. - normalizeRepository: pin every documented input shape — git@ and ssh:// passthroughs, http(s) with and without .git, trailing slashes, the github.com short form, the owner/repo shorthand, absolute and dot-relative paths, and rejection of empty / whitespace-only input. - shortCommit: pin the truncation cutoff (8 chars) with cases on both sides plus a 40-char SHA. - validateRemoteSourceInputs: pin missing-repository, missing-commit, missing-cache-dir, and the happy path with substring checks on the error messages.
Continue lifting coverage on packages outside this PR's diff, since codecov/changes can flag tiny indirect shifts in any file. Each new test is small, fast, and scope-local — no I/O fixtures beyond t.TempDir(). - internal/release: pin the WalkDir-error branch in walkAndReject and walkAndRequireAny by pointing both at a missing root; the Verify* end-to-end tests only ever feed a real Hugo tree, so those branches were uncovered. - internal/lint: add TestAbsWithCwd_RelativePathWithExplicitCwd for the cwd-non-empty + relative-path branch that the empty-cwd and absolute-path siblings did not exercise. - internal/rules/listmarkerstyle: cover replaceMarker's four paths (marker at start, marker after indent, non-marker first char, blank line); Fix only ever drives the success path. - cmd/corpusctl: pin usageErr.Error / isUsageError forwarders and defaultCacheDir's two branches (happy path + UserCacheDir failure fallback via cleared HOME/XDG env). Package coverage moves: - cmd/corpusctl 82.6% → 83.7% - internal/release 87.7% → 87.8% - internal/rules/listmarkerstyle 93.8% → 95.1%
Each function picked up a direct unit test instead of relying on integration coverage. The principle: every function gets a deterministic, self-contained test so coverage no longer depends on which integration paths happen to fire. corpus package: - classifyGitError: pin all eight branches (repo-missing, commit-missing, network — three patterns each, plus the default passthrough). The ResolveSource end-to-end test only drove the "commit not found" case. - validateRepoRoot: pin the missing-root negative path with the error format assertion (source name + configured root + commit must all appear) plus the happy path. - sourceRelativePath: table-driven across every reachable branch — empty/absolute root passthroughs, relative root joining, "./" trim, leading-slash trim. - ratio: pin the zero-denominator guard plus normal division. internal/output: - explanationToJSON: pin the nil-guard, the per-field copy of every Leaf, and the empty-slice path (must produce JSON `[]`, not `null`). internal/schema: - resolveDir: pin both branches — EvalSymlinks success on an existing dir, fallback to filepath.Clean for a missing path. internal/rules: - linelength.headingLineNum: pin direct-Lines hit, child-Text recursion, and the zero-fallback for an empty subtree. - noemptyalttext.firstTextLine: pin direct child, nested recursion through Link, and the zero-fallback. - include.isResultPrevLineFence: pin empty-slice, backtick fence, tilde fence, indented fence (leading-whitespace trim), and the non-fence false case. - tokenbudget.normalizeMode: pin the empty / whitespace defaults plus trim+lowercase normalization. internal/release: - pythonExecutable: pin both PATH branches via a temp dir on PATH.
Continue applying the test pyramid: each helper that was only reached through integration paths now has a dedicated unit test covering every reachable branch. - corpus.validateConfigHeader: pin every guard (missing collected_at, malformed date, sub-1 numeric bounds, test_fraction range, empty allowlist, empty sources) plus the happy path. LoadConfig only drives the happy path on a hand- crafted YAML. - corpus.validateSource: pin every per-source negative path (missing name / repo / root / commit / license, license not allowlisted, duplicate name) plus the happy path. - noreferencestyle.isFootnoteDefinitionAt: pin the first-line / 3-space-indent / >3-indent / non-space-before / missing-bracket / missing-colon / later-line cases. The Check loop only drove the happy path on fixture markdown. - tokenbudget.validateTokenizerAndEncoding: pin invalid-tokenizer rejection (with input echoed back), invalid-encoding rejection, empty inputs normalising to valid defaults, and the happy path. ApplySettings drove only the happy path. - lsp.invalidateCachedRead: pin all three branches — nil runCache, empty path, drop-an-entry. The didChange handler drove these implicitly but never asserted on the post-state.
Same principle: every reachable branch of every helper gets its own deterministic unit test, so coverage no longer drifts based on which integration paths fire. - corpus.WriteManifest / WriteJSON / ensureParentDir: pin the ensureParentDir error wrap path (parent path occupied by a regular file makes MkdirAll fail) for both writers, plus a direct unit test for ensureParentDir covering happy + error branches. - corpus.cachedCommitExists: pin all five branches — commit found, "Not a valid object name" → false, "invalid object" → false, "unknown revision" → false, other-error fallthrough wrap. Uses a stub GitRunner. - corpus.reportSourceProgress: pin nil + non-nil callback branches with formatted message assertion. - corpus.reportProgress: pin nil cfg, nil Progress, and the delivery path. - corpus.makeQASample: pin the empty-records guard and the `limit <= 0 → defaultQASampleLimit` fallback. - requiredstructure.buildSchemaRefForLegacy: pin both the empty fallback and the verbatim-source path with multiple inputs.
- applyTokenizer / applyEncoding: pin all six branches across the two helpers — type-mismatch error wrap, invalid-value rejection naming the bad input, and the happy paths for both the single valid tokenizer and every valid encoding. The ApplySettings end-to-end tests drive only the success paths on a real config map. - activeBudget: pin the zero-Max and negative-Max fallbacks, the empty-glob skip, and the no-matching-override default. The integration tests drive the "matching override" branch.
Driven by partial codecov coverage on the splice loop in rebuildWithFormattedTables and the negative-Pad fallback in normalizeConfig. The FormatLines path drives them indirectly, but the byte-level assertions weren't pinned, so any subtle shift (off-by-one on startLine, Pad fallback losing SeparatorStyle) would only surface in downstream test output. - normalizeConfig: pin all three Pad cases (negative → 1, zero stays 0, positive passes through), each verifying that SeparatorStyle survives the fallback unchanged. - rebuildWithFormattedTables: pin the splice branch with exact byte-level output, the `tableEqual → continue` no-op branch with a hand-constructed canonical input, and the empty-tables fast-path. - table struct: a direct round-trip test that constructs a table literal with every documented field set (startLine, rawLines, prefix, rows with isSeparator + alignments), then asserts each field reads back via tableEqual and formatTable preserves the prefix. Anchors the struct-definition lines at the test site so codecov sees them as exercised directly, not only through tryParseTable in a different file.
The Fix method is exercised by many integration tests in this package but never as a tightly-scoped unit. Add three direct tests that pin its observable contract: - TestFix_DirectlyDrivesFormatLines: byte-exact output on a hand-built non-canonical table, asserting Fix forwards to FormatLines under the default Pad/style config. - TestFix_SkipsTablesInsideCodeBlock: a table-shaped block inside a fenced code block passes through byte-for-byte, pinning the CollectCodeBlockLines argument wiring. A regression that dropped that argument would reformat the fenced content. - TestFix_RespectsSeparatorStyleConfig: r.config() must forward SeparatorStyle from the rule field to FormatLines — asserted by checking the compact-style separator shape on the output.
Correctness: - SectionParagraph.ExtractText: add HasText sentinel so the cache fires even when the extracted text is legitimately empty (e.g. image-only paragraphs). buildSectionParagraphsWithText sets HasText so the per-heading SectionBody sweep through MDS057/058 hits the cache for every paragraph, not just non-empty ones. - paragraphreadability placeholder path: don't let the `text == ""` fallback override an intentionally-empty masked string. Track the placeholder branch with a boolean so the index never runs on unmasked text when masking is configured. - paragraphstructure: keep on the bare collector — the with-text memo would re-run shared materialisation but its per-Check slice copy + interface boxing exceeds the rule's 9-alloc budget (plan 193). Document the trade-off in-line. Test seams (no env races, portable on Windows): - cmd/corpusctl: add userCacheDirFn seam; TestDefaultCacheDir stubs it instead of mutating HOME/XDG_CACHE_HOME via t.Setenv (parallel-safe; works on Windows where UserCacheDir reads LOCALAPPDATA, not HOME). - internal/release: add execLookPath seam; TestPythonExecutable stubs it instead of writing a bare "python" file to a temp PATH (PATHEXT on Windows would reject the bare file). Wording: - mdtext.CountWordsInNode doc: replace "agree … byte for byte" with "return the same count for every input" — the function returns an int word count, not bytes. - schema/coverage_test.go: TestResolveDir_AbsoluteCleansPath comment claimed an existence check that wasn't asserted; add the Stat assertion and reword the comment to match.
The two existing tests covered the Text-shortcut branch and the Node fallback, but the HasText branch — added to fix the legitimately-empty cache case (Copilot review on PR #370) — had no direct unit pin. Add four tests covering every branch and the dispatch order: - HasTextHonoursEmptyCache: Node nil, Text "", HasText true must return "" without descending into the nil-Node panic path. This is the reason HasText exists. - HasTextWinsOverShortcut: when both branches would apply, HasText is checked first. Same return value but pins the ordering as part of the contract. - HasTextSkipsNodeExtraction: even with a valid Node set, HasText returns the cached string verbatim. Pre-cache a divergent value to prove ExtractText does NOT fall through to Node extraction. - AllThreeBranches (table-driven): one assertion per dispatch arm — HasText with non-empty Text, HasText with empty Text, no HasText with non-empty Text, no HasText with empty Text falling back to Node.
The previous startLSPSubprocess used exec.CommandContext with the
default cancel behaviour (Process.Kill → SIGKILL). On any path
where defer cancel() runs before the test's t.Cleanup hook —
the failure path is the common one, since function defers run
ahead of Cleanup — the subprocess was SIGKILL'd before Go's
coverage atexit hooks could flush counters to GOCOVERDIR. The
result: e2e_coverage.txt entries for files only the LSP server
exercises (cmd/mdsmith/lsp.go, internal/lsp/...) shifted between
CI runs based on which LSP tests passed cleanly versus failed
mid-flight. Codecov's per-file `changes` gate reported this as
"indirect coverage changes not visible in diff" and the failure
was persistent and non-actionable through more unit tests.
Fix (Go 1.20+):
cmd.Cancel = func() error { return stdin.Close() }
cmd.WaitDelay = 5 * time.Second
cmd.Cancel overrides the default SIGKILL with a graceful stdin
close. The LSP server's read loop exits on EOF, runs its own
shutdown, and the Go runtime flushes coverage counters before
the process exits. cmd.WaitDelay caps the wait if the subprocess
somehow hangs after EOF — a guaranteed bounded outcome, not the
previous "manually call p.wait() before every conceivable
failure path" workaround.
Also:
- Update shutdown()'s comment to reflect that the race is now
structurally impossible at the helper level; the explicit
wait() is only there to confirm clean exit when reached.
- Add TestLSPSubprocess_CoverFlushOnContextCancel: pins the
invariant directly by spawning the LSP subprocess, cancelling
its context without driving the protocol shutdown, and
asserting GOCOVERDIR is non-empty after Wait returns. A
regression that drops cmd.Cancel/WaitDelay fails this test
before any downstream symptom (codecov drift, integration
flakes) would surface.
With carryforward: true, the base commit inherits coverage from an even-earlier commit (or a different Go toolchain) whose block coordinates may differ from the head's by a column or two. Codecov maps blocks to lines and attributes coverage per line, so any column-level shift in block boundaries makes non-statement lines (closing braces, blank lines, type declarations) flip between "covered" (because the carryforward block edge included them) and "uncovered" (because the head's fresh block edge does not). Those flips surface as "indirect coverage changes not visible in diff" and trip the per-file `changes` status check — even when both sides measure 100% statement coverage of the same unchanged source. Concrete example traced on this PR: codecov flagged internal/rules/tablefmt/tablefmt.go lines 143-145, 156, 157 as indirect changes. Those lines are the closing brace of an inner for-loop, the closing brace of an outer for-loop, a blank line, the doc comment on `type table struct`, and the type declaration itself — none have executable statements, all are 100% according to `go test -cover`. The flap was pure carryforward bookkeeping. Turning carryforward off makes every PR base re-measured cleanly against the head's fresh upload. The `changes` gate continues to fire on real per-file coverage drops; spurious indirect-changes noise on metadata lines goes away.
Plan 195 (merged to main after this PR was originally opened) added TestCategory in the same file. After rebasing onto main, both my plan-196 TestCategory and the upstream version collided. Remove the duplicate; main's keeps its own assertion text.
Multiple reviewers flagged the same race: TestDefaultCacheDir_*
called t.Parallel() while mutating the package-level
userCacheDirFn seam, so a sibling t.Parallel test calling
defaultCacheDir()/run() concurrently could observe the stubbed
function. The seam was added to avoid the t.Setenv race on
HOME/XDG_CACHE_HOME but introduced an equivalent global-mutation
race of its own.
Refactor to a parameter-injection shape:
func defaultCacheDir() string {
return resolveCacheDir(os.UserCacheDir)
}
func resolveCacheDir(getUserCacheDir func() (string, error)) string {
...
}
Tests now call resolveCacheDir(stub) directly. No global is
mutated, so t.Parallel() is unconditionally safe regardless of
which other tests run concurrently. Coverage stays at the same
branches: TmpFallbackWhenUserCacheDirFails, EmptyUserCacheDir,
plus a new HappyPath case.
50762e2 to
60348a0
Compare
Implements plan 196 to defer paragraph text extraction until a caller explicitly requests it, reducing allocations on prose-heavy input where most paragraphs are filtered out early.
Summary
The paragraph-readability rule (MDS023) gates on word count before examining readability metrics. On synthetic corpora, most paragraphs fall below the
minWordsfloor (default 20) and never need their plain text extracted. Previously,CollectSectionParagraphseagerly calledExtractPlainTexton every paragraph, wasting allocations on filtered-out paragraphs.This change makes text extraction lazy: the collector now stores the AST node and defers text materialization until a rule explicitly requests it via a new
ExtractTextmethod.Key Changes
mdtext.CountWordsInNode: New function that counts words by walking the AST without materializing the full text string. Mirrors the dispatch logic ofExtractPlainTextto guarantee equivalence.wordCounterhelper that preservesinWordstate across segment boundaries, so adjacent text nodes coalesce correctly.astutil.SectionParagraph: AddedNode ast.Nodefield to store the paragraph's AST node. TheTextfield is kept for backward compatibility with test literals but is no longer populated by the collector.ExtractText(source []byte) stringmethod returns cachedTextif present, otherwise materializes text fromNodeon demand.astutil.CollectSectionParagraphs: Now stores the AST node instead of eagerly extracting text. The memoization still runs once per file, but defers the per-paragraph string allocation.astutil.SectionBody: Updated signature to acceptsource []byteparameter and callExtractTexton each paragraph, since text is no longer pre-populated.paragraphreadabilityrule: UsesCountWordsInNodefor theminWordsgate (no text allocation for filtered paragraphs). Only materializes text viaExtractTextfor paragraphs that pass the gate or when placeholders are configured (which requires string masking).paragraphstructure,requiredtextpatterns,requiredmentionsrules: Updated to callExtractText(f.Source)instead of readingp.Textdirectly, and to passsourcetoSectionBody.Performance Impact
Benchmark results on the synthetic corpus:
BenchmarkCheckCorpusLargeallocs/op: 634 k → 553 k (~81 k reduction, exceeding the 30 k target)Testing
TestCountWordsInNode_EquivalentToCountWordsExtractPlainText) validates thatCountWordsInNodeandCountWords(ExtractPlainText(...))agree on every paragraph in the MDS023 fixture corpus plus in-package prose helpers.SectionBodyandSectionParagraphtests updated for the new lazy-extraction contract.https://claude.ai/code/session_01211re8Ye2womR86G6n9Eno