Skip to content

Performance: Unroll inner loop of LCSPTFAMerger._lcsSubstring 8x#143

Open
ysdede wants to merge 1 commit intomasterfrom
perf/unroll-lcs-inner-loop-11980598664288901526
Open

Performance: Unroll inner loop of LCSPTFAMerger._lcsSubstring 8x#143
ysdede wants to merge 1 commit intomasterfrom
perf/unroll-lcs-inner-loop-11980598664288901526

Conversation

@ysdede
Copy link
Copy Markdown
Owner

@ysdede ysdede commented Apr 1, 2026

What changed

Unrolled the inner j loop of the Dynamic Programming matrix fill in LCSPTFAMerger._lcsSubstring by a factor of 8. Additionally, cached the outer loop array value (X[i - 1]) into a local variable xVal to avoid redundant property lookups during the inner loop execution.

Why it was needed

The _lcsSubstring algorithm is a critical path for aligning overlapping audio chunks in continuous or streaming transcription workloads. Profiling using benchmark scripts (bench_lcs.mjs) showed that the manual array traversal with repeated X[i - 1] property lookups incurred significant loop maintenance overhead and potential branch-prediction penalties in V8.

Impact

Measurements using performance.now() over 5000 iterations with 500-element arrays:

  • Baseline: ~8125ms
  • Optimized (8x unroll + local cache): ~3400ms
  • Improvement: ~58% speedup in the hot path.

How to verify

Run the test_lcs.mjs unit test script or execute standard project tests via shim (vitest_shim.mjs over tests/parakeet.test.mjs) to verify identical outcomes. A benchmark script replicating the bench_lcs.mjs setup can be used to re-validate the performance metric.


PR created automatically by Jules for task 11980598664288901526 started by @ysdede

Summary by Sourcery

Optimize the LCSPTFAMerger longest common substring DP routine with loop unrolling and document the performance learning.

Enhancements:

  • Unroll the inner loop of LCSPTFAMerger._lcsSubstring and cache the outer array element in a local variable to reduce overhead in the dynamic programming hot path.
  • Record the performance findings and guidance about LCS loop unrolling in the project’s .jules/bolt.md knowledge file.

Summary by CodeRabbit

  • Performance
    • Optimized string matching algorithm for significantly improved execution speed (~58% faster).

@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

📝 Walkthrough

Walkthrough

Optimized the _lcsSubstring algorithm in LCSPTFAMerger by implementing 8x loop unrolling of the inner iteration and caching of the outer loop variable to reduce branch overhead. Documented the change with observed ~58% performance improvement and updated optimization guidance.

Changes

Cohort / File(s) Summary
Documentation
.jules/bolt.md
Added dated learning entry (2025-04-01) documenting the performance optimization, observed speedup (~58%), and updated loop-unrolling guidance for tight DP inner loops over typed/standard arrays.
Algorithm Implementation
src/parakeet.js
Refactored LCSPTFAMerger._lcsSubstring() inner loop with 8x unrolling and cached xVal variable; added explicit unrolled updates to LCS[j..j+7] with conditional mismatch resets, remainder loop for trailing iterations, and inlined max-substring tracking.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

type/performance

Poem

🐰 A loop unrolled, eight steps at a time,
Cache values held like carrots so fine,
Branches collapsed, the algorithm takes flight,
Fifty-eight percent faster—oh what a sight!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description covers what changed, why it was needed, and includes performance metrics and verification steps. However, it lacks the required template sections like Scope Guard, Fragile Areas Touched checklist, Risk level, and Rollback plan. Complete the description using the template: add Scope Guard checkboxes, select relevant Fragile Areas Touched items, specify Risk level and Rollback plan, and verify test execution status.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change: unrolling the inner loop of LCSPTFAMerger._lcsSubstring by 8x, which is the primary optimization in the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/unroll-lcs-inner-loop-11980598664288901526

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The 8x-unrolled inner loop is now quite dense and repetitive; consider extracting the repeated update pattern into a small helper (or at least adding a brief comment describing the prev/temp invariant) to make future maintenance less error-prone without sacrificing the optimization intent.
  • The inline if (...) { maxLen = ...; endX = ...; endY = ...; } blocks are duplicated eight times; you could factor this into a tiny local function or macro-like helper to centralize the max-update logic while keeping the unrolling clear.
  • Given the manual index arithmetic (j + 1..j + 7 and Y[j - 1]..Y[j + 6]), it would help to add a short comment near the unrolled loop explaining why for (; j <= n - 7; j += 8) guarantees all accesses stay within bounds.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The 8x-unrolled inner loop is now quite dense and repetitive; consider extracting the repeated update pattern into a small helper (or at least adding a brief comment describing the `prev`/`temp` invariant) to make future maintenance less error-prone without sacrificing the optimization intent.
- The inline `if (...) { maxLen = ...; endX = ...; endY = ...; }` blocks are duplicated eight times; you could factor this into a tiny local function or macro-like helper to centralize the max-update logic while keeping the unrolling clear.
- Given the manual index arithmetic (`j + 1`..`j + 7` and `Y[j - 1]`..`Y[j + 6]`), it would help to add a short comment near the unrolled loop explaining why `for (; j <= n - 7; j += 8)` guarantees all accesses stay within bounds.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements 8x loop unrolling and local variable caching for the Longest Common Subsequence (LCS) dynamic programming loop in src/parakeet.js, which is documented to provide a significant performance improvement. The feedback suggests a further optimization within the unrolled blocks to store updated values in local variables, thereby reducing redundant property access to the LCS TypedArray.

Comment thread src/parakeet.js
Comment on lines +1881 to +1886
let temp0 = LCS[j];
if (xVal === Y[j - 1]) {
LCS[j] = prev + 1;
if (LCS[j] > maxLen) {
maxLen = LCS[j];
endX = i;
endY = j;
}
if (LCS[j] > maxLen) { maxLen = LCS[j]; endX = i; endY = j; }
} else { LCS[j] = 0; }
prev = temp0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In this performance-critical loop, you can avoid redundant reads from the LCS TypedArray by storing the updated value in a local variable. This reduces property access overhead and can improve performance in V8 by avoiding a read-after-write to the same index in the TypedArray. This optimization should be applied to all 8 unrolled blocks and the remainder loop below.

Suggested change
let temp0 = LCS[j];
if (xVal === Y[j - 1]) {
LCS[j] = prev + 1;
if (LCS[j] > maxLen) {
maxLen = LCS[j];
endX = i;
endY = j;
}
if (LCS[j] > maxLen) { maxLen = LCS[j]; endX = i; endY = j; }
} else { LCS[j] = 0; }
prev = temp0;
let temp0 = LCS[j];
if (xVal === Y[j - 1]) {
const newVal = prev + 1;
LCS[j] = newVal;
if (newVal > maxLen) { maxLen = newVal; endX = i; endY = j; }
} else { LCS[j] = 0; }
prev = temp0;

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.jules/bolt.md:
- Line 17: The entry in .jules/bolt.md currently reads "## 2025-04-01 -
Unrolling Longest Common Subsequence (LCS) dynamic programming loop" but the PR
was opened on April 1, 2026; update that header date to "2026-04-01" (i.e.,
change the "2025-04-01" token in the heading line to "2026-04-01") so the
chronology is correct.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 05aa8844-e017-4494-b992-3881900ded05

📥 Commits

Reviewing files that changed from the base of the PR and between da00dbd and 5f4960c.

📒 Files selected for processing (2)
  • .jules/bolt.md
  • src/parakeet.js

Comment thread .jules/bolt.md
Learning: Unrolling the `Math.exp` accumulation loop to 8x and caching the multiplication `(tokenLogits[i] - maxLogit) * invTemp` into local variables before passing to `Math.exp` yields a measurable performance improvement (~4%) over the previous 4x unrolled implementation in the V8 engine, by reducing property access and allowing better instruction-level parallelism.
Action: Utilize 8x loop unrolling paired with local variable caching for tight floating-point accumulation loops over TypedArrays.

## 2025-04-01 - Unrolling Longest Common Subsequence (LCS) dynamic programming loop
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the learning entry date.

This optimization was added in a PR opened on April 1, 2026, but the note is stamped 2025-04-01. That makes the chronology in bolt.md misleading.

📝 Suggested fix
-## 2025-04-01 - Unrolling Longest Common Subsequence (LCS) dynamic programming loop
+## 2026-04-01 - Unrolling Longest Common Subsequence (LCS) dynamic programming loop
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## 2025-04-01 - Unrolling Longest Common Subsequence (LCS) dynamic programming loop
## 2026-04-01 - Unrolling Longest Common Subsequence (LCS) dynamic programming loop
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.jules/bolt.md at line 17, The entry in .jules/bolt.md currently reads "##
2025-04-01 - Unrolling Longest Common Subsequence (LCS) dynamic programming
loop" but the PR was opened on April 1, 2026; update that header date to
"2026-04-01" (i.e., change the "2025-04-01" token in the heading line to
"2026-04-01") so the chronology is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant