Skip to content

perf: O(1) index maps for redundancy exact/normalized lookups#13

Merged
pythondatascrape merged 1 commit into
mainfrom
perf/redundancy-index
May 6, 2026
Merged

perf: O(1) index maps for redundancy exact/normalized lookups#13
pythondatascrape merged 1 commit into
mainfrom
perf/redundancy-index

Conversation

@pythondatascrape
Copy link
Copy Markdown
Owner

Summary

  • Adds exactIdx map[string]bool and normalizedIdx map[string]bool to Checker for O(1) exact and normalized duplicate detection
  • Add populates both index maps alongside the existing slices (slices still needed for Jaccard similarity scan)
  • CheckWithThreshold checks maps first before falling through to the linear similarity scan
  • Adds ExactIndex() and NormalizedIndex() accessors for test verification

Test plan

  • TestExactLookupUsesIndex — verifies index is populated and used for exact matches
  • TestNormalizedLookupUsesIndex — verifies normalized index populated correctly
  • TestLargeExactIndex — 10k entries, exact lookup completes in <5ms
  • TestLargeNormalizedIndex — 10k entries, normalized lookup completes in <5ms
  • All existing redundancy tests continue to pass

Closes #8

🤖 Generated with Claude Code

…lookups

Closes #8

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 14:20
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves redundancy detection performance by adding O(1) exact and normalized lookup indexes to internal/redundancy.Checker, allowing CheckWithThreshold to short-circuit on common duplicate cases before falling back to the similarity scan.

Changes:

  • Add exactIdx / normalizedIdx maps to Checker and populate them in Record.
  • Update CheckWithThreshold to consult the maps before scanning c.normalized for Jaccard similarity.
  • Add tests and new accessor methods (ExactIndex(), NormalizedIndex()) to support index verification.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
internal/redundancy/redundancy.go Adds exact/normalized index maps and uses them for O(1) redundancy checks; introduces index accessors.
internal/redundancy/redundancy_test.go Adds tests intended to validate index population and large-history behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 54 to +57
// Check tests content against all previously recorded strings.
// It does NOT record the content — call Record separately if desired.
func (c *Checker) ExactIndex() map[string]bool {
c.mu.Lock()
Comment on lines +71 to +72
// TestLargeExactIndex ensures exact match stays fast with many entries (index must not degrade).
func TestLargeExactIndex(t *testing.T) {
Comment on lines +89 to +94
c.Record(strings.Repeat("y", i+1))
}
// Check reverse-order of a recorded entry — same tokens, different order (single token so same).
target := strings.Repeat("y", n/2)
result := c.Check(target)
require.True(t, result.IsRedundant)
Comment on lines +84 to +85
// TestLargeNormalizedIndex ensures normalized match stays fast with many entries.
func TestLargeNormalizedIndex(t *testing.T) {
Comment on lines 82 to +92
func (c *Checker) CheckWithThreshold(content string, threshold float64) Result {
c.mu.Lock()
defer c.mu.Unlock()

norm := normalize(content)
normTokens := strings.Fields(norm)

for i, r := range c.raw {
if r == content {
return Result{IsRedundant: true, Kind: "exact", Similarity: 1.0}
}
if c.normalized[i] == norm {
return Result{IsRedundant: true, Kind: "normalized", Similarity: 1.0}
}
sim := jaccard(normTokens, strings.Fields(c.normalized[i]))
if c.exactIdx[content] {
return Result{IsRedundant: true, Kind: "exact", Similarity: 1.0}
}
if c.normalizedIdx[norm] {
return Result{IsRedundant: true, Kind: "normalized", Similarity: 1.0}
@pythondatascrape pythondatascrape merged commit 7094d69 into main May 6, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: redundancy Checker uses O(n) linear scan — index for sub-linear lookup

2 participants