fix: deduplicate contained search blocks and add is_test to JSON output by buger · Pull Request #554 · probelabs/probe

buger · 2026-04-06T08:31:02Z

Summary

Fixes #553 — two improvements to probe search --allow-tests --exact --no-merge -o json for requirement-annotation indexing:

Deduplicate contained blocks: When the parser emits both a comment node (with attached declaration) and the declaration itself as separate overlapping results, the contained block is now removed. This runs unconditionally regardless of --no-merge, reducing 4 results to 2 in the reported repro case. Matched keywords from removed blocks are preserved on the surviving block.
Add is_test field to JSON output: Computed at serialization time from file naming conventions (is_test_file()) and code content patterns (is_test_code_block()). Only present when true (omitted for non-test blocks). Structural declarations (type_declaration, struct_item) are excluded from test detection, letting consumers distinguish top-level Verifies: comments on implementation code from actual test/witness blocks.

Test plan

9 unit tests for deduplicate_contained_blocks() (containment, partial overlap, different files, keyword merging, same-range, edge cases)
7 unit tests for is_test_code_block() (Rust, Python, JS/TS, Go, comment-with-function, type declarations, plain comments)
Manual verification with exact repro cases from issue:
- Repro 1: 2 results instead of 4 (duplicates eliminated)
- Repro 2: Go type_declaration has no is_test, Python/TS test blocks have "is_test": true
All 337 unit tests pass, 11 integration tests pass, 21 CLI tests pass

🤖 Generated with Claude Code

…ut (#553) - Add deduplicate_contained_blocks() that removes blocks fully contained within another block from the same file, preserving matched keywords. Runs unconditionally regardless of --no-merge flag. - Add is_test field to JSON output (only present when true) computed from file naming conventions and code content patterns across languages. - Structural declarations (type_declaration, struct_item) are excluded from test detection to distinguish implementation-level Verifies comments from actual test/witness blocks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

probelabs · 2026-04-06T08:33:22Z

Overview

This PR fixes issue #553 by introducing two improvements to probe search --allow-tests --exact --no-merge -o json for requirement-annotation indexing:

Deduplicates contained search blocks: Removes overlapping results where one block fully contains another (e.g., comment node with attached declaration, plus the declaration itself). This runs unconditionally regardless of --no-merge flag, reducing duplicate results from 4 to 2 in the reported repro case. Matched keywords from removed blocks are preserved on the surviving block.
Adds is_test field to JSON output: Computed at serialization time from file naming conventions (is_test_file()) and code content patterns (is_test_code_block()). Only present when true (omitted for non-test blocks). Structural declarations (type_declaration, struct_item) are excluded from test detection.

Files Changed

src/search/block_merging.rs (+253 lines): New deduplicate_contained_blocks() function with 9 unit tests
src/search/search_output.rs (+170, -14 lines): Added is_test field to JSON output and is_test_code_block() helper with 7 unit tests
src/search/search_runner.rs (+17 lines): Integration point calling deduplication before optional merging

Architecture & Impact

Component Flow

graph TD
    A[Search Results] --> B[deduplicate_contained_blocks]
    B --> C[Optional Block Merging]
    C --> D[JSON Serialization]
    D --> E[is_test_file + is_test_code_block]
    E --> F[JSON Output with is_test field]
    
    B --> B1[Group by file]
    B1 --> B2[Sort by start line, span size]
    B2 --> B3[Mark contained blocks]
    B3 --> B4[Merge keywords]
    B4 --> B5[Return deduped results]

Key Technical Changes

Deduplication Logic (block_merging.rs):

Groups results by file path
Sorts by start line, then by span size descending (larger blocks first)
Detects full containment: inner_start >= outer_start && inner_end <= outer_end
Merges matched_keywords from contained blocks into container
Handles edge cases: same range, partial overlap, different files

Test Detection (search_output.rs):

File-level: Uses existing is_test_file() from probe_code::language
Code-level: is_test_code_block() checks first 10 lines for patterns:
- Rust: #[test], #[cfg(test)]
- Python: def test_
- JS/TS: test(, describe(, it(
- Go: func Test
- Java/C#: @Test
Excludes structural types: type_declaration, struct_item, package_clause, import_declaration

Affected Components

Search pipeline: Deduplication now runs before optional merging in search_runner.rs
JSON output: New optional is_test field (only serialized when true)
Requirement annotation indexing: Cleaner results with duplicates removed and test identification

Scope Discovery

Related Files (Inferred)

probe_code::language module: Contains is_test_file() function used for file-level test detection
Test files in src/search/: Integration tests for search functionality
CLI tests: Verify --allow-tests --exact --no-merge -o json behavior

Test Coverage

9 unit tests for deduplicate_contained_blocks(): containment, partial overlap, different files, keyword merging, same-range, edge cases
7 unit tests for is_test_code_block(): Rust, Python, JS/TS, Go, comment-with-function, type declarations, plain comments
Manual verification with exact repro cases from issue search JSON returns duplicate witness hits and top-level Verifies blocks for exact requirement annotation search #553

Review Notes

Focus areas:

Deduplication algorithm correctness - verify containment logic handles all edge cases
Test detection heuristics - ensure false positives are minimized for structural declarations
Keyword merging behavior - confirm no keywords are lost during deduplication
Performance impact - deduplication runs on every search with results

Potential concerns:

Deduplication runs unconditionally even with --no-merge (by design, as it's deduplication not merging)
Test detection is heuristic-based and may miss language-specific patterns
JSON schema change (new optional field) may affect consumers

Metadata

Review Effort: 3 / 5
Primary Label: bug

Powered by Visor from Probelabs

Last updated: 2026-04-06T08:40:43.298Z | Triggered by: pr_opened | Commit: 2a897ca

💡 TIP: You can chat with Visor using /visor ask <your question>

probelabs · 2026-04-06T08:39:04Z

Security Issues (1)

Severity	Location	Issue
🟠 Error	`contract:0`	Output schema validation failed: must have required property 'issues'

Security Issues (1)

Severity	Location	Issue
🟠 Error	`contract:0`	Output schema validation failed: must have required property 'issues'

\n\n ### Architecture Issues (8)

Severity	Location	Issue
🟠 Error	`src/search/block_merging.rs:7-119`	The deduplicate_contained_blocks() function duplicates keyword merging logic from merge_matched_keywords() (lines 536-561). Both functions implement identical keyword deduplication using HashSet with sorting. This violates DRY principles and creates maintenance burden - any changes to keyword merging logic must be updated in two places. 💡 Suggestion Extract the keyword merging logic into a shared helper function and call it from both deduplicate_contained_blocks() and merge_ranked_blocks(). Create a function like merge_keyword_sets(kw1: Option<&Vec<String>>, kw2: Option<&Vec<String>>) -> Option<Vec<String>> that both functions can reuse. 🔧 Suggested Fix `Extract common logic: fn merge_keyword_sets( kw1: Option<&Vec<String>>, kw2: Option<&Vec<String>> ) -> Option<Vec<String>> { let mut keywords = std::collections::HashSet::new(); if let Some(ref kw) = kw1 { for keyword in kw { keywords.insert(keyword.clone()); } } if let Some(ref kw) = kw2 { for keyword in kw { keywords.insert(keyword.clone()); } } if keywords.is_empty() { None } else { let mut keyword_vec: Vec<String> = keywords.into_iter().collect(); keyword_vec.sort(); Some(keyword_vec) } }` `Then in deduplicate_contained_blocks(), replace lines 68-81 with: blocks[i].matched_keywords = merge_keyword_sets( blocks[i].matched_keywords.as_ref(), blocks[j].matched_keywords.as_ref() );`
🟠 Error	`src/search/search_output.rs:702-760`	The is_test_code_block() function implements language-specific heuristics for detecting test code by pattern matching on code content. This creates a fragile abstraction that will break with new languages, testing frameworks, or coding patterns. The function mixes concerns of file-level detection (is_test_file) with block-level detection, creating an inconsistent interface. 💡 Suggestion Refactor into a more extensible architecture: 1) Create a TestDetection trait with methods for different languages, 2) Implement language-specific detectors that can be registered, 3) Use a strategy pattern instead of hardcoded string matching. This allows adding new languages without modifying core logic and makes the detection rules explicit and testable. 🔧 Suggested Fix `Consider a trait-based approach: trait TestDetector { fn is_test_block(&self, code: &str, node_type: &str) -> bool; } struct RustTestDetector; impl TestDetector for RustTestDetector { fn is_test_block(&self, code: &str, node_type: &str) -> bool { // Rust-specific logic } }` `// Then in is_test_code_block: fn is_test_code_block(code: &str, node_type: &str, language: &str) -> bool { match language { "rust" => RustTestDetector.is_test_block(code, node_type), "python" => PythonTestDetector.is_test_block(code, node_type), // ... } }`
🟠 Error	`src/search/search_output.rs:599-625`	The is_test computation is embedded in the JSON serialization logic (format_and_print_json_results). This violates separation of concerns - output formatting code should not contain business logic about what constitutes a test. The is_test field should be computed earlier in the pipeline when SearchResult objects are created. 💡 Suggestion Move the is_test computation to the result building stage (in file_processing.rs where SearchResult objects are created). Add an is_test field to the SearchResult struct and compute it when the result is first created. This makes the data flow clearer and allows is_test to be used in other output formats (not just JSON).
🟡 Warning	`src/search/block_merging.rs:7-119`	The deduplicate_contained_blocks() function implements O(n²) nested loop algorithm with complex bidirectional containment checking. For typical search result sets (often <100 items per file), this complexity is unnecessary. The algorithm also handles edge cases (same start line, different spans) that could be simplified with a clearer approach. 💡 Suggestion Simplify the algorithm by: 1) Sorting blocks by start line, then by span size descending (already done), 2) Single pass through sorted blocks tracking the 'current largest' block, 3) Only checking if subsequent blocks are contained within the current largest. This eliminates the bidirectional check and the 'removed[i] = true' case. 🔧 Suggested Fix Simplified approach: for i in 0..len { if removed[i] { continue; } // Since sorted by (start, -span), blocks[i] is the largest at this start position let (outer_start, outer_end) = blocks[i].lines; for j in (i + 1)..len { if removed[j] { continue; } let (inner_start, inner_end) = blocks[j].lines; // Stop if we've moved past outer's range if inner_start > outer_end { break; } // Check containment if inner_start >= outer_start && inner_end <= outer_end { blocks[i].matched_keywords = merge_keyword_sets( blocks[i].matched_keywords.as_ref(), blocks[j].matched_keywords.as_ref() ); removed[j] = true; } } `}`
🟡 Warning	`src/search/block_merging.rs:44-50`	The sorting logic uses a complex comparator that sorts by start line, then by span size descending. This is a special-case optimization for the deduplication algorithm that makes the code harder to understand and maintain. The dependency between sorting order and algorithm correctness is fragile. 💡 Suggestion Either: 1) Document the sorting invariant clearly with comments explaining why this specific order is required, or 2) Refactor to not depend on sorting order for correctness. Option 1 is simpler - add a comment explaining that blocks must be sorted with larger spans first at each start position for the containment check to work correctly.
🟡 Warning	`src/search/search_output.rs:702-760`	The is_test_code_block() function hardcodes the number of lines to check (10) with no justification. This magic number represents a special case - 'check first 10 lines' - without explaining why 10 is the right threshold or making it configurable. 💡 Suggestion Either: 1) Make the line limit a constant with documentation explaining the choice (e.g., const TEST_PATTERN_SCAN_LINES: usize = 10; // Most test decorators appear within first 10 lines), or 2) Make it configurable via a parameter. The constant approach is simpler and sufficient for this use case.
🟡 Warning	`src/search/search_output.rs:708-717`	The is_structural_only check creates a special case for certain node types that are excluded from test detection. This hardcoded list of node types is a special case that doesn't follow a general principle - it's an ad-hoc exclusion list that will need manual updates as new node types are added. 💡 Suggestion Replace the hardcoded list with a more general principle: 'node types that represent pure declarations without executable code'. Document this principle and consider whether the exclusion should be based on node characteristics (e.g., 'has no executable statements') rather than a type whitelist.
🟡 Warning	`src/search/search_runner.rs:1527-1544`	The deduplicate_contained_blocks() call is embedded directly in perform_probe() alongside block merging logic. This mixes concerns of result processing (deduplication) with pipeline orchestration (perform_probe). The deduplication step should be clearly separated as a distinct processing stage. 💡 Suggestion Extract the deduplication step into a separate function that clearly defines the processing pipeline stages. For example, create a function apply_post_ranking_deduplication(results: Vec<SearchResult>) -> Vec<SearchResult> that encapsulates this step. This makes the pipeline structure clearer and allows for easier testing and modification of individual stages.

Quality Issues (16)

Severity	Location	Issue
🟠 Error	`src/search/block_merging.rs:7`	No input validation for edge cases in SearchResult. The function assumes lines.0 <= lines.1 but doesn't validate this. Invalid line ranges could cause incorrect containment detection. 💡 Suggestion Add validation at the start of the function to ensure line ranges are valid: assert!(result.lines.0 <= result.lines.1, "Invalid line range"). Handle or log invalid ranges gracefully.
🟠 Error	`src/search/search_output.rs:702`	The is_test_code_block function uses fragile string-based pattern matching on code content. This approach is prone to false positives and false negatives. For example, it would match 'def test_' even in non-test contexts. 💡 Suggestion Consider using more robust detection: 1) Check if the file is a test file first using is_test_file(), 2) Use AST-based detection instead of string matching, 3) Add more specific patterns (e.g., require test imports or test class inheritance). Document the limitations of the current approach.
🟡 Warning	`src/search/block_merging.rs:7`	The deduplicate_contained_blocks function has high cyclomatic complexity with nested loops and multiple conditional branches. The nested loop structure (O(n²)) with inner containment checks makes the logic difficult to follow and maintain. 💡 Suggestion Consider refactoring into smaller helper functions to reduce complexity. Extract the containment check logic and keyword merging into separate functions. Consider using a more efficient data structure or algorithm if performance becomes an issue.
🟡 Warning	`src/search/block_merging.rs:44`	Complex sorting logic with multiple criteria (start line, then span size descending) is embedded in the sort closure. This makes the sorting intent unclear and the code harder to maintain. 💡 Suggestion Extract the sorting logic into a named helper function or use a more explicit approach. Consider adding a comment explaining why larger blocks are sorted first (to ensure containment checks work correctly).
🟡 Warning	`src/search/block_merging.rs:56`	O(n²) nested loop algorithm for containment detection. For each block, it compares against all subsequent blocks. This could be inefficient for files with many search results. 💡 Suggestion Consider using a more efficient algorithm like interval trees or sweep line algorithm for large numbers of blocks. For typical use cases this may be acceptable, but document the performance characteristics.
🟡 Warning	`src/search/block_merging.rs:67`	Duplicate keyword merging logic in two branches (lines 67-76 and 84-93). The same pattern of cloning, checking, adding, and sorting keywords is repeated. 💡 Suggestion Extract the keyword merging logic into a helper function: fn merge_keywords_into(target: &mut SearchResult, source: &SearchResult). This reduces duplication and makes the code more maintainable.
🟡 Warning	`src/search/search_output.rs:702`	Magic number 10 for line limit in is_test_code_block function. The function only checks the first 10 lines of code for test patterns, which is arbitrary and not documented. 💡 Suggestion Extract to a named constant with documentation: const TEST_PATTERN_SCAN_LINES: usize = 10;. Add a comment explaining why this limit was chosen (e.g., test decorators typically appear near the function start).
🟡 Warning	`src/search/search_output.rs:702`	Hardcoded language-specific test patterns without extensibility. Adding support for new languages requires modifying this function with more if/else branches. 💡 Suggestion Consider using a strategy pattern or trait-based approach where each language implementation provides its own test detection logic. This would be more extensible and maintainable.
🟡 Warning	`src/search/search_output.rs:715`	Test pattern detection is incomplete and may miss valid test cases. For example, it only checks first 10 lines, so test patterns appearing later in the code would be missed. Also doesn't handle all test frameworks (e.g., RSpec for Ruby, pytest fixtures). 💡 Suggestion Either scan the entire code block or document the limitation clearly. Consider adding more comprehensive test pattern coverage for common frameworks.
🟡 Warning	`src/search/block_merging.rs:678`	Test helper function make_result uses magic numbers for line ranges (1, 4, 2, 4, 10, 15) without semantic meaning. These numbers are arbitrary and make tests harder to understand. 💡 Suggestion Use named constants or helper functions with meaningful names: const COMMENT_BLOCK_RANGE: (usize, usize) = (1, 4); const FUNCTION_BLOCK_RANGE: (usize, usize) = (2, 4);. This makes test intent clearer.
🟡 Warning	`src/search/block_merging.rs:695`	Tests use magic strings like 'REQ-001', 'REQ-002' for matched keywords without explanation. These appear to be arbitrary requirement IDs. 💡 Suggestion Use named constants or add comments explaining what these represent: const REQUIREMENT_1: &str = 'REQ-001';. Or use more descriptive test data that explains the scenario.
🟡 Warning	`src/search/block_merging.rs:695`	Missing test cases for important edge cases: 1) Empty matched_keywords on both blocks, 2) Three or more blocks with containment relationships, 3) Blocks with same start line but different end lines, 4) Invalid line ranges (start > end). 💡 Suggestion Add comprehensive edge case tests: test_dedup_empty_keywords_both_blocks(), test_dedup_three_nested_blocks(), test_dedup_same_start_different_ends(), test_dedup_invalid_line_ranges_handled().
🟡 Warning	`src/search/block_merging.rs:695`	No negative test cases for error conditions or invalid inputs. The tests only cover happy paths. 💡 Suggestion Add negative tests: test_dedup_handles_empty_vector(), test_dedup_handles_single_block(), test_dedup_handles_blocks_with_invalid_ranges(). Verify the function handles edge cases gracefully.
🟡 Warning	`src/search/search_output.rs:3090`	Test suite for is_test_code_block lacks negative test cases. Tests verify positive cases but don't verify that non-test code is correctly rejected. 💡 Suggestion Add negative tests: test_is_test_code_block_rejects_production_code(), test_is_test_code_block_rejects_false_positives(), test_is_test_code_block_handles_empty_code(). Verify the function doesn't incorrectly flag production code as tests.
🟡 Warning	`src/search/search_output.rs:3090`	Test coverage for is_test_code_block is incomplete. Missing tests for: 1) Code with test patterns but in structural node types, 2) Edge case where test patterns appear after line 10, 3) Mixed language files, 4) Code with test-like strings that aren't actually tests. 💡 Suggestion Add comprehensive tests covering all language patterns and edge cases. Test that structural node types are excluded even with test patterns. Test that patterns beyond line 10 are not detected.
🟡 Warning	`src/search/search_runner.rs:1530`	The deduplicate_contained_blocks function is called but its return value is not checked for errors. While it returns Vec<SearchResult> (not Result), there's no validation that the deduplication succeeded or produced valid results. 💡 Suggestion Add validation to ensure the deduplication produced valid results: assert!(deduped.len() <= limited.results.len(), "Deduplication should not increase result count"). Consider logging when deduplication removes results.

Powered by Visor from Probelabs

Last updated: 2026-04-06T08:39:54.150Z | Triggered by: pr_opened | Commit: 2a897ca

💡 TIP: You can chat with Visor using /visor ask <your question>

buger merged commit cff9cff into main Apr 6, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: deduplicate contained search blocks and add is_test to JSON output#554

fix: deduplicate contained search blocks and add is_test to JSON output#554
buger merged 1 commit intomainfrom
fix/search-dedup-553

buger commented Apr 6, 2026

Uh oh!

Uh oh!

probelabs bot commented Apr 6, 2026 •

edited

Loading

Uh oh!

probelabs bot commented Apr 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

buger commented Apr 6, 2026

Summary

Test plan

Uh oh!

Uh oh!

probelabs bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Files Changed

Architecture & Impact

Component Flow

Key Technical Changes

Affected Components

Scope Discovery

Related Files (Inferred)

Test Coverage

Review Notes

Uh oh!

probelabs bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Security Issues (1)

Security Issues (1)

Quality Issues (16)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

probelabs bot commented Apr 6, 2026 •

edited

Loading

probelabs bot commented Apr 6, 2026 •

edited

Loading