Skip to content

fix: deduplicate contained search blocks and add is_test to JSON output#554

Merged
buger merged 1 commit intomainfrom
fix/search-dedup-553
Apr 6, 2026
Merged

fix: deduplicate contained search blocks and add is_test to JSON output#554
buger merged 1 commit intomainfrom
fix/search-dedup-553

Conversation

@buger
Copy link
Copy Markdown
Collaborator

@buger buger commented Apr 6, 2026

Summary

Fixes #553 — two improvements to probe search --allow-tests --exact --no-merge -o json for requirement-annotation indexing:

  • Deduplicate contained blocks: When the parser emits both a comment node (with attached declaration) and the declaration itself as separate overlapping results, the contained block is now removed. This runs unconditionally regardless of --no-merge, reducing 4 results to 2 in the reported repro case. Matched keywords from removed blocks are preserved on the surviving block.
  • Add is_test field to JSON output: Computed at serialization time from file naming conventions (is_test_file()) and code content patterns (is_test_code_block()). Only present when true (omitted for non-test blocks). Structural declarations (type_declaration, struct_item) are excluded from test detection, letting consumers distinguish top-level Verifies: comments on implementation code from actual test/witness blocks.

Test plan

  • 9 unit tests for deduplicate_contained_blocks() (containment, partial overlap, different files, keyword merging, same-range, edge cases)
  • 7 unit tests for is_test_code_block() (Rust, Python, JS/TS, Go, comment-with-function, type declarations, plain comments)
  • Manual verification with exact repro cases from issue:
    • Repro 1: 2 results instead of 4 (duplicates eliminated)
    • Repro 2: Go type_declaration has no is_test, Python/TS test blocks have "is_test": true
  • All 337 unit tests pass, 11 integration tests pass, 21 CLI tests pass

🤖 Generated with Claude Code

…ut (#553)

- Add deduplicate_contained_blocks() that removes blocks fully contained
  within another block from the same file, preserving matched keywords.
  Runs unconditionally regardless of --no-merge flag.
- Add is_test field to JSON output (only present when true) computed from
  file naming conventions and code content patterns across languages.
- Structural declarations (type_declaration, struct_item) are excluded
  from test detection to distinguish implementation-level Verifies
  comments from actual test/witness blocks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@buger buger merged commit cff9cff into main Apr 6, 2026
15 checks passed
@probelabs
Copy link
Copy Markdown
Contributor

probelabs bot commented Apr 6, 2026

Overview

This PR fixes issue #553 by introducing two improvements to probe search --allow-tests --exact --no-merge -o json for requirement-annotation indexing:

  1. Deduplicates contained search blocks: Removes overlapping results where one block fully contains another (e.g., comment node with attached declaration, plus the declaration itself). This runs unconditionally regardless of --no-merge flag, reducing duplicate results from 4 to 2 in the reported repro case. Matched keywords from removed blocks are preserved on the surviving block.

  2. Adds is_test field to JSON output: Computed at serialization time from file naming conventions (is_test_file()) and code content patterns (is_test_code_block()). Only present when true (omitted for non-test blocks). Structural declarations (type_declaration, struct_item) are excluded from test detection.

Files Changed

  • src/search/block_merging.rs (+253 lines): New deduplicate_contained_blocks() function with 9 unit tests
  • src/search/search_output.rs (+170, -14 lines): Added is_test field to JSON output and is_test_code_block() helper with 7 unit tests
  • src/search/search_runner.rs (+17 lines): Integration point calling deduplication before optional merging

Architecture & Impact

Component Flow

graph TD
    A[Search Results] --> B[deduplicate_contained_blocks]
    B --> C[Optional Block Merging]
    C --> D[JSON Serialization]
    D --> E[is_test_file + is_test_code_block]
    E --> F[JSON Output with is_test field]
    
    B --> B1[Group by file]
    B1 --> B2[Sort by start line, span size]
    B2 --> B3[Mark contained blocks]
    B3 --> B4[Merge keywords]
    B4 --> B5[Return deduped results]
Loading

Key Technical Changes

Deduplication Logic (block_merging.rs):

  • Groups results by file path
  • Sorts by start line, then by span size descending (larger blocks first)
  • Detects full containment: inner_start >= outer_start && inner_end <= outer_end
  • Merges matched_keywords from contained blocks into container
  • Handles edge cases: same range, partial overlap, different files

Test Detection (search_output.rs):

  • File-level: Uses existing is_test_file() from probe_code::language
  • Code-level: is_test_code_block() checks first 10 lines for patterns:
    • Rust: #[test], #[cfg(test)]
    • Python: def test_
    • JS/TS: test(, describe(, it(
    • Go: func Test
    • Java/C#: @Test
  • Excludes structural types: type_declaration, struct_item, package_clause, import_declaration

Affected Components

  • Search pipeline: Deduplication now runs before optional merging in search_runner.rs
  • JSON output: New optional is_test field (only serialized when true)
  • Requirement annotation indexing: Cleaner results with duplicates removed and test identification

Scope Discovery

Related Files (Inferred)

  • probe_code::language module: Contains is_test_file() function used for file-level test detection
  • Test files in src/search/: Integration tests for search functionality
  • CLI tests: Verify --allow-tests --exact --no-merge -o json behavior

Test Coverage

Review Notes

Focus areas:

  1. Deduplication algorithm correctness - verify containment logic handles all edge cases
  2. Test detection heuristics - ensure false positives are minimized for structural declarations
  3. Keyword merging behavior - confirm no keywords are lost during deduplication
  4. Performance impact - deduplication runs on every search with results

Potential concerns:

  • Deduplication runs unconditionally even with --no-merge (by design, as it's deduplication not merging)
  • Test detection is heuristic-based and may miss language-specific patterns
  • JSON schema change (new optional field) may affect consumers
Metadata
  • Review Effort: 3 / 5
  • Primary Label: bug

Powered by Visor from Probelabs

Last updated: 2026-04-06T08:40:43.298Z | Triggered by: pr_opened | Commit: 2a897ca

💡 TIP: You can chat with Visor using /visor ask <your question>

@probelabs
Copy link
Copy Markdown
Contributor

probelabs bot commented Apr 6, 2026

Security Issues (1)

Severity Location Issue
🟠 Error contract:0
Output schema validation failed: must have required property 'issues'

Security Issues (1)

Severity Location Issue
🟠 Error contract:0
Output schema validation failed: must have required property 'issues'
\n\n ### Architecture Issues (8)
Severity Location Issue
🟠 Error src/search/block_merging.rs:7-119
The deduplicate_contained_blocks() function duplicates keyword merging logic from merge_matched_keywords() (lines 536-561). Both functions implement identical keyword deduplication using HashSet with sorting. This violates DRY principles and creates maintenance burden - any changes to keyword merging logic must be updated in two places.
💡 SuggestionExtract the keyword merging logic into a shared helper function and call it from both deduplicate_contained_blocks() and merge_ranked_blocks(). Create a function like merge_keyword_sets(kw1: Option<&Vec<String>>, kw2: Option<&Vec<String>>) -> Option<Vec<String>> that both functions can reuse.
🔧 Suggested Fix
Extract common logic:

fn merge_keyword_sets(
kw1: Option<&Vec<String>>,
kw2: Option<&Vec<String>>
) -> Option<Vec<String>> {
let mut keywords = std::collections::HashSet::new();
if let Some(ref kw) = kw1 {
for keyword in kw {
keywords.insert(keyword.clone());
}
}
if let Some(ref kw) = kw2 {
for keyword in kw {
keywords.insert(keyword.clone());
}
}
if keywords.is_empty() {
None
} else {
let mut keyword_vec: Vec<String> = keywords.into_iter().collect();
keyword_vec.sort();
Some(keyword_vec)
}
}

Then in deduplicate_contained_blocks(), replace lines 68-81 with:
blocks[i].matched_keywords = merge_keyword_sets(
blocks[i].matched_keywords.as_ref(),
blocks[j].matched_keywords.as_ref()
);

🟠 Error src/search/search_output.rs:702-760
The is_test_code_block() function implements language-specific heuristics for detecting test code by pattern matching on code content. This creates a fragile abstraction that will break with new languages, testing frameworks, or coding patterns. The function mixes concerns of file-level detection (is_test_file) with block-level detection, creating an inconsistent interface.
💡 SuggestionRefactor into a more extensible architecture: 1) Create a TestDetection trait with methods for different languages, 2) Implement language-specific detectors that can be registered, 3) Use a strategy pattern instead of hardcoded string matching. This allows adding new languages without modifying core logic and makes the detection rules explicit and testable.
🔧 Suggested Fix
Consider a trait-based approach:

trait TestDetector {
fn is_test_block(&self, code: &str, node_type: &str) -> bool;
}

struct RustTestDetector;
impl TestDetector for RustTestDetector {
fn is_test_block(&self, code: &str, node_type: &str) -> bool {
// Rust-specific logic
}
}

// Then in is_test_code_block:
fn is_test_code_block(code: &str, node_type: &str, language: &str) -> bool {
match language {
"rust" => RustTestDetector.is_test_block(code, node_type),
"python" => PythonTestDetector.is_test_block(code, node_type),
// ...
}
}

🟠 Error src/search/search_output.rs:599-625
The is_test computation is embedded in the JSON serialization logic (format_and_print_json_results). This violates separation of concerns - output formatting code should not contain business logic about what constitutes a test. The is_test field should be computed earlier in the pipeline when SearchResult objects are created.
💡 SuggestionMove the is_test computation to the result building stage (in file_processing.rs where SearchResult objects are created). Add an is_test field to the SearchResult struct and compute it when the result is first created. This makes the data flow clearer and allows is_test to be used in other output formats (not just JSON).
🟡 Warning src/search/block_merging.rs:7-119
The deduplicate_contained_blocks() function implements O(n²) nested loop algorithm with complex bidirectional containment checking. For typical search result sets (often <100 items per file), this complexity is unnecessary. The algorithm also handles edge cases (same start line, different spans) that could be simplified with a clearer approach.
💡 SuggestionSimplify the algorithm by: 1) Sorting blocks by start line, then by span size descending (already done), 2) Single pass through sorted blocks tracking the 'current largest' block, 3) Only checking if subsequent blocks are contained within the current largest. This eliminates the bidirectional check and the 'removed[i] = true' case.
🔧 Suggested Fix
Simplified approach:

for i in 0..len {
if removed[i] { continue; }

// Since sorted by (start, -span), blocks[i] is the largest at this start position
let (outer_start, outer_end) = blocks[i].lines;

for j in (i + 1)..len {
    if removed[j] { continue; }
    let (inner_start, inner_end) = blocks[j].lines;
    
    // Stop if we&#39;ve moved past outer&#39;s range
    if inner_start &gt; outer_end { break; }
    
    // Check containment
    if inner_start &gt;= outer_start &amp;&amp; inner_end &lt;= outer_end {
        blocks[i].matched_keywords = merge_keyword_sets(
            blocks[i].matched_keywords.as_ref(),
            blocks[j].matched_keywords.as_ref()
        );
        removed[j] = true;
    }
}

}

🟡 Warning src/search/block_merging.rs:44-50
The sorting logic uses a complex comparator that sorts by start line, then by span size descending. This is a special-case optimization for the deduplication algorithm that makes the code harder to understand and maintain. The dependency between sorting order and algorithm correctness is fragile.
💡 SuggestionEither: 1) Document the sorting invariant clearly with comments explaining why this specific order is required, or 2) Refactor to not depend on sorting order for correctness. Option 1 is simpler - add a comment explaining that blocks must be sorted with larger spans first at each start position for the containment check to work correctly.
🟡 Warning src/search/search_output.rs:702-760
The is_test_code_block() function hardcodes the number of lines to check (10) with no justification. This magic number represents a special case - 'check first 10 lines' - without explaining why 10 is the right threshold or making it configurable.
💡 SuggestionEither: 1) Make the line limit a constant with documentation explaining the choice (e.g., const TEST_PATTERN_SCAN_LINES: usize = 10; // Most test decorators appear within first 10 lines), or 2) Make it configurable via a parameter. The constant approach is simpler and sufficient for this use case.
🟡 Warning src/search/search_output.rs:708-717
The is_structural_only check creates a special case for certain node types that are excluded from test detection. This hardcoded list of node types is a special case that doesn't follow a general principle - it's an ad-hoc exclusion list that will need manual updates as new node types are added.
💡 SuggestionReplace the hardcoded list with a more general principle: 'node types that represent pure declarations without executable code'. Document this principle and consider whether the exclusion should be based on node characteristics (e.g., 'has no executable statements') rather than a type whitelist.
🟡 Warning src/search/search_runner.rs:1527-1544
The deduplicate_contained_blocks() call is embedded directly in perform_probe() alongside block merging logic. This mixes concerns of result processing (deduplication) with pipeline orchestration (perform_probe). The deduplication step should be clearly separated as a distinct processing stage.
💡 SuggestionExtract the deduplication step into a separate function that clearly defines the processing pipeline stages. For example, create a function apply_post_ranking_deduplication(results: Vec<SearchResult>) -> Vec<SearchResult> that encapsulates this step. This makes the pipeline structure clearer and allows for easier testing and modification of individual stages.

Quality Issues (16)

Severity Location Issue
🟠 Error src/search/block_merging.rs:7
No input validation for edge cases in SearchResult. The function assumes lines.0 <= lines.1 but doesn't validate this. Invalid line ranges could cause incorrect containment detection.
💡 SuggestionAdd validation at the start of the function to ensure line ranges are valid: assert!(result.lines.0 <= result.lines.1, "Invalid line range"). Handle or log invalid ranges gracefully.
🟠 Error src/search/search_output.rs:702
The is_test_code_block function uses fragile string-based pattern matching on code content. This approach is prone to false positives and false negatives. For example, it would match 'def test_' even in non-test contexts.
💡 SuggestionConsider using more robust detection: 1) Check if the file is a test file first using is_test_file(), 2) Use AST-based detection instead of string matching, 3) Add more specific patterns (e.g., require test imports or test class inheritance). Document the limitations of the current approach.
🟡 Warning src/search/block_merging.rs:7
The deduplicate_contained_blocks function has high cyclomatic complexity with nested loops and multiple conditional branches. The nested loop structure (O(n²)) with inner containment checks makes the logic difficult to follow and maintain.
💡 SuggestionConsider refactoring into smaller helper functions to reduce complexity. Extract the containment check logic and keyword merging into separate functions. Consider using a more efficient data structure or algorithm if performance becomes an issue.
🟡 Warning src/search/block_merging.rs:44
Complex sorting logic with multiple criteria (start line, then span size descending) is embedded in the sort closure. This makes the sorting intent unclear and the code harder to maintain.
💡 SuggestionExtract the sorting logic into a named helper function or use a more explicit approach. Consider adding a comment explaining why larger blocks are sorted first (to ensure containment checks work correctly).
🟡 Warning src/search/block_merging.rs:56
O(n²) nested loop algorithm for containment detection. For each block, it compares against all subsequent blocks. This could be inefficient for files with many search results.
💡 SuggestionConsider using a more efficient algorithm like interval trees or sweep line algorithm for large numbers of blocks. For typical use cases this may be acceptable, but document the performance characteristics.
🟡 Warning src/search/block_merging.rs:67
Duplicate keyword merging logic in two branches (lines 67-76 and 84-93). The same pattern of cloning, checking, adding, and sorting keywords is repeated.
💡 SuggestionExtract the keyword merging logic into a helper function: fn merge_keywords_into(target: &mut SearchResult, source: &SearchResult). This reduces duplication and makes the code more maintainable.
🟡 Warning src/search/search_output.rs:702
Magic number 10 for line limit in is_test_code_block function. The function only checks the first 10 lines of code for test patterns, which is arbitrary and not documented.
💡 SuggestionExtract to a named constant with documentation: const TEST_PATTERN_SCAN_LINES: usize = 10;. Add a comment explaining why this limit was chosen (e.g., test decorators typically appear near the function start).
🟡 Warning src/search/search_output.rs:702
Hardcoded language-specific test patterns without extensibility. Adding support for new languages requires modifying this function with more if/else branches.
💡 SuggestionConsider using a strategy pattern or trait-based approach where each language implementation provides its own test detection logic. This would be more extensible and maintainable.
🟡 Warning src/search/search_output.rs:715
Test pattern detection is incomplete and may miss valid test cases. For example, it only checks first 10 lines, so test patterns appearing later in the code would be missed. Also doesn't handle all test frameworks (e.g., RSpec for Ruby, pytest fixtures).
💡 SuggestionEither scan the entire code block or document the limitation clearly. Consider adding more comprehensive test pattern coverage for common frameworks.
🟡 Warning src/search/block_merging.rs:678
Test helper function make_result uses magic numbers for line ranges (1, 4, 2, 4, 10, 15) without semantic meaning. These numbers are arbitrary and make tests harder to understand.
💡 SuggestionUse named constants or helper functions with meaningful names: const COMMENT_BLOCK_RANGE: (usize, usize) = (1, 4); const FUNCTION_BLOCK_RANGE: (usize, usize) = (2, 4);. This makes test intent clearer.
🟡 Warning src/search/block_merging.rs:695
Tests use magic strings like 'REQ-001', 'REQ-002' for matched keywords without explanation. These appear to be arbitrary requirement IDs.
💡 SuggestionUse named constants or add comments explaining what these represent: const REQUIREMENT_1: &str = 'REQ-001';. Or use more descriptive test data that explains the scenario.
🟡 Warning src/search/block_merging.rs:695
Missing test cases for important edge cases: 1) Empty matched_keywords on both blocks, 2) Three or more blocks with containment relationships, 3) Blocks with same start line but different end lines, 4) Invalid line ranges (start > end).
💡 SuggestionAdd comprehensive edge case tests: test_dedup_empty_keywords_both_blocks(), test_dedup_three_nested_blocks(), test_dedup_same_start_different_ends(), test_dedup_invalid_line_ranges_handled().
🟡 Warning src/search/block_merging.rs:695
No negative test cases for error conditions or invalid inputs. The tests only cover happy paths.
💡 SuggestionAdd negative tests: test_dedup_handles_empty_vector(), test_dedup_handles_single_block(), test_dedup_handles_blocks_with_invalid_ranges(). Verify the function handles edge cases gracefully.
🟡 Warning src/search/search_output.rs:3090
Test suite for is_test_code_block lacks negative test cases. Tests verify positive cases but don't verify that non-test code is correctly rejected.
💡 SuggestionAdd negative tests: test_is_test_code_block_rejects_production_code(), test_is_test_code_block_rejects_false_positives(), test_is_test_code_block_handles_empty_code(). Verify the function doesn't incorrectly flag production code as tests.
🟡 Warning src/search/search_output.rs:3090
Test coverage for is_test_code_block is incomplete. Missing tests for: 1) Code with test patterns but in structural node types, 2) Edge case where test patterns appear after line 10, 3) Mixed language files, 4) Code with test-like strings that aren't actually tests.
💡 SuggestionAdd comprehensive tests covering all language patterns and edge cases. Test that structural node types are excluded even with test patterns. Test that patterns beyond line 10 are not detected.
🟡 Warning src/search/search_runner.rs:1530
The deduplicate_contained_blocks function is called but its return value is not checked for errors. While it returns Vec<SearchResult> (not Result), there's no validation that the deduplication succeeded or produced valid results.
💡 SuggestionAdd validation to ensure the deduplication produced valid results: assert!(deduped.len() <= limited.results.len(), "Deduplication should not increase result count"). Consider logging when deduplication removes results.

Powered by Visor from Probelabs

Last updated: 2026-04-06T08:39:54.150Z | Triggered by: pr_opened | Commit: 2a897ca

💡 TIP: You can chat with Visor using /visor ask <your question>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

search JSON returns duplicate witness hits and top-level Verifies blocks for exact requirement annotation search

1 participant