Skip to content

refactor(extraction): split oversized modules and wire up CLI pipeline#135

Merged
mergify[bot] merged 22 commits intomainfrom
fix_dist_issues
Feb 18, 2026
Merged

refactor(extraction): split oversized modules and wire up CLI pipeline#135
mergify[bot] merged 22 commits intomainfrom
fix_dist_issues

Conversation

@unclesp1d3r
Copy link
Member

@unclesp1d3r unclesp1d3r commented Feb 17, 2026

Summary

  • CI: Use just recipes in CI workflows instead of raw cargo commands; fix cargo-dist invocation (use dist plan binary via mise instead of cargo dist check subcommand)
  • Extraction refactor: Split 6 oversized files (all >500 lines) in src/extraction/ into focused module directories -- total lines unchanged, all non-test files now under 500 lines
  • CLI: Wire up the extraction pipeline in main.rs with format selection (--format table|json|yara) and minimum length filtering (--min-length)
  • Design debt cleanup: Resolve 8 pre-existing design issues surfaced during bot review

Details

CI changes

  • ci.yml: test, test-cross-platform, and coverage jobs now use just test-ci, just build-release, just coverage
  • security.yml: uses just deny, just outdated, just dist-check (fixes cargo-dist not found)
  • docs.yml: uses just docs-build for combined mdbook + rustdoc build
  • justfile: updated recipes with --all-features, added outdated recipe, fixed dist-check to use dist plan

Extraction module splits

Original file Lines Split into
extraction/mod.rs 1,576 mod.rs (154) + traits.rs + helpers.rs + basic_extractor.rs + tests.rs
extraction/filters.rs 702 filters/mod.rs + filters/implementations.rs + filters/tests.rs
extraction/ascii.rs 832 ascii/mod.rs + ascii/extraction.rs + ascii/tests.rs
extraction/utf16.rs 1,273 utf16/mod.rs + utf16/config.rs + utf16/validation.rs + utf16/confidence.rs + utf16/extraction.rs + utf16/tests.rs
extraction/dedup.rs 838 dedup/mod.rs + dedup/scoring.rs + dedup/tests.rs
extraction/pe_resources.rs 1,449 pe_resources/mod.rs + pe_resources/detection.rs + pe_resources/version_info.rs + pe_resources/string_tables.rs + pe_resources/manifests.rs + pe_resources/tests.rs

CLI pipeline

  • main.rs: reads binary, detects format, parses container, extracts strings, formats output
  • 4 new integration tests in tests/integration_cli.rs

Design debt cleanup (8 items)

  1. Remove dual encoding fields: Deleted encodings from ExtractionConfig, kept only enabled_encodings. Eliminated confusing OR-checks in basic_extractor.rs.
  2. Extract shared logic: Factored duplicated section-sorting/filtering/symbol-inclusion pipeline from extract() and extract_canonical() into collect_all_strings() helper.
  3. Fix O(n^2) dedup in Auto mode: Replaced strings.iter_mut().find() with HashMap-based O(1) lookup in utf16/mod.rs.
  4. Remove unused parameter: Removed _preserve_all_occurrences from deduplicate() signature, ExtractionConfig, and all callsites.
  5. Remove broken byte order heuristic: check_byte_order_consistency() operated on decoded u16 values (identical regardless of byte order), making LE always score 1.0 and BE always 0.0. Removed and redistributed weight.
  6. Fix char_count semantics: Changed char_count += 1 to += consumed_units in UTF-16 extraction so min/max length checks count code units, not scalar values.
  7. Replace Result<_, ()>: Changed 3 UTF-16 decode functions to return StringyError::ParseError, removed #[allow(clippy::result_unit_err)].
  8. Add surrogate noncharacter validation: Reject U+xFFFE/U+xFFFF in supplementary planes (planes 1-16), matching existing BMP noncharacter check.

Test plan

  • just ci-check passes locally (478 tests, zero clippy warnings)
  • CI passes on all platforms (ubuntu, macos, windows)
  • CodeRabbit/Copilot review feedback addressed
  • Pre-existing design debt resolved (8 items)

Generated with Claude Code

Centralize CI commands in justfile as single source of truth:
- Replace raw cargo commands with just recipes in test,
  test-cross-platform, coverage, security, and docs workflows
- Fix cargo-dist issue in security workflow by using `dist plan`
  via mise exec instead of `cargo dist check` (subcommand not found)
- Add --all-features to test-ci, build-release, and coverage recipes
- Add outdated recipe for cargo-outdated checks
- Fix dist-check recipe to use `dist plan` (check subcommand
  does not exist in cargo-dist 0.30.3)

Quality and MSRV jobs unchanged (use dtolnay Rust, not mise).

Signed-off-by: Kevin Melton <kevin@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Move the ~400-line `impl StringExtractor for BasicExtractor` block from
mod.rs into its own file, reducing mod.rs from 1038 to 630 lines. The
new basic_extractor.rs (421 lines) contains the core extraction engine
that orchestrates ASCII, UTF-8, and UTF-16 string extraction across
binary sections.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Convert the 702-line filters.rs into a module directory with three
focused files:
- mod.rs (193 lines): FilterContext, NoiseFilter trait, CompositeNoiseFilter
- implementations.rs (394 lines): all 6 filter structs and NoiseFilter impls
- tests.rs (131 lines): all test code

All public re-exports remain unchanged so the API is not affected.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Convert src/extraction/ascii.rs (832 lines) into a module directory
with three focused files:

- mod.rs (147 lines): AsciiExtractionConfig, is_printable_ascii(), re-exports
- extraction.rs (316 lines): extract_ascii_strings(), extract_from_section()
- tests.rs (378 lines): all unit tests

All files are under the 500-line limit. All 473 tests pass.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Convert the 1,273-line src/extraction/utf16.rs into a module directory
with six focused files, each under 500 lines:

- mod.rs (122 lines): ByteOrder enum, extract_utf16_strings orchestrator, re-exports
- config.rs (97 lines): Utf16ExtractionConfig struct and constructors
- validation.rs (281 lines): UTF-16 sequence validation, printability checks, null patterns
- confidence.rs (201 lines): Confidence scoring heuristics (ASCII ratio, printable ratio, etc.)
- extraction.rs (380 lines): Core extraction functions and section-aware extraction
- tests.rs (238 lines): All unit tests

All public exports remain identical. No functional changes.

Signed-off-by: Kyle Melton <kyle@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Convert the 838-line dedup.rs into a module directory with three files:
- mod.rs (237 lines): CanonicalString, StringOccurrence structs, deduplicate(),
  merge_tags(), found_string_to_occurrence(), CanonicalString::to_found_string()
- scoring.rs (56 lines): calculate_combined_score()
- tests.rs (555 lines): All test code

Public API and re-exports remain unchanged.

Signed-off-by: Kevin Melton <kmelton@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Convert the largest file (1,449 lines) into a module directory with
six focused files, each under 500 lines:

- mod.rs (230): constants, entry points, orchestrator, decode_utf16le
- detection.rs (208): detect_version_info, detect_string_tables, detect_manifests
- version_info.rs (76): extract_version_info_strings
- string_tables.rs (200): parse_string_table_block, extract_string_table_strings
- manifests.rs (201): detect_manifest_encoding, decode_manifest, extract_manifest_strings
- tests.rs (594): all test code

All 473 tests pass. Public API unchanged.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Document the CI workflow patterns (dtolnay vs mise, just recipes as
single source of truth) and Mergify behavior in AGENTS.md. Add the
extraction refactor and CLI implementation plan for project history.

Signed-off-by: Keith Melton <kmelton@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Copilot AI review requested due to automatic review settings February 17, 2026 15:11
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request implements a comprehensive refactor of the extraction module architecture, restructuring flat extraction modules (ASCII, UTF-16, PE resources, deduplication) into modular subdirectories with trait-based abstractions. It introduces a BasicExtractor implementation, noise filtering framework, semantic enrichment helpers, and CLI wiring for binary analysis with multiple output formats (Table, JSON, YARA), complemented by CI/workflow automation updates and integration test coverage.

Changes

Cohort / File(s) Summary
CI & Workflow Automation
.github/workflows/ci.yml, .github/workflows/docs.yml, .github/workflows/security.yml, justfile
Unified CI commands to use just recipes (e.g., just test-ci, just build-release, just coverage) as single source of truth; added --all-features flags to test/build targets and new outdated target.
Documentation & Planning
AGENTS.md, docs/plans/2026-02-16-extraction-refactor-and-cli.md
Added CI Architecture documentation describing justfile-driven workflows and merge protections; comprehensive multi-phase refactor plan detailing module modularization (Phase 1), cross-cutting extraction (Phase 2), and CLI wiring (Phase 3) with commit sequences and verification steps.
Extraction Traits & Architecture
src/extraction/traits.rs, src/extraction/mod.rs
Introduced StringExtractor trait with extract(), extract_from_section(), and extract_canonical() methods; ExtractionConfig with validation; exposed traits via public re-export; reorganized top-level module with new traits and basic_extractor submodules.
ASCII Extraction Modularization
src/extraction/ascii/mod.rs, src/extraction/ascii/extraction.rs, src/extraction/ascii/tests.rs
Restructured flat ascii.rs (832 lines removed) into modular hierarchy with configuration struct, extraction functions, section-aware processing, printable ASCII utility, and comprehensive unit/edge-case tests (380 lines).
UTF-16 Extraction Modularization
src/extraction/utf16/mod.rs, src/extraction/utf16/config.rs, src/extraction/utf16/extraction.rs, src/extraction/utf16/validation.rs, src/extraction/utf16/confidence.rs, src/extraction/utf16/tests.rs
Replaced flat utf16.rs (1273 lines removed) with modular architecture: ByteOrder enum, configuration, extraction with section metadata, UTF-16 validation/surrogate handling, confidence scoring heuristics, and comprehensive tests (238 lines) covering LE/BE/Auto detection and alignment scanning.
Deduplication Modularization
src/extraction/dedup/mod.rs, src/extraction/dedup/scoring.rs, src/extraction/dedup/tests/mod.rs, src/extraction/dedup/tests/basic.rs, src/extraction/dedup/tests/scoring.rs
Restructured flat dedup.rs (838 lines removed) into modular structure with CanonicalString/StringOccurrence types, deduplication logic with threshold-based grouping, scoring function (base + bonuses for occurrences/cross-section/multi-source/confidence), and extensive test coverage (520+ lines) for edge cases, bonuses, and threshold behavior.
Noise Filtering Framework
src/extraction/filters/mod.rs, src/extraction/filters/implementations.rs, src/extraction/filters/tests.rs
Introduced FilterContext struct carrying section metadata, NoiseFilter trait with calculate_confidence(), and CompositeNoiseFilter aggregating six sub-filters (Entropy, CharDistribution, Linguistic, Length, Repetition, Context) with configurable weights and enabling flags; includes 131 lines of filter integration tests.
PE Resource Extraction Modularization
src/extraction/pe_resources/mod.rs, src/extraction/pe_resources/detection.rs, src/extraction/pe_resources/manifests.rs, src/extraction/pe_resources/string_tables.rs, src/extraction/pe_resources/version_info.rs, src/extraction/pe_resources/tests/...
Replaced flat pe_resources.rs (1449 lines removed) with modular structure: resource enumeration (Phase 1 detection), string extraction (Phase 2) for VERSIONINFO/STRINGTABLE/MANIFEST via pelite, manifest encoding detection, UTF-16LE decoding, and comprehensive test hierarchy (640+ lines) with fixture-driven detection/extraction validation.
BasicExtractor Implementation
src/extraction/basic_extractor.rs, src/extraction/tests.rs
Implemented StringExtractor trait with multi-encoding orchestration (ASCII/UTF-8/UTF-16LE/UTF-16BE per section), section prioritization, symbol inclusion (ImportName/ExportName), optional noise filtering/deduplication, semantic enrichment, and canonicalization support; includes 475 lines of integration tests validating extraction across encodings, filtering, symbols, and boundaries.
Semantic Enrichment & Helpers
src/extraction/helpers.rs
Added semantic enrichment function integrating SemanticClassifier and SymbolDemangler with StringContext; UTF-8/ASCII extraction helpers with min/max length filtering and printable text validation; supports noise-filter based confidence scoring (273 lines).
CLI Implementation & Integration
src/main.rs, tests/integration_cli.rs
Implemented CLI with output format selection (Table/Json/Yara), min_length configuration, extraction pipeline orchestration via BasicExtractor, binary format detection, and output formatting; added four CLI integration tests validating file handling, JSON output validity, error cases, and min_length filtering behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Rationale: This refactor spans 50+ files with heterogeneous changes: comprehensive module restructuring (ASCII/UTF-16/PE resources/dedup), trait-based architecture design, complex extraction pipelines with multi-encoding support and configurable filtering, new noise filtering framework with six sub-filters and scoring heuristics, PE resource modularization requiring Pelite integration understanding, and CLI wiring. The density of interconnected logic—section-aware metadata population, confidence computation across multiple heuristics, threshold-based deduplication with bonus scoring, surrogate-pair handling in UTF-16—requires careful verification of correctness, false-positive reduction in classification, and proper error handling in parsing/decoding paths.

Possibly related PRs

Poem

🧬 Strings extracted from binaries deep,
Modules modularized, secrets to keep,
UTF-16 dances, ASCII in line,
Traits and filters make extraction shine,
From PE resources to CLI—one design.

🚥 Pre-merge checks | ✅ 6 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Error Handling ⚠️ Warning PR violates error handling conventions by returning Vec instead of Result for PE resource extraction functions, silently suppressing parsing errors from pelite. Convert all resource extraction functions to return Result<Vec, StringyError> and replace error suppression patterns with proper ? operator propagation.
Ascii Only ⚠️ Warning Pull request introduces Unicode smart quotes (U+201D, U+201C) and construction emoji (🚧) violating ASCII-only requirement across multiple source files and documentation. Replace smart quotes with ASCII equivalents ("), remove emoji, and enforce ASCII-only strings via project linter before merge.
✅ Passed checks (6 passed)
Check name Status Explanation
Title check ✅ Passed Title follows Conventional Commits format with valid type (refactor) and scope (extraction), and accurately describes the main changes: module splitting and CLI wiring.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 70.00%.
No Unsafe Code ✅ Passed The crate enforces safety through forbid(unsafe_code) declarations in src/lib.rs and src/main.rs, preventing compilation of any unsafe constructs.
File Size Limit ✅ Passed All extraction module files comply with 500-line limit; largest at 497 lines. No blanket #[allow] attributes detected. Refactor achieves focused, maintainable module organization.
Section Weight Consistency ✅ Passed PR does not modify container parser files; conditional requirement for weight consistency verification is not triggered.
Description check ✅ Passed The PR description comprehensively documents all major changes: CI workflow updates to use justfile recipes, systematic refactoring of six oversized extraction modules into focused subdirectories, CLI pipeline wiring with format and length options, and resolution of eight design debt items.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix_dist_issues

Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify
Copy link
Contributor

mergify bot commented Feb 17, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?:

🟢 CI must pass

Wonderful, this rule succeeded.

All CI checks must pass. This protection prevents manual merges that bypass the merge queue.

  • check-success = coverage
  • check-success = msrv (stable minus 1 releases)
  • check-success = msrv (stable minus 2 releases)
  • check-success = msrv (stable minus 3 releases)
  • check-success = msrv (stable minus 4 releases)
  • check-success = msrv (stable)
  • check-success = quality
  • check-success = test
  • check-success = test-cross-platform (macos-latest, macOS)
  • check-success = test-cross-platform (ubuntu-latest, Linux)
  • check-success = test-cross-platform (windows-latest, Windows)

🟢 Do not merge outdated PRs

Wonderful, this rule succeeded.

Make sure PRs are within 10 commits of the base branch before merging

  • #commits-behind <= 10

@coderabbitai coderabbitai bot added documentation Improvements or additions to documentation enhancement New feature or request labels Feb 17, 2026
@mergify
Copy link
Contributor

mergify bot commented Feb 17, 2026

🧪 CI Insights

Here's what we observed from your CI run for 9c5da41.

🟢 All jobs passed!

But CI Insights is watching 👀

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 33

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/extraction/filters/implementations.rs (1)

297-302: 🧹 Nitpick | 🔵 Trivial

Redundant cast - total is already f32.

Line 299 has (total as usize) as f32 but total is already declared as f32 on line 278. This double-cast is unnecessary.

Suggested fix
-            let min_pattern_len_for_3_reps = ((total as usize) as f32 / 3.0).ceil() as usize;
+            let min_pattern_len_for_3_reps = (total / 3.0).ceil() as usize;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/extraction/filters/implementations.rs` around lines 297 - 302, The
expression computing min_pattern_len_for_3_reps contains redundant casts of
total; since total is already an f32, replace the double-cast `((total as usize)
as f32 / 3.0).ceil() as usize` with a direct floating calculation like `(total /
3.0).ceil() as usize` in the block that uses variables total and max_pattern_len
so min_pattern_len_for_3_reps is computed without needless casting.
src/extraction/mod.rs (1)

141-151: ⚠️ Potential issue | 🟠 Major

Address re-export naming inconsistency for extract_from_section.

The re-export in mod.rs aliases UTF-16's extract_from_section to extract_utf16_from_section (line 149) but leaves ASCII's version unaliased (line 141). However, test code imports directly from submodules (e.g., test_utf16_extraction.rs imports extract_from_section from the UTF-16 module), bypassing the top-level aliases entirely. This defeats the aliasing intent and violates the guideline requiring imports from stringy::extraction, not nested paths.

Ensure consistent re-export naming: either alias both functions (ASCII as extract_ascii_from_section, UTF-16 as extract_utf16_from_section), or export both as extract_from_section with qualified module paths. Update test imports to use top-level re-exports via stringy::extraction to align with project conventions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/extraction/mod.rs` around lines 141 - 151, The top-level re-export naming
is inconsistent: ascii::extract_from_section is exported as extract_from_section
while utf16::extract_from_section is aliased to extract_utf16_from_section;
update mod.rs to make these symmetric and adjust tests to import from
stringy::extraction. Either (recommended) alias both exports to explicit names
(extract_ascii_from_section and extract_utf16_from_section) by changing the
ascii re-export to export extract_from_section as extract_ascii_from_section and
keep the utf16 alias extract_utf16_from_section, or export both as
extract_from_section within their modules and update tests to import via the
top-level path (stringy::extraction::{extract_ascii_from_section,
extract_utf16_from_section} or stringy::extraction::extract_from_section if you
choose unambiguous names); ensure tests like test_utf16_extraction.rs import
from stringy::extraction instead of nested module paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@AGENTS.md`:
- Around line 70-77: Update the first bullet under "CI Architecture" to state
that CI workflows use `just` recipes as the single source of truth except for
Quality/MSRV jobs; explicitly mention that Quality/MSRV jobs use
`dtolnay/rust-toolchain` and must not use `just` recipes because they rely on
`mise exec --` which conflicts, so readers immediately see the exception before
the separate bullet about Quality/MSRV.

In `@docs/plans/2026-02-16-extraction-refactor-and-cli.md`:
- Around line 59-63: Several fenced code blocks (including the one containing
the git add/git commit snippet) lack language specifiers; update those fenced
blocks to include an appropriate shell language tag (e.g., ```bash or ```sh) so
tooling and syntax highlighting work correctly, ensuring you add the specifier
to both opening fences and any other command/code blocks referenced in this
document.

In `@src/extraction/ascii/extraction.rs`:
- Around line 120-169: The end-of-buffer branch duplicates FoundString
construction; extract a helper function or closure (e.g., make_found_string)
that takes the taken bytes, start, len and returns a FoundString, then replace
the duplicated blocks in the max_length branch and the else branch (and any
other places in this loop) with strings.push(make_found_string(bytes, start,
len)); ensure the helper sets the same fields used now (text from
String::from_utf8, Encoding::Ascii, offset from current_string_start, length,
source StringSource::SectionData, confidence 1.0, etc.) and use
std::mem::take(&mut current_string_bytes) only once per invocation.

In `@src/extraction/ascii/tests.rs`:
- Around line 277-293: In test_extract_from_section_rva_calculation replace the
conditional guard so the test fails if no strings were extracted: after calling
create_test_section, AsciiExtractionConfig::default, and extract_from_section,
assert that strings is non-empty (e.g. assert!(!strings.is_empty()) or
assert_eq!(strings.len(), 1)) and then perform the existing assertions on
strings[0].offset and strings[0].rva; this removes the silent pass and ensures
extract_from_section actually produced the expected string.

In `@src/extraction/basic_extractor.rs`:
- Around line 19-125: The extract() and extract_canonical() implementations in
BasicExtractor duplicate section sorting, filtering, symbol inclusion and
semantic enrichment; extract that shared logic into a private helper fn (e.g.,
extract_all_strings(&self, data: &[u8], container_info: &ContainerInfo, config:
&ExtractionConfig) -> Result<Vec<FoundString>>) which performs the section sort
(same key logic), applies the Debug/code filters, calls
self.extract_from_section for each section, includes import/export symbols when
config.include_symbols is true, and calls apply_semantic_enrichment; then have
extract() call this helper and only run deduplication via
deduplicate(...)/to_found_string conversion when config.enable_deduplication,
and have extract_canonical() call the same helper but return the
canonical/deduped output as needed.
- Around line 277-278: The code is checking both config.encodings and
config.enabled_encodings (e.g., in the ascii_enabled and utf8_enabled
computations using Encoding::Ascii/Encoding::Utf8), indicating legacy dual-field
handling; consolidate to a single source of truth by standardizing all checks to
one field (pick config.encodings or rename to config.enabled_encodings
consistently), update occurrences where ascii_enabled and utf8_enabled are
computed to reference only the canonical field, and add a brief inline comment
like "// TODO: deprecate enabled_encodings" next to the kept field to signal
migration; ensure any helper functions or downstream uses (the variables
ascii_enabled, utf8_enabled) are updated to the single-field logic so behavior
is unchanged.
- Around line 261-264: The current overflow handling for computing end_offset
uses checked_add(...).unwrap_or(data.len()).min(data.len()) which is defensive
but can be clearer; change the computation in the end_offset assignment to use
saturating_add on section_offset (i.e.,
section_offset.saturating_add(section_size)) and then .min(data.len()) to
express explicit saturation semantics for end_offset (referencing end_offset,
section_offset, section_size, and data.len()).

In `@src/extraction/dedup/mod.rs`:
- Around line 212-236: to_found_string currently index-accesses
self.occurrences[0] which panics if CanonicalString.occurrences is empty; change
to use self.occurrences.first() and make the function return Option<FoundString>
(or Result) so it can return None when there are no occurrences, compute
max_confidence by iterating over occurrences (or return None early), and build
the FoundString from the first occurrence when present (preserving fields like
text, original_text, encoding, offset, rva, section, length, tags = merged_tags,
score = combined_score, source, confidence = max_confidence); update callers of
to_found_string accordingly.
- Around line 228-237: The code unnecessarily clones `fs` before calling
`found_string_to_occurrence(fs.clone())` inside the iterator consumer; since
`fs` is already moved by `into_iter()` you should pass `fs` directly to
`found_string_to_occurrence` to avoid an extra clone—locate the iterator that
calls `into_iter()` and replace `found_string_to_occurrence(fs.clone())` with
`found_string_to_occurrence(fs)` (the surrounding context includes identifiers
like found_string_to_occurrence, into_iter, fs, merged_tags, and
combined_score).

In `@src/extraction/dedup/tests.rs`:
- Around line 17-27: The UTF-16 byte-length computation in the match for
Encoding (where length is set from text.chars().count() * 2) undercounts
characters that need surrogate pairs; either update the calculation to use
text.encode_utf16().count() * 2 to count UTF-16 code units correctly, or keep
the current code but change the comment to explicitly state it assumes BMP-only
characters for these tests; modify the match arm and the inline comment near the
Encoding handling (the length variable and the text.* call) accordingly.
- Line 6: Add an inline justification comment for the
#[allow(clippy::too_many_arguments)] attribute on the test helper so it
documents why many parameters are necessary (e.g., "helper constructs test data
with full metadata; multiple args required for clarity and coverage"). Place the
comment immediately after the attribute on the same line or directly above it,
referencing the existing attribute #[allow(clippy::too_many_arguments)] so
reviewers can see the rationale for the exception.

In `@src/extraction/helpers.rs`:
- Around line 124-125: Add an inline justification comment for the
#[allow(clippy::collapsible_if)] attribute where it appears (the conditional
checking current_string_bytes.len() against min_length and max_length and the
similar block later) by appending a short rationale on the same line or
immediately after the attribute explaining why collapsing the ifs would break
the borrow/ownership intent (e.g., "allow because collapsing the ifs would
change borrow lifetimes and cause borrow-checker errors"); update both
occurrences to include this inline justification.

In `@src/extraction/pe_resources/manifests.rs`:
- Around line 68-98: The inline UTF-16BE decoding in decode_manifest should be
extracted into a shared helper (e.g., pub(crate) fn decode_utf16be(bytes: &[u8])
-> Result<String>) alongside the existing decode_utf16le in the parent
pe_resources module; implement the same logic currently in the match arm
(chunks_exact(2) -> u16::from_be_bytes -> String::from_utf16 -> trim nulls ->
map Err to StringyError::EncodingError) and then replace the UTF-16BE arm in
decode_manifest to call decode_utf16be(data) for consistency and reuse.

In `@src/extraction/pe_resources/tests.rs`:
- Around line 107-113: The test function
test_detect_version_info_empty_directory is empty and provides no coverage;
either remove it or implement it by creating a minimal fixture (an empty
RT_VERSION directory) and asserting the behavior of the code path (e.g., call
extract_resources(...) or the detect_version_info(...) helper used by
pe_resources and assert it returns no entries/does not panic). Update the test
to set up the empty directory, invoke extract_resources/detect_version_info, and
add an assertion on the returned result (empty list/Ok value) or delete the
empty test; apply the same fix to the other empty tests in tests.rs.
- Around line 1-594: The test file is getting large and should be split for
maintainability; extract resource-detection tests into detection_tests.rs and
string-extraction tests into extraction_tests.rs. Move tests that primarily
exercise extract_resources, VERSIONINFO/STRINGTABLE/MANIFEST detection and
fixture-based resource metadata checks into detection_tests.rs (any tests
referencing extract_resources, ResourceType, ResourceMetadata assertions), and
move Phase 2 string-focused tests (decode_utf16le, detect_manifest_encoding,
extract_version_info_strings, extract_string_table_strings,
extract_manifest_strings, extract_resource_strings and their fixture-based
variants) into extraction_tests.rs; keep or move the shared get_fixture_path
helper into a small tests/mod.rs or a shared utils file and update the tests
module declarations to pub use/declare those helpers so both new files can
access it. Finally, update the tests mod (or Cargo test configuration) to
include the two new files so cargo test still discovers all tests.
- Around line 99-105: The test computes `_has_version` from
`resources.iter().any(|r| matches!(r.resource_type, ResourceType::VersionInfo))`
but never asserts it; change this to an assertion that verifies absence (e.g.,
`assert!(!resources.iter().any(|r| matches!(r.resource_type,
ResourceType::VersionInfo)))`) or assign to `has_version` and
`assert_eq!(has_version, false)` so the test explicitly checks for no
VERSIONINFO resources; update the assertion in the test function that uses
`resources` to reflect this.
- Around line 363-370: The assertion is tautological because
`!resources.is_empty()` makes the `has_version_info`/`has_string_table` checks
redundant; change the check to explicitly detect other resource types instead.
Replace the `assert!(has_version_info || has_string_table ||
!resources.is_empty(), ...)` with a check that uses `resources.iter().any(|r|
!matches!(r.resource_type, ResourceType::VersionInfo |
ResourceType::StringTable))`, i.e. assert `has_version_info || has_string_table
|| resources.iter().any(|r| !matches!(r.resource_type, ResourceType::VersionInfo
| ResourceType::StringTable))` so you only pass when either known types are
present or some other resource type exists (referencing `has_version_info`,
`has_string_table`, `resources` and
`ResourceType::VersionInfo/ResourceType::StringTable`).

In `@src/extraction/pe_resources/version_info.rs`:
- Around line 21-35: In extract_version_info_strings, currently all Err branches
for PeFile::from_bytes, pe.resources(), and resources.version_info() silently
return an empty Vec; change these to log the error (at debug/trace) including
the returned Err value so parsing failures are visible while still returning an
empty Vec for the caller; specifically update the match arms around
PeFile::from_bytes, pe.resources(), and resources.version_info() to capture the
error (e.g., err) and call the project logger (or tracing::debug!) with context
like "failed to parse PE", "failed to get resources", or "failed to get
version_info" plus the err, keeping the final behavior of
extract_version_info_strings unchanged.

In `@src/extraction/tests.rs`:
- Around line 199-236: The ExtractionConfig API is redundant: remove the
obsolete encodings field and update all uses to rely solely on
enabled_encodings; update the ExtractionConfig struct and its Default impl to
drop encodings, change BasicExtractor's checks (e.g. the
ascii_enabled/utf8_enabled/utf16le_enabled/utf16be_enabled logic in
basic_extractor.rs) to reference only config.enabled_encodings.contains(...),
and update tests (e.g. test_basic_extractor_encoding_filtering in tests.rs) to
construct ExtractionConfig with enabled_encodings only. Ensure any
serialization/constructor code and pattern matches expecting encodings are
adjusted or migrated to enabled_encodings to avoid compilation errors.

In `@src/extraction/traits.rs`:
- Around line 43-58: The struct currently has two redundant fields, encodings
and enabled_encodings (both Vec<Encoding> with the same defaults); remove one to
eliminate ambiguity (prefer keeping the more clearly named enabled_encodings) or
consolidate them by renaming/merging encodings into enabled_encodings and update
all usages accordingly (search for references to encodings and
enabled_encodings, e.g., in parsing, defaults, and tests), preserve the original
default values and behavior when merging, and update the doc comment for the
remaining field to explain its purpose and remove the duplicate doc entry.
- Around line 119-156: ExtractionConfig::validate currently misses checks for
inverted/zero bounds and dedup settings; update validate() to ensure max_length
> 0, max_length >= min_length (or strictly > if you prefer to forbid equality),
and if dedup_threshold is Some(t) then t > 0 (reject zero). Specifically, inside
the validate method add checks referencing self.max_length (ensure > 0 and >=
self.min_length or > self.min_length) and self.dedup_threshold (use map or match
to assert t > 0 and return StringyError::ConfigError with a clear message on
failure) so invalid configs are rejected.

In `@src/extraction/utf16/confidence.rs`:
- Around line 90-94: The check for chars.is_empty() is redundant because the
function already returns for empty input earlier (lines ~68-70); remove the
if/else and simply return printable_count as f32 / chars.len() as f32 (using the
existing chars and printable_count identifiers) so the unreachable branch is
eliminated and the calculation is simplified.

In `@src/extraction/utf16/extraction.rs`:
- Around line 95-98: Add an inline justification comment for the
#[allow(clippy::result_unit_err)] on the decode_utf16le_bytes function: explain
that the unit error type is intentional because decode_utf16le returns no
actionable error information and callers only need success/failure, referencing
decode_utf16le and decode_utf16le_bytes to locate the rationale; keep the
comment brief and on the same line or immediately above the attribute.

In `@src/extraction/utf16/mod.rs`:
- Around line 86-98: The current add_with_dedup closure does O(n²) work by
scanning strings with strings.iter_mut().find(); replace it with a HashMap-based
dedup: create a HashMap keyed by (FoundString.offset, FoundString.encoding,
FoundString.text) mapping to the index in strings, update that map as you push
or replace entries, and when a duplicate is seen compare confidence to
strings[index].confidence and either replace strings[index] or skip; update or
insert the map entry accordingly (use the existing FoundString fields offset,
encoding, text, confidence and the add_with_dedup closure to locate where to
change the logic).

In `@src/extraction/utf16/tests.rs`:
- Around line 218-224: Update the test_utf16_odd_length_data test to assert a
concrete expected outcome instead of only ensuring no panic: call
extract_utf16_strings(data, &Utf16ExtractionConfig::default()) and assert that
the returned Vec either is empty or contains the partial decoded string "He"
(since input pairs decode to 'H','e' and the final 0x6C is an incomplete pair),
e.g. assert!(strings.is_empty() || strings.contains(&"He".to_string()));
reference the test function name, Utf16ExtractionConfig, and
extract_utf16_strings to locate and update the test.
- Around line 132-140: The test currently asserts be_string.is_some() and then
enters an if let Some(s) to re-check s.text and s.encoding redundantly; replace
the assert! + if-let pattern with a single unwrap (or use expect) on the result
of strings.iter().find(...) and then assert_eq! directly on the unwrapped value
to remove duplicated checks — locate the be_string variable created from
strings.iter().find(...), the Encoding::Utf16Be reference, and the subsequent
assert_eq! calls to update accordingly.

In `@src/extraction/utf16/validation.rs`:
- Around line 133-134: The #[allow(dead_code)] on is_valid_utf16_sequence lacks
the required justification; add a brief comment above the attribute explaining
why the function is intentionally kept (for example: "kept for future use",
"used by tests/benchmarks", or "exported for FFI but referenced conditionally")
so the attribute is justified and complies with guidelines; update the comment
to reference the function name is_valid_utf16_sequence to make the intent clear.
- Around line 116-120: The legacy wrapper is_printable_utf16le_char currently
passes None for next_code_unit so it cannot validate UTF-16 surrogate pairs
(high surrogates 0xD800–0xDBFF will always return false); update the function's
docstring to explicitly state this limitation and that callers needing
surrogate-pair validation should call is_printable_code_unit_or_pair (or use a
variant that accepts a next_code_unit) instead of relying on this
backward-compatibility wrapper. Mention the behavior for high and low surrogates
and the recommended alternative function name (is_printable_code_unit_or_pair)
in the docs.
- Around line 225-229: The branch checking chars.is_empty() at the end is
unreachable because an early return already handles the empty case; remove the
if/else and replace it with a single return of valid_count as f32 / chars.len()
as f32 (using the existing chars and valid_count variables) so the function
simply computes and returns the ratio without the redundant empty check.

In `@src/main.rs`:
- Around line 47-49: Change the short flag for the min_length CLI option to
avoid ambiguity: update the #[arg(...)] attribute on the min_length field
(currently #[arg(long, short = 'l', default_value_t = 4)]) to use a clearer
short flag such as 'n' or 'm' (e.g., short = 'n'), and update any related
usages, help text, README examples, and tests that reference the old -l flag to
the new short form to keep CLI docs and behavior consistent.
- Around line 74-75: OutputMetadata::new is being called with strings.len() for
both total_count and filtered_count, causing identical values; either document
this is intentional or pass a pre-filter total count instead. If there truly is
no post-extraction filtering, add a brief comment near the call (referencing
OutputMetadata::new, binary_name, output_format, and strings.len()) stating
total and filtered counts are intentionally identical; otherwise compute and use
the pre-filter total (capture the length of the raw/extracted collection before
any filtering into a variable like total_count) and pass that as the first
argument while passing strings.len() as filtered_count.

In `@tests/integration_cli.rs`:
- Around line 33-41: The test cli_invalid_file currently only checks the exit
status; update it to also validate stderr from
Command::new(env!("CARGO_BIN_EXE_stringy")) by converting output.stderr (bytes)
to a UTF-8 string and asserting it contains a helpful error message (e.g.,
non-empty or matches an expected substring such as "No such file" or "error");
modify the test to read let stderr = String::from_utf8_lossy(&output.stderr) and
add an assertion like assert!(!stderr.trim().is_empty(), ...) or
assert!(stderr.contains("expected substring"), ...) to ensure user-facing
diagnostics are present.
- Around line 43-51: The cli_min_length_flag test currently only asserts
success; update the test (cli_min_length_flag) to also read output.stdout (from
Command::new(...).output()), convert it to a UTF-8 string, split into lines (or
parse the expected output format), and assert that every returned string/item
has length >= 20 (or at least one string meets that criteria depending on
expected behavior). Use the existing Command invocation and output variable
names (output) and add assertions on String::from_utf8_lossy(&output.stdout) (or
equivalent) to verify the min-length flag actually filtered/affected results.

---

Outside diff comments:
In `@src/extraction/filters/implementations.rs`:
- Around line 297-302: The expression computing min_pattern_len_for_3_reps
contains redundant casts of total; since total is already an f32, replace the
double-cast `((total as usize) as f32 / 3.0).ceil() as usize` with a direct
floating calculation like `(total / 3.0).ceil() as usize` in the block that uses
variables total and max_pattern_len so min_pattern_len_for_3_reps is computed
without needless casting.

In `@src/extraction/mod.rs`:
- Around line 141-151: The top-level re-export naming is inconsistent:
ascii::extract_from_section is exported as extract_from_section while
utf16::extract_from_section is aliased to extract_utf16_from_section; update
mod.rs to make these symmetric and adjust tests to import from
stringy::extraction. Either (recommended) alias both exports to explicit names
(extract_ascii_from_section and extract_utf16_from_section) by changing the
ascii re-export to export extract_from_section as extract_ascii_from_section and
keep the utf16 alias extract_utf16_from_section, or export both as
extract_from_section within their modules and update tests to import via the
top-level path (stringy::extraction::{extract_ascii_from_section,
extract_utf16_from_section} or stringy::extraction::extract_from_section if you
choose unambiguous names); ensure tests like test_utf16_extraction.rs import
from stringy::extraction instead of nested module paths.

Comment on lines +59 to +63
```
git add src/extraction/traits.rs src/extraction/mod.rs
git commit -s -m "refactor(extraction): extract traits and config to traits.rs"
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add language specifiers to fenced code blocks.

Multiple code blocks throughout the document lack language specifiers (e.g., lines 63, 103, 142, etc.). Adding bash or shell for command blocks improves syntax highlighting and tooling support.

-```
+```bash
 git add src/extraction/traits.rs src/extraction/mod.rs
 git commit -s -m "refactor(extraction): extract traits and config to traits.rs"

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.20.0)</summary>

[warning] 63-63: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

[warning] 59-59: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/plans/2026-02-16-extraction-refactor-and-cli.md around lines 59 - 63,
Several fenced code blocks (including the one containing the git add/git commit
snippet) lack language specifiers; update those fenced blocks to include an
appropriate shell language tag (e.g., bash or sh) so tooling and syntax
highlighting work correctly, ensuring you add the specifier to both opening
fences and any other command/code blocks referenced in this document.


</details>

<!-- fingerprinting:phantom:poseidon:ocelot -->

<!-- This is an auto-generated comment by CodeRabbit -->

Comment on lines +120 to +169
// Handle string at end of buffer
if let Some(start) = current_string_start {
let len = current_string_bytes.len();
if len >= config.min_length {
// Check maximum length if configured
if let Some(max_len) = config.max_length {
if len > max_len {
// Skip this string
} else {
let bytes = std::mem::take(&mut current_string_bytes);
let text = String::from_utf8(bytes).expect("ASCII bytes should be valid UTF-8");
strings.push(FoundString {
text,
original_text: None,
encoding: Encoding::Ascii,
offset: start as u64,
rva: None,
section: None,
length: len as u32,
tags: Vec::new(),
score: 0,
section_weight: None,
semantic_boost: None,
noise_penalty: None,
source: StringSource::SectionData,
confidence: 1.0,
});
}
} else {
let bytes = std::mem::take(&mut current_string_bytes);
let text = String::from_utf8(bytes).expect("ASCII bytes should be valid UTF-8");
strings.push(FoundString {
text,
original_text: None,
encoding: Encoding::Ascii,
offset: start as u64,
rva: None,
section: None,
length: len as u32,
tags: Vec::new(),
score: 0,
section_weight: None,
semantic_boost: None,
noise_penalty: None,
source: StringSource::SectionData,
confidence: 1.0,
});
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

End-of-buffer handling duplicates FoundString construction.

The FoundString creation is duplicated across the max_length check branches and the main loop. Consider extracting a helper closure or function to reduce duplication.

♻️ Suggested refactor to reduce duplication
+    // Helper to create FoundString from accumulated bytes
+    let make_found_string = |bytes: Vec<u8>, start: usize, len: usize| -> FoundString {
+        let text = String::from_utf8(bytes).expect("ASCII bytes should be valid UTF-8");
+        FoundString {
+            text,
+            original_text: None,
+            encoding: Encoding::Ascii,
+            offset: start as u64,
+            rva: None,
+            section: None,
+            length: len as u32,
+            tags: Vec::new(),
+            score: 0,
+            section_weight: None,
+            semantic_boost: None,
+            noise_penalty: None,
+            source: StringSource::SectionData,
+            confidence: 1.0,
+        }
+    };

Then use strings.push(make_found_string(bytes, start, len)); in both locations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/extraction/ascii/extraction.rs` around lines 120 - 169, The end-of-buffer
branch duplicates FoundString construction; extract a helper function or closure
(e.g., make_found_string) that takes the taken bytes, start, len and returns a
FoundString, then replace the duplicated blocks in the max_length branch and the
else branch (and any other places in this loop) with
strings.push(make_found_string(bytes, start, len)); ensure the helper sets the
same fields used now (text from String::from_utf8, Encoding::Ascii, offset from
current_string_start, length, source StringSource::SectionData, confidence 1.0,
etc.) and use std::mem::take(&mut current_string_bytes) only once per
invocation.

Comment on lines 43 to 51
#[test]
fn cli_min_length_flag() {
let output = Command::new(env!("CARGO_BIN_EXE_stringy"))
.args(["tests/fixtures/test_binary_elf", "-l", "20"])
.output()
.expect("Failed to execute stringy");

assert!(output.status.success(), "Exit code: {}", output.status);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

min_length test doesn't verify the flag's effect.

The test only checks exit success but doesn't verify that output strings are actually >= 20 characters. Without this, the test wouldn't catch if the flag was silently ignored.

Suggested improvement
 #[test]
 fn cli_min_length_flag() {
     let output = Command::new(env!("CARGO_BIN_EXE_stringy"))
         .args(["tests/fixtures/test_binary_elf", "-l", "20"])
         .output()
         .expect("Failed to execute stringy");

     assert!(output.status.success(), "Exit code: {}", output.status);
+    
+    // Verify all output strings meet minimum length
+    let stdout = String::from_utf8_lossy(&output.stdout);
+    for line in stdout.lines().filter(|l| !l.is_empty()) {
+        // In table format, the string is typically the first column
+        // This is a basic sanity check - adjust based on actual output format
+        assert!(
+            line.len() >= 20 || line.starts_with("Offset") || line.contains("---"),
+            "Output line should be >= 20 chars or be a header: {}",
+            line
+        );
+    }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#[test]
fn cli_min_length_flag() {
let output = Command::new(env!("CARGO_BIN_EXE_stringy"))
.args(["tests/fixtures/test_binary_elf", "-l", "20"])
.output()
.expect("Failed to execute stringy");
assert!(output.status.success(), "Exit code: {}", output.status);
}
#[test]
fn cli_min_length_flag() {
let output = Command::new(env!("CARGO_BIN_EXE_stringy"))
.args(["tests/fixtures/test_binary_elf", "-l", "20"])
.output()
.expect("Failed to execute stringy");
assert!(output.status.success(), "Exit code: {}", output.status);
// Verify all output strings meet minimum length
let stdout = String::from_utf8_lossy(&output.stdout);
for line in stdout.lines().filter(|l| !l.is_empty()) {
// In table format, the string is typically the first column
// This is a basic sanity check - adjust based on actual output format
assert!(
line.len() >= 20 || line.starts_with("Offset") || line.contains("---"),
"Output line should be >= 20 chars or be a header: {}",
line
);
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration_cli.rs` around lines 43 - 51, The cli_min_length_flag test
currently only asserts success; update the test (cli_min_length_flag) to also
read output.stdout (from Command::new(...).output()), convert it to a UTF-8
string, split into lines (or parse the expected output format), and assert that
every returned string/item has length >= 20 (or at least one string meets that
criteria depending on expected behavior). Use the existing Command invocation
and output variable names (output) and add assertions on
String::from_utf8_lossy(&output.stdout) (or equivalent) to verify the min-length
flag actually filtered/affected results.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the extraction module to comply with the 500-line file size limit and wires up the CLI to use the full extraction pipeline. All 6 oversized extraction files (totaling ~6,670 lines) have been split into focused module directories without changing total line count. The CLI now reads binaries, detects format, parses containers, extracts strings, and formats output.

Changes:

  • Split 6 oversized extraction module files into focused subdirectories (mod.rs + specialized files + tests.rs)
  • Wired up main.rs to implement the full extraction pipeline with format selection (table/json/yara) and min-length filtering
  • Updated CI workflows and justfile to use just recipes as single source of truth, fixed cargo-dist invocation

Reviewed changes

Copilot reviewed 38 out of 38 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/main.rs Implements CLI pipeline: read binary → detect format → parse → extract → format output
tests/integration_cli.rs New CLI integration tests for file handling, JSON output, error cases, and min-length flag
src/extraction/traits.rs Extracted ExtractionConfig, StringExtractor trait, and BasicExtractor struct from mod.rs
src/extraction/helpers.rs Extracted helper functions (semantic enrichment, printable checks, ASCII/UTF-8 extraction)
src/extraction/basic_extractor.rs Extracted BasicExtractor StringExtractor impl (extract, extract_canonical, extract_from_section)
src/extraction/tests.rs Extracted 570 lines of tests from mod.rs
src/extraction/filters/* Split filters.rs into mod.rs + implementations.rs + tests.rs
src/extraction/ascii/* Split ascii.rs into mod.rs + extraction.rs + tests.rs
src/extraction/utf16/* Split utf16.rs into 6 files: mod.rs + config.rs + validation.rs + confidence.rs + extraction.rs + tests.rs
src/extraction/dedup/* Split dedup.rs into mod.rs + scoring.rs + tests.rs
src/extraction/pe_resources/* Split pe_resources.rs into 6 files: mod.rs + detection.rs + version_info.rs + string_tables.rs + manifests.rs + tests.rs
justfile Added --all-features to build-release/test-ci/coverage, added outdated recipe, fixed dist-check to use dist plan
.github/workflows/*.yml Updated CI workflows to use just recipes instead of raw cargo commands
AGENTS.md Documented CI architecture: use just recipes in CI, cargo-dist invocation pattern
docs/plans/*.md Implementation plan documenting the refactoring process

.unwrap_or_else(|| cli.input.display().to_string());

let output_format = cli.format.to_output_format();
let metadata = OutputMetadata::new(binary_name, output_format, strings.len(), strings.len());
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OutputMetadata constructor is called with strings.len() for both total_strings and filtered_strings parameters. This is incorrect - all strings are deduplicated/filtered by this point, so there's no distinction between total and filtered. Either pass different values to show actual filtering stats, or if the distinction doesn't apply here, the API should be clarified.

Copilot uses AI. Check for mistakes.
- Fix panic in CanonicalString::to_found_string() on empty occurrences
  (now returns Option<FoundString>)
- Add #[allow] justification comments per project rules (4 files)
- Remove redundant empty checks in confidence.rs and validation.rs
- Strengthen test assertions: replace conditional passes with asserts,
  fix unused result, fix tautological assertion, implement empty test
- Add max_length validation to ExtractionConfig::validate()
- Improve CLI tests: check stderr on error, verify min_length filtering
- Clarify AGENTS.md CI Architecture wording

Signed-off-by: Keith Melton <kmelton@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Fix UTF-16 length calculation in version_info.rs and string_tables.rs:
  use encode_utf16().count() * 2 instead of text.len() for source bytes
- Fix missing trailing newline in CLI output for table/JSON formatters
- Remove unnecessary clone in basic_extractor.rs canonical string path
- Use encoding from map key in dedup instead of re-indexing found_strings
- Implement empty test bodies for string_tables and manifests detection
- Assert unused results in PE resource tests instead of discarding
- Add max_length vs min_ascii_length/min_wide_length validation

Signed-off-by: Keith Melton <kmelton@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Copilot AI review requested due to automatic review settings February 18, 2026 00:45
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 38 out of 38 changed files in this pull request and generated 6 comments.

Comment on lines +57 to +60
let default_output = Command::new(env!("CARGO_BIN_EXE_stringy"))
.arg("tests/fixtures/test_binary_elf")
.output()
.expect("Failed to execute stringy");
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cli_min_length_flag compares line counts against default_output, but never asserts that the default invocation succeeded. If the default run fails and returns empty stdout, this assertion can give misleading results. Add an assert!(default_output.status.success(), ...) before comparing outputs.

Copilot uses AI. Check for mistakes.
Comment on lines 195 to 199
is_printable_code_unit_or_pair(code_unit, next_code_unit);

if is_printable {
char_count += 1;
i += consumed_units * 2;
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

char_count is incremented by 1 even when consumed_units is 2 for surrogate pairs. This makes the length checks (min_length/max_length) effectively count Unicode scalar values, but Utf16ExtractionConfig documents min/max as UTF-16 code units. Either increment by consumed_units (so counts match the config semantics) or update the config docs and related logic to consistently use scalar-count semantics.

Copilot uses AI. Check for mistakes.
Comment on lines 59 to 63
let config = ExtractionConfig {
min_length: cli.min_length,
min_ascii_length: cli.min_length,
..ExtractionConfig::default()
};
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI --min-length currently only updates min_length/min_ascii_length. If UTF-16 extraction is enabled via config, min_wide_length will remain at its default, so the flag will not consistently apply across encodings. Also consider calling config.validate() (or enforcing a non-zero constraint at the clap level) so invalid values fail fast with a clear error.

Copilot uses AI. Check for mistakes.
- Validate min_length at CLI level (clap range 1..) and call
  config.validate() before extraction
- Set min_wide_length from CLI --min-length flag for consistent
  filtering across all encodings
- Assert default CLI run succeeds before comparing output in tests

Signed-off-by: Keith Melton <kmelton@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/extraction/dedup/tests.rs`:
- Around line 1-560: The test file src/extraction/dedup/tests.rs is too large
(560 lines); split it into smaller focused test modules (for example: basic.rs
for basic_deduplication, encoding.rs for test_encoding_separation and
test_length_preservation, metadata.rs for test_metadata_preservation and
test_tag_merging, scoring.rs for
test_score_calculation/test_cross_section_bonus/test_multi_source_bonus/test_sorting/test_single_occurrence/test_to_found_string,
threshold.rs for test_dedup_threshold, and edge_cases.rs for
test_empty_input/test_edge_case_empty_string) placed under
src/extraction/dedup/tests/; in each new file keep the create_test_string helper
either by moving it into a shared mod (e.g., mod common;) or duplicating minimal
helper code, add appropriate mod declarations in
src/extraction/dedup/tests/mod.rs (pub mod basic; pub mod encoding; etc.), and
ensure each test file imports the crate items with use super::* or use
crate::extraction::dedup::* so tests compile and cargo test still discovers
them.

In `@src/extraction/pe_resources/string_tables.rs`:
- Around line 68-80: The public function extract_string_table_strings lacks a
rustdoc example or justification for being exported; either add a short rustdoc
example showing a minimal input byte slice and expected FoundString output to
demonstrate usage, or if it is internal-only change its visibility to pub(crate)
by updating the function signature; update the doc comment above
extract_string_table_strings accordingly to include the example or note the
reduced visibility.
- Around line 86-100: The current code swallows all resource parsing failures
and returns Vec::new() at pe.resources(), resources.root(), and
root.get_dir(Name::Id(RT_STRING)), losing context; change the function signature
to return Result<Vec<...>, StringyError> and replace each early Vec::new()
return with Err(...) that maps the underlying error into an appropriate
StringyError variant from types.rs (include context: resource type RT_STRING,
section and offset info where available), or propagate the original error using
? after converting it into StringyError; update callers to handle the Result.
Ensure you reference and use pe.resources(), resources.root(), and
root.get_dir(Name::Id(RT_STRING)) when building the StringyError contextual
message/fields.
- Around line 149-163: The FoundString constructed for resource strings
currently leaves offset=0, rva unset, and section_weight=None which prevents
provenance and section-aware scoring; update the code that builds FoundString
(the FoundString literal in src/extraction/pe_resources/string_tables.rs with
source StringSource::ResourceString) to compute the file offset and RVA from the
pelite resource entry or by mapping the resource's virtual address through the
section headers, populate FoundString.offset with the computed file offset, set
FoundString.rva to the computed RVA, set FoundString.section to
Some(".rsrc".to_string()) (keep as-is) and set FoundString.section_weight to
Some(9.0) to match PE .rsrc weighting; also populate any permissions/provenance
field available on FoundString (e.g., a permissions or provenance field if
present) with the section flags derived from the PE section header so downstream
scoring can use the data.
- Around line 173-175: Rename the test module currently declared as #[cfg(test)]
mod internal_tests to #[cfg(test)] mod tests to follow project test conventions;
update the module declaration (internal_tests → tests) and any internal
references to that module within this file (e.g., imports or uses that mention
internal_tests) so compilation and test discovery continue to work.
- Around line 47-63: The parser currently continues when a declared string
length overruns the remaining bytes, which leaves offset at the next length
field and can corrupt subsequent reads; update the loop in string_tables.rs to
stop parsing on this truncated-string case: when checking if offset +
byte_length > bytes.len(), instead of pushing a single None and continue, push
None for the current entry and then pad the rest of the expected entries with
None (or if no total-count variable exists, immediately break the loop) so no
further bytes are interpreted as lengths; use the existing symbols offset,
byte_length, strings, bytes and the decode_utf16le call to locate and modify the
logic.

In `@src/extraction/pe_resources/tests.rs`:
- Around line 155-171: The test name test_detect_version_info_no_translation is
misleading because it uses the existing fixture
get_fixture_path("test_binary_with_resources.exe") which likely includes
translations; either provide a dedicated fixture that truly lacks VERSIONINFO
translations and update the test to load that fixture (create a new fixture file
and change the fixture path in test_detect_version_info_no_translation), or
rename the test to test_detect_version_info_does_not_panic and leave the
existing fixture usage; update the test function name and its ignore/comment
accordingly so the test name matches the actual behavior.
- Around line 1-11: The test file src/extraction/pe_resources/tests.rs is too
large; split it into two focused modules: create detection_tests.rs containing
tests around extract_resources, resource type detection, and fixture metadata
validation, and extraction_tests.rs containing unit tests for decode_utf16le,
extract_version_info_strings, extract_string_table_strings, and
extract_manifest_strings; move the shared helper get_fixture_path into a common
tests helper module (e.g., tests/util.rs or src/extraction/tests_common.rs) that
both new files import, update module declarations and use/import paths
accordingly, and ensure any shared use statements (crate::types::{Encoding,
ResourceType, StringSource, Tag} and super::*) are adjusted in each new file.
- Around line 422-440: The match over resource.resource_type in the test does no
validation; replace it with an actual assertion that resource.resource_type is
one of the allowed variants (ResourceType::VersionInfo,
ResourceType::StringTable, ResourceType::Manifest) — e.g., assert that
matches!(resource.resource_type, ResourceType::VersionInfo |
ResourceType::StringTable | ResourceType::Manifest) and include the unexpected
resource.resource_type in the assertion message so the test fails with useful
info; this change should be applied inside the loop that iterates over resources
in the test function that currently contains the match.

In `@src/extraction/utf16/extraction.rs`:
- Around line 16-99: Change the helper functions and public API to return
crate::types::Result instead of Result<_, ()>, mapping String::from_utf16 errors
into the crate error type (StringyError) so callers get meaningful errors: in
decode_utf16le and decode_utf16be map the FromUtf16Error into StringyError
(e.g., StringyError::Utf16Decode or appropriate variant) and return
crate::types::Result<(String, Vec<u16>)>, then update decode_utf16le_bytes to
return crate::types::Result<String> (remove the
#[allow(clippy::result_unit_err)] afterwards). Keep the same u16 conversion
logic but replace map_err(|_| ()) with a conversion that constructs the
StringyError via From/Into so error context is preserved.

In `@src/extraction/utf16/validation.rs`:
- Around line 28-55: After decoding the surrogate pair in the surrogate-pair
branch (where code_point is computed), add an explicit noncharacter rejection
check similar to the single-unit path: if code_point is in 0xFDD0..=0xFDEF or if
(code_point & 0xFFFF) == 0xFFFE or 0xFFFF then treat it as non-printable and
return (false, 2); keep this check before the whitespace/control/printable
checks so noncharacters are not allowed as printable.

In `@src/main.rs`:
- Around line 59-66: After constructing the ExtractionConfig instance, call its
validate() method and propagate or handle any error before proceeding to create
BasicExtractor and call extractor.extract; specifically, after creating
ExtractionConfig (symbol ExtractionConfig) invoke config.validate() (method
validate()) and return or use the ? operator on its Result so invalid configs
are caught early and do not reach BasicExtractor::new or extractor.extract.

In `@tests/integration_cli.rs`:
- Around line 19-31: The test cli_json_output currently only checks that each
non-empty stdout line parses as JSON but doesn't assert any JSON was produced;
update the test (cli_json_output) to collect parsed JSON values (or count
non-empty lines) from stdout.lines().filter(|l| !l.is_empty()) and add an
assertion that this collection/count is greater than zero (e.g.,
assert!(!values.is_empty(), "Expected at least one JSON output line")). Ensure
the new assertion runs after parsing so both validity and non-empty output are
enforced.

---

Duplicate comments:
In `@src/extraction/basic_extractor.rs`:
- Around line 279-383: Replace the repeated dual-field checks for encodings
(config.encodings.contains(&...), config.enabled_encodings.contains(&...)) with
a single source-of-truth helper method (e.g., Config::is_encoding_enabled(&self,
encoding) or a free function encoding_enabled(config, encoding)) and call that
helper wherever ascii_enabled, utf8_enabled, utf16le_enabled, utf16be_enabled
(and any other encoding checks) are computed; implement the helper to
encapsulate the current backward-compatible logic (check both fields once) so
migration is centralized and update all references in basic_extractor.rs to use
that helper and remove the duplicated contains(...) checks from the individual
blocks.
- Around line 19-244: Both extract and extract_canonical in BasicExtractor
duplicate the same orchestration (section sorting, section filters, extracting
via extract_from_section, adding import/export symbols, and calling
apply_semantic_enrichment); factor this into a private helper (e.g.,
collect_found_strings or gather_all_found_strings) that returns
Vec<FoundString>. Have BasicExtractor::extract call that helper, then run
deduplicate (using deduplicate, config.dedup_threshold,
config.preserve_all_occurrences) and convert canonical results back to
FoundString when needed; have BasicExtractor::extract_canonical call the same
helper and either return deduplicate(...) or map each FoundString to a
CanonicalString (using found_string_to_occurrence, CanonicalString fields) when
deduplication is disabled. Ensure the helper reuses extract_from_section and
honors config.include_symbols and config.include_debug, and keep unique
conversion/dedup logic only in the two public methods.

In `@src/extraction/dedup/tests.rs`:
- Around line 19-28: The UTF-16 length calculation in the test uses
text.chars().count() * 2 which undercounts for non-BMP characters; update the
calculation inside the match for Encoding::Utf16Le | Encoding::Utf16Be to use
text.encode_utf16().count() * 2 (maintaining the cast to u32) so surrogate-pair
characters are accounted for, or alternatively add a clear comment next to this
block stating the test assumes BMP-only input; locate the calculation by the
variables Encoding and length in the test.

In `@src/extraction/pe_resources/version_info.rs`:
- Around line 21-35: The function extract_version_info_strings currently
swallows all errors and returns an empty Vec; update each error arm
(PeFile::from_bytes, pe.resources(), resources.version_info()) to log the error
at debug level with context and the error value (e.g., using log::debug! or your
crate's logger) before returning an empty Vec so callers still get the same
return shape but failures are distinguishable in logs; include the function name
and the failing call (PeFile::from_bytes, pe.resources(),
resources.version_info()) and format the error (e.g., "{:?}") in the message to
aid triage of parsing vs. missing version info while keeping the existing return
behavior.

In `@src/extraction/traits.rs`:
- Around line 43-58: The struct currently exposes two identical fields,
encodings and enabled_encodings, causing ambiguity; consolidate to a single
canonical field (pick one name, e.g., enabled_encodings) and update all usage
sites such as BasicExtractor::extract_from_section where checks like
config.encodings.contains(&Encoding::Ascii) ||
config.enabled_encodings.contains(&Encoding::Ascii) occur to reference only the
chosen field (e.g., config.enabled_encodings.contains(...)); if you must keep
the old name for compatibility, make the old field a thin deprecated alias that
forwards to the canonical Vec<Encoding> (with clear doc comment) and remove
duplicate defaults so only one source of truth remains.

In `@src/extraction/utf16/validation.rs`:
- Around line 102-120: The doc for is_printable_utf16le_char must explicitly
state its surrogate-pair limitation: because it always calls
is_printable_code_unit_or_pair(code_unit, None), it cannot detect or treat
high-surrogate + next low-surrogate pairs and will return false for standalone
high surrogates; update the function doc comment to call out that this is a
legacy wrapper, that it always passes None for next_code_unit, and that callers
needing proper surrogate-pair handling should call
is_printable_code_unit_or_pair with the following code unit supplied.

Comment on lines +68 to +80
/// Extract strings from STRINGTABLE resources
///
/// Parses RT_STRING resources (type 6) containing localized UI strings.
/// Handles block structure: strings grouped in blocks of 16.
///
/// # Arguments
///
/// * `data` - Raw PE binary data
///
/// # Returns
///
/// Vector of FoundString entries from string tables
pub fn extract_string_table_strings(data: &[u8]) -> Vec<FoundString> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a rustdoc example or reduce visibility for the public API.
extract_string_table_strings is public but the rustdoc has no example. Add a short example or make the function pub(crate) if it is internal only.
As per coding guidelines, "Exported APIs require rustdoc comments with examples where appropriate".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/extraction/pe_resources/string_tables.rs` around lines 68 - 80, The
public function extract_string_table_strings lacks a rustdoc example or
justification for being exported; either add a short rustdoc example showing a
minimal input byte slice and expected FoundString output to demonstrate usage,
or if it is internal-only change its visibility to pub(crate) by updating the
function signature; update the doc comment above extract_string_table_strings
accordingly to include the example or note the reduced visibility.

Comment on lines +86 to +100
let resources = match pe.resources() {
Ok(resources) => resources,
Err(_) => return Vec::new(),
};

let root = match resources.root() {
Ok(root) => root,
Err(_) => return Vec::new(),
};

let string_type_name = Name::Id(RT_STRING);
let string_type_dir = match root.get_dir(string_type_name) {
Ok(dir) => dir,
Err(_) => return Vec::new(),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Surface resource parsing failures via StringyError instead of returning an empty Vec.
Collapsing all failures to Vec::new() makes "no strings found" indistinguishable from "parse failed" and drops context. Return a Result and map failures to StringyError with resource context (type, section, offsets), or propagate the error via the caller.
As per coding guidelines, "Use comprehensive error handling with a StringyError enum across the codebase" and "Use StringyError variants from types.rs for error handling, never raw strings".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/extraction/pe_resources/string_tables.rs` around lines 86 - 100, The
current code swallows all resource parsing failures and returns Vec::new() at
pe.resources(), resources.root(), and root.get_dir(Name::Id(RT_STRING)), losing
context; change the function signature to return Result<Vec<...>, StringyError>
and replace each early Vec::new() return with Err(...) that maps the underlying
error into an appropriate StringyError variant from types.rs (include context:
resource type RT_STRING, section and offset info where available), or propagate
the original error using ? after converting it into StringyError; update callers
to handle the Result. Ensure you reference and use pe.resources(),
resources.root(), and root.get_dir(Name::Id(RT_STRING)) when building the
StringyError contextual message/fields.

Comment on lines +149 to +163
let found_string = FoundString {
text,
original_text: None,
encoding: Encoding::Utf16Le,
offset: 0, // File offset not easily available from pelite DataEntry
rva,
section: Some(".rsrc".to_string()),
length: text_len,
tags: vec![Tag::Resource],
score: 0,
section_weight: None,
semantic_boost: None,
noise_penalty: None,
source: StringSource::ResourceString,
confidence: 1.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Populate provenance and PE section weighting for resource strings.
offset: 0, rva: None, and section_weight: None mean downstream tools cannot trace or rank resource strings, and offset 0 is misleading. Derive RVA and file offset from the resource entry or section headers, set permissions, and apply the PE .rsrc weight so the section-aware scoring works.
As per coding guidelines, "Track provenance information including offset, section, RVA, and permissions for all extracted strings" and "Implement weight-based section prioritization for ELF (.rodata 10.0, .comment 9.0, .data.rel.ro 7.0), PE (.rdata 10.0, .rsrc 9.0, read-only .data 7.0), and Mach-O (__TEXT,__cstring 10.0, __TEXT,__const 9.0, __DATA_CONST 7.0)" and "Implement section-aware scoring algorithm to rank and prioritize the most relevant strings".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/extraction/pe_resources/string_tables.rs` around lines 149 - 163, The
FoundString constructed for resource strings currently leaves offset=0, rva
unset, and section_weight=None which prevents provenance and section-aware
scoring; update the code that builds FoundString (the FoundString literal in
src/extraction/pe_resources/string_tables.rs with source
StringSource::ResourceString) to compute the file offset and RVA from the pelite
resource entry or by mapping the resource's virtual address through the section
headers, populate FoundString.offset with the computed file offset, set
FoundString.rva to the computed RVA, set FoundString.section to
Some(".rsrc".to_string()) (keep as-is) and set FoundString.section_weight to
Some(9.0) to match PE .rsrc weighting; also populate any permissions/provenance
field available on FoundString (e.g., a permissions or provenance field if
present) with the section flags derived from the PE section header so downstream
scoring can use the data.

Comment on lines 16 to 99
/// Decode UTF-16LE byte sequence to UTF-8 String and return u16 vector
///
/// Converts a UTF-16LE byte sequence to a UTF-8 String using `u16::from_le_bytes`
/// and `String::from_utf16`. Also returns the u16 vector for confidence scoring.
/// Handles odd-length inputs gracefully by truncating the last byte.
///
/// # Arguments
///
/// * `bytes` - UTF-16LE encoded byte slice
///
/// # Returns
///
/// `Result<(String, Vec<u16>)>` - Decoded UTF-8 string and u16 vector, or error if decoding fails
fn decode_utf16le(bytes: &[u8]) -> Result<(String, Vec<u16>), ()> {
// Handle odd-length input by truncating last byte
let even_bytes = if bytes.len() % 2 == 1 {
&bytes[..bytes.len() - 1]
} else {
bytes
};

// Convert to u16 slice
let u16_slice: Vec<u16> = even_bytes
.chunks_exact(2)
.map(|chunk| u16::from_le_bytes([chunk[0], chunk[1]]))
.collect();

// Decode UTF-16 to String
let decoded = String::from_utf16(&u16_slice).map_err(|_| ())?;

Ok((decoded, u16_slice))
}

/// Decode UTF-16BE byte sequence to UTF-8 String and return u16 vector
///
/// Converts a UTF-16BE byte sequence to a UTF-8 String using `u16::from_be_bytes`
/// and `String::from_utf16`. Also returns the u16 vector for confidence scoring.
/// Handles odd-length inputs gracefully by truncating the last byte.
///
/// # Arguments
///
/// * `bytes` - UTF-16BE encoded byte slice
///
/// # Returns
///
/// `Result<(String, Vec<u16>)>` - Decoded UTF-8 string and u16 vector, or error if decoding fails
fn decode_utf16be(bytes: &[u8]) -> Result<(String, Vec<u16>), ()> {
// Handle odd-length input by truncating last byte
let even_bytes = if bytes.len() % 2 == 1 {
&bytes[..bytes.len() - 1]
} else {
bytes
};

// Convert to u16 slice
let u16_slice: Vec<u16> = even_bytes
.chunks_exact(2)
.map(|chunk| u16::from_be_bytes([chunk[0], chunk[1]]))
.collect();

// Decode UTF-16 to String
let decoded = String::from_utf16(&u16_slice).map_err(|_| ())?;

Ok((decoded, u16_slice))
}

/// Decode UTF-16LE byte sequence to UTF-8 String (public API)
///
/// Converts a UTF-16LE byte sequence to a UTF-8 String using `u16::from_le_bytes`
/// and `String::from_utf16`. Handles odd-length inputs gracefully by truncating
/// the last byte.
///
/// # Arguments
///
/// * `bytes` - UTF-16LE encoded byte slice
///
/// # Returns
///
/// Decoded UTF-8 string, or error if decoding fails
// Unit error is intentional: callers only need success/failure, no actionable error detail
#[allow(clippy::result_unit_err)]
pub fn decode_utf16le_bytes(bytes: &[u8]) -> Result<String, ()> {
decode_utf16le(bytes).map(|(s, _)| s)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use StringyError for UTF-16 decode errors.

Returning Result<_, ()> from the decode helpers and the public decode_utf16le_bytes drops error context and violates the project error-handling contract. Switch these helpers to crate::types::Result and map FromUtf16Error into StringyError; then the clippy allow is unnecessary.

Proposed fix
-use crate::types::{Encoding, FoundString, SectionInfo, StringSource};
+use crate::types::{Encoding, FoundString, Result, SectionInfo, StringSource};

-fn decode_utf16le(bytes: &[u8]) -> Result<(String, Vec<u16>), ()> {
+fn decode_utf16le(bytes: &[u8]) -> Result<(String, Vec<u16>)> {
@@
-    let decoded = String::from_utf16(&u16_slice).map_err(|_| ())?;
+    let decoded = String::from_utf16(&u16_slice)?;
@@
-fn decode_utf16be(bytes: &[u8]) -> Result<(String, Vec<u16>), ()> {
+fn decode_utf16be(bytes: &[u8]) -> Result<(String, Vec<u16>)> {
@@
-    let decoded = String::from_utf16(&u16_slice).map_err(|_| ())?;
+    let decoded = String::from_utf16(&u16_slice)?;
@@
-// Unit error is intentional: callers only need success/failure, no actionable error detail
-#[allow(clippy::result_unit_err)]
-pub fn decode_utf16le_bytes(bytes: &[u8]) -> Result<String, ()> {
+pub fn decode_utf16le_bytes(bytes: &[u8]) -> Result<String> {
     decode_utf16le(bytes).map(|(s, _)| s)
 }
As per coding guidelines, "Use StringyError variants from types.rs for error handling, never raw strings".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/extraction/utf16/extraction.rs` around lines 16 - 99, Change the helper
functions and public API to return crate::types::Result instead of Result<_,
()>, mapping String::from_utf16 errors into the crate error type (StringyError)
so callers get meaningful errors: in decode_utf16le and decode_utf16be map the
FromUtf16Error into StringyError (e.g., StringyError::Utf16Decode or appropriate
variant) and return crate::types::Result<(String, Vec<u16>)>, then update
decode_utf16le_bytes to return crate::types::Result<String> (remove the
#[allow(clippy::result_unit_err)] afterwards). Keep the same u16 conversion
logic but replace map_err(|_| ()) with a conversion that constructs the
StringyError via From/Into so error context is preserved.

Comment on lines 28 to 55
// Check for high surrogate (0xD800..0xDBFF)
if (0xD800..=0xDBFF).contains(&code_unit) {
// Need next code unit to form a valid pair
if let Some(low) = next_code_unit {
// Check if it's a valid low surrogate (0xDC00..0xDFFF)
if (0xDC00..=0xDFFF).contains(&low) {
// Decode surrogate pair to code point
let high_bits = (code_unit as u32 & 0x3FF) << 10;
let low_bits = low as u32 & 0x3FF;
let code_point = 0x10000 + high_bits + low_bits;

// Check if the decoded character is printable
if let Some(ch) = char::from_u32(code_point) {
// Allow whitespace characters
if ch.is_whitespace() {
return (true, 2);
}

// Exclude control characters
if ch.is_control() {
return (false, 2);
}

// For code points >= 0x20, exclude 0x7F..0x9F control range
if code_point >= 0x20 && !(0x7F..=0x9F).contains(&code_point) {
return (true, 2);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Exclude noncharacters in surrogate-pair path.

The single-unit path rejects noncharacters, but the surrogate-pair branch does not, so some noncharacters can be treated as printable. Add an explicit check after decoding to align with the doc intent and reduce false positives.

Proposed fix
-                let code_point = 0x10000 + high_bits + low_bits;
+                let code_point = 0x10000 + high_bits + low_bits;
+
+                // Exclude noncharacters (U+FFFE or U+FFFF in any plane)
+                if (code_point & 0xFFFF) == 0xFFFE || (code_point & 0xFFFF) == 0xFFFF {
+                    return (false, 2);
+                }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/extraction/utf16/validation.rs` around lines 28 - 55, After decoding the
surrogate pair in the surrogate-pair branch (where code_point is computed), add
an explicit noncharacter rejection check similar to the single-unit path: if
code_point is in 0xFDD0..=0xFDEF or if (code_point & 0xFFFF) == 0xFFFE or 0xFFFF
then treat it as non-printable and return (false, 2); keep this check before the
whitespace/control/printable checks so noncharacters are not allowed as
printable.

Comment on lines +19 to +31
#[test]
fn cli_json_output() {
let output = Command::new(env!("CARGO_BIN_EXE_stringy"))
.args(["tests/fixtures/test_binary_elf", "--format", "json"])
.output()
.expect("Failed to execute stringy");

assert!(output.status.success(), "Exit code: {}", output.status);
let stdout = String::from_utf8_lossy(&output.stdout);
for line in stdout.lines().filter(|l| !l.is_empty()) {
serde_json::from_str::<serde_json::Value>(line).expect("Each line should be valid JSON");
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider asserting non-empty JSON output.

The test validates JSON syntax but doesn't verify that at least one JSON object was produced. If the binary produces no strings, the test would pass vacuously.

Add non-empty assertion
     let stdout = String::from_utf8_lossy(&output.stdout);
+    let lines: Vec<_> = stdout.lines().filter(|l| !l.is_empty()).collect();
+    assert!(!lines.is_empty(), "Should produce at least one JSON object");
-    for line in stdout.lines().filter(|l| !l.is_empty()) {
+    for line in lines {
         serde_json::from_str::<serde_json::Value>(line).expect("Each line should be valid JSON");
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#[test]
fn cli_json_output() {
let output = Command::new(env!("CARGO_BIN_EXE_stringy"))
.args(["tests/fixtures/test_binary_elf", "--format", "json"])
.output()
.expect("Failed to execute stringy");
assert!(output.status.success(), "Exit code: {}", output.status);
let stdout = String::from_utf8_lossy(&output.stdout);
for line in stdout.lines().filter(|l| !l.is_empty()) {
serde_json::from_str::<serde_json::Value>(line).expect("Each line should be valid JSON");
}
}
#[test]
fn cli_json_output() {
let output = Command::new(env!("CARGO_BIN_EXE_stringy"))
.args(["tests/fixtures/test_binary_elf", "--format", "json"])
.output()
.expect("Failed to execute stringy");
assert!(output.status.success(), "Exit code: {}", output.status);
let stdout = String::from_utf8_lossy(&output.stdout);
let lines: Vec<_> = stdout.lines().filter(|l| !l.is_empty()).collect();
assert!(!lines.is_empty(), "Should produce at least one JSON object");
for line in lines {
serde_json::from_str::<serde_json::Value>(line).expect("Each line should be valid JSON");
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration_cli.rs` around lines 19 - 31, The test cli_json_output
currently only checks that each non-empty stdout line parses as JSON but doesn't
assert any JSON was produced; update the test (cli_json_output) to collect
parsed JSON values (or count non-empty lines) from stdout.lines().filter(|l|
!l.is_empty()) and add an assertion that this collection/count is greater than
zero (e.g., assert!(!values.is_empty(), "Expected at least one JSON output
line")). Ensure the new assertion runs after parsing so both validity and
non-empty output are enforced.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/main.rs`:
- Around line 47-63: The CLI's min_length field is declared as u64 and later
cast to usize which can silently truncate; change the Cli field declaration for
min_length to type usize and update its clap value parser to
clap::value_parser!(usize) (keep default_value_t = 4 and the range check). Then
update usages: in run(), remove the cast "as usize" and use cli.min_length
directly (e.g., let min_len = cli.min_length) so ExtractionConfig { min_length,
min_ascii_length, min_wide_length } receives usize values without truncation.

In `@tests/integration_cli.rs`:
- Around line 3-31: The two integration tests cli_accepts_binary_file and
cli_json_output should be converted to use insta snapshot assertions instead of
raw asserts: capture the process output into a normalized stdout string (e.g.,
trim trailing whitespace, remove timestamps or machine-specific paths if
present) and call insta::assert_snapshot!(normalized_stdout) for the table/text
case in cli_accepts_binary_file; for cli_json_output, parse/normalize each JSON
line (e.g., serde_json::from_str -> serde_json::to_string_pretty or canonicalize
keys) and then assert_snapshot!(normalized_joined_json) so future output changes
are detected by the insta framework.
- Around line 3-49: Rename the four test functions to follow the
test_<function>_<scenario> pattern by prefixing their names with "test_": change
cli_accepts_binary_file to test_cli_accepts_binary_file, cli_json_output to
test_cli_json_output, cli_invalid_file to test_cli_invalid_file, and
cli_min_length_flag to test_cli_min_length_flag so the #[test] functions match
the project's naming convention and are easily discoverable.

Comment on lines +3 to +31
#[test]
fn cli_accepts_binary_file() {
let output = Command::new(env!("CARGO_BIN_EXE_stringy"))
.arg("tests/fixtures/test_binary_elf")
.output()
.expect("Failed to execute stringy");

assert!(output.status.success(), "Exit code: {}", output.status);
let stdout = String::from_utf8_lossy(&output.stdout);
assert!(
!stdout.contains("coming soon"),
"CLI still shows stub message"
);
assert!(!stdout.is_empty(), "No output produced");
}

#[test]
fn cli_json_output() {
let output = Command::new(env!("CARGO_BIN_EXE_stringy"))
.args(["tests/fixtures/test_binary_elf", "--format", "json"])
.output()
.expect("Failed to execute stringy");

assert!(output.status.success(), "Exit code: {}", output.status);
let stdout = String::from_utf8_lossy(&output.stdout);
for line in stdout.lines().filter(|l| !l.is_empty()) {
serde_json::from_str::<serde_json::Value>(line).expect("Each line should be valid JSON");
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Integration tests must use insta snapshots for CLI output

These tests validate CLI behavior but do not use insta snapshots, which is required for integration tests and helps catch output regressions. Snapshot stdout (after any normalization if needed), especially for table and JSON formats.

Example direction
+use insta::assert_snapshot;
@@
     let stdout = String::from_utf8_lossy(&output.stdout);
+    assert_snapshot!("cli_json_output", stdout);

As per coding guidelines, "tests/integration_*.rs: Use insta crate for snapshot testing in integration tests".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration_cli.rs` around lines 3 - 31, The two integration tests
cli_accepts_binary_file and cli_json_output should be converted to use insta
snapshot assertions instead of raw asserts: capture the process output into a
normalized stdout string (e.g., trim trailing whitespace, remove timestamps or
machine-specific paths if present) and call
insta::assert_snapshot!(normalized_stdout) for the table/text case in
cli_accepts_binary_file; for cli_json_output, parse/normalize each JSON line
(e.g., serde_json::from_str -> serde_json::to_string_pretty or canonicalize
keys) and then assert_snapshot!(normalized_joined_json) so future output changes
are detected by the insta framework.

Comment on lines +3 to +49
#[test]
fn cli_accepts_binary_file() {
let output = Command::new(env!("CARGO_BIN_EXE_stringy"))
.arg("tests/fixtures/test_binary_elf")
.output()
.expect("Failed to execute stringy");

assert!(output.status.success(), "Exit code: {}", output.status);
let stdout = String::from_utf8_lossy(&output.stdout);
assert!(
!stdout.contains("coming soon"),
"CLI still shows stub message"
);
assert!(!stdout.is_empty(), "No output produced");
}

#[test]
fn cli_json_output() {
let output = Command::new(env!("CARGO_BIN_EXE_stringy"))
.args(["tests/fixtures/test_binary_elf", "--format", "json"])
.output()
.expect("Failed to execute stringy");

assert!(output.status.success(), "Exit code: {}", output.status);
let stdout = String::from_utf8_lossy(&output.stdout);
for line in stdout.lines().filter(|l| !l.is_empty()) {
serde_json::from_str::<serde_json::Value>(line).expect("Each line should be valid JSON");
}
}

#[test]
fn cli_invalid_file() {
let output = Command::new(env!("CARGO_BIN_EXE_stringy"))
.arg("nonexistent_file")
.output()
.expect("Failed to execute stringy");

assert!(!output.status.success(), "Should fail for missing file");
let stderr = String::from_utf8_lossy(&output.stderr);
assert!(
!stderr.trim().is_empty(),
"Should produce an error message on stderr"
);
}

#[test]
fn cli_min_length_flag() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Rename tests to test__ pattern

Current test names do not match the required naming convention. Please rename all four to the required pattern.

Example renames
-fn cli_accepts_binary_file() {
+fn test_cli_accepts_binary_file() {

-fn cli_json_output() {
+fn test_cli_json_output() {

-fn cli_invalid_file() {
+fn test_cli_invalid_file() {

-fn cli_min_length_flag() {
+fn test_cli_min_length_flag() {

As per coding guidelines, "tests/**/*.rs: Test names follow test__ pattern".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration_cli.rs` around lines 3 - 49, Rename the four test functions
to follow the test_<function>_<scenario> pattern by prefixing their names with
"test_": change cli_accepts_binary_file to test_cli_accepts_binary_file,
cli_json_output to test_cli_json_output, cli_invalid_file to
test_cli_invalid_file, and cli_min_length_flag to test_cli_min_length_flag so
the #[test] functions match the project's naming convention and are easily
discoverable.

Use usize directly for min_length CLI field instead of u64 with
cast. config.validate() handles the >= 1 constraint.

Signed-off-by: Keith Melton <kmelton@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Copilot AI review requested due to automatic review settings February 18, 2026 01:02
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/main.rs`:
- Around line 47-49: Add a clap-level range validator to the min_length argument
so invalid values (like 0) are rejected at parse time rather than later in
config.validate(); update the struct field attribute for min_length (the
#[arg(...)] on the min_length: usize field) to include a value_parser using
clap::value_parser!(usize).range(1..) so clap fails fast with a clear message.

Comment on lines +47 to +49
/// Minimum string length in bytes (must be >= 1)
#[arg(long, short = 'l', default_value_t = 4)]
min_length: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding value_parser range constraint for better UX.

Currently min_length accepts any usize, including 0, and relies on config.validate() to reject invalid values. While this works, clap's range parser provides immediate feedback with a clearer error message:

#[arg(long, short = 'l', default_value_t = 4, value_parser = clap::value_parser!(usize).range(1..))]
min_length: usize,

This fails fast at argument parsing rather than mid-pipeline. The current approach is functional but shifts validation deeper into the call stack.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.rs` around lines 47 - 49, Add a clap-level range validator to the
min_length argument so invalid values (like 0) are rejected at parse time rather
than later in config.validate(); update the struct field attribute for
min_length (the #[arg(...)] on the min_length: usize field) to include a
value_parser using clap::value_parser!(usize).range(1..) so clap fails fast with
a clear message.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 38 out of 38 changed files in this pull request and generated 5 comments.

Comment on lines +59 to +63
let config = ExtractionConfig {
min_length: cli.min_length,
min_ascii_length: cli.min_length,
min_wide_length: cli.min_length,
..ExtractionConfig::default()
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--min-length is documented as a byte length, but the CLI maps it directly into min_wide_length, which is consumed as a UTF-16 code-unit length by the UTF-16 extractor. This makes UTF-16 filtering stricter than intended (eg min-length=4 bytes should correspond to 2 UTF-16 code units). Consider either (a) keeping separate CLI flags for byte vs UTF-16 code units, or (b) converting cli.min_length to min_wide_length = max(1, (cli.min_length + 1) / 2) and documenting the behavior.

Copilot uses AI. Check for mistakes.
Comment on lines 88 to 92
pub fn deduplicate(
strings: Vec<FoundString>,
dedup_threshold: Option<usize>,
_preserve_all_occurrences: bool,
) -> Vec<CanonicalString> {
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preserve_all_occurrences argument is currently unused (named _preserve_all_occurrences) but the docstring promises different behavior when it is false ("only store occurrence count instead of full metadata"). This is misleading for callers. Either implement the reduced-metadata mode, or remove the parameter and update the docs/API accordingly.

Copilot uses AI. Check for mistakes.
@unclesp1d3r unclesp1d3r self-assigned this Feb 18, 2026
@unclesp1d3r unclesp1d3r changed the title refactor: split oversized extraction modules and wire up CLI pipeline refactor(extraction): split oversized modules and wire up CLI pipeline Feb 18, 2026
- Fix STRINGTABLE truncated entry bug: break instead of continue when
  byte_length exceeds remaining buffer, preventing misaligned reads
- Fix stale doc paths after refactoring (ascii/mod.rs, pe_resources/mod.rs)
- Rename test module internal_tests -> tests (project convention)
- Rename misleading test (no_translation -> fallback_language_handling)
- Replace no-op match with actual type assertion in metadata validation
- Split pe_resources/tests.rs (632 lines) into tests/ directory module
- Split dedup/tests.rs (560 lines) into tests/ directory module
- Add test for STRINGTABLE truncated entry parsing behavior

Signed-off-by: Keith Melton <kmelton@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/extraction/ascii/mod.rs`:
- Around line 39-108: The AsciiExtractionConfig currently allows invalid values
(min_length == 0 or max_length < min_length); add validation to prevent these
invariants by implementing either a checked constructor and/or a validate()
method: update AsciiExtractionConfig::new (and/or implement TryFrom/try_new) to
return a Result<AsciiExtractionConfig, StringyError> that rejects min_length < 1
and rejects Some(max) where max < min_length, and update Default (or keep it) to
produce valid defaults; ensure any callers that construct AsciiExtractionConfig
(including use of Default or new) call the fallible constructor or validate() so
invalid configs are rejected upstream.

In `@src/extraction/dedup/tests/basic.rs`:
- Around line 102-112: The test relies on insertion order of occurrences; before
asserting positions, sort the occurrences vector by offset to make the test
deterministic: after obtaining canonical = deduplicate(strings, None, true) and
let occ = &canonical[0].occurrences, sort occ in-place by the offset field
(e.g., using sort_by_key or sort_unstable_by_key on occ.offset) and then perform
the existing assertions on occ[0] and occ[1]; this keeps the change localized to
the test (basic.rs) and references the canonical variable and deduplicate call.

In `@src/extraction/dedup/tests/scoring.rs`:
- Around line 201-250: Update the inline comment in the test_length_preservation
unit test to avoid ambiguity about .text.len(): either reword the comment on the
assertion from "4 characters" to "4 bytes (ASCII)" to be precise about
.text.len() returning byte length for ASCII, or change the assertion check to
use text.chars().count() if you intend to assert character count; reference the
test function name test_length_preservation and the calls to
found.text.len()/.length to locate where to adjust the comment or the check.
- Around line 31-35: The test uses magic numbers for scoring; define and export
module-level constants (e.g., OCCURRENCE_BONUS = 5, CROSS_SECTION_BONUS = 10,
MULTI_SOURCE_BONUS = 15) in the scoring module where calculate_combined_score()
is implemented, then import those constants into the test and replace the
hardcoded 5/10/15 in the deduplicate() test assertion (which checks
canonical[0].combined_score) with the named constants so the test reads base +
OCCURRENCE_BONUS + (confidence calculation) and stays in sync if weights change.

In `@src/extraction/pe_resources/mod.rs`:
- Around line 149-170: extract_resources currently swallows PE parse and
resource access errors and returns an empty Vec; change its signature to return
Result<Vec<ResourceMetadata>, StringyError>, map PeFile::from_bytes errors and
pe.resources() errors into appropriate StringyError variants from types.rs
(e.g., a parse/pe variant and a resource-access variant) with contextual
messages (do not use raw strings), and propagate enumerate_resources results via
? or map_err to a StringyError rather than unwrap_or_default; also update
callers to handle the Result return so they can distinguish "no resources" from
"parse failed".

In `@src/extraction/pe_resources/tests/mod.rs`:
- Around line 14-20: The helper function get_fixture_path is used by multiple
test submodules but is currently private; change its visibility to pub(crate) so
other tests in the crate can access it. Locate the function named
get_fixture_path in src/extraction/pe_resources/tests/mod.rs and add the
pub(crate) visibility modifier to its declaration, keeping the implementation
and signature otherwise unchanged so it remains a crate-scoped test helper.

---

Duplicate comments:
In `@src/extraction/pe_resources/string_tables.rs`:
- Around line 81-100: Change this function to return Result<Vec<FoundString>,
StringyError> instead of Vec so parse failures are not swallowed; replace each
early return like from PeFile::from_bytes, pe.resources(), resources.root(), and
root.get_dir(Name::Id(RT_STRING)) with propagation of a StringyError variant
(use the project's StringyError enum from types.rs) that includes context about
which resource parse step failed; ensure downstream callers of this function are
updated to handle the Result and preserve error context rather than mapping
failures to an empty Vec.
- Around line 68-80: The public function extract_string_table_strings currently
lacks a rustdoc example; either add a short usage example showing calling
extract_string_table_strings(data: &[u8]) and what it returns (Vec<FoundString>)
or, if this is internal-only, reduce its visibility to pub(crate) by changing
the signature from pub fn extract_string_table_strings(...) to pub(crate) fn
extract_string_table_strings(...); ensure any added example references the
FoundString type and a minimal input byte slice to satisfy the exported-API
documentation requirement.
- Around line 149-163: The FoundString construction in string_tables.rs leaves
provenance blank (offset=0, rva=None, section_weight=None); update the code that
builds FoundString so it derives the file offset and RVA from the pelite
resource entry or the containing section headers (use the resource entry's data
pointer/base and section header virtual_address to compute offset and rva),
populate original_text if available, set section to ".rsrc" and set
section_weight to 9.0 for resource strings, and keep source as
StringSource::ResourceString so downstream section-aware scoring can use these
fields (update the FoundString fields offset, rva, section_weight accordingly).

Comment on lines +39 to +108
/// Configuration for ASCII string extraction
///
/// Controls minimum and maximum string length filtering. This structure serves as the
/// foundation for future configuration expansion, including encoding preferences and
/// tag filters as mentioned in the issue.
///
/// # Default Values
///
/// - `min_length`: 4 (standard minimum to reduce noise)
/// - `max_length`: None (no upper limit by default)
///
/// # Examples
///
/// ```rust
/// use stringy::extraction::ascii::AsciiExtractionConfig;
///
/// // Use default configuration
/// let config = AsciiExtractionConfig::default();
///
/// // Custom minimum length
/// let config = AsciiExtractionConfig::new(8);
///
/// // Custom minimum and maximum length
/// let mut config = AsciiExtractionConfig::default();
/// config.max_length = Some(256);
/// ```
#[derive(Debug, Clone)]
pub struct AsciiExtractionConfig {
/// Minimum string length in bytes (default: 4)
pub min_length: usize,
/// Maximum string length in bytes (default: None, no limit)
pub max_length: Option<usize>,
}

impl Default for AsciiExtractionConfig {
fn default() -> Self {
Self {
min_length: 4,
max_length: None,
}
}
}

impl AsciiExtractionConfig {
/// Create a new AsciiExtractionConfig with custom minimum length
///
/// # Arguments
///
/// * `min_length` - Minimum string length in bytes
///
/// # Returns
///
/// New AsciiExtractionConfig with specified minimum length and default max_length (None)
///
/// # Example
///
/// ```rust
/// use stringy::extraction::ascii::AsciiExtractionConfig;
///
/// let config = AsciiExtractionConfig::new(8);
/// assert_eq!(config.min_length, 8);
/// assert_eq!(config.max_length, None);
/// ```
pub fn new(min_length: usize) -> Self {
Self {
min_length,
max_length: None,
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate AsciiExtractionConfig invariants.
min_length can be 0 and max_length can be set below min_length, which can cause excessive output and false positives. Consider a validated constructor (or a validate() method) that returns a StringyError when min_length < 1 or max_length < min_length, or enforce these checks at the CLI boundary.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/extraction/ascii/mod.rs` around lines 39 - 108, The AsciiExtractionConfig
currently allows invalid values (min_length == 0 or max_length < min_length);
add validation to prevent these invariants by implementing either a checked
constructor and/or a validate() method: update AsciiExtractionConfig::new
(and/or implement TryFrom/try_new) to return a Result<AsciiExtractionConfig,
StringyError> that rejects min_length < 1 and rejects Some(max) where max <
min_length, and update Default (or keep it) to produce valid defaults; ensure
any callers that construct AsciiExtractionConfig (including use of Default or
new) call the fallible constructor or validate() so invalid configs are rejected
upstream.

Comment on lines 201 to 250
#[test]
fn test_length_preservation() {
// Test that length is preserved correctly for UTF-16 strings
let strings = vec![
FoundString {
text: "Test".to_string(),
original_text: None,
encoding: Encoding::Utf16Le,
offset: 0x100,
rva: Some(0x1000),
section: None,
length: 8, // 4 characters * 2 bytes = 8 bytes
tags: vec![],
score: 10,
section_weight: None,
semantic_boost: None,
noise_penalty: None,
source: StringSource::SectionData,
confidence: 0.8,
},
FoundString {
text: "Test".to_string(),
original_text: None,
encoding: Encoding::Utf16Le,
offset: 0x200,
rva: Some(0x2000),
section: None,
length: 8,
tags: vec![],
score: 15,
section_weight: None,
semantic_boost: None,
noise_penalty: None,
source: StringSource::SectionData,
confidence: 0.9,
},
];

let canonical = deduplicate(strings, None, true);
assert_eq!(canonical.len(), 1);
assert_eq!(canonical[0].occurrences[0].length, 8);
assert_eq!(canonical[0].occurrences[1].length, 8);

// Verify to_found_string() uses stored length, not text.len()
let found = canonical[0]
.to_found_string()
.expect("canonical string with occurrences should convert");
assert_eq!(found.length, 8); // Should be 8 bytes, not 4 (text.len())
assert_eq!(found.text.len(), 4); // But text is still 4 characters
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

LGTM - length preservation test, minor comment clarification.

Good test for UTF-16 byte-length preservation through the dedup pipeline.

Line 249: The comment says "4 characters" but text.len() returns bytes. For ASCII "Test", these happen to be equal. Consider rewording to "4 bytes (ASCII)" for precision, or use text.chars().count() if you intend to count characters.

Clarify comment
-    assert_eq!(found.text.len(), 4); // But text is still 4 characters
+    assert_eq!(found.text.len(), 4); // text.len() returns bytes; "Test" is 4 ASCII bytes
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#[test]
fn test_length_preservation() {
// Test that length is preserved correctly for UTF-16 strings
let strings = vec![
FoundString {
text: "Test".to_string(),
original_text: None,
encoding: Encoding::Utf16Le,
offset: 0x100,
rva: Some(0x1000),
section: None,
length: 8, // 4 characters * 2 bytes = 8 bytes
tags: vec![],
score: 10,
section_weight: None,
semantic_boost: None,
noise_penalty: None,
source: StringSource::SectionData,
confidence: 0.8,
},
FoundString {
text: "Test".to_string(),
original_text: None,
encoding: Encoding::Utf16Le,
offset: 0x200,
rva: Some(0x2000),
section: None,
length: 8,
tags: vec![],
score: 15,
section_weight: None,
semantic_boost: None,
noise_penalty: None,
source: StringSource::SectionData,
confidence: 0.9,
},
];
let canonical = deduplicate(strings, None, true);
assert_eq!(canonical.len(), 1);
assert_eq!(canonical[0].occurrences[0].length, 8);
assert_eq!(canonical[0].occurrences[1].length, 8);
// Verify to_found_string() uses stored length, not text.len()
let found = canonical[0]
.to_found_string()
.expect("canonical string with occurrences should convert");
assert_eq!(found.length, 8); // Should be 8 bytes, not 4 (text.len())
assert_eq!(found.text.len(), 4); // But text is still 4 characters
}
#[test]
fn test_length_preservation() {
// Test that length is preserved correctly for UTF-16 strings
let strings = vec![
FoundString {
text: "Test".to_string(),
original_text: None,
encoding: Encoding::Utf16Le,
offset: 0x100,
rva: Some(0x1000),
section: None,
length: 8, // 4 characters * 2 bytes = 8 bytes
tags: vec![],
score: 10,
section_weight: None,
semantic_boost: None,
noise_penalty: None,
source: StringSource::SectionData,
confidence: 0.8,
},
FoundString {
text: "Test".to_string(),
original_text: None,
encoding: Encoding::Utf16Le,
offset: 0x200,
rva: Some(0x2000),
section: None,
length: 8,
tags: vec![],
score: 15,
section_weight: None,
semantic_boost: None,
noise_penalty: None,
source: StringSource::SectionData,
confidence: 0.9,
},
];
let canonical = deduplicate(strings, None, true);
assert_eq!(canonical.len(), 1);
assert_eq!(canonical[0].occurrences[0].length, 8);
assert_eq!(canonical[0].occurrences[1].length, 8);
// Verify to_found_string() uses stored length, not text.len()
let found = canonical[0]
.to_found_string()
.expect("canonical string with occurrences should convert");
assert_eq!(found.length, 8); // Should be 8 bytes, not 4 (text.len())
assert_eq!(found.text.len(), 4); // text.len() returns bytes; "Test" is 4 ASCII bytes
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/extraction/dedup/tests/scoring.rs` around lines 201 - 250, Update the
inline comment in the test_length_preservation unit test to avoid ambiguity
about .text.len(): either reword the comment on the assertion from "4
characters" to "4 bytes (ASCII)" to be precise about .text.len() returning byte
length for ASCII, or change the assertion check to use text.chars().count() if
you intend to assert character count; reference the test function name
test_length_preservation and the calls to found.text.len()/.length to locate
where to adjust the comment or the check.

Comment on lines +149 to +170
pub fn extract_resources(data: &[u8]) -> Vec<ResourceMetadata> {
// Attempt to parse PE using pelite
let pe = match PeFile::from_bytes(data) {
Ok(pe) => pe,
Err(_) => {
// Graceful degradation: return empty vec on parse error
// This allows PE parsing to succeed even if resource extraction fails
return Vec::new();
}
};

// Get resource directory
let resources = match pe.resources() {
Ok(resources) => resources,
Err(_) => {
// No resource directory or error accessing it - not an error condition
return Vec::new();
}
};

// Enumerate all resources - handle errors gracefully
enumerate_resources(&resources).unwrap_or_default()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Return errors instead of empty Vec for resource parsing failures.
extract_resources discards parse failures and returns an empty vector, which hides errors from callers. Please return Result<Vec, StringyError> and propagate context so callers can distinguish "no resources" from "parse failed".

As per coding guidelines, "Use comprehensive error handling with a StringyError enum across the codebase" and "Use StringyError variants from types.rs for error handling, never raw strings".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/extraction/pe_resources/mod.rs` around lines 149 - 170, extract_resources
currently swallows PE parse and resource access errors and returns an empty Vec;
change its signature to return Result<Vec<ResourceMetadata>, StringyError>, map
PeFile::from_bytes errors and pe.resources() errors into appropriate
StringyError variants from types.rs (e.g., a parse/pe variant and a
resource-access variant) with contextual messages (do not use raw strings), and
propagate enumerate_resources results via ? or map_err to a StringyError rather
than unwrap_or_default; also update callers to handle the Result return so they
can distinguish "no resources" from "parse failed".

Comment on lines +14 to +20
/// Helper to get fixture path
fn get_fixture_path(name: &str) -> std::path::PathBuf {
Path::new(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
.join(name)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use pub(crate) for shared test helpers.
get_fixture_path is used across test submodules; mark it pub(crate) to align with the test helper visibility rule.

Proposed change
-fn get_fixture_path(name: &str) -> std::path::PathBuf {
+pub(crate) fn get_fixture_path(name: &str) -> std::path::PathBuf {
As per coding guidelines, "Use pub(crate) visibility for internal test helpers".
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// Helper to get fixture path
fn get_fixture_path(name: &str) -> std::path::PathBuf {
Path::new(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
.join(name)
}
/// Helper to get fixture path
pub(crate) fn get_fixture_path(name: &str) -> std::path::PathBuf {
Path::new(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
.join(name)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/extraction/pe_resources/tests/mod.rs` around lines 14 - 20, The helper
function get_fixture_path is used by multiple test submodules but is currently
private; change its visibility to pub(crate) so other tests in the crate can
access it. Locate the function named get_fixture_path in
src/extraction/pe_resources/tests/mod.rs and add the pub(crate) visibility
modifier to its declaration, keeping the implementation and signature otherwise
unchanged so it remains a crate-scoped test helper.

Address 8 pre-existing design issues surfaced during bot review of PR #135:

1. Remove dual encoding fields: delete `encodings` from ExtractionConfig,
   keep only `enabled_encodings`. Eliminates confusing OR-checks throughout
   basic_extractor.rs.

2. Extract shared logic from extract/extract_canonical: factor out the
   duplicated section-sorting, filtering, symbol-inclusion, and semantic
   enrichment pipeline into a private `collect_all_strings()` helper.

3. Fix O(n^2) dedup in Auto byte order mode: replace
   `strings.iter_mut().find()` with HashMap-based O(1) lookup in
   utf16/mod.rs.

4. Remove unused `_preserve_all_occurrences` parameter from
   `deduplicate()` signature, ExtractionConfig struct, and all callsites.

5. Remove broken byte order consistency heuristic: the
   `check_byte_order_consistency()` function operated on decoded u16
   values which are identical regardless of original byte order, making
   LE always score 1.0 and BE always 0.0. Redistribute the 0.1 weight
   to valid_unicode (0.35) and printable (0.45) heuristics.

6. Fix char_count semantics for surrogate pairs: change `char_count += 1`
   to `char_count += consumed_units` in utf16/extraction.rs so min/max
   length checks count UTF-16 code units, not scalar values.

7. Replace `Result<_, ()>` with `StringyError` in UTF-16 decode functions:
   use `StringyError::ParseError` for the 3 decode functions in
   utf16/extraction.rs, removing `#[allow(clippy::result_unit_err)]`.

8. Add noncharacter validation in surrogate-pair path: reject U+xFFFE
   and U+xFFFF in supplementary planes (planes 1-16) in
   utf16/validation.rs, matching the BMP noncharacter check.

Signed-off-by: Kevin Melton <kmelton@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Copilot AI review requested due to automatic review settings February 18, 2026 02:08
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 44 out of 44 changed files in this pull request and generated 2 comments.

Comment on lines +51 to +55
/// Minimum length for ASCII strings (default: 4, same as min_length)
pub min_ascii_length: usize,
/// Minimum length for UTF-16 strings (default: 3, for future use)
pub min_wide_length: usize,
/// Which encodings to extract (default: ASCII, UTF-8)
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min_wide_length is documented as a general "minimum length" but is currently used as Utf16ExtractionConfig::min_length (UTF-16 code units) in basic_extractor.rs. The current validate() logic also compares max_length (bytes) against min_wide_length directly, which is inconsistent if min_wide_length is in UTF-16 units. Please clarify the unit for min_wide_length in the doc comment and adjust validation/conversion accordingly (eg validate against max_length / 2 if it is UTF-16 code units, or convert it to code units at the use-site if you want to keep it byte-based).

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +152
if self.max_length < self.min_wide_length {
return Err(StringyError::ConfigError(format!(
"max_length ({}) must be >= min_wide_length ({})",
self.max_length, self.min_wide_length
)));
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate() currently checks self.max_length < self.min_wide_length, but max_length is in bytes while min_wide_length is used as the UTF-16 min_length (code units). This can both allow invalid UTF-16 configs (if min_wide_length is units) and reject valid ones (if callers treat it as bytes). Align the units here (eg compare max_length / 2 against min_wide_length if min_wide_length is UTF-16 units).

Copilot uses AI. Check for mistakes.
Add .github/actionlint.yml config to suppress lint errors from
release.yml, which is auto-generated by cargo-dist and cannot be
hand-edited.

Signed-off-by: Kevin Melton <kmelton@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
The cspell tool was configured but never installed in mise, so
lint-spell always failed. The project's domain-heavy jargon
(binary analysis terms, hex patterns) required 70+ exception words,
making the maintenance cost outweigh the marginal value. Typos in
identifiers are caught at compile time, and documentation quality
is covered by CodeRabbit reviews.

Signed-off-by: Kevin Melton <kmelton@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Copilot AI review requested due to automatic review settings February 18, 2026 02:37
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 46 changed files in this pull request and generated 1 comment.

@unclesp1d3r unclesp1d3r added the lgtm This PR has been approved by a maintainer label Feb 18, 2026
@mergify mergify bot added the queued label Feb 18, 2026
mergify bot added a commit that referenced this pull request Feb 18, 2026
@mergify
Copy link
Contributor

mergify bot commented Feb 18, 2026

Merge Queue Status

Rule: default


This pull request spent 4 minutes 52 seconds in the queue, including 4 minutes 42 seconds running CI.

Required conditions to merge
  • check-success = coverage
  • check-success = msrv (stable minus 1 releases)
  • check-success = msrv (stable minus 2 releases)
  • check-success = msrv (stable minus 3 releases)
  • check-success = msrv (stable minus 4 releases)
  • check-success = msrv (stable)
  • check-success = quality
  • check-success = test
  • check-success = test-cross-platform (macos-latest, macOS)
  • check-success = test-cross-platform (ubuntu-latest, Linux)
  • check-success = test-cross-platform (windows-latest, Windows)
  • all of [🛡 Merge Protections rule CI must pass]:
    • check-success = coverage
    • check-success = msrv (stable minus 1 releases)
    • check-success = msrv (stable minus 2 releases)
    • check-success = msrv (stable minus 3 releases)
    • check-success = msrv (stable minus 4 releases)
    • check-success = msrv (stable)
    • check-success = quality
    • check-success = test
    • check-success = test-cross-platform (macos-latest, macOS)
    • check-success = test-cross-platform (ubuntu-latest, Linux)
    • check-success = test-cross-platform (windows-latest, Windows)
  • all of [🛡 Merge Protections rule Do not merge outdated PRs]:
  • all of [🛡 Merge Protections rule Enforce conventional commit]:
  • any of [🛡 GitHub repository ruleset rule main]:
    • check-neutral = Mergify Merge Protections
    • check-skipped = Mergify Merge Protections
    • check-success = Mergify Merge Protections

@mergify mergify bot merged commit f976cf4 into main Feb 18, 2026
33 of 34 checks passed
@mergify mergify bot deleted the fix_dist_issues branch February 18, 2026 02:50
@mergify mergify bot removed the queued label Feb 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request lgtm This PR has been approved by a maintainer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments