Skip to content

feat(classification): add symbol demangling for Rust#124

Merged
unclesp1d3r merged 7 commits intomainfrom
symantic_classification_completion
Jan 18, 2026
Merged

feat(classification): add symbol demangling for Rust#124
unclesp1d3r merged 7 commits intomainfrom
symantic_classification_completion

Conversation

@unclesp1d3r
Copy link
Member

This pull request introduces a new modular pattern classification system for detecting various data formats, including IP addresses, GUIDs, emails, Base64 data, format strings, and user agent strings. The changes add new pattern classification modules, refactor existing capabilities to use these modules, and update dependencies for improved performance and maintainability.

New Pattern Classification Modules

  • Added a new patterns module in src/classification, which contains submodules for different types of pattern classification: ip, network, paths, and data. Each submodule provides functions and regex patterns for detecting specific formats such as IP addresses, URLs, file paths, GUIDs, emails, Base64, format strings, and user agents. [1] [2] [3]

Expanded Data Format Detection

  • Implemented new classifiers for GUID/UUID, email addresses, Base64-encoded data, printf-style format strings, and user agent strings in src/classification/patterns/data.rs, with comprehensive regex-based detection and thorough unit tests.
  • Updated documentation in src/classification/mod.rs to reflect the expanded capabilities, including symbol demangling and new data format detection.

Improved IP Address Detection

  • Added robust IPv4 and IPv6 address detection in src/classification/patterns/ip.rs, including support for port suffixes and bracketed IPv6 addresses, with validation using both regex and standard library types.

Refactoring and API Changes

  • Refactored src/classification/mod.rs to use the new patterns module, and re-exported key classification functions for easier access and future extensibility. [1] [2]

Dependency Updates

  • Replaced lazy_static with once_cell for more efficient static initialization, added rustc-demangle for symbol demangling, and updated several dependencies for improved stability and performance in Cargo.toml.

- Implemented `SymbolDemangler` to detect and demangle mangled Rust symbols.
- Added support for Rust legacy and v0 mangling formats.
- Updated `FoundString` to store original mangled text and tags for demangled symbols.
- Enhanced `Tag` enum with `DemangledSymbol` variant.
- Added comprehensive tests for demangling functionality.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
File extensions such as exe, dll, sys, bin, dat, log, tmp, and bak are no longer treated as valid TLDs in domain classification. Added tests to ensure these extensions are not classified as domains, and updated documentation to reflect Rust-only symbol demangling.
Copilot AI review requested due to automatic review settings January 18, 2026 05:12
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 18, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Architectural Refactoring: Modular Pattern Classification and Symbol Demangling (Rust + C++)

This PR refactors the classification subsystem into a modular pattern library, substantially expands pattern coverage (network, paths, data, IP), and integrates symbol demangling (Rust + C++) into the extraction/enrichment pipeline.

Architectural Changes

  • New modular classification root: src/classification/patterns with focused submodules:
    • ip.rs — IPv4/IPv6 detection with port/bracket handling, regex pre-filter + stdlib validation.
    • network.rs — URL and domain detection with TLD validation and edge-case handling.
    • paths.rs — POSIX, Windows, UNC, and registry path detection plus suspicious-persistence heuristics.
    • data.rs — GUID/UUID, email, Base64, printf-style format strings, and user-agent detection.
  • Centralized, cached regexes via once_cell::sync::Lazy (lazy_static removed). Several regex constants are re-exported pub(crate) to support tests and reuse.
  • Classification logic moved from module-local state to standalone pattern functions; SemanticClassifier delegates to these functions (function-based classification).

Symbol Demangling & Extraction Integration

  • New SymbolDemangler (src/classification/symbols.rs):
    • Detects Rust v0 (_R) and legacy/Itanium-style (_ZN/_Z) mangled symbols and C++ Itanium (_Z).
    • Uses rustc-demangle for Rust and cpp_demangle for C++ demangling.
    • API: SymbolDemangler::new(), is_mangled(&str), try_demangle(&str) -> Option, demangle(&mut FoundString).
    • demangle(&mut FoundString) is idempotent, preserves FoundString.original_text (stores original mangled form), replaces FoundString.text with demangled text, and appends Tag::DemangledSymbol once while preserving existing tags.
  • Extraction pipeline update (src/extraction/mod.rs): apply_semantic_enrichment runs demangler.demangle(...) first, then SemanticClassifier.classify(...) for each FoundString before deduplication/canonicalization so outputs include demangled text and associated tags.
  • New optional original_text field added to StringOccurrence and propagated through deduplication and CanonicalString::to_found_string, enabling preservation/serialization of original mangled strings for forensic fidelity.

API & Public Surface

  • Large expansion of public pattern functions and helpers (re-exported via patterns:: and made available through SemanticClassifier):
    • classify_{url,domain,ip_addresses,posix_path,windows_path,unc_path,registry_path,guid,email,base64,format_string,user_agent}
    • has_valid_tld, is_ipv4_address, is_ipv6_address, is_valid_/is_suspicious_, strip_port, strip_ipv6_brackets
  • New public type: SymbolDemangler with methods new(), is_mangled(), demangle(), try_demangle().
  • Tag enum extended with DemangledSymbol (serde rename "demangled").
  • SemanticClassifier API surface widened:
    • classify(&FoundString) -> Vec (preserves multi-tagging).
    • new() annotated #[must_use].
    • Adds regex_cache_addresses() helper (testing/debugging: exposes addresses of cached regex patterns).

Binary Format Handling & Data Structures

  • original_text preservation improves forensic fidelity: demangled outputs keep the raw/mangled form alongside the demangled text in occurrences and serialized output.
  • pelite and other format-related crates were added to dependencies (present in Cargo.toml) enabling future binary/format interrogation enhancements.
  • Demangling is applied broadly during enrichment (imports, sections, strings) prior to deduplication to ensure canonical outputs contain human-readable symbol names while retaining originals.

Classification Accuracy & Behavior Improvements

  • IP detection: two-stage approach (regex pre-filter + stdlib parse) reduces false positives; supports port suffixes and bracketed IPv6.
  • Domain/URL detection: domain regex + known-TLD validation; classify_domain avoids double-tagging URLs; trailing-dot TLDs rejected to reduce false positives.
  • Path classifiers: stricter validation for POSIX/Windows/UNC/registry paths, plus heuristics to flag suspicious persistence-related locations.
  • Data classifiers: GUID, email, Base64, printf-format, and user-agent detectors tuned with unit tests; Base64 explicitly treated as broad/ambiguous to reduce false positives.
  • Demangler behavior: idempotent, preserves original_text, appends a single DemangledSymbol tag, and preserves pre-existing tags (e.g., Import), improving downstream semantic accuracy and readability.

Dependencies & Implementation Notes

  • Replaced lazy_static with once_cell::sync::Lazy; added rustc-demangle and cpp_demangle for demangling; pelite present for PE/format work; minor bumps to serde_json and insta.
  • Many new unit tests added across patterns, demangling, and enrichment logic.
  • Public API expansion and added serialization of original_text increase surface area and review scope (notable churn in classification/semantic modules and extraction/deduplication).

Impact Summary

  • Significant expansion of classification capabilities and public API surface; moderate-to-high review effort expected (especially patterns, SemanticClassifier, extraction/dedup).
  • Backward-compatible on most core types but introduces Tag::DemangledSymbol and the optional original_text field in StringOccurrence serialization.
  • Improves semantic detection accuracy and forensic output quality by emitting demangled text while preserving original mangled data.

Walkthrough

Adds a centralized pattern-classification subsystem (data, ip, network, paths), a SymbolDemangler, expands SemanticClassifier's public surface and integrates demangling + pattern classification into extraction before deduplication; introduces Tag::DemangledSymbol and dependency updates for demangling and lazy init.

Changes

Cohort / File(s) Summary
Dependency Management
Cargo.toml
Added cpp_demangle, once_cell, pelite, rustc-demangle; bumped serde_json and insta. Attention: native demangler crates and feature pins.
Top-level classification
src/classification/mod.rs
Added mod patterns;, pub mod symbols; and re-exported SymbolDemangler.
Patterns index
src/classification/patterns/mod.rs
New aggregator module exposing data, ip, network, paths classifiers and selected internal regexes.
Data-format patterns
src/classification/patterns/data.rs
New classifiers: classify_guid, classify_email, classify_base64, classify_format_string, classify_user_agent; lazy regexes and tests. Review regex strictness and false-positive handling.
IP patterns & helpers
src/classification/patterns/ip.rs
IPv4/IPv6 detection, strip_port, strip_ipv6_brackets, validation helpers and classify_ip_addresses; careful: port/bracket edge cases and leading-zero rules.
Network patterns
src/classification/patterns/network.rs
classify_url, classify_domain, has_valid_tld with URL_REGEX/DOMAIN_REGEX and TLD list. Attention: TLD list maintenance and URL/domain boundary rules.
Path patterns
src/classification/patterns/paths.rs
POSIX/Windows/UNC/Registry validators and classifiers, suspicious-path detection; many regexes and heuristics. Review for OS-specific edge cases and permission/persistence indicators.
Semantic classifier
src/classification/semantic.rs
classify delegates to pattern functions; many helpers re-exported publicly; new() annotated #[must_use]; added regex_cache_addresses test API. Public surface expanded significantly.
Symbol demangler
src/classification/symbols.rs
New SymbolDemangler with is_mangled, try_demangle, demangle (in-place FoundString demangling), preserves original_text, appends Tag::DemangledSymbol; comprehensive tests. Review safety of demangler fallbacks and memory/ownership handling.
Types
src/types.rs
Added Tag::DemangledSymbol (serde rename = "demangled").
Extraction / dedup
src/extraction/mod.rs, src/extraction/dedup.rs
StringOccurrence now holds original_text; semantic enrichment (demangle + classify) applied before deduplication; canonicalization flows original_text back into FoundString. Verify serialization impact and backwards compatibility.
Docs
docs/src/classification.md
Doc rework: once_cell lazy init, expanded pattern descriptions, tag specificity notes; removed low-level snippets.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Extract as "extract()/extract_from_section()"
  participant Found as "FoundString"
  participant Symbol as "SymbolDemangler"
  participant Semantic as "SemanticClassifier"
  participant Patterns as "patterns::*"
  Note over Extract,Found: collect found strings
  Extract->>Found: iterate found strings
  loop per found string
    Extract->>Symbol: demangle(&mut FoundString)
    alt mangled symbol detected
      Symbol-->>Found: mutate text + set original_text
      Symbol-->>Extract: add Tag::DemangledSymbol
    end
    Extract->>Semantic: classify(&FoundString)
    Semantic->>Patterns: classify_url/classify_domain/classify_ip/... 
    Patterns-->>Semantic: Option<Tag>/Vec<Tag>
    Semantic-->>Extract: aggregated tags
    Extract->>Found: merge tags into FoundString
  end
  Extract->>Extract: deduplicate/canonicalize (preserve original_text)
  Extract-->>Client: final results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • Copilot

Poem

Demangle the knots, restore the name,
IPs and paths put under frame,
Regexes idle, then spring to test,
Original text kept safe and blessed,
Tags align — classifiers proclaim.

🚥 Pre-merge checks | ✅ 4 | ❌ 4
❌ Failed checks (4 warnings)
Check name Status Explanation Resolution
Title check ⚠️ Warning Title follows Conventional Commits format with correct type and scope, but is misleading as symbol demangling is only one of many changes. Revise title to reflect the primary change: feat(classification): refactor pattern detection into modular submodules or feat(classification): add modular pattern classification system.
No Unsafe Code ⚠️ Warning The crate lacks the #![forbid(unsafe_code)] directive in src/lib.rs, which is required to prevent unsafe code introduction across the entire crate. Add #![forbid(unsafe_code)] as the first non-documentation line in src/lib.rs before module declarations and verify compilation and tests pass.
Error Handling ⚠️ Warning PR introduces extensive .unwrap() calls on Regex::new() and silently suppresses external crate errors from cpp_demangle and rustc_demangle without mapping to StringyError, violating established project error handling conventions. Add StringyError variants for demangling/pattern errors, replace .unwrap() with proper error handling, map external crate errors using #[from], and change classification signatures from Option<Tag> to Result<Tag, StringyError>.
File Size Limit ⚠️ Warning File src/classification/symbols.rs exceeds 500-line limit with 508 lines, violating new file size constraint by 8 lines. Refactor by extracting helper functions or test logic into separate modules to bring the file below 500 lines.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed Description accurately covers pattern classification modules, data format detection, IP detection, refactoring, and dependency updates with specific file references.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 70.00%.
Ascii Only ✅ Passed No new Unicode punctuation introduced in PR. Pre-existing Chinese characters in test strings and dashes in concept.md are not newly added code artifacts.
Section Weight Consistency ✅ Passed PR makes no modifications to container parser files (src/container/*.rs), so section weight logic remains unchanged and consistent.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch symantic_classification_completion

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai bot added documentation Improvements or additions to documentation enhancement New feature or request labels Jan 18, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces Rust symbol demangling capabilities and significantly expands the pattern classification system with new modules for detecting various data formats. The changes refactor existing classification code into a modular structure and replace lazy_static with once_cell for improved performance.

Changes:

  • Added symbol demangling for Rust using rustc-demangle with support for legacy (_ZN) and v0 (_R) mangling schemes
  • Introduced new patterns module with submodules for IP addresses, network indicators, file paths, and data formats (GUIDs, emails, Base64, format strings, user agents)
  • Refactored SemanticClassifier to delegate to pattern classification functions while maintaining backward compatibility

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
Cargo.toml Replaced lazy_static with once_cell, added rustc-demangle dependency, and updated minor versions of serde_json and insta
src/types.rs Added DemangledSymbol tag variant to the Tag enum
src/classification/mod.rs Updated module documentation to reflect new capabilities, added patterns and symbols submodules with public exports
src/classification/symbols.rs New module implementing Rust symbol demangling with detection, demangling, and comprehensive tests
src/classification/semantic.rs Refactored to delegate classification to pattern functions while maintaining public API compatibility
src/classification/patterns/mod.rs New module organizing pattern classification into submodules with public re-exports
src/classification/patterns/ip.rs Extracted IPv4 and IPv6 address detection with regex patterns and validation using std library
src/classification/patterns/network.rs Extracted URL and domain detection with TLD validation using a comprehensive list of common TLDs
src/classification/patterns/paths.rs Extracted file and registry path detection for POSIX, Windows, UNC, and registry paths
src/classification/patterns/data.rs New module for detecting GUIDs, emails, Base64, format strings, and user agents
tests/snapshots/classification_integration__classification_snapshots.snap Updated snapshot showing Windows path now classified with both Domain and FilePath tags

Comment on lines 111 to 113
let haystack_lower = haystack.to_ascii_lowercase();
let needle_lower = needle.to_ascii_lowercase();
haystack_lower.contains(&needle_lower)
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function allocates two new strings by converting both haystack and needle to lowercase. This is less efficient than the byte-wise comparison approach that was used in the original semantic.rs module. Consider using a more efficient implementation that avoids allocations, such as iterating through bytes and comparing them case-insensitively using eq_ignore_ascii_case on byte slices or using a windowing approach.

Suggested change
let haystack_lower = haystack.to_ascii_lowercase();
let needle_lower = needle.to_ascii_lowercase();
haystack_lower.contains(&needle_lower)
// Match the behavior of `str::contains`: empty needle always matches.
if needle.is_empty() {
return true;
}
let haystack_bytes = haystack.as_bytes();
let needle_bytes = needle.as_bytes();
if needle_bytes.len() > haystack_bytes.len() {
return false;
}
let last_start = haystack_bytes.len() - needle_bytes.len();
for start in 0..=last_start {
let window = &haystack_bytes[start..start + needle_bytes.len()];
if window.eq_ignore_ascii_case(needle_bytes) {
return true;
}
}
false

Copilot uses AI. Check for mistakes.
Comment on lines +111 to +114
let has_upper = text.chars().any(|c| c.is_ascii_uppercase());
let has_lower = text.chars().any(|c| c.is_ascii_lowercase());
let has_digit = text.chars().any(|c| c.is_ascii_digit());

Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Base64 classification function iterates through the text multiple times: once to check all characters are valid Base64 (line 87), once to count padding (line 95), and three separate times to check for uppercase, lowercase, and digits (lines 111-113). This could be optimized by combining these checks into a single pass through the string to improve performance, especially for longer strings.

Suggested change
let has_upper = text.chars().any(|c| c.is_ascii_uppercase());
let has_lower = text.chars().any(|c| c.is_ascii_lowercase());
let has_digit = text.chars().any(|c| c.is_ascii_digit());
let mut has_upper = false;
let mut has_lower = false;
let mut has_digit = false;
for c in text.chars() {
if c.is_ascii_uppercase() {
has_upper = true;
} else if c.is_ascii_lowercase() {
has_lower = true;
} else if c.is_ascii_digit() {
has_digit = true;
}
if has_upper && has_lower && has_digit {
break;
}
}

Copilot uses AI. Check for mistakes.
@codecov
Copy link

codecov bot commented Jan 18, 2026

Codecov Report

❌ Patch coverage is 92.54658% with 60 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/classification/semantic.rs 65.74% 37 Missing ⚠️
src/classification/patterns/paths.rs 93.16% 16 Missing ⚠️
src/classification/patterns/data.rs 96.87% 3 Missing ⚠️
src/classification/symbols.rs 98.56% 3 Missing ⚠️
src/classification/patterns/network.rs 98.27% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Integrates the cpp_demangle crate and extends SymbolDemangler to handle C++ Itanium ABI mangled symbols (e.g., _Z, _ZN). Updates documentation, detection logic, and adds tests for C++ symbol demangling alongside Rust symbol support.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@src/classification/patterns/data.rs`:
- Around line 102-108: The current length check using (content_len +
padding_count).is_multiple_of(4) wrongly rejects valid unpadded Base64; update
the logic around variables content_len and padding_count to permit total lengths
whose length mod 4 is 0, 2, or 3 and only reject when the remainder is 1.
Concretely, compute the remainder of (content_len + padding_count) mod 4 and
return None only if that remainder equals 1 (instead of requiring an exact
multiple-of-4), so JWT-style unpadded segments (mod 2 or 3) are accepted.

In `@src/classification/patterns/network.rs`:
- Around line 26-33: COMMON_TLDS currently contains file extensions that are not
real TLDs and cause false positives; remove entries like "exe", "dll", "sys",
"bin", "dat", "log", "tmp", "bak" from the COMMON_TLDS constant in network.rs
(the const COMMON_TLDS: &[&str] array) so it only lists actual TLDs, and if
needed implement separate filename/extension checks inside the path classifier
(or a new extension heuristic) rather than conflating them with domain TLD
recognition.

In `@src/classification/patterns/paths.rs`:
- Around line 302-307: The current is_suspicious_windows_path function uses
contains_ascii_case_insensitive which does substring matching and can
false-positive on embedded paths; change it to perform a case-insensitive prefix
check instead (e.g., for each pattern in SUSPICIOUS_WINDOWS_PATHS compare the
lowercase pattern to text.lowercase().starts_with(&pattern_lower) or call an
existing starts_with_ascii_case_insensitive helper), updating references to
contains_ascii_case_insensitive accordingly in is_suspicious_windows_path; if
substring behavior is intentional, add a clear doc comment above
is_suspicious_windows_path explaining that embedded path matches are desired.
- Around line 30-45: Replace the verbose repeated insert() calls when building
the static SUSPICIOUS_POSIX_PATHS with a more idiomatic construction: build the
set from an array or iterator (e.g. using HashSet::from_iter or
vec![...].into_iter().collect()) inside the Lazy::new closure so the same string
literals populate the HashSet in one expression; apply the same pattern to other
HashSet statics in the file (referencing SUSPICIOUS_POSIX_PATHS and any other
*_PATHS or similar HashSet statics).
- Around line 182-187: The loop over KNOWN_WINDOWS_PREFIXES incorrectly uses
contains_ascii_case_insensitive to check prefixes, which allows matches in the
middle of the string; change the logic to perform a case-insensitive starts_with
check instead (e.g., implement or use a case-insensitive starts_with helper and
call it on text with each prefix) so the prefix match in the for loop only
returns true when text begins with the given prefix; update the invocation where
contains_ascii_case_insensitive(text, prefix) is used (in the function that
iterates KNOWN_WINDOWS_PREFIXES) to call the new/updated starts-with helper.
- Around line 109-114: contains_ascii_case_insensitive currently allocates two
Strings via to_ascii_lowercase on every call; replace it with an allocation-free
ASCII case-insensitive substring search by iterating over haystack.as_bytes()
and comparing each candidate window to needle.as_bytes() using byte-wise
to_ascii_lowercase() on u8 (or similar inline lowercasing) rather than building
new Strings; update the function contains_ascii_case_insensitive to scan
haystack bytes, for each start index compare needle length bytes with on-the-fly
lowercasing, and return true on a match (use needle.is_empty() and length checks
as needed).

@unclesp1d3r unclesp1d3r requested a review from Copilot January 18, 2026 05:21
@unclesp1d3r unclesp1d3r self-assigned this Jan 18, 2026
@unclesp1d3r unclesp1d3r added this to the v0.1 milestone Jan 18, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Comment on lines +213 to +222
// Also accept paths that use forward slashes (some tools do this)
if upper_text.starts_with("HKEY_")
&& text.contains('/')
&& let Some(slash_pos) = text.find('/')
{
let root = &upper_text[..slash_pos];
if VALID_REGISTRY_ROOTS.contains(root) {
return true;
}
}
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for handling forward slashes in registry paths may be overly permissive. At line 214-215, the code checks upper_text.starts_with("HKEY_") && text.contains('/'), which means any occurrence of a forward slash anywhere in the path will trigger this branch. This could match invalid paths like "HKEY_LOCAL_MACHINE\SOFTWARE/mixed" or paths with forward slashes in unexpected places. Consider checking if the path uses forward slashes consistently (e.g., text.find('\\').is_none()) or at least verify the forward slash is used as the first separator after the root key.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/classification/patterns/network.rs`:
- Around line 43-51: The TLD lookup in has_valid_tld currently does an O(n) scan
over COMMON_TLDS; change COMMON_TLDS from a Vec<&'static str> (or slice) to a
HashSet<&'static str> (or a &'static HashSet) and initialize it once (e.g., via
lazy_static or once_cell::sync::Lazy) so has_valid_tld can call
COMMON_TLDS.contains(&tld_lower.as_str()) with O(1) expected lookup; update the
COMMON_TLDS definition and any imports (HashSet, Lazy) and ensure has_valid_tld
still lowercases the TLD before lookup.

In `@src/classification/symbols.rs`:
- Around line 165-167: The demangling code currently unconditionally overwrites
string.original_text which can erase prior provenance; change the logic in the
demangling routine (the block that sets string.original_text and string.text) so
that you only assign string.original_text = Some(string.text.clone()) when
string.original_text.is_none(), then always update string.text = demangled_str;
this preserves any existing original_text while still replacing text with the
demangled form.
- Around line 52-59: The public struct SymbolDemangler must be marked
non-exhaustive to allow future fields without breaking semver; add the attribute
#[non_exhaustive] immediately before the derive and the pub struct declaration
for SymbolDemangler (leave the existing pub struct SymbolDemangler and its
impl/new() intact) so the attribute applies to that type.
♻️ Duplicate comments (1)
src/classification/patterns/network.rs (1)

26-34: Previous issue resolved - file extensions removed from TLD list.

The problematic entries (exe, dll, sys, bin, dat, log, tmp, bak) have been removed. The test suite at lines 170-181 now explicitly validates that file extensions are not misclassified as domains. This directly addresses the false positive reduction goal for binary analysis.

Comment on lines +52 to +59
#[derive(Debug, Default, Clone)]
pub struct SymbolDemangler;

impl SymbolDemangler {
/// Create a new instance of the symbol demangler
#[must_use]
pub fn new() -> Self {
Self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and read the relevant lines
if [ -f "src/classification/symbols.rs" ]; then
  echo "=== File exists, reading lines 50-65 ==="
  sed -n '50,65p' src/classification/symbols.rs | cat -n
else
  echo "File not found at src/classification/symbols.rs"
  find . -name "symbols.rs" -type f 2>/dev/null
fi

Repository: EvilBit-Labs/Stringy

Length of output: 767


🏁 Script executed:

# Also check the broader context to understand the struct definition
if [ -f "src/classification/symbols.rs" ]; then
  echo "=== Checking for struct SymbolDemangler definition ==="
  rg -n "struct SymbolDemangler" src/classification/symbols.rs -A 5 -B 2
fi

Repository: EvilBit-Labs/Stringy

Length of output: 372


🏁 Script executed:

# Verify if the struct is public and check for #[non_exhaustive]
if [ -f "src/classification/symbols.rs" ]; then
  echo "=== Checking full context around SymbolDemangler ==="
  rg -n "SymbolDemangler" src/classification/symbols.rs | head -20
fi

Repository: EvilBit-Labs/Stringy

Length of output: 1121


Add #[non_exhaustive] to the public struct.

The coding guidelines require #[non_exhaustive] on all public structs to allow future extension without semver breaks. The struct already has an explicit constructor, so add the attribute before the derive macro.

Proposed fix
+#[non_exhaustive]
 #[derive(Debug, Default, Clone)]
 pub struct SymbolDemangler;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#[derive(Debug, Default, Clone)]
pub struct SymbolDemangler;
impl SymbolDemangler {
/// Create a new instance of the symbol demangler
#[must_use]
pub fn new() -> Self {
Self
#[non_exhaustive]
#[derive(Debug, Default, Clone)]
pub struct SymbolDemangler;
impl SymbolDemangler {
/// Create a new instance of the symbol demangler
#[must_use]
pub fn new() -> Self {
Self
🤖 Prompt for AI Agents
In `@src/classification/symbols.rs` around lines 52 - 59, The public struct
SymbolDemangler must be marked non-exhaustive to allow future fields without
breaking semver; add the attribute #[non_exhaustive] immediately before the
derive and the pub struct declaration for SymbolDemangler (leave the existing
pub struct SymbolDemangler and its impl/new() intact) so the attribute applies
to that type.

Comment on lines +165 to +167
// Store original mangled form and replace with demangled
string.original_text = Some(string.text.clone());
string.text = demangled_str;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not overwrite existing original_text.

demangle unconditionally replaces original_text, which can erase prior provenance if another transform already populated it. Only set original_text when it is None.

Proposed fix
-        string.original_text = Some(string.text.clone());
+        if string.original_text.is_none() {
+            string.original_text = Some(string.text.clone());
+        }
         string.text = demangled_str;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Store original mangled form and replace with demangled
string.original_text = Some(string.text.clone());
string.text = demangled_str;
// Store original mangled form and replace with demangled
if string.original_text.is_none() {
string.original_text = Some(string.text.clone());
}
string.text = demangled_str;
🤖 Prompt for AI Agents
In `@src/classification/symbols.rs` around lines 165 - 167, The demangling code
currently unconditionally overwrites string.original_text which can erase prior
provenance; change the logic in the demangling routine (the block that sets
string.original_text and string.text) so that you only assign
string.original_text = Some(string.text.clone()) when
string.original_text.is_none(), then always update string.text = demangled_str;
this preserves any existing original_text while still replacing text with the
demangled form.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/classification/patterns/paths.rs`:
- Around line 1-4: The paths.rs module exceeds the 500-line guideline and should
be split into smaller focused modules; create separate modules (e.g., posix.rs,
windows.rs, unc.rs, registry.rs) and move the corresponding detection
functions/structs (the POSIX path classifier, Windows path classifier, UNC path
classifier, and registry path classifier) from paths.rs into these new files,
export them from a new mod.rs or from a restructured patterns::paths mod that
re-exports pub use posix::*, pub use windows::*, etc., and update any
references/imports in the codebase to use patterns::paths::{PosixClassifier,
WindowsClassifier, UncClassifier, RegistryClassifier} (or the actual
function/struct names used) so behavior is unchanged but each file stays under
the line limit.
♻️ Duplicate comments (2)
src/classification/patterns/paths.rs (1)

235-244: Avoid accepting mixed separators in registry paths.

Line 235 allows forward-slash registry paths even when backslashes are present, so mixed separators like "HKEY_LOCAL_MACHINE\SOFTWARE/mixed" are treated as valid and increase false positives. Consider requiring forward slashes only when no backslashes exist.

Proposed fix
-    if upper_text.starts_with("HKEY_")
-        && text.contains('/')
-        && let Some(slash_pos) = text.find('/')
-    {
+    if upper_text.starts_with("HKEY_")
+        && text.contains('/')
+        && !text.contains('\\')
+        && let Some(slash_pos) = text.find('/')
+    {
src/classification/patterns/data.rs (1)

79-133: Reject padded Base64 strings unless the total length is a multiple of 4.

Line 105 only rejects remainder 1, so padded strings with length mod 4 of 2 or 3 can be incorrectly accepted and increase false positives.

Proposed fix
-    // Strip padding for length check
-    let content_len = text.len() - padding_count;
-
-    // Valid Base64 content length should avoid mod 4 remainder of 1
-    let remainder = (content_len + padding_count) % 4;
-    if remainder == 1 {
-        return None;
-    }
+    let total_len = text.len();
+    let len_mod = total_len % 4;
+    if padding_count > 0 {
+        // Padded Base64 must be a multiple of 4
+        if len_mod != 0 {
+            return None;
+        }
+    } else if len_mod == 1 {
+        // Unpadded Base64 cannot have remainder 1
+        return None;
+    }

Comment on lines +1 to +4
//! File and registry path classification patterns
//!
//! This module provides POSIX, Windows, UNC, and registry path detection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Split this module to stay under the 500-line limit.

This file is 504 lines, which is above the 500-line guideline. As per coding guidelines, please split the path classifiers into smaller modules.

🤖 Prompt for AI Agents
In `@src/classification/patterns/paths.rs` around lines 1 - 4, The paths.rs module
exceeds the 500-line guideline and should be split into smaller focused modules;
create separate modules (e.g., posix.rs, windows.rs, unc.rs, registry.rs) and
move the corresponding detection functions/structs (the POSIX path classifier,
Windows path classifier, UNC path classifier, and registry path classifier) from
paths.rs into these new files, export them from a new mod.rs or from a
restructured patterns::paths mod that re-exports pub use posix::*, pub use
windows::*, etc., and update any references/imports in the codebase to use
patterns::paths::{PosixClassifier, WindowsClassifier, UncClassifier,
RegistryClassifier} (or the actual function/struct names used) so behavior is
unchanged but each file stays under the line limit.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/extraction/dedup.rs (1)

311-350: Test helper doesn't exercise original_text field.

The create_test_string helper always sets original_text: None. Consider adding a test case that verifies original_text propagation through deduplication, particularly for the to_found_string() round-trip.

Suggested test case
#[test]
fn test_original_text_preservation() {
    let mut fs = create_test_string(
        "demangled_name",
        Encoding::Utf8,
        0x100,
        None,
        StringSource::SectionData,
        vec![],
        10,
        0.8,
    );
    fs.original_text = Some("_ZN7mangled4nameE".to_string());
    
    let strings = vec![fs];
    let canonical = deduplicate(strings, None, true);
    
    assert_eq!(canonical[0].occurrences[0].original_text, 
               Some("_ZN7mangled4nameE".to_string()));
    
    let found = canonical[0].to_found_string();
    assert_eq!(found.original_text, Some("_ZN7mangled4nameE".to_string()));
}
docs/src/classification.md (2)

158-160: Documentation inconsistency: lazy_static vs once_cell.

Line 119 and the code example at 122-126 correctly show once_cell::sync::Lazy, but line 160 still references lazy_static!. Update for consistency with the actual implementation.

Suggested fix
-The classifier relies on `lazy_static!` to compile regex patterns once and reuse them across classification calls. Helper methods validate strings before assigning tags.
+The classifier relies on `once_cell::sync::Lazy` to compile regex patterns once and reuse them across classification calls. Helper methods validate strings before assigning tags.

180-192: Example code missing original_text field.

The FoundString struct now has an original_text: Option<String> field (per the types.rs changes). Update the example to include it for accuracy.

Suggested fix
 let found_string = FoundString {
     text: "C:\\Windows\\System32\\cmd.exe".to_string(),
+    original_text: None,
     encoding: Encoding::Ascii,
     offset: 0,
🤖 Fix all issues with AI agents
In `@docs/src/classification.md`:
- Around line 113-119: In the "Tag Specificity" section update the phrasing for
clarity: change "high confidence matches" to "high-confidence matches"
(hyphenate when used as a compound adjective) and tighten "and should be treated
as ambiguous" to "which should be treated as ambiguous" so the sentence reads
smoothly; ensure the rest of the paragraph (mentioning examples like URL,
domain, IP, file path, GUID, email, format string, and user agent) remains
unchanged.

In `@src/extraction/mod.rs`:
- Around line 525-536: The demangling + classification loop duplicated in the
extraction flow should be extracted into a private helper to avoid maintenance
drift: create a function (e.g., fn apply_classification(all_strings: &mut
[YourStringType]) { ... }) that instantiates SemanticClassifier::new() and
SymbolDemangler::new() and runs the same demangler.demangle(string) and
classifier.classify(string) logic while merging tags into string.tags; then
replace the duplicated blocks in the current function and in extract_canonical()
with a call to apply_classification(&mut all_strings) so both call sites share
the same implementation and behavior.

Comment on lines +113 to +119
## Tag Specificity

The semantic classifier uses cached regex patterns via `lazy_static!` and applies validation checks to reduce false positives.
Tags are treated as either specific or broad. Specific tags indicate high confidence matches (for example URL, domain, IP, file path, GUID, email, format string, and user agent). Base64 is a broad tag and should be treated as ambiguous due to higher false positive risk.

## Pattern Matching Engine

The semantic classifier uses cached regex patterns via `once_cell::sync::Lazy` and applies validation checks to reduce false positives.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor grammar and clarity improvements.

Line 115: "high confidence" should be hyphenated as "high-confidence" when used as a compound adjective before a noun. Also, "and should be treated as" could be tightened to "which should be treated as" for flow.

Suggested fix
-Tags are treated as either specific or broad. Specific tags indicate high confidence matches (for example URL, domain, IP, file path, GUID, email, format string, and user agent). Base64 is a broad tag and should be treated as ambiguous due to higher false positive risk.
+Tags are treated as either specific or broad. Specific tags indicate high-confidence matches (for example URL, domain, IP, file path, GUID, email, format string, and user agent). Base64 is a broad tag, which should be treated as ambiguous due to higher false-positive risk.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Tag Specificity
The semantic classifier uses cached regex patterns via `lazy_static!` and applies validation checks to reduce false positives.
Tags are treated as either specific or broad. Specific tags indicate high confidence matches (for example URL, domain, IP, file path, GUID, email, format string, and user agent). Base64 is a broad tag and should be treated as ambiguous due to higher false positive risk.
## Pattern Matching Engine
The semantic classifier uses cached regex patterns via `once_cell::sync::Lazy` and applies validation checks to reduce false positives.
## Tag Specificity
Tags are treated as either specific or broad. Specific tags indicate high-confidence matches (for example URL, domain, IP, file path, GUID, email, format string, and user agent). Base64 is a broad tag, which should be treated as ambiguous due to higher false-positive risk.
## Pattern Matching Engine
The semantic classifier uses cached regex patterns via `once_cell::sync::Lazy` and applies validation checks to reduce false positives.
🧰 Tools
🪛 LanguageTool

[grammar] ~115-~115: Use a hyphen to join words.
Context: ...ic or broad. Specific tags indicate high confidence matches (for example URL, dom...

(QB_NEW_EN_HYPHEN)


[style] ~115-~115: Consider replacing this word to strengthen your wording.
Context: ... and user agent). Base64 is a broad tag and should be treated as ambiguous due to h...

(AND_THAT)

🤖 Prompt for AI Agents
In `@docs/src/classification.md` around lines 113 - 119, In the "Tag Specificity"
section update the phrasing for clarity: change "high confidence matches" to
"high-confidence matches" (hyphenate when used as a compound adjective) and
tighten "and should be treated as ambiguous" to "which should be treated as
ambiguous" so the sentence reads smoothly; ensure the rest of the paragraph
(mentioning examples like URL, domain, IP, file path, GUID, email, format
string, and user agent) remains unchanged.

Test modules for classification patterns have been refactored to combine similar test cases, reducing redundancy and improving maintainability. Semantic enrichment logic in extraction has been moved to a dedicated helper function. Unused deduplication parameter has been removed for clarity.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/extraction/dedup.rs (1)

64-86: Remove dead parameter _preserve_all_occurrences — implementation ignores documented behavior.

The docstring claims this parameter controls whether to "store occurrence count instead of full metadata," but the function always preserves full metadata via found_string_to_occurrence(), making the parameter dead code. The underscore prefix confirms it's intentionally unused.

Removing this parameter requires updating 26 call sites (2 in src/extraction/mod.rs + tests), but it eliminates the API inconsistency and stale documentation. Alternatively, implement the documented behavior if occurrence-count-only mode is actually needed.

Suggested fix: remove unused parameter
-/// * `preserve_all_occurrences` - If false, only store occurrence count instead of full metadata
-///
 /// # Returns
 ///
 /// Vector of canonical strings sorted by combined_score (descending)
@@ pub fn deduplicate(
     strings: Vec<FoundString>,
     dedup_threshold: Option<usize>,
-    _preserve_all_occurrences: bool,
 ) -> Vec<CanonicalString> {

Update both calls in src/extraction/mod.rs (lines 544-548, 651-655) and all test invocations.

🤖 Fix all issues with AI agents
In `@src/classification/patterns/data.rs`:
- Around line 20-21: Add a doc comment for the static EMAIL_REGEX (the
Lazy<Regex> named EMAIL_REGEX) explaining that this pattern is intentionally
simplified for binary/string extraction: it will match short forms like "a@b.cc"
and common unquoted local-parts but does not support quoted local-parts, some
valid edge cases (e.g., certain plus/escape forms and RFC 5322 full syntax), nor
internationalized domain names; state the tradeoff (reduced false positives vs.
not fully RFC-compliant) so future readers know the limitation and why this
simplified regex was chosen.

In `@src/classification/patterns/ip.rs`:
- Around line 117-145: The function is_ipv6_address currently never uses
IPV6_REGEX; update it to apply the IPV6_REGEX pre-filter after handling
bracketed addresses (after strip_port and strip_ipv6_brackets) before the
character checks and Ipv6Addr::from_str validation, similar to is_ipv4_address,
so inputs failing the regex are rejected early; alternatively, if the regex is
unused elsewhere, remove the unused IPV6_REGEX definition—refer to
is_ipv6_address, IPV6_REGEX, strip_port, strip_ipv6_brackets, and
Ipv6Addr::from_str when making the change.
- Around line 24-26: The IPV6_REGEX constant currently holds a massive,
hard-to-maintain pattern; replace it with a much simpler pre-filter regex (e.g.,
allow only hex digits, colons, and optional IPv4-mapped suffix) and then
validate canonical correctness with Ipv6Addr::from_str inside the code paths
that use IPV6_REGEX, or if you must keep the complex regex for performance, add
a comment above pub(crate) static IPV6_REGEX describing the performance tradeoff
and why the full pattern is required; update usages to fall back to
Ipv6Addr::from_str for authoritative validation.
- Around line 41-42: The IPV6_BRACKETS_REGEX is currently greedy and can capture
trailing text; update its pattern to only capture the content inside the
brackets and anchor the match so it doesn't include characters after the closing
bracket (e.g. use a pattern like r"^\[([^\]]+)\]$" or r"^\[([^\]]+)\]" to
capture only what's between '[' and ']' in the static IPV6_BRACKETS_REGEX
Lazy<Regex> declaration).

In `@src/classification/patterns/network.rs`:
- Around line 46-54: The has_valid_tld function currently treats a domain ending
with '.' as having an empty TLD (which falls through to false); explicitly
handle trailing dots by either returning false when domain.ends_with('.') or by
trimming a single trailing '.' before computing tld (update has_valid_tld to
check for domain.ends_with('.') and return false, or strip the trailing dot then
use rfind and COMMON_TLDS lookup) so the behavior is explicit and documented in
the function comment.

In `@src/classification/patterns/paths.rs`:
- Around line 329-341: The two functions use different match semantics which is
intentional but not documented: is_suspicious_windows_path uses prefix
(starts_with) matching while is_suspicious_registry_path uses substring
(contains) matching; add a brief doc comment above each function
(is_suspicious_windows_path and is_suspicious_registry_path) clarifying the
rationale (windows paths should be matched by prefix like "C:\Windows\…" whereas
registry keys may appear anywhere in the path and thus use substring matching)
so future readers understand the asymmetry.
- Around line 291-301: Add a unit test that ensures classify_unc_path returns
None for a UNC root with a trailing backslash (e.g., "\\server\\") to cover the
case where UNC_PATH_REGEX matches but the share name is missing; create a test
function (e.g., test_classify_unc_path_missing_share) that calls
classify_unc_path with that string and asserts the result is None, referencing
classify_unc_path and UNC_PATH_REGEX as the functions/regex under test.
♻️ Duplicate comments (3)
docs/src/classification.md (1)

115-115: Hyphenate compound adjective and tighten clause.
This matches a previously noted wording fix. Please hyphenate "high-confidence" and tighten "and should be treated as" to "which should be treated as".

src/classification/patterns/data.rs (1)

111-130: Multiple iterations over text for character class checks.

Lines 112-114 iterate the string three separate times. For hot paths with long strings, a single-pass approach is more efficient.

Single-pass character diversity check
-    // Check for character diversity typical of Base64
-    let has_upper = text.chars().any(|c| c.is_ascii_uppercase());
-    let has_lower = text.chars().any(|c| c.is_ascii_lowercase());
-    let has_digit = text.chars().any(|c| c.is_ascii_digit());
+    // Check for character diversity typical of Base64 (single pass)
+    let (mut has_upper, mut has_lower, mut has_digit) = (false, false, false);
+    for c in text.chars() {
+        if c.is_ascii_uppercase() {
+            has_upper = true;
+        } else if c.is_ascii_lowercase() {
+            has_lower = true;
+        } else if c.is_ascii_digit() {
+            has_digit = true;
+        }
+        // Early exit if all flags are set
+        if has_upper && has_lower && has_digit {
+            break;
+        }
+    }
src/classification/patterns/paths.rs (1)

235-244: Forward-slash registry path logic is overly permissive.

The condition upper_text.starts_with("HKEY_") && text.contains('/') matches if a forward slash appears anywhere, including in value names. For example, HKEY_LOCAL_MACHINE\SOFTWARE\Path/Name would incorrectly trigger this branch and extract HKEY_LOCAL_MACHINE\SOFTWARE\Path as the root (wrong separator position).

Stricter forward-slash handling
     // Also accept paths that use forward slashes (some tools do this)
-    if upper_text.starts_with("HKEY_")
-        && text.contains('/')
-        && let Some(slash_pos) = text.find('/')
-    {
+    // Only if the path uses forward slashes consistently (no backslashes before first slash)
+    if upper_text.starts_with("HKEY_") && let Some(slash_pos) = text.find('/') {
+        // Ensure no backslash appears before the first forward slash
+        if text[..slash_pos].contains('\\') {
+            return false;
+        }
         let root = &upper_text[..slash_pos];
         if VALID_REGISTRY_ROOTS.contains(root) {
             return true;
         }
     }

- Updated email regex to simplify matching for binary and string extraction.
- Refined IPv6 regex to allow only valid characters and improved validation checks.
- Added validation for trailing dots in TLD checks.
- Enhanced suspicious path checks with clearer matching strategies for Windows and registry paths.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
@unclesp1d3r unclesp1d3r enabled auto-merge (squash) January 18, 2026 06:12
@unclesp1d3r unclesp1d3r merged commit c3b90ed into main Jan 18, 2026
19 of 20 checks passed
@unclesp1d3r unclesp1d3r deleted the symantic_classification_completion branch January 18, 2026 06:13
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/classification/patterns/data.rs`:
- Around line 101-116: The current padding logic only counts trailing '=' but
allows '=' in the middle and doesn't require overall length divisibility when
padding is present; update the check around variables padding_count, content_len
and remainder so that you first verify there are no '=' characters in the
non-padded prefix (i.e., ensure text[..content_len] contains no '='), and then
if padding_count > 0 enforce that text.len() % 4 == 0 (instead of the loose
remainder check); keep the existing padding_count <= 2 constraint and maintain
the remainder == 1 early-out for the no-padding case.

In `@src/classification/patterns/paths.rs`:
- Around line 215-244: The is_valid_registry_path function uses
text.to_uppercase(), which can expand Unicode and invalidate byte indices when
later slicing with positions from the original text; change both calls that
create upper_text to text.to_ascii_uppercase() so byte lengths remain stable
(Windows registry roots are ASCII), ensuring the subsequent find('\\') and
find('/') positions from the original text remain valid when you slice
upper_text[..slash_pos] and check VALID_REGISTRY_ROOTS.

Comment on lines +101 to +116
// Count padding characters
let padding_count = text.chars().rev().take_while(|&c| c == '=').count();

// Padding should be at most 2 characters
if padding_count > 2 {
return None;
}

// Strip padding for length check
let content_len = text.len() - padding_count;

// Valid Base64 content length should avoid mod 4 remainder of 1
let remainder = (content_len + padding_count) % 4;
if remainder == 1 {
return None;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Padding validation can accept invalid Base64 shapes.

Only trailing padding is counted. Strings with '=' in the middle or padded strings whose total length is not a multiple of 4 can slip through, increasing false positives. Add a check that padding is only at the end and enforce len%4==0 when padding exists.

Proposed fix
     // Padding should be at most 2 characters
     if padding_count > 2 {
         return None;
     }

-    // Strip padding for length check
-    let content_len = text.len() - padding_count;
-
-    // Valid Base64 content length should avoid mod 4 remainder of 1
-    let remainder = (content_len + padding_count) % 4;
-    if remainder == 1 {
-        return None;
-    }
+    // Padding must only appear at the end
+    let padding_start = text.len() - padding_count;
+    if text[..padding_start].contains('=') {
+        return None;
+    }
+
+    // Padded Base64 must have total length multiple of 4
+    let remainder = text.len() % 4;
+    if padding_count > 0 {
+        if remainder != 0 {
+            return None;
+        }
+    } else if remainder == 1 {
+        return None;
+    }
🤖 Prompt for AI Agents
In `@src/classification/patterns/data.rs` around lines 101 - 116, The current
padding logic only counts trailing '=' but allows '=' in the middle and doesn't
require overall length divisibility when padding is present; update the check
around variables padding_count, content_len and remainder so that you first
verify there are no '=' characters in the non-padded prefix (i.e., ensure
text[..content_len] contains no '='), and then if padding_count > 0 enforce that
text.len() % 4 == 0 (instead of the loose remainder check); keep the existing
padding_count <= 2 constraint and maintain the remainder == 1 early-out for the
no-padding case.

Comment on lines +215 to +244
/// Validates a registry path
pub fn is_valid_registry_path(text: &str) -> bool {
let upper_text = text.to_uppercase();

// Check for full registry root
if upper_text.starts_with("HKEY_") {
// Extract root key
if let Some(slash_pos) = text.find('\\') {
let root = &upper_text[..slash_pos];
if VALID_REGISTRY_ROOTS.contains(root) {
return true;
}
}
}

// Check for abbreviated forms (case-insensitive)
if REGISTRY_ABBREV_REGEX.is_match(text) {
return true;
}

// Also accept paths that use forward slashes (some tools do this)
if upper_text.starts_with("HKEY_")
&& text.contains('/')
&& let Some(slash_pos) = text.find('/')
{
let root = &upper_text[..slash_pos];
if VALID_REGISTRY_ROOTS.contains(root) {
return true;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's find the file and examine the full registry validation function
fd -t f "paths.rs" src/classification/patterns/

# Get the file size to see if we can read it entirely
wc -l src/classification/patterns/paths.rs

Repository: EvilBit-Labs/Stringy

Length of output: 141


🏁 Script executed:

# Look at the specific function and surrounding context
cat -n src/classification/patterns/paths.rs | sed -n '200,250p'

Repository: EvilBit-Labs/Stringy

Length of output: 1736


🏁 Script executed:

# Search for all uses of to_uppercase in the codebase
rg "to_uppercase" src/

Repository: EvilBit-Labs/Stringy

Length of output: 142


🏁 Script executed:

# Check if there are any tests for the registry validation function
rg "is_valid_registry_path" --type rust

Repository: EvilBit-Labs/Stringy

Length of output: 742


🏁 Script executed:

# Let's verify the behavior of to_uppercase vs to_ascii_uppercase with a simple test
python3 << 'EOF'
# Test character expansion with to_uppercase() in Unicode
import unicodedata

# Test ASCII characters
ascii_test = "HKEY_LOCAL_MACHINE"
print(f"ASCII string: '{ascii_test}'")
print(f"Length: {len(ascii_test)}")

# Test if certain characters expand when uppercased
test_chars = ['a', 'ß', 'ñ', '\\', '/']
for char in test_chars:
    try:
        upper = char.upper()
        print(f"'{char}' -> '{upper}' (was {len(char)} byte(s), now {len(upper.encode('utf-8'))} byte(s))")
    except Exception as e:
        print(f"Error with '{char}': {e}")

# Key insight: bytes vs characters
test_str = "HKEY_ß\\"  # Contains a character that expands
print(f"\nTest string with ß: '{test_str}'")
print(f"UTF-8 byte length: {len(test_str.encode('utf-8'))}")
pos = test_str.find('\\')
print(f"Position of first \\: {pos}")
print(f"Upper version byte length: {len(test_str.upper().encode('utf-8'))}")
EOF

Repository: EvilBit-Labs/Stringy

Length of output: 150


🏁 Script executed:

# Check if the VALID_REGISTRY_ROOTS constant is defined and what it contains
rg "VALID_REGISTRY_ROOTS" src/classification/patterns/paths.rs -A 10 -B 2

Repository: EvilBit-Labs/Stringy

Length of output: 962


🏁 Script executed:

# Check the REGISTRY_ABBREV_REGEX to understand what patterns are validated
rg "REGISTRY_ABBREV_REGEX" src/classification/patterns/paths.rs -A 5 -B 2

Repository: EvilBit-Labs/Stringy

Length of output: 669


🏁 Script executed:

# Let's check how this function is called - is there any input validation upstream?
rg "is_valid_registry_path" src/ -B 5 -A 2

Repository: EvilBit-Labs/Stringy

Length of output: 3141


🏁 Script executed:

# Let's check the actual byte/character expansion risk by examining valid registry root strings
cat > /tmp/test_uppercase.rs << 'EOF'
fn main() {
    let test = "HKEY_LOCAL_MACHINE\\subkey";
    let upper = test.to_uppercase();
    let ascii_upper = test.to_ascii_uppercase();
    
    println!("Original: {} bytes", test.len());
    println!("to_uppercase: {} bytes", upper.len());
    println!("to_ascii_uppercase: {} bytes", ascii_upper.len());
    
    // Now test with potential non-ASCII before slash
    let test_unicode = "HKEY_ß\\subkey";
    let upper_uni = test_unicode.to_uppercase();
    let ascii_upper_uni = test_unicode.to_ascii_uppercase();
    
    println!("\nWith ß character:");
    println!("Original: {} bytes", test_unicode.len());
    println!("to_uppercase: {} bytes", upper_uni.len());
    println!("to_ascii_uppercase: {} bytes", ascii_upper_uni.len());
    
    // Find positions
    if let Some(pos) = test_unicode.find('\\') {
        println!("Position of \\: {}", pos);
        println!("to_uppercase slicing [..{}]: OK? {}", pos, pos <= upper_uni.len());
        println!("to_ascii_uppercase slicing [..{}]: OK? {}", pos, pos <= ascii_upper_uni.len());
    }
}
EOF
rustc /tmp/test_uppercase.rs -o /tmp/test_uppercase && /tmp/test_uppercase

Repository: EvilBit-Labs/Stringy

Length of output: 110


🏁 Script executed:

# Check what types of strings are passed to this function from semantic classification
rg "is_valid_registry_path\|classify_registry_path" src/ -B 3 -A 3

Repository: EvilBit-Labs/Stringy

Length of output: 46


🏁 Script executed:

# Check if there are any existing tests that demonstrate the behavior
fd -t f "test" src/classification/ -o rg "registry" tests/ src/ --type rust

Repository: EvilBit-Labs/Stringy

Length of output: 187


🏁 Script executed:

# Let's look more carefully at what types of strings are being processed
# Check the extraction context to see what kinds of inputs reach this function
rg "RegistryPath|registry" src/ -B 2 -A 2 | head -50

Repository: EvilBit-Labs/Stringy

Length of output: 2995


🏁 Script executed:

# Let's examine if there's any validation or filtering before these strings are classified
rg "classify_registry_path|is_valid_registry_path" src/classification/ -B 5 -A 5

Repository: EvilBit-Labs/Stringy

Length of output: 7669


🏁 Script executed:

# Check if the issue exists with a simple logic trace
cat > /tmp/analyze.py << 'EOF'
# Simulate the potential issue

# Test case 1: ASCII only (normal case)
test1 = "HKEY_LOCAL_MACHINE\\Software"
pos1 = test1.find('\\')
upper1 = test1.upper()
upper1_ascii = test1.upper()  # for ASCII it's the same

print(f"Test 1 (ASCII): '{test1}'")
print(f"  Position of \\: {pos1}")
print(f"  Length original: {len(test1)}, uppercase: {len(upper1)}")
print(f"  Slice [..{pos1}] in uppercase: '{upper1[:pos1]}'")
print(f"  Match: {upper1[:pos1] == 'HKEY_LOCAL_MACHINE'}")

# Test case 2: Non-ASCII before delimiter (edge case)
test2 = "HKEYß\\Software"
pos2 = test2.find('\\')
upper2 = test2.upper()

print(f"\nTest 2 (with ß): '{test2}'")
print(f"  Bytes in original: {[c for c in test2]}")
print(f"  Position of \\ (in original): {pos2}")
print(f"  Length original: {len(test2)}, uppercase: {len(upper2)}")
print(f"  Upper bytes: {[c for c in upper2]}")
try:
    print(f"  Slice [..{pos2}] in uppercase: '{upper2[:pos2]}'")
    print(f"  Compare to HKEY_: {upper2[:pos2].startswith('HKEY_')}")
except Exception as e:
    print(f"  ERROR: {e}")

# Test case 3: Using ASCII uppercase on non-ASCII
print(f"\nTest 3 (ASCII uppercase behavior):")
test3 = "HKEYß\\Software"
pos3 = test3.find('\\')
# In Rust, to_ascii_uppercase() only uppercases ASCII
ascii_upper = ''.join(c.upper() if c.isascii() else c for c in test3)
print(f"  ASCII upper of '{test3}': '{ascii_upper}'")
print(f"  Slice [..{pos3}]: '{ascii_upper[:pos3]}'")
EOF
python3 /tmp/analyze.py

Repository: EvilBit-Labs/Stringy

Length of output: 693


🏁 Script executed:

# Let's look at the actual registry path test to see if non-ASCII is expected
rg "test_registry_path|test_classify_registry_path" src/ -A 10

Repository: EvilBit-Labs/Stringy

Length of output: 1554


🏁 Script executed:

# Check if there are any other slicing operations with position indices in the paths module
rg "find\(" src/classification/patterns/paths.rs -B 2 -A 3

Repository: EvilBit-Labs/Stringy

Length of output: 534


🏁 Script executed:

# Let's verify the Unicode expansion concern with a real example
cat > /tmp/verify_panic.rs << 'EOF'
// Simulate what happens with the current code
fn main() {
    // Example of character that could cause issue
    // Looking for strings that expand when uppercased
    
    // Test: some Unicode characters do expand
    // For instance, certain ligatures or special characters
    let test = "HKEY\u{00DF}\\Software";  // ß character
    println!("Original string: {} bytes", test.len());
    println!("Bytes: {:?}", test.as_bytes());
    
    let upper = test.to_uppercase();
    println!("Uppercase string: {} bytes", upper.len());
    println!("Bytes: {:?}", upper.as_bytes());
    
    let pos = test.find('\\').unwrap();
    println!("Position of \\: {} (in original)", pos);
    
    // This is what the code does - use original position on uppercase string
    println!("Slicing uppercase[..{}]: '{}'", pos, &upper[..pos]);
    
    // With to_ascii_uppercase
    let ascii_upper = test.to_ascii_uppercase();
    println!("\nASCII uppercase: {} bytes", ascii_upper.len());
    println!("Slicing ascii_upper[..{}]: '{}'", pos, &ascii_upper[..pos]);
}
EOF
# Try to compile it to check for actual panic risk
cat /tmp/verify_panic.rs

Repository: EvilBit-Labs/Stringy

Length of output: 1121


🏁 Script executed:

# Confirm the current implementation one more time
sed -n '215,247p' src/classification/patterns/paths.rs

Repository: EvilBit-Labs/Stringy

Length of output: 976


Fix Unicode byte expansion issue in registry path validation.

Line 217 uses to_uppercase(), which can expand certain Unicode characters and change byte length. Since the code then slices upper_text[..slash_pos] with indices from the original string, this can cause a panic if non-ASCII characters precede the delimiter.

Use to_ascii_uppercase() instead, which preserves byte positions for ASCII characters. This is semantically correct since Windows registry root keys are strictly ASCII.

Proposed fix
-    let upper_text = text.to_uppercase();
+    let upper_text = text.to_ascii_uppercase();
🤖 Prompt for AI Agents
In `@src/classification/patterns/paths.rs` around lines 215 - 244, The
is_valid_registry_path function uses text.to_uppercase(), which can expand
Unicode and invalidate byte indices when later slicing with positions from the
original text; change both calls that create upper_text to
text.to_ascii_uppercase() so byte lengths remain stable (Windows registry roots
are ASCII), ensuring the subsequent find('\\') and find('/') positions from the
original text remain valid when you slice upper_text[..slash_pos] and check
VALID_REGISTRY_ROOTS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments