feat(classification): add symbol demangling for Rust by unclesp1d3r · Pull Request #124 · EvilBit-Labs/Stringy

unclesp1d3r · 2026-01-18T05:12:32Z

This pull request introduces a new modular pattern classification system for detecting various data formats, including IP addresses, GUIDs, emails, Base64 data, format strings, and user agent strings. The changes add new pattern classification modules, refactor existing capabilities to use these modules, and update dependencies for improved performance and maintainability.

New Pattern Classification Modules

Added a new patterns module in src/classification, which contains submodules for different types of pattern classification: ip, network, paths, and data. Each submodule provides functions and regex patterns for detecting specific formats such as IP addresses, URLs, file paths, GUIDs, emails, Base64, format strings, and user agents. [1] [2] [3]

Expanded Data Format Detection

Implemented new classifiers for GUID/UUID, email addresses, Base64-encoded data, printf-style format strings, and user agent strings in src/classification/patterns/data.rs, with comprehensive regex-based detection and thorough unit tests.
Updated documentation in src/classification/mod.rs to reflect the expanded capabilities, including symbol demangling and new data format detection.

Improved IP Address Detection

Added robust IPv4 and IPv6 address detection in src/classification/patterns/ip.rs, including support for port suffixes and bracketed IPv6 addresses, with validation using both regex and standard library types.

Refactoring and API Changes

Refactored src/classification/mod.rs to use the new patterns module, and re-exported key classification functions for easier access and future extensibility. [1] [2]

Dependency Updates

Replaced lazy_static with once_cell for more efficient static initialization, added rustc-demangle for symbol demangling, and updated several dependencies for improved stability and performance in Cargo.toml.

- Implemented `SymbolDemangler` to detect and demangle mangled Rust symbols. - Added support for Rust legacy and v0 mangling formats. - Updated `FoundString` to store original mangled text and tags for demangled symbols. - Enhanced `Tag` enum with `DemangledSymbol` variant. - Added comprehensive tests for demangling functionality. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

File extensions such as exe, dll, sys, bin, dat, log, tmp, and bak are no longer treated as valid TLDs in domain classification. Added tests to ensure these extensions are not classified as domains, and updated documentation to reflect Rust-only symbol demangling.

coderabbitai · 2026-01-18T05:12:43Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Architectural Refactoring: Modular Pattern Classification and Symbol Demangling (Rust + C++)

This PR refactors the classification subsystem into a modular pattern library, substantially expands pattern coverage (network, paths, data, IP), and integrates symbol demangling (Rust + C++) into the extraction/enrichment pipeline.

Architectural Changes

New modular classification root: src/classification/patterns with focused submodules:
- ip.rs — IPv4/IPv6 detection with port/bracket handling, regex pre-filter + stdlib validation.
- network.rs — URL and domain detection with TLD validation and edge-case handling.
- paths.rs — POSIX, Windows, UNC, and registry path detection plus suspicious-persistence heuristics.
- data.rs — GUID/UUID, email, Base64, printf-style format strings, and user-agent detection.
Centralized, cached regexes via once_cell::sync::Lazy (lazy_static removed). Several regex constants are re-exported pub(crate) to support tests and reuse.
Classification logic moved from module-local state to standalone pattern functions; SemanticClassifier delegates to these functions (function-based classification).

Symbol Demangling & Extraction Integration

New SymbolDemangler (src/classification/symbols.rs):
- Detects Rust v0 (_R) and legacy/Itanium-style (_ZN/_Z) mangled symbols and C++ Itanium (_Z).
- Uses rustc-demangle for Rust and cpp_demangle for C++ demangling.
- API: SymbolDemangler::new(), is_mangled(&str), try_demangle(&str) -> Option, demangle(&mut FoundString).
- demangle(&mut FoundString) is idempotent, preserves FoundString.original_text (stores original mangled form), replaces FoundString.text with demangled text, and appends Tag::DemangledSymbol once while preserving existing tags.
Extraction pipeline update (src/extraction/mod.rs): apply_semantic_enrichment runs demangler.demangle(...) first, then SemanticClassifier.classify(...) for each FoundString before deduplication/canonicalization so outputs include demangled text and associated tags.
New optional original_text field added to StringOccurrence and propagated through deduplication and CanonicalString::to_found_string, enabling preservation/serialization of original mangled strings for forensic fidelity.

API & Public Surface

Large expansion of public pattern functions and helpers (re-exported via patterns:: and made available through SemanticClassifier):
- classify_{url,domain,ip_addresses,posix_path,windows_path,unc_path,registry_path,guid,email,base64,format_string,user_agent}
- has_valid_tld, is_ipv4_address, is_ipv6_address, is_valid_/is_suspicious_, strip_port, strip_ipv6_brackets
New public type: SymbolDemangler with methods new(), is_mangled(), demangle(), try_demangle().
Tag enum extended with DemangledSymbol (serde rename "demangled").
SemanticClassifier API surface widened:
- classify(&FoundString) -> Vec (preserves multi-tagging).
- new() annotated #[must_use].
- Adds regex_cache_addresses() helper (testing/debugging: exposes addresses of cached regex patterns).

Binary Format Handling & Data Structures

original_text preservation improves forensic fidelity: demangled outputs keep the raw/mangled form alongside the demangled text in occurrences and serialized output.
pelite and other format-related crates were added to dependencies (present in Cargo.toml) enabling future binary/format interrogation enhancements.
Demangling is applied broadly during enrichment (imports, sections, strings) prior to deduplication to ensure canonical outputs contain human-readable symbol names while retaining originals.

Classification Accuracy & Behavior Improvements

IP detection: two-stage approach (regex pre-filter + stdlib parse) reduces false positives; supports port suffixes and bracketed IPv6.
Domain/URL detection: domain regex + known-TLD validation; classify_domain avoids double-tagging URLs; trailing-dot TLDs rejected to reduce false positives.
Path classifiers: stricter validation for POSIX/Windows/UNC/registry paths, plus heuristics to flag suspicious persistence-related locations.
Data classifiers: GUID, email, Base64, printf-format, and user-agent detectors tuned with unit tests; Base64 explicitly treated as broad/ambiguous to reduce false positives.
Demangler behavior: idempotent, preserves original_text, appends a single DemangledSymbol tag, and preserves pre-existing tags (e.g., Import), improving downstream semantic accuracy and readability.

Dependencies & Implementation Notes

Replaced lazy_static with once_cell::sync::Lazy; added rustc-demangle and cpp_demangle for demangling; pelite present for PE/format work; minor bumps to serde_json and insta.
Many new unit tests added across patterns, demangling, and enrichment logic.
Public API expansion and added serialization of original_text increase surface area and review scope (notable churn in classification/semantic modules and extraction/deduplication).

Impact Summary

Significant expansion of classification capabilities and public API surface; moderate-to-high review effort expected (especially patterns, SemanticClassifier, extraction/dedup).
Backward-compatible on most core types but introduces Tag::DemangledSymbol and the optional original_text field in StringOccurrence serialization.
Improves semantic detection accuracy and forensic output quality by emitting demangled text while preserving original mangled data.

Walkthrough

Adds a centralized pattern-classification subsystem (data, ip, network, paths), a SymbolDemangler, expands SemanticClassifier's public surface and integrates demangling + pattern classification into extraction before deduplication; introduces Tag::DemangledSymbol and dependency updates for demangling and lazy init.

Changes

Cohort / File(s)	Summary
Dependency Management `Cargo.toml`	Added `cpp_demangle`, `once_cell`, `pelite`, `rustc-demangle`; bumped `serde_json` and `insta`. Attention: native demangler crates and feature pins.
Top-level classification `src/classification/mod.rs`	Added `mod patterns;`, `pub mod symbols;` and re-exported `SymbolDemangler`.
Patterns index `src/classification/patterns/mod.rs`	New aggregator module exposing `data`, `ip`, `network`, `paths` classifiers and selected internal regexes.
Data-format patterns `src/classification/patterns/data.rs`	New classifiers: `classify_guid`, `classify_email`, `classify_base64`, `classify_format_string`, `classify_user_agent`; lazy regexes and tests. Review regex strictness and false-positive handling.
IP patterns & helpers `src/classification/patterns/ip.rs`	IPv4/IPv6 detection, `strip_port`, `strip_ipv6_brackets`, validation helpers and `classify_ip_addresses`; careful: port/bracket edge cases and leading-zero rules.
Network patterns `src/classification/patterns/network.rs`	`classify_url`, `classify_domain`, `has_valid_tld` with `URL_REGEX`/`DOMAIN_REGEX` and TLD list. Attention: TLD list maintenance and URL/domain boundary rules.
Path patterns `src/classification/patterns/paths.rs`	POSIX/Windows/UNC/Registry validators and classifiers, suspicious-path detection; many regexes and heuristics. Review for OS-specific edge cases and permission/persistence indicators.
Semantic classifier `src/classification/semantic.rs`	`classify` delegates to pattern functions; many helpers re-exported publicly; `new()` annotated `#[must_use]`; added `regex_cache_addresses` test API. Public surface expanded significantly.
Symbol demangler `src/classification/symbols.rs`	New `SymbolDemangler` with `is_mangled`, `try_demangle`, `demangle` (in-place `FoundString` demangling), preserves `original_text`, appends `Tag::DemangledSymbol`; comprehensive tests. Review safety of demangler fallbacks and memory/ownership handling.
Types `src/types.rs`	Added `Tag::DemangledSymbol` (serde rename = `"demangled"`).
Extraction / dedup `src/extraction/mod.rs`, `src/extraction/dedup.rs`	`StringOccurrence` now holds `original_text`; semantic enrichment (demangle + classify) applied before deduplication; canonicalization flows `original_text` back into FoundString. Verify serialization impact and backwards compatibility.
Docs `docs/src/classification.md`	Doc rework: once_cell lazy init, expanded pattern descriptions, tag specificity notes; removed low-level snippets.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Extract as "extract()/extract_from_section()"
  participant Found as "FoundString"
  participant Symbol as "SymbolDemangler"
  participant Semantic as "SemanticClassifier"
  participant Patterns as "patterns::*"
  Note over Extract,Found: collect found strings
  Extract->>Found: iterate found strings
  loop per found string
    Extract->>Symbol: demangle(&mut FoundString)
    alt mangled symbol detected
      Symbol-->>Found: mutate text + set original_text
      Symbol-->>Extract: add Tag::DemangledSymbol
    end
    Extract->>Semantic: classify(&FoundString)
    Semantic->>Patterns: classify_url/classify_domain/classify_ip/... 
    Patterns-->>Semantic: Option<Tag>/Vec<Tag>
    Semantic-->>Extract: aggregated tags
    Extract->>Found: merge tags into FoundString
  end
  Extract->>Extract: deduplicate/canonicalize (preserve original_text)
  Extract-->>Client: final results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat(classification): implement file path classification for POSIX, Windows, and registry paths #121 — Overlaps on path classification and SemanticClassifier API expansion.
feat(parse): implement string deduplication with metadata preservation #117 — Related extraction/dedup changes and propagation of original_text.
Add code review automation and Rust standards documentation with IPv4/IPv6 detection #119 — Related IPv4/IPv6 detection work and helper functions.

Suggested reviewers

Copilot

Poem

Demangle the knots, restore the name,
IPs and paths put under frame,
Regexes idle, then spring to test,
Original text kept safe and blessed,
Tags align — classifiers proclaim.

🚥 Pre-merge checks | ✅ 4 | ❌ 4

❌ Failed checks (4 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	Title follows Conventional Commits format with correct type and scope, but is misleading as symbol demangling is only one of many changes.	Revise title to reflect the primary change: feat(classification): refactor pattern detection into modular submodules or feat(classification): add modular pattern classification system.
No Unsafe Code	⚠️ Warning	The crate lacks the #![forbid(unsafe_code)] directive in src/lib.rs, which is required to prevent unsafe code introduction across the entire crate.	Add #![forbid(unsafe_code)] as the first non-documentation line in src/lib.rs before module declarations and verify compilation and tests pass.
Error Handling	⚠️ Warning	PR introduces extensive `.unwrap()` calls on `Regex::new()` and silently suppresses external crate errors from `cpp_demangle` and `rustc_demangle` without mapping to `StringyError`, violating established project error handling conventions.	Add `StringyError` variants for demangling/pattern errors, replace `.unwrap()` with proper error handling, map external crate errors using `#[from]`, and change classification signatures from `Option<Tag>` to `Result<Tag, StringyError>`.
File Size Limit	⚠️ Warning	File src/classification/symbols.rs exceeds 500-line limit with 508 lines, violating new file size constraint by 8 lines.	Refactor by extracting helper functions or test logic into separate modules to bring the file below 500 lines.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	Description accurately covers pattern classification modules, data format detection, IP detection, refactoring, and dependency updates with specific file references.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 70.00%.
Ascii Only	✅ Passed	No new Unicode punctuation introduced in PR. Pre-existing Chinese characters in test strings and dashes in concept.md are not newly added code artifacts.
Section Weight Consistency	✅ Passed	PR makes no modifications to container parser files (src/container/*.rs), so section weight logic remains unchanged and consistent.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch symantic_classification_completion

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This pull request introduces Rust symbol demangling capabilities and significantly expands the pattern classification system with new modules for detecting various data formats. The changes refactor existing classification code into a modular structure and replace lazy_static with once_cell for improved performance.

Changes:

Added symbol demangling for Rust using rustc-demangle with support for legacy (_ZN) and v0 (_R) mangling schemes
Introduced new patterns module with submodules for IP addresses, network indicators, file paths, and data formats (GUIDs, emails, Base64, format strings, user agents)
Refactored SemanticClassifier to delegate to pattern classification functions while maintaining backward compatibility

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
Cargo.toml	Replaced `lazy_static` with `once_cell`, added `rustc-demangle` dependency, and updated minor versions of `serde_json` and `insta`
src/types.rs	Added `DemangledSymbol` tag variant to the Tag enum
src/classification/mod.rs	Updated module documentation to reflect new capabilities, added `patterns` and `symbols` submodules with public exports
src/classification/symbols.rs	New module implementing Rust symbol demangling with detection, demangling, and comprehensive tests
src/classification/semantic.rs	Refactored to delegate classification to pattern functions while maintaining public API compatibility
src/classification/patterns/mod.rs	New module organizing pattern classification into submodules with public re-exports
src/classification/patterns/ip.rs	Extracted IPv4 and IPv6 address detection with regex patterns and validation using std library
src/classification/patterns/network.rs	Extracted URL and domain detection with TLD validation using a comprehensive list of common TLDs
src/classification/patterns/paths.rs	Extracted file and registry path detection for POSIX, Windows, UNC, and registry paths
src/classification/patterns/data.rs	New module for detecting GUIDs, emails, Base64, format strings, and user agents
tests/snapshots/classification_integration__classification_snapshots.snap	Updated snapshot showing Windows path now classified with both Domain and FilePath tags

src/classification/patterns/network.rs

tests/snapshots/classification_integration__classification_snapshots.snap

Copilot · 2026-01-18T05:15:39Z

src/classification/patterns/paths.rs

+    let haystack_lower = haystack.to_ascii_lowercase();
+    let needle_lower = needle.to_ascii_lowercase();
+    haystack_lower.contains(&needle_lower)


The function allocates two new strings by converting both haystack and needle to lowercase. This is less efficient than the byte-wise comparison approach that was used in the original semantic.rs module. Consider using a more efficient implementation that avoids allocations, such as iterating through bytes and comparing them case-insensitively using eq_ignore_ascii_case on byte slices or using a windowing approach.

Suggested change

let haystack_lower = haystack.to_ascii_lowercase();

let needle_lower = needle.to_ascii_lowercase();

haystack_lower.contains(&needle_lower)

// Match the behavior of `str::contains`: empty needle always matches.

if needle.is_empty() {

return true;

}

let haystack_bytes = haystack.as_bytes();

let needle_bytes = needle.as_bytes();

if needle_bytes.len() > haystack_bytes.len() {

return false;

}

let last_start = haystack_bytes.len() - needle_bytes.len();

for start in 0..=last_start {

let window = &haystack_bytes[start..start + needle_bytes.len()];

if window.eq_ignore_ascii_case(needle_bytes) {

return true;

}

}

false

Copilot · 2026-01-18T05:15:39Z

src/classification/patterns/data.rs

+    let has_upper = text.chars().any(|c| c.is_ascii_uppercase());
+    let has_lower = text.chars().any(|c| c.is_ascii_lowercase());
+    let has_digit = text.chars().any(|c| c.is_ascii_digit());
+


The Base64 classification function iterates through the text multiple times: once to check all characters are valid Base64 (line 87), once to count padding (line 95), and three separate times to check for uppercase, lowercase, and digits (lines 111-113). This could be optimized by combining these checks into a single pass through the string to improve performance, especially for longer strings.

Suggested change

let has_upper = text.chars().any(|c| c.is_ascii_uppercase());

let has_lower = text.chars().any(|c| c.is_ascii_lowercase());

let has_digit = text.chars().any(|c| c.is_ascii_digit());

let mut has_upper = false;

let mut has_lower = false;

let mut has_digit = false;

for c in text.chars() {

if c.is_ascii_uppercase() {

has_upper = true;

} else if c.is_ascii_lowercase() {

has_lower = true;

} else if c.is_ascii_digit() {

has_digit = true;

}

if has_upper && has_lower && has_digit {

break;

}

}

codecov · 2026-01-18T05:16:27Z

Codecov Report

❌ Patch coverage is 92.54658% with 60 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/classification/semantic.rs	65.74%	37 Missing ⚠️
src/classification/patterns/paths.rs	93.16%	16 Missing ⚠️
src/classification/patterns/data.rs	96.87%	3 Missing ⚠️
src/classification/symbols.rs	98.56%	3 Missing ⚠️
src/classification/patterns/network.rs	98.27%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Integrates the cpp_demangle crate and extends SymbolDemangler to handle C++ Itanium ABI mangled symbols (e.g., _Z, _ZN). Updates documentation, detection logic, and adds tests for C++ symbol demangling alongside Rust symbol support.

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In `@src/classification/patterns/data.rs`:
- Around line 102-108: The current length check using (content_len +
padding_count).is_multiple_of(4) wrongly rejects valid unpadded Base64; update
the logic around variables content_len and padding_count to permit total lengths
whose length mod 4 is 0, 2, or 3 and only reject when the remainder is 1.
Concretely, compute the remainder of (content_len + padding_count) mod 4 and
return None only if that remainder equals 1 (instead of requiring an exact
multiple-of-4), so JWT-style unpadded segments (mod 2 or 3) are accepted.

In `@src/classification/patterns/network.rs`:
- Around line 26-33: COMMON_TLDS currently contains file extensions that are not
real TLDs and cause false positives; remove entries like "exe", "dll", "sys",
"bin", "dat", "log", "tmp", "bak" from the COMMON_TLDS constant in network.rs
(the const COMMON_TLDS: &[&str] array) so it only lists actual TLDs, and if
needed implement separate filename/extension checks inside the path classifier
(or a new extension heuristic) rather than conflating them with domain TLD
recognition.

In `@src/classification/patterns/paths.rs`:
- Around line 302-307: The current is_suspicious_windows_path function uses
contains_ascii_case_insensitive which does substring matching and can
false-positive on embedded paths; change it to perform a case-insensitive prefix
check instead (e.g., for each pattern in SUSPICIOUS_WINDOWS_PATHS compare the
lowercase pattern to text.lowercase().starts_with(&pattern_lower) or call an
existing starts_with_ascii_case_insensitive helper), updating references to
contains_ascii_case_insensitive accordingly in is_suspicious_windows_path; if
substring behavior is intentional, add a clear doc comment above
is_suspicious_windows_path explaining that embedded path matches are desired.
- Around line 30-45: Replace the verbose repeated insert() calls when building
the static SUSPICIOUS_POSIX_PATHS with a more idiomatic construction: build the
set from an array or iterator (e.g. using HashSet::from_iter or
vec![...].into_iter().collect()) inside the Lazy::new closure so the same string
literals populate the HashSet in one expression; apply the same pattern to other
HashSet statics in the file (referencing SUSPICIOUS_POSIX_PATHS and any other
*_PATHS or similar HashSet statics).
- Around line 182-187: The loop over KNOWN_WINDOWS_PREFIXES incorrectly uses
contains_ascii_case_insensitive to check prefixes, which allows matches in the
middle of the string; change the logic to perform a case-insensitive starts_with
check instead (e.g., implement or use a case-insensitive starts_with helper and
call it on text with each prefix) so the prefix match in the for loop only
returns true when text begins with the given prefix; update the invocation where
contains_ascii_case_insensitive(text, prefix) is used (in the function that
iterates KNOWN_WINDOWS_PREFIXES) to call the new/updated starts-with helper.
- Around line 109-114: contains_ascii_case_insensitive currently allocates two
Strings via to_ascii_lowercase on every call; replace it with an allocation-free
ASCII case-insensitive substring search by iterating over haystack.as_bytes()
and comparing each candidate window to needle.as_bytes() using byte-wise
to_ascii_lowercase() on u8 (or similar inline lowercasing) rather than building
new Strings; update the function contains_ascii_case_insensitive to scan
haystack bytes, for each start index compare needle length bytes with on-the-fly
lowercasing, and return true on a match (use needle.is_empty() and length checks
as needed).

src/classification/patterns/data.rs

src/classification/patterns/network.rs

src/classification/patterns/paths.rs

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Copilot · 2026-01-18T05:25:39Z

src/classification/patterns/paths.rs

+    // Also accept paths that use forward slashes (some tools do this)
+    if upper_text.starts_with("HKEY_")
+        && text.contains('/')
+        && let Some(slash_pos) = text.find('/')
+    {
+        let root = &upper_text[..slash_pos];
+        if VALID_REGISTRY_ROOTS.contains(root) {
+            return true;
+        }
+    }


The logic for handling forward slashes in registry paths may be overly permissive. At line 214-215, the code checks upper_text.starts_with("HKEY_") && text.contains('/'), which means any occurrence of a forward slash anywhere in the path will trigger this branch. This could match invalid paths like "HKEY_LOCAL_MACHINE\SOFTWARE/mixed" or paths with forward slashes in unexpected places. Consider checking if the path uses forward slashes consistently (e.g., text.find('\\').is_none()) or at least verify the forward slash is used as the first separator after the root key.

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@src/classification/patterns/network.rs`:
- Around line 43-51: The TLD lookup in has_valid_tld currently does an O(n) scan
over COMMON_TLDS; change COMMON_TLDS from a Vec<&'static str> (or slice) to a
HashSet<&'static str> (or a &'static HashSet) and initialize it once (e.g., via
lazy_static or once_cell::sync::Lazy) so has_valid_tld can call
COMMON_TLDS.contains(&tld_lower.as_str()) with O(1) expected lookup; update the
COMMON_TLDS definition and any imports (HashSet, Lazy) and ensure has_valid_tld
still lowercases the TLD before lookup.

In `@src/classification/symbols.rs`:
- Around line 165-167: The demangling code currently unconditionally overwrites
string.original_text which can erase prior provenance; change the logic in the
demangling routine (the block that sets string.original_text and string.text) so
that you only assign string.original_text = Some(string.text.clone()) when
string.original_text.is_none(), then always update string.text = demangled_str;
this preserves any existing original_text while still replacing text with the
demangled form.
- Around line 52-59: The public struct SymbolDemangler must be marked
non-exhaustive to allow future fields without breaking semver; add the attribute
#[non_exhaustive] immediately before the derive and the pub struct declaration
for SymbolDemangler (leave the existing pub struct SymbolDemangler and its
impl/new() intact) so the attribute applies to that type.

♻️ Duplicate comments (1)

src/classification/patterns/network.rs (1)

26-34: Previous issue resolved - file extensions removed from TLD list.

The problematic entries (exe, dll, sys, bin, dat, log, tmp, bak) have been removed. The test suite at lines 170-181 now explicitly validates that file extensions are not misclassified as domains. This directly addresses the false positive reduction goal for binary analysis.

coderabbitai · 2026-01-18T05:29:04Z

src/classification/symbols.rs

+#[derive(Debug, Default, Clone)]
+pub struct SymbolDemangler;
+
+impl SymbolDemangler {
+    /// Create a new instance of the symbol demangler
+    #[must_use]
+    pub fn new() -> Self {
+        Self


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and read the relevant lines if [ -f "src/classification/symbols.rs" ]; then echo "=== File exists, reading lines 50-65 ===" sed -n '50,65p' src/classification/symbols.rs | cat -n else echo "File not found at src/classification/symbols.rs" find . -name "symbols.rs" -type f 2>/dev/null fi

Repository: EvilBit-Labs/Stringy

Length of output: 767

🏁 Script executed:

# Also check the broader context to understand the struct definition if [ -f "src/classification/symbols.rs" ]; then echo "=== Checking for struct SymbolDemangler definition ===" rg -n "struct SymbolDemangler" src/classification/symbols.rs -A 5 -B 2 fi

Repository: EvilBit-Labs/Stringy

Length of output: 372

🏁 Script executed:

# Verify if the struct is public and check for #[non_exhaustive] if [ -f "src/classification/symbols.rs" ]; then echo "=== Checking full context around SymbolDemangler ===" rg -n "SymbolDemangler" src/classification/symbols.rs | head -20 fi

Repository: EvilBit-Labs/Stringy

Length of output: 1121

Add #[non_exhaustive] to the public struct.

The coding guidelines require #[non_exhaustive] on all public structs to allow future extension without semver breaks. The struct already has an explicit constructor, so add the attribute before the derive macro.

Proposed fix

+#[non_exhaustive] #[derive(Debug, Default, Clone)] pub struct SymbolDemangler;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#[derive(Debug, Default, Clone)]

pub struct SymbolDemangler;

impl SymbolDemangler {

/// Create a new instance of the symbol demangler

#[must_use]

pub fn new() -> Self {

Self

#[non_exhaustive]

#[derive(Debug, Default, Clone)]

pub struct SymbolDemangler;

impl SymbolDemangler {

/// Create a new instance of the symbol demangler

#[must_use]

pub fn new() -> Self {

Self

🤖 Prompt for AI Agents

In `@src/classification/symbols.rs` around lines 52 - 59, The public struct SymbolDemangler must be marked non-exhaustive to allow future fields without breaking semver; add the attribute #[non_exhaustive] immediately before the derive and the pub struct declaration for SymbolDemangler (leave the existing pub struct SymbolDemangler and its impl/new() intact) so the attribute applies to that type.

coderabbitai · 2026-01-18T05:29:04Z

src/classification/symbols.rs

+        // Store original mangled form and replace with demangled
+        string.original_text = Some(string.text.clone());
+        string.text = demangled_str;


⚠️ Potential issue | 🟠 Major

Do not overwrite existing original_text.

demangle unconditionally replaces original_text, which can erase prior provenance if another transform already populated it. Only set original_text when it is None.

Proposed fix

- string.original_text = Some(string.text.clone()); + if string.original_text.is_none() { + string.original_text = Some(string.text.clone()); + } string.text = demangled_str;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Store original mangled form and replace with demangled

string.original_text = Some(string.text.clone());

string.text = demangled_str;

// Store original mangled form and replace with demangled

if string.original_text.is_none() {

string.original_text = Some(string.text.clone());

}

string.text = demangled_str;

🤖 Prompt for AI Agents

In `@src/classification/symbols.rs` around lines 165 - 167, The demangling code currently unconditionally overwrites string.original_text which can erase prior provenance; change the logic in the demangling routine (the block that sets string.original_text and string.text) so that you only assign string.original_text = Some(string.text.clone()) when string.original_text.is_none(), then always update string.text = demangled_str; this preserves any existing original_text while still replacing text with the demangled form.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/classification/patterns/paths.rs`:
- Around line 1-4: The paths.rs module exceeds the 500-line guideline and should
be split into smaller focused modules; create separate modules (e.g., posix.rs,
windows.rs, unc.rs, registry.rs) and move the corresponding detection
functions/structs (the POSIX path classifier, Windows path classifier, UNC path
classifier, and registry path classifier) from paths.rs into these new files,
export them from a new mod.rs or from a restructured patterns::paths mod that
re-exports pub use posix::*, pub use windows::*, etc., and update any
references/imports in the codebase to use patterns::paths::{PosixClassifier,
WindowsClassifier, UncClassifier, RegistryClassifier} (or the actual
function/struct names used) so behavior is unchanged but each file stays under
the line limit.

♻️ Duplicate comments (2)

src/classification/patterns/paths.rs (1)
235-244: Avoid accepting mixed separators in registry paths.

Line 235 allows forward-slash registry paths even when backslashes are present, so mixed separators like "HKEY_LOCAL_MACHINE\SOFTWARE/mixed" are treated as valid and increase false positives. Consider requiring forward slashes only when no backslashes exist.
Proposed fix
-    if upper_text.starts_with("HKEY_")
-        && text.contains('/')
-        && let Some(slash_pos) = text.find('/')
-    {
+    if upper_text.starts_with("HKEY_")
+        && text.contains('/')
+        && !text.contains('\\')
+        && let Some(slash_pos) = text.find('/')
+    {
src/classification/patterns/data.rs (1)
79-133: Reject padded Base64 strings unless the total length is a multiple of 4.

Line 105 only rejects remainder 1, so padded strings with length mod 4 of 2 or 3 can be incorrectly accepted and increase false positives.
Proposed fix
-    // Strip padding for length check
-    let content_len = text.len() - padding_count;
-
-    // Valid Base64 content length should avoid mod 4 remainder of 1
-    let remainder = (content_len + padding_count) % 4;
-    if remainder == 1 {
-        return None;
-    }
+    let total_len = text.len();
+    let len_mod = total_len % 4;
+    if padding_count > 0 {
+        // Padded Base64 must be a multiple of 4
+        if len_mod != 0 {
+            return None;
+        }
+    } else if len_mod == 1 {
+        // Unpadded Base64 cannot have remainder 1
+        return None;
+    }

coderabbitai · 2026-01-18T05:45:52Z

src/classification/patterns/paths.rs

+//! File and registry path classification patterns
+//!
+//! This module provides POSIX, Windows, UNC, and registry path detection.
+


🛠️ Refactor suggestion | 🟠 Major

Split this module to stay under the 500-line limit.

This file is 504 lines, which is above the 500-line guideline. As per coding guidelines, please split the path classifiers into smaller modules.

🤖 Prompt for AI Agents

In `@src/classification/patterns/paths.rs` around lines 1 - 4, The paths.rs module exceeds the 500-line guideline and should be split into smaller focused modules; create separate modules (e.g., posix.rs, windows.rs, unc.rs, registry.rs) and move the corresponding detection functions/structs (the POSIX path classifier, Windows path classifier, UNC path classifier, and registry path classifier) from paths.rs into these new files, export them from a new mod.rs or from a restructured patterns::paths mod that re-exports pub use posix::*, pub use windows::*, etc., and update any references/imports in the codebase to use patterns::paths::{PosixClassifier, WindowsClassifier, UncClassifier, RegistryClassifier} (or the actual function/struct names used) so behavior is unchanged but each file stays under the line limit.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/extraction/dedup.rs (1)
311-350: Test helper doesn't exercise original_text field.

The create_test_string helper always sets original_text: None. Consider adding a test case that verifies original_text propagation through deduplication, particularly for the to_found_string() round-trip.
Suggested test case
#[test]
fn test_original_text_preservation() {
    let mut fs = create_test_string(
        "demangled_name",
        Encoding::Utf8,
        0x100,
        None,
        StringSource::SectionData,
        vec![],
        10,
        0.8,
    );
    fs.original_text = Some("_ZN7mangled4nameE".to_string());
    
    let strings = vec![fs];
    let canonical = deduplicate(strings, None, true);
    
    assert_eq!(canonical[0].occurrences[0].original_text, 
               Some("_ZN7mangled4nameE".to_string()));
    
    let found = canonical[0].to_found_string();
    assert_eq!(found.original_text, Some("_ZN7mangled4nameE".to_string()));
}
docs/src/classification.md (2)
158-160: Documentation inconsistency: lazy_static vs once_cell.

Line 119 and the code example at 122-126 correctly show once_cell::sync::Lazy, but line 160 still references lazy_static!. Update for consistency with the actual implementation.
Suggested fix
-The classifier relies on `lazy_static!` to compile regex patterns once and reuse them across classification calls. Helper methods validate strings before assigning tags.
+The classifier relies on `once_cell::sync::Lazy` to compile regex patterns once and reuse them across classification calls. Helper methods validate strings before assigning tags.
180-192: Example code missing original_text field.

The FoundString struct now has an original_text: Option<String> field (per the types.rs changes). Update the example to include it for accuracy.
Suggested fix
 let found_string = FoundString {
     text: "C:\\Windows\\System32\\cmd.exe".to_string(),
+    original_text: None,
     encoding: Encoding::Ascii,
     offset: 0,

🤖 Fix all issues with AI agents

In `@docs/src/classification.md`:
- Around line 113-119: In the "Tag Specificity" section update the phrasing for
clarity: change "high confidence matches" to "high-confidence matches"
(hyphenate when used as a compound adjective) and tighten "and should be treated
as ambiguous" to "which should be treated as ambiguous" so the sentence reads
smoothly; ensure the rest of the paragraph (mentioning examples like URL,
domain, IP, file path, GUID, email, format string, and user agent) remains
unchanged.

In `@src/extraction/mod.rs`:
- Around line 525-536: The demangling + classification loop duplicated in the
extraction flow should be extracted into a private helper to avoid maintenance
drift: create a function (e.g., fn apply_classification(all_strings: &mut
[YourStringType]) { ... }) that instantiates SemanticClassifier::new() and
SymbolDemangler::new() and runs the same demangler.demangle(string) and
classifier.classify(string) logic while merging tags into string.tags; then
replace the duplicated blocks in the current function and in extract_canonical()
with a call to apply_classification(&mut all_strings) so both call sites share
the same implementation and behavior.

coderabbitai · 2026-01-18T05:56:07Z

docs/src/classification.md

+## Tag Specificity

-The semantic classifier uses cached regex patterns via `lazy_static!` and applies validation checks to reduce false positives.
+Tags are treated as either specific or broad. Specific tags indicate high confidence matches (for example URL, domain, IP, file path, GUID, email, format string, and user agent). Base64 is a broad tag and should be treated as ambiguous due to higher false positive risk.
+
+## Pattern Matching Engine
+
+The semantic classifier uses cached regex patterns via `once_cell::sync::Lazy` and applies validation checks to reduce false positives.


⚠️ Potential issue | 🟡 Minor

Minor grammar and clarity improvements.

Line 115: "high confidence" should be hyphenated as "high-confidence" when used as a compound adjective before a noun. Also, "and should be treated as" could be tightened to "which should be treated as" for flow.

Suggested fix

-Tags are treated as either specific or broad. Specific tags indicate high confidence matches (for example URL, domain, IP, file path, GUID, email, format string, and user agent). Base64 is a broad tag and should be treated as ambiguous due to higher false positive risk. +Tags are treated as either specific or broad. Specific tags indicate high-confidence matches (for example URL, domain, IP, file path, GUID, email, format string, and user agent). Base64 is a broad tag, which should be treated as ambiguous due to higher false-positive risk.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

## Tag Specificity

The semantic classifier uses cached regex patterns via `lazy_static!` and applies validation checks to reduce false positives.

Tags are treated as either specific or broad. Specific tags indicate high confidence matches (for example URL, domain, IP, file path, GUID, email, format string, and user agent). Base64 is a broad tag and should be treated as ambiguous due to higher false positive risk.

## Pattern Matching Engine

The semantic classifier uses cached regex patterns via `once_cell::sync::Lazy` and applies validation checks to reduce false positives.

## Tag Specificity

Tags are treated as either specific or broad. Specific tags indicate high-confidence matches (for example URL, domain, IP, file path, GUID, email, format string, and user agent). Base64 is a broad tag, which should be treated as ambiguous due to higher false-positive risk.

## Pattern Matching Engine

The semantic classifier uses cached regex patterns via `once_cell::sync::Lazy` and applies validation checks to reduce false positives.

🧰 Tools

🪛 LanguageTool

[grammar] ~115-~115: Use a hyphen to join words.
Context: ...ic or broad. Specific tags indicate high confidence matches (for example URL, dom...

(QB_NEW_EN_HYPHEN)

[style] ~115-~115: Consider replacing this word to strengthen your wording.
Context: ... and user agent). Base64 is a broad tag and should be treated as ambiguous due to h...

(AND_THAT)

🤖 Prompt for AI Agents

In `@docs/src/classification.md` around lines 113 - 119, In the "Tag Specificity" section update the phrasing for clarity: change "high confidence matches" to "high-confidence matches" (hyphenate when used as a compound adjective) and tighten "and should be treated as ambiguous" to "which should be treated as ambiguous" so the sentence reads smoothly; ensure the rest of the paragraph (mentioning examples like URL, domain, IP, file path, GUID, email, format string, and user agent) remains unchanged.

src/extraction/mod.rs

Test modules for classification patterns have been refactored to combine similar test cases, reducing redundancy and improving maintainability. Semantic enrichment logic in extraction has been moved to a dedicated helper function. Unused deduplication parameter has been removed for clarity.

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/extraction/dedup.rs (1)
64-86: Remove dead parameter _preserve_all_occurrences — implementation ignores documented behavior.

The docstring claims this parameter controls whether to "store occurrence count instead of full metadata," but the function always preserves full metadata via found_string_to_occurrence(), making the parameter dead code. The underscore prefix confirms it's intentionally unused.

Removing this parameter requires updating 26 call sites (2 in src/extraction/mod.rs + tests), but it eliminates the API inconsistency and stale documentation. Alternatively, implement the documented behavior if occurrence-count-only mode is actually needed.
Suggested fix: remove unused parameter
-/// * `preserve_all_occurrences` - If false, only store occurrence count instead of full metadata
-///
 /// # Returns
 ///
 /// Vector of canonical strings sorted by combined_score (descending)
@@ pub fn deduplicate(
     strings: Vec<FoundString>,
     dedup_threshold: Option<usize>,
-    _preserve_all_occurrences: bool,
 ) -> Vec<CanonicalString> {
Update both calls in src/extraction/mod.rs (lines 544-548, 651-655) and all test invocations.

🤖 Fix all issues with AI agents

In `@src/classification/patterns/data.rs`:
- Around line 20-21: Add a doc comment for the static EMAIL_REGEX (the
Lazy<Regex> named EMAIL_REGEX) explaining that this pattern is intentionally
simplified for binary/string extraction: it will match short forms like "a@b.cc"
and common unquoted local-parts but does not support quoted local-parts, some
valid edge cases (e.g., certain plus/escape forms and RFC 5322 full syntax), nor
internationalized domain names; state the tradeoff (reduced false positives vs.
not fully RFC-compliant) so future readers know the limitation and why this
simplified regex was chosen.

In `@src/classification/patterns/ip.rs`:
- Around line 117-145: The function is_ipv6_address currently never uses
IPV6_REGEX; update it to apply the IPV6_REGEX pre-filter after handling
bracketed addresses (after strip_port and strip_ipv6_brackets) before the
character checks and Ipv6Addr::from_str validation, similar to is_ipv4_address,
so inputs failing the regex are rejected early; alternatively, if the regex is
unused elsewhere, remove the unused IPV6_REGEX definition—refer to
is_ipv6_address, IPV6_REGEX, strip_port, strip_ipv6_brackets, and
Ipv6Addr::from_str when making the change.
- Around line 24-26: The IPV6_REGEX constant currently holds a massive,
hard-to-maintain pattern; replace it with a much simpler pre-filter regex (e.g.,
allow only hex digits, colons, and optional IPv4-mapped suffix) and then
validate canonical correctness with Ipv6Addr::from_str inside the code paths
that use IPV6_REGEX, or if you must keep the complex regex for performance, add
a comment above pub(crate) static IPV6_REGEX describing the performance tradeoff
and why the full pattern is required; update usages to fall back to
Ipv6Addr::from_str for authoritative validation.
- Around line 41-42: The IPV6_BRACKETS_REGEX is currently greedy and can capture
trailing text; update its pattern to only capture the content inside the
brackets and anchor the match so it doesn't include characters after the closing
bracket (e.g. use a pattern like r"^\[([^\]]+)\]$" or r"^\[([^\]]+)\]" to
capture only what's between '[' and ']' in the static IPV6_BRACKETS_REGEX
Lazy<Regex> declaration).

In `@src/classification/patterns/network.rs`:
- Around line 46-54: The has_valid_tld function currently treats a domain ending
with '.' as having an empty TLD (which falls through to false); explicitly
handle trailing dots by either returning false when domain.ends_with('.') or by
trimming a single trailing '.' before computing tld (update has_valid_tld to
check for domain.ends_with('.') and return false, or strip the trailing dot then
use rfind and COMMON_TLDS lookup) so the behavior is explicit and documented in
the function comment.

In `@src/classification/patterns/paths.rs`:
- Around line 329-341: The two functions use different match semantics which is
intentional but not documented: is_suspicious_windows_path uses prefix
(starts_with) matching while is_suspicious_registry_path uses substring
(contains) matching; add a brief doc comment above each function
(is_suspicious_windows_path and is_suspicious_registry_path) clarifying the
rationale (windows paths should be matched by prefix like "C:\Windows\…" whereas
registry keys may appear anywhere in the path and thus use substring matching)
so future readers understand the asymmetry.
- Around line 291-301: Add a unit test that ensures classify_unc_path returns
None for a UNC root with a trailing backslash (e.g., "\\server\\") to cover the
case where UNC_PATH_REGEX matches but the share name is missing; create a test
function (e.g., test_classify_unc_path_missing_share) that calls
classify_unc_path with that string and asserts the result is None, referencing
classify_unc_path and UNC_PATH_REGEX as the functions/regex under test.

♻️ Duplicate comments (3)

docs/src/classification.md (1)

115-115: Hyphenate compound adjective and tighten clause.
This matches a previously noted wording fix. Please hyphenate "high-confidence" and tighten "and should be treated as" to "which should be treated as".
src/classification/patterns/data.rs (1)
111-130: Multiple iterations over text for character class checks.

Lines 112-114 iterate the string three separate times. For hot paths with long strings, a single-pass approach is more efficient.
Single-pass character diversity check
-    // Check for character diversity typical of Base64
-    let has_upper = text.chars().any(|c| c.is_ascii_uppercase());
-    let has_lower = text.chars().any(|c| c.is_ascii_lowercase());
-    let has_digit = text.chars().any(|c| c.is_ascii_digit());
+    // Check for character diversity typical of Base64 (single pass)
+    let (mut has_upper, mut has_lower, mut has_digit) = (false, false, false);
+    for c in text.chars() {
+        if c.is_ascii_uppercase() {
+            has_upper = true;
+        } else if c.is_ascii_lowercase() {
+            has_lower = true;
+        } else if c.is_ascii_digit() {
+            has_digit = true;
+        }
+        // Early exit if all flags are set
+        if has_upper && has_lower && has_digit {
+            break;
+        }
+    }
src/classification/patterns/paths.rs (1)
235-244: Forward-slash registry path logic is overly permissive.

The condition upper_text.starts_with("HKEY_") && text.contains('/') matches if a forward slash appears anywhere, including in value names. For example, HKEY_LOCAL_MACHINE\SOFTWARE\Path/Name would incorrectly trigger this branch and extract HKEY_LOCAL_MACHINE\SOFTWARE\Path as the root (wrong separator position).
Stricter forward-slash handling
     // Also accept paths that use forward slashes (some tools do this)
-    if upper_text.starts_with("HKEY_")
-        && text.contains('/')
-        && let Some(slash_pos) = text.find('/')
-    {
+    // Only if the path uses forward slashes consistently (no backslashes before first slash)
+    if upper_text.starts_with("HKEY_") && let Some(slash_pos) = text.find('/') {
+        // Ensure no backslash appears before the first forward slash
+        if text[..slash_pos].contains('\\') {
+            return false;
+        }
         let root = &upper_text[..slash_pos];
         if VALID_REGISTRY_ROOTS.contains(root) {
             return true;
         }
     }

src/classification/patterns/data.rs

src/classification/patterns/ip.rs

src/classification/patterns/network.rs

src/classification/patterns/paths.rs

- Updated email regex to simplify matching for binary and string extraction. - Refined IPv6 regex to allow only valid characters and improved validation checks. - Added validation for trailing dots in TLD checks. - Enhanced suspicious path checks with clearer matching strategies for Windows and registry paths. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@src/classification/patterns/data.rs`:
- Around line 101-116: The current padding logic only counts trailing '=' but
allows '=' in the middle and doesn't require overall length divisibility when
padding is present; update the check around variables padding_count, content_len
and remainder so that you first verify there are no '=' characters in the
non-padded prefix (i.e., ensure text[..content_len] contains no '='), and then
if padding_count > 0 enforce that text.len() % 4 == 0 (instead of the loose
remainder check); keep the existing padding_count <= 2 constraint and maintain
the remainder == 1 early-out for the no-padding case.

In `@src/classification/patterns/paths.rs`:
- Around line 215-244: The is_valid_registry_path function uses
text.to_uppercase(), which can expand Unicode and invalidate byte indices when
later slicing with positions from the original text; change both calls that
create upper_text to text.to_ascii_uppercase() so byte lengths remain stable
(Windows registry roots are ASCII), ensuring the subsequent find('\\') and
find('/') positions from the original text remain valid when you slice
upper_text[..slash_pos] and check VALID_REGISTRY_ROOTS.

coderabbitai · 2026-01-18T06:22:39Z

src/classification/patterns/data.rs

+    // Count padding characters
+    let padding_count = text.chars().rev().take_while(|&c| c == '=').count();
+
+    // Padding should be at most 2 characters
+    if padding_count > 2 {
+        return None;
+    }
+
+    // Strip padding for length check
+    let content_len = text.len() - padding_count;
+
+    // Valid Base64 content length should avoid mod 4 remainder of 1
+    let remainder = (content_len + padding_count) % 4;
+    if remainder == 1 {
+        return None;
+    }


⚠️ Potential issue | 🟠 Major

Padding validation can accept invalid Base64 shapes.

Only trailing padding is counted. Strings with '=' in the middle or padded strings whose total length is not a multiple of 4 can slip through, increasing false positives. Add a check that padding is only at the end and enforce len%4==0 when padding exists.

Proposed fix

// Padding should be at most 2 characters if padding_count > 2 { return None; } - // Strip padding for length check - let content_len = text.len() - padding_count; - - // Valid Base64 content length should avoid mod 4 remainder of 1 - let remainder = (content_len + padding_count) % 4; - if remainder == 1 { - return None; - } + // Padding must only appear at the end + let padding_start = text.len() - padding_count; + if text[..padding_start].contains('=') { + return None; + } + + // Padded Base64 must have total length multiple of 4 + let remainder = text.len() % 4; + if padding_count > 0 { + if remainder != 0 { + return None; + } + } else if remainder == 1 { + return None; + }

🤖 Prompt for AI Agents

In `@src/classification/patterns/data.rs` around lines 101 - 116, The current padding logic only counts trailing '=' but allows '=' in the middle and doesn't require overall length divisibility when padding is present; update the check around variables padding_count, content_len and remainder so that you first verify there are no '=' characters in the non-padded prefix (i.e., ensure text[..content_len] contains no '='), and then if padding_count > 0 enforce that text.len() % 4 == 0 (instead of the loose remainder check); keep the existing padding_count <= 2 constraint and maintain the remainder == 1 early-out for the no-padding case.

coderabbitai · 2026-01-18T06:22:39Z

src/classification/patterns/paths.rs

+/// Validates a registry path
+pub fn is_valid_registry_path(text: &str) -> bool {
+    let upper_text = text.to_uppercase();
+
+    // Check for full registry root
+    if upper_text.starts_with("HKEY_") {
+        // Extract root key
+        if let Some(slash_pos) = text.find('\\') {
+            let root = &upper_text[..slash_pos];
+            if VALID_REGISTRY_ROOTS.contains(root) {
+                return true;
+            }
+        }
+    }
+
+    // Check for abbreviated forms (case-insensitive)
+    if REGISTRY_ABBREV_REGEX.is_match(text) {
+        return true;
+    }
+
+    // Also accept paths that use forward slashes (some tools do this)
+    if upper_text.starts_with("HKEY_")
+        && text.contains('/')
+        && let Some(slash_pos) = text.find('/')
+    {
+        let root = &upper_text[..slash_pos];
+        if VALID_REGISTRY_ROOTS.contains(root) {
+            return true;
+        }
+    }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's find the file and examine the full registry validation function fd -t f "paths.rs" src/classification/patterns/ # Get the file size to see if we can read it entirely wc -l src/classification/patterns/paths.rs

Repository: EvilBit-Labs/Stringy

Length of output: 141

🏁 Script executed:

# Look at the specific function and surrounding context cat -n src/classification/patterns/paths.rs | sed -n '200,250p'

Repository: EvilBit-Labs/Stringy

Length of output: 1736

🏁 Script executed:

# Search for all uses of to_uppercase in the codebase rg "to_uppercase" src/

Repository: EvilBit-Labs/Stringy

Length of output: 142

🏁 Script executed:

# Check if there are any tests for the registry validation function rg "is_valid_registry_path" --type rust

Repository: EvilBit-Labs/Stringy

Length of output: 742

🏁 Script executed:

# Let's verify the behavior of to_uppercase vs to_ascii_uppercase with a simple test python3 << 'EOF' # Test character expansion with to_uppercase() in Unicode import unicodedata # Test ASCII characters ascii_test = "HKEY_LOCAL_MACHINE" print(f"ASCII string: '{ascii_test}'") print(f"Length: {len(ascii_test)}") # Test if certain characters expand when uppercased test_chars = ['a', 'ß', 'ñ', '\\', '/'] for char in test_chars: try: upper = char.upper() print(f"'{char}' -> '{upper}' (was {len(char)} byte(s), now {len(upper.encode('utf-8'))} byte(s))") except Exception as e: print(f"Error with '{char}': {e}") # Key insight: bytes vs characters test_str = "HKEY_ß\\" # Contains a character that expands print(f"\nTest string with ß: '{test_str}'") print(f"UTF-8 byte length: {len(test_str.encode('utf-8'))}") pos = test_str.find('\\') print(f"Position of first \\: {pos}") print(f"Upper version byte length: {len(test_str.upper().encode('utf-8'))}") EOF

Repository: EvilBit-Labs/Stringy

Length of output: 150

🏁 Script executed:

# Check if the VALID_REGISTRY_ROOTS constant is defined and what it contains rg "VALID_REGISTRY_ROOTS" src/classification/patterns/paths.rs -A 10 -B 2

Repository: EvilBit-Labs/Stringy

Length of output: 962

🏁 Script executed:

# Check the REGISTRY_ABBREV_REGEX to understand what patterns are validated rg "REGISTRY_ABBREV_REGEX" src/classification/patterns/paths.rs -A 5 -B 2

Repository: EvilBit-Labs/Stringy

Length of output: 669

🏁 Script executed:

# Let's check how this function is called - is there any input validation upstream? rg "is_valid_registry_path" src/ -B 5 -A 2

Repository: EvilBit-Labs/Stringy

Length of output: 3141

🏁 Script executed:

# Let's check the actual byte/character expansion risk by examining valid registry root strings cat > /tmp/test_uppercase.rs << 'EOF' fn main() { let test = "HKEY_LOCAL_MACHINE\\subkey"; let upper = test.to_uppercase(); let ascii_upper = test.to_ascii_uppercase(); println!("Original: {} bytes", test.len()); println!("to_uppercase: {} bytes", upper.len()); println!("to_ascii_uppercase: {} bytes", ascii_upper.len()); // Now test with potential non-ASCII before slash let test_unicode = "HKEY_ß\\subkey"; let upper_uni = test_unicode.to_uppercase(); let ascii_upper_uni = test_unicode.to_ascii_uppercase(); println!("\nWith ß character:"); println!("Original: {} bytes", test_unicode.len()); println!("to_uppercase: {} bytes", upper_uni.len()); println!("to_ascii_uppercase: {} bytes", ascii_upper_uni.len()); // Find positions if let Some(pos) = test_unicode.find('\\') { println!("Position of \\: {}", pos); println!("to_uppercase slicing [..{}]: OK? {}", pos, pos <= upper_uni.len()); println!("to_ascii_uppercase slicing [..{}]: OK? {}", pos, pos <= ascii_upper_uni.len()); } } EOF rustc /tmp/test_uppercase.rs -o /tmp/test_uppercase && /tmp/test_uppercase

Repository: EvilBit-Labs/Stringy

Length of output: 110

🏁 Script executed:

# Check what types of strings are passed to this function from semantic classification rg "is_valid_registry_path\|classify_registry_path" src/ -B 3 -A 3

Repository: EvilBit-Labs/Stringy

Length of output: 46

🏁 Script executed:

# Check if there are any existing tests that demonstrate the behavior fd -t f "test" src/classification/ -o rg "registry" tests/ src/ --type rust

Repository: EvilBit-Labs/Stringy

Length of output: 187

🏁 Script executed:

# Let's look more carefully at what types of strings are being processed # Check the extraction context to see what kinds of inputs reach this function rg "RegistryPath|registry" src/ -B 2 -A 2 | head -50

Repository: EvilBit-Labs/Stringy

Length of output: 2995

🏁 Script executed:

# Let's examine if there's any validation or filtering before these strings are classified rg "classify_registry_path|is_valid_registry_path" src/classification/ -B 5 -A 5

Repository: EvilBit-Labs/Stringy

Length of output: 7669

🏁 Script executed:

# Check if the issue exists with a simple logic trace cat > /tmp/analyze.py << 'EOF' # Simulate the potential issue # Test case 1: ASCII only (normal case) test1 = "HKEY_LOCAL_MACHINE\\Software" pos1 = test1.find('\\') upper1 = test1.upper() upper1_ascii = test1.upper() # for ASCII it's the same print(f"Test 1 (ASCII): '{test1}'") print(f" Position of \\: {pos1}") print(f" Length original: {len(test1)}, uppercase: {len(upper1)}") print(f" Slice [..{pos1}] in uppercase: '{upper1[:pos1]}'") print(f" Match: {upper1[:pos1] == 'HKEY_LOCAL_MACHINE'}") # Test case 2: Non-ASCII before delimiter (edge case) test2 = "HKEYß\\Software" pos2 = test2.find('\\') upper2 = test2.upper() print(f"\nTest 2 (with ß): '{test2}'") print(f" Bytes in original: {[c for c in test2]}") print(f" Position of \\ (in original): {pos2}") print(f" Length original: {len(test2)}, uppercase: {len(upper2)}") print(f" Upper bytes: {[c for c in upper2]}") try: print(f" Slice [..{pos2}] in uppercase: '{upper2[:pos2]}'") print(f" Compare to HKEY_: {upper2[:pos2].startswith('HKEY_')}") except Exception as e: print(f" ERROR: {e}") # Test case 3: Using ASCII uppercase on non-ASCII print(f"\nTest 3 (ASCII uppercase behavior):") test3 = "HKEYß\\Software" pos3 = test3.find('\\') # In Rust, to_ascii_uppercase() only uppercases ASCII ascii_upper = ''.join(c.upper() if c.isascii() else c for c in test3) print(f" ASCII upper of '{test3}': '{ascii_upper}'") print(f" Slice [..{pos3}]: '{ascii_upper[:pos3]}'") EOF python3 /tmp/analyze.py

Repository: EvilBit-Labs/Stringy

Length of output: 693

🏁 Script executed:

# Let's look at the actual registry path test to see if non-ASCII is expected rg "test_registry_path|test_classify_registry_path" src/ -A 10

Repository: EvilBit-Labs/Stringy

Length of output: 1554

🏁 Script executed:

# Check if there are any other slicing operations with position indices in the paths module rg "find\(" src/classification/patterns/paths.rs -B 2 -A 3

Repository: EvilBit-Labs/Stringy

Length of output: 534

🏁 Script executed:

# Let's verify the Unicode expansion concern with a real example cat > /tmp/verify_panic.rs << 'EOF' // Simulate what happens with the current code fn main() { // Example of character that could cause issue // Looking for strings that expand when uppercased // Test: some Unicode characters do expand // For instance, certain ligatures or special characters let test = "HKEY\u{00DF}\\Software"; // ß character println!("Original string: {} bytes", test.len()); println!("Bytes: {:?}", test.as_bytes()); let upper = test.to_uppercase(); println!("Uppercase string: {} bytes", upper.len()); println!("Bytes: {:?}", upper.as_bytes()); let pos = test.find('\\').unwrap(); println!("Position of \\: {} (in original)", pos); // This is what the code does - use original position on uppercase string println!("Slicing uppercase[..{}]: '{}'", pos, &upper[..pos]); // With to_ascii_uppercase let ascii_upper = test.to_ascii_uppercase(); println!("\nASCII uppercase: {} bytes", ascii_upper.len()); println!("Slicing ascii_upper[..{}]: '{}'", pos, &ascii_upper[..pos]); } EOF # Try to compile it to check for actual panic risk cat /tmp/verify_panic.rs

Repository: EvilBit-Labs/Stringy

Length of output: 1121

🏁 Script executed:

# Confirm the current implementation one more time sed -n '215,247p' src/classification/patterns/paths.rs

Repository: EvilBit-Labs/Stringy

Length of output: 976

Fix Unicode byte expansion issue in registry path validation.

Line 217 uses to_uppercase(), which can expand certain Unicode characters and change byte length. Since the code then slices upper_text[..slash_pos] with indices from the original string, this can cause a panic if non-ASCII characters precede the delimiter.

Use to_ascii_uppercase() instead, which preserves byte positions for ASCII characters. This is semantically correct since Windows registry root keys are strictly ASCII.

Proposed fix

- let upper_text = text.to_uppercase(); + let upper_text = text.to_ascii_uppercase();

🤖 Prompt for AI Agents

In `@src/classification/patterns/paths.rs` around lines 215 - 244, The is_valid_registry_path function uses text.to_uppercase(), which can expand Unicode and invalidate byte indices when later slicing with positions from the original text; change both calls that create upper_text to text.to_ascii_uppercase() so byte lengths remain stable (Windows registry roots are ASCII), ensuring the subsequent find('\\') and find('/') positions from the original text remain valid when you slice upper_text[..slash_pos] and check VALID_REGISTRY_ROOTS.

unclesp1d3r added 2 commits January 17, 2026 23:59

Copilot AI review requested due to automatic review settings January 18, 2026 05:12

Copilot started reviewing on behalf of unclesp1d3r January 18, 2026 05:12 View session

coderabbitai bot added documentation Improvements or additions to documentation enhancement New feature or request labels Jan 18, 2026

Copilot AI reviewed Jan 18, 2026

View reviewed changes

Add C++ symbol demangling support

503199f

Integrates the cpp_demangle crate and extends SymbolDemangler to handle C++ Itanium ABI mangled symbols (e.g., _Z, _ZN). Updates documentation, detection logic, and adds tests for C++ symbol demangling alongside Rust symbol support.

coderabbitai bot reviewed Jan 18, 2026

View reviewed changes

unclesp1d3r requested a review from Copilot January 18, 2026 05:21

unclesp1d3r self-assigned this Jan 18, 2026

unclesp1d3r added this to the v0.1 milestone Jan 18, 2026

Copilot started reviewing on behalf of unclesp1d3r January 18, 2026 05:22 View session

Copilot AI reviewed Jan 18, 2026

View reviewed changes

coderabbitai bot reviewed Jan 18, 2026

View reviewed changes

feat(classification): improve path detection logic and regex patterns

cbe64c7

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

coderabbitai bot reviewed Jan 18, 2026

View reviewed changes

feat(extraction): add original text field and apply demangling

c6c7568

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

coderabbitai bot reviewed Jan 18, 2026

View reviewed changes

unclesp1d3r enabled auto-merge (squash) January 18, 2026 06:12

unclesp1d3r merged commit c3b90ed into main Jan 18, 2026
19 of 20 checks passed

unclesp1d3r deleted the symantic_classification_completion branch January 18, 2026 06:13

coderabbitai bot reviewed Jan 18, 2026

View reviewed changes

This was referenced Jan 18, 2026

feat: Implement enum-based output formatters (Table, JSON, YARA) #127

Merged

feat(classification): add semantic pattern matching for GUID, email, base64, format strings, and user agents #131

Merged

coderabbitai bot mentioned this pull request Feb 18, 2026

refactor(extraction): split oversized modules and wire up CLI pipeline #135

Merged

4 tasks

-    let haystack_lower = haystack.to_ascii_lowercase();
-    let needle_lower = needle.to_ascii_lowercase();
-    haystack_lower.contains(&needle_lower)
+    // Match the behavior of `str::contains`: empty needle always matches.
+    if needle.is_empty() {
+        return true;
+    }
+    let haystack_bytes = haystack.as_bytes();
+    let needle_bytes = needle.as_bytes();
+    if needle_bytes.len() > haystack_bytes.len() {
+        return false;
+    }
+    let last_start = haystack_bytes.len() - needle_bytes.len();
+    for start in 0..=last_start {
+        let window = &haystack_bytes[start..start + needle_bytes.len()];
+        if window.eq_ignore_ascii_case(needle_bytes) {
+            return true;
+        }
+    }
+    false

-    let has_upper = text.chars().any(|c| c.is_ascii_uppercase());
-    let has_lower = text.chars().any(|c| c.is_ascii_lowercase());
-    let has_digit = text.chars().any(|c| c.is_ascii_digit());
+    let mut has_upper = false;
+    let mut has_lower = false;
+    let mut has_digit = false;
+    for c in text.chars() {
+        if c.is_ascii_uppercase() {
+            has_upper = true;
+        } else if c.is_ascii_lowercase() {
+            has_lower = true;
+        } else if c.is_ascii_digit() {
+            has_digit = true;
+        }
+        if has_upper && has_lower && has_digit {
+            break;
+        }
+    }

Uh oh!

Conversation

unclesp1d3r commented Jan 18, 2026

New Pattern Classification Modules

Expanded Data Format Detection

Improved IP Address Detection

Refactoring and API Changes

Dependency Updates

Uh oh!

coderabbitai bot commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Architectural Refactoring: Modular Pattern Classification and Symbol Demangling (Rust + C++)

Architectural Changes

Symbol Demangling & Extraction Integration

API & Public Surface

Binary Format Handling & Data Structures

Classification Accuracy & Behavior Improvements

Dependencies & Implementation Notes

Impact Summary

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

coderabbitai bot commented Jan 18, 2026 •

edited

Loading

codecov bot commented Jan 18, 2026 •

edited

Loading