feat(classification): add symbol demangling for Rust#124
Conversation
- Implemented `SymbolDemangler` to detect and demangle mangled Rust symbols. - Added support for Rust legacy and v0 mangling formats. - Updated `FoundString` to store original mangled text and tags for demangled symbols. - Enhanced `Tag` enum with `DemangledSymbol` variant. - Added comprehensive tests for demangling functionality. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
File extensions such as exe, dll, sys, bin, dat, log, tmp, and bak are no longer treated as valid TLDs in domain classification. Added tests to ensure these extensions are not classified as domains, and updated documentation to reflect Rust-only symbol demangling.
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughArchitectural Refactoring: Modular Pattern Classification and Symbol Demangling (Rust + C++)This PR refactors the classification subsystem into a modular pattern library, substantially expands pattern coverage (network, paths, data, IP), and integrates symbol demangling (Rust + C++) into the extraction/enrichment pipeline. Architectural Changes
Symbol Demangling & Extraction Integration
API & Public Surface
Binary Format Handling & Data Structures
Classification Accuracy & Behavior Improvements
Dependencies & Implementation Notes
Impact Summary
WalkthroughAdds a centralized pattern-classification subsystem (data, ip, network, paths), a SymbolDemangler, expands SemanticClassifier's public surface and integrates demangling + pattern classification into extraction before deduplication; introduces Tag::DemangledSymbol and dependency updates for demangling and lazy init. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Extract as "extract()/extract_from_section()"
participant Found as "FoundString"
participant Symbol as "SymbolDemangler"
participant Semantic as "SemanticClassifier"
participant Patterns as "patterns::*"
Note over Extract,Found: collect found strings
Extract->>Found: iterate found strings
loop per found string
Extract->>Symbol: demangle(&mut FoundString)
alt mangled symbol detected
Symbol-->>Found: mutate text + set original_text
Symbol-->>Extract: add Tag::DemangledSymbol
end
Extract->>Semantic: classify(&FoundString)
Semantic->>Patterns: classify_url/classify_domain/classify_ip/...
Patterns-->>Semantic: Option<Tag>/Vec<Tag>
Semantic-->>Extract: aggregated tags
Extract->>Found: merge tags into FoundString
end
Extract->>Extract: deduplicate/canonicalize (preserve original_text)
Extract-->>Client: final results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 4❌ Failed checks (4 warnings)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
This pull request introduces Rust symbol demangling capabilities and significantly expands the pattern classification system with new modules for detecting various data formats. The changes refactor existing classification code into a modular structure and replace lazy_static with once_cell for improved performance.
Changes:
- Added symbol demangling for Rust using
rustc-demanglewith support for legacy (_ZN) and v0 (_R) mangling schemes - Introduced new
patternsmodule with submodules for IP addresses, network indicators, file paths, and data formats (GUIDs, emails, Base64, format strings, user agents) - Refactored
SemanticClassifierto delegate to pattern classification functions while maintaining backward compatibility
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| Cargo.toml | Replaced lazy_static with once_cell, added rustc-demangle dependency, and updated minor versions of serde_json and insta |
| src/types.rs | Added DemangledSymbol tag variant to the Tag enum |
| src/classification/mod.rs | Updated module documentation to reflect new capabilities, added patterns and symbols submodules with public exports |
| src/classification/symbols.rs | New module implementing Rust symbol demangling with detection, demangling, and comprehensive tests |
| src/classification/semantic.rs | Refactored to delegate classification to pattern functions while maintaining public API compatibility |
| src/classification/patterns/mod.rs | New module organizing pattern classification into submodules with public re-exports |
| src/classification/patterns/ip.rs | Extracted IPv4 and IPv6 address detection with regex patterns and validation using std library |
| src/classification/patterns/network.rs | Extracted URL and domain detection with TLD validation using a comprehensive list of common TLDs |
| src/classification/patterns/paths.rs | Extracted file and registry path detection for POSIX, Windows, UNC, and registry paths |
| src/classification/patterns/data.rs | New module for detecting GUIDs, emails, Base64, format strings, and user agents |
| tests/snapshots/classification_integration__classification_snapshots.snap | Updated snapshot showing Windows path now classified with both Domain and FilePath tags |
tests/snapshots/classification_integration__classification_snapshots.snap
Outdated
Show resolved
Hide resolved
src/classification/patterns/paths.rs
Outdated
| let haystack_lower = haystack.to_ascii_lowercase(); | ||
| let needle_lower = needle.to_ascii_lowercase(); | ||
| haystack_lower.contains(&needle_lower) |
There was a problem hiding this comment.
The function allocates two new strings by converting both haystack and needle to lowercase. This is less efficient than the byte-wise comparison approach that was used in the original semantic.rs module. Consider using a more efficient implementation that avoids allocations, such as iterating through bytes and comparing them case-insensitively using eq_ignore_ascii_case on byte slices or using a windowing approach.
| let haystack_lower = haystack.to_ascii_lowercase(); | |
| let needle_lower = needle.to_ascii_lowercase(); | |
| haystack_lower.contains(&needle_lower) | |
| // Match the behavior of `str::contains`: empty needle always matches. | |
| if needle.is_empty() { | |
| return true; | |
| } | |
| let haystack_bytes = haystack.as_bytes(); | |
| let needle_bytes = needle.as_bytes(); | |
| if needle_bytes.len() > haystack_bytes.len() { | |
| return false; | |
| } | |
| let last_start = haystack_bytes.len() - needle_bytes.len(); | |
| for start in 0..=last_start { | |
| let window = &haystack_bytes[start..start + needle_bytes.len()]; | |
| if window.eq_ignore_ascii_case(needle_bytes) { | |
| return true; | |
| } | |
| } | |
| false |
| let has_upper = text.chars().any(|c| c.is_ascii_uppercase()); | ||
| let has_lower = text.chars().any(|c| c.is_ascii_lowercase()); | ||
| let has_digit = text.chars().any(|c| c.is_ascii_digit()); | ||
|
|
There was a problem hiding this comment.
The Base64 classification function iterates through the text multiple times: once to check all characters are valid Base64 (line 87), once to count padding (line 95), and three separate times to check for uppercase, lowercase, and digits (lines 111-113). This could be optimized by combining these checks into a single pass through the string to improve performance, especially for longer strings.
| let has_upper = text.chars().any(|c| c.is_ascii_uppercase()); | |
| let has_lower = text.chars().any(|c| c.is_ascii_lowercase()); | |
| let has_digit = text.chars().any(|c| c.is_ascii_digit()); | |
| let mut has_upper = false; | |
| let mut has_lower = false; | |
| let mut has_digit = false; | |
| for c in text.chars() { | |
| if c.is_ascii_uppercase() { | |
| has_upper = true; | |
| } else if c.is_ascii_lowercase() { | |
| has_lower = true; | |
| } else if c.is_ascii_digit() { | |
| has_digit = true; | |
| } | |
| if has_upper && has_lower && has_digit { | |
| break; | |
| } | |
| } |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Integrates the cpp_demangle crate and extends SymbolDemangler to handle C++ Itanium ABI mangled symbols (e.g., _Z, _ZN). Updates documentation, detection logic, and adds tests for C++ symbol demangling alongside Rust symbol support.
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@src/classification/patterns/data.rs`:
- Around line 102-108: The current length check using (content_len +
padding_count).is_multiple_of(4) wrongly rejects valid unpadded Base64; update
the logic around variables content_len and padding_count to permit total lengths
whose length mod 4 is 0, 2, or 3 and only reject when the remainder is 1.
Concretely, compute the remainder of (content_len + padding_count) mod 4 and
return None only if that remainder equals 1 (instead of requiring an exact
multiple-of-4), so JWT-style unpadded segments (mod 2 or 3) are accepted.
In `@src/classification/patterns/network.rs`:
- Around line 26-33: COMMON_TLDS currently contains file extensions that are not
real TLDs and cause false positives; remove entries like "exe", "dll", "sys",
"bin", "dat", "log", "tmp", "bak" from the COMMON_TLDS constant in network.rs
(the const COMMON_TLDS: &[&str] array) so it only lists actual TLDs, and if
needed implement separate filename/extension checks inside the path classifier
(or a new extension heuristic) rather than conflating them with domain TLD
recognition.
In `@src/classification/patterns/paths.rs`:
- Around line 302-307: The current is_suspicious_windows_path function uses
contains_ascii_case_insensitive which does substring matching and can
false-positive on embedded paths; change it to perform a case-insensitive prefix
check instead (e.g., for each pattern in SUSPICIOUS_WINDOWS_PATHS compare the
lowercase pattern to text.lowercase().starts_with(&pattern_lower) or call an
existing starts_with_ascii_case_insensitive helper), updating references to
contains_ascii_case_insensitive accordingly in is_suspicious_windows_path; if
substring behavior is intentional, add a clear doc comment above
is_suspicious_windows_path explaining that embedded path matches are desired.
- Around line 30-45: Replace the verbose repeated insert() calls when building
the static SUSPICIOUS_POSIX_PATHS with a more idiomatic construction: build the
set from an array or iterator (e.g. using HashSet::from_iter or
vec![...].into_iter().collect()) inside the Lazy::new closure so the same string
literals populate the HashSet in one expression; apply the same pattern to other
HashSet statics in the file (referencing SUSPICIOUS_POSIX_PATHS and any other
*_PATHS or similar HashSet statics).
- Around line 182-187: The loop over KNOWN_WINDOWS_PREFIXES incorrectly uses
contains_ascii_case_insensitive to check prefixes, which allows matches in the
middle of the string; change the logic to perform a case-insensitive starts_with
check instead (e.g., implement or use a case-insensitive starts_with helper and
call it on text with each prefix) so the prefix match in the for loop only
returns true when text begins with the given prefix; update the invocation where
contains_ascii_case_insensitive(text, prefix) is used (in the function that
iterates KNOWN_WINDOWS_PREFIXES) to call the new/updated starts-with helper.
- Around line 109-114: contains_ascii_case_insensitive currently allocates two
Strings via to_ascii_lowercase on every call; replace it with an allocation-free
ASCII case-insensitive substring search by iterating over haystack.as_bytes()
and comparing each candidate window to needle.as_bytes() using byte-wise
to_ascii_lowercase() on u8 (or similar inline lowercasing) rather than building
new Strings; update the function contains_ascii_case_insensitive to scan
haystack bytes, for each start index compare needle length bytes with on-the-fly
lowercasing, and return true on a match (use needle.is_empty() and length checks
as needed).
| // Also accept paths that use forward slashes (some tools do this) | ||
| if upper_text.starts_with("HKEY_") | ||
| && text.contains('/') | ||
| && let Some(slash_pos) = text.find('/') | ||
| { | ||
| let root = &upper_text[..slash_pos]; | ||
| if VALID_REGISTRY_ROOTS.contains(root) { | ||
| return true; | ||
| } | ||
| } |
There was a problem hiding this comment.
The logic for handling forward slashes in registry paths may be overly permissive. At line 214-215, the code checks upper_text.starts_with("HKEY_") && text.contains('/'), which means any occurrence of a forward slash anywhere in the path will trigger this branch. This could match invalid paths like "HKEY_LOCAL_MACHINE\SOFTWARE/mixed" or paths with forward slashes in unexpected places. Consider checking if the path uses forward slashes consistently (e.g., text.find('\\').is_none()) or at least verify the forward slash is used as the first separator after the root key.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/classification/patterns/network.rs`:
- Around line 43-51: The TLD lookup in has_valid_tld currently does an O(n) scan
over COMMON_TLDS; change COMMON_TLDS from a Vec<&'static str> (or slice) to a
HashSet<&'static str> (or a &'static HashSet) and initialize it once (e.g., via
lazy_static or once_cell::sync::Lazy) so has_valid_tld can call
COMMON_TLDS.contains(&tld_lower.as_str()) with O(1) expected lookup; update the
COMMON_TLDS definition and any imports (HashSet, Lazy) and ensure has_valid_tld
still lowercases the TLD before lookup.
In `@src/classification/symbols.rs`:
- Around line 165-167: The demangling code currently unconditionally overwrites
string.original_text which can erase prior provenance; change the logic in the
demangling routine (the block that sets string.original_text and string.text) so
that you only assign string.original_text = Some(string.text.clone()) when
string.original_text.is_none(), then always update string.text = demangled_str;
this preserves any existing original_text while still replacing text with the
demangled form.
- Around line 52-59: The public struct SymbolDemangler must be marked
non-exhaustive to allow future fields without breaking semver; add the attribute
#[non_exhaustive] immediately before the derive and the pub struct declaration
for SymbolDemangler (leave the existing pub struct SymbolDemangler and its
impl/new() intact) so the attribute applies to that type.
♻️ Duplicate comments (1)
src/classification/patterns/network.rs (1)
26-34: Previous issue resolved - file extensions removed from TLD list.The problematic entries (exe, dll, sys, bin, dat, log, tmp, bak) have been removed. The test suite at lines 170-181 now explicitly validates that file extensions are not misclassified as domains. This directly addresses the false positive reduction goal for binary analysis.
| #[derive(Debug, Default, Clone)] | ||
| pub struct SymbolDemangler; | ||
|
|
||
| impl SymbolDemangler { | ||
| /// Create a new instance of the symbol demangler | ||
| #[must_use] | ||
| pub fn new() -> Self { | ||
| Self |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and read the relevant lines
if [ -f "src/classification/symbols.rs" ]; then
echo "=== File exists, reading lines 50-65 ==="
sed -n '50,65p' src/classification/symbols.rs | cat -n
else
echo "File not found at src/classification/symbols.rs"
find . -name "symbols.rs" -type f 2>/dev/null
fiRepository: EvilBit-Labs/Stringy
Length of output: 767
🏁 Script executed:
# Also check the broader context to understand the struct definition
if [ -f "src/classification/symbols.rs" ]; then
echo "=== Checking for struct SymbolDemangler definition ==="
rg -n "struct SymbolDemangler" src/classification/symbols.rs -A 5 -B 2
fiRepository: EvilBit-Labs/Stringy
Length of output: 372
🏁 Script executed:
# Verify if the struct is public and check for #[non_exhaustive]
if [ -f "src/classification/symbols.rs" ]; then
echo "=== Checking full context around SymbolDemangler ==="
rg -n "SymbolDemangler" src/classification/symbols.rs | head -20
fiRepository: EvilBit-Labs/Stringy
Length of output: 1121
Add #[non_exhaustive] to the public struct.
The coding guidelines require #[non_exhaustive] on all public structs to allow future extension without semver breaks. The struct already has an explicit constructor, so add the attribute before the derive macro.
Proposed fix
+#[non_exhaustive]
#[derive(Debug, Default, Clone)]
pub struct SymbolDemangler;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #[derive(Debug, Default, Clone)] | |
| pub struct SymbolDemangler; | |
| impl SymbolDemangler { | |
| /// Create a new instance of the symbol demangler | |
| #[must_use] | |
| pub fn new() -> Self { | |
| Self | |
| #[non_exhaustive] | |
| #[derive(Debug, Default, Clone)] | |
| pub struct SymbolDemangler; | |
| impl SymbolDemangler { | |
| /// Create a new instance of the symbol demangler | |
| #[must_use] | |
| pub fn new() -> Self { | |
| Self |
🤖 Prompt for AI Agents
In `@src/classification/symbols.rs` around lines 52 - 59, The public struct
SymbolDemangler must be marked non-exhaustive to allow future fields without
breaking semver; add the attribute #[non_exhaustive] immediately before the
derive and the pub struct declaration for SymbolDemangler (leave the existing
pub struct SymbolDemangler and its impl/new() intact) so the attribute applies
to that type.
| // Store original mangled form and replace with demangled | ||
| string.original_text = Some(string.text.clone()); | ||
| string.text = demangled_str; |
There was a problem hiding this comment.
Do not overwrite existing original_text.
demangle unconditionally replaces original_text, which can erase prior provenance if another transform already populated it. Only set original_text when it is None.
Proposed fix
- string.original_text = Some(string.text.clone());
+ if string.original_text.is_none() {
+ string.original_text = Some(string.text.clone());
+ }
string.text = demangled_str;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Store original mangled form and replace with demangled | |
| string.original_text = Some(string.text.clone()); | |
| string.text = demangled_str; | |
| // Store original mangled form and replace with demangled | |
| if string.original_text.is_none() { | |
| string.original_text = Some(string.text.clone()); | |
| } | |
| string.text = demangled_str; |
🤖 Prompt for AI Agents
In `@src/classification/symbols.rs` around lines 165 - 167, The demangling code
currently unconditionally overwrites string.original_text which can erase prior
provenance; change the logic in the demangling routine (the block that sets
string.original_text and string.text) so that you only assign
string.original_text = Some(string.text.clone()) when
string.original_text.is_none(), then always update string.text = demangled_str;
this preserves any existing original_text while still replacing text with the
demangled form.
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/classification/patterns/paths.rs`:
- Around line 1-4: The paths.rs module exceeds the 500-line guideline and should
be split into smaller focused modules; create separate modules (e.g., posix.rs,
windows.rs, unc.rs, registry.rs) and move the corresponding detection
functions/structs (the POSIX path classifier, Windows path classifier, UNC path
classifier, and registry path classifier) from paths.rs into these new files,
export them from a new mod.rs or from a restructured patterns::paths mod that
re-exports pub use posix::*, pub use windows::*, etc., and update any
references/imports in the codebase to use patterns::paths::{PosixClassifier,
WindowsClassifier, UncClassifier, RegistryClassifier} (or the actual
function/struct names used) so behavior is unchanged but each file stays under
the line limit.
♻️ Duplicate comments (2)
src/classification/patterns/paths.rs (1)
235-244: Avoid accepting mixed separators in registry paths.Line 235 allows forward-slash registry paths even when backslashes are present, so mixed separators like "HKEY_LOCAL_MACHINE\SOFTWARE/mixed" are treated as valid and increase false positives. Consider requiring forward slashes only when no backslashes exist.
Proposed fix
- if upper_text.starts_with("HKEY_") - && text.contains('/') - && let Some(slash_pos) = text.find('/') - { + if upper_text.starts_with("HKEY_") + && text.contains('/') + && !text.contains('\\') + && let Some(slash_pos) = text.find('/') + {src/classification/patterns/data.rs (1)
79-133: Reject padded Base64 strings unless the total length is a multiple of 4.Line 105 only rejects remainder 1, so padded strings with length mod 4 of 2 or 3 can be incorrectly accepted and increase false positives.
Proposed fix
- // Strip padding for length check - let content_len = text.len() - padding_count; - - // Valid Base64 content length should avoid mod 4 remainder of 1 - let remainder = (content_len + padding_count) % 4; - if remainder == 1 { - return None; - } + let total_len = text.len(); + let len_mod = total_len % 4; + if padding_count > 0 { + // Padded Base64 must be a multiple of 4 + if len_mod != 0 { + return None; + } + } else if len_mod == 1 { + // Unpadded Base64 cannot have remainder 1 + return None; + }
| //! File and registry path classification patterns | ||
| //! | ||
| //! This module provides POSIX, Windows, UNC, and registry path detection. | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Split this module to stay under the 500-line limit.
This file is 504 lines, which is above the 500-line guideline. As per coding guidelines, please split the path classifiers into smaller modules.
🤖 Prompt for AI Agents
In `@src/classification/patterns/paths.rs` around lines 1 - 4, The paths.rs module
exceeds the 500-line guideline and should be split into smaller focused modules;
create separate modules (e.g., posix.rs, windows.rs, unc.rs, registry.rs) and
move the corresponding detection functions/structs (the POSIX path classifier,
Windows path classifier, UNC path classifier, and registry path classifier) from
paths.rs into these new files, export them from a new mod.rs or from a
restructured patterns::paths mod that re-exports pub use posix::*, pub use
windows::*, etc., and update any references/imports in the codebase to use
patterns::paths::{PosixClassifier, WindowsClassifier, UncClassifier,
RegistryClassifier} (or the actual function/struct names used) so behavior is
unchanged but each file stays under the line limit.
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/extraction/dedup.rs (1)
311-350: Test helper doesn't exercise original_text field.The
create_test_stringhelper always setsoriginal_text: None. Consider adding a test case that verifiesoriginal_textpropagation through deduplication, particularly for theto_found_string()round-trip.Suggested test case
#[test] fn test_original_text_preservation() { let mut fs = create_test_string( "demangled_name", Encoding::Utf8, 0x100, None, StringSource::SectionData, vec![], 10, 0.8, ); fs.original_text = Some("_ZN7mangled4nameE".to_string()); let strings = vec![fs]; let canonical = deduplicate(strings, None, true); assert_eq!(canonical[0].occurrences[0].original_text, Some("_ZN7mangled4nameE".to_string())); let found = canonical[0].to_found_string(); assert_eq!(found.original_text, Some("_ZN7mangled4nameE".to_string())); }docs/src/classification.md (2)
158-160: Documentation inconsistency: lazy_static vs once_cell.Line 119 and the code example at 122-126 correctly show
once_cell::sync::Lazy, but line 160 still referenceslazy_static!. Update for consistency with the actual implementation.Suggested fix
-The classifier relies on `lazy_static!` to compile regex patterns once and reuse them across classification calls. Helper methods validate strings before assigning tags. +The classifier relies on `once_cell::sync::Lazy` to compile regex patterns once and reuse them across classification calls. Helper methods validate strings before assigning tags.
180-192: Example code missingoriginal_textfield.The
FoundStringstruct now has anoriginal_text: Option<String>field (per the types.rs changes). Update the example to include it for accuracy.Suggested fix
let found_string = FoundString { text: "C:\\Windows\\System32\\cmd.exe".to_string(), + original_text: None, encoding: Encoding::Ascii, offset: 0,
🤖 Fix all issues with AI agents
In `@docs/src/classification.md`:
- Around line 113-119: In the "Tag Specificity" section update the phrasing for
clarity: change "high confidence matches" to "high-confidence matches"
(hyphenate when used as a compound adjective) and tighten "and should be treated
as ambiguous" to "which should be treated as ambiguous" so the sentence reads
smoothly; ensure the rest of the paragraph (mentioning examples like URL,
domain, IP, file path, GUID, email, format string, and user agent) remains
unchanged.
In `@src/extraction/mod.rs`:
- Around line 525-536: The demangling + classification loop duplicated in the
extraction flow should be extracted into a private helper to avoid maintenance
drift: create a function (e.g., fn apply_classification(all_strings: &mut
[YourStringType]) { ... }) that instantiates SemanticClassifier::new() and
SymbolDemangler::new() and runs the same demangler.demangle(string) and
classifier.classify(string) logic while merging tags into string.tags; then
replace the duplicated blocks in the current function and in extract_canonical()
with a call to apply_classification(&mut all_strings) so both call sites share
the same implementation and behavior.
| ## Tag Specificity | ||
|
|
||
| The semantic classifier uses cached regex patterns via `lazy_static!` and applies validation checks to reduce false positives. | ||
| Tags are treated as either specific or broad. Specific tags indicate high confidence matches (for example URL, domain, IP, file path, GUID, email, format string, and user agent). Base64 is a broad tag and should be treated as ambiguous due to higher false positive risk. | ||
|
|
||
| ## Pattern Matching Engine | ||
|
|
||
| The semantic classifier uses cached regex patterns via `once_cell::sync::Lazy` and applies validation checks to reduce false positives. |
There was a problem hiding this comment.
Minor grammar and clarity improvements.
Line 115: "high confidence" should be hyphenated as "high-confidence" when used as a compound adjective before a noun. Also, "and should be treated as" could be tightened to "which should be treated as" for flow.
Suggested fix
-Tags are treated as either specific or broad. Specific tags indicate high confidence matches (for example URL, domain, IP, file path, GUID, email, format string, and user agent). Base64 is a broad tag and should be treated as ambiguous due to higher false positive risk.
+Tags are treated as either specific or broad. Specific tags indicate high-confidence matches (for example URL, domain, IP, file path, GUID, email, format string, and user agent). Base64 is a broad tag, which should be treated as ambiguous due to higher false-positive risk.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## Tag Specificity | |
| The semantic classifier uses cached regex patterns via `lazy_static!` and applies validation checks to reduce false positives. | |
| Tags are treated as either specific or broad. Specific tags indicate high confidence matches (for example URL, domain, IP, file path, GUID, email, format string, and user agent). Base64 is a broad tag and should be treated as ambiguous due to higher false positive risk. | |
| ## Pattern Matching Engine | |
| The semantic classifier uses cached regex patterns via `once_cell::sync::Lazy` and applies validation checks to reduce false positives. | |
| ## Tag Specificity | |
| Tags are treated as either specific or broad. Specific tags indicate high-confidence matches (for example URL, domain, IP, file path, GUID, email, format string, and user agent). Base64 is a broad tag, which should be treated as ambiguous due to higher false-positive risk. | |
| ## Pattern Matching Engine | |
| The semantic classifier uses cached regex patterns via `once_cell::sync::Lazy` and applies validation checks to reduce false positives. |
🧰 Tools
🪛 LanguageTool
[grammar] ~115-~115: Use a hyphen to join words.
Context: ...ic or broad. Specific tags indicate high confidence matches (for example URL, dom...
(QB_NEW_EN_HYPHEN)
[style] ~115-~115: Consider replacing this word to strengthen your wording.
Context: ... and user agent). Base64 is a broad tag and should be treated as ambiguous due to h...
(AND_THAT)
🤖 Prompt for AI Agents
In `@docs/src/classification.md` around lines 113 - 119, In the "Tag Specificity"
section update the phrasing for clarity: change "high confidence matches" to
"high-confidence matches" (hyphenate when used as a compound adjective) and
tighten "and should be treated as ambiguous" to "which should be treated as
ambiguous" so the sentence reads smoothly; ensure the rest of the paragraph
(mentioning examples like URL, domain, IP, file path, GUID, email, format
string, and user agent) remains unchanged.
Test modules for classification patterns have been refactored to combine similar test cases, reducing redundancy and improving maintainability. Semantic enrichment logic in extraction has been moved to a dedicated helper function. Unused deduplication parameter has been removed for clarity.
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/extraction/dedup.rs (1)
64-86: Remove dead parameter_preserve_all_occurrences— implementation ignores documented behavior.The docstring claims this parameter controls whether to "store occurrence count instead of full metadata," but the function always preserves full metadata via
found_string_to_occurrence(), making the parameter dead code. The underscore prefix confirms it's intentionally unused.Removing this parameter requires updating 26 call sites (2 in src/extraction/mod.rs + tests), but it eliminates the API inconsistency and stale documentation. Alternatively, implement the documented behavior if occurrence-count-only mode is actually needed.
Suggested fix: remove unused parameter
-/// * `preserve_all_occurrences` - If false, only store occurrence count instead of full metadata -/// /// # Returns /// /// Vector of canonical strings sorted by combined_score (descending) @@ pub fn deduplicate( strings: Vec<FoundString>, dedup_threshold: Option<usize>, - _preserve_all_occurrences: bool, ) -> Vec<CanonicalString> {Update both calls in src/extraction/mod.rs (lines 544-548, 651-655) and all test invocations.
🤖 Fix all issues with AI agents
In `@src/classification/patterns/data.rs`:
- Around line 20-21: Add a doc comment for the static EMAIL_REGEX (the
Lazy<Regex> named EMAIL_REGEX) explaining that this pattern is intentionally
simplified for binary/string extraction: it will match short forms like "a@b.cc"
and common unquoted local-parts but does not support quoted local-parts, some
valid edge cases (e.g., certain plus/escape forms and RFC 5322 full syntax), nor
internationalized domain names; state the tradeoff (reduced false positives vs.
not fully RFC-compliant) so future readers know the limitation and why this
simplified regex was chosen.
In `@src/classification/patterns/ip.rs`:
- Around line 117-145: The function is_ipv6_address currently never uses
IPV6_REGEX; update it to apply the IPV6_REGEX pre-filter after handling
bracketed addresses (after strip_port and strip_ipv6_brackets) before the
character checks and Ipv6Addr::from_str validation, similar to is_ipv4_address,
so inputs failing the regex are rejected early; alternatively, if the regex is
unused elsewhere, remove the unused IPV6_REGEX definition—refer to
is_ipv6_address, IPV6_REGEX, strip_port, strip_ipv6_brackets, and
Ipv6Addr::from_str when making the change.
- Around line 24-26: The IPV6_REGEX constant currently holds a massive,
hard-to-maintain pattern; replace it with a much simpler pre-filter regex (e.g.,
allow only hex digits, colons, and optional IPv4-mapped suffix) and then
validate canonical correctness with Ipv6Addr::from_str inside the code paths
that use IPV6_REGEX, or if you must keep the complex regex for performance, add
a comment above pub(crate) static IPV6_REGEX describing the performance tradeoff
and why the full pattern is required; update usages to fall back to
Ipv6Addr::from_str for authoritative validation.
- Around line 41-42: The IPV6_BRACKETS_REGEX is currently greedy and can capture
trailing text; update its pattern to only capture the content inside the
brackets and anchor the match so it doesn't include characters after the closing
bracket (e.g. use a pattern like r"^\[([^\]]+)\]$" or r"^\[([^\]]+)\]" to
capture only what's between '[' and ']' in the static IPV6_BRACKETS_REGEX
Lazy<Regex> declaration).
In `@src/classification/patterns/network.rs`:
- Around line 46-54: The has_valid_tld function currently treats a domain ending
with '.' as having an empty TLD (which falls through to false); explicitly
handle trailing dots by either returning false when domain.ends_with('.') or by
trimming a single trailing '.' before computing tld (update has_valid_tld to
check for domain.ends_with('.') and return false, or strip the trailing dot then
use rfind and COMMON_TLDS lookup) so the behavior is explicit and documented in
the function comment.
In `@src/classification/patterns/paths.rs`:
- Around line 329-341: The two functions use different match semantics which is
intentional but not documented: is_suspicious_windows_path uses prefix
(starts_with) matching while is_suspicious_registry_path uses substring
(contains) matching; add a brief doc comment above each function
(is_suspicious_windows_path and is_suspicious_registry_path) clarifying the
rationale (windows paths should be matched by prefix like "C:\Windows\…" whereas
registry keys may appear anywhere in the path and thus use substring matching)
so future readers understand the asymmetry.
- Around line 291-301: Add a unit test that ensures classify_unc_path returns
None for a UNC root with a trailing backslash (e.g., "\\server\\") to cover the
case where UNC_PATH_REGEX matches but the share name is missing; create a test
function (e.g., test_classify_unc_path_missing_share) that calls
classify_unc_path with that string and asserts the result is None, referencing
classify_unc_path and UNC_PATH_REGEX as the functions/regex under test.
♻️ Duplicate comments (3)
docs/src/classification.md (1)
115-115: Hyphenate compound adjective and tighten clause.
This matches a previously noted wording fix. Please hyphenate "high-confidence" and tighten "and should be treated as" to "which should be treated as".src/classification/patterns/data.rs (1)
111-130: Multiple iterations over text for character class checks.Lines 112-114 iterate the string three separate times. For hot paths with long strings, a single-pass approach is more efficient.
Single-pass character diversity check
- // Check for character diversity typical of Base64 - let has_upper = text.chars().any(|c| c.is_ascii_uppercase()); - let has_lower = text.chars().any(|c| c.is_ascii_lowercase()); - let has_digit = text.chars().any(|c| c.is_ascii_digit()); + // Check for character diversity typical of Base64 (single pass) + let (mut has_upper, mut has_lower, mut has_digit) = (false, false, false); + for c in text.chars() { + if c.is_ascii_uppercase() { + has_upper = true; + } else if c.is_ascii_lowercase() { + has_lower = true; + } else if c.is_ascii_digit() { + has_digit = true; + } + // Early exit if all flags are set + if has_upper && has_lower && has_digit { + break; + } + }src/classification/patterns/paths.rs (1)
235-244: Forward-slash registry path logic is overly permissive.The condition
upper_text.starts_with("HKEY_") && text.contains('/')matches if a forward slash appears anywhere, including in value names. For example,HKEY_LOCAL_MACHINE\SOFTWARE\Path/Namewould incorrectly trigger this branch and extractHKEY_LOCAL_MACHINE\SOFTWARE\Pathas the root (wrong separator position).Stricter forward-slash handling
// Also accept paths that use forward slashes (some tools do this) - if upper_text.starts_with("HKEY_") - && text.contains('/') - && let Some(slash_pos) = text.find('/') - { + // Only if the path uses forward slashes consistently (no backslashes before first slash) + if upper_text.starts_with("HKEY_") && let Some(slash_pos) = text.find('/') { + // Ensure no backslash appears before the first forward slash + if text[..slash_pos].contains('\\') { + return false; + } let root = &upper_text[..slash_pos]; if VALID_REGISTRY_ROOTS.contains(root) { return true; } }
- Updated email regex to simplify matching for binary and string extraction. - Refined IPv6 regex to allow only valid characters and improved validation checks. - Added validation for trailing dots in TLD checks. - Enhanced suspicious path checks with clearer matching strategies for Windows and registry paths. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/classification/patterns/data.rs`:
- Around line 101-116: The current padding logic only counts trailing '=' but
allows '=' in the middle and doesn't require overall length divisibility when
padding is present; update the check around variables padding_count, content_len
and remainder so that you first verify there are no '=' characters in the
non-padded prefix (i.e., ensure text[..content_len] contains no '='), and then
if padding_count > 0 enforce that text.len() % 4 == 0 (instead of the loose
remainder check); keep the existing padding_count <= 2 constraint and maintain
the remainder == 1 early-out for the no-padding case.
In `@src/classification/patterns/paths.rs`:
- Around line 215-244: The is_valid_registry_path function uses
text.to_uppercase(), which can expand Unicode and invalidate byte indices when
later slicing with positions from the original text; change both calls that
create upper_text to text.to_ascii_uppercase() so byte lengths remain stable
(Windows registry roots are ASCII), ensuring the subsequent find('\\') and
find('/') positions from the original text remain valid when you slice
upper_text[..slash_pos] and check VALID_REGISTRY_ROOTS.
| // Count padding characters | ||
| let padding_count = text.chars().rev().take_while(|&c| c == '=').count(); | ||
|
|
||
| // Padding should be at most 2 characters | ||
| if padding_count > 2 { | ||
| return None; | ||
| } | ||
|
|
||
| // Strip padding for length check | ||
| let content_len = text.len() - padding_count; | ||
|
|
||
| // Valid Base64 content length should avoid mod 4 remainder of 1 | ||
| let remainder = (content_len + padding_count) % 4; | ||
| if remainder == 1 { | ||
| return None; | ||
| } |
There was a problem hiding this comment.
Padding validation can accept invalid Base64 shapes.
Only trailing padding is counted. Strings with '=' in the middle or padded strings whose total length is not a multiple of 4 can slip through, increasing false positives. Add a check that padding is only at the end and enforce len%4==0 when padding exists.
Proposed fix
// Padding should be at most 2 characters
if padding_count > 2 {
return None;
}
- // Strip padding for length check
- let content_len = text.len() - padding_count;
-
- // Valid Base64 content length should avoid mod 4 remainder of 1
- let remainder = (content_len + padding_count) % 4;
- if remainder == 1 {
- return None;
- }
+ // Padding must only appear at the end
+ let padding_start = text.len() - padding_count;
+ if text[..padding_start].contains('=') {
+ return None;
+ }
+
+ // Padded Base64 must have total length multiple of 4
+ let remainder = text.len() % 4;
+ if padding_count > 0 {
+ if remainder != 0 {
+ return None;
+ }
+ } else if remainder == 1 {
+ return None;
+ }🤖 Prompt for AI Agents
In `@src/classification/patterns/data.rs` around lines 101 - 116, The current
padding logic only counts trailing '=' but allows '=' in the middle and doesn't
require overall length divisibility when padding is present; update the check
around variables padding_count, content_len and remainder so that you first
verify there are no '=' characters in the non-padded prefix (i.e., ensure
text[..content_len] contains no '='), and then if padding_count > 0 enforce that
text.len() % 4 == 0 (instead of the loose remainder check); keep the existing
padding_count <= 2 constraint and maintain the remainder == 1 early-out for the
no-padding case.
| /// Validates a registry path | ||
| pub fn is_valid_registry_path(text: &str) -> bool { | ||
| let upper_text = text.to_uppercase(); | ||
|
|
||
| // Check for full registry root | ||
| if upper_text.starts_with("HKEY_") { | ||
| // Extract root key | ||
| if let Some(slash_pos) = text.find('\\') { | ||
| let root = &upper_text[..slash_pos]; | ||
| if VALID_REGISTRY_ROOTS.contains(root) { | ||
| return true; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Check for abbreviated forms (case-insensitive) | ||
| if REGISTRY_ABBREV_REGEX.is_match(text) { | ||
| return true; | ||
| } | ||
|
|
||
| // Also accept paths that use forward slashes (some tools do this) | ||
| if upper_text.starts_with("HKEY_") | ||
| && text.contains('/') | ||
| && let Some(slash_pos) = text.find('/') | ||
| { | ||
| let root = &upper_text[..slash_pos]; | ||
| if VALID_REGISTRY_ROOTS.contains(root) { | ||
| return true; | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find the file and examine the full registry validation function
fd -t f "paths.rs" src/classification/patterns/
# Get the file size to see if we can read it entirely
wc -l src/classification/patterns/paths.rsRepository: EvilBit-Labs/Stringy
Length of output: 141
🏁 Script executed:
# Look at the specific function and surrounding context
cat -n src/classification/patterns/paths.rs | sed -n '200,250p'Repository: EvilBit-Labs/Stringy
Length of output: 1736
🏁 Script executed:
# Search for all uses of to_uppercase in the codebase
rg "to_uppercase" src/Repository: EvilBit-Labs/Stringy
Length of output: 142
🏁 Script executed:
# Check if there are any tests for the registry validation function
rg "is_valid_registry_path" --type rustRepository: EvilBit-Labs/Stringy
Length of output: 742
🏁 Script executed:
# Let's verify the behavior of to_uppercase vs to_ascii_uppercase with a simple test
python3 << 'EOF'
# Test character expansion with to_uppercase() in Unicode
import unicodedata
# Test ASCII characters
ascii_test = "HKEY_LOCAL_MACHINE"
print(f"ASCII string: '{ascii_test}'")
print(f"Length: {len(ascii_test)}")
# Test if certain characters expand when uppercased
test_chars = ['a', 'ß', 'ñ', '\\', '/']
for char in test_chars:
try:
upper = char.upper()
print(f"'{char}' -> '{upper}' (was {len(char)} byte(s), now {len(upper.encode('utf-8'))} byte(s))")
except Exception as e:
print(f"Error with '{char}': {e}")
# Key insight: bytes vs characters
test_str = "HKEY_ß\\" # Contains a character that expands
print(f"\nTest string with ß: '{test_str}'")
print(f"UTF-8 byte length: {len(test_str.encode('utf-8'))}")
pos = test_str.find('\\')
print(f"Position of first \\: {pos}")
print(f"Upper version byte length: {len(test_str.upper().encode('utf-8'))}")
EOFRepository: EvilBit-Labs/Stringy
Length of output: 150
🏁 Script executed:
# Check if the VALID_REGISTRY_ROOTS constant is defined and what it contains
rg "VALID_REGISTRY_ROOTS" src/classification/patterns/paths.rs -A 10 -B 2Repository: EvilBit-Labs/Stringy
Length of output: 962
🏁 Script executed:
# Check the REGISTRY_ABBREV_REGEX to understand what patterns are validated
rg "REGISTRY_ABBREV_REGEX" src/classification/patterns/paths.rs -A 5 -B 2Repository: EvilBit-Labs/Stringy
Length of output: 669
🏁 Script executed:
# Let's check how this function is called - is there any input validation upstream?
rg "is_valid_registry_path" src/ -B 5 -A 2Repository: EvilBit-Labs/Stringy
Length of output: 3141
🏁 Script executed:
# Let's check the actual byte/character expansion risk by examining valid registry root strings
cat > /tmp/test_uppercase.rs << 'EOF'
fn main() {
let test = "HKEY_LOCAL_MACHINE\\subkey";
let upper = test.to_uppercase();
let ascii_upper = test.to_ascii_uppercase();
println!("Original: {} bytes", test.len());
println!("to_uppercase: {} bytes", upper.len());
println!("to_ascii_uppercase: {} bytes", ascii_upper.len());
// Now test with potential non-ASCII before slash
let test_unicode = "HKEY_ß\\subkey";
let upper_uni = test_unicode.to_uppercase();
let ascii_upper_uni = test_unicode.to_ascii_uppercase();
println!("\nWith ß character:");
println!("Original: {} bytes", test_unicode.len());
println!("to_uppercase: {} bytes", upper_uni.len());
println!("to_ascii_uppercase: {} bytes", ascii_upper_uni.len());
// Find positions
if let Some(pos) = test_unicode.find('\\') {
println!("Position of \\: {}", pos);
println!("to_uppercase slicing [..{}]: OK? {}", pos, pos <= upper_uni.len());
println!("to_ascii_uppercase slicing [..{}]: OK? {}", pos, pos <= ascii_upper_uni.len());
}
}
EOF
rustc /tmp/test_uppercase.rs -o /tmp/test_uppercase && /tmp/test_uppercaseRepository: EvilBit-Labs/Stringy
Length of output: 110
🏁 Script executed:
# Check what types of strings are passed to this function from semantic classification
rg "is_valid_registry_path\|classify_registry_path" src/ -B 3 -A 3Repository: EvilBit-Labs/Stringy
Length of output: 46
🏁 Script executed:
# Check if there are any existing tests that demonstrate the behavior
fd -t f "test" src/classification/ -o rg "registry" tests/ src/ --type rustRepository: EvilBit-Labs/Stringy
Length of output: 187
🏁 Script executed:
# Let's look more carefully at what types of strings are being processed
# Check the extraction context to see what kinds of inputs reach this function
rg "RegistryPath|registry" src/ -B 2 -A 2 | head -50Repository: EvilBit-Labs/Stringy
Length of output: 2995
🏁 Script executed:
# Let's examine if there's any validation or filtering before these strings are classified
rg "classify_registry_path|is_valid_registry_path" src/classification/ -B 5 -A 5Repository: EvilBit-Labs/Stringy
Length of output: 7669
🏁 Script executed:
# Check if the issue exists with a simple logic trace
cat > /tmp/analyze.py << 'EOF'
# Simulate the potential issue
# Test case 1: ASCII only (normal case)
test1 = "HKEY_LOCAL_MACHINE\\Software"
pos1 = test1.find('\\')
upper1 = test1.upper()
upper1_ascii = test1.upper() # for ASCII it's the same
print(f"Test 1 (ASCII): '{test1}'")
print(f" Position of \\: {pos1}")
print(f" Length original: {len(test1)}, uppercase: {len(upper1)}")
print(f" Slice [..{pos1}] in uppercase: '{upper1[:pos1]}'")
print(f" Match: {upper1[:pos1] == 'HKEY_LOCAL_MACHINE'}")
# Test case 2: Non-ASCII before delimiter (edge case)
test2 = "HKEYß\\Software"
pos2 = test2.find('\\')
upper2 = test2.upper()
print(f"\nTest 2 (with ß): '{test2}'")
print(f" Bytes in original: {[c for c in test2]}")
print(f" Position of \\ (in original): {pos2}")
print(f" Length original: {len(test2)}, uppercase: {len(upper2)}")
print(f" Upper bytes: {[c for c in upper2]}")
try:
print(f" Slice [..{pos2}] in uppercase: '{upper2[:pos2]}'")
print(f" Compare to HKEY_: {upper2[:pos2].startswith('HKEY_')}")
except Exception as e:
print(f" ERROR: {e}")
# Test case 3: Using ASCII uppercase on non-ASCII
print(f"\nTest 3 (ASCII uppercase behavior):")
test3 = "HKEYß\\Software"
pos3 = test3.find('\\')
# In Rust, to_ascii_uppercase() only uppercases ASCII
ascii_upper = ''.join(c.upper() if c.isascii() else c for c in test3)
print(f" ASCII upper of '{test3}': '{ascii_upper}'")
print(f" Slice [..{pos3}]: '{ascii_upper[:pos3]}'")
EOF
python3 /tmp/analyze.pyRepository: EvilBit-Labs/Stringy
Length of output: 693
🏁 Script executed:
# Let's look at the actual registry path test to see if non-ASCII is expected
rg "test_registry_path|test_classify_registry_path" src/ -A 10Repository: EvilBit-Labs/Stringy
Length of output: 1554
🏁 Script executed:
# Check if there are any other slicing operations with position indices in the paths module
rg "find\(" src/classification/patterns/paths.rs -B 2 -A 3Repository: EvilBit-Labs/Stringy
Length of output: 534
🏁 Script executed:
# Let's verify the Unicode expansion concern with a real example
cat > /tmp/verify_panic.rs << 'EOF'
// Simulate what happens with the current code
fn main() {
// Example of character that could cause issue
// Looking for strings that expand when uppercased
// Test: some Unicode characters do expand
// For instance, certain ligatures or special characters
let test = "HKEY\u{00DF}\\Software"; // ß character
println!("Original string: {} bytes", test.len());
println!("Bytes: {:?}", test.as_bytes());
let upper = test.to_uppercase();
println!("Uppercase string: {} bytes", upper.len());
println!("Bytes: {:?}", upper.as_bytes());
let pos = test.find('\\').unwrap();
println!("Position of \\: {} (in original)", pos);
// This is what the code does - use original position on uppercase string
println!("Slicing uppercase[..{}]: '{}'", pos, &upper[..pos]);
// With to_ascii_uppercase
let ascii_upper = test.to_ascii_uppercase();
println!("\nASCII uppercase: {} bytes", ascii_upper.len());
println!("Slicing ascii_upper[..{}]: '{}'", pos, &ascii_upper[..pos]);
}
EOF
# Try to compile it to check for actual panic risk
cat /tmp/verify_panic.rsRepository: EvilBit-Labs/Stringy
Length of output: 1121
🏁 Script executed:
# Confirm the current implementation one more time
sed -n '215,247p' src/classification/patterns/paths.rsRepository: EvilBit-Labs/Stringy
Length of output: 976
Fix Unicode byte expansion issue in registry path validation.
Line 217 uses to_uppercase(), which can expand certain Unicode characters and change byte length. Since the code then slices upper_text[..slash_pos] with indices from the original string, this can cause a panic if non-ASCII characters precede the delimiter.
Use to_ascii_uppercase() instead, which preserves byte positions for ASCII characters. This is semantically correct since Windows registry root keys are strictly ASCII.
Proposed fix
- let upper_text = text.to_uppercase();
+ let upper_text = text.to_ascii_uppercase();🤖 Prompt for AI Agents
In `@src/classification/patterns/paths.rs` around lines 215 - 244, The
is_valid_registry_path function uses text.to_uppercase(), which can expand
Unicode and invalidate byte indices when later slicing with positions from the
original text; change both calls that create upper_text to
text.to_ascii_uppercase() so byte lengths remain stable (Windows registry roots
are ASCII), ensuring the subsequent find('\\') and find('/') positions from the
original text remain valid when you slice upper_text[..slash_pos] and check
VALID_REGISTRY_ROOTS.
This pull request introduces a new modular pattern classification system for detecting various data formats, including IP addresses, GUIDs, emails, Base64 data, format strings, and user agent strings. The changes add new pattern classification modules, refactor existing capabilities to use these modules, and update dependencies for improved performance and maintainability.
New Pattern Classification Modules
patternsmodule insrc/classification, which contains submodules for different types of pattern classification:ip,network,paths, anddata. Each submodule provides functions and regex patterns for detecting specific formats such as IP addresses, URLs, file paths, GUIDs, emails, Base64, format strings, and user agents. [1] [2] [3]Expanded Data Format Detection
src/classification/patterns/data.rs, with comprehensive regex-based detection and thorough unit tests.src/classification/mod.rsto reflect the expanded capabilities, including symbol demangling and new data format detection.Improved IP Address Detection
src/classification/patterns/ip.rs, including support for port suffixes and bracketed IPv6 addresses, with validation using both regex and standard library types.Refactoring and API Changes
src/classification/mod.rsto use the newpatternsmodule, and re-exported key classification functions for easier access and future extensibility. [1] [2]Dependency Updates
lazy_staticwithonce_cellfor more efficient static initialization, addedrustc-demanglefor symbol demangling, and updated several dependencies for improved stability and performance inCargo.toml.