filters: fix escaped brackets inside *[...] character classes#440
Merged
Conversation
The escape branch in strjoker probed joker[i+2] instead of the current
char, so a backslash escape only worked as the first class member:
'*[\[\]]' (documented as "the [ or ] character") matched only ']', and
'*[a,\[]' dropped the 'a'. The loop also treated any ']' as the class
terminator, so an escaped ']' could never be a member.
Decode the escape first in the loop body: a backslash takes the next char
as the literal member (only that char, not also the backslash the old code
added), and an escaped ']' is consumed before the terminator check. So
'*[\[\]]' now matches both brackets, and escape precedes the range/size
checks ('\-' '\,' '\<' become literal members). The self-test previously
pinned the buggy output as expected; it now asserts the documented
behavior and fails against the old matcher.
Closes #148
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The *[...] class parser's range arm does i += 3 unconditionally, so a pattern ending in a dangling '-' (e.g. *[a-) read one byte past the NUL: joker[i+2] is the NUL, i jumps to len+1, and the separator skip and loop guard then read joker[len+1]. Guard the range arm on joker[i+2] != '\0' so a truncated range falls through to the literal-member path instead of overshooting. The filter self-test now copies the pattern and string into exact-size heap buffers so a sanitizer traps such over-reads; the pattern previously came straight from argv (no redzone), which is why this stayed invisible. A *[a- test case exercises it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>
8a5e5e9 to
c292454
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Escaping a bracket inside a
*[...]filter class was broken: the matcher's escape branch read two chars ahead of the current position, so a backslash only took effect on the first class member. The documented*[\[\]]("the [ or ] character") matched only],*[a,\[]silently dropped thea, and because the loop stopped at the first]even when escaped, an escaped]could never be a member. The fix decodes the escape first in the loop body, so a backslash takes the next char as a literal member, an escaped]is consumed before the terminator check, and escaping runs ahead of the range and size checks (\-,\,,\<are literal).*[\[\]]now matches both brackets as the guide claims.A self-test already exercised this corner, but its assertions pinned the buggy output as expected (it even flagged #148 as a known quirk). They now assert the documented behavior and fail against the old matcher; the guide example moves to the comma form it documents.
Reviewing this with review-recipe surfaced a separate, pre-existing 1-byte heap over-read in the same loop: a truncated range like
*[a-rani += 3off the end and then read past the NUL. The second commit guards the range arm on a non-NUL third char, and reworks the filter self-test to copy patterns and strings into exact-size heap buffers so a sanitizer catches that class of over-read (it was invisible before because the pattern came straight from argv, which has no redzone). A*[a-case exercises it.Closes #148