V8 regexp.c: add support for character classes [:NAME:]#732
V8 regexp.c: add support for character classes [:NAME:]#732gwsw merged 5 commits intogwsw:masterfrom
Conversation
|
About case-sensitivity with As the last commit message says, there's a known issue with This can be fixed in one of few ways:
diff --git a/regexp.c b/regexp.c
index 8a44b5f..9a35caa 100644
--- a/regexp.c
+++ b/regexp.c
@@ -542,6 +542,8 @@ cclass_idx2(constant char *s2)
return -1; /* not special as far as we can tell */
}
+extern int is_caseless;
+
/*
* called with regparse pointing to right after an opening bracket '[' .
* emits ANYOF/ANYBUT node with a set of chars and (on success) final '\0'.
@@ -586,7 +588,7 @@ regbracket()
} else { /* idx is a valid index, add class chars */
constant char *c = cclasses[idx].chars;
for (; *c; ++c)
- regc(*c);
+ regc(is_caseless ? tolower(*c) : *c);
regparse += cclasses[idx].len;
if (*regparse == '-' && regparse[1] != '\0' && regparse[1] != ']')(or a more correct but slightly bigger patch which doesn't evaluate This works for Note that even if it would recompile the pattern unconditionally when chagning However, this So the patch above or similar, possibly with unconditional regcomp when changing |
Note that this issue is not limited to So |
However, this should be reasonably simple to handle, simply by making all the chars added within brackets take For instance, instead of the patch above which only covers diff --git a/regexp.c b/regexp.c
index 8a44b5f..47a4350 100644
--- a/regexp.c
+++ b/regexp.c
@@ -542,6 +542,28 @@ cclass_idx2(constant char *s2)
return -1; /* not special as far as we can tell */
}
+/* when less is case-insensitive, and the RE engine doesn't support REG_ICASE
+ * or equivalent (like this engine), then it converts the pattern to lowercase
+ * before compiling it, and the text to lowercase before matching it.
+ * this works in general, and even with some ranges like [A-Z], but doesn't
+ * work if the range matches upper case chars, but itself doesn't have any
+ * upper case, like [@-_] or [[:upper:]].
+ * so bracket handling uses this regc_bracket which does take the case
+ * sensitivity of "less" into account in brackets, even when the pattern is
+ * without upper case chars but still matches upper case chars.
+ *
+ * TODO: we could _probably_ do that also for non-bracket parts of the pattern,
+ * and if we do, then "less" can avoid converting the pattern to lower
+ * case (but it would still need to convert the text before match).
+ */
+extern int is_caseless; /* case-insensitivity state in "less" */
+
+static inline void
+regc_brckt(char c)
+{
+ regc(is_caseless ? tolower(c) : c);
+}
+
/*
* called with regparse pointing to right after an opening bracket '[' .
* emits ANYOF/ANYBUT node with a set of chars and (on success) final '\0'.
@@ -576,17 +598,17 @@ regbracket()
if (clss > classend+1)
FAIL("invalid [] range");
for (; clss <= classend; clss++)
- regc(clss);
+ regc_brckt(clss);
regparse++;
}
} else if ((idx = CCLASS_IDX(regparse)) == -1) {
- regc(*regparse++);
+ regc_brckt(*regparse++);
} else if (idx == -2) {
FAIL("unknown [:CLASS:] name");
} else { /* idx is a valid index, add class chars */
constant char *c = cclasses[idx].chars;
for (; *c; ++c)
- regc(*c);
+ regc_brckt(*c);
regparse += cclasses[idx].len;
if (*regparse == '-' && regparse[1] != '\0' && regparse[1] != ']')I didn't check how less converts the the text/pattern to lower, and if it also uses However, for the It needs some thought, but largely the case sensitivity subject is orthogonal to this PR, and if we decide to improve it, the it should not be limited to the |
|
I haven't fully digested this yet, but I will mention that the recompilation of a pattern after changing -i/-I happens in chg_caseless() in search.c. If you want to unconditionally recompile after changing the option, you can just delete lines 2037-2039. |
|
Thanks. Indeed I now saw where it skips the recompilation, and also now noticed the case macros in But the bottom line is still that the case-related issues are interesting and probably addressable at least to some degree, but are actually off topic for this PR, and also can probably be considered uncommon edge cases. Both issues relate to patterns which match upper case letters, but do not themselves contain uppercase letters, like The 1st issue is that when "less" wants to do caseless search, then such patterns are broken with RE engines which don't support REG_ICASE, like V8, because converting this pattern to lowercase has no effect on the chars which it matches. This doesn't have a good general solution, but probably can have a good solution with the local V8 specifically (but not in this PR, it's a new subject). The 2nd issue with such patterns affects all RE engines, and it's that with "smart case" ( Granted, it doesn't actually contradict the docs, which say explicitly that However, I don't think there's a good solution for this, so it's more of just realizing that this issue exists. |
|
Force-pushed without any code changes, but with commit messages changes: For the "fix warning" commit, it now says why this solution is correct. The fact that I missed when I originally wrote this message is that For the "support character classes" commit, it now also says that the |
|
If you intend to merge it, and don't intend to rebase it on top of master yourself, then let me know and I'll rebase it before you merge. |
|
If you want, I have a different variant of the classes support, which saves few hundreds bytes of data by reusing the bracket parsing code for the classes data - instead of each class holding a copy of all its chars, which looks like this: static constant cclass cclasses[] = {
/* first two chars of id are ignored, and expected to be "[:" */
{9, "[:alnum:]", "0-9A-Za-z"},
{9, "[:alpha:]", "A-Za-z"},
{9, "[:blank:]", "\t "},
{9, "[:cntrl:]", "\x01-\x1f\x7f"},
{9, "[:digit:]", "0-9"},
{9, "[:graph:]", "\x21-\x7e"},
{9, "[:lower:]", "a-z"},
{9, "[:print:]", "\x20-\x7e"},
{9, "[:punct:]", "!-/:-@[-`{|}~"},
{9, "[:space:]", "\t\n\v\f\r "},
{9, "[:upper:]", "A-Z"},
{10, "[:xdigit:]", "0-9A-Fa-f"},
};To reuse the code cleanly, instead of extracting the whole static constant char *
regbracket(constant char *regparse)
{
register int clss;
register int classend;
int idx;
if (*regparse == ']' || *regparse == '-')
regc(*regparse++);
while (*regparse != '\0' && *regparse != ']') {
/* ... identical code as in this PR */
} else { /* idx is a valid index, add class chars */
if (!regbracket(cclasses[idx].chars))
FAIL("internal error");
regparse += cclasses[idx].len;
if (*regparse == '-' && regparse[1] != '\0' && regparse[1] != ']')
FAIL("invalid [] range start"); /* [:NAME:]-X */
}
}
return regparse;
}And the remaining case '[':
if (*regparse == '^') { /* Complement of range. */
ret = regnode(ANYBUT);
regparse++;
} else
ret = regnode(ANYOF);
regparse = regbracket(regparse);
if (regparse == NULL)
return(NULL);
regc('\0');
if (*regparse != ']')
FAIL("unmatched []");
regparse++;
*flagp |= HASWIDTH|SIMPLE;
break;( Let me know if you have a preference between the two approaches. |
|
I don't think I care much about a couple of hundred bytes of data, but performance would be more of a concern. Do you think there is a significant performance difference between the two approaches that you mention? If there's no performance difference either, the second approach (using ranges instead of the full list of chars) seems more readable. |
|
Performance difference seems to be within the noise level (~ 5%). If anything, the recursing variant is a bit faster, maybe because it does less memory access? (it iterates ranges instead of fetching all chars from the data). I tested it in isolation, with modified #ifdef ALLCHARS
const char *c = cclasses[idx].chars;
for (; *c; ++c)
regc(*c);
#else
if (!regbracket(cclasses[idx].chars))
FAIL("internal error");
#endif(and obviously the cclasses array also differs depending on ALLCHARS) I tested with all the classes in one bracket string (confirming that both forms do register all the chars correctly - this is 459 chars - the v8 regcomp doesn't remove duplicates in brackets): regbracket("[:alnum:][:alpha:][:blank:][:cntrl:]"
"[:digit:][:graph:][:lower:][:print:]"
"[:punct:][:space:][:upper:][:xdigit:]");Calling this million times in a loop and measuring the runtime. Both are the same duration - about 1s. Regardless, even if there was a difference, it's only at So it would have been negligible even if it was 10x slower, but the 2nd (recursive) variant actually turned out to be very slightly faster according to my tests (both with -O2 and -O0). EDIT: with Shall I update the PR to do the recursive variants? |
Yes, please do that. |
This is a followup to commit c8c01e3, and does the same (change K&R style to C89/99) also for the functions regprop and regdump - which are only compiled when DEBUG is defined. Also, regprop was declared twice, so remove one of them. Also, the reprop implementation returns const char*, but the declaration returned plain char*, so fix the declaration. Also, regdump is not static and is intended for extrenal use (it's not used in regexp.c even when DEBUG is defined), so move its declaration to regexp.h - still inside #ifdef DEBUG . Finally, add a changelog note, with the date of commit c8c01e3 .
When compiling with DEBUG defined, two ptrdiff_t values were printed as "%d" format, which resulted in a warning in clang (but not gcc). This was an issue in 64 bit builds, as a pointer (diff) is 64 bit while int is typically 32. The most correct solution would be to use %td (ptrdiff_t format), which is not supported in c89, but we have another correct solution which does work in c89 or later: Use %ld and cast the value to long - which is what this commit does. This works great for us, because eventhough int and long are typically the same size (32 bit), long in c89 and later guarantees 32 bit, while int in c89 or later only guarantees 16 bits. The actual value in practice is currently limited to 32K (regcomp aborts if regsize >= 32767), so %ld with cast to long covers our current needs, and still allows this limit to be raised up to 2M without breaking the correctness of this code in c89 or later. For what it's worth, regdump works and seems useful.
|
Rebased and updated to the recursive form. Also removed the macro which tested And finally, a small unrelated commit (just like the first two commits) which replaces NIH If you want me to move commits which are not related to brackets (1st, 2nd, 5th commits) to a new PR, or delete some or all of them, let me know. |
regcomp calls regatom, which is the main regexp parser, and handling a bracket expression was a big part of it. This made it hard to follow the overall flow of regatom. Also, this code started at indent level 3 and reached indent level 7, which made it hard to add more code. So just extract it to its own function regbracket(), which makes regatom much more readable, and allows extending the brackets code (which the next commit will do). The change moves the actual character-handling part from regatom to the new regbracket, while the '[' code at regatom remains only the outer skeleton of setting up the node and finally terminating it. The code is unmodified, but the regbracket code doesn't work on the the global regparse anymore, and instead takes a pointer "in" to the chars, and returns pointer to after the processed chars, so all the instances of "regparse" at the new regbracket were renamed to "in".
This is not critically useful, because most of the classes can be replaced trivially with normal ranges, like [a-z] for [[:lower:]], but it's still convenient, like with [:xdigit:], or even more so with [:punct:] which is otherwise hard to specify ([^ 0-9A-Za-z] comes close but inconvenient, and also include control chars). Character classes are also POSIX, so overall it's nice to have. The implementation is fairly simple. We add an array of the chars per class (only POSIX/C locale values), and augment regbracket() to identify and process them during regcomp. The code reuses the regbracket() parsing to specify the class chars as ranges where appropriate, which is more readable and less data. Known issue: case-insensitivity for [:upper:] doesn't work in "less". E.g.: Matches all upper/lower alpha: less -I MYFILE +/"[A-Z]" Doesn't match anything: less -I MYFILE +/"[[:upper:]]" However, this is not a new issue with V8, and it also happens with the patterns "[@-_]" or "[H-a]" etc, regardless of this commit. A solution to this would be to support REG_ICASE or equivalent at regexp.c, instead of "less" implemeting ICASE by doing tolower on the pattern (regcomp) and text (regexec2) - which works for [A-Z] but not for [[:upper:]] or [@-_] . Other classes are unaffected because either they don't contain alpha chars (digit, punct, cntrl), or they contain both the upper and lower case of their alpha values (alnum, xdigit, etc) and the lower-case chars are already included anyway. So "upper" is the exception.
prog->regmust, if not null, is a normal null-terminated string which must appear at the input. Use the standard libc strstr instead of the NIH code at regexec2, because libc can probably do that better than us. Also get rid of prog->regmlen which is regmust's len, and which was only used by the NIH strstr.
|
Not sure why it got closed. I just fixed a tiny style issue at the "V8 regexp.c: move bracket handling to a function (no-op)" commit (moved the open brace in a new line at the Luckily "reopen pull request" seems to have worked... |
|
There seems to be something wrong. If I run I get Running in gdb, it appears that regatom is calling |
This is correct behavior. You should try The |
|
I see. It's not a feature I typically use, since most of the time the explicit range (like |
Yeah, it's not a critical feature. I think the most useful of those are probably Anyway, thanks for the merge. |
The main change is adding support for
[:alnum:],[:punct:], etc.And before that it has few minor fixups when DEBUG is defined.