Skip to content

V8 regexp.c: add support for character classes [:NAME:]#732

Merged
gwsw merged 5 commits intogwsw:masterfrom
avih:v8-cclasses
Mar 1, 2026
Merged

V8 regexp.c: add support for character classes [:NAME:]#732
gwsw merged 5 commits intogwsw:masterfrom
avih:v8-cclasses

Conversation

@avih
Copy link
Copy Markdown
Contributor

@avih avih commented Feb 14, 2026

The main change is adding support for [:alnum:], [:punct:], etc.

And before that it has few minor fixups when DEBUG is defined.

@avih
Copy link
Copy Markdown
Contributor Author

avih commented Feb 14, 2026

About case-sensitivity with [:NAME:] classes.

As the last commit message says, there's a known issue with [:upper:] and case-INsensitivity in "less", because V8 doesn't support REG_ICASE or equivalent, and without it, "less" converts both the pattern and the text to lower-case, and while it works with [A-Z], it has no effect on [[:upper:]] (which still tries to match capital A-Z, while the text is now lower-case).

This can be fixed in one of few ways:

  1. Make V8 support REG_ICASE or equivalent, and adjust the call-sites as well. This might be great, but might be much effort, and probably risk too, which is definitely not worth it if the sole goal is fixing [:upper:] when less is case-insensitive.
  2. Make the v8 regcomp take an additional flag/bool which indicates that character classes should be added with tolower values (when is_caseless is true). However, this requires changing the semantics of the API, and possibly the signature too, and it's a very awkward semantics to put in an API, because the pattern and strings should still be converted to lower case as well.
  3. Same as 2, but instead of changing the API, use the public is_caseless, maybe with a patch similar to below.
diff --git a/regexp.c b/regexp.c
index 8a44b5f..9a35caa 100644
--- a/regexp.c
+++ b/regexp.c
@@ -542,6 +542,8 @@ cclass_idx2(constant char *s2)
 	return -1;  /* not special as far as we can tell */
 }
 
+extern int is_caseless;
+
 /*
  * called with regparse pointing to right after an opening bracket '[' .
  * emits ANYOF/ANYBUT node with a set of chars and (on success) final '\0'.
@@ -586,7 +588,7 @@ regbracket()
 		} else {  /* idx is a valid index, add class chars */
 			constant char *c = cclasses[idx].chars;
 			for (; *c; ++c)
-				regc(*c);
+				regc(is_caseless ? tolower(*c) : *c);
 			regparse += cclasses[idx].len;
 
 			if (*regparse == '-' && regparse[1] != '\0' && regparse[1] != ']')

(or a more correct but slightly bigger patch which doesn't evaluate is_caseless in each iteration unnecessarily)

This works for less -I ... +/"[[:upper:]]", but for some reason it doesn't work when changing -I at runtime. Not sure why, I'm guessing because the pattern is not recompiled if it doesn't contain upper case chars? However, performing a new search with the same pattern does recopmpile it, and does respect the new case-sensitivity with this patch.

Note that even if it would recompile the pattern unconditionally when chagning -I (capital i) at runtime, there still remains a related issue with -i (lower-case i), because "smart case" only makes it case-sensitive if the pattern contains upper case, but [[:upper:]] does not, so it becomes INsensitive eventhough the pattern does match upper-case chars.

However, this -i issue with [[:upper:]] is not unique to V8 in less, and also happen with POSIX REGEX which does support REG_ICASE (and probably PCRE too etc), because "less" sets REG_ICASE for this pattern - because there's no uppercase chars in it.

So the patch above or similar, possibly with unconditional regcomp when changing -I/-i at runtime, would bring the V8 behavior to par with other regexp engines which do support REG_ICASE, and all of them would still have an issue with -i if the pattern includes [:upper:] and contains no upper-case chars.

@avih
Copy link
Copy Markdown
Contributor Author

avih commented Feb 14, 2026

"less" converts both the pattern and the text to lower-case, and while it works with [A-Z], it has no effect on [[:upper:]] (which still tries to match capital A-Z, while the text is now lower-case).

Note that this issue is not limited to [:upper:]. It also happens in master without [:NAME:] support, with any pattern which matches upper case chars, but the pattern itself doesn't include any, for instance [@-_] is a range which includes upper A-Z in it (but not lower a-z), but without upper case in the pattern itself.

So less -I ... +/"[@-_]" does not match any alpha chars at all with V8 in master, eventhough it should (ideally) match all upper and lower case chars.

@avih
Copy link
Copy Markdown
Contributor Author

avih commented Feb 14, 2026

Note that this issue is not limited to [:upper:]. It also happens in master without [:NAME:] support,

However, this should be reasonably simple to handle, simply by making all the chars added within brackets take is_caseless into account, and assuming that less would recompile the pattern unconditionally when -i or -I changes in runtime even for engines where re_handles_caseless is false.

For instance, instead of the patch above which only covers [:NAME:], this would cover all the cases with brackets, including [@-_]:

diff --git a/regexp.c b/regexp.c
index 8a44b5f..47a4350 100644
--- a/regexp.c
+++ b/regexp.c
@@ -542,6 +542,28 @@ cclass_idx2(constant char *s2)
 	return -1;  /* not special as far as we can tell */
 }
 
+/* when less is case-insensitive, and the RE engine doesn't support REG_ICASE
+ * or equivalent (like this engine), then it converts the pattern to lowercase
+ * before compiling it, and the text to lowercase before matching it.
+ * this works in general, and even with some ranges like [A-Z], but doesn't
+ * work if the range matches upper case chars, but itself doesn't have any
+ * upper case, like [@-_] or [[:upper:]].
+ * so bracket handling uses this regc_bracket which does take the case
+ * sensitivity of "less" into account in brackets, even when the pattern is
+ * without upper case chars but still matches upper case chars.
+ *
+ * TODO: we could _probably_ do that also for non-bracket parts of the pattern,
+ *       and if we do, then "less" can avoid converting the pattern to lower
+ *       case (but it would still need to convert the text before match).
+ */
+extern int is_caseless;  /* case-insensitivity state in "less" */
+
+static inline void
+regc_brckt(char c)
+{
+	regc(is_caseless ? tolower(c) : c);
+}
+
 /*
  * called with regparse pointing to right after an opening bracket '[' .
  * emits ANYOF/ANYBUT node with a set of chars and (on success) final '\0'.
@@ -576,17 +598,17 @@ regbracket()
 				if (clss > classend+1)
 					FAIL("invalid [] range");
 				for (; clss <= classend; clss++)
-					regc(clss);
+					regc_brckt(clss);
 				regparse++;
 			}
 		} else if ((idx = CCLASS_IDX(regparse)) == -1) {
-			regc(*regparse++);
+			regc_brckt(*regparse++);
 		} else if (idx == -2) {
 			FAIL("unknown [:CLASS:] name");
 		} else {  /* idx is a valid index, add class chars */
 			constant char *c = cclasses[idx].chars;
 			for (; *c; ++c)
-				regc(*c);
+				regc_brckt(*c);
 			regparse += cclasses[idx].len;
 
 			if (*regparse == '-' && regparse[1] != '\0' && regparse[1] != ']')

I didn't check how less converts the the text/pattern to lower, and if it also uses tolower like this patch does, then it's fine, but if it does some more complex conversion, maybe depending on locale or something else, then it might become more complex.

However, for the [:NAME:] conversion, tolower would always work, because these classes only contain ASCII values, all smaller than 128.

It needs some thought, but largely the case sensitivity subject is orthogonal to this PR, and if we decide to improve it, the it should not be limited to the [:NAME:] support which this PR adds.

@gwsw
Copy link
Copy Markdown
Owner

gwsw commented Feb 15, 2026

I haven't fully digested this yet, but I will mention that the recompilation of a pattern after changing -i/-I happens in chg_caseless() in search.c. If you want to unconditionally recompile after changing the option, you can just delete lines 2037-2039.

@avih
Copy link
Copy Markdown
Contributor Author

avih commented Feb 15, 2026

Thanks. Indeed I now saw where it skips the recompilation, and also now noticed the case macros in less.h, where indeed it only converts ASCII values under 128, but otherwise it's still on a per-byte basis and doesn't change the string length. So this is good.

But the bottom line is still that the case-related issues are interesting and probably addressable at least to some degree, but are actually off topic for this PR, and also can probably be considered uncommon edge cases.

Both issues relate to patterns which match upper case letters, but do not themselves contain uppercase letters, like [@-_] or [:upper:].

The 1st issue is that when "less" wants to do caseless search, then such patterns are broken with RE engines which don't support REG_ICASE, like V8, because converting this pattern to lowercase has no effect on the chars which it matches.

This doesn't have a good general solution, but probably can have a good solution with the local V8 specifically (but not in this PR, it's a new subject).

The 2nd issue with such patterns affects all RE engines, and it's that with "smart case" (-i), such patterns cause the search to be caseless - eventhough they do match upper case letters.

Granted, it doesn't actually contradict the docs, which say explicitly that -i is only ignored if the pattern contains uppercase, but I think it does break user expectation, that something like [:upper:] which explicitly wants upper case, actually result in REG_ICASE or equivalent set (in current master, for instance with POSIX REGEX).

However, I don't think there's a good solution for this, so it's more of just realizing that this issue exists.

@avih
Copy link
Copy Markdown
Contributor Author

avih commented Feb 15, 2026

Force-pushed without any code changes, but with commit messages changes:

For the "fix warning" commit, it now says why this solution is correct. The fact that I missed when I originally wrote this message is that long is guaranteed 32 bit in C89 or later.

For the "support character classes" commit, it now also says that the [[:upper:]] case-sensitivity issue is not new, and existed also before this commit, for instance with the pattern [@-_].

@avih
Copy link
Copy Markdown
Contributor Author

avih commented Feb 15, 2026

If you intend to merge it, and don't intend to rebase it on top of master yourself, then let me know and I'll rebase it before you merge.

@avih
Copy link
Copy Markdown
Contributor Author

avih commented Feb 16, 2026

If you want, I have a different variant of the classes support, which saves few hundreds bytes of data by reusing the bracket parsing code for the classes data - instead of each class holding a copy of all its chars, which looks like this:

static constant cclass cclasses[] = {
	/* first two chars of id are ignored, and expected to be "[:" */
	{9,  "[:alnum:]",  "0-9A-Za-z"},
	{9,  "[:alpha:]",  "A-Za-z"},
	{9,  "[:blank:]",  "\t "},
	{9,  "[:cntrl:]",  "\x01-\x1f\x7f"},
	{9,  "[:digit:]",  "0-9"},
	{9,  "[:graph:]",  "\x21-\x7e"},
	{9,  "[:lower:]",  "a-z"},
	{9,  "[:print:]",  "\x20-\x7e"},
	{9,  "[:punct:]",  "!-/:-@[-`{|}~"},
	{9,  "[:space:]",  "\t\n\v\f\r "},
	{9,  "[:upper:]",  "A-Z"},
	{10, "[:xdigit:]", "0-9A-Fa-f"},
};

To reuse the code cleanly, instead of extracting the whole [ block from regatom (which adds a node and terminating 0), we can extract roughly the while loop so the classes code can recurse, while keeping the surrounding code at regatom, like this (and doesn't need a local ret anymore):

static constant char *
regbracket(constant char *regparse)
{
	register int clss;
	register int classend;
	int idx;

	if (*regparse == ']' || *regparse == '-')
		regc(*regparse++);
	while (*regparse != '\0' && *regparse != ']') {

		/* ... identical code as in this PR */

		} else {  /* idx is a valid index, add class chars */
			if (!regbracket(cclasses[idx].chars))
				FAIL("internal error");
			regparse += cclasses[idx].len;

			if (*regparse == '-' && regparse[1] != '\0' && regparse[1] != ']')
				FAIL("invalid [] range start"); /* [:NAME:]-X */
		}
	}
	return regparse;
}

And the remaining [ code at regatom looks like this:

	case '[':
		if (*regparse == '^') {	/* Complement of range. */
			ret = regnode(ANYBUT);
			regparse++;
		} else
			ret = regnode(ANYOF);

		regparse = regbracket(regparse);
		if (regparse == NULL)
			return(NULL);

		regc('\0');
		if (*regparse != ']')
			FAIL("unmatched []");
		regparse++;
		*flagp |= HASWIDTH|SIMPLE;
		break;

(CCLASS_IDX and cclass_idx2 remain identical to the code in the PR)

Let me know if you have a preference between the two approaches.

@gwsw
Copy link
Copy Markdown
Owner

gwsw commented Feb 27, 2026

I don't think I care much about a couple of hundred bytes of data, but performance would be more of a concern. Do you think there is a significant performance difference between the two approaches that you mention? If there's no performance difference either, the second approach (using ranges instead of the full list of chars) seems more readable.

@avih
Copy link
Copy Markdown
Contributor Author

avih commented Feb 27, 2026

Performance difference seems to be within the noise level (~ 5%).

If anything, the recursing variant is a bit faster, maybe because it does less memory access? (it iterates ranges instead of fetching all chars from the data).

I tested it in isolation, with modified regbracket function posted above (the recursion variant) to support both the recursive and flat variants using these chunks:

#ifdef ALLCHARS
			const char *c = cclasses[idx].chars;
			for (; *c; ++c)
				regc(*c);
#else
			if (!regbracket(cclasses[idx].chars))
				FAIL("internal error");
#endif

(and obviously the cclasses array also differs depending on ALLCHARS)

I tested with all the classes in one bracket string (confirming that both forms do register all the chars correctly - this is 459 chars - the v8 regcomp doesn't remove duplicates in brackets):

        regbracket("[:alnum:][:alpha:][:blank:][:cntrl:]"
                   "[:digit:][:graph:][:lower:][:print:]"
                   "[:punct:][:space:][:upper:][:xdigit:]");

Calling this million times in a loop and measuring the runtime. Both are the same duration - about 1s.

Regardless, even if there was a difference, it's only at regcomp, which is used extremely infrequently (at most at typing rate if we have --incsearch), compared to regexec which can execute thousands of times per second.

So it would have been negligible even if it was 10x slower, but the 2nd (recursive) variant actually turned out to be very slightly faster according to my tests (both with -O2 and -O0).

EDIT: with gcc -O0, the recursive variant is ~ 10% faster (3.2s vs 3.5s), while with TinyC (tcc - non-optimizing), the recursive variant is ~ 15% slower (3.9s vs 3.4s), and with gcc -O2 the diff is barely measureable and both are ~ 1.6s.

Shall I update the PR to do the recursive variants?

@gwsw
Copy link
Copy Markdown
Owner

gwsw commented Feb 28, 2026

Shall I update the PR to do the recursive variants?

Yes, please do that.

avih added 2 commits February 28, 2026 22:10
This is a followup to commit c8c01e3, and does the same (change
K&R style to C89/99) also for the functions regprop and regdump - which
are only compiled when DEBUG is defined.

Also, regprop was declared twice, so remove one of them.

Also, the reprop implementation returns const char*, but the
declaration returned plain char*, so fix the declaration.

Also, regdump is not static and is intended for extrenal use (it's
not used in regexp.c even when DEBUG is defined), so move its
declaration to regexp.h - still inside #ifdef DEBUG .

Finally, add a changelog note, with the date of commit c8c01e3 .
When compiling with DEBUG defined, two ptrdiff_t values were printed
as "%d" format, which resulted in a warning in clang (but not gcc).
This was an issue in 64 bit builds, as a pointer (diff) is 64 bit
while int is typically 32.

The most correct solution would be to use %td (ptrdiff_t format),
which is not supported in c89, but we have another correct solution
which does work in c89 or later:

Use %ld and cast the value to long - which is what this commit does.

This works great for us, because eventhough int and long are typically
the same size (32 bit), long in c89 and later guarantees 32 bit, while
int in c89 or later only guarantees 16 bits.

The actual value in practice is currently limited to 32K (regcomp
aborts if regsize >= 32767), so %ld with cast to long covers our
current needs, and still allows this limit to be raised up to 2M
without breaking the correctness of this code in c89 or later.

For what it's worth, regdump works and seems useful.
@avih
Copy link
Copy Markdown
Contributor Author

avih commented Feb 28, 2026

Rebased and updated to the recursive form.

Also removed the macro which tested [: before calling cclass_idx because this is not hot code, so simpler code is better (the macro did multiple evaluation of the argument, which was fine here, but it can bite if someone touches the code and doesn't notice it).

And finally, a small unrelated commit (just like the first two commits) which replaces NIH strstr with standard libc strstr, because libc probably does it better,

If you want me to move commits which are not related to brackets (1st, 2nd, 5th commits) to a new PR, or delete some or all of them, let me know.

avih added 3 commits March 1, 2026 00:22
regcomp calls regatom, which is the main regexp parser, and handling
a bracket expression was a big part of it. This made it hard to follow
the overall flow of regatom. Also, this code started at indent level 3
and reached indent level 7, which made it hard to add more code.

So just extract it to its own function regbracket(), which makes
regatom much more readable, and allows extending the brackets code
(which the next commit will do).

The change moves the actual character-handling part from regatom
to the new regbracket, while the '[' code at regatom remains only the
outer skeleton of setting up the node and finally terminating it.

The code is unmodified, but the regbracket code doesn't work on the
the global regparse anymore, and instead takes a pointer "in" to the
chars, and returns pointer to after the processed chars, so all the
instances of "regparse" at the new regbracket were renamed to "in".
This is not critically useful, because most of the classes can be
replaced trivially with normal ranges, like [a-z] for [[:lower:]],
but it's still convenient, like with [:xdigit:], or even more so
with [:punct:] which is otherwise hard to specify ([^ 0-9A-Za-z]
comes close but inconvenient, and also include control chars).

Character classes are also POSIX, so overall it's nice to have.

The implementation is fairly simple. We add an array of the chars
per class (only POSIX/C locale values), and augment regbracket()
to identify and process them during regcomp.

The code reuses the regbracket() parsing to specify the class chars
as ranges where appropriate, which is more readable and less data.

Known issue: case-insensitivity for [:upper:] doesn't work in "less".
E.g.:
  Matches all upper/lower alpha: less -I MYFILE +/"[A-Z]"
  Doesn't match anything:        less -I MYFILE +/"[[:upper:]]"

However, this is not a new issue with V8, and it also happens with
the patterns "[@-_]" or "[H-a]" etc, regardless of this commit.

A solution to this would be to support REG_ICASE or equivalent at
regexp.c, instead of "less" implemeting ICASE by doing tolower on
the pattern (regcomp) and text (regexec2) - which works for [A-Z]
but not for [[:upper:]] or [@-_] .

Other classes are unaffected because either they don't contain alpha
chars (digit, punct, cntrl), or they contain both the upper and lower
case of their alpha values (alnum, xdigit, etc) and the lower-case
chars are already included anyway. So "upper" is the exception.
prog->regmust, if not null, is a normal null-terminated string which
must appear at the input.

Use the standard libc strstr instead of the NIH code at regexec2,
because libc can probably do that better than us.

Also get rid of prog->regmlen which is regmust's len, and which was
only used by the NIH strstr.
@avih avih closed this Feb 28, 2026
@avih avih reopened this Feb 28, 2026
@avih
Copy link
Copy Markdown
Contributor Author

avih commented Feb 28, 2026

Not sure why it got closed. I just fixed a tiny style issue at the "V8 regexp.c: move bracket handling to a function (no-op)" commit (moved the open brace in a new line at the regbracket(...) function definition), and the issue got closed... whatever.

Luckily "reopen pull request" seems to have worked...

@gwsw
Copy link
Copy Markdown
Owner

gwsw commented Mar 1, 2026

There seems to be something wrong. If I run

seq 100 | less '+/[:digit:]'

I get Pattern not found: [:digit:].

Running in gdb, it appears that regatom is calling regparse = regbracket(regparse); at line 613, but at that point regparse has been incremented in the switch statement, so regbracket gets ":digit:]" (no initial bracket). Then when regbracket calls cclass_idx at line 563, cclass_idx immediately returns -1 because its parameter is missing the initial left bracket.

@avih
Copy link
Copy Markdown
Contributor Author

avih commented Mar 1, 2026

There seems to be something wrong. If I run

seq 100 | less '+/[:digit:]'

I get Pattern not found: [:digit:].

This is correct behavior. You should try seq 100 | grep '[:digit:]' too ;)

The [:NAME:] thing replaces a single character is considered a single element inside brackets, so to match any digit the pattern should be [[:digit:]], while [:digit:] matches : or d or i or g or t - just like without this patch.

@gwsw
Copy link
Copy Markdown
Owner

gwsw commented Mar 1, 2026

I see. It's not a feature I typically use, since most of the time the explicit range (like [a-z] or [a-zA-Z]) is shorter than the symbolic name ([[:lower:]] or [[:alpha:]]).

@gwsw gwsw merged commit 3a20b03 into gwsw:master Mar 1, 2026
@avih
Copy link
Copy Markdown
Contributor Author

avih commented Mar 1, 2026

most of the time the explicit range (like [a-z] or [a-zA-Z]) is shorter than the symbolic name

Yeah, it's not a critical feature. I think the most useful of those are probably punct and alnum, because otherwise they're long to specify (the alternatives are [^ 0-9A-Za-z] for punct, but it still includes control, and [0-9A-Za-z]).

Anyway, thanks for the merge.

@avih avih deleted the v8-cclasses branch March 1, 2026 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants