V8 regexp.c: add support for character classes [:NAME:] by avih · Pull Request #732 · gwsw/less

avih · 2026-02-14T21:10:46Z

The main change is adding support for [:alnum:], [:punct:], etc.

And before that it has few minor fixups when DEBUG is defined.

avih · 2026-02-14T21:43:01Z

About case-sensitivity with [:NAME:] classes.

As the last commit message says, there's a known issue with [:upper:] and case-INsensitivity in "less", because V8 doesn't support REG_ICASE or equivalent, and without it, "less" converts both the pattern and the text to lower-case, and while it works with [A-Z], it has no effect on [[:upper:]] (which still tries to match capital A-Z, while the text is now lower-case).

This can be fixed in one of few ways:

Make V8 support REG_ICASE or equivalent, and adjust the call-sites as well. This might be great, but might be much effort, and probably risk too, which is definitely not worth it if the sole goal is fixing [:upper:] when less is case-insensitive.
Make the v8 regcomp take an additional flag/bool which indicates that character classes should be added with tolower values (when is_caseless is true). However, this requires changing the semantics of the API, and possibly the signature too, and it's a very awkward semantics to put in an API, because the pattern and strings should still be converted to lower case as well.
Same as 2, but instead of changing the API, use the public is_caseless, maybe with a patch similar to below.

diff --git a/regexp.c b/regexp.c
index 8a44b5f..9a35caa 100644
--- a/regexp.c
+++ b/regexp.c
@@ -542,6 +542,8 @@ cclass_idx2(constant char *s2)
 	return -1;  /* not special as far as we can tell */
 }
 
+extern int is_caseless;
+
 /*
  * called with regparse pointing to right after an opening bracket '[' .
  * emits ANYOF/ANYBUT node with a set of chars and (on success) final '\0'.
@@ -586,7 +588,7 @@ regbracket()
 		} else {  /* idx is a valid index, add class chars */
 			constant char *c = cclasses[idx].chars;
 			for (; *c; ++c)
-				regc(*c);
+				regc(is_caseless ? tolower(*c) : *c);
 			regparse += cclasses[idx].len;
 
 			if (*regparse == '-' && regparse[1] != '\0' && regparse[1] != ']')

(or a more correct but slightly bigger patch which doesn't evaluate is_caseless in each iteration unnecessarily)

This works for less -I ... +/"[[:upper:]]", but for some reason it doesn't work when changing -I at runtime. Not sure why, I'm guessing because the pattern is not recompiled if it doesn't contain upper case chars? However, performing a new search with the same pattern does recopmpile it, and does respect the new case-sensitivity with this patch.

Note that even if it would recompile the pattern unconditionally when chagning -I (capital i) at runtime, there still remains a related issue with -i (lower-case i), because "smart case" only makes it case-sensitive if the pattern contains upper case, but [[:upper:]] does not, so it becomes INsensitive eventhough the pattern does match upper-case chars.

However, this -i issue with [[:upper:]] is not unique to V8 in less, and also happen with POSIX REGEX which does support REG_ICASE (and probably PCRE too etc), because "less" sets REG_ICASE for this pattern - because there's no uppercase chars in it.

So the patch above or similar, possibly with unconditional regcomp when changing -I/-i at runtime, would bring the V8 behavior to par with other regexp engines which do support REG_ICASE, and all of them would still have an issue with -i if the pattern includes [:upper:] and contains no upper-case chars.

avih · 2026-02-14T23:14:00Z

"less" converts both the pattern and the text to lower-case, and while it works with [A-Z], it has no effect on [[:upper:]] (which still tries to match capital A-Z, while the text is now lower-case).

Note that this issue is not limited to [:upper:]. It also happens in master without [:NAME:] support, with any pattern which matches upper case chars, but the pattern itself doesn't include any, for instance [@-_] is a range which includes upper A-Z in it (but not lower a-z), but without upper case in the pattern itself.

So less -I ... +/"[@-_]" does not match any alpha chars at all with V8 in master, eventhough it should (ideally) match all upper and lower case chars.

avih · 2026-02-14T23:53:23Z

Note that this issue is not limited to [:upper:]. It also happens in master without [:NAME:] support,

However, this should be reasonably simple to handle, simply by making all the chars added within brackets take is_caseless into account, and assuming that less would recompile the pattern unconditionally when -i or -I changes in runtime even for engines where re_handles_caseless is false.

For instance, instead of the patch above which only covers [:NAME:], this would cover all the cases with brackets, including [@-_]:

diff --git a/regexp.c b/regexp.c
index 8a44b5f..47a4350 100644
--- a/regexp.c
+++ b/regexp.c
@@ -542,6 +542,28 @@ cclass_idx2(constant char *s2)
 	return -1;  /* not special as far as we can tell */
 }
 
+/* when less is case-insensitive, and the RE engine doesn't support REG_ICASE
+ * or equivalent (like this engine), then it converts the pattern to lowercase
+ * before compiling it, and the text to lowercase before matching it.
+ * this works in general, and even with some ranges like [A-Z], but doesn't
+ * work if the range matches upper case chars, but itself doesn't have any
+ * upper case, like [@-_] or [[:upper:]].
+ * so bracket handling uses this regc_bracket which does take the case
+ * sensitivity of "less" into account in brackets, even when the pattern is
+ * without upper case chars but still matches upper case chars.
+ *
+ * TODO: we could _probably_ do that also for non-bracket parts of the pattern,
+ *       and if we do, then "less" can avoid converting the pattern to lower
+ *       case (but it would still need to convert the text before match).
+ */
+extern int is_caseless;  /* case-insensitivity state in "less" */
+
+static inline void
+regc_brckt(char c)
+{
+	regc(is_caseless ? tolower(c) : c);
+}
+
 /*
  * called with regparse pointing to right after an opening bracket '[' .
  * emits ANYOF/ANYBUT node with a set of chars and (on success) final '\0'.
@@ -576,17 +598,17 @@ regbracket()
 				if (clss > classend+1)
 					FAIL("invalid [] range");
 				for (; clss <= classend; clss++)
-					regc(clss);
+					regc_brckt(clss);
 				regparse++;
 			}
 		} else if ((idx = CCLASS_IDX(regparse)) == -1) {
-			regc(*regparse++);
+			regc_brckt(*regparse++);
 		} else if (idx == -2) {
 			FAIL("unknown [:CLASS:] name");
 		} else {  /* idx is a valid index, add class chars */
 			constant char *c = cclasses[idx].chars;
 			for (; *c; ++c)
-				regc(*c);
+				regc_brckt(*c);
 			regparse += cclasses[idx].len;
 
 			if (*regparse == '-' && regparse[1] != '\0' && regparse[1] != ']')

I didn't check how less converts the the text/pattern to lower, and if it also uses tolower like this patch does, then it's fine, but if it does some more complex conversion, maybe depending on locale or something else, then it might become more complex.

However, for the [:NAME:] conversion, tolower would always work, because these classes only contain ASCII values, all smaller than 128.

It needs some thought, but largely the case sensitivity subject is orthogonal to this PR, and if we decide to improve it, the it should not be limited to the [:NAME:] support which this PR adds.

gwsw · 2026-02-15T01:14:00Z

I haven't fully digested this yet, but I will mention that the recompilation of a pattern after changing -i/-I happens in chg_caseless() in search.c. If you want to unconditionally recompile after changing the option, you can just delete lines 2037-2039.

avih · 2026-02-15T08:07:28Z

Thanks. Indeed I now saw where it skips the recompilation, and also now noticed the case macros in less.h, where indeed it only converts ASCII values under 128, but otherwise it's still on a per-byte basis and doesn't change the string length. So this is good.

But the bottom line is still that the case-related issues are interesting and probably addressable at least to some degree, but are actually off topic for this PR, and also can probably be considered uncommon edge cases.

Both issues relate to patterns which match upper case letters, but do not themselves contain uppercase letters, like [@-_] or [:upper:].

The 1st issue is that when "less" wants to do caseless search, then such patterns are broken with RE engines which don't support REG_ICASE, like V8, because converting this pattern to lowercase has no effect on the chars which it matches.

This doesn't have a good general solution, but probably can have a good solution with the local V8 specifically (but not in this PR, it's a new subject).

The 2nd issue with such patterns affects all RE engines, and it's that with "smart case" (-i), such patterns cause the search to be caseless - eventhough they do match upper case letters.

Granted, it doesn't actually contradict the docs, which say explicitly that -i is only ignored if the pattern contains uppercase, but I think it does break user expectation, that something like [:upper:] which explicitly wants upper case, actually result in REG_ICASE or equivalent set (in current master, for instance with POSIX REGEX).

However, I don't think there's a good solution for this, so it's more of just realizing that this issue exists.

avih · 2026-02-15T09:53:04Z

Force-pushed without any code changes, but with commit messages changes:

For the "fix warning" commit, it now says why this solution is correct. The fact that I missed when I originally wrote this message is that long is guaranteed 32 bit in C89 or later.

For the "support character classes" commit, it now also says that the [[:upper:]] case-sensitivity issue is not new, and existed also before this commit, for instance with the pattern [@-_].

avih · 2026-02-15T19:03:52Z

If you intend to merge it, and don't intend to rebase it on top of master yourself, then let me know and I'll rebase it before you merge.

avih · 2026-02-16T03:58:05Z

If you want, I have a different variant of the classes support, which saves few hundreds bytes of data by reusing the bracket parsing code for the classes data - instead of each class holding a copy of all its chars, which looks like this:

static constant cclass cclasses[] = {
	/* first two chars of id are ignored, and expected to be "[:" */
	{9,  "[:alnum:]",  "0-9A-Za-z"},
	{9,  "[:alpha:]",  "A-Za-z"},
	{9,  "[:blank:]",  "\t "},
	{9,  "[:cntrl:]",  "\x01-\x1f\x7f"},
	{9,  "[:digit:]",  "0-9"},
	{9,  "[:graph:]",  "\x21-\x7e"},
	{9,  "[:lower:]",  "a-z"},
	{9,  "[:print:]",  "\x20-\x7e"},
	{9,  "[:punct:]",  "!-/:-@[-`{|}~"},
	{9,  "[:space:]",  "\t\n\v\f\r "},
	{9,  "[:upper:]",  "A-Z"},
	{10, "[:xdigit:]", "0-9A-Fa-f"},
};

To reuse the code cleanly, instead of extracting the whole [ block from regatom (which adds a node and terminating 0), we can extract roughly the while loop so the classes code can recurse, while keeping the surrounding code at regatom, like this (and doesn't need a local ret anymore):

static constant char *
regbracket(constant char *regparse)
{
	register int clss;
	register int classend;
	int idx;

	if (*regparse == ']' || *regparse == '-')
		regc(*regparse++);
	while (*regparse != '\0' && *regparse != ']') {

		/* ... identical code as in this PR */

		} else {  /* idx is a valid index, add class chars */
			if (!regbracket(cclasses[idx].chars))
				FAIL("internal error");
			regparse += cclasses[idx].len;

			if (*regparse == '-' && regparse[1] != '\0' && regparse[1] != ']')
				FAIL("invalid [] range start"); /* [:NAME:]-X */
		}
	}
	return regparse;
}

And the remaining [ code at regatom looks like this:

	case '[':
		if (*regparse == '^') {	/* Complement of range. */
			ret = regnode(ANYBUT);
			regparse++;
		} else
			ret = regnode(ANYOF);

		regparse = regbracket(regparse);
		if (regparse == NULL)
			return(NULL);

		regc('\0');
		if (*regparse != ']')
			FAIL("unmatched []");
		regparse++;
		*flagp |= HASWIDTH|SIMPLE;
		break;

(CCLASS_IDX and cclass_idx2 remain identical to the code in the PR)

Let me know if you have a preference between the two approaches.

gwsw · 2026-02-27T18:07:03Z

I don't think I care much about a couple of hundred bytes of data, but performance would be more of a concern. Do you think there is a significant performance difference between the two approaches that you mention? If there's no performance difference either, the second approach (using ranges instead of the full list of chars) seems more readable.

avih · 2026-02-27T19:54:08Z

Performance difference seems to be within the noise level (~ 5%).

If anything, the recursing variant is a bit faster, maybe because it does less memory access? (it iterates ranges instead of fetching all chars from the data).

I tested it in isolation, with modified regbracket function posted above (the recursion variant) to support both the recursive and flat variants using these chunks:

#ifdef ALLCHARS
			const char *c = cclasses[idx].chars;
			for (; *c; ++c)
				regc(*c);
#else
			if (!regbracket(cclasses[idx].chars))
				FAIL("internal error");
#endif

(and obviously the cclasses array also differs depending on ALLCHARS)

I tested with all the classes in one bracket string (confirming that both forms do register all the chars correctly - this is 459 chars - the v8 regcomp doesn't remove duplicates in brackets):

        regbracket("[:alnum:][:alpha:][:blank:][:cntrl:]"
                   "[:digit:][:graph:][:lower:][:print:]"
                   "[:punct:][:space:][:upper:][:xdigit:]");

Calling this million times in a loop and measuring the runtime. Both are the same duration - about 1s.

Regardless, even if there was a difference, it's only at regcomp, which is used extremely infrequently (at most at typing rate if we have --incsearch), compared to regexec which can execute thousands of times per second.

So it would have been negligible even if it was 10x slower, but the 2nd (recursive) variant actually turned out to be very slightly faster according to my tests (both with -O2 and -O0).

EDIT: with gcc -O0, the recursive variant is ~ 10% faster (3.2s vs 3.5s), while with TinyC (tcc - non-optimizing), the recursive variant is ~ 15% slower (3.9s vs 3.4s), and with gcc -O2 the diff is barely measureable and both are ~ 1.6s.

Shall I update the PR to do the recursive variants?

gwsw · 2026-02-28T01:33:06Z

Shall I update the PR to do the recursive variants?

Yes, please do that.

This is a followup to commit c8c01e3, and does the same (change K&R style to C89/99) also for the functions regprop and regdump - which are only compiled when DEBUG is defined. Also, regprop was declared twice, so remove one of them. Also, the reprop implementation returns const char*, but the declaration returned plain char*, so fix the declaration. Also, regdump is not static and is intended for extrenal use (it's not used in regexp.c even when DEBUG is defined), so move its declaration to regexp.h - still inside #ifdef DEBUG . Finally, add a changelog note, with the date of commit c8c01e3 .

When compiling with DEBUG defined, two ptrdiff_t values were printed as "%d" format, which resulted in a warning in clang (but not gcc). This was an issue in 64 bit builds, as a pointer (diff) is 64 bit while int is typically 32. The most correct solution would be to use %td (ptrdiff_t format), which is not supported in c89, but we have another correct solution which does work in c89 or later: Use %ld and cast the value to long - which is what this commit does. This works great for us, because eventhough int and long are typically the same size (32 bit), long in c89 and later guarantees 32 bit, while int in c89 or later only guarantees 16 bits. The actual value in practice is currently limited to 32K (regcomp aborts if regsize >= 32767), so %ld with cast to long covers our current needs, and still allows this limit to be raised up to 2M without breaking the correctness of this code in c89 or later. For what it's worth, regdump works and seems useful.

avih · 2026-02-28T22:11:15Z

Rebased and updated to the recursive form.

Also removed the macro which tested [: before calling cclass_idx because this is not hot code, so simpler code is better (the macro did multiple evaluation of the argument, which was fine here, but it can bite if someone touches the code and doesn't notice it).

And finally, a small unrelated commit (just like the first two commits) which replaces NIH strstr with standard libc strstr, because libc probably does it better,

If you want me to move commits which are not related to brackets (1st, 2nd, 5th commits) to a new PR, or delete some or all of them, let me know.

regcomp calls regatom, which is the main regexp parser, and handling a bracket expression was a big part of it. This made it hard to follow the overall flow of regatom. Also, this code started at indent level 3 and reached indent level 7, which made it hard to add more code. So just extract it to its own function regbracket(), which makes regatom much more readable, and allows extending the brackets code (which the next commit will do). The change moves the actual character-handling part from regatom to the new regbracket, while the '[' code at regatom remains only the outer skeleton of setting up the node and finally terminating it. The code is unmodified, but the regbracket code doesn't work on the the global regparse anymore, and instead takes a pointer "in" to the chars, and returns pointer to after the processed chars, so all the instances of "regparse" at the new regbracket were renamed to "in".

This is not critically useful, because most of the classes can be replaced trivially with normal ranges, like [a-z] for [[:lower:]], but it's still convenient, like with [:xdigit:], or even more so with [:punct:] which is otherwise hard to specify ([^ 0-9A-Za-z] comes close but inconvenient, and also include control chars). Character classes are also POSIX, so overall it's nice to have. The implementation is fairly simple. We add an array of the chars per class (only POSIX/C locale values), and augment regbracket() to identify and process them during regcomp. The code reuses the regbracket() parsing to specify the class chars as ranges where appropriate, which is more readable and less data. Known issue: case-insensitivity for [:upper:] doesn't work in "less". E.g.: Matches all upper/lower alpha: less -I MYFILE +/"[A-Z]" Doesn't match anything: less -I MYFILE +/"[[:upper:]]" However, this is not a new issue with V8, and it also happens with the patterns "[@-_]" or "[H-a]" etc, regardless of this commit. A solution to this would be to support REG_ICASE or equivalent at regexp.c, instead of "less" implemeting ICASE by doing tolower on the pattern (regcomp) and text (regexec2) - which works for [A-Z] but not for [[:upper:]] or [@-_] . Other classes are unaffected because either they don't contain alpha chars (digit, punct, cntrl), or they contain both the upper and lower case of their alpha values (alnum, xdigit, etc) and the lower-case chars are already included anyway. So "upper" is the exception.

prog->regmust, if not null, is a normal null-terminated string which must appear at the input. Use the standard libc strstr instead of the NIH code at regexec2, because libc can probably do that better than us. Also get rid of prog->regmlen which is regmust's len, and which was only used by the NIH strstr.

avih · 2026-02-28T22:28:26Z

Not sure why it got closed. I just fixed a tiny style issue at the "V8 regexp.c: move bracket handling to a function (no-op)" commit (moved the open brace in a new line at the regbracket(...) function definition), and the issue got closed... whatever.

Luckily "reopen pull request" seems to have worked...

gwsw · 2026-03-01T02:09:18Z

There seems to be something wrong. If I run

seq 100 | less '+/[:digit:]'

I get Pattern not found: [:digit:].

Running in gdb, it appears that regatom is calling regparse = regbracket(regparse); at line 613, but at that point regparse has been incremented in the switch statement, so regbracket gets ":digit:]" (no initial bracket). Then when regbracket calls cclass_idx at line 563, cclass_idx immediately returns -1 because its parameter is missing the initial left bracket.

avih · 2026-03-01T02:28:42Z

There seems to be something wrong. If I run
seq 100 | less '+/[:digit:]'
I get Pattern not found: [:digit:].

This is correct behavior. You should try seq 100 | grep '[:digit:]' too ;)

The [:NAME:] thing ~~replaces a single character~~ is considered a single element inside brackets, so to match any digit the pattern should be [[:digit:]], while [:digit:] matches : or d or i or g or t - just like without this patch.

gwsw · 2026-03-01T02:53:02Z

I see. It's not a feature I typically use, since most of the time the explicit range (like [a-z] or [a-zA-Z]) is shorter than the symbolic name ([[:lower:]] or [[:alpha:]]).

avih · 2026-03-01T03:10:08Z

most of the time the explicit range (like [a-z] or [a-zA-Z]) is shorter than the symbolic name

Yeah, it's not a critical feature. I think the most useful of those are probably punct and alnum, because otherwise they're long to specify (the alternatives are [^ 0-9A-Za-z] for punct, but it still includes control, and [0-9A-Za-z]).

Anyway, thanks for the merge.

avih force-pushed the v8-cclasses branch from f155279 to b034458 Compare February 15, 2026 09:49

avih mentioned this pull request Feb 17, 2026

local V8 regexp.c: improve case-INsensitivity support #738

Closed

avih added 2 commits February 28, 2026 22:10

avih force-pushed the v8-cclasses branch from b034458 to 114e4ce Compare February 28, 2026 22:04

avih added 3 commits March 1, 2026 00:22

avih closed this Feb 28, 2026

avih force-pushed the v8-cclasses branch from 114e4ce to b41604b Compare February 28, 2026 22:23

avih reopened this Feb 28, 2026

gwsw merged commit 3a20b03 into gwsw:master Mar 1, 2026

avih deleted the v8-cclasses branch March 1, 2026 11:48

Conversation

avih commented Feb 14, 2026

Uh oh!

avih commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avih commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avih commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gwsw commented Feb 15, 2026

Uh oh!

avih commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avih commented Feb 15, 2026

Uh oh!

avih commented Feb 15, 2026

Uh oh!

avih commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gwsw commented Feb 27, 2026

Uh oh!

avih commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gwsw commented Feb 28, 2026

Uh oh!

avih commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avih commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gwsw commented Mar 1, 2026

Uh oh!

avih commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gwsw commented Mar 1, 2026

Uh oh!

avih commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

avih commented Feb 14, 2026 •

edited

Loading

avih commented Feb 14, 2026 •

edited

Loading

avih commented Feb 14, 2026 •

edited

Loading

avih commented Feb 15, 2026 •

edited

Loading

avih commented Feb 16, 2026 •

edited

Loading

avih commented Feb 27, 2026 •

edited

Loading

avih commented Feb 28, 2026 •

edited

Loading

avih commented Feb 28, 2026 •

edited

Loading

avih commented Mar 1, 2026 •

edited

Loading