Skip to content

Conversation

@sirreal
Copy link
Member

@sirreal sirreal commented Dec 15, 2025

The HTML API currently rejects script tag contents that may be dangerous. This is a proposal to detect JavaScript and JSON script tags and automatically escape contents when necessary.

  • JSON and JavaScript script tags may be detected according to the HTML standard.
  • Script tag contents are escaped only when <script or </script (case-insensitive) is found.

In JSON

< is replaced with \u003C. This eliminates the problematic strings and aligns with the approach described in #63851 and applied in r60681.

This is proposed as a simple character replacement with strtr. This should be highly performant. A less invasive replacement could be done to only replace < in <script or </script where it's really necessary. This would preserve more of the JSON string, but likely at the cost of performance. It would require either a regular expression with case-insensitive matching (see JavaScript example).

In JavaScript

<script and </script (followed by a necessary tag termination character \t\n\r\f/>) the s is replaced with its Unicode escape. This should remain valid in all contexts where the text can appear and maintain identical behavior in all except a few edge cases (see ticket or quoted section below for full explanation and caveats).

From the ticket:

The HTML API prevents setting SCRIPT tag that could modify the tree either by closing the SCRIPT element prematurely, or by preventing the SCRIPT element from closing at the expected close tag.

This is handled by rejecting any script tag contents that are potentially dangerous and is safe. There are some improvements that could be made.

If the contents are found to be unsafe and the type of the script tag is JSON or JavaScript (this is well specified in the HTML standard), it should be possible to apply a syntactic transformation to the contents in such a way that the script contents become safe without semantically altering the script.

If the HTML API can safely and automatically escape the majority of SCRIPT tag contents, it can then be used to for SCRIPT tag creation and has the potential to eliminate the class of problem from #40737, #62797, and #63851. It also has the potential to address part of #51159 where SCRIPT tag escaping becomes less of an issue.

JSON

In JSON SCRIPT tags, the transformation is a simple replacement of < with its Unicode escape sequence \u003C. This can be applied to the entire contents of the script or specifically in case-insensitive matches for <script and </script.

JavaScript

JavaScript SCRIPT tags are more difficult because the language has vastly more syntax. Fortunately, there is prior art described in this 2022 blog post (external) from React team member Sophie Alpert. It's the same the JavaScript SCRIPT tag contents escaping strategy that React continues to employ today. In summary, the problematic text <script and </script syntactically appear in places where Unicode escape sequences can be used in the script part (Strings, Identifiers, and RegExp literals). React takes the approach of replacing the s character, resulting in <\u0073cript or </\u0073cript, completely safe in a Script tag.

There are a few notable exceptions where the transformed JavaScript has observably different runtime behavior. These are the only examples I'm aware of. They're more esoteric parts of the language and the likelihood of them being used in inline JavaScript with the problematic text sequences seems an acceptable tradeoff to me to enable cheap, automatic JavaScript escaping.

String.raw does not process escape sequences.

'<script>' === '<\u0073cript>'; // true
String.raw`<script>` === String.raw`<\u0073cript>`; // false

Tagged templates can also access the raw strings, again a form without processing escape sequences.

function taggedCooked( strings ) {
    return strings[0];
}
taggedCooked`<script>` === taggedCooked`<\u0073cript>`; // true

function taggedRaw( strings ) {
    return strings.raw[0];
}
taggedRaw`<script>` === taggedRaw`<\u0073cript>`; // false

The source property of RegExp contains a string representation of the pattern. JavaScript RegExp support Unicode escape sequences, but the Unicode escape sequence is not transformed in the source.

const rPlain = /<script>/;
const rEscaped = /<\u0073cript>/

rPlain.test('<script>'); // true
rEscaped.test('<script>'); // true

rPlain.source === rEscaped.source; // false
rPlain.source; // '<script>'
rEscaped.source; // '<\\u0073cript>'

Any better JavaScript escaping would likely require a complete JavaScript parser and much more invasive changes. It would be much more costly to perform. Even then, I'm not sure that the escaping could be done faithfully.

String.raw() could be split and joined:

String.raw`<script>` === String.raw`<s` + String.raw`cript>`; true 

Tagged template raw and RegExp source seem much more challenging.

Trac ticket: https://core.trac.wordpress.org/ticket/64419


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

@github-actions
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

* > then let the script block's type string for this script element be "text/javascript".
*/
$type_attr = $this->get_attribute( 'type' );
$language_attr = $this->get_attribute( 'language' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like this has the potential to be clearer if we did one of two things:

  • early-abort when both the type and language attributes are missing.
  • null-coalesce to some value like '' which would be semantically the same in these checks as null but allow us to treat the values as strings.

or even something like this…

$type = $this->get_attribute( 'type' );
$type = is_string( $type ) ?: '';        // combine `true` and `null` into ''.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this can be simpler. The different cases seem clear so we can also add some unit tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed this and simplified or clarified it slightly, but I think it matches the specified behavior well and I'm not sure that further changes will improve things.

Comment on lines 4042 to 4049
/*
* > Otherwise, if the script block's type string is an ASCII case-insensitive match for
* > the string "module", then set el's type to "module".
*
* A module is evaluated as JavaScript.
*/
case 'module':
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PhpStorm complains about having a separate case here:

Image

It suggests moving the module case up above with the others.

This is purely stylistic, I acknowledge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this as-is, it allows quotes referencing different specifications to remain separate.

sirreal added a commit to sirreal/wordpress-develop that referenced this pull request Dec 29, 2025
@sirreal
Copy link
Member Author

sirreal commented Dec 29, 2025

I've created #10668 to align STYLE tag handling with this approach.

@dmsnell
Copy link
Member

dmsnell commented Dec 31, 2025

@sirreal after my changes the tests could be merged, and in fact, can/should be refactored a fair amount. I think he same might be true of the long comment in the JavaScript escaping function. for now, I moved the graphviz source code into a separate file in the test directory to get it out of the class. at least I didn’t like it at the bottom since it was so disconnected from the code, and I think you didn’t like it inline because of how noisy it was already. I think in the end we could link to your blog post or to the SPEC and just let it be a little reference instead of a full explanation.

at one point I added an optional parameter which was $script_type and when passed, would be set to things like classic, module, etc… but then I took it out because we didn’t have a clear need for it, plus I figured any determination of it could be done quickly after calling get_script_content_type().

if you look at the commits, you will realize I learned that script is six letters and not five 🤦‍♂️

Comment on lines 4029 to 4036
* This text may appear in the JavaScript in limited ways, all of which support
* the use of Unicode escape sequences on the "s" character. The escaping is safe
* to perform in all JavaScript and the modified JavaScript maintains identical
* behavior with a few exceptions:
*
* - Comments.
* - Tagged templates like `String.raw()` that access “raw” strings.
* - The `source` property of a RegExp object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these cases, especially in the second two, wouldn't it be preferable to have set_modifiable_text return false so as to not corrupt the JS? Otherwise, in order to faithfully escape all JS correctly, I suppose this would require a full JS tokenizer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is going to be determining if we are in these cases without a JS tokenizer.

The chance of corrupting the JavaScript should be extremely minimal, as I think one of a very rare cases would need to be in play:

  • Another JavaScript script would need to be reading the .textContent of the modified SCRIPT element and be analyzing the comments in the source code (or some other aspect of the source code) and these source-code-only changes would be meaningful.
  • The same JavaScript is inspecting the “source code value” of the template literal or RegExp source and doing something meaningful with it.

My experience is telling me that these are two chained rare events, the combination of which should be almost unseen.


In the cases where this would corrupt the JS, I don’t think there’s a way to represent the code otherwise. Would it be better to reject it entirely or risk breaking these edge cases? Would that breakage likely be meaningful? (e.g. some JS analyzer shows the Unicode escape instead of the decoded character but the runtime behaviors are all otherwise the same…)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add or re-iterate:

  • These are unusual states to be in. The vast majority of inline JavaScript can be safely escaped.
  • The tag contents were already broken. It seems preferable to potentially modify the behavior of JavaScript edge cases instead of outputting broken HTML (which could result in completely broken script tags and a fatal error making the script not work at all).
  • As you say, doing any better would require processing the JavaScript as JavaScript which we're not able to do at this time. That processing would also come at a much higher cost, and even then the potential for automatic fixes is limited and unclear. In particular, the source property of RegExp could occur elsewhere and it's impossible to locally know whether a given usage would be impacted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of outputting broken HTML

This was also not the previous state: previously, if the contents contained <script or </script then the HTML API simply rejected the update.

I have tried to clarify the expected cases of breakage in an update to the DocBlock comment. I have tried to emphasize how unlikely it is that this will break code as well as illustrate realistic scenarios where it might. Thanks @sirreal for being so comprehensive in this comment already.

* This works because of how parsing changes after encountering an opening SCRIPT
* tag. The actual parsing comprises a complicated state machine, the result of
* legacy behaviors and diverse browser support. However, without these two strings
* in the script contents, the machine collapses to a single state. A JavaScript engine
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely happy with this phrasing.

It's still many states. In the HTML standard, it's a great many states, but even in the simplified state chart it moves between script data and escaped.

Both of those states are fine because they're predictable. That is,</script> will close the script tag from either of them.

The doubled escaped state is eliminated from the simplified state chart, which is the important thing.

I've tried to rephrase this a few times and haven't been happy with the result 😕

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I meant was that there’s only one meaningful transition left, and even though there are distinct states that we move through, they are equivalent under inspection based on the transitions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your updates, but will fiddle a bit more with them.

Comment on lines 4047 to 4049
* While it may seem tempting to replace the `<` characters instead, doing so would
* only work within certain parts of JavaScript: identifiers, strings, etc… This
* would accidentally turn comparison operators into syntax errors!
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads awkwardly to me, it makes it sound like < can be part of an identifier (which I don't believe is correct).

The problem with replacing < is that it's part of JavaScript syntax, so things like the "less than" comparison operator would break if we replaced it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I was trying to convey the syntax part by noting the comparison operator. I couldn’t think of any other syntax involving <

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants