-
Notifications
You must be signed in to change notification settings - Fork 3.2k
HTML API: Auto-escape JavaScript and JSON script tag contents when necessary #10635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
HTML API: Auto-escape JavaScript and JSON script tag contents when necessary #10635
Conversation
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
| * > then let the script block's type string for this script element be "text/javascript". | ||
| */ | ||
| $type_attr = $this->get_attribute( 'type' ); | ||
| $language_attr = $this->get_attribute( 'language' ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like this has the potential to be clearer if we did one of two things:
- early-abort when both the type and language attributes are missing.
- null-coalesce to some value like
''which would be semantically the same in these checks asnullbut allow us to treat the values as strings.
or even something like this…
$type = $this->get_attribute( 'type' );
$type = is_string( $type ) ?: ''; // combine `true` and `null` into ''.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this can be simpler. The different cases seem clear so we can also add some unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed this and simplified or clarified it slightly, but I think it matches the specified behavior well and I'm not sure that further changes will improve things.
| /* | ||
| * > Otherwise, if the script block's type string is an ASCII case-insensitive match for | ||
| * > the string "module", then set el's type to "module". | ||
| * | ||
| * A module is evaluated as JavaScript. | ||
| */ | ||
| case 'module': | ||
| return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer this as-is, it allows quotes referencing different specifications to remain separate.
Co-authored-by: Weston Ruter <westonruter@gmail.com>
|
I've created #10668 to align STYLE tag handling with this approach. |
|
@sirreal after my changes the tests could be merged, and in fact, can/should be refactored a fair amount. I think he same might be true of the long comment in the JavaScript escaping function. for now, I moved the at one point I added an optional parameter which was if you look at the commits, you will realize I learned that |
| * This text may appear in the JavaScript in limited ways, all of which support | ||
| * the use of Unicode escape sequences on the "s" character. The escaping is safe | ||
| * to perform in all JavaScript and the modified JavaScript maintains identical | ||
| * behavior with a few exceptions: | ||
| * | ||
| * - Comments. | ||
| * - Tagged templates like `String.raw()` that access “raw” strings. | ||
| * - The `source` property of a RegExp object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In these cases, especially in the second two, wouldn't it be preferable to have set_modifiable_text return false so as to not corrupt the JS? Otherwise, in order to faithfully escape all JS correctly, I suppose this would require a full JS tokenizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is going to be determining if we are in these cases without a JS tokenizer.
The chance of corrupting the JavaScript should be extremely minimal, as I think one of a very rare cases would need to be in play:
- Another JavaScript script would need to be reading the
.textContentof the modifiedSCRIPTelement and be analyzing the comments in the source code (or some other aspect of the source code) and these source-code-only changes would be meaningful. - The same JavaScript is inspecting the “source code value” of the template literal or RegExp source and doing something meaningful with it.
My experience is telling me that these are two chained rare events, the combination of which should be almost unseen.
In the cases where this would corrupt the JS, I don’t think there’s a way to represent the code otherwise. Would it be better to reject it entirely or risk breaking these edge cases? Would that breakage likely be meaningful? (e.g. some JS analyzer shows the Unicode escape instead of the decoded character but the runtime behaviors are all otherwise the same…)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add or re-iterate:
- These are unusual states to be in. The vast majority of inline JavaScript can be safely escaped.
- The tag contents were already broken. It seems preferable to potentially modify the behavior of JavaScript edge cases instead of outputting broken HTML (which could result in completely broken script tags and a fatal error making the script not work at all).
- As you say, doing any better would require processing the JavaScript as JavaScript which we're not able to do at this time. That processing would also come at a much higher cost, and even then the potential for automatic fixes is limited and unclear. In particular, the
sourceproperty ofRegExpcould occur elsewhere and it's impossible to locally know whether a given usage would be impacted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of outputting broken HTML
This was also not the previous state: previously, if the contents contained <script or </script then the HTML API simply rejected the update.
I have tried to clarify the expected cases of breakage in an update to the DocBlock comment. I have tried to emphasize how unlikely it is that this will break code as well as illustrate realistic scenarios where it might. Thanks @sirreal for being so comprehensive in this comment already.
Avoid naming unescaped plaintext variable as "escaped."
| * This works because of how parsing changes after encountering an opening SCRIPT | ||
| * tag. The actual parsing comprises a complicated state machine, the result of | ||
| * legacy behaviors and diverse browser support. However, without these two strings | ||
| * in the script contents, the machine collapses to a single state. A JavaScript engine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely happy with this phrasing.
It's still many states. In the HTML standard, it's a great many states, but even in the simplified state chart it moves between script data and escaped.
Both of those states are fine because they're predictable. That is,</script> will close the script tag from either of them.
The doubled escaped state is eliminated from the simplified state chart, which is the important thing.
I've tried to rephrase this a few times and haven't been happy with the result 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what I meant was that there’s only one meaningful transition left, and even though there are distinct states that we move through, they are equivalent under inspection based on the transitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your updates, but will fiddle a bit more with them.
| * While it may seem tempting to replace the `<` characters instead, doing so would | ||
| * only work within certain parts of JavaScript: identifiers, strings, etc… This | ||
| * would accidentally turn comparison operators into syntax errors! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads awkwardly to me, it makes it sound like < can be part of an identifier (which I don't believe is correct).
The problem with replacing < is that it's part of JavaScript syntax, so things like the "less than" comparison operator would break if we replaced it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I was trying to convey the syntax part by noting the comparison operator. I couldn’t think of any other syntax involving <

The HTML API currently rejects script tag contents that may be dangerous. This is a proposal to detect JavaScript and JSON script tags and automatically escape contents when necessary.
<scriptor</script(case-insensitive) is found.In JSON
<is replaced with\u003C. This eliminates the problematic strings and aligns with the approach described in #63851 and applied in r60681.This is proposed as a simple character replacement with
strtr. This should be highly performant. A less invasive replacement could be done to only replace<in<scriptor</scriptwhere it's really necessary. This would preserve more of the JSON string, but likely at the cost of performance. It would require either a regular expression with case-insensitive matching (see JavaScript example).In JavaScript
<scriptand</script(followed by a necessary tag termination character\t\n\r\f/>) thesis replaced with its Unicode escape. This should remain valid in all contexts where the text can appear and maintain identical behavior in all except a few edge cases (see ticket or quoted section below for full explanation and caveats).From the ticket:
Trac ticket: https://core.trac.wordpress.org/ticket/64419
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.