Add an automated auditor#8
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b7450d99d7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| diagArea.innerHTML = `<span class="font-semibold uppercase tracking-wider text-[10px] ${ | ||
| outcome === 'pass' ? 'text-emerald-700' : outcome === 'fail' ? 'text-red-700' : 'text-amber-700' | ||
| }">Diagnostic Output:</span> <span class="text-ink-700">${matchingResult.message}</span>`; |
There was a problem hiding this comment.
Escape diagnostic messages before inserting them
When a user audits a hostile site, the backend reflects site-controlled values such as <title>, meta descriptions, canonical URLs, and response headers into matchingResult.message, and this line inserts that string with innerHTML. A scanned domain can therefore return a title/header containing markup like an onerror handler and execute script on specification.website for anyone who runs the audit; render the message as text or sanitize it before composing the diagnostic markup.
Useful? React with 👍 / 👎.
| // Clean the domain input | ||
| domain = domain.replace(/^(https?:\/\/)?(www\.)?/, ''); | ||
| // Remove any trailing path/slashes | ||
| domain = domain.split('/')[0]; |
There was a problem hiding this comment.
Preserve explicit www hosts during scans
For inputs like https://www.example.com, this normalization silently scans example.com instead of the host the user entered. Many production sites serve different content or security headers on the apex versus www, and some only work on www, so the auditor can report failures or passes for the wrong site; only strip the scheme/path and keep the hostname intact.
Useful? React with 👍 / 👎.
| const checkMapping: Record<string, string> = { | ||
| // Foundations | ||
| 'canonical-url': 'canonical-urls', | ||
| 'meta-charset': 'character-encoding', | ||
| 'meta-viewport': 'viewport', | ||
| 'meta-description': 'meta-descriptions', | ||
| 'title': 'title-tags', | ||
| 'favicons': 'favicons', | ||
| 'open-graph': 'open-graph', |
There was a problem hiding this comment.
Map the automated doctype and html-lang checks
The API already returns results for doctype and html-lang, and the content collection has matching spec pages, but these slugs are omitted from checkMapping, so those two automated checks are rendered as manual-only and their backend results are ignored on every audit. Add mappings for these slugs so the UI reflects the automated results instead of asking users to verify them manually.
Useful? React with 👍 / 👎.
|
This looks amazing. Let me test it a bit more - my initial thought is "why doesn't this automatically check things like my html doctype? |
jdevalk
left a comment
There was a problem hiding this comment.
Thanks for the work on this — the breadth of checks is great. A few issues to flag before this can land.
The biggest one: the verification plan in the PR description never exercises the real API. astro dev doesn't run Cloudflare Pages Functions, so /api/check 404s, the catch block fires, and generateMockResults() returns hardcoded strings that ignore the domain entirely. Auditing joost.blog (or anything else) at localhost:4321 reports a passing doctype because the string 'Modern HTML5 doctype declaration (<!doctype html>) matches.' is baked into the mock at line 679, not because the page was ever fetched. The amber 'Local Simulation Mode' banner is the only signal that nothing real happened. Two consequences: (a) reviewers following the verification plan can't catch bugs in check.ts; (b) once the API does run, several checks have substantive correctness issues — see inline comments. I confirmed the live preview at https://specification-website-2bg.pages.dev/api/check?domain=joost.blog does return real results, so the wiring works; the dev story just needs wrangler pages dev instead of astro dev (or the mock removed so failures are visible).
Other PR-level items:
- The PR branches from
DhrRob/mainrather than a feature branch — any future commits to yourmainwill land in this PR. functions/api/check.tsis one 733-line function. Splitting by category (security headers / HTML foundations / well-known fetches / DNS) would make this far easier to review and test.- No auth, no per-IP rate limit, and ~14 outbound fetches per submission. A script POSTing in a loop costs real money on the Pages Functions budget. Worth adding at least IP-based rate limiting via Workers KV before this is linked from the home page.
- The audit's
data-check-keyis matched to API result slugs at audit.astro:477. Mismatches silently render 'Check skipped or not reported by backend API.' Worth a one-pass cross-check of every emitted slug againstsrc/content/spec/.
| mockNotice?.classList.remove('hidden'); | ||
| // Wait 1.5s to simulate scan progress | ||
| await new Promise(r => setTimeout(r, 1500)); | ||
| apiResults = generateMockResults(currentDomain); |
There was a problem hiding this comment.
This fallback is why the audit appears to work locally but isn't actually testing the site. When astro dev serves this page, /api/check 404s, the catch fires, and generateMockResults() returns ~50 hardcoded results that ignore currentDomain apart from string interpolation. Reviewers following the PR's verification plan see a pass on doctype, lang, OG, etc. for any input, including domains that don't exist.
Two fixes worth considering:
- Drop the mock entirely and surface the error — a failed
/api/checkshould look like a failure, not a success. - If you want a local-dev story, document
wrangler pages devinstead ofastro devin the PR description and CLAUDE.md.
The baked strings in generateMockResults (line 666+) should be deleted regardless — they will rot independently of the real check.ts and mislead reviewers.
| } | ||
|
|
||
| // Clean the domain input | ||
| domain = domain.replace(/^(https?:\/\/)?(www\.)?/, ''); |
There was a problem hiding this comment.
Stripping www. unconditionally is wrong for sites that only respond on the www host. The HTTPS+HSTS path uses redirect: 'follow' so it survives, but the HTTP→HTTPS redirect check at line 82 uses redirect: 'manual' and only tries http://${domain} — so if http://example.com doesn't redirect but http://www.example.com does, this falsely reports HTTPS as failing.
Suggest: keep the user's input host, or try both apex and www and pass if either redirects.
| }); | ||
|
|
||
| // skip-links | ||
| const hasSkipLink = /href\s*=\s*["']#[^"']*["']/i.test(mainPageHtml) && /skip/i.test(mainPageHtml); |
There was a problem hiding this comment.
False positives. This passes if the page contains any href="#..." and the substring skip appears anywhere — a blog post titled 'Skip the upsell' with any in-page anchor will pass. Should require something like an <a href="#main" class="…skip…"> or check the anchor text/aria-label.
|
|
||
| // PERFORMANCE | ||
| // compression | ||
| const contentEncoding = mainHeaders.get('content-encoding') || mainHeaders.get('x-encoded-content-encoding'); |
There was a problem hiding this comment.
This check is effectively dead under Workers fetch(). The Workers runtime transparently decodes Content-Encoding and strips the header from the response. So mainHeaders.get('content-encoding') returns null for almost every site, including the heavily-Brotli'd joost.blog — which is why the live API reports warning there. Either drop this check, or use a separate HEAD with cf: { cacheTtl: 0 } plus an Accept-Encoding hint, or read response.cf.encoding if present.
| results.push({ | ||
| slug: 'http3', | ||
| result: supportsH3 ? 'pass' : 'warning', | ||
| message: supportsH3 ? `HTTP/3 supported (Alt-Svc: ${altSvc.substring(0, 50)}...).` : 'Alt-Svc header for HTTP/3 not returned in standard GET headers.', |
There was a problem hiding this comment.
Same problem as the compression check: Alt-Svc is unreliable from Workers — many CF-fronted sites won't expose it here. All Cloudflare-fronted sites support H3, so this will false-warning on a huge fraction of real sites. Consider reading response.cf.httpProtocol instead (it's 'HTTP/3' when the origin fetch used H3) or just drop the check from the automated pass and mark it manual.
| }); | ||
|
|
||
| // speculation-rules | ||
| const hasSpecRules = /<script\s+[^>]*type=["']speculationrules["']/i.test(mainPageHtml); |
There was a problem hiding this comment.
The Speculation Rules spec also supports an HTTP Speculation-Rules header pointing at an external JSON file — currently this only checks the inline <script type="speculationrules"> form. joost.blog happens to use the script form (which is why it passed) but the corresponding spec page documents both.
| // PRIVACY | ||
| // privacy-policy | ||
| const privacyRegex = /href\s*=\s*["'][^"']*privacy[^"']*["']/i; | ||
| const hasPrivacyLink = privacyRegex.test(mainPageHtml) || /privacy/i.test(mainPageHtml.replace(/<[^>]+>/g, '')); |
There was a problem hiding this comment.
Strong false-positive risk. /privacy/i.test(mainPageHtml.replace(/<[^>]+>/g, '')) matches the literal word 'privacy' anywhere in visible text — 'for privacy reasons', 'we respect your privacy', etc. A site with no privacy policy will routinely pass this. Restrict to the <a href> test only (line 453), or require the anchor text/title to contain 'privacy'.
There was a problem hiding this comment.
My knowledge here is not 100%, but would a rel="privacy-policy" suffice, or perhaps this is not as standard as I think it is? There's a fine line between keeping things simple and working on this one.
| if (src.startsWith('//')) return true; | ||
| if (src.startsWith('http://') || src.startsWith('https://')) { | ||
| const parsed = new URL(src); | ||
| return !parsed.hostname.endsWith(domain); |
There was a problem hiding this comment.
Unsound same-origin check. parsed.hostname.endsWith(domain) does a substring match — for domain = 'joost.blog', evil-joost.blog would be treated as same-origin. Use a .${domain} boundary or proper public-suffix-aware comparison.
const h = parsed.hostname;
return h !== domain && !h.endsWith(`.${domain}`);| { slug: 'change-password', url: `${rootUrl}/.well-known/change-password`, manualRedirect: true }, | ||
| { slug: 'llms-txt', url: `${rootUrl}/llms.txt` }, | ||
| { slug: 'apple-app-site-association', url: `${rootUrl}/.well-known/apple-app-site-association` }, | ||
| { slug: 'assetlinks-json', url: `${rootUrl}//.well-known/assetlinks.json` }, |
There was a problem hiding this comment.
Double slash: ${rootUrl}//.well-known/assetlinks.json → https://example.com//.well-known/.... Most servers tolerate it but some normalisation layers and CDNs will 404. Should be ${rootUrl}/.well-known/assetlinks.json.
|
|
||
| const USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 SpecAuditor/1.0'; | ||
|
|
||
| export const onRequest: PagesFunction = async (context) => { |
There was a problem hiding this comment.
Top-of-function concerns (no inline-friendly location, putting them here):
-
Abuse vector. This endpoint accepts arbitrary user-supplied hostnames, fans out 14+ outbound requests per submission, and is unauthenticated and unmetered. A script POSTing in a loop will cost real money on the Pages Functions budget. Add IP-based rate limiting (Workers KV with a 60s window is enough) before linking this from the home page.
-
Function length. 733 lines in one handler is hard to review and impossible to unit-test piecewise. Splitting by concern (security-headers.ts, html-foundations.ts, well-known.ts, dns.ts) and composing them in the handler would also make it easier to add new checks as the spec catalog grows.
-
Slug alignment.
audit.astromatches API slugs todata-check-keyattributes silently — a mismatch renders 'Check skipped or not reported by backend API.' Worth grepping the slugs emitted here againstsrc/content/spec/<cat>/<slug>.mdso the audit and the catalog stay in lockstep.
|
Cheers for the great feedback @jdevalk, I'll make some changes in the coming days! I'm not sure if I will be able to fix them all, considering my lack of experience with Astro, but I will try! |
What this changes
Adds a new primary call to action to the home page, a main navigation item and a footer item linking to
/audit/. From here, users can simply enter their URL and the website will check 56 of the 108 checklist items automatically. The other items will still require manual auditing, as they cannot be tested reliably without AI.It consists off a serverless Cloudflare Pages function (
/api/check)Known issues
To do
Verification Plan
Live demo running at https://specification-website-2bg.pages.dev/ or alternatively: