Skip to content

fix: crawl sitemap index to resolve actual page URLs#1031

Merged
dslovinsky merged 1 commit intomainfrom
ds/sitemap-crawl-metadata
Feb 13, 2026
Merged

fix: crawl sitemap index to resolve actual page URLs#1031
dslovinsky merged 1 commit intomainfrom
ds/sitemap-crawl-metadata

Conversation

@dslovinsky
Copy link
Collaborator

Summary

  • Updates generate-metadata to recursively crawl the sitemap index at sitemap.xml instead of reading it as a flat sitemap
  • The new docs site serves a <sitemapindex> that references child sitemaps (sitemap-0.xml, etc.) — the script now follows those references to collect all actual page URLs
  • Handles future growth if additional sitemap files are added (e.g. sitemap-1.xml)

Closes DX-2215

Test plan

  • Ran pnpm run generate:metadata — collected 4,781 page URLs (previously it was returning just the sitemap-0.xml URL)
  • Verify metadata.json is consumed correctly downstream

The new docs site uses a sitemapindex at sitemap.xml that references
child sitemaps (sitemap-0.xml, etc.) instead of listing pages directly.
This recursively follows the index to collect all page URLs.

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Feb 13, 2026

🌿 Documentation Preview

Name Status Preview Updated (UTC)
Alchemy Docs ✅ Ready 🔗 Visit Preview Feb 13, 2026, 10:49 PM

@github-actions github-actions bot temporarily deployed to docs-preview February 13, 2026 22:47 Destroyed
@dslovinsky dslovinsky marked this pull request as ready for review February 13, 2026 22:48
@dslovinsky dslovinsky requested a review from a team as a code owner February 13, 2026 22:48
Copilot AI review requested due to automatic review settings February 13, 2026 22:48
@dslovinsky dslovinsky self-assigned this Feb 13, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the metadata generation script to correctly collect page URLs from docs sites that publish a sitemap index (<sitemapindex>) pointing to multiple child sitemaps, instead of treating sitemap.xml as a flat sitemap.

Changes:

  • Adds helpers to extract <loc> values and detect sitemap indexes.
  • Recursively fetches child sitemaps when sitemap.xml is a sitemap index.
  • Logs how many URLs were collected and writes them into metadata.json alongside spec file URLs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dslovinsky dslovinsky merged commit 5639932 into main Feb 13, 2026
14 of 15 checks passed
@dslovinsky dslovinsky deleted the ds/sitemap-crawl-metadata branch February 13, 2026 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants