Skip to content

feat: stream MkDocs sites by URL#4

Merged
QueryPlanner merged 2 commits into
mainfrom
codex/url-driven-mkdocs
Jun 14, 2026
Merged

feat: stream MkDocs sites by URL#4
QueryPlanner merged 2 commits into
mainfrom
codex/url-driven-mkdocs

Conversation

@QueryPlanner

Copy link
Copy Markdown
Owner

What

Make Binge Docs stream documentation from any deployed MkDocs root URL.

Why

Requiring each documentation site to be registered in source code limits the tool to a curated list. URL-driven discovery lets users listen to compatible MkDocs sites immediately.

How

  • Replace provider-specific commands with binge-docs listen MKDOCS_URL.
  • Discover canonical pages from the site sitemap.xml.
  • Derive page labels and sections from published URL paths.
  • Support Material, standard MkDocs, and ReadTheDocs content roots.
  • Restrict HTTPS fetching and redirects to the supplied origin and path.
  • Add bounded downloads, versioned caching, and stale-cache fallback.
  • Add deterministic offline tests for CLI, fetching, caching, parsing, narration, models, and playback.

Tests

  • uv lock --check
  • uv run ruff check .
  • uv run pytest --cov=binge_docs --cov-branch --cov-fail-under=100
  • Smoke-test Agent Foundation, SQLModel, and MkDocs.org discovery and narration

Breaking Changes

  • Remove binge-docs fastapi, binge-docs typer, and the source registry interface.
  • Require a deployed MkDocs root URL as the first argument to listen.

Related Issues

No related GitHub issue exists.

- Accept deployed MkDocs root URLs in the listen command
- Discover canonical pages from sitemap.xml
- Enforce bounded HTTPS fetching and versioned caching
- Add offline tests with complete line and branch coverage
@github-actions

Copy link
Copy Markdown

PR Review: feat: stream MkDocs sites by URL

Summary

This is a substantial, well-engineered refactor that replaces hardcoded provider-specific commands (binge-docs fastapi, binge-docs typer) with a generic binge-docs listen MKDOCS_URL approach, discovering pages from the site's sitemap.xml.


What changed

Area Details
CLI Removed fastapi/typer commands; new listen command accepts any MkDocs root URL
New: documentation_fetching.py URLPolicy (origin/path validation), SecureWebFetcher (manual redirect following with policy checks), WebCache (TTL + stale fallback), path normalization that blocks traversal
documentation_sources.py Replaced MkDocsSourceDefinition + DocumentationSource protocol with MkDocsSource that discovers pages from sitemap.xml; derives labels/sections from URLs; supports Material, standard MkDocs, and ReadTheDocs content roots
errors.py Added SourceError for invalid base URLs
speech.py Minor: added # pragma: no cover to __name__ == "__main__" guard
Tests 6 new test files: 100% branch coverage, fully deterministic (no network)
README Rewritten examples showing URL-driven usage

Strengths

  1. Strong security posture. URLPolicy.validate_url enforces HTTPS-only, rejects credentials/query/fragment/custom port, and normalize_path blocks traversal segments (..). Redirect targets are re-validated against the policy on each hop.

  2. Clean extraction of fetching into its own module. The new WebCache is decoupled from MkDocsSource and accepts a fetcher callback, making it testable in isolation.

  3. 100% branch coverage with fully offline, deterministic tests. No test makes real network calls — all external dependencies are stubbed.

  4. Stale-cache fallback. Cached catalogs and pages serve as offline backups with visible warnings to the user (_warn_if_stale).

  5. Versioned catalog snapshots (CATALOG_SNAPSHOT_VERSION = 4 in cache keys) ensure cache invalidation on format changes.

  6. Multi-theme support via ARTICLE_SELECTORS = ("article.md-content__inner", '[role="main"]', "main", "article").


Issues

1. Loss of diagnostic detail in fetch errors

documentation_fetching.py:119-120httpx.HTTPError is caught and re-raised as OSError with only the URL. The original status code and response details are dropped:

except httpx.HTTPError as error:
    raise OSError(f"Could not download {current_url}") from error

This makes debugging harder (e.g., a 403 vs 500 look identical). Consider including the status code.

2. Silent discard of unrecognized sitemap URLs

documentation_sources.py:244-246 — URLs that fail policy.validate_url are silently skipped:

try:
    page_url = policy.validate_url(location.text.strip())
except ValueError:
    continue

This is intentional (external links) but could hide broken internal links in a sitemap. Not a blocker, but worth noting.

3. Section derivation is only one level deep

documentation_sources.py:285section = segments[0] for tutorial/advanced/testing yields section "tutorial", losing the "advanced" grouping. Fine for the current use case but limits section granularity on deeply nested sites.

4. No sitemap.xml fallback

If a MkDocs site deploys without a sitemap.xml, the tool fails with a CatalogError. This is a design constraint — reasonable since MkDocs always generates one — but it should be documented as a requirement.

5. Minor: duplicate ARTICLE_SELECTOR in narration.py

narration.py:11 defines ARTICLE_SELECTOR = "article.md-content__inner" while documentation_sources.py:19-24 defines ARTICLE_SELECTORS (plural, as a tuple). The narration module's constant is no longer referenced by callers — it's just a default parameter. Not a bug, just a slight duplication that could be cleaned up.


Verdict

Approved with minor notes. The PR is production-quality: excellent security, thorough tests, clean architecture. The issues above are observational and non-blocking.

New%20session%20-%202026-06-14T07%3A45%3A31.800Z
opencode session  |  github run

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the application to support streaming documentation from any deployed MkDocs site dynamically via sitemaps, replacing the previous hardcoded support for FastAPI and Typer. It introduces secure web fetching and caching mechanisms (SecureWebFetcher, URLPolicy, WebCache) to validate URLs and prevent unsafe redirects, alongside comprehensive test suites. The review feedback focuses on improving robustness and security, suggesting rejecting backslashes in paths to prevent path traversal, falling back to stale cached pages if sitemap parsing fails, passing raw bytes directly to ElementTree to handle encoding automatically, defensively handling unexpected URLs in sitemaps, and stripping leading slashes from relative page references to ensure correct URL resolution.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread binge_docs/documentation_fetching.py
Comment thread binge_docs/documentation_sources.py
Comment thread binge_docs/documentation_sources.py Outdated
Comment thread binge_docs/documentation_sources.py Outdated
Comment thread binge_docs/documentation_sources.py Outdated
Comment thread tests/test_documentation_sources.py Outdated
- Reject backslashes and unsafe protocol-relative references
- Preserve stale catalogs when fresh sitemaps cannot be parsed
- Parse XML using declared encodings and skip mismatched paths
- Add regression tests for reviewed edge cases
@github-actions

Copy link
Copy Markdown

Here's my review of the full PR:

PR Review: feat: stream MkDocs sites by URL

I've read all 14 changed files. This is a well-executed refactor. Below is my assessment.


Architecture & Design

  • Clean separation of concerns: documentation_fetching.py (fetching, caching, URL validation), documentation_sources.py (MkDocs discovery/parsing), cli.py (UI). Each module is independently testable.
  • Security-first fetching: URLPolicy enforces HTTPS-only, blocks credentials/query/fragment/custom ports, validates redirect targets on every hop. normalize_path blocks traversal (.., .) and backslashes (\). Redirects are followed manually with per-hop policy re-validation — unlike httpx's built-in redirects which could silently leave the origin.
  • Stale-cache fallback: Both catalog and page loads fall back to cached content with visible warnings. Great for spotty connections.
  • Versioned catalog snapshots: CATALOG_SNAPSHOT_VERSION = 4 in cache keys prevents stale-format cache corruption.
  • Multi-theme support: ARTICLE_SELECTORS tries Material (article.md-content__inner), standard ([role="main"]), ReadTheDocs (main, article).

Test Quality

100% branch coverage, fully deterministic (no network), all external dependencies stubbed. Edge cases covered: invalid UTF-8, encoding declarations, redirect loops, missing Location headers, duplicate sitemap URLs, empty sitemaps, stale cache, catalog snapshot corruption, interactive page selection, playback controls.

Items from Existing Reviews

The gemini-code-assist review flagged several items that are already addressed in the submitted code:

Suggestion Status
Reject backslashes in paths Already at documentation_fetching.py:199
Sitemap parse error → stale fallback Already at documentation_sources.py:97-99
Pass raw bytes to ElementTree Already at documentation_sources.py:238
Defensive handling of relative_page_path Already at documentation_sources.py:252-255
Strip leading slashes in resolve_page_slug Already at documentation_sources.py:321

Minor Observations

  1. SecureWebFetcher.fetch_bytes flattens HTTP errors (line 119-120): httpx.HTTPError becomes bare OSError("Could not download <url>"). The __cause__ chain preserves the original exception for debugging, so this is fine for CLI UX. Not blocking.

  2. Duplicate ARTICLE_SELECTOR in narration.py:11 vs ARTICLE_SELECTORS in documentation_sources.py:19. They serve different purposes (default parameter vs iterative selector probing), but removing the narration.py constant and importing from documentation_sources would reduce duplication. Minor.

  3. Section depth: page_from_url at line 290 uses segments[0] — one level deep. Fine for current use, but multilayered sites lose grouping granularity.

Verdict

Approved. The PR is production-quality: strong security posture, thorough tests, clean modularity. No blocking issues.

New%20session%20-%202026-06-14T07%3A59%3A52.081Z
opencode session  |  github run

@QueryPlanner QueryPlanner merged commit 59cba06 into main Jun 14, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant