feat: stream MkDocs sites by URL by QueryPlanner · Pull Request #4 · QueryPlanner/Binge-Docs

QueryPlanner · 2026-06-14T07:45:12Z

What

Make Binge Docs stream documentation from any deployed MkDocs root URL.

Why

Requiring each documentation site to be registered in source code limits the tool to a curated list. URL-driven discovery lets users listen to compatible MkDocs sites immediately.

How

Replace provider-specific commands with binge-docs listen MKDOCS_URL.
Discover canonical pages from the site sitemap.xml.
Derive page labels and sections from published URL paths.
Support Material, standard MkDocs, and ReadTheDocs content roots.
Restrict HTTPS fetching and redirects to the supplied origin and path.
Add bounded downloads, versioned caching, and stale-cache fallback.
Add deterministic offline tests for CLI, fetching, caching, parsing, narration, models, and playback.

Tests

uv lock --check
uv run ruff check .
uv run pytest --cov=binge_docs --cov-branch --cov-fail-under=100
Smoke-test Agent Foundation, SQLModel, and MkDocs.org discovery and narration

Breaking Changes

Remove binge-docs fastapi, binge-docs typer, and the source registry interface.
Require a deployed MkDocs root URL as the first argument to listen.

Related Issues

No related GitHub issue exists.

- Accept deployed MkDocs root URLs in the listen command - Discover canonical pages from sitemap.xml - Enforce bounded HTTPS fetching and versioned caching - Add offline tests with complete line and branch coverage

github-actions · 2026-06-14T07:46:43Z

PR Review: feat: stream MkDocs sites by URL

Summary

This is a substantial, well-engineered refactor that replaces hardcoded provider-specific commands (binge-docs fastapi, binge-docs typer) with a generic binge-docs listen MKDOCS_URL approach, discovering pages from the site's sitemap.xml.

What changed

Area	Details
CLI	Removed `fastapi`/`typer` commands; new `listen` command accepts any MkDocs root URL
New: `documentation_fetching.py`	`URLPolicy` (origin/path validation), `SecureWebFetcher` (manual redirect following with policy checks), `WebCache` (TTL + stale fallback), path normalization that blocks traversal
`documentation_sources.py`	Replaced `MkDocsSourceDefinition` + `DocumentationSource` protocol with `MkDocsSource` that discovers pages from `sitemap.xml`; derives labels/sections from URLs; supports Material, standard MkDocs, and ReadTheDocs content roots
`errors.py`	Added `SourceError` for invalid base URLs
`speech.py`	Minor: added `# pragma: no cover` to `__name__ == "__main__"` guard
Tests	6 new test files: 100% branch coverage, fully deterministic (no network)
README	Rewritten examples showing URL-driven usage

Strengths

Strong security posture. URLPolicy.validate_url enforces HTTPS-only, rejects credentials/query/fragment/custom port, and normalize_path blocks traversal segments (..). Redirect targets are re-validated against the policy on each hop.
Clean extraction of fetching into its own module. The new WebCache is decoupled from MkDocsSource and accepts a fetcher callback, making it testable in isolation.
100% branch coverage with fully offline, deterministic tests. No test makes real network calls — all external dependencies are stubbed.
Stale-cache fallback. Cached catalogs and pages serve as offline backups with visible warnings to the user (_warn_if_stale).
Versioned catalog snapshots (CATALOG_SNAPSHOT_VERSION = 4 in cache keys) ensure cache invalidation on format changes.
Multi-theme support via ARTICLE_SELECTORS = ("article.md-content__inner", '[role="main"]', "main", "article").

Issues

1. Loss of diagnostic detail in fetch errors

documentation_fetching.py:119-120 — httpx.HTTPError is caught and re-raised as OSError with only the URL. The original status code and response details are dropped:

except httpx.HTTPError as error:
    raise OSError(f"Could not download {current_url}") from error

This makes debugging harder (e.g., a 403 vs 500 look identical). Consider including the status code.

2. Silent discard of unrecognized sitemap URLs

documentation_sources.py:244-246 — URLs that fail policy.validate_url are silently skipped:

try:
    page_url = policy.validate_url(location.text.strip())
except ValueError:
    continue

This is intentional (external links) but could hide broken internal links in a sitemap. Not a blocker, but worth noting.

3. Section derivation is only one level deep

documentation_sources.py:285 — section = segments[0] for tutorial/advanced/testing yields section "tutorial", losing the "advanced" grouping. Fine for the current use case but limits section granularity on deeply nested sites.

4. No `sitemap.xml` fallback

If a MkDocs site deploys without a sitemap.xml, the tool fails with a CatalogError. This is a design constraint — reasonable since MkDocs always generates one — but it should be documented as a requirement.

5. Minor: duplicate `ARTICLE_SELECTOR` in narration.py

narration.py:11 defines ARTICLE_SELECTOR = "article.md-content__inner" while documentation_sources.py:19-24 defines ARTICLE_SELECTORS (plural, as a tuple). The narration module's constant is no longer referenced by callers — it's just a default parameter. Not a bug, just a slight duplication that could be cleaned up.

Verdict

Approved with minor notes. The PR is production-quality: excellent security, thorough tests, clean architecture. The issues above are observational and non-blocking.

opencode session | github run

gemini-code-assist

Code Review

This pull request refactors the application to support streaming documentation from any deployed MkDocs site dynamically via sitemaps, replacing the previous hardcoded support for FastAPI and Typer. It introduces secure web fetching and caching mechanisms (SecureWebFetcher, URLPolicy, WebCache) to validate URLs and prevent unsafe redirects, alongside comprehensive test suites. The review feedback focuses on improving robustness and security, suggesting rejecting backslashes in paths to prevent path traversal, falling back to stale cached pages if sitemap parsing fails, passing raw bytes directly to ElementTree to handle encoding automatically, defensively handling unexpected URLs in sitemaps, and stripping leading slashes from relative page references to ensure correct URL resolution.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

- Reject backslashes and unsafe protocol-relative references - Preserve stale catalogs when fresh sitemaps cannot be parsed - Parse XML using declared encodings and skip mismatched paths - Add regression tests for reviewed edge cases

github-actions · 2026-06-14T08:00:53Z

Here's my review of the full PR:

PR Review: feat: stream MkDocs sites by URL

I've read all 14 changed files. This is a well-executed refactor. Below is my assessment.

Architecture & Design

Clean separation of concerns: documentation_fetching.py (fetching, caching, URL validation), documentation_sources.py (MkDocs discovery/parsing), cli.py (UI). Each module is independently testable.
Security-first fetching: URLPolicy enforces HTTPS-only, blocks credentials/query/fragment/custom ports, validates redirect targets on every hop. normalize_path blocks traversal (.., .) and backslashes (\). Redirects are followed manually with per-hop policy re-validation — unlike httpx's built-in redirects which could silently leave the origin.
Stale-cache fallback: Both catalog and page loads fall back to cached content with visible warnings. Great for spotty connections.
Versioned catalog snapshots: CATALOG_SNAPSHOT_VERSION = 4 in cache keys prevents stale-format cache corruption.
Multi-theme support: ARTICLE_SELECTORS tries Material (article.md-content__inner), standard ([role="main"]), ReadTheDocs (main, article).

Test Quality

100% branch coverage, fully deterministic (no network), all external dependencies stubbed. Edge cases covered: invalid UTF-8, encoding declarations, redirect loops, missing Location headers, duplicate sitemap URLs, empty sitemaps, stale cache, catalog snapshot corruption, interactive page selection, playback controls.

Items from Existing Reviews

The gemini-code-assist review flagged several items that are already addressed in the submitted code:

Suggestion	Status
Reject backslashes in paths	Already at `documentation_fetching.py:199`
Sitemap parse error → stale fallback	Already at `documentation_sources.py:97-99`
Pass raw bytes to `ElementTree`	Already at `documentation_sources.py:238`
Defensive handling of `relative_page_path`	Already at `documentation_sources.py:252-255`
Strip leading slashes in `resolve_page_slug`	Already at `documentation_sources.py:321`

Minor Observations

SecureWebFetcher.fetch_bytes flattens HTTP errors (line 119-120): httpx.HTTPError becomes bare OSError("Could not download <url>"). The __cause__ chain preserves the original exception for debugging, so this is fine for CLI UX. Not blocking.
Duplicate ARTICLE_SELECTOR in narration.py:11 vs ARTICLE_SELECTORS in documentation_sources.py:19. They serve different purposes (default parameter vs iterative selector probing), but removing the narration.py constant and importing from documentation_sources would reduce duplication. Minor.
Section depth: page_from_url at line 290 uses segments[0] — one level deep. Fine for current use, but multilayered sites lose grouping granularity.

Verdict

Approved. The PR is production-quality: strong security posture, thorough tests, clean modularity. No blocking issues.

opencode session | github run

feat: stream MkDocs sites by URL

6acfa43

- Accept deployed MkDocs root URLs in the listen command - Discover canonical pages from sitemap.xml - Enforce bounded HTTPS fetching and versioned caching - Add offline tests with complete line and branch coverage

gemini-code-assist Bot reviewed Jun 14, 2026

View reviewed changes

fix: harden MkDocs URL and sitemap handling

26ec0bf

- Reject backslashes and unsafe protocol-relative references - Preserve stale catalogs when fresh sitemaps cannot be parsed - Parse XML using declared encodings and skip mismatched paths - Add regression tests for reviewed edge cases

QueryPlanner merged commit 59cba06 into main Jun 14, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: stream MkDocs sites by URL#4

feat: stream MkDocs sites by URL#4
QueryPlanner merged 2 commits into
mainfrom
codex/url-driven-mkdocs

QueryPlanner commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

QueryPlanner commented Jun 14, 2026

What

Why

How

Tests

Breaking Changes

Related Issues

Uh oh!

github-actions Bot commented Jun 14, 2026

PR Review: feat: stream MkDocs sites by URL

Summary

What changed

Strengths

Issues

1. Loss of diagnostic detail in fetch errors

2. Silent discard of unrecognized sitemap URLs

3. Section derivation is only one level deep

4. No sitemap.xml fallback

5. Minor: duplicate ARTICLE_SELECTOR in narration.py

Verdict

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 14, 2026

PR Review: feat: stream MkDocs sites by URL

Architecture & Design

Test Quality

Items from Existing Reviews

Minor Observations

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

4. No `sitemap.xml` fallback

5. Minor: duplicate `ARTICLE_SELECTOR` in narration.py