Skip to content

feat(fetchers): WikipediaFetcher — clean article extraction via MediaWiki API #55

@chaliy

Description

@chaliy

What

Add a WikipediaFetcher that matches en.wikipedia.org/wiki/{title} URLs (and other language editions), returning clean article content via the MediaWiki API.

Why

Agents doing research and fact-checking frequently land on Wikipedia. The current DefaultFetcher returns the full page with edit links, references sections, navigation boxes, and other wiki-specific chrome. The MediaWiki API provides clean extract text and structured metadata.

Requirements

  • Match: https://{lang}.wikipedia.org/wiki/{title} (all language editions)
  • Fetch via API: https://{lang}.wikipedia.org/api/rest_v1/page/summary/{title} for summary
  • Optionally fetch full content via: https://{lang}.wikipedia.org/api/rest_v1/page/html/{title}
  • Return: title, extract/summary, infobox data (if parseable), key sections, categories
  • Strip: edit links, reference numbers, navigation boxes, disambiguation notices
  • Format field: "wikipedia"
  • Support redirect resolution

Design Notes

  • MediaWiki REST API is well-documented and has generous rate limits
  • Summary endpoint returns a concise extract — often sufficient for agent needs
  • Full HTML endpoint can be converted via existing html_to_markdown with wiki-specific cleanup
  • Infobox extraction is complex — could be a stretch goal

Tier

2 — High-frequency agent need

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions