newsdom-api is a small service-oriented Python application with a
thin FastAPI entrypoint and explicit separation between request
orchestration, MinerU process execution, and DOM normalization.
src/newsdom_api/main.pyexposes/healthand/parsethrough FastAPI.src/newsdom_api/service.pyorchestrates PDF parsing, temporary files, and response construction.src/newsdom_api/mineru_runner.pyshells out to the MinerU CLI, collects JSON outputs, and translates runtime or incomplete-output failures into typed sanitized exceptions.src/newsdom_api/dom_builder.pyconverts MinerUcontent_listblocks plus page model metadata into the canonical NewsDOM response model.src/newsdom_api/schemas.pydefines the public response schema.src/newsdom_api/synthetic.pyandsrc/newsdom_api/equivalence.pysupport synthetic fixture generation and structural comparisons.
src/newsdom_api/main.pyreceives an uploaded PDF.src/newsdom_api/service.pywrites the upload to a temporary workspace and calls MinerU.src/newsdom_api/mineru_runner.pyresolves the executable, runs the OCR pipeline, loads generated JSON artifacts, and raises typed sanitized errors for runtime-unavailable or incomplete-output cases.src/newsdom_api/dom_builder.pynormalizes OCR blocks into the canonical response while preserving page-aware structure from MinerU model metadata.- FastAPI returns typed JSON from
src/newsdom_api/schemas.pyand maps MinerU runtime failures to 503 and incomplete output to 502.
tests/fixturesholds synthetic PDFs, JSON baselines, and provenance notes; private reference inputs stay out of git.manual/is the published user manual rendered by MkDocs..github/workflows/encodes CI, security scanning, Pages, release, and image-delivery policy.scripts/release/builds release manifests and exports GitHub attestation bundles.
developis the integration line for normal feature, fix, and chore work.mainis the stable release line that receives tagged releases.- The service is production-grade only when code, docs, workflows, and release evidence agree.