English | 简体中文
PDF/PPT/Image to Markdown OCR Converter
Batch convert documents to clean Markdown using ZhipuAI GLM-OCR API — the #1 OCR model on OmniDocBench v1.5 as of early 2026. Zero GPU required, LaTeX math support, concurrent processing, resume from breakpoint.
- Features
- Use Cases
- Why GLM?
- Quick Start
- Studio Prototype
- Output Structure
- Verification and Failure Semantics
- Common Failure Modes and Recommended Handling
- Recommended Workflow for Personal Knowledge Pipelines
- Reference-book Metadata and Downstream Note Workflow
- Clean Junk Images
- Configuration
- Requirements
- Known Limitations
- License
- Appendix: Detailed Benchmark Test Results
- Appendix: API Pricing Details
- PDF / PPT / PPTX -> Markdown: segment-based OCR with failure-safe fallback (
PDF upload -> per-page image -> native PDF text fallback) - Handwriting recognition:
--handwritemode uses ZhipuAI handwriting OCR API for handwritten content - Concurrent processing: when a file has multiple segments, processes 2 segments in parallel (GLM-OCR API max concurrency is 2)
- Long screenshot support: auto-splits tall images (e.g. chat screenshots) into overlapping segments (text OCR only, no image extraction)
- Resume from breakpoint: already-completed segments are skipped on re-run
- Image extraction: embedded images are saved to
images/subfolder - Explicit failure reporting: segments that still fail after fallback are recorded in
_failed_segments/*.failed.jsonand are never silently treated as complete - Strict validation:
verify_ocr.pychecks directory/md/page coverage/failure reports, andaudit_ocr_integrity.pycatches legacy mixed-output problems - Clean output tree: converted PPT PDFs are cached under
_cache/ppt_pdf/instead of being mixed intooutput/ - Markdown cleanup at the OCR boundary: normalizes OCR math delimiters so
$ 2x+1 $becomes$2x+1$before downstream note generation - Suspicious math-command audit: can emit a report of suspicious commands still left after cleanup, so the remaining OCR edge cases are explicit
- Visual-block suppression: filters header/footer strips, page-number corners, logos, watermarks, and glyph-fragment crops before they enter
images/ - Reference-book metadata: generates
目录页.md, page-offset notes, QR resource aggregation, and textbook lookup helpers - Manual image review UI: local reviewer for duplicate / similar OCR images, blacklist learning, and group-by-group deletion
- Junk image audit tools: exact-duplicate audit, grayscale similarity search, orphan-image maintenance, and targeted watermark purging
- No more fitz in the active PDF backend: the main pipeline now uses
pypdf + pypdfium2, withpypdffor page structure/text andpypdfium2for rendering and bbox crops - Math formula support: LaTeX output (
$...$inline,$$...$$block) - AI coding assistant support: includes
CLAUDE.mdandAGENTS.mdfor Claude Code, Codex, OpenCode, and OpenClaw
- Convert lecture slides / textbooks to Markdown for note-taking and RAG knowledge bases
- WeChat chat export via screenshots: taking long screenshots is a safer alternative to third-party export tools (no API abuse, no risk of ToS violations). OCR the screenshots into text, then feed to AI for personal assistant training
- Batch digitize scanned documents
- Extract text from web page screenshots
The OmniDocBench v1.5 benchmark (GitHub) is still one of the best document-parsing references, but there is an important 2026 caveat: not every “high score” is reported on the same scale.
GLM-OCR 94.62: an official raw OmniDocBench v1.5 scoreQianfan-OCR 93.12: an official model-card self-reported OmniDocBench v1.5 scorePaddleOCR-VL-1.5 94.5: a paper-reported OmniDocBench v1.5 scoredots.mocr 1124.7: an Elo aggregate score across multiple benchmarks from the project READMEdots.mocr-svg 0.931 / 0.905: SVG-specific task scores on UniSVG / ChartMimic
That means 94.62 and 1124.7 are not directly comparable. The first is a raw benchmark score; the second is an author-built Elo aggregate.
This is also why this software does not chase a single “one model does everything” answer:
- GLM-OCR remains the default daily document OCR path
- Qianfan-OCR is a strong cloud structured-OCR backend
- dots.mocr-svg is the most relevant path for turning charts, UI, and logos into editable SVG assets
Also, many people remember the older PaddleOCR toolkit line. The models compared here, PaddleOCR-VL and PaddleOCR-VL-1.5, belong to the newer VL document parsing branch that formed between 2025-10 and 2026-01, and should not be confused with the older product line.
Via ZhipuAI special deals, 50 million tokens (enough to OCR ~60 textbooks at 300 pages each):
| Tier | Price | ~ USD | ~ EUR |
|---|---|---|---|
| Flash sale (1× per account) | ¥2.9 | $0.40 | €0.37 |
| Developer / Education (3× per account) | ¥8 | $1.10 | €1.02 |
| Standard (no package) | ¥10 | $1.38 | €1.28 |
Standard per-token rate: ¥0.2/M tokens (~$0.03 / €0.03), roughly 1/100 the cost of GPT-4o Vision.
Full pricing breakdown → Appendix: API Pricing
| Solution | Key Score / Metric | Score Type | Best For | Weaknesses | Deployment |
|---|---|---|---|---|---|
| GLM-OCR (ZhipuAI / 智谱) | 94.62 |
Raw OmniDocBench v1.5 score (official) | Textbooks, lecture notes, formula-heavy documents, and the default path of this software | Does not directly produce SVG; weak editable-asset story | Cloud API / VLLM local |
| Qianfan-OCR (Baidu Qianfan / 百度千帆) | 93.12 |
OmniDocBench v1.5 self-reported score (official model card) | Markdown plus JSON structured output and cloud parsing of complex layouts | More focused on structured text than SVG assets | Cloud API |
| PaddleOCR-VL-1.5 (Baidu / 百度) | 94.5 |
OmniDocBench v1.5 self-reported score (paper) | Offline deployment, privacy-sensitive docs, strong structured parsing | Higher deployment cost; not ideal as your lightweight daily default | Local GPU / service deployment |
| dots.ocr (rednote-hilab) | 88.41 |
Raw OmniDocBench v1.5 score | Markdown plus JSON plus bbox structure | Complex tables and math are not its strongest area | Local / vLLM |
| dots.mocr (rednote-hilab) | 1124.7 |
Multi-benchmark Elo aggregate (official README) | Charts, UI, visual understanding, multimodal structured assets | Elo scores should not be mixed directly with raw OmniDocBench scores | Local / vLLM |
| dots.mocr-svg (rednote-hilab) | UniSVG 0.931 / ChartMimic 0.905 |
SVG-specific task scores | Charts, UI, logos, and editable visual assets | Not a primary whole-document OCR model; better as a visual-asset pipeline stage | Local / vLLM / API wrapper |
For this product, the current best division of labor is:
- GLM-OCR as the default document OCR pipeline
- Qianfan-OCR as the cloud structured parsing backend
- dots.mocr-svg as the editable SVG asset pipeline
This project is designed for individual users (students, researchers) who don't need to process thousands of documents. Cloud API means zero GPU requirements, no CUDA setup, no model downloads — just pip install and go. Local deployment only makes sense for enterprises with dedicated GPU servers.
The script architecture is model-agnostic — swapping to a different API (DeepSeek, Unisound U1, etc.) only requires changing the API client and model name in ocr.py.
References:
Register at ZhipuAI Open Platform and create an API key.
pip install -r requirements.txtCreate a .env file in the project root:
GLM_API_KEY=your_api_key_here
# Put PDF/PPT/images in input/ directory
python ocr.py # Standard OCR (printed text, formulas, etc.)
python ocr.py --handwrite # Handwriting recognition mode
# Markdown output in output/python verify_ocr.py
python audit_ocr_integrity.pyOnly treat an OCR batch as complete when both checks are clean and the relevant output/<file>/ directory has no _failed_segments/*.failed.json.
Detailed audit convention for non-GLM page supplementation: OCR_AUDIT_POLICY.md
The repository now includes an early desktop-shell prototype named GLM OCR Studio.
It is meant to address two practical problems first:
- OCR provider differences are hard to remember
- advanced manual tools are scattered across scripts
The current prototype already provides:
- bilingual UI switching
- a provider capability board for GLM-OCR, Qianfan-OCR, Qianfan-OCR-Fast, GLM handwriting, dots.ocr, dots.mocr, dots.mocr-svg, PaddleOCR-VL 1.5, and PP-StructureV3
- a visual memory board for the existing manual workflow and script entry points
- a staged refactor plan
Run it with:
python ocr_studio.pyRefactor plan documents:
- English: docs/STUDIO_REFACTOR_PLAN.md
- 简体中文: docs/STUDIO_REFACTOR_PLAN_CN.md
output/
└── filename/
├── filename_0001-0020.md # Pages 1-20
├── filename_0021-0040.md # Pages 21-40
├── ...
├── _failed_segments/ # Present only when some pages still failed after all fallbacks
│ └── filename_0021-0040.failed.json
└── images/ # Extracted bbox images
├── p0003_fig0001.png
└── ...
Converted PPT/PPTX PDFs are cached separately under _cache/ppt_pdf/<filename>/<filename>.pdf so output/ stays focused on OCR artifacts only.
For textbooks and reference books, the OCR folder may additionally contain:
目录页.md
周边资源-全部.md # only when QR resources are numerous
_front_assets/qrcodes/ # extracted QR code snapshots
python verify_ocr.py
python audit_ocr_integrity.pyverify_ocr.pyis the acceptance gate. It checks output directories, Markdown presence, page-range continuity, failed placeholders, and_failed_segments/*.failed.json.audit_ocr_integrity.pyis the deeper audit. It highlights high-risk states such as legacy mixed output, coverage gaps, partial failed md files, and explicit failed segment reports.ocr.pynow treats "empty/failed markdown" as a real failure. If segment-level PDF upload is blocked or returns no meaningful content, it falls back to per-page images; if a page still fails, it tries native PDF text extraction; if that still fails, it writes a.failed.jsonreport instead of pretending success.- A visible failure report is intentional. It means the pipeline surfaced a real problem instead of silently producing incomplete OCR.
- Provider content filter (
1301 contentFilter): Some pages about security, attacks, or other sensitive topics may be blocked by ZhipuAI. Recommended handling:- Split the failed segment so unaffected pages can still be saved.
- Retry the blocked page as a single-page segment.
- If it still fails, use a secondary path such as Mathpix, another vision model, or an explicitly documented AI visual transcription from the rendered page image.
- Mark the replacement clearly as non-GLM-OCR output.
- Legacy mixed output (
segment_*.md, old page image dumps): Older runs may leave mixed naming schemes or image-only remnants inoutput/. Recommended handling:- Run
audit_ocr_integrity.py. - Do not delete old segments until the new ranged
.mdfiles are confirmed to cover the same pages. - Prefer rerunning only the affected range instead of rerunning the whole book.
- Run
- Garbled filenames from archive extraction / Windows encoding:
ZIP/RAR extraction may produce mojibake file names. Recommended handling:
- Rename the source files first.
- Keep
input/,output/, and downstream library names synchronized. - Save the rename mapping for later traceability.
- PPT conversion artifacts:
PPT/PPTX must be converted to PDF before OCR. Recommended handling:
- Keep converted PDFs in
_cache/ppt_pdf/. - Do not treat cached PPT PDFs as OCR output.
- Do not import cache files into downstream knowledge libraries.
- Keep converted PDFs in
- Editable-text fallback is not guaranteed:
Some scanned or image-only PDFs return no native text at all, so PDF text fallback may be empty. Recommended handling:
- Expect this on older scans and textbook page images.
- Switch to image OCR or another vision path rather than assuming native text extraction will save the page.
- Keep a strict three-layer workflow:
- source library
- OCR intermediate library (
paged md + images) - final knowledge base
- Do not move OCR output into the final knowledge base before
verify_ocr.pyandaudit_ocr_integrity.pyare both clean. - For reference books, keep a directory page / page-offset note so future lookups can map book page numbers back to PDF page numbers.
- When one blocked page would invalidate a whole segment, split the segment first so recoverable pages are not lost.
- If you must replace one page through a non-GLM path, document it inside the target
.mdheader so later auditing stays honest.
Full source-library -> OCR intermediate library -> Obsidian note workflow:
reference_book_metadata.py: generate目录页.md, printed-page offsets, and QR resource aggregation from the original PDFbackfill_reference_book_directory_pages.py: batch refresh textbook metadata across the intermediate libraryrerun_pdf_segments.py: rerun only failed or suspicious page rangesmarkdown_cleanup.py/repair_math_delimiters.py: fix OCR math delimiters and generate suspicious-command audit reports before the notes consume the outputduplicate_image_reviewer.py: local UI for duplicate/similar image review, blacklist learning, and manual deletionmiddle_library_image_maintenance.py: audit orphan images and keep the OCR intermediate library tidy
This repo now supports not only OCR itself, but also the textbook metadata and handoff steps needed before writing structured course notes.
python clean_junk_images.py audit --root output --min-count 4 --copy-samples
python clean_junk_images.py similar --root output --reference path/to/sample.png --threshold 8
python clean_junk_images.py purge --root output --manifest confirmed_junk_manifest.txtUse python duplicate_image_reviewer.py --root output for a local browser UI similar to phone gallery duplicate-photo cleanup.
The old size-based cleaner still exists as an explicit compatibility mode:
python clean_junk_images.py legacy-size-clean --root outputRecommended order is:
- audit duplicate families
- review them manually in the UI
- learn or purge confirmed junk families
- keep ordinary unreferenced images under audit instead of blindly deleting them
| Parameter | Default | Description |
|---|---|---|
PAGES_PER_MD_PDF |
20 | Pages per segment for PDF |
PAGES_PER_MD_PPT |
50 | Pages per segment for PPT |
MAX_WORKERS |
2 | Max concurrent API calls (GLM-OCR API limit is 2) |
IMAGE_SEGMENT_HEIGHT |
4000px | Max height per segment for long images |
IMAGE_OVERLAP |
200px | Overlap between adjacent image segments |
- Python 3.8+
- ZhipuAI API key
- PowerPoint or WPS Office (only for PPT/PPTX conversion, via COM automation)
- Hallucination: GLM may guess plausible values for blurry text instead of returning errors — avoid for financial/medical documents requiring exact precision
- Repetition bug: Occasionally loops on dense Excel screenshots, repeating the same row
- Provider content filters: some pages (for example security-related textbook content) may trigger ZhipuAI safety filtering. These pages now stay visible as failed reports instead of being silently counted as finished OCR.
- Handwriting: Standard mode is weak on handwritten content — use
--handwritemode for handwritten text (or Claude/GPT for best quality). Handwriting mode outputs plain text only (no Markdown formatting), supports images only (PDFs are auto-converted to images), ¥0.01/page - PPT/PPTX files are converted to PDF in a background thread (parallel with OCR), using a single reusable PowerPoint COM instance
- Source files in
input/are preserved after processing
MIT
Source: 5-model OCR benchmark (2026.02) by @AI创客空间
Click to expand
Test 1 — Math-heavy PDF (formulas & equations):
| Model | Result |
|---|---|
| GLM-OCR | Formula hierarchy perfectly restored, complete layout with chapter titles |
| PaddleOCR VL 1.5 | Zero errors, equivalent LaTeX notation |
| MinerU | Zero text errors, complete LaTeX structure |
| DeepSeek OCR2 | Formula symbols missing, content loss |
Test 2 — Complex magazine (images, blurry fonts, mixed layout):
| Model | Result |
|---|---|
| GLM-OCR | Only model to correctly identify all biological terms (e.g. "hemocyanin", "copper ions") |
| PaddleOCR VL 1.5 | Close but misread specialized terminology |
| MinerU | Many character errors on domain-specific terms |
| DeepSeek OCR2 | Text mostly correct but images discarded, page numbers lost |
Test 3 — Handwritten vertical Chinese calligraphy:
| Model | Result |
|---|---|
| PaddleOCR VL | Zero errors, all 10 lines perfectly recognized |
| GLM-OCR | Correct reading order, mostly accurate, but missed one character |
| MinerU | Correct order but weak on calligraphic forms |
| DeepSeek OCR2 | Completely wrong reading order |
Test 4 — Complex handwritten table (checkboxes, handwritten numbers):
| Model | Result |
|---|---|
| PaddleOCR VL 1.5 | Best overall — handwritten digits correct, checkboxes detected, clean structure |
| GLM-OCR | Handwritten digits all correct, table format correct, but header info lost |
| MinerU | Table recognition completely wrong |
| DeepSeek OCR2 | Zero info loss but table separated from header |
Source: ZhipuAI Special Deals · Standard Pricing (as of March 2026)
Click to expand
| Tier | Package | Price | Per M tokens | Limit |
|---|---|---|---|---|
| Flash sale | 50M tokens / 3 months | ¥2.9 ($0.40 / €0.37) | ¥0.058 | 1× per account |
| Developer | 50M tokens / 3 months | ¥8 ($1.10 / €1.02) | ¥0.16 | 3× per account |
| Education | 50M tokens / 3 months | ¥8 ($1.10 / €1.02) | ¥0.16 | 3× per account |
| Enterprise | 10B tokens / 4 months | ¥1,600 ($221 / €204) | ¥0.16 | 3× per account |
| Standard (no package) | Pay-as-you-go | — | ¥0.2 ($0.03 / €0.03) | Unlimited |
Cost estimate: ~2,500 tokens per page (image input + markdown output). A 300-page textbook ≈ 750K tokens ≈ ¥0.15 at standard rate.
Exchange rates: 1 USD ≈ 7.25 CNY, 1 EUR ≈ 7.85 CNY (approximate, March 2026)