Skip to content

Improve PDF text extraction in get_md #2

@g-eoj

Description

@g-eoj

Problem

get_md in _tools.py uses a hack to force pdf.js to render all pages: it presses 'n' (next page) a fixed number of times. This has two weaknesses:

  1. After navigating through all pages, pdf.js may have virtualized out early pages (removed from DOM), so #viewer HTML at the end may only contain the last few pages.
  2. The current simple fix caps at 50 pages and uses PDFViewerApplication.pagesCount — still relies on DOM rendering.

Better Fix

Use pdf.js's JavaScript API to extract text directly from the PDF data layer, bypassing DOM rendering entirely:

text_content = await frame.page.evaluate("""
    async () => {
        const pdf = PDFViewerApplication.pdfDocument;
        const pages = pdf.numPages;
        let text = '';
        for (let i = 1; i <= pages; i++) {
            const page = await pdf.getPage(i);
            const content = await page.getTextContent();
            text += content.items.map(item => item.str).join(' ') + '\n\n';
        }
        return text;
    }
""")

This avoids the virtualization problem entirely and would be faster and more reliable.

Notes

  • Need to handle non-PDF pages gracefully (fall back to current body extraction)
  • Should wait for PDFViewerApplication.initialized before calling

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions