Problem
get_md in _tools.py uses a hack to force pdf.js to render all pages: it presses 'n' (next page) a fixed number of times. This has two weaknesses:
- After navigating through all pages, pdf.js may have virtualized out early pages (removed from DOM), so
#viewer HTML at the end may only contain the last few pages.
- The current simple fix caps at 50 pages and uses
PDFViewerApplication.pagesCount — still relies on DOM rendering.
Better Fix
Use pdf.js's JavaScript API to extract text directly from the PDF data layer, bypassing DOM rendering entirely:
text_content = await frame.page.evaluate("""
async () => {
const pdf = PDFViewerApplication.pdfDocument;
const pages = pdf.numPages;
let text = '';
for (let i = 1; i <= pages; i++) {
const page = await pdf.getPage(i);
const content = await page.getTextContent();
text += content.items.map(item => item.str).join(' ') + '\n\n';
}
return text;
}
""")
This avoids the virtualization problem entirely and would be faster and more reliable.
Notes
- Need to handle non-PDF pages gracefully (fall back to current body extraction)
- Should wait for
PDFViewerApplication.initialized before calling
Problem
get_mdin_tools.pyuses a hack to force pdf.js to render all pages: it presses 'n' (next page) a fixed number of times. This has two weaknesses:#viewerHTML at the end may only contain the last few pages.PDFViewerApplication.pagesCount— still relies on DOM rendering.Better Fix
Use pdf.js's JavaScript API to extract text directly from the PDF data layer, bypassing DOM rendering entirely:
This avoids the virtualization problem entirely and would be faster and more reliable.
Notes
PDFViewerApplication.initializedbefore calling