Improve PDF text extraction in get_md

## Problem

`get_md` in `_tools.py` uses a hack to force pdf.js to render all pages: it presses 'n' (next page) a fixed number of times. This has two weaknesses:

1. After navigating through all pages, pdf.js may have virtualized out early pages (removed from DOM), so `#viewer` HTML at the end may only contain the last few pages.
2. The current simple fix caps at 50 pages and uses `PDFViewerApplication.pagesCount` — still relies on DOM rendering.

## Better Fix

Use pdf.js's JavaScript API to extract text directly from the PDF data layer, bypassing DOM rendering entirely:

```python
text_content = await frame.page.evaluate("""
    async () => {
        const pdf = PDFViewerApplication.pdfDocument;
        const pages = pdf.numPages;
        let text = '';
        for (let i = 1; i <= pages; i++) {
            const page = await pdf.getPage(i);
            const content = await page.getTextContent();
            text += content.items.map(item => item.str).join(' ') + '\n\n';
        }
        return text;
    }
""")
```

This avoids the virtualization problem entirely and would be faster and more reliable.

## Notes
- Need to handle non-PDF pages gracefully (fall back to current body extraction)
- Should wait for `PDFViewerApplication.initialized` before calling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve PDF text extraction in get_md #2

Problem

Better Fix

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Improve PDF text extraction in get_md #2

Description

Problem

Better Fix

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions