feat: add structured archive export by chboishabba · Pull Request #17 · simwai/perplexity-ai-export

chboishabba · 2026-05-15T13:17:06Z

Summary

This PR adds a structured archive layer for Perplexity exports before Markdown/vector indexing, and improves long-thread extraction beyond the initial captured API page.

The exporter now writes itir.perplexity.thread.v1 JSON with normalized messages, stable source thread/message IDs, thread metadata, and captured API provenance. Markdown export remains available as an optional sidecar, and the existing vector/RAG flow can still be enabled when Markdown sidecars are present.

Why

The current Markdown-first flow is useful for reading and local vector search, but it makes dedupe, reimport, pagination validation, and downstream archive integrations harder than they need to be. A canonical thread/message export gives SQLite/MyChatArchive/ITIR-style tools a stable source of truth, while Markdown and vector indexes can be regenerated from that canonical layer.

A major practical issue is long Perplexity threads. The first browser-captured /rest/thread/<id> response can contain only the first page of entries. Without structured IDs and pagination checks, it is hard to know whether a run captured the whole thread, only page one, or duplicated page one repeatedly.

Pagination / long-thread behavior

This PR changes thread extraction so it does not stop at the first captured thread API response when Perplexity reports more pages:

captures the exact /rest/thread/<thread-id> response for the conversation being exported
follows next_cursor from page context with authenticated browser cookies
appends additional entries when Perplexity returns genuinely new pages
tracks entry identities across pages
stops pagination if Perplexity replays only duplicate entries, instead of inflating the archive with repeated page-one content
keeps the raw API response/entries in the structured export so downstream tools can audit what was captured

That last point is important: this does not pretend to bypass Perplexity/Cloudflare or guarantee all private webapp pagination works forever. It makes successful pagination useful, and failed/replayed pagination detectable and safe.

For cases where Perplexity's private API still only yields the first page but the UI download contains more content, this PR also adds a downloaded Markdown recovery path. npm run bundle:perplexity-downloads converts downloaded Markdown chunks into the same structured JSON shape so the recovered turns can attach to the same canonical thread downstream.

Changes

Adds structured JSON export by default via EXPORT_STRUCTURED_JSON=true.
Makes Markdown export optional via EXPORT_MARKDOWN=true.
Adds normalized user/assistant message extraction from Perplexity API entries.
Adds authenticated cursor pagination for thread detail responses.
Adds duplicate-page detection so replayed first pages do not create fake extra messages.
Adds bundle:perplexity-downloads for converting downloaded Markdown chunks into the same structured JSON shape.
Defaults headful browser mode because Cloudflare/Turnstile makes headless unreliable for this flow.
Adds unit coverage for config, structured file writes, pagination, duplicate-page replay, filename truncation, and downloaded Markdown bundling.

Validation

npm run type-check
npm run test:unit (25 tests passed)

simwai · 2026-05-27T08:52:37Z

Hello, thanks for the PR. 🙌

I like some of the ideas, the ITIR and structured JSON export thing is something I don't really understand why it is needed or what is the purpose of it. Maybe you wanna give more insights there.

chboishabba · 2026-05-29T01:42:55Z

Hi mate, thanks for making the repo.

Firstly I had a few issues around sign-in detection (I forget exactly which now - I think around session detection/redirects), and then I also have some stupidly long threads which the scraper reported as having pulled completely, when this was not the case.

Regarding the JSON export, I figured better to just provide a generic interface, however, the canonical use case within my repos is to store within SQLite (see https://github.com/1ch1n/mychatarchive).

I don't think I PR'd any ITIR integration per-se, more that, again, canonically ITIR operates over that db (arbitrary text also fine, dedupe is priority re SQL)...

Please let me know if this helps :)

feat: add structured archive export

89fffa0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add structured archive export#17

feat: add structured archive export#17
chboishabba wants to merge 1 commit into
simwai:masterfrom
chboishabba:codex/structured-sqlite-export

chboishabba commented May 15, 2026 •

edited

Loading

Uh oh!

simwai commented May 27, 2026 •

edited

Loading

Uh oh!

chboishabba commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chboishabba commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Pagination / long-thread behavior

Changes

Validation

Uh oh!

simwai commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chboishabba commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chboishabba commented May 15, 2026 •

edited

Loading

simwai commented May 27, 2026 •

edited

Loading