feat: add structured archive export#17
Conversation
|
Hello, thanks for the PR. 🙌 I like some of the ideas, the ITIR and structured JSON export thing is something I don't really understand why it is needed or what is the purpose of it. Maybe you wanna give more insights there. |
|
Hi mate, thanks for making the repo. Firstly I had a few issues around sign-in detection (I forget exactly which now - I think around session detection/redirects), and then I also have some stupidly long threads which the scraper reported as having pulled completely, when this was not the case. Regarding the JSON export, I figured better to just provide a generic interface, however, the canonical use case within my repos is to store within SQLite (see https://github.com/1ch1n/mychatarchive). I don't think I PR'd any ITIR integration per-se, more that, again, canonically ITIR operates over that db (arbitrary text also fine, dedupe is priority re SQL)... Please let me know if this helps :) |
Summary
This PR adds a structured archive layer for Perplexity exports before Markdown/vector indexing, and improves long-thread extraction beyond the initial captured API page.
The exporter now writes
itir.perplexity.thread.v1JSON with normalized messages, stable source thread/message IDs, thread metadata, and captured API provenance. Markdown export remains available as an optional sidecar, and the existing vector/RAG flow can still be enabled when Markdown sidecars are present.Why
The current Markdown-first flow is useful for reading and local vector search, but it makes dedupe, reimport, pagination validation, and downstream archive integrations harder than they need to be. A canonical thread/message export gives SQLite/MyChatArchive/ITIR-style tools a stable source of truth, while Markdown and vector indexes can be regenerated from that canonical layer.
A major practical issue is long Perplexity threads. The first browser-captured
/rest/thread/<id>response can contain only the first page of entries. Without structured IDs and pagination checks, it is hard to know whether a run captured the whole thread, only page one, or duplicated page one repeatedly.Pagination / long-thread behavior
This PR changes thread extraction so it does not stop at the first captured thread API response when Perplexity reports more pages:
/rest/thread/<thread-id>response for the conversation being exportednext_cursorfrom page context with authenticated browser cookiesThat last point is important: this does not pretend to bypass Perplexity/Cloudflare or guarantee all private webapp pagination works forever. It makes successful pagination useful, and failed/replayed pagination detectable and safe.
For cases where Perplexity's private API still only yields the first page but the UI download contains more content, this PR also adds a downloaded Markdown recovery path.
npm run bundle:perplexity-downloadsconverts downloaded Markdown chunks into the same structured JSON shape so the recovered turns can attach to the same canonical thread downstream.Changes
EXPORT_STRUCTURED_JSON=true.EXPORT_MARKDOWN=true.bundle:perplexity-downloadsfor converting downloaded Markdown chunks into the same structured JSON shape.Validation
npm run type-checknpm run test:unit(25 tests passed)