Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
1415037
feat: integrate patchright and discovery rate limiting for cloudflare…
google-labs-jules[bot] Mar 15, 2026
57db19d
feat: integrate patchright and discovery rate limiting for cloudflare…
google-labs-jules[bot] Mar 15, 2026
9ffbb26
feat: advanced cloudflare bypass with patchright and strategy pattern
google-labs-jules[bot] Mar 15, 2026
3a711e2
refactor: ran formatter
simwai Mar 15, 2026
4c27f44
feat: advanced stealth scraping with strategy pattern and cloudflare …
google-labs-jules[bot] Mar 15, 2026
8bd918c
feat: robust cloudflare bypass with strategy-based scraping
google-labs-jules[bot] Mar 15, 2026
560a89e
feat: advanced cloudflare bypass with strategy-based scraping
google-labs-jules[bot] Mar 15, 2026
bdf66a7
feat: ultimate stealth and behavioral cloudflare bypass
google-labs-jules[bot] Mar 15, 2026
af2df46
feat: ultimate vision-based cloudflare bypass and strategy pattern
google-labs-jules[bot] Mar 15, 2026
0ddbc1c
feat: dual-model vision bypass and system requirement checks
google-labs-jules[bot] Mar 15, 2026
091cb5b
feat: ultimate stealth scraping with vision-based bypass and system h…
google-labs-jules[bot] Mar 15, 2026
c80877b
fix: browser initialization and vision model integration
google-labs-jules[bot] Mar 15, 2026
16db160
fix: self-recovering vision bypass and fast-fail logic
google-labs-jules[bot] Mar 15, 2026
5053ef6
feat: automatic model pulling via ollama CLI
google-labs-jules[bot] Mar 15, 2026
2c86635
feat: Switched to got-scraping and improved AI-driven bypass logic
simwai Mar 15, 2026
179daff
feat: dual AI providers and distinct reasoning models
google-labs-jules[bot] Mar 15, 2026
0a81882
feat: vision-bypass precision tuning and AI stability
google-labs-jules[bot] Mar 15, 2026
53df955
feat: multi-provider stealth scraping with vision bypass
google-labs-jules[bot] Mar 15, 2026
956fae6
feat: update default port and headless settings
google-labs-jules[bot] Mar 15, 2026
8c308c1
feat: enhanced captcha bypass sequence logging
google-labs-jules[bot] Mar 15, 2026
5825313
feat: primary structural turnstile bypass with vision fallback
google-labs-jules[bot] Mar 16, 2026
ef6429b
feat: update default LLM models for RAG and vision
google-labs-jules[bot] Mar 16, 2026
53fca26
feat: ultimate stealth scraping with structural turnstile bypass and …
google-labs-jules[bot] Mar 16, 2026
2d445d6
feat: ultimate stealth scraping with structural bypass and ghost-cursor
google-labs-jules[bot] Mar 16, 2026
7dfcfcf
fix: turnstile bypass robustness and vision coordinate extraction
google-labs-jules[bot] Mar 16, 2026
571f4ad
docs: cleanup contributing and readme to remove redundancy
google-labs-jules[bot] Mar 16, 2026
18f1156
fix: openrouter vision fallback via inline encoding
google-labs-jules[bot] Mar 16, 2026
7100b75
fix: openrouter vision API compliance and response handling
google-labs-jules[bot] Mar 16, 2026
0faefce
fix: OpenRouter API reliability and vision payload optimization
google-labs-jules[bot] Mar 16, 2026
258845d
feat: optimize vision payload via image downsampling
google-labs-jules[bot] Mar 16, 2026
de00262
feat: advanced prompt engineering and XML-tagging
google-labs-jules[bot] Mar 17, 2026
9c5f37d
feat: ultimate stealth scraping and turnstile bypass with style resto…
google-labs-jules[bot] Mar 17, 2026
34308fb
feat: implement visual action logging for captcha bypass
google-labs-jules[bot] Mar 18, 2026
554847b
fix: increase Turnstile strategy wait pacing to 6 seconds
google-labs-jules[bot] Mar 18, 2026
8b00165
fix: dynamic Turnstile interaction timeouts with random noise
google-labs-jules[bot] Mar 18, 2026
e4aa97c
fix: extend Turnstile interaction wait to 14 seconds
google-labs-jules[bot] Mar 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 20 additions & 7 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,32 @@
AUTH_STORAGE_PATH=.storage/auth.json

# Scraping behavior
WAIT_MODE=fixed
RATE_LIMIT_MS=3000
PARALLEL_WORKERS=2
# DISCOVERY_MODE: api (fast), scroll (stealth), interaction (direct), ai (smart)
DISCOVERY_MODE=api
# EXTRACTION_MODE: api (fast), dom (classic), native (interaction-export), ai (smart-dom)
EXTRACTION_MODE=api
WAIT_MODE=dynamic
RATE_LIMIT_MS=1000
PARALLEL_WORKERS=5
CHECKPOINT_SAVE_INTERVAL=10

# Vector search
ENABLE_VECTOR_SEARCH=true

# AI services
GEMINI_API_KEY=
# LLM_SOURCE: 'ollama' or 'openrouter'
LLM_SOURCE=ollama
# LLM_RAG_MODEL: Model for text reasoning and RAG
LLM_RAG_MODEL=deepseek-r1:7b
# LLM_VISION_MODEL: Model for vision tasks and captcha bypass
LLM_VISION_MODEL=qwen3.5:4b
LLM_EMBED_MODEL=nomic-embed-text

# Ollama Specific
OLLAMA_URL=http://localhost:11435
OLLAMA_MODEL=deepseek-r1
OLLAMA_EMBED_MODEL=nomic-embed-text

# OpenRouter Specific
OPENROUTER_API_KEY=

# Paths
EXPORT_DIR=exports
Expand All @@ -23,4 +36,4 @@ VECTOR_INDEX_PATH=.storage/vector-index

# Browser behavior
# HEADLESS can be 'true', 'false', or 'new'
HEADLESS=true
HEADLESS=false
147 changes: 65 additions & 82 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,89 +1,72 @@
# Contributing to the Evolution of Perplexity History Export
# Contributing to Perplexity History Export

Welcome, seeker of organized intelligence. We are delighted that you've chosen to contribute your cognitive energy to this system. By refining this tool, we collectively enhance our ability to synthesize knowledge from our digital interactions.
We welcome contributions! To ensure a smooth development process and maintain high code quality, please follow these guidelines.

This project is a manifestation of structured data extraction and semantic synthesis. To maintain the integrity of its cognitive architecture, we follow a specific workflow.
## Development Environment Setup

---

## Prerequisites for Co-Creation

To effectively interact with the codebase, your local environment must support the following substrates:

- **Node.js 20+**: The fundamental runtime for our operations.
- **Ollama**: Essential for local embedding generation and RAG-based reasoning.
1. **Install Node.js**: Ensure you have Node.js 20+ installed.
2. **Install Ollama**:
- Download and install [Ollama](https://ollama.ai/).
- `ollama pull nomic-embed-text` (for semantic vectors)
- `ollama pull deepseek-r1` (for generative synthesis)
- **Playwright**: Our interface for navigating the complexities of the web.

---

## The Developmental Lifecycle

### 1. Initialization

Clone the repository and instantiate the dependencies:

```bash
npm install
npx playwright install chromium
```

### 2. Environment Configuration

Establish your local parameters:

```bash
cp .env.example .env
# Refine the variables to align with your local Ollama setup.
```

### 3. Iterative Development

Launch the interactive environment to observe the system in action:

```bash
npm run dev
```

### 4. Integrity Verification (Testing)

We adhere to a "Testing Trophy" philosophy, prioritizing integration tests that verify the emergent behavior of system components.

- **Unit Tests**: `npm run test:unit`
- **Integration Tests**: `npm run test:integration` (Uses MSW to simulate Ollama interactions)
- **End-to-End**: `npm run test:e2e`

Always ensure the full suite passes before proposing a merger:

```bash
npm run test
```

### 5. Syntactic Harmony (Formatting)

We utilize `oxlint` and `oxfmt` for rapid, high-performance code analysis and formatting. Maintain the aesthetic and structural consistency of the codebase:
- `ollama pull deepseek-r1:7b` (for generative synthesis)
- `ollama pull qwen3.5:4b` (for vision-based bypass)
3. **Install Dependencies**:
```bash
npm install
```
4. **Prepare Environment Variables**:
```bash
cp .env.example .env
```
5. **Install Playwright Browsers**:
```bash
npx playwright install chromium
```

## Development Workflow

- **Start in Dev Mode**:
```bash
# start dev
npm run dev
```
- **Type Checking**:
```bash
npm run type-check
```
- **Formatting & Linting**:
```bash
npm run format
```

## Commit Guidelines

We use [Conventional Commits](https://www.conventionalcommits.org/).

- `feat:` for new features.
- `fix:` for bug fixes.
- `docs:` for documentation changes.
- `chore:` for maintenance tasks.

## Testing Strategy

- **Unit Tests**: Place in `test/unit/`.
- **Integration Tests**: Place in `test/integration/`.
- **Run all tests**:
```bash
npm test
```

## Pull Request Process

1. Create a feature branch.
2. Ensure all tests pass.
3. Submit the PR with a clear description of the changes.

## Build Single Executable (SEA)

To build the standalone executable for your platform:

```bash
npm run format
npm run build:exe
```

---

## Proposing Cognitive Enhancements (PR Process)

1. **Fork and Branch**: Create a branch with a descriptive prefix:
- `feat/` for novel capabilities.
- `fix/` for rectifying systemic discrepancies (bugs).
- `docs/` for enhancing the conceptual clarity of our documentation.
2. **Commit with Intent**: Write clear, descriptive commit messages.
3. **Synergize**: Open a Pull Request. Provide a concise summary of the changes and how they contribute to the system's overall utility.

---

## Ethical and Intellectual Standards

- **Clarity over Complexity**: While our goals are ambitious, our code should remain a model of lucidity.
- **Robustness**: Build for resilience against the unpredictable nature of web interfaces and AI model outputs.

Together, we are building a more coherent interface between human inquiry and machine intelligence.
42 changes: 26 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
<img src="https://img.shields.io/badge/Node.js-4c1d95?style=flat&logo=node.js&logoColor=white" alt="Node.js" />
<img src="https://img.shields.io/badge/TypeScript-5b21b6?style=flat&logo=typescript&logoColor=white" alt="TypeScript" />
<img src="https://img.shields.io/badge/Ollama-6d28d9?style=flat&logo=ollama&logoColor=white" alt="Ollama" />
<img src="https://img.shields.io/badge/Playwright-7c3aed?style=flat&logo=playwright&logoColor=white" alt="Playwright" />
<img src="https://img.shields.io/badge/Patchright-7c3aed?style=flat&logo=playwright&logoColor=white" alt="Patchright" />
<img src="https://img.shields.io/badge/Vitest-8b5cf6?style=flat&logo=vitest&logoColor=white" alt="Vitest" />
</p>

Expand All @@ -16,6 +16,7 @@

- [Introduction](#introduction)
- [Key Features](#key-features)
- [Stealth & Behavioral Resilience](#stealth--behavioral-resilience)
- [Environment Setup Guide](#environment-setup-guide)
* [1. Install Node.js (The Engine)](#1-install-nodejs-the-engine)
* [2. Install Ollama (The AI Intelligence)](#2-install-ollama-the-ai-intelligence)
Expand All @@ -39,13 +40,22 @@ This tool is designed to externalize your Perplexity.ai conversation history int

## Key Features

- **Parallelized Extraction**: Leverages Playwright to extract multiple conversation threads simultaneously for high-velocity data retrieval.
- **Parallelized Extraction**: Leverages worker pools to extract multiple conversation threads simultaneously for high-velocity data retrieval.
- **Architectural Resilience**: Automatically restores browser contexts and retries operations, ensuring continuity amidst environmental instability.
- **Advanced RAG (Retrieval-Augmented Generation)**: Engage in a cognitive dialogue with your history. The system employs intent analysis to synthesize broad summaries or pinpoint specific technical insights.
- **Semantic Vector Search**: Move beyond keyword matching. Locate information based on conceptual depth and semantic relevance.
- **Persistent State Tracking**: Frequent checkpoints allow the system to resume progress after any interruption.
- **Interactive Synthesis (REPL)**: A streamlined command-line interface for human-system synergy.

## Stealth & Behavioral Resilience

The scraper employs advanced behavioral modeling to achieve 1:1 parity with natural browsing, bypassing Cloudflare and Turnstile challenges:

- **Structural Interaction**: Targets the internal Turnstile widget structure directly, monitoring response tokens to ensure bypass integrity.
- **Vision-Based Fallback**: Captures snapshots and leverages AI reasoning to identify exact interaction coordinates if structural methods fail.
- **Ghost-Cursor Integration**: Utilizes `ghost-cursor` to generate authentic, non-linear mouse paths, making detection statistically improbable.
- **Session Reputation**: Establishes browser trust through "Session Warming" (visiting the home page and simulating browsing) before sensitive data access.

## Environment Setup Guide

If you are new to development or don't have the necessary tools installed, follow these steps to set up your environment.
Expand All @@ -72,10 +82,11 @@ We recommend using a version manager to install Node.js. This allows you to easi
### 2. Install Ollama (The AI Intelligence)

1. Download and install Ollama from [ollama.ai](https://ollama.ai).
2. Open your terminal and pull the required models:
2. The system will automatically pull the required models on first run, but you can also pull them manually:
```bash
ollama pull nomic-embed-text
ollama pull deepseek-r1
ollama pull deepseek-r1:7b
ollama pull qwen3.5:4b
```

### 3. Download and Prepare the Project
Expand All @@ -99,28 +110,27 @@ cp .env.example .env

### Key Environment Variables

- **OLLAMA_URL**: Access point for your local AI engine (default: http://localhost:11434).
- **OLLAMA_MODEL**: Cognitive model for RAG synthesis (e.g., deepseek-r1).
- **OLLAMA_EMBED_MODEL**: Model for generating vector representations (e.g., nomic-embed-text).
- **LLM_SOURCE**: Set to `ollama` (local) or `openrouter` (cloud).
- **LLM_RAG_MODEL**: Cognitive model for RAG synthesis (default: `deepseek-r1:7b`).
- **LLM_VISION_MODEL**: Model for vision-based security bypass (default: `qwen3.5:4b`).
- **ENABLE_VECTOR_SEARCH**: Set to `true` to activate semantic and RAG layers.
- **DISCOVERY_MODE** & **EXTRACTION_MODE**: Choose between `api`, `scroll`, `interaction`, and `ai`.

## Usage Guide

Launch the system:

```bash
# Start the development environment
# Start system
npm run dev
```

**Note**: The system requires at least **10GB of free disk space** to operate safely with local AI models.

### Operational Directives

- **Start scraper (Library)**: Initiates extraction. Authenticate manually if required.
- **Search conversations**: Interface with your history using various modes:
- **Auto**: Heuristic selection between semantic and exact search.
- **Semantic**: Fuzzy matching via high-dimensional vector space.
- **RAG**: Direct inquiry—e.g., "What did I learn about emergent intelligence?"
- **Exact**: Rapid string matching via ripgrep (bundled).
- **Search conversations**: Interface with your history using various modes (Auto, Semantic, RAG, Exact).
- **Build vector index**: Processes Markdown exports into a local vector store.
- **Reset all data**: Purges checkpoints, authentication data, and the vector index.

Expand All @@ -140,11 +150,11 @@ For a detailed look at our RAG implementation, hybrid search strategy, and theor

### Project Structure

- **src/ai/**: Ollama interaction and advanced RAG orchestration layers.
- **src/scraper/**: Playwright-based extraction logic and parallel worker pool management.
- **src/ai/**: Provider management and advanced RAG orchestration layers.
- **src/scraper/**: Patchright-based extraction logic and parallel worker pool management.
- **src/search/**: Vector storage (Vectra) and ripgrep search implementation.
- **src/repl/**: Interactive CLI components.
- **src/utils/**: Shared utility functions for data chunking and logging.
- **src/utils/**: Shared utility functions for behavioral navigation and logging.

## Testing

Expand Down
Loading