diff --git a/CRITIQUE.MD b/CRITIQUE.MD new file mode 100644 index 0000000..4de8637 --- /dev/null +++ b/CRITIQUE.MD @@ -0,0 +1,814 @@ +# Architectural Review: src/search/search-orchestrator.ts + +The `SearchOrchestrator` serves as the cognitive nexus for query resolution, balancing the brute force of Ripgrep with the semantic nuance of vector embeddings. While its structure is largely sound, we observe a slight entanglement of concerns where presentation logic bleeds into the orchestration layer. An architecturally pure implementation would treat the logger as a passive observer rather than a canvas for terminal aesthetics. + +### Persona Perspective Review + +- **Martin Fowler**: The code exhibits clear intent, but the decision-making logic in `executeAutoSearch` borders on a violation of the *Tell-Don't-Ask* principle. The orchestrator is interrogating the query to decide the strategy, whereas the strategy itself could potentially self-select based on the input complexity. +- **Uncle Bob**: The class structure is clean, and methods are generally focused. However, nesting error classes within the main class is a non-standard pattern that slightly complicates the external API ergonomics. +- **Kent C. Dodds**: The inclusion of an `auto` mode is a brilliant "pit of success" for the user. However, the lack of configurability for the word-count threshold might lead to frustrating "Why did it use Ripgrep here?" moments for power users. +- **Sindre Sorhus**: The strict adherence to ESM is appreciated. The dependency on `chalk` within a logic-heavy orchestrator is a minor blemish; logic modules should ideally be agnostic of the display medium. +- **Anthony Fu**: The TypeScript implementation is functional but could be more "vibey" with better metadata typing. The manual casting of `meta` properties hints at an opportunity for a more robust generic interface. + +### Line-by-Line Observations + +- **Error Nesting**: + ```typescript + static readonly SearchOrchestratorError = class extends Error { ... } + ``` + Critique: While syntactically valid, this pattern makes it harder to catch specific errors without referencing the class itself. Fowler would suggest top-level exports for better discoverability and cleaner `instanceof` checks. + +- **Heuristic Logic**: + ```typescript + const queryWordCount = query.trim().split(/\s+/).length + const isLongQuery = queryWordCount > LONG_QUERY_WORD_COUNT_THRESHOLD + ``` + Critique: This heuristic for "semantic depth" is rudimentary. Uncle Bob might appreciate the extraction of the threshold, but Kent C. Dodds would argue for making this threshold configurable via the `Config` object to support different user environments. + +- **Presentation Leakage**: + ```typescript + const spaceNameDisplay = chalk.green(meta['spaceName'] as string) + const titleDisplay = chalk.cyan(meta['title'] as string) + ``` + Critique: Sorhus and Fu would likely align on the fact that the orchestrator is doing too much "painting." Decoupling the result retrieval from the formatting would allow for easier integration with other frontends. The manual casting with `as string` suggests the metadata types in the `VectorStore` could be more strictly defined. + +--- +# Architectural Review: src/search/vector-store.ts + +The `VectorStore` encapsulates the complexity of local vector indexing using `vectra`. It demonstrates a solid understanding of batch processing and chunking, which are essential for performance and context preservation. However, the reliance on synchronous file system operations and regex-based metadata extraction introduces fragility. + +### Persona Perspective Review + +- **Martin Fowler**: The `extractContentAndMetadata` method is a prime candidate for the *Extract Method* refactoring. It mixes file reading, metadata parsing via regex, and chunking. Decoupling the parser into a separate class or function would improve testability and allow for more robust parsing logic. +- **Uncle Bob**: The class follows the *Single Responsibility Principle* reasonably well, but the batching logic in `processMarkdownFilesByBatches` is a bit dense. The nested loops and state management for `pendingTextsToEmbed` could be simplified using more modern functional patterns or a dedicated batching utility. +- **Kent C. Dodds**: The "Aha!" moment for the developer is the automatic chunking of large markdown files. However, the hardcoded `CHUNK_SIZE_CHARS` and `CHUNK_OVERLAP_CHARS` should be part of the `Config` to allow tuning without code changes. +- **Sindre Sorhus**: The use of `readFileSync` and `readdirSync` is a "small module" anti-pattern in modern Node.js. For a tool that might process thousands of files, asynchronous I/O would prevent blocking the event loop and improve responsiveness. +- **Anthony Fu**: The metadata handling is a bit loose. Using `Record` and casting with `as any` or `as Record` is a missed opportunity for TypeScript's type safety. A Zod schema for metadata would provide both validation and type inference. + +### Line-by-Line Observations + +- **Synchronous I/O**: + ```typescript + const directoryEntries = readdirSync(directoryPath) + ... + const fileContent = readFileSync(filePath, 'utf-8') + ``` + Critique: Sorhus would insist on `fs.promises` or `globby`. Blocking the thread during a large rebuild is suboptimal. + +- **Regex Fragility**: + ```typescript + const titleMatch = fileContent.match(/^# (.+)$/m) + const spaceMatch = fileContent.match(/^\*\*Space:\*\* (.+?)\s{2,}$/m) + ``` + Critique: Relying on exact string matches for metadata in Markdown is risky. If the export format changes slightly (e.g., extra whitespace or different bolding), the indexing fails silently or produces "Untitled" results. Fowler would suggest a more resilient parsing strategy. + +- **Type Safety**: + ```typescript + return rawResults.map((result) => ({ + meta: result.item.metadata as VectorDocMeta, + score: result.score, + })) + ``` + Critique: Anthony Fu would prefer a properly typed generic for the `LocalIndex` if `vectra` supports it, or at least a validation step to ensure the metadata matches the expected `VectorDocMeta` structure. + +--- +# Architectural Review: src/search/rg-search.ts + +The `RgSearch` class provides a robust bridge to the `ripgrep` binary, leveraging the high-performance search capabilities of the Rust-based tool. It correctly handles JSON streaming and process management. However, the implementation mixes concerns between "searching for display" and "searching for data," leading to some logic duplication in argument construction. + +### Persona Perspective Review + +- **Martin Fowler**: The class uses *Method Extraction* to isolate argument construction, but the logic within `captureSearchMatches` to manipulate those arguments (`filter` and `concat`) is a bit brittle. A more elegant approach would be to have a central `getCommonArguments` method and specialize it for JSON or human-readable output. +- **Uncle Bob**: The use of `spawn` and manual `Promise` wrapping is handled well, adhering to the *Single Responsibility Principle* by keeping the process management contained. However, the `MAX_MATCHES_PER_QUERY` and `SEARCH_TIMEOUT_MS` constants are "magic numbers" that Uncle Bob would prefer to see elevated to the top of the file or the `Config`. +- **Kent C. Dodds**: The inclusion of a timeout and match limit is excellent for operational stability. It prevents a broad query from hanging the application—a great developer-centric "safe-by-default" design. +- **Sindre Sorhus**: The use of `@vscode/ripgrep` ensures cross-platform compatibility without external dependencies, which aligns perfectly with Sindre's philosophy of self-contained, reliable modules. The manual piping of `stdout` to `process.stdout` is a bit raw; a more modern stream-based approach would be cleaner. +- **Anthony Fu**: The use of `spawn` with `rgPath` is clean. The interface `RgMatch` is clear, but could be enhanced with better documentation of what the `text` field contains (is it the whole line or just the match?). + +### Line-by-Line Observations + +- **Argument Mutation**: + ```typescript + const jsonOutputArguments = baseArguments + .filter((arg) => arg !== '--color=always') + .concat(['--color=never', '--json', '--max-filesize', '1M', '--no-binary']) + ``` + Critique: This is a bit "hacky." Fowler would suggest a more structured way to build the argument list based on the desired output format, rather than filtering a base list. + +- **Manual Stream Handling**: + ```typescript + ripgrepProcess.stdout.on('data', (data) => { + hasFoundMatches = true + process.stdout.write(data) + }) + ``` + Critique: Uncle Bob might point out that writing directly to `process.stdout` makes this method hard to test in isolation. It would be better to return a stream or use a callback/logger to decouple the search from the terminal. + +- **Silencing Stderr**: + ```typescript + ripgrepProcess.stderr.on('data', () => { + // Silently consume stderr to avoid buffer filling up + }) + ``` + Critique: While preventing buffer overflow is important, silent consumption makes debugging difficult. Anthony Fu would likely suggest logging these to a debug log file or the `errorBus` if they represent actual errors rather than just "not found" warnings. + +--- +# Architectural Review: src/export/file-writer.ts + +The `FileWriter` class is a straightforward implementation of persistence logic. It correctly handles directory creation and file naming. However, its heavy reliance on synchronous I/O and the concatenation-based Markdown generation are areas where the "persona" might suggest improvements. + +### Persona Perspective Review + +- **Martin Fowler**: The `formatConversationAsMarkdown` method is a simple example of the *Composed Method* pattern. It's readable, but as the metadata block grows, it might benefit from a dedicated template or a more structured builder to ensure consistent formatting (e.g., ensuring double spaces for Markdown line breaks). +- **Uncle Bob**: The class is small and focused, adhering to the *Single Responsibility Principle*. However, the constructor performing a side effect (`ensureRootExportDirectoryExists`) is something Uncle Bob often advises against. It's better to have a dedicated `init()` or `setup()` method, or have the caller ensure the environment is ready. +- **Kent C. Dodds**: The inclusion of the `conversation.id` in the filename is a great "UX for developers" move. It prevents collisions and makes it easy to map files back to the source data. +- **Sindre Sorhus**: Again, the use of `writeFileSync` and `mkdirSync` is a blemish in an otherwise modern TypeScript project. Sindre would prefer the asynchronous versions from `node:fs/promises` to keep the application responsive, especially if multiple files are being written in parallel. +- **Anthony Fu**: The use of `toISOString()` for the date is standard, but the manual string concatenation for Markdown is a bit "old school." Using a template literal for the entire block would be more "vibey" and easier to read. + +### Line-by-Line Observations + +- **Constructor Side Effects**: + ```typescript + constructor(private readonly config: Config) { + this.ensureRootExportDirectoryExists() + } + ``` + Critique: Uncle Bob would prefer this to be explicit. If the directory cannot be created, the object creation fails, which might be confusing in some contexts. + +- **Markdown Generation**: + ```typescript + const metadataBlock = + `**Space:** ${conversation.spaceName} \n` + + `**ID:** ${conversation.id} \n` + + `**Date:** ${conversation.timestamp.toISOString()} \n\n` + ``` + Critique: Anthony Fu would likely prefer a template literal for the entire file content, which makes the structure of the output much more obvious at a glance. + +- **Sync I/O**: + ```typescript + writeFileSync(destinationFilePath, markdownContent, 'utf-8') + ``` + Critique: Sorhus would insist on `await writeFile(...)`. Even if this is a CLI tool, async-by-default is a better practice for scalability. + +--- +# Architectural Review: src/export/sanitizer.ts + +A concise utility module. While small, it plays a critical role in ensuring filesystem compatibility. The persona review highlights the tension between simplicity and robustness. + +### Persona Perspective Review + +- **Martin Fowler**: The `sanitizeSpaceName` function is a simple *Delegation*. It's fine for now, but Fowler would keep an eye on it—if space names ever need different rules than filenames, the delegation should be broken. +- **Uncle Bob**: The functions are small and focused. The extraction of `ILLEGAL_CHARACTER_REPLACEMENT` and `MAXIMUM_FILENAME_LENGTH` as local constants within the function is good, though Uncle Bob might argue for moving them to the top of the file if they are used by multiple functions. +- **Kent C. Dodds**: The truncation to `MAXIMUM_FILENAME_LENGTH` is a great safety feature to avoid "filename too long" errors on certain filesystems. It's a defensive coding practice that improves the user experience. +- **Sindre Sorhus**: Sindre is the king of small modules. He would appreciate the use of the `sanitize-filename` package instead of reinventing the wheel. However, he might find the `sanitizeMarkdownContent` function slightly redundant—if it does nothing but return the input, why does it exist? +- **Anthony Fu**: The module is clean and functional. The use of ESM is perfect. + +### Line-by-Line Observations + +- **Redundant Regex?**: + ```typescript + return safeFilename + .replace(/\s+/g, ILLEGAL_CHARACTER_REPLACEMENT) + ``` + Critique: The `sanitize` package already handles illegal characters. Replacing spaces with underscores is a stylistic choice. Anthony Fu might suggest making this replacement optional or configurable, as some users might prefer spaces in their filenames. + +- **Stub Function**: + ```typescript + export function sanitizeMarkdownContent(rawMarkdown: string): string { + return rawMarkdown || '' + } + ``` + Critique: Sindre Sorhus would likely call this a "ghost function." If there's no sanitization logic for the content yet, it's better to omit it until it's needed, or at least document why it's a stub (e.g., "Placeholder for future XSS/cleaning logic"). + +--- +# Architectural Review: src/utils/api-diagnostics.ts + +The `ApiDiagnosticsWriter` is a focused utility for observability. Its use of JSONL (JSON Lines) is a professional choice for logging, as it remains append-only and easy to parse. + +### Persona Perspective Review + +- **Martin Fowler**: The use of `Omit` in the method signature is a clean way to ensure the caller doesn't have to worry about generating timestamps. This is a good application of *Interface Segregation* or at least clean API design. +- **Uncle Bob**: The class is small, follows the SRP, and handles its own errors gracefully. However, the hardcoded `DEBUG_DIRECTORY` and `DIAGNOSTICS_FILENAME` as private constants are slightly restrictive. Uncle Bob might suggest making these configurable through the constructor or the `Config` object. +- **Kent C. Dodds**: This is a great "debug-ability" feature. When an API changes unexpectedly, having a machine-readable log of the failures is a huge time-saver for developers. This is a very Kent-esque "pit of success" for troubleshooting. +- **Sindre Sorhus**: Sindre would be pleased to see `node:fs/promises` being used here. It's clean, modern, and non-blocking. He might suggest extracting this into a tiny standalone `jsonl-logger` module if it were to be reused elsewhere. +- **Anthony Fu**: The TypeScript usage is elegant. The check `if (!this.config.debug) return` ensures zero overhead in production-like runs, which is a nice performance consideration. + +### Line-by-Line Observations + +- **Hardcoded Paths**: + ```typescript + private readonly DEBUG_DIRECTORY = 'debug' + private readonly DIAGNOSTICS_FILENAME = 'api-diagnostics.jsonl' + ``` + Critique: While acceptable for a utility, Fowler would point out that these are "Hidden Dependencies" on the file system structure. If the tool is run from a different directory, the `debug/` folder might end up in an unexpected location. + +- **Timestamp Generation**: + ```typescript + timestamp: new Date().toISOString(), + ``` + Critique: Anthony Fu might suggest using a more modern date library or at least a utility function if the project grows, but for a diagnostic log, `toISOString()` is perfectly appropriate and "standard." + +--- +# Architectural Review: src/utils/error-bus.ts + +The `ErrorBus` implements a centralized error handling pattern, decoupling error generation from error reporting. This is a powerful pattern for maintainability. + +### Persona Perspective Review + +- **Martin Fowler**: This is a classic implementation of the *Event Aggregator* pattern. It allows various parts of the system to report errors without needing to know how those errors are logged or reported to external services. +- **Uncle Bob**: While the pattern is good, the `ErrorBus` class itself is doing two things: aggregating errors and logging them (in the constructor). Uncle Bob would suggest moving the logging logic to a separate "ErrorLogger" that subscribes to the `ErrorBus`. +- **Kent C. Dodds**: The inclusion of a `context` object is vital for production debugging. It allows developers to see exactly what state the system was in when the error occurred. +- **Sindre Sorhus**: Sindre might find the reliance on `process.env` inside the class a bit "messy." He would likely prefer the `debug` flag to be passed in or handled by a dedicated configuration module. +- **Anthony Fu**: The use of `node:events` is solid. The type definition for `AppError` is clear. He would probably prefer `logger.debug` instead of `console.error` for the error stack trace to keep the output consistent. + +### Line-by-Line Observations + +- **In-constructor Side Effect**: + ```typescript + this.on('error', (appError: AppError) => { ... }) + ``` + Critique: Uncle Bob would argue that the bus should be a "passive" carrier of information. Having it "listen to itself" to log errors makes it harder to test or to swap out the logging implementation. + +- **Environment Variable Check**: + ```typescript + const isDebugEnabled = process.env['DEBUG'] === 'true' || process.env['DEBUG_MODE'] === 'true' + ``` + Critique: This contradicts the memory about `DEBUG=true` being the "strict" activation variable and legacy variables being removed. Sindre Sorhus and Anthony Fu would both prefer this logic to reside in the `Config` class and be injected. + +- **Type Safety**: + ```typescript + error?: unknown + ``` + Critique: `unknown` is the correct type here for an error, but Anthony Fu might suggest a utility to safely extract the stack trace or message from it. + +--- +# Architectural Review: src/utils/wait-strategy.ts + +The `WaitStrategy` module is a textbook example of the *Strategy Pattern*. It allows the application to toggle between performance-oriented (dynamic) and stealth-oriented (static) waiting behaviors. + +### Persona Perspective Review + +- **Martin Fowler**: Fowler would love the use of the *Strategy Pattern* here. It cleanly separates the "what" (waiting for an action to finish) from the "how" (network idle vs. fixed delay). It makes the scraper's behavior highly configurable and testable. +- **Uncle Bob**: The interface `WaitStrategy` is clean and follows the *Interface Segregation Principle*. The implementations are small and focused. One minor point: the use of `Page` from `@playwright/test` in a project that supposedly uses `patchright` (as per memory) might be a subtle dependency leak. +- **Kent C. Dodds**: The `StaticWaitStrategy` with its `randomPause` and jitter is a great "real-world" solution for bypassing bot detection. It mimics human behavior in a way that fixed delays cannot. +- **Sindre Sorhus**: Sindre might find the class-based approach for `DynamicWaitStrategy` a bit heavyweight for what could be simple functions, but given the need for an interface, it's a reasonable choice. +- **Anthony Fu**: The factory function `waitStrategy` is a clean way to instantiate the correct strategy. He might suggest using a more descriptive name for the factory, like `createWaitStrategy`. + +### Line-by-Line Observations + +- **Dependency Inconsistency**: + ```typescript + import type { Page } from '@playwright/test' + ``` + Critique: The memory states the project uses `patchright` as a drop-in replacement. While `patchright` is compatible with Playwright types, importing from `@playwright/test` might bring in unnecessary dependencies or cause confusion. Sindre would prefer consistent importing from the core automation package. + +- **Swallowing Errors**: + ```typescript + .catch(() => {}) + ``` + Critique: Uncle Bob would caution against empty catch blocks. Even if the timeout is expected, it should be explicitly handled or logged at a debug level so that developers know why a wait was cut short. + +- **Magic Numbers**: + ```typescript + private static readonly NETWORK_IDLE_TIMEOUT_MS = 2000 + ``` + Critique: While these are constants, making them part of the `Config` would satisfy Kent C. Dodds' desire for user-tunable performance. + +--- +# Architectural Review: src/utils/config.ts + +The `config.ts` module uses Zod for schema validation, which is a gold standard in modern TypeScript development. It provides strong type safety and runtime validation for environment variables. + +### Persona Perspective Review + +- **Martin Fowler**: The use of *Schema Validation* for configuration is a great way to ensure the system fails fast with clear errors. The `camelToSnakeCase` utility is a nice touch to map internal properties back to their environment variable equivalents in error messages. +- **Uncle Bob**: The module has side effects at the top level (`loadEnv()`, `parseEnvConfig()`, and directory creation). Uncle Bob would prefer these to be encapsulated in a class or a controlled initialization sequence. The directory creation especially should happen within the relevant service's initialization, not in the configuration loader. +- **Kent C. Dodds**: This is a masterclass in "pit of success" for the user. If they misconfigure an environment variable, the app tells them exactly what's wrong and how to fix it before they even start. +- **Sindre Sorhus**: Sindre would appreciate the use of Zod and the strict ESM. He might suggest extracting the `camelToSnakeCase` and `ensureDirectoryExistsForFile` into a small utility module if they are used elsewhere, or keeping them local if they are strictly for config. +- **Anthony Fu**: The use of `z.infer` to generate the `Config` type is exactly what Anthony would do. It keeps the source of truth in one place. One minor detail: the manual parsing of `parseInt` and `process.env` before Zod validation is a bit redundant; Zod could handle the string-to-number transformation itself. + +### Line-by-Line Observations + +- **Redundant Parsing**: + ```typescript + parallelWorkers: parseInt(process.env['PARALLEL_WORKERS'] ?? DEFAULT_PARALLEL_WORKERS, 10), + ``` + Critique: Anthony Fu would suggest using `z.coerce.number()` in the schema. This would allow Zod to handle the parsing from string to number, reducing the amount of manual boilerplate in `rawConfig`. + +- **Side Effects in Config**: + ```typescript + ensureDirectoryExistsForFile(config.authStoragePath) + ... + mkdirSync(config.exportDir, { recursive: true }) + ``` + Critique: Uncle Bob and Fowler would both flag this. A configuration module should be "pure"—it should return the configuration, not modify the filesystem. These side effects make the module harder to test and create "hidden" setup logic. + +- **Process Exit**: + ```typescript + process.exit(1) + ``` + Critique: While acceptable for a CLI tool, Sindre would prefer the function to `throw` and let the entry point handle the exit. This makes the code more reusable in other contexts (e.g., as a library). + +--- +# Architectural Review: src/utils/chunking.ts + +The `chunkMarkdown` utility is a critical piece of the RAG pipeline, ensuring that long documents are split into manageable pieces while preserving semantic context via headers and overlap. + +### Persona Perspective Review + +- **Martin Fowler**: The function uses a *Recursive Split* approach followed by a fallback for oversized sections. It's a pragmatic solution to a complex problem. However, the logic for handling overlap (`currentChunk.slice(-overlapChars)`) is a bit "magical." A more explicit windowing approach would be easier to follow. +- **Uncle Bob**: The function is quite long and contains several nested logic branches. Uncle Bob would suggest breaking it down into smaller, more descriptive functions: `splitByMarkers`, `accumulateSections`, and `handleOversizedChunks`. +- **Kent C. Dodds**: The inclusion of an `overlapChars` parameter is a great feature for preserving context between chunks, which is essential for vector search accuracy. +- **Sindre Sorhus**: Sindre would love the single-purpose nature of this module. He might suggest renaming the file to `markdown-chunker.ts` to be more descriptive. +- **Anthony Fu**: The regex `HEADER_OR_RULE_REGEX` is clever but might be fragile. Anthony would probably suggest using a proper Markdown parser (like `unified` or `remark`) to extract sections more reliably, though for a lightweight tool, the regex is understandable. + +### Line-by-Line Observations + +- **Overlap Logic**: + ```typescript + const overlapText = currentChunk.slice(-overlapChars).replace(/^---\s*/, '') + ``` + Critique: This line is doing a lot of heavy lifting. It's trying to preserve the end of the previous chunk but also cleaning up horizontal rules. Fowler would suggest extracting this into a named function like `getOverlapContext`. + +- **Fallback Chunking**: + ```typescript + return chunks.flatMap((chunk) => { + if (chunk.length <= MAX_CHUNK_THRESHOLD) { ... } + ... + }) + ``` + Critique: The `flatMap` at the end is a safety net for huge sections that don't have headers. It's a good "fail-safe," but Uncle Bob would point out that it's a separate concern from the header-based chunking and should be isolated. + +- **Magic Numbers**: + ```typescript + const MAX_CHUNK_THRESHOLD = maxChars + 500 + ``` + Critique: The `500` is a magic constant. Why 500? Anthony Fu would prefer this to be a percentage or a named constant at the top of the file. + +--- +# Architectural Review: src/utils/http-logger.ts + +The `http-logger.ts` is a critical tool for debugging bot-detection and API interactions. Its focus on privacy (redaction) is commendable and necessary for a tool that handles sensitive user data. + +### Persona Perspective Review + +- **Martin Fowler**: The redaction logic is a good use of the *Encapsulate Variable* and *Extract Method* patterns. It ensures that sensitivity rules are applied consistently across both requests and responses. +- **Uncle Bob**: The module has many top-level functions and side effects (file creation, directory creation). Uncle Bob would likely prefer an `HttpLogger` class that takes a `Config` and an `FsProvider` to make it easier to mock for testing. The use of `appendFileSync` is again a synchronous blemish. +- **Kent C. Dodds**: The "Privacy by Default" approach (`isPromptRequest`) is a fantastic developer experience choice. It allows users to share debug logs without accidentally leaking their private prompts or conversation history. +- **Sindre Sorhus**: Sindre would point out the inconsistency in file I/O: some parts of the project use `fs/promises`, while this one uses `appendFileSync`. He would also suggest moving the `LOGS_DIRECTORY` and filenames into a configuration or utility. +- **Anthony Fu**: The identification of `SENSITIVE_HEADERS` and `PROMPT_KEYWORDS` is very pragmatic. Anthony would likely suggest a more robust way to handle JSON bodies that doesn't rely on string matching if possible, but given the varied nature of API responses, this "fuzzy" approach is understandable. + +### Line-by-Line Observations + +- **Synchronous File Operations**: + ```typescript + appendFileSync(HTTP_LOG_PATH, logEntry + '\n') + ``` + Critique: This blocks the event loop for every single HTTP interaction. In a scraper making dozens of requests, this could significantly slow down the process. Sindre Sorhus would insist on an asynchronous approach. + +- **Dependency on Playwright Types**: + ```typescript + import type { Request, Response } from '@playwright/test' + ``` + Critique: As noted in `wait-strategy.ts`, this project should ideally be importing from `patchright` to maintain consistency with its core automation engine. + +- **Magic Strings**: + ```typescript + const LOGS_DIRECTORY = 'logs' + ``` + Critique: Uncle Bob would prefer this to be part of the configuration so that log location can be easily changed for CI/CD or different deployment environments. + +--- +# Architectural Review: src/utils/logger.ts + +The `logger.ts` is the primary interface for feedback to the user and for background logging. It correctly implements color-coding and file persistence for debugging. + +### Persona Perspective Review + +- **Martin Fowler**: The logger provides a consistent interface for the rest of the application. The removal of ANSI escape codes for file logging is a great example of *Transforming Data* for a specific sink. +- **Uncle Bob**: Similar to other utilities, this module has side effects in its "private" helper `writeToLogFile`. Uncle Bob would prefer the logger to be an interface that can have multiple "Sinks" (Console, File, API) rather than being hardcoded to both `console` and `appendFileSync`. +- **Kent C. Dodds**: The visual cues (icons like ℹ, ✓, ⚠) are excellent for developer experience. They make the terminal output much easier to scan at a glance. +- **Sindre Sorhus**: Sindre would likely point out the use of legacy env vars `DEBUG_MODE` and `DIAGNOSIS_MODE`, which contradicts the project memory about `DEBUG=true` being the sole trigger. He would also prefer the use of a more robust logging package like `pino` or `winston` if the project needs file logging, or a simpler one-line implementation if not. +- **Anthony Fu**: The use of `chalk` is a staple of Anthony's tools. He would probably suggest using `consola` or a similar modern logger that handles a lot of this boilerplate (including the file logging and environment checks) more elegantly. + +### Line-by-Line Observations + +- **Legacy Environment Variables**: + ```typescript + const IS_DEBUG_MODE = + process.env['DEBUG_MODE'] === 'true' || process.env['DIAGNOSIS_MODE'] === 'true' + ``` + Critique: The project memory states that support for these legacy variables was removed. This module needs to be updated to align with the source of truth in the memory. + +- **Synchronous and Direct I/O**: + ```typescript + appendFileSync(MAIN_LOG_PATH, `[${logTimestamp}] ${plainTextLines}\n`) + ``` + Critique: As with the HTTP logger, blocking I/O is a performance bottleneck. Sindre Sorhus would emphasize that logging should not block the main execution flow of a scraper. + +- **Ansi Strip Regex**: + ```typescript + const ANSI_ESCAPE_REGEX = /\x1b\[[0-9;]*m/g + ``` + Critique: Sindre Sorhus has a package for this (`strip-ansi`). Using a well-tested library is safer than a manual regex, which might miss certain edge cases in terminal codes. + +--- +# Architectural Review: src/index.ts + +The application entry point is refreshingly minimal, adhering to the principle of "separation of concerns" by delegating all logic to the `Repl` and `Config` modules. + +### Persona Perspective Review + +- **Martin Fowler**: This is a clean *Application Entry Point*. It's decoupled from the actual logic, making the system easier to test by simply instantiating the `Repl` or other components directly in test suites. +- **Uncle Bob**: The use of a `bootstrapApplication` function is good practice, but Uncle Bob might find the JSDoc comment ("Entry point for...") a bit redundant given the filename and the function name. +- **Kent C. Dodds**: The simplicity here is key. A developer can quickly see how the app starts up without wading through configuration or initialization logic. +- **Sindre Sorhus**: Sindre would appreciate the brevity. He might suggest making the script "safe" for importing by checking if it's the main module (e.g., `if (import.meta.url === ...)`) although in many modern TS projects, the entry point is strictly an entry point. +- **Anthony Fu**: The use of ESM top-level `await` via the `bootstrapApplication` wrapper is the modern way to handle this. + +### Line-by-Line Observations + +- **Error Handling**: + ```typescript + } catch (initializationError) { + errorBus.emitError('Application failed to start', initializationError) + } + ``` + Critique: While the error is emitted to the bus, does the application actually exit? Anthony Fu would note that if the REPL fails to start, the process might just hang or end silently. Sindre Sorhus would suggest a `process.exit(1)` here after logging to ensure the shell knows the command failed. + +--- +# Architectural Review: src/repl/commands.ts + +The `CommandHandler` is the "Controller" in the MVC-like structure of the REPL. It orchestrates the flow between the user's input, the scraper, and the search engine. + +### Persona Perspective Review + +- **Martin Fowler**: The class is a classic *Service Layer* that coordinates the application's high-level tasks. However, it's becoming a "God Object" or at least a very heavy one. It manages discovery, extraction, search, and storage wiping. Fowler would suggest splitting this into specialized handlers: `ExportHandler`, `SearchHandler`, and `MaintenanceHandler`. +- **Uncle Bob**: The `executeFullScrapingFlow` method is quite long and mixes high-level flow control with low-level details like browser management. Uncle Bob would prefer to see the browser management extracted into a *Resource Wrapper* or a *Context Manager*. +- **Kent C. Dodds**: The "Scraper Wizard" with its resume/sync/restart options is a peak DX (Developer Experience) feature. It treats the user as an intelligent operator and provides clear, actionable choices for common scenarios. +- **Sindre Sorhus**: Sindre would point out the mix of imports and the manual `rmSync`. He would prefer the use of modern packages for UI prompts (though `@inquirer/prompts` is used, which is good) and asynchronous file operations. +- **Anthony Fu**: The TypeScript usage for the prompt return values (`as 'auto' | 'vector' | 'rg' | 'rag'`) is a bit loose. Using a Zod schema or a discriminated union with proper inference from the prompt library would be safer. + +### Line-by-Line Observations + +- **Error Nesting (Again)**: + ```typescript + static readonly ScraperError = class extends Error { ... } + ``` + Critique: This pattern persists. As mentioned in the search orchestrator review, this makes the errors less ergonomic for external consumers. + +- **Direct Inquirer Calls**: + The class directly uses `confirm`, `select`, and `input`. + Critique: This makes the class hard to test without an interactive terminal. Fowler would suggest injecting an `UIProvider` or `PromptService` into the `CommandHandler` to decouple the logic from the terminal interface. + +- **Mixing Abstractions**: + ```typescript + await this.runDiscoveryPhase(activePage) + ... + const workerPool = new WorkerPool(this.config, this.checkpointManager, activeBrowser) + ``` + Critique: In `runDiscoveryPhase`, the logic is hidden in a method, but in `runExtractionPhase`, the `WorkerPool` is instantiated and managed directly. Uncle Bob would insist on consistent abstraction levels. + +--- +# Architectural Review: src/repl/index.ts + +The `Repl` class provides the main loop for the interactive CLI. It's a clean implementation of a command dispatcher. + +### Persona Perspective Review + +- **Martin Fowler**: The `dispatchCommand` method is a straightforward *Switch-on-Type* (or in this case, switch-on-string). It's simple and effective for this scale. If the number of commands were to grow significantly, Fowler might suggest a *Command Pattern* where each command is an object with an `execute` method. +- **Uncle Bob**: The loop is clean and the `try-catch` handles user interruptions (`ExitPromptError`) gracefully. However, the direct call to `process.exit(0)` inside the class is something Uncle Bob would generally avoid in favor of returning from the `start` method. +- **Kent C. Dodds**: The inclusion of a friendly "Goodbye!" and clear instructions on how to exit (Ctrl+C) is great UX for a CLI tool. +- **Sindre Sorhus**: Sindre would appreciate the use of `@inquirer/prompts`. He might suggest making the class more generic so it could be reused for other REPL-like interfaces. +- **Anthony Fu**: The use of `chalk.bold.cyan` for the header is a classic Anthony Fu touch. It gives the tool a professional and modern "vibe." + +### Line-by-Line Observations + +- **Process Exit**: + ```typescript + private terminate(): void { + ... + process.exit(0) + } + ``` + Critique: Sindre and Uncle Bob would both prefer the `start` loop to naturally finish, allowing the caller (in `index.ts`) to decide whether to exit the process. + +- **Type Safety on Action Value**: + ```typescript + private async dispatchCommand(actionValue: string): Promise + ``` + Critique: Anthony Fu would suggest using a union type for `actionValue` based on the `choices` array to ensure that the `switch` statement is exhaustive and type-safe. + +--- +# Architectural Review: src/repl/help.ts + +A pure presentation module. It's clean and serves its purpose well. + +### Persona Perspective Review + +- **Martin Fowler**: The `logAction` helper is a good example of *Internal Domain Language* for this specific module. It makes the help text easy to maintain and consistent. +- **Uncle Bob**: The function is a bit long due to the strings, but it follows the SRP perfectly. +- **Kent C. Dodds**: The inclusion of tips for RAG and search is brilliant. It educates the user on how to get the most out of the tool, reducing frustration and "Why did I get this result?" support questions. +- **Sindre Sorhus**: Sindre would appreciate the clean, minimalist look. He might suggest using a package like `boxen` to make the headers pop even more, or keeping it simple as it is. +- **Anthony Fu**: The use of emojis and `chalk` styling makes the help text readable and modern. + +### Line-by-Line Observations + +- **Hardcoded Strings**: + Critique: While acceptable here, if the application were to be internationalized, these strings would need to be moved to a translation file. For a tool of this scope, it's a non-issue. + +- **Manual Helper**: + ```typescript + const logAction = (actionName: string, actionDescription: string) => { ... } + ``` + Critique: Anthony Fu would appreciate the local scope of this helper, keeping it from leaking into the global namespace. + +--- +# Architectural Review: src/scraper/conversation-extractor.ts + +The `ConversationExtractor` is a high-complexity module that bridges the gap between raw browser events and structured Markdown data. It features sophisticated retry logic, pagination handling, and data normalization. + +### Persona Perspective Review + +- **Martin Fowler**: This class is a candidate for substantial refactoring. It violates the *Single Responsibility Principle* by handling browser navigation, API interception, Zod validation, SHA-256 hashing, and Markdown conversion. Fowler would suggest extracting these into a `NavigationService`, `ApiInterceptor`, `DataHasher`, and `MarkdownFormatter`. +- **Uncle Bob**: The `captureConversationApiResponse` method is a very large and complex closure-based promise. Uncle Bob would argue for breaking this down into smaller, named methods to manage the event listeners and timeouts more cleanly. +- **Kent C. Dodds**: The use of stable JSON serialization for hashing (`hashEntries`) is a brilliant "pit of success" for incremental updates. It ensures that content-based skipping is reliable even if the API returns keys in a different order. +- **Sindre Sorhus**: Sindre would appreciate the strict Zod schemas for the API response. However, he would likely find the internal state management (`currentTimeoutMs`) within the class a bit untidy and would prefer a more functional approach to timeout adjustments. +- **Anthony Fu**: The TypeScript usage is quite comprehensive, though the frequent use of `any` in `parseConversationData` and `convertEntriesToMarkdown` is a blemish. Anthony would suggest defining more exhaustive interfaces for the Perplexity API objects to eliminate the `any` casts. + +### Line-by-Line Observations + +- **Massive Error List**: + The class defines over 7 different static error classes. Uncle Bob would suggest consolidating these or moving them to an external `errors.ts` file to keep the main logic file clean. + +- **Manual JSON Sorting**: + ```typescript + const stableJsonString = JSON.stringify(rawEntries, (_key, value) => { ... }) + ``` + Critique: Sindre Sorhus would recommend using a dedicated package like `fast-json-stable-stringify` instead of a manual implementation of sorted keys, as it's more robust and handles edge cases better. + +- **Markdown Logic in Extractor**: + ```typescript + private convertEntriesToMarkdown(entries: unknown[], threadTitle: string): string { ... } + ``` + Critique: This is a clear case of mixed concerns. The extractor should return structured data; the formatting into Markdown should be the responsibility of a different component (e.g., the `FileWriter` or a dedicated `MarkdownService`). + +- **Zod Error Reporting**: + ```typescript + this.diagnostics.writeFailure({ ... }) + ``` + Critique: This is an excellent observability feature. It allows the tool to evolve alongside the API it scrapes by logging exactly where the data schema has drifted. + +--- +# Architectural Review: src/scraper/worker-pool.ts + +The `WorkerPool` is the engine of the extraction phase, providing parallelization and fault tolerance. Its ability to manage multiple "workers" while handling context loss and retries is a sophisticated implementation of the *Work Queue* pattern. + +### Persona Perspective Review + +- **Martin Fowler**: The class is a good example of the *Parallel Pipeline* pattern. It separates the orchestration of tasks (the loop) from the execution (the worker). The retry logic is cleanly integrated into the failure handler. +- **Uncle Bob**: The class follows the SRP reasonably well, but the `processConversations` method is a bit "busy." The manual `while` loop with a polling interval is a bit primitive for modern Node.js. Uncle Bob would prefer to see this implemented using a more robust concurrency primitive like a `Semaphore` or a dedicated library like `p-limit`. +- **Kent C. Dodds**: The intelligent skipping based on content hashes is a great "developer-centric" optimization. It saves both time and browser resources by avoiding unnecessary re-processing of unchanged threads. +- **Sindre Sorhus**: Sindre is the creator of `p-limit` and `p-queue`. He would strongly suggest replacing the manual `POLLING_INTERVAL_MS` loop with one of his micro-libraries to achieve much cleaner and more efficient concurrency management. +- **Anthony Fu**: The TypeScript usage is clean. The `ExtractionWorker` and `QueueItem` interfaces provide good internal structure. The use of `ReturnType` to get the result type from the extractor is a nice touch. + +### Line-by-Line Observations + +- **Manual Concurrency Loop**: + ```typescript + while (queue.length > 0 || activeTasks.length > 0) { ... } + ``` + Critique: This is a classic "Old School" Node.js pattern. Sindre Sorhus would point out that it's inefficient because it relies on a timer (`POLLING_INTERVAL_MS`). A promise-based queue would be more reactive and less resource-heavy. + +- **Content Hash Check**: + ```typescript + if (existingHash && existingHash === result.contentHash) { ... } + ``` + Critique: This is the "Magic" that makes the sync feature fast. Fowler would appreciate the clear business logic here. + +- **Context Refreshing**: + ```typescript + if (isContextLost) { + logger.warn('Browser context lost. Refreshing worker context...') + await this.refreshContext() + } + ``` + Critique: This is a very robust error-handling pattern. It acknowledges that browser contexts are fragile and provides a self-healing mechanism. Uncle Bob would approve of this defensive programming. + +--- +# Architectural Review: src/scraper/checkpoint-manager.ts + +The `CheckpointManager` provides the persistence layer for the scraping process, ensuring that the tool can resume after failures and efficiently sync new content. It acts as a *Repository* for the processing state. + +### Persona Perspective Review + +- **Martin Fowler**: The preservation of content hashes in `setDiscoveredConversations` is a good example of *Merging State*. It shows a thoughtful approach to data continuity. However, the class mixes the logic of "what" to save with the "how" of saving it. Fowler would suggest separating the `CheckpointPersistence` (file I/O) from the `CheckpointState` (logic). +- **Uncle Bob**: The class is quite clean, but the methods `saveCheckpoint` and `loadCheckpoint` are synchronous. Uncle Bob would argue that as the number of conversations grows, the checkpoint file could become quite large, and blocking the main thread for JSON serialization and I/O could cause stuttering in the UI or delays in the worker pool. +- **Kent C. Dodds**: The `prepareForUpdateRun` method is the technical backbone of the "Sync" feature. It's a great example of an "Aha!" developer moment—realizing that you only need to reset the `processedIds` to re-trigger the extraction logic while keeping the hashes for skipping. +- **Sindre Sorhus**: Sindre would point out the use of `readFileSync` and `writeFileSync`. He would prefer an asynchronous, atomic write (like `write-file-atomic`) to prevent data corruption if the process is killed during a save operation. +- **Anthony Fu**: The TypeScript interfaces for the checkpoint data are clear and helpful. He might suggest using a more modern storage solution like a local SQLite database (via `better-sqlite3` or `drizzle`) if the data grows beyond what's comfortable for a single JSON file. + +### Line-by-Line Observations + +- **Synchronous File I/O**: + ```typescript + const rawCheckpointData = readFileSync(this.checkpointFilePath, 'utf-8') + ... + writeFileSync(this.checkpointFilePath, serializedState) + ``` + Critique: As noted, Sindre Sorhus and Uncle Bob would both prefer asynchronous I/O to avoid blocking and potential data loss. + +- **Inefficient Search**: + ```typescript + return this.currentState.discoveredConversations.filter( + (conversation) => !this.currentState.processedIds.includes(conversation.id) + ) + ``` + Critique: Uncle Bob would point out that `includes` inside a `filter` is O(N*M). For a large library, this could become slow. Converting `processedIds` to a `Set` for O(1) lookups would be a simple but effective optimization. + +- **Direct Mutation of State**: + The class directly modifies `this.currentState`. Fowler might suggest using an immutable approach or a more formal "State Change" pattern to ensure consistency, though for this scale, direct mutation is pragmatic. + +--- +# Architectural Review: src/scraper/library-discovery.ts + +The `LibraryDiscovery` module is a masterpiece of pragmatic web automation. It bypasses the DOM entirely, opting to execute authenticated `fetch` calls within the browser context to retrieve the user's thread list. This is significantly faster and more reliable than UI-based scraping. + +### Persona Perspective Review + +- **Martin Fowler**: The code uses a mix of top-level functions and a class. Fowler would suggest moving the helper functions (`detectApiVersion`, `fetchThreadBatch`, etc.) into the `LibraryDiscovery` class or a companion service to keep the namespace clean and improve the cohesiveness of the module. +- **Uncle Bob**: The use of `page.evaluate` to run code in the browser is handled well, but the `fetchThreadBatch` function is doing a lot: constructing the URL, evaluating in the browser, parsing the response, and handling errors. Uncle Bob would prefer to see the browser-side code extracted into a clean, reusable string or a separate file. +- **Kent C. Dodds**: The "Pit of Success" here is the automatic version detection. By watching the network traffic for `/rest/` endpoints, the tool automatically adapts to Perplexity's API versioning without requiring user intervention. This is a top-tier developer experience choice. +- **Sindre Sorhus**: Sindre would appreciate the randomized delay between batches (`800 + Math.random() * 700`) to avoid bot detection. It's a simple but effective technique. However, he would find the type casting `parsed as RawThread[]` a bit risky and would suggest a Zod schema to validate the API response. +- **Anthony Fu**: The module is very modern. The use of `page.waitForResponse` and `page.evaluate` shows a deep understanding of Playwright/Patchright capabilities. The `RawThread` interface is comprehensive, though it could be shared with other modules to avoid duplication. + +### Line-by-Line Observations + +- **Mixed Exports and Functions**: + Critique: Fowler would suggest moving the helper functions inside the class or into a `private` section to clarify what is part of the public API versus internal implementation details. + +- **Wait Logic**: + ```typescript + const delay = 800 + Math.random() * 700 + await page.waitForTimeout(delay) + ``` + Critique: Kent C. Dodds and Sindre Sorhus would both applaud this. It's a "human-like" behavior that makes the scraper much stealthier. + +- **Lack of Validation**: + ```typescript + const threads = parsed as RawThread[] + ``` + Critique: Anthony Fu and Sindre Sorhus would both insist on a Zod validation step here. If the API format changes (e.g., `threads` becomes an object instead of an array), the script will crash with a cryptic "cannot read property total_threads of undefined" instead of a clear validation error. + +- **Version Fallback**: + ```typescript + const version = extractVersionFromUrl(response.url()) ?? '2.18' + ``` + Critique: While pragmatic, Uncle Bob would prefer the fallback version to be a named constant in the `Config` or at the top of the file to avoid "magic strings." + +--- +# Architectural Review: src/scraper/browser.ts + +The `BrowserManager` is a sophisticated wrapper around Playwright/Patchright that handles the lifecycle of the browser and user authentication. Its multi-stage launch process (trying saved auth, falling back to headful login, then switching back to headless) is a great example of user-centric automation. + +### Persona Perspective Review + +- **Martin Fowler**: The `launch` method is quite long and complex, using the *Long Method* smell. Fowler would suggest breaking it down into smaller, well-named methods like `tryRestoreSession` and `performManualLogin`. The use of *Replace Exception with Guard Clause* would also make the flow easier to follow. +- **Uncle Bob**: The class violates the SRP by handling both browser lifecycle management and the business logic of authentication verification. Uncle Bob would recommend extracting an `AuthService` to handle the `verifyLoginStatus` and `persistAuthenticationState` logic. +- **Kent C. Dodds**: The "Automatic Session Recovery" is a peak DX feature. Saving and loading the `storageState` (cookies and localStorage) saves the user from having to log in manually every time they run the tool. +- **Sindre Sorhus**: Sindre would point out that despite the project memory saying `patchright` is used, the code imports from `@playwright/test`. He would also prefer asynchronicity throughout, avoiding `statSync` and `writeFileSync`. +- **Anthony Fu**: The logic to detect if a session is "fresh" (older than 24 hours) is very pragmatic. Anthony would probably suggest making this duration configurable or at least a named constant at the top of the file. + +### Line-by-Line Observations + +- **Dependency Inconsistency (Again)**: + ```typescript + import { chromium, ... } from '@playwright/test' + ``` + Critique: This is a major point of confusion. If the project's USP is using `patchright` for stealth, why is it importing standard `chromium` from `@playwright/test`? This likely negates the stealth benefits unless `patchright` is somehow monkey-patching the playwright package globally. + +- **State Persistence**: + ```typescript + writeFileSync(this.config.authStoragePath, serializedState) + ``` + Critique: Sindre Sorhus would suggest using a more robust way to write this file, perhaps with atomic writes, to ensure that a crash during the save doesn't leave the user with a corrupted (and thus unrecoverable) session. + +- **Complex Conditional in Launch**: + ```typescript + if (isSavedAuthValid) { + ... + if (isLoggedIn) { ... } + ... + await this.close() + } + ``` + Critique: This nested logic is hard to read. Fowler would suggest using a *State Machine* or at least flatter logic with early returns to clarify the "Happy Path" versus the "Recovery Path." + +- **Automation Controlled Flag**: + ```typescript + args: ['--disable-blink-features=AutomationControlled'], + ``` + Critique: The memory specifically mentions that this flag is handled internally by `patchright` and should be omitted. Its presence here suggests a drift between the code and the architectural instructions. + +--- +# Architectural Review: src/benchmark.ts + +The `benchmark.ts` script is a valuable tool for measuring the performance and reliability of the RAG pipeline. It demonstrates a commitment to performance monitoring and operational excellence. + +### Persona Perspective Review + +- **Martin Fowler**: The benchmark is a form of *Self-Checking Test*. It ensures that the system not only works correctly but also performs within acceptable bounds. Fowler would suggest parameterizing the queries so they can be passed in via a file or CLI arguments rather than being hardcoded. +- **Uncle Bob**: The main `runBenchmark` function is a bit long and mixes concerns: initialization, execution, and reporting. Uncle Bob would prefer to see these stages separated into distinct functions. +- **Kent C. Dodds**: This is a great developer utility. It allows anyone working on the RAG logic to immediately see the impact of their changes on latency. +- **Sindre Sorhus**: Sindre would likely suggest turning this into a more formal CLI tool using `meow` or a similar package, allowing for more flexible output (like JSON for CI integration). +- **Anthony Fu**: The use of `performance.now()` is the standard way to measure high-resolution timing in Node.js. The summary report at the end is clear and visually helpful. + +### Line-by-Line Observations + +- **Hardcoded Queries**: + ```typescript + const BENCHMARK_QUERIES = [ ... ] + ``` + Critique: While useful for a quick check, this makes the benchmark less flexible. Sindre Sorhus would suggest allowing the user to provide their own queries. + +- **Mixing Logic and Report**: + ```typescript + logger.info('--- Benchmark Results ---') + benchmarkResults.forEach((result, index) => { ... }) + ``` + Critique: Uncle Bob would suggest extracting the reporting logic into a separate function, perhaps even a separate `BenchmarkReporter` class, to allow for different output formats. + +- **Exit on Error**: + ```typescript + process.exit(1) + ``` + Critique: As mentioned before, Sindre and Uncle Bob prefer throwing errors and letting the top-level handler decide whether to exit the process. + +--- +# Architectural Review: src/ai/rag-orchestrator.ts + +The `RagOrchestrator` is the crown jewel of the application's intelligence. It implements a sophisticated Multi-Agent RAG pipeline involving planning (Research Plan), retrieval (Adaptive Hybrid Search with HyDE and RRF), reranking (Cross-Encoders), and synthesis (Map-Reduce extraction followed by a final Narrator). + +### Persona Perspective Review + +- **Martin Fowler**: This class is a "God Object" in the AI domain. It manages the entire lifecycle of a complex RAG pipeline. Fowler would recommend breaking this down into smaller, specialized services: `RAGPlanner`, `HybridRetriever`, `FactExtractor`, and `ResponseSynthesizer`. The *Template Method* pattern could also be used to define the RAG workflow while allowing different implementations for each step. +- **Uncle Bob**: The methods are well-named and reflect the "Ubiquitous Language" of RAG. However, the use of large string templates for prompts within the methods makes the code hard to read and maintain. Uncle Bob would suggest moving these prompts to external template files or a dedicated `PromptRegistry`. +- **Kent C. Dodds**: The "Adaptive" nature of the search (HyDE, Keyword, Semantic) is a masterclass in providing high-quality results for varied user queries. It's a "Mightiest" implementation indeed. +- **Sindre Sorhus**: Sindre would point out the dependency on `@huggingface/transformers` for reranking and the optionality handling (`getCrossEncoder`). He would prefer a more explicit dependency management strategy rather than checking if a module is available at runtime. +- **Anthony Fu**: The use of RRF (Reciprocal Rank Fusion) and Cross-Encoders shows a commitment to state-of-the-art AI patterns. Anthony would likely suggest using a more robust JSON parsing library for LLM outputs, as the regex-based `parseJsonFromResponse` is prone to failure if the LLM adds extra text or markdown formatting. + +### Line-by-Line Observations + +- **JSON Parsing Fragility**: + ```typescript + const jsonMatch = response.match(/(\{[\s\S]*\}|\[[\s\S]*\])/) + ``` + Critique: Relying on regex to extract JSON from LLM responses is a common but risky pattern. Anthony Fu would suggest using a library like `zod-to-json-schema` to define the expected output and then using a library that handles "dirty" JSON parsing (like `jsonic` or `dirty-json`) or using Ollama's structured output features if available. + +- **Prompt Entanglement**: + The class is riddled with large multi-line strings for prompts. + Critique: This makes the logic of the pipeline (e.g., how many batches to process) hard to distinguish from the instructions given to the LLM. Separation of concerns would dictate moving prompts elsewhere. + +- **Manual Map-Reduce**: + ```typescript + for (let batchStartIndex = 0, batchNumber = 1; ... ) { ... } + ``` + Critique: This manual batching and processing is efficient for local LLMs (avoiding context window overflow), but the implementation is quite verbose. Uncle Bob would suggest extracting the Map-Reduce logic into a generic utility. + +- **Cross-Encoder Optionality**: + ```typescript + const crossEncoder = await getCrossEncoder() + if (!crossEncoder) { ... } + ``` + Critique: While pragmatic, this "silent fallback" can lead to confusing performance differences between environments. Kent C. Dodds would suggest being more vocal about *why* the reranking is skipped (e.g., a "Warning: Reranking disabled due to missing dependency" message). + +--- +# Architectural Review: src/ai/ollama-client.ts + +The `OllamaClient` is a focused service for interacting with the Ollama API. It correctly handles both the modern OpenAI-compatible embeddings endpoint and the legacy Ollama-specific format. + +### Persona Perspective Review + +- **Martin Fowler**: The `parseEmbeddingsFromResponse` method is a good example of the *Special Case* or *Adapter* pattern. It allows the rest of the application to ignore the version-specific details of the Ollama API. +- **Uncle Bob**: The class follows the SRP by focusing strictly on communication with Ollama. The use of Zod for response validation is excellent, ensuring that the application doesn't try to process malformed or unexpected data. +- **Kent C. Dodds**: The `validate` method is another great DX feature. It provides an early warning if Ollama isn't running or the model isn't downloaded, rather than failing deep inside a complex RAG process. +- **Sindre Sorhus**: Sindre would appreciate the use of the native `fetch` API. He might suggest adding a timeout to the fetch requests, as LLM generation can sometimes hang or take a very long time, and a default timeout would improve the tool's robustness. +- **Anthony Fu**: The class is clean and modern. He would likely suggest adding support for streaming responses, as that's a common feature in AI tools that significantly improves perceived performance for the user. + +### Line-by-Line Observations + +- **Manual Fetch**: + ```typescript + const httpResponse = await fetch(fullRequestUrl, { ... }) + ``` + Critique: While `fetch` is standard, for a specialized client, Uncle Bob might suggest using a library like `got` or `axios` that has built-in retry logic and better timeout handling, though for a local API, `fetch` is often sufficient. + +- **Error Nesting (The Final Boss)**: + ```typescript + static readonly OllamaError = class extends Error { ... } + ``` + Critique: This pattern has been consistent throughout the codebase. While it keeps errors scoped to their modules, Fowler and Uncle Bob would both recommend a more standard approach to error hierarchy. + +- **Zod Schema Locality**: + The Zod schemas are defined at the top of the file. + Critique: Anthony Fu would approve of this, as it keeps the validation logic close to the code that uses it. + +--- diff --git a/package.json b/package.json index 26c589c..698cc3b 100644 --- a/package.json +++ b/package.json @@ -33,23 +33,31 @@ "chalk": "^5.6.2", "chromium-bidi": "^15.0.0", "dotenv": "^17.2.4", + "fast-json-stable-stringify": "^2.1.0", "inquirer": "^13.2.2", + "jsonic": "^2.28.0", + "p-limit": "^7.3.0", + "patchright": "^1.60.1", "playwright-core": "^1.58.2", "sanitize-filename": "^1.6.3", "vectra": "^0.12.3", - "zod": "^4.3.6" + "write-file-atomic": "^8.0.0", + "zod": "^4.3.6", + "zod-to-json-schema": "^3.25.2" }, "devDependencies": { - "concurrently": "^10.0.0", "@commitlint/cli": "^20.4.4", "@commitlint/config-conventional": "^20.4.4", "@playwright/test": "^1.58.2", "@release-it/conventional-changelog": "^10.0.6", "@types/inquirer": "^9.0.9", + "@types/jsonic": "^2.15.1", "@types/node": "^25.2.3", "@types/sanitize-filename": "^1.1.28", + "@types/write-file-atomic": "^4.0.3", "@vitest/coverage-v8": "^4.0.18", "@vitest/ui": "^4.0.18", + "concurrently": "^10.0.0", "esbuild": "^0.27.4", "husky": "^9.1.7", "lint-staged": "^17.0.5", diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml index 6632056..e4e7705 100644 --- a/pnpm-lock.yaml +++ b/pnpm-lock.yaml @@ -26,9 +26,21 @@ importers: dotenv: specifier: ^17.2.4 version: 17.4.2 + fast-json-stable-stringify: + specifier: ^2.1.0 + version: 2.1.0 inquirer: specifier: ^13.2.2 version: 13.4.3(@types/node@25.9.1) + jsonic: + specifier: ^2.28.0 + version: 2.28.0 + p-limit: + specifier: ^7.3.0 + version: 7.3.0 + patchright: + specifier: ^1.60.1 + version: 1.60.1 playwright-core: specifier: ^1.58.2 version: 1.60.0 @@ -38,9 +50,15 @@ importers: vectra: specifier: ^0.12.3 version: 0.12.3(zod@4.4.3) + write-file-atomic: + specifier: ^8.0.0 + version: 8.0.0 zod: specifier: ^4.3.6 version: 4.4.3 + zod-to-json-schema: + specifier: ^3.25.2 + version: 3.25.2(zod@4.4.3) devDependencies: '@commitlint/cli': specifier: ^20.4.4 @@ -57,12 +75,18 @@ importers: '@types/inquirer': specifier: ^9.0.9 version: 9.0.9 + '@types/jsonic': + specifier: ^2.15.1 + version: 2.15.1 '@types/node': specifier: ^25.2.3 version: 25.9.1 '@types/sanitize-filename': specifier: ^1.1.28 version: 1.6.3 + '@types/write-file-atomic': + specifier: ^4.0.3 + version: 4.0.3 '@vitest/coverage-v8': specifier: ^4.0.18 version: 4.1.7(vitest@4.1.7) @@ -1486,6 +1510,10 @@ packages: '@types/inquirer@9.0.9': resolution: {integrity: sha512-/mWx5136gts2Z2e5izdoRCo46lPp5TMs9R15GTSsgg/XnZyxDWVqoVU3R9lWnccKpqwsJLvRoxbCjoJtZB7DSw==} + '@types/jsonic@2.15.1': + resolution: {integrity: sha512-ue8FLe7W9bfsGJ0FU6m+j+LnQphGkQvT5e2HfTlAkJ+9K9dX2FZRcCF7C5KXW1e06Y0clCiZZLxAMBB/KuvYcg==} + deprecated: This is a stub types definition. jsonic provides its own type definitions, so you do not need this installed. + '@types/node-fetch@2.6.13': resolution: {integrity: sha512-QGpRVpzSaUs30JBSGPjOg4Uveu384erbHBoT1zeONvyCfwQxIkUshLAOqN/k9EjGviPRmWTTe6aH2qySWKTVSw==} @@ -1515,6 +1543,9 @@ packages: '@types/through@0.0.33': resolution: {integrity: sha512-HsJ+z3QuETzP3cswwtzt2vEIiHBk/dCcHGhbmG5X3ecnwFD/lPrMpliGXxSCg03L9AhrdwA4Oz/qfspkDW+xGQ==} + '@types/write-file-atomic@4.0.3': + resolution: {integrity: sha512-qdo+vZRchyJIHNeuI1nrpsLw+hnkgqP/8mlaN6Wle/NKhydHmUN9l4p3ZE8yP90AJNJW4uB8HQhedb4f1vNayQ==} + '@vitest/coverage-v8@4.1.7': resolution: {integrity: sha512-qsYPeXc5Q9dFLd1i8Ap+Bx8sQgcp+rFVQo4R0dDsWNBzl26ldVF1qOO+RL24K7FDrR6pA+50XedRLSoSG24bVQ==} peerDependencies: @@ -2152,6 +2183,9 @@ packages: fast-deep-equal@3.1.3: resolution: {integrity: sha512-f3qQ9oQy9j2AhBe/H9VC91wLmKBCCU/gDOnKNAYG5hswO7BLKj09Hc5HYNz9cGI++xlpDCIgDaitVs03ATR84Q==} + fast-json-stable-stringify@2.1.0: + resolution: {integrity: sha512-lhd/wF+Lk98HZoTCtlVraHtfh5XYijIjalXck7saUtuanSDyLMxnHhSXEDJqHxD7msR8D0uCmqlkwjCV8xvwHw==} + fast-string-truncated-width@3.0.3: resolution: {integrity: sha512-0jjjIEL6+0jag3l2XWWizO64/aZVtpiGE3t0Zgqxv0DPuxiMjvB3M24fCyhZUO4KomJQPj3LTSUnDP3GpdwC0g==} @@ -2545,6 +2579,10 @@ packages: json-with-bigint@3.5.8: resolution: {integrity: sha512-eq/4KP6K34kwa7TcFdtvnftvHCD9KvHOGGICWwMFc4dOOKF5t4iYqnfLK8otCRCRv06FXOzGGyqE8h8ElMvvdw==} + jsonic@2.28.0: + resolution: {integrity: sha512-lo+Yo4ShvJot3A3G6kZOlphRx+GfhR4uK3/O0u9Wg12fggOuvK7TxtteoRtBGeGaqnbHqEHqkj3Uft8obfvP+g==} + hasBin: true + kind-of@3.2.2: resolution: {integrity: sha512-NOW9QQXMoZGg/oqnVNoNTTIFEIid1627WCffUBJEdMxYApq7mNE7CpzucIPc+ZQg25Phej7IJSmX3hO+oblOtQ==} engines: {node: '>=0.10.0'} @@ -2937,6 +2975,10 @@ packages: vite-plus: optional: true + p-limit@7.3.0: + resolution: {integrity: sha512-7cIXg/Z0M5WZRblrsOla88S4wAK+zOQQWeBYfV3qJuJXMr+LnbYjaadrFaS0JILfEDPVqHyKnZ1Z/1d6J9VVUw==} + engines: {node: '>=20'} + pac-proxy-agent@7.2.0: resolution: {integrity: sha512-TEB8ESquiLMc0lV8vcd5Ql/JAKAoyzHFXaStwjkzpOpC5Yv+pIzLfHvjTSdf3vpa2bMiUQrg9i6276yn8666aA==} engines: {node: '>= 14'} @@ -2969,6 +3011,16 @@ packages: parse5@7.3.0: resolution: {integrity: sha512-IInvU7fabl34qmi9gY8XOVxhYyMyuH2xUNpb2q8/Y+7552KlejkRvqvD19nMoUW/uQGGbqNpA6Tufu5FL5BZgw==} + patchright-core@1.60.1: + resolution: {integrity: sha512-CJ6ulXQkis3MfrA379i6Nw8Xf7BPj8/WbQbk7imcZN7T0NOazWieagPUhC2s6MsUVKpvVSVRHCS52PrX3wDtUg==} + engines: {node: '>=18'} + hasBin: true + + patchright@1.60.1: + resolution: {integrity: sha512-9T4bWxQ9gwduSPflE5eAAiGHxjMRio8189oxxGfW3OFxhSGOcvLP8xGrsW2ZHkn8aSUqEb33cr74rV12pvwNgg==} + engines: {node: '>=18'} + hasBin: true + path-key@3.1.1: resolution: {integrity: sha512-ojmeN0qd+y0jszEtoY48r0Peq5dwMEkIlCOu6Q5f41lfkswXuKtYrhgoTpLnyIcHm24Uhqx+5Tqm2InSwLhE6Q==} engines: {node: '>=8'} @@ -3616,6 +3668,10 @@ packages: resolution: {integrity: sha512-42AtmgqjV+X1VpdOfyTGOYRi0/zsoLqtXQckTmqTeybT+BDIbM/Guxo7x3pE2vtpr1ok6xRqM9OpBe+Jyoqyww==} engines: {node: '>=18'} + write-file-atomic@8.0.0: + resolution: {integrity: sha512-dYwyZredl67GyLLIHJnRM3h2PcOmN5SkcgC7eM5DPDEOEl6dLFqVrMg3F1Ea32usj4VSVZtd2H4MtKTNOf6nPg==} + engines: {node: ^22.22.2 || ^24.15.0 || >=26.0.0} + wsl-utils@0.1.0: resolution: {integrity: sha512-h3Fbisa2nKGPxCpm89Hk33lBLsnaGBvctQopaBSOW/uIs6FTe1ATyAnKFJrzVs9vpGdsTe73WF3V4lIsk4Gacw==} engines: {node: '>=18'} @@ -3653,6 +3709,10 @@ packages: resolution: {integrity: sha512-4UEqdc2RYGHZc7Doyqkrqiln3p9X2DZVxaGbwhn2pi7MrRagKaOcIKe8L3OxYcbhXLgLFUS3zAYuQjKBQgmuNg==} engines: {node: ^20.19.0 || ^22.12.0 || >=23} + yocto-queue@1.2.2: + resolution: {integrity: sha512-4LCcse/U2MHZ63HAJVE+v71o7yOdIe4cZ70Wpf8D/IyjDKYQLV5GD46B+hSTjJsvV5PztjvHoU580EftxjDZFQ==} + engines: {node: '>=12.20'} + yoctocolors-cjs@2.1.3: resolution: {integrity: sha512-U/PBtDf35ff0D8X8D0jfdzHYEPFxAI7jJlxZXwCSez5M3190m+QobIfh+sWDWSHMCWWJN2AWamkegn6vr6YBTw==} engines: {node: '>=18'} @@ -3661,6 +3721,11 @@ packages: resolution: {integrity: sha512-CzhO+pFNo8ajLM2d2IW/R93ipy99LWjtwblvC1RsoSUMZgyLbYFr221TnSNT7GjGdYui6P459mw9JH/g/zW2ug==} engines: {node: '>=18'} + zod-to-json-schema@3.25.2: + resolution: {integrity: sha512-O/PgfnpT1xKSDeQYSCfRI5Gy3hPf91mKVDuYLUHZJMiDFptvP41MSnWofm8dnCm0256ZNfZIM7DSzuSMAFnjHA==} + peerDependencies: + zod: ^3.25.28 || ^4 + zod@3.25.76: resolution: {integrity: sha512-gzUt/qt81nXsFGKIFcC3YnfEAx5NkunCfnDlvuBSSFS02bcXu4Lmea0AFIUwbLWxWPx3d9p8S5QoaujKcNQxcQ==} @@ -4681,6 +4746,10 @@ snapshots: '@types/through': 0.0.33 rxjs: 7.8.2 + '@types/jsonic@2.15.1': + dependencies: + jsonic: 2.28.0 + '@types/node-fetch@2.6.13': dependencies: '@types/node': 25.9.1 @@ -4714,6 +4783,10 @@ snapshots: dependencies: '@types/node': 25.9.1 + '@types/write-file-atomic@4.0.3': + dependencies: + '@types/node': 25.9.1 + '@vitest/coverage-v8@4.1.7(vitest@4.1.7)': dependencies: '@bcoe/v8-coverage': 1.0.2 @@ -5403,6 +5476,8 @@ snapshots: fast-deep-equal@3.1.3: {} + fast-json-stable-stringify@2.1.0: {} + fast-string-truncated-width@3.0.3: {} fast-string-width@3.0.2: @@ -5791,6 +5866,8 @@ snapshots: json-with-bigint@3.5.8: {} + jsonic@2.28.0: {} + kind-of@3.2.2: dependencies: is-buffer: 1.1.6 @@ -6192,6 +6269,10 @@ snapshots: '@oxlint/binding-win32-ia32-msvc': 1.67.0 '@oxlint/binding-win32-x64-msvc': 1.67.0 + p-limit@7.3.0: + dependencies: + yocto-queue: 1.2.2 + pac-proxy-agent@7.2.0: dependencies: '@tootallnate/quickjs-emscripten': 0.23.0 @@ -6243,6 +6324,14 @@ snapshots: dependencies: entities: 6.0.1 + patchright-core@1.60.1: {} + + patchright@1.60.1: + dependencies: + patchright-core: 1.60.1 + optionalDependencies: + fsevents: 2.3.2 + path-key@3.1.1: {} path-key@4.0.0: {} @@ -6891,6 +6980,10 @@ snapshots: string-width: 7.2.0 strip-ansi: 7.2.0 + write-file-atomic@8.0.0: + dependencies: + signal-exit: 4.1.0 + wsl-utils@0.1.0: dependencies: is-wsl: 3.1.1 @@ -6927,10 +7020,16 @@ snapshots: y18n: 5.0.8 yargs-parser: 22.0.0 + yocto-queue@1.2.2: {} + yoctocolors-cjs@2.1.3: {} yoctocolors@2.1.2: {} + zod-to-json-schema@3.25.2(zod@4.4.3): + dependencies: + zod: 4.4.3 + zod@3.25.76: {} zod@4.4.3: {} diff --git a/src/ai/cross-encoder.ts b/src/ai/cross-encoder.ts new file mode 100644 index 0000000..1bf19e6 --- /dev/null +++ b/src/ai/cross-encoder.ts @@ -0,0 +1,25 @@ +import { pipeline } from '@huggingface/transformers' + +let crossEncoderInstance: any = null +let isEncoderInitializing = false + +export async function getCrossEncoder() { + if (crossEncoderInstance) return crossEncoderInstance + if (isEncoderInitializing) return null + + isEncoderInitializing = true + try { + const CROSS_ENCODER_MODEL_IDENTIFIER = 'Xenova/ms-marco-MiniLM-L-6-v2' + const transformerPipeline = await (pipeline as any)('feature-extraction', CROSS_ENCODER_MODEL_IDENTIFIER) + + crossEncoderInstance = { + tokenizer: transformerPipeline.tokenizer, + model: transformerPipeline.model, + } + return crossEncoderInstance + } catch { + return null + } finally { + isEncoderInitializing = false + } +} diff --git a/src/ai/ollama-client.ts b/src/ai/ollama-client.ts index 9358ee8..d12619f 100644 --- a/src/ai/ollama-client.ts +++ b/src/ai/ollama-client.ts @@ -15,38 +15,27 @@ const generationResponseSchema = z.object({ }) export class OllamaClient { - static readonly OllamaError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'OllamaError' - } - } - - constructor(private readonly config: Config) {} + constructor(private readonly applicationConfig: Config) {} - async embed(inputTexts: string[]): Promise { - const isInputEmpty = inputTexts.length === 0 + async embed(inputTextsToEmbed: string[]): Promise { + const isInputEmpty = inputTextsToEmbed.length === 0 if (isInputEmpty) return [] - const requestBody = { - model: this.config.ollamaEmbedModel, - input: inputTexts, - } - - const responseData = await this.performOllamaHttpRequest('/v1/embeddings', requestBody) + const responseData = await this.performPostRequest('/v1/embeddings', { + model: this.applicationConfig.ollamaEmbedModel, + input: inputTextsToEmbed, + }) return this.parseEmbeddingsFromResponse(responseData) } - async generate(promptText: string, modelOverride?: string): Promise { - const requestBody = { - model: modelOverride ?? this.config.ollamaModel, + async generate(promptText: string, modelNameOverride?: string): Promise { + const responseData = await this.performPostRequest('/api/generate', { + model: modelNameOverride ?? this.applicationConfig.ollamaModel, prompt: promptText, stream: false, - } - - const responseData = await this.performOllamaHttpRequest('/api/generate', requestBody) - const validatedData = generationResponseSchema.parse(responseData) - return validatedData.response + }) + const validatedGenerationData = generationResponseSchema.parse(responseData) + return validatedGenerationData.response } async validate(): Promise { @@ -54,18 +43,13 @@ export class OllamaClient { try { await this.embed(['ping']) logger.success('Ollama embeddings look good.') - } catch (error) { - const errorMessage = error instanceof Error ? error.message : String(error) - throw new OllamaClient.OllamaError(`Ollama validation failed: ${errorMessage}`) + } catch (validationError) { + errorBus.raiseError(`Ollama validation failed`, validationError) } } - private async performOllamaHttpRequest( - apiEndpoint: string, - requestBody: object - ): Promise { - const fullRequestUrl = `${this.config.ollamaUrl}${apiEndpoint}` - + private async performPostRequest(apiEndpoint: string, requestBody: object): Promise { + const fullRequestUrl = `${this.applicationConfig.ollamaUrl}${apiEndpoint}` try { const httpResponse = await fetch(fullRequestUrl, { method: 'POST', @@ -73,47 +57,41 @@ export class OllamaClient { body: JSON.stringify(requestBody), }) - const isResponseSuccessful = httpResponse.ok - if (!isResponseSuccessful) { - let rawErrorBody = '' + const isRequestSuccessful = httpResponse.ok + if (!isRequestSuccessful) { + let errorBodyText = '' try { - rawErrorBody = await httpResponse.text() - } catch (_ignored) { - // Fallback to empty string if body reading fails + errorBodyText = await httpResponse.text() + } catch { + // Ignore body reading errors } - errorBus.emitError(`Ollama HTTP ${httpResponse.status}`, undefined, { + errorBus.raiseError(`Ollama request failed with status ${httpResponse.status}`, undefined, { body: requestBody, - errorBody: rawErrorBody.slice(0, 500), + errorBody: errorBodyText.slice(0, 500), }) - - const errorExcerpt = rawErrorBody.slice(0, 200) - throw new OllamaClient.OllamaError( - `Ollama request failed with status ${httpResponse.status} – ${errorExcerpt}` - ) } - return await httpResponse.json() - } catch (error) { - const isOllamaError = error instanceof OllamaClient.OllamaError - if (isOllamaError) throw error - - const errorMessage = error instanceof Error ? error.message : String(error) - throw new OllamaClient.OllamaError(`Network error while calling Ollama: ${errorMessage}`) + } catch (requestError) { + const isOllamaSpecificError = requestError instanceof Error && requestError.message.includes('Ollama request failed') + if (isOllamaSpecificError) { + throw requestError + } + errorBus.raiseError(`Network error while calling Ollama`, requestError) } } private parseEmbeddingsFromResponse(responseData: unknown): number[][] { - const openAiParseResult = openAiFormatSchema.safeParse(responseData) - if (openAiParseResult.success) { - return openAiParseResult.data.data.map((item) => item.embedding) + const openAiFormatResult = openAiFormatSchema.safeParse(responseData) + if (openAiFormatResult.success) { + return openAiFormatResult.data.data.map((item) => item.embedding) } - const legacyParseResult = legacyFormatSchema.safeParse(responseData) - if (legacyParseResult.success) { - return [legacyParseResult.data.embedding] + const legacyFormatResult = legacyFormatSchema.safeParse(responseData) + if (legacyFormatResult.success) { + return [legacyFormatResult.data.embedding] } - throw new OllamaClient.OllamaError('Unexpected response format from Ollama embeddings endpoint') + return errorBus.raiseError('Unexpected response format from Ollama embeddings endpoint') } } diff --git a/src/ai/rag-orchestrator.ts b/src/ai/rag-orchestrator.ts index 80333d8..3e29c6f 100644 --- a/src/ai/rag-orchestrator.ts +++ b/src/ai/rag-orchestrator.ts @@ -1,392 +1,133 @@ -import { errorBus } from '../utils/error-bus.js' -import { VectorStore, type VectorSearchResult } from '../search/vector-store.js' +import { type VectorSearchResult, VectorStore } from '../search/vector-store.js' +import { type Config } from '../utils/config.js' +import { logger } from '../utils/logger.js' import { OllamaClient } from './ollama-client.js' +import { errorBus } from '../utils/error-bus.js' import { RgSearch } from '../search/rg-search.js' -import { logger } from '../utils/logger.js' import chalk from 'chalk' -import { join } from 'node:path' -import { type Config } from '../utils/config.js' +import { getCrossEncoder } from './cross-encoder.js' -let crossEncoderTokenizer: any = null -let crossEncoderModel: any = null - -async function getCrossEncoder() { - const isAlreadyLoaded = crossEncoderTokenizer && crossEncoderModel - if (isAlreadyLoaded) { - return { tokenizer: crossEncoderTokenizer, model: crossEncoderModel } - } - - const transformers = await import('@huggingface/transformers').catch(() => null) - const isTransformersInstalled = - transformers && transformers.AutoTokenizer && transformers.AutoModelForSequenceClassification - - if (!isTransformersInstalled) { - return null - } - - const { AutoTokenizer, AutoModelForSequenceClassification } = transformers - - crossEncoderTokenizer = await AutoTokenizer.from_pretrained('Xenova/ms-marco-MiniLM-L-6-v2') - crossEncoderModel = await AutoModelForSequenceClassification.from_pretrained( - 'Xenova/ms-marco-MiniLM-L-6-v2', - { dtype: 'int8' } - ) - - return { tokenizer: crossEncoderTokenizer, model: crossEncoderModel } -} - -interface ResearchPlan { - strategy: 'precise' | 'exhaustive' - queries: string[] - hardKeywords: string[] - hydePassage: string - filters: Record -} - -interface ExtractedFact { - fact: string - source_title: string - thread: string -} +import { RAGPlanner } from './rag/planner.js' +import { HybridRetriever } from './rag/retriever.js' +import { FactExtractor } from './rag/extractor.js' +import { ResponseSynthesizer } from './rag/synthesizer.js' +import { type ExtractedFact } from './rag/types.js' export class RagOrchestrator { - private readonly vectorStore: VectorStore + private readonly researchPlanner: RAGPlanner + private readonly hybridRetriever: HybridRetriever + private readonly factExtractor: FactExtractor + private readonly responseSynthesizer: ResponseSynthesizer private readonly ollamaClient: OllamaClient - private readonly ripgrep: RgSearch - constructor(private readonly config: Config) { - this.vectorStore = new VectorStore(config) - this.ollamaClient = new OllamaClient(config) - this.ripgrep = new RgSearch(config) - } + constructor(applicationConfig: Config) { + this.ollamaClient = new OllamaClient(applicationConfig) + const conversationVectorStore = new VectorStore(applicationConfig) + const ripgrepSearchEngine = new RgSearch(applicationConfig) - async answerQuestion(question: string): Promise { - logger.info(`Mightiest Adaptive RAG processing: "${question}"`) + this.researchPlanner = new RAGPlanner(this.ollamaClient) + this.hybridRetriever = new HybridRetriever(applicationConfig, conversationVectorStore, ripgrepSearchEngine) + this.factExtractor = new FactExtractor(this.ollamaClient) + this.responseSynthesizer = new ResponseSynthesizer(this.ollamaClient) + } + async answerQuestion(userQuestion: string): Promise { try { - const researchPlan = await this.developResearchPlan(question) - const isExhaustiveMode = researchPlan.strategy === 'exhaustive' + logger.info(chalk.bold.cyan(`\nQuestion: ${userQuestion}`)) - logger.info(`Plan: ${chalk.bold.yellow(researchPlan.strategy.toUpperCase())}`) - if (isExhaustiveMode) { - logger.warn( - `Exhaustive mode enabled. This may take a while as I'll be doing a deep dive into your history.` - ) - } + logger.info('Developing research plan...') + const researchPlan = await this.researchPlanner.developPlan(userQuestion) - const hasHardKeywords = researchPlan.hardKeywords?.length > 0 - if (hasHardKeywords) { - logger.info(`Hard Keywords detected: ${chalk.gray(researchPlan.hardKeywords.join(', '))}`) - } + logger.info(`Strategy identified: ${chalk.yellow(researchPlan.researchStrategy)}`) - if (researchPlan.hydePassage) { - logger.debug(`HyDE passage generated: "${researchPlan.hydePassage.slice(0, 80)}..."`) - } + logger.info('Executing hybrid search...') + const rawSearchResults = await this.hybridRetriever.retrieve(researchPlan) - const searchResults = await this.executeAdaptiveHybridSearch(researchPlan) - const rerankedResults = await this.crossEncoderRerank(question, searchResults) - const contextFacts = await this.extractFactsWithGranularMapReduce( - question, - rerankedResults, - isExhaustiveMode - ) + const rerankedSearchResults = await this.executeCrossEncoderReranking(userQuestion, rawSearchResults) - logger.info(`Synthesizing final answer from ${contextFacts.length} verified facts...`) - const finalAnswer = await this.generateMightiestResponse( - question, - contextFacts, - researchPlan.strategy + const extractedResearchFacts = await this.factExtractor.extractFacts( + userQuestion, + rerankedSearchResults, + researchPlan.researchStrategy === 'exhaustive' ) - console.log(`\n${chalk.bold.green('Mightiest AI Response:')}\n`) - console.log(finalAnswer) - - this.displaySourceProvenance(contextFacts) - - const feedback = await this.verifyAnswerQuality(question, finalAnswer) - const isImprovementSuggested = feedback.status === 'improvement-needed' - if (isImprovementSuggested) { - logger.warn(`Self-Correction: ${chalk.gray(feedback.suggestion)}`) - } - } catch (error) { - const errorMessage = error instanceof Error ? error.message : String(error) - errorBus.emitError(`Mightiest RAG failed: ${errorMessage}`) - } - } - - private async developResearchPlan(originalQuestion: string): Promise { - const plannerPrompt = ` -Analyze: "${originalQuestion}" -1. Strategy: "precise" (specific facts) or "exhaustive" (broad summary/entity history). -2. Variations: 3 semantic search phrases. -3. Hard Keywords: Identify any names, IDs, or unique technical terms for exact matching. -4. HyDE: Write 1-2 sentences that would plausibly appear in a saved answer to this question. Write as if it's content already stored, not as a reply. -Return JSON: {"strategy": "...", "queries": [], "hardKeywords": [], "hydePassage": "...", "filters": {}} -` - try { - const response = await this.ollamaClient.generate(plannerPrompt) - const planJson = this.parseJsonFromResponse(response, {}) - - return { - strategy: planJson.strategy || 'precise', - queries: planJson.queries || [originalQuestion], - hardKeywords: planJson.hardKeywords || [], - hydePassage: planJson.hydePassage || '', - filters: planJson.filters || {}, - } - } catch (error) { - return { - strategy: 'precise', - queries: [originalQuestion], - hardKeywords: [], - hydePassage: '', - filters: {}, - } - } - } + logger.info('Synthesizing mightiest response...') + const finalGeneratedAnswer = await this.responseSynthesizer.synthesize( + userQuestion, + extractedResearchFacts, + researchPlan.researchStrategy + ) - private async executeAdaptiveHybridSearch(plan: ResearchPlan): Promise { - const searchPools: VectorSearchResult[][] = [] + console.log(`\n${chalk.white(finalGeneratedAnswer)}\n`) - for (let i = 0; i < (plan.queries || []).length; i++) { - const searchQuery = plan.queries[i]! - logger.debug(`Executing semantic search [${i + 1}/${plan.queries.length}]: "${searchQuery}"`) - const vectorResults = await this.vectorStore.search(searchQuery, 40) - searchPools.push(vectorResults) - } + this.displaySourceProvenance(extractedResearchFacts) - if (plan.hydePassage) { - logger.debug(`Executing HyDE search: "${plan.hydePassage.slice(0, 60)}..."`) - const hydeResults = await this.vectorStore.search(plan.hydePassage, 40) - searchPools.push(hydeResults) - } + const verificationFeedback = await this.responseSynthesizer.verifyQuality(userQuestion, finalGeneratedAnswer) + const isImprovementSuggested = verificationFeedback.verificationStatus === 'missed-info' - const keywordMatchPool: VectorSearchResult[] = [] - for (const hardKeyword of plan.hardKeywords || []) { - logger.debug(`Executing keyword search: "${hardKeyword}"`) - try { - const matches = await this.ripgrep.captureSearchMatches({ pattern: hardKeyword }) - const convertedMatches: VectorSearchResult[] = matches.map((match) => ({ - meta: { - path: join(this.config.exportDir, match.path), - snippet: match.text, - title: match.path.split('/').pop() || 'Untitled', - id: match.path + match.line, - }, - score: 1.0, - })) - keywordMatchPool.push(...convertedMatches) - } catch (error) { - // Silently skip failed keyword searches + if (isImprovementSuggested) { + logger.warn(`Self-Correction: ${chalk.gray(verificationFeedback.improvementSuggestion)}`) } + } catch (orchestrationError) { + errorBus.emitError('Mightiest RAG pipeline failed', orchestrationError, { question: userQuestion }) } - - const hasKeywordResults = keywordMatchPool.length > 0 - if (hasKeywordResults) { - searchPools.push(keywordMatchPool) - } - - return this.mergeAndFusionRank(searchPools) } - private mergeAndFusionRank(pools: VectorSearchResult[][]): VectorSearchResult[] { - const fusionScores = new Map() - - pools.forEach((pool) => { - pool.forEach((result, rank) => { - const path = result.meta['path'] || 'unknown' - const snippet = result.meta['snippet'] || '' - const uniqueId = result.meta['id'] || `${path}:${snippet}` - - const rankScore = 1 / (60 + rank) - const existingEntry = fusionScores.get(uniqueId) - - if (existingEntry) { - existingEntry.totalScore += rankScore - } else { - fusionScores.set(uniqueId, { result, totalScore: rankScore }) - } - }) - }) - - return Array.from(fusionScores.values()) - .sort((a, b) => b.totalScore - a.totalScore) - .map((entry) => entry.result) - } - - private async crossEncoderRerank( - question: string, - results: VectorSearchResult[] + private async executeCrossEncoderReranking( + userQuestion: string, + searchResults: VectorSearchResult[] ): Promise { - const isResultsEmpty = results.length === 0 - if (isResultsEmpty) return results + if (searchResults.length === 0) return searchResults - const crossEncoder = await getCrossEncoder() - if (!crossEncoder) { - logger.debug( - 'Cross-encoder not available (run: npm install @huggingface/transformers). Skipping rerank.' - ) - return results + const crossEncoderInstance = await getCrossEncoder() + if (!crossEncoderInstance) { + logger.debug('Cross-encoder not available. Skipping rerank.') + return searchResults } - const { tokenizer, model } = crossEncoder - logger.info(`Cross-encoder reranking ${results.length} candidates...`) + try { + const { tokenizer, model } = crossEncoderInstance + logger.info(`Cross-encoder reranking ${searchResults.length} candidates...`) - const RERANK_BATCH_SIZE = 64 - const rerankScores: number[] = new Array(results.length).fill(0) + const RERANKING_BATCH_SIZE = 64 + const calculatedRerankScores: number[] = new Array(searchResults.length).fill(0) - for (let i = 0; i < results.length; i += RERANK_BATCH_SIZE) { - const currentBatch = results.slice(i, i + RERANK_BATCH_SIZE) - const inputPairs = currentBatch.map((res) => [ - question, - (res.meta['snippet'] as string) || '', - ]) + for (let currentBatchOffset = 0; currentBatchOffset < searchResults.length; currentBatchOffset += RERANKING_BATCH_SIZE) { + const currentBatch = searchResults.slice(currentBatchOffset, currentBatchOffset + RERANKING_BATCH_SIZE) + const inputPairsForEncoder = currentBatch.map((result) => [userQuestion, (result.meta['snippet'] as string) || '']) - const tokenizedInputs = await tokenizer( - inputPairs.map((pair) => pair[0]), - { - text_pair: inputPairs.map((pair) => pair[1]), + const tokenizedEncoderInputs = await tokenizer(inputPairsForEncoder.map((pair) => pair[0]), { + text_pair: inputPairsForEncoder.map((pair) => pair[1]), padding: true, truncation: true, - } - ) - - const modelOutput = await model(tokenizedInputs) - const batchLogits: number[] = Array.from(modelOutput.logits.data as Float32Array) - - batchLogits.forEach((logit, offset) => { - rerankScores[i + offset] = logit - }) - } - - return results - .map((result, index) => ({ result, rerankScore: rerankScores[index]! })) - .sort((a, b) => b.rerankScore - a.rerankScore) - .map((entry) => entry.result) - } - - private async extractFactsWithGranularMapReduce( - question: string, - results: VectorSearchResult[], - isExhaustive: boolean - ): Promise { - const POOL_LIMIT_EXHAUSTIVE = 60 - const POOL_LIMIT_PRECISE = 35 - const poolLimit = isExhaustive ? POOL_LIMIT_EXHAUSTIVE : POOL_LIMIT_PRECISE - - const processingPool = results.slice(0, poolLimit) - const isPoolEmpty = processingPool.length === 0 - if (isPoolEmpty) return [] - - const extractedFindings: ExtractedFact[] = [] - const ANALYSIS_BATCH_SIZE = 10 - const totalBatches = Math.ceil(processingPool.length / ANALYSIS_BATCH_SIZE) - - for ( - let batchStartIndex = 0, batchNumber = 1; - batchStartIndex < processingPool.length; - batchStartIndex += ANALYSIS_BATCH_SIZE, batchNumber++ - ) { - const currentBatch = processingPool.slice( - batchStartIndex, - batchStartIndex + ANALYSIS_BATCH_SIZE - ) - logger.info(`Analyzing history snippets... batch ${batchNumber} of ${totalBatches}`) - - const researchPrompt = ` -You are the Researcher. Analyze these snippets from the user's history for the question: "${question}" -Context: -${currentBatch.map((res, index) => `[Node ${batchStartIndex + index}] ${res.meta['title']}: ${res.meta['snippet']}`).join('\n\n')} + }) -Extract every specific fact, mention, date, or piece of code. -Return JSON array: [{"fact": "...", "node_id": N, "thread": "..."}] -` - try { - const response = await this.ollamaClient.generate(researchPrompt) - const extractedFacts = this.parseJsonFromResponse(response, []) + const modelOutputPrediction = await model(tokenizedEncoderInputs) + const batchLogitScores: number[] = Array.from(modelOutputPrediction.logits.data as Float32Array) - extractedFacts.forEach((factEntry: any) => { - const originalSnippet = processingPool[factEntry.node_id] - extractedFindings.push({ - fact: factEntry.fact, - source_title: originalSnippet?.meta['title'] || factEntry.thread || 'Unknown', - thread: factEntry.thread || originalSnippet?.meta['title'] || 'Unknown', - }) - }) - } catch (error) { - currentBatch.forEach((res) => { - extractedFindings.push({ - fact: res.meta['snippet'] as string, - source_title: res.meta['title'] as string, - thread: res.meta['title'] as string, - }) + batchLogitScores.forEach((logit, indexWithinBatch) => { + calculatedRerankScores[currentBatchOffset + indexWithinBatch] = logit }) } - } - return extractedFindings - } - - private async generateMightiestResponse( - question: string, - extractedFacts: ExtractedFact[], - strategy: string - ): Promise { - const synthesisPrompt = ` -You are the Narrator. Synthesize these research findings into a cohesive, mightiest answer for: "${question}" -Strategy: ${strategy} -Findings: -${extractedFacts.map((fact, index) => `[Find ${index}] (${fact.source_title}): ${fact.fact}`).join('\n')} - -INSTRUCTIONS: -1. Provide a comprehensive, authoritative response. -2. If "exhaustive", list ALL relevant conversations and what they contributed. -3. Be specific with names and technical details. -4. Cite everything with [Find N]. - -ANSWER: -` - return this.ollamaClient.generate(synthesisPrompt) - } - - private displaySourceProvenance(extractedFacts: ExtractedFact[]): void { - const uniqueSourceTitles = new Set(extractedFacts.map((fact) => fact.source_title)) - const hasSources = uniqueSourceTitles.size > 0 - - if (hasSources) { - console.log(`\n${chalk.bold.cyan('History Sources Explored:')}`) - uniqueSourceTitles.forEach((title) => console.log(` - ${title}`)) - } - } - - private async verifyAnswerQuality( - question: string, - answer: string - ): Promise<{ status: string; suggestion?: string }> { - const verificationPrompt = ` -Verify the answer. -Question: "${question}" -Answer: "${answer.slice(0, 500)}..." -Did I miss anything important? -Return JSON: {"status": "ok" | "missed-info", "suggestion": "..."} -` - try { - const verificationResponse = await this.ollamaClient.generate(verificationPrompt) - return this.parseJsonFromResponse(verificationResponse, { status: 'ok' }) - } catch (error) { - return { status: 'ok' } + return searchResults + .map((result, resultIndex) => ({ + result, + rerankScore: calculatedRerankScores[resultIndex]! + })) + .sort((a, b) => b.rerankScore - a.rerankScore) + .map((scoredEntry) => scoredEntry.result) + } catch (rerankingError) { + errorBus.emitError('Cross-encoder reranking failed', rerankingError) + return searchResults } } - private parseJsonFromResponse(response: string, defaultValue: any): any { - const jsonMatch = response.match(/(\{[\s\S]*\}|\[[\s\S]*\])/) - if (jsonMatch?.[0]) { - try { - return JSON.parse(jsonMatch[0]) - } catch (error) { - return defaultValue - } + private displaySourceProvenance(extractedResearchFacts: ExtractedFact[]): void { + const uniqueSourceDocumentTitles = new Set(extractedResearchFacts.map((fact) => fact.sourceDocumentTitle)) + if (uniqueSourceDocumentTitles.size > 0) { + console.log(`\n${chalk.bold.cyan('History Sources Explored:')}`) + uniqueSourceDocumentTitles.forEach((documentTitle) => console.log(` - ${documentTitle}`)) } - return defaultValue } } diff --git a/src/ai/rag/extractor.ts b/src/ai/rag/extractor.ts new file mode 100644 index 0000000..dd3065f --- /dev/null +++ b/src/ai/rag/extractor.ts @@ -0,0 +1,82 @@ +import { type OllamaClient } from '../ollama-client.js' +import { type VectorSearchResult } from '../../search/vector-store.js' +import { type ExtractedFact } from './types.js' +import { RAG_PROMPTS } from './prompts.js' +import { logger } from '../../utils/logger.js' +import { errorBus } from '../../utils/error-bus.js' +import jsonic from 'jsonic' + +export class FactExtractor { + constructor(private readonly ollamaClient: OllamaClient) {} + + async extractFacts( + userQuestion: string, + searchResults: VectorSearchResult[], + isExhaustiveStrategy: boolean + ): Promise { + const MAXIMUM_NODES_FOR_EXHAUSTIVE = 60 + const MAXIMUM_NODES_FOR_PRECISE = 35 + const poolSizeLimit = isExhaustiveStrategy ? MAXIMUM_NODES_FOR_EXHAUSTIVE : MAXIMUM_NODES_FOR_PRECISE + + const candidateNodesPool = searchResults.slice(0, poolSizeLimit) + if (candidateNodesPool.length === 0) return [] + + const extractedResearchFindings: ExtractedFact[] = [] + const ANALYSIS_BATCH_SIZE = 10 + const totalBatchesToProcess = Math.ceil(candidateNodesPool.length / ANALYSIS_BATCH_SIZE) + + for (let currentBatchOffset = 0; currentBatchOffset < candidateNodesPool.length; currentBatchOffset += ANALYSIS_BATCH_SIZE) { + const currentBatchNumber = Math.floor(currentBatchOffset / ANALYSIS_BATCH_SIZE) + 1 + const currentResultsBatch = candidateNodesPool.slice(currentBatchOffset, currentBatchOffset + ANALYSIS_BATCH_SIZE) + + logger.info(`Analyzing history snippets... batch ${currentBatchNumber} of ${totalBatchesToProcess}`) + + const batchContextText = currentResultsBatch + .map((result, indexWithinBatch) => `[Node ${currentBatchOffset + indexWithinBatch}] ${result.meta['title']}: ${result.meta['snippet']}`) + .join('\n\n') + + const researcherPrompt = RAG_PROMPTS.informationResearcher(userQuestion, batchContextText) + + try { + const ollamaResponseText = await this.ollamaClient.generate(researcherPrompt) + const extractedFactsList = this.extractJsonArrayFromResponse(ollamaResponseText) + + for (const factEntry of extractedFactsList) { + const originalSourceNode = candidateNodesPool[factEntry.node_id] + extractedResearchFindings.push({ + factContent: factEntry.fact, + sourceDocumentTitle: originalSourceNode?.meta['title'] || factEntry.thread || 'Unknown', + conversationThreadTitle: factEntry.thread || originalSourceNode?.meta['title'] || 'Unknown', + }) + } + } catch (extractionError) { + errorBus.emitError(`Fact extraction batch ${currentBatchNumber} failed`, extractionError) + for (const fallbackResult of currentResultsBatch) { + extractedResearchFindings.push({ + factContent: fallbackResult.meta['snippet'] as string, + sourceDocumentTitle: fallbackResult.meta['title'] as string, + conversationThreadTitle: fallbackResult.meta['title'] as string, + }) + } + } + } + + return extractedResearchFindings + } + + private extractJsonArrayFromResponse(responseText: string): any[] { + const jsonBlockRegex = /(\{[\s\S]*\}|\[[\s\S]*\])/ + const regexMatchResult = responseText.match(jsonBlockRegex) + + if (regexMatchResult?.[0]) { + try { + const parsedData = jsonic(regexMatchResult[0]) + return Array.isArray(parsedData) ? parsedData : [] + } catch (parsingError) { + errorBus.emitError('Failed to parse researcher JSON', parsingError, { response: responseText }) + return [] + } + } + return [] + } +} diff --git a/src/ai/rag/planner.ts b/src/ai/rag/planner.ts new file mode 100644 index 0000000..5f36e95 --- /dev/null +++ b/src/ai/rag/planner.ts @@ -0,0 +1,49 @@ +import { type OllamaClient } from '../ollama-client.js' +import { type ResearchPlan } from './types.js' +import { RAG_PROMPTS } from './prompts.js' +import { errorBus } from '../../utils/error-bus.js' +import jsonic from 'jsonic' + +export class RAGPlanner { + constructor(private readonly ollamaClient: OllamaClient) {} + + async developPlan(userQuestion: string): Promise { + const researchPlannerPrompt = RAG_PROMPTS.researchPlanner(userQuestion) + try { + const ollamaResponseText = await this.ollamaClient.generate(researchPlannerPrompt) + const parsedPlanJson = this.extractJsonFromResponse(ollamaResponseText) + + return { + researchStrategy: parsedPlanJson.strategy || 'precise', + searchQueries: parsedPlanJson.queries || [userQuestion], + hardKeywordsForExactMatch: parsedPlanJson.hardKeywords || [], + hypotheticalDocumentEmbeddingsPassage: parsedPlanJson.hydePassage || '', + metadataFilters: parsedPlanJson.filters || {}, + } + } catch (planningError) { + errorBus.emitError('Research planner fallback triggered', planningError) + return { + researchStrategy: 'precise', + searchQueries: [userQuestion], + hardKeywordsForExactMatch: [], + hypotheticalDocumentEmbeddingsPassage: '', + metadataFilters: {}, + } + } + } + + private extractJsonFromResponse(responseText: string): any { + const jsonBlockRegex = /(\{[\s\S]*\}|\[[\s\S]*\])/ + const regexMatchResult = responseText.match(jsonBlockRegex) + + if (regexMatchResult?.[0]) { + try { + return jsonic(regexMatchResult[0]) + } catch (parsingError) { + errorBus.emitError('Failed to parse planner JSON', parsingError, { response: responseText }) + return {} + } + } + return {} + } +} diff --git a/src/ai/rag/prompts.ts b/src/ai/rag/prompts.ts new file mode 100644 index 0000000..54e2f5f --- /dev/null +++ b/src/ai/rag/prompts.ts @@ -0,0 +1,42 @@ +export const RAG_PROMPTS = { + researchPlanner: (userQuestion: string) => ` +Analyze: "${userQuestion}" +1. Strategy: "precise" (specific facts) or "exhaustive" (broad summary/entity history). +2. Variations: 3 semantic search phrases. +3. Hard Keywords: Identify any names, IDs, or unique technical terms for exact matching. +4. HyDE: Write 1-2 sentences that would plausibly appear in a saved answer to this question. Write as if it's content already stored, not as a reply. +Return JSON: {"strategy": "...", "queries": [], "hardKeywords": [], "hydePassage": "...", "filters": {}} +`, + + informationResearcher: (userQuestion: string, researchContextSnippets: string) => ` +You are the Researcher. Analyze these snippets from the user's history for the question: "${userQuestion}" +Context: +${researchContextSnippets} + +Extract every specific fact, mention, date, or piece of code. +Return JSON array: [{"fact": "...", "node_id": N, "thread": "..."}] +`, + + answerNarrator: (userQuestion: string, researchStrategy: string, researchFindings: string) => ` +You are the Narrator. Synthesize these research findings into a cohesive, mightiest answer for: "${userQuestion}" +Strategy: ${researchStrategy} +Findings: +${researchFindings} + +INSTRUCTIONS: +1. Provide a comprehensive, authoritative response. +2. If "exhaustive", list ALL relevant conversations and what they contributed. +3. Be specific with names and technical details. +4. Cite everything with [Find N]. + +ANSWER: +`, + + answerVerifier: (userQuestion: string, generatedAnswer: string) => ` +Verify the answer. +Question: "${userQuestion}" +Answer: "${generatedAnswer.slice(0, 500)}..." +Did I miss anything important? +Return JSON: {"status": "ok" | "missed-info", "suggestion": "..."} +` +} as const diff --git a/src/ai/rag/retriever.ts b/src/ai/rag/retriever.ts new file mode 100644 index 0000000..e6c25f6 --- /dev/null +++ b/src/ai/rag/retriever.ts @@ -0,0 +1,85 @@ +import { type VectorStore, type VectorSearchResult } from '../../search/vector-store.js' +import { type RgSearch } from '../../search/rg-search.js' +import { type ResearchPlan } from './types.js' +import { logger } from '../../utils/logger.js' +import { join } from 'node:path' +import { type Config } from '../../utils/config.js' + +export class HybridRetriever { + constructor( + private readonly applicationConfig: Config, + private readonly conversationVectorStore: VectorStore, + private readonly ripgrepSearchEngine: RgSearch + ) {} + + async retrieve(researchPlan: ResearchPlan): Promise { + const searchResultPools: VectorSearchResult[][] = [] + + for (const [queryIndex, searchQuery] of researchPlan.searchQueries.entries()) { + logger.debug(`Executing semantic search [${queryIndex + 1}/${researchPlan.searchQueries.length}]: "${searchQuery}"`) + const semanticVectorResults = await this.conversationVectorStore.search(searchQuery, 40) + searchResultPools.push(semanticVectorResults) + } + + if (researchPlan.hypotheticalDocumentEmbeddingsPassage) { + logger.debug(`Executing HyDE search: "${researchPlan.hypotheticalDocumentEmbeddingsPassage.slice(0, 60)}..."`) + const hydeVectorResults = await this.conversationVectorStore.search(researchPlan.hypotheticalDocumentEmbeddingsPassage, 40) + searchResultPools.push(hydeVectorResults) + } + + const keywordMatchPool: VectorSearchResult[] = [] + for (const hardKeyword of researchPlan.hardKeywordsForExactMatch) { + logger.debug(`Executing keyword search: "${hardKeyword}"`) + try { + const ripgrepMatches = await this.ripgrepSearchEngine.captureSearchMatches({ pattern: hardKeyword }) + const convertedMatches: VectorSearchResult[] = ripgrepMatches.map((match) => ({ + meta: { + path: join(this.applicationConfig.exportDir, match.path), + snippet: match.text, + title: match.path.split('/').pop() || 'Untitled', + id: match.path + match.line, + }, + score: 1.0, + })) + keywordMatchPool.push(...convertedMatches) + } catch (ripgrepError) { + // Skip failed keyword searches silently as per design + } + } + + if (keywordMatchPool.length > 0) { + searchResultPools.push(keywordMatchPool) + } + + return this.executeReciprocalRankFusion(searchResultPools) + } + + private executeReciprocalRankFusion(resultPools: VectorSearchResult[][]): VectorSearchResult[] { + const fusionRankScores = new Map() + + for (const individualPool of resultPools) { + for (const [itemRank, searchResult] of individualPool.entries()) { + const filePath = searchResult.meta['path'] || 'unknown' + const textSnippet = searchResult.meta['snippet'] || '' + const uniqueResultIdentifier = searchResult.meta['id'] || `${filePath}:${textSnippet}` + + const rankConstant = 60 + const itemRankScore = 1 / (rankConstant + itemRank) + const existingFusionEntry = fusionRankScores.get(uniqueResultIdentifier) + + if (existingFusionEntry) { + existingFusionEntry.cumulativeFusionScore += itemRankScore + } else { + fusionRankScores.set(uniqueResultIdentifier, { + searchResult, + cumulativeFusionScore: itemRankScore + }) + } + } + } + + return Array.from(fusionRankScores.values()) + .sort((firstEntry, secondEntry) => secondEntry.cumulativeFusionScore - firstEntry.cumulativeFusionScore) + .map((fusionEntry) => fusionEntry.searchResult) + } +} diff --git a/src/ai/rag/synthesizer.ts b/src/ai/rag/synthesizer.ts new file mode 100644 index 0000000..8df12bb --- /dev/null +++ b/src/ai/rag/synthesizer.ts @@ -0,0 +1,52 @@ +import { type OllamaClient } from '../ollama-client.js' +import { type ExtractedFact } from './types.js' +import { RAG_PROMPTS } from './prompts.js' +import { errorBus } from '../../utils/error-bus.js' +import jsonic from 'jsonic' + +export class ResponseSynthesizer { + constructor(private readonly ollamaClient: OllamaClient) {} + + async synthesize(userQuestion: string, researchFacts: ExtractedFact[], researchStrategy: string): Promise { + try { + const researchFindingsSummaryText = researchFacts + .map((fact, index) => `[Find ${index}] (${fact.sourceDocumentTitle}): ${fact.factContent}`) + .join('\n') + + const narratorPrompt = RAG_PROMPTS.answerNarrator(userQuestion, researchStrategy, researchFindingsSummaryText) + return await this.ollamaClient.generate(narratorPrompt) + } catch (synthesisError) { + return errorBus.raiseError('Response synthesis failed', synthesisError) + } + } + + async verifyQuality(userQuestion: string, generatedAnswer: string): Promise<{ verificationStatus: string; improvementSuggestion?: string }> { + const verifierPrompt = RAG_PROMPTS.answerVerifier(userQuestion, generatedAnswer) + try { + const ollamaResponseText = await this.ollamaClient.generate(verifierPrompt) + const parsedVerificationJson = this.extractJsonFromResponse(ollamaResponseText) + return { + verificationStatus: parsedVerificationJson.status || 'ok', + improvementSuggestion: parsedVerificationJson.suggestion + } + } catch (verificationError) { + errorBus.emitError('Answer verification failed', verificationError) + return { verificationStatus: 'ok' } + } + } + + private extractJsonFromResponse(responseText: string): any { + const jsonBlockRegex = /(\{[\s\S]*\}|\[[\s\S]*\])/ + const regexMatchResult = responseText.match(jsonBlockRegex) + + if (regexMatchResult?.[0]) { + try { + return jsonic(regexMatchResult[0]) + } catch (parsingError) { + errorBus.emitError('Failed to parse verifier JSON', parsingError, { response: responseText }) + return {} + } + } + return {} + } +} diff --git a/src/ai/rag/types.ts b/src/ai/rag/types.ts new file mode 100644 index 0000000..f3be609 --- /dev/null +++ b/src/ai/rag/types.ts @@ -0,0 +1,23 @@ +import { type VectorSearchResult } from '../../search/vector-store.js' + +export interface ResearchPlan { + researchStrategy: 'precise' | 'exhaustive' + searchQueries: string[] + hardKeywordsForExactMatch: string[] + hypotheticalDocumentEmbeddingsPassage: string + metadataFilters: Record +} + +export interface ExtractedFact { + factContent: string + sourceDocumentTitle: string + conversationThreadTitle: string +} + +export interface RagProcessingStepState { + originalUserQuestion: string + developedResearchPlan?: ResearchPlan + retrievedSearchResults?: VectorSearchResult[] + extractedResearchFacts?: ExtractedFact[] + generatedFinalAnswer?: string +} diff --git a/src/benchmark.ts b/src/benchmark.ts index c213d27..170c4d6 100644 --- a/src/benchmark.ts +++ b/src/benchmark.ts @@ -1,13 +1,13 @@ import { performance } from 'node:perf_hooks' import { existsSync } from 'node:fs' import { join } from 'node:path' -import { config } from './utils/config.js' +import { config as applicationConfiguration } from './utils/config.js' import { errorBus } from './utils/error-bus.js' import { logger } from './utils/logger.js' import { VectorStore } from './search/vector-store.js' import { RagOrchestrator } from './ai/rag-orchestrator.js' -const BENCHMARK_QUERIES = [ +const TEST_BENCHMARK_QUERIES = [ 'What TypeScript patterns have I used in past projects?', 'Which npm packages have I discussed installing?', 'What errors or bugs did I troubleshoot recently?', @@ -15,74 +15,76 @@ const BENCHMARK_QUERIES = [ 'What architecture decisions did I make?', ] -async function runBenchmark(): Promise { - const indexJsonPath = join(config.vectorIndexPath, 'index.json') - const isIndexPresent = existsSync(indexJsonPath) - if (!isIndexPresent) { - logger.error('No vector index found. Build the index first via the main menu.') - process.exit(1) +async function runPerformanceBenchmark(): Promise { + const vectorIndexMetadataFilePath = join(applicationConfiguration.vectorIndexPath, 'index.json') + + if (!existsSync(vectorIndexMetadataFilePath)) { + errorBus.raiseError('No vector index found. Build the index first via the main menu.') } - logger.info(`Starting benchmark with ${BENCHMARK_QUERIES.length} queries...`) + logger.info(`Starting benchmark with ${TEST_BENCHMARK_QUERIES.length} queries...`) - const benchmarkVectorStore = new VectorStore(config) + const benchmarkVectorStore = new VectorStore(applicationConfiguration) await benchmarkVectorStore.validate() - const ragOrchestrator = new RagOrchestrator(config) - const benchmarkResults: { query: string; durationMs: number; isFailure: boolean }[] = [] + const benchmarkRagOrchestrator = new RagOrchestrator(applicationConfiguration) + const benchmarkRunResults: { queryText: string; durationMilliseconds: number; isFailure: boolean }[] = [] - for (let queryIndex = 0; queryIndex < BENCHMARK_QUERIES.length; queryIndex++) { - const currentQuery = BENCHMARK_QUERIES[queryIndex]! - logger.info(`[${queryIndex + 1}/${BENCHMARK_QUERIES.length}] "${currentQuery}"`) + for (let queryIndex = 0; queryIndex < TEST_BENCHMARK_QUERIES.length; queryIndex++) { + const currentBenchmarkQuery = TEST_BENCHMARK_QUERIES[queryIndex]! + logger.info(`[${queryIndex + 1}/${TEST_BENCHMARK_QUERIES.length}] "${currentBenchmarkQuery}"`) - const startTime = performance.now() - let isFailure = false + const startTimeStamp = performance.now() + let isQuerySuccessful = true try { - await ragOrchestrator.answerQuestion(currentQuery) - } catch (error) { - isFailure = true - errorBus.emitError('Benchmark query failed', error, { query: currentQuery }) + await benchmarkRagOrchestrator.answerQuestion(currentBenchmarkQuery) + } catch (queryExecutionError) { + isQuerySuccessful = false + errorBus.emitError('Benchmark query failed', queryExecutionError, { query: currentBenchmarkQuery }) } - const durationMs = Math.round(performance.now() - startTime) - benchmarkResults.push({ query: currentQuery, durationMs, isFailure }) + const durationMilliseconds = Math.round(performance.now() - startTimeStamp) + benchmarkRunResults.push({ + queryText: currentBenchmarkQuery, + durationMilliseconds, + isFailure: !isQuerySuccessful + }) - if (isFailure) { - logger.warn(`Query failed after ${durationMs}ms`) + if (!isQuerySuccessful) { + logger.warn(`Query failed after ${durationMilliseconds}ms`) } else { - logger.success(`Done in ${durationMs}ms`) + logger.success(`Done in ${durationMilliseconds}ms`) } } - const successfulResults = benchmarkResults.filter((result) => !result.isFailure) - const failedResults = benchmarkResults.filter((result) => result.isFailure) + const successfulBenchmarkRuns = benchmarkRunResults.filter((result) => !result.isFailure) - const totalSuccessfulDuration = successfulResults.reduce( - (accumulator, result) => accumulator + result.durationMs, + const totalSuccessfulDurationMilliseconds = successfulBenchmarkRuns.reduce( + (cumulativeDuration, runResult) => cumulativeDuration + runResult.durationMilliseconds, 0 ) - const averageLatencyMs = - successfulResults.length > 0 - ? Math.round(totalSuccessfulDuration / successfulResults.length) - : 0 + + const averageLatencyMilliseconds = successfulBenchmarkRuns.length > 0 + ? Math.round(totalSuccessfulDurationMilliseconds / successfulBenchmarkRuns.length) + : 0 logger.info('--- Benchmark Results ---') - benchmarkResults.forEach((result, index) => { - const statusSymbol = result.isFailure ? '✗' : '✓' - logger.info(` ${statusSymbol} [${index + 1}] ${result.durationMs}ms — ${result.query}`) + benchmarkRunResults.forEach((runResult, index) => { + const statusSuccessSymbol = runResult.isFailure ? '✗' : '✓' + logger.info(` ${statusSuccessSymbol} [${index + 1}] ${runResult.durationMilliseconds}ms — ${runResult.queryText}`) }) - logger.info(`Successful: ${successfulResults.length}/${benchmarkResults.length}`) - logger.info(`Average latency: ${averageLatencyMs}ms`) + logger.info(`Successful: ${successfulBenchmarkRuns.length}/${benchmarkRunResults.length}`) + logger.info(`Average latency: ${averageLatencyMilliseconds}ms`) - const hasFailures = failedResults.length > 0 - if (hasFailures) { - logger.warn(`${failedResults.length} queries failed — run with DEBUG=true for details`) + const hasAnyFailuresOccurred = benchmarkRunResults.some(result => result.isFailure) + if (hasAnyFailuresOccurred) { + logger.warn(`Some queries failed — run with DEBUG=true for details`) } } -runBenchmark().catch((error) => { - errorBus.emitError('Benchmark execution failed', error) +runPerformanceBenchmark().catch((executionError) => { + errorBus.emitError('Benchmark execution failed', executionError) process.exit(1) }) diff --git a/src/export/file-writer.ts b/src/export/file-writer.ts index b3ac772..815487c 100644 --- a/src/export/file-writer.ts +++ b/src/export/file-writer.ts @@ -1,64 +1,40 @@ -import { join } from 'node:path' -import { writeFileSync, existsSync, mkdirSync } from 'node:fs' +import { join, dirname } from 'node:path' +import fileSystem from 'node:fs/promises' +import writeFileAtomic from 'write-file-atomic' import { type Config } from '../utils/config.js' -import type { ExtractedConversation } from '../scraper/conversation-extractor.js' +import { type ExtractedConversation } from '../scraper/extractor/types.js' import { sanitizeFilename, sanitizeSpaceName } from './sanitizer.js' +import { errorBus } from '../utils/error-bus.js' export class FileWriter { - static readonly WriteError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'FileWriteError' - } - } - - constructor(private readonly config: Config) { - this.ensureRootExportDirectoryExists() - } + constructor(private readonly applicationConfig: Config) {} - write(conversation: ExtractedConversation): string { + async write(extractedConversation: ExtractedConversation): Promise { try { - const destinationFilePath = this.constructDestinationFilePath(conversation) - const markdownContent = this.formatConversationAsMarkdown(conversation) - - this.ensureSpaceDirectoryExists(conversation.spaceName) + const destinationFilePath = this.constructDestinationPath(extractedConversation) + const formattedMarkdownContent = this.formatAsMarkdown(extractedConversation) - writeFileSync(destinationFilePath, markdownContent, 'utf-8') + await fileSystem.mkdir(dirname(destinationFilePath), { recursive: true }) + await (writeFileAtomic as any)(destinationFilePath, formattedMarkdownContent, 'utf8') return destinationFilePath - } catch (error) { - const errorMessage = error instanceof Error ? error.message : String(error) - throw new FileWriter.WriteError( - `Failed to write conversation ${conversation.id}: ${errorMessage}` - ) - } - } - - private ensureRootExportDirectoryExists(): void { - if (!existsSync(this.config.exportDir)) { - mkdirSync(this.config.exportDir, { recursive: true }) - } - } - - private ensureSpaceDirectoryExists(spaceName: string): void { - const spaceSpecificDirectory = join(this.config.exportDir, sanitizeSpaceName(spaceName)) - if (!existsSync(spaceSpecificDirectory)) { - mkdirSync(spaceSpecificDirectory, { recursive: true }) + } catch (writeError) { + return errorBus.raiseError(`Failed to write conversation ${extractedConversation.conversationId}`, writeError) } } - private constructDestinationFilePath(conversation: ExtractedConversation): string { - const safeSpaceName = sanitizeSpaceName(conversation.spaceName) - const safeFileTitle = sanitizeFilename(conversation.title) - const fileNameWithIdSuffix = `${safeFileTitle} (${conversation.id}).md` - return join(this.config.exportDir, safeSpaceName, fileNameWithIdSuffix) + private constructDestinationPath(extractedConversation: ExtractedConversation): string { + const safeSpaceName = sanitizeSpaceName(extractedConversation.conversationSpaceName) + const safeTitle = sanitizeFilename(extractedConversation.conversationTitle) + const filenameWithIdSuffix = `${safeTitle} (${extractedConversation.conversationId}).md` + return join(this.applicationConfig.exportDir, safeSpaceName, filenameWithIdSuffix) } - private formatConversationAsMarkdown(conversation: ExtractedConversation): string { - const headerTitle = `# ${conversation.title}\n\n` + private formatAsMarkdown(extractedConversation: ExtractedConversation): string { + const headerTitle = `# ${extractedConversation.conversationTitle}\n\n` const metadataBlock = - `**Space:** ${conversation.spaceName} \n` + - `**ID:** ${conversation.id} \n` + - `**Date:** ${conversation.timestamp.toISOString()} \n\n` - return headerTitle + metadataBlock + conversation.content + `**Space:** ${extractedConversation.conversationSpaceName} \n` + + `**ID:** ${extractedConversation.conversationId} \n` + + `**Date:** ${extractedConversation.extractionTimestamp.toISOString()} \n\n` + return headerTitle + metadataBlock + extractedConversation.formattedMarkdownContent } } diff --git a/src/export/sanitizer.ts b/src/export/sanitizer.ts index 5c72e15..64fac0c 100644 --- a/src/export/sanitizer.ts +++ b/src/export/sanitizer.ts @@ -1,10 +1,10 @@ import sanitize from 'sanitize-filename' -export function sanitizeFilename(filename: string): string { +export function sanitizeFilename(rawFilename: string): string { const ILLEGAL_CHARACTER_REPLACEMENT = '_' const MAXIMUM_FILENAME_LENGTH = 100 - const safeFilename = sanitize(filename, { + const safeFilename = sanitize(rawFilename, { replacement: ILLEGAL_CHARACTER_REPLACEMENT, }) @@ -13,10 +13,10 @@ export function sanitizeFilename(filename: string): string { .substring(0, MAXIMUM_FILENAME_LENGTH) } -export function sanitizeSpaceName(spaceName: string): string { - return sanitizeFilename(spaceName) +export function sanitizeSpaceName(rawSpaceName: string): string { + return sanitizeFilename(rawSpaceName) } -export function sanitizeMarkdownContent(rawMarkdown: string): string { - return rawMarkdown || '' +export function sanitizeMarkdownContent(rawMarkdownText: string): string { + return rawMarkdownText || '' } diff --git a/src/index.ts b/src/index.ts index fe84b34..23bd081 100644 --- a/src/index.ts +++ b/src/index.ts @@ -1,17 +1,15 @@ import { errorBus } from './utils/error-bus.js' import { Repl } from './repl/index.js' -import { config } from './utils/config.js' +import { config as applicationConfiguration } from './utils/config.js' -/** - * Entry point for the Perplexity History Export application. - */ -async function bootstrapApplication(): Promise { +async function bootstrapApplicationEntryPoint(): Promise { try { - const interactiveRepl = new Repl(config) - await interactiveRepl.start() + const interactiveApplicationRepl = new Repl(applicationConfiguration) + await interactiveApplicationRepl.start() } catch (initializationError) { errorBus.emitError('Application failed to start', initializationError) + process.exit(1) } } -bootstrapApplication() +bootstrapApplicationEntryPoint() diff --git a/src/repl/commands.ts b/src/repl/commands.ts index 4d5bf5e..96d66e7 100644 --- a/src/repl/commands.ts +++ b/src/repl/commands.ts @@ -1,316 +1,42 @@ -import { type Page } from '@playwright/test' -import { errorBus } from '../utils/error-bus.js' -import { input, select, confirm } from '@inquirer/prompts' -import { rmSync } from 'node:fs' -import { sep } from 'node:path' -import { BrowserManager } from '../scraper/browser.js' import { CheckpointManager } from '../scraper/checkpoint-manager.js' -import { WorkerPool } from '../scraper/worker-pool.js' import { SearchOrchestrator } from '../search/search-orchestrator.js' -import { logger } from '../utils/logger.js' import { showHelp } from './help.js' -import { LibraryDiscovery } from '../scraper/library-discovery.js' import { type Config } from '../utils/config.js' +import { ExportHandler } from './handlers/export.js' +import { SearchHandler } from './handlers/search.js' +import { MaintenanceHandler } from './handlers/maintenance.js' export class CommandHandler { - static readonly ScraperError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'ScraperError' - } - } - - static readonly SearchError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'SearchError' - } - } - - static readonly VectorizeError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'VectorizeError' - } - } - - static readonly ValidationError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'ValidationError' - } - } + private readonly libraryExportHandler: ExportHandler + private readonly conversationSearchHandler: SearchHandler + private readonly systemMaintenanceHandler: MaintenanceHandler - static readonly ResetError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'ResetError' - } - } - - private readonly checkpointManager: CheckpointManager - private readonly searchOrchestrator: SearchOrchestrator - - constructor(private readonly config: Config) { - this.checkpointManager = new CheckpointManager(config) - this.searchOrchestrator = new SearchOrchestrator(config) - } + constructor(applicationConfig: Config) { + const activeCheckpointManager = new CheckpointManager(applicationConfig) + const activeSearchOrchestrator = new SearchOrchestrator(applicationConfig) - async handleStartLibraryExport(): Promise { - try { - await this.executeFullScrapingFlow() - } catch (error) { - errorBus.emitError('Scraper failed', error) - logger.info( - '\nNote: Check "debug/api-diagnostics.jsonl" for details if the failure is related to API response changes.' - ) - } + this.libraryExportHandler = new ExportHandler(applicationConfig, activeCheckpointManager, activeSearchOrchestrator) + this.conversationSearchHandler = new SearchHandler(applicationConfig, activeCheckpointManager, activeSearchOrchestrator) + this.systemMaintenanceHandler = new MaintenanceHandler(applicationConfig, activeCheckpointManager, activeSearchOrchestrator) } async handleScraperWizard(): Promise { - const progress = this.checkpointManager.getProcessingProgress() - const hasExistingProgress = progress.total > 0 - - if (hasExistingProgress) { - await this.promptUserForCheckpointAction() - } - - await this.executeFullScrapingFlow() + await this.libraryExportHandler.handleScraperWizard() } async handleSearchWizard(): Promise { - const query = await this.promptForSearchQuery() - let mode = (await this.promptForSearchMode()) as 'auto' | 'vector' | 'rg' | 'rag' - - const ripgrepOptions = { - pattern: query, - caseSensitive: false, - wholeWord: false, - regex: false, - } - - try { - const isSemanticMode = mode === 'auto' || mode === 'vector' || mode === 'rag' - if (isSemanticMode) { - try { - await this.searchOrchestrator.validateVectorSearch() - } catch (error) { - if (mode === 'auto') { - logger.warn( - 'Ollama is not available (required for semantic features). Falling back to Exact Text search (ripgrep).' - ) - mode = 'rg' - } else { - const errorMessage = error instanceof Error ? error.message : String(error) - errorBus.emitError(errorMessage) - logger.info('Start Ollama with the embedding model, then run "vectorize".') - return - } - } - } - - logger.info(`Searching for: "${query}" (mode: ${mode})\n`) - await this.searchOrchestrator.search(query, mode, ripgrepOptions) - } catch (error) { - if (error instanceof Error) { - errorBus.emitError(error.message, error) - } - } + await this.conversationSearchHandler.handleSearchWizard() } async handleVectorizeWizard(): Promise { - const shouldRebuildIndex = await confirm({ - message: 'Rebuild the vector index from exports now?', - default: true, - }) - - if (!shouldRebuildIndex) { - logger.info('Vectorization cancelled.') - return - } - - try { - await this.searchOrchestrator.validateVectorSearch() - } catch (error) { - await this.handleVectorSearchValidationRetry(error) - return - } - - await this.searchOrchestrator.vectorizeNow() + await this.conversationSearchHandler.handleVectorizeWizard() } async handleDataReset(): Promise { - const isCertainOfReset = await confirm({ - message: - '⚠️ This will delete all stored checkpoints, authentication data, and vector index. Are you sure?', - default: false, - }) - - if (!isCertainOfReset) { - logger.info('Reset cancelled.') - return - } - - try { - this.wipeStorageDirectory() - this.checkpointManager.resetCheckpoint() - logger.success('✅ Storage folder deleted. All progress has been reset.') - } catch (error) { - const errorMessage = error instanceof Error ? error.message : String(error) - throw new CommandHandler.ResetError(`Failed to reset: ${errorMessage}`) - } + await this.systemMaintenanceHandler.handleDataReset() } handleShowHelp(): void { showHelp() } - - private async executeFullScrapingFlow(): Promise { - const browserManager = new BrowserManager(this.config) - - try { - const activePage = await browserManager.launch() - - const isDiscoveryRequired = !this.checkpointManager.isDiscoveryPhaseComplete() - if (isDiscoveryRequired) { - await this.runDiscoveryPhase(activePage) - } - - const pendingConversations = this.checkpointManager.getPendingConversations() - const hasPendingConversations = pendingConversations.length > 0 - - if (!hasPendingConversations) { - logger.success('All conversations already processed!') - return - } - - await this.runExtractionPhase(browserManager, pendingConversations) - - logger.success('\n✨ Export complete!') - logger.info( - '\nNote: If some conversations were missed or the format looks wrong, please check "debug/api-diagnostics.jsonl" and consider opening a GitHub issue with that file attached.' - ) - } catch (error) { - const errorMessage = error instanceof Error ? error.message : String(error) - throw new CommandHandler.ScraperError(`Scraping failed: ${errorMessage}`) - } finally { - await browserManager.close() - } - } - - private async runDiscoveryPhase(page: Page): Promise { - logger.info('\n=== Phase 1: Library Discovery ===\n') - const discoveryTool = new LibraryDiscovery() - const discoveredConversations = await discoveryTool.discoverAllConversationsFromLibrary(page) - this.checkpointManager.setDiscoveredConversations(discoveredConversations) - } - - private async runExtractionPhase( - browserManager: BrowserManager, - pendingConversations: any[] - ): Promise { - logger.info(`\n=== Phase 2: Parallel Extraction (${pendingConversations.length} pending) ===\n`) - - const activeBrowser = browserManager.browserInstance - if (!activeBrowser) { - throw new CommandHandler.ScraperError('Browser was not initialized') - } - - const workerPool = new WorkerPool(this.config, this.checkpointManager, activeBrowser) - await workerPool.initialize() - await workerPool.processConversations(pendingConversations) - await workerPool.close() - } - - private async promptUserForCheckpointAction(): Promise { - const currentProgress = this.checkpointManager.getProcessingProgress() - - const actionChoices = [ - { name: 'Resume (Continue processing known threads)', value: 'resume' }, - { name: 'Sync (Re-scan library for new threads and updates)', value: 'update' }, - { name: 'Start Over (Re-scan and re-process everything)', value: 'restart' }, - { name: 'Cancel', value: 'cancel' }, - ] - - const selectedAction = await select({ - message: `Found checkpoint (${currentProgress.processed}/${currentProgress.total} processed). What do you want to do?`, - choices: actionChoices, - }) - - if (selectedAction === 'cancel') { - logger.info('Start cancelled.') - process.exit(0) - } - - if (selectedAction === 'restart') { - this.checkpointManager.resetCheckpoint() - } else if (selectedAction === 'update') { - this.checkpointManager.prepareForUpdateRun() - } - } - - private async promptForSearchQuery(): Promise { - return input({ - message: 'Search query:', - validate: (value: string) => (value.trim().length === 0 ? 'Please enter a query.' : true), - }) - } - - private async promptForSearchMode(): Promise { - return select({ - message: 'Search mode:', - choices: [ - { name: 'Auto (semantic for long queries, exact for short)', value: 'auto' }, - { name: 'Semantic (Ollama + Vectra)', value: 'vector' }, - { name: 'RAG (Ask history with Ollama)', value: 'rag' }, - { name: 'Exact text (ripgrep)', value: 'rg' }, - ], - default: 'auto', - }) - } - - private async handleVectorSearchValidationRetry(validationError: unknown): Promise { - const validationErrorMessage = - validationError instanceof Error ? validationError.message : String(validationError) - errorBus.emitError(validationErrorMessage) - - const shouldRetryAfterStartingOllama = await confirm({ - message: - 'Ollama validation failed. Start Ollama (with the embedding model) and retry vectorization?', - default: false, - }) - - if (!shouldRetryAfterStartingOllama) { - return - } - - try { - await this.searchOrchestrator.validateVectorSearch() - } catch (retryError) { - const retryErrorMessage = - retryError instanceof Error ? retryError.message : String(retryError) - errorBus.emitError(retryErrorMessage) - return - } - - await this.searchOrchestrator.vectorizeNow() - } - - private wipeStorageDirectory(): void { - const authStoragePath = this.config.authStoragePath - const storageRootDir = authStoragePath ? authStoragePath.split(sep)[0] : '.storage' - - try { - const isDirectorySpecified = !!storageRootDir - if (isDirectorySpecified) { - rmSync(storageRootDir, { recursive: true, force: true }) - logger.debug(`Deleted storage folder: ${storageRootDir}`) - } - } catch (error) { - const isNotFoundError = (error as NodeJS.ErrnoException).code === 'ENOENT' - if (!isNotFoundError) { - throw error - } - } - } } diff --git a/src/repl/handlers/base.ts b/src/repl/handlers/base.ts new file mode 100644 index 0000000..545acd7 --- /dev/null +++ b/src/repl/handlers/base.ts @@ -0,0 +1,11 @@ +import { type Config } from '../../utils/config.js' +import { type CheckpointManager } from '../../scraper/checkpoint-manager.js' +import { type SearchOrchestrator } from '../../search/search-orchestrator.js' + +export abstract class BaseHandler { + constructor( + protected readonly applicationConfig: Config, + protected readonly checkpointManager: CheckpointManager, + protected readonly searchOrchestrator: SearchOrchestrator + ) {} +} diff --git a/src/repl/handlers/export.ts b/src/repl/handlers/export.ts new file mode 100644 index 0000000..47c6c50 --- /dev/null +++ b/src/repl/handlers/export.ts @@ -0,0 +1,80 @@ +import { BaseHandler } from './base.js' +import { BrowserManager } from '../../scraper/browser.js' +import { LibraryDiscovery } from '../../scraper/library-discovery.js' +import { WorkerPool } from '../../scraper/worker-pool.js' +import { logger } from '../../utils/logger.js' +import { errorBus } from '../../utils/error-bus.js' +import { select } from '@inquirer/prompts' + +export class ExportHandler extends BaseHandler { + async handleScraperWizard(): Promise { + const processingProgress = this.checkpointManager.getProcessingProgress() + const hasExistingProgress = processingProgress.total > 0 + + if (hasExistingProgress) { + await this.promptUserForCheckpointAction() + } + await this.handleStartLibraryExport() + } + + async handleStartLibraryExport(): Promise { + const browserManager = new BrowserManager(this.applicationConfig) + try { + const activeBrowserPage = await browserManager.launch() + + const isDiscoveryPhaseRequired = !this.checkpointManager.isDiscoveryPhaseComplete() + if (isDiscoveryPhaseRequired) { + logger.info('\n=== Phase 1: Library Discovery ===\n') + const libraryDiscoveryTool = new LibraryDiscovery() + const discoveredConversations = await libraryDiscoveryTool.discoverAllConversationsFromLibrary(activeBrowserPage) + this.checkpointManager.setDiscoveredConversations(discoveredConversations) + } + + const pendingConversationsToExtract = this.checkpointManager.getPendingConversations() + const hasPendingConversations = pendingConversationsToExtract.length > 0 + + if (!hasPendingConversations) { + logger.success('All conversations already processed!') + return + } + + logger.info(`\n=== Phase 2: Parallel Extraction (${pendingConversationsToExtract.length} pending) ===\n`) + const launchedBrowserInstance = browserManager.browserInstance! + const extractionWorkerPool = new WorkerPool(this.applicationConfig, this.checkpointManager, launchedBrowserInstance) + + await extractionWorkerPool.initialize() + await extractionWorkerPool.processConversations(pendingConversationsToExtract) + await extractionWorkerPool.close() + + logger.success('\n✨ Export complete!') + } catch (scrapingError) { + errorBus.emitError('Scraper failed', scrapingError) + } finally { + await browserManager.close() + } + } + + private async promptUserForCheckpointAction(): Promise { + const currentProgress = this.checkpointManager.getProcessingProgress() + const userSelectedAction = await select({ + message: `Found checkpoint (${currentProgress.processed}/${currentProgress.total} processed). What do you want to do?`, + choices: [ + { name: 'Resume (Continue processing known threads)', value: 'resume' }, + { name: 'Sync (Re-scan library for new threads and updates)', value: 'update' }, + { name: 'Start Over (Re-scan and re-process everything)', value: 'restart' }, + { name: 'Cancel', value: 'cancel' }, + ], + }) + + if (userSelectedAction === 'cancel') { + logger.info('Start cancelled.') + process.exit(0) + } + + if (userSelectedAction === 'restart') { + this.checkpointManager.resetCheckpoint() + } else if (userSelectedAction === 'update') { + this.checkpointManager.prepareForUpdateRun() + } + } +} diff --git a/src/repl/handlers/maintenance.ts b/src/repl/handlers/maintenance.ts new file mode 100644 index 0000000..f347385 --- /dev/null +++ b/src/repl/handlers/maintenance.ts @@ -0,0 +1,40 @@ +import { BaseHandler } from './base.js' +import { confirm } from '@inquirer/prompts' +import { logger } from '../../utils/logger.js' +import { errorBus } from '../../utils/error-bus.js' +import { rmSync, existsSync } from 'node:fs' +import { sep } from 'node:path' + +export class MaintenanceHandler extends BaseHandler { + async handleDataReset(): Promise { + try { + const isUserCertainOfReset = await confirm({ + message: '⚠️ This will delete all stored checkpoints, authentication data, and vector index. Are you sure?', + default: false + }) + + if (!isUserCertainOfReset) { + return + } + + this.wipeStorageDirectory() + this.checkpointManager.resetCheckpoint() + logger.success('✅ Storage folder deleted. All progress has been reset.') + } catch (resetError) { + errorBus.emitError('Reset failed', resetError) + } + } + + private wipeStorageDirectory(): void { + const authenticationStoragePath = this.applicationConfig.authStoragePath + const storageRootDirectory = authenticationStoragePath ? authenticationStoragePath.split(sep)[0] : '.storage' + + if (storageRootDirectory && existsSync(storageRootDirectory)) { + try { + rmSync(storageRootDirectory, { recursive: true, force: true }) + } catch (deletionError) { + errorBus.raiseError(`Failed to delete storage directory: ${storageRootDirectory}`, deletionError) + } + } + } +} diff --git a/src/repl/handlers/search.ts b/src/repl/handlers/search.ts new file mode 100644 index 0000000..b912d8c --- /dev/null +++ b/src/repl/handlers/search.ts @@ -0,0 +1,62 @@ +import { BaseHandler } from './base.js' +import { input, select, confirm } from '@inquirer/prompts' +import { logger } from '../../utils/logger.js' +import { errorBus } from '../../utils/error-bus.js' + +export class SearchHandler extends BaseHandler { + async handleSearchWizard(): Promise { + try { + const userSearchQuery = await input({ + message: 'Search query:', + validate: (inputValue) => inputValue.trim().length > 0 || 'Please enter a query.', + }) + + let selectedSearchMode = await select({ + message: 'Search mode:', + choices: [ + { name: 'Auto (semantic for long queries, exact for short)', value: 'auto' }, + { name: 'Semantic (Ollama + Vectra)', value: 'vector' }, + { name: 'RAG (Ask history with Ollama)', value: 'rag' }, + { name: 'Exact text (ripgrep)', value: 'rg' }, + ], + default: 'auto', + }) as any + + const isSemanticSearchRequested = selectedSearchMode !== 'rg' + if (isSemanticSearchRequested) { + try { + await this.searchOrchestrator.validateVectorSearch() + } catch (validationError) { + if (selectedSearchMode === 'auto') { + logger.warn('Ollama not available. Falling back to Exact Text search.') + selectedSearchMode = 'rg' + } else { + return + } + } + } + + await this.searchOrchestrator.search(userSearchQuery, selectedSearchMode, { pattern: userSearchQuery }) + } catch (wizardError) { + errorBus.emitError('Search wizard failed', wizardError) + } + } + + async handleVectorizeWizard(): Promise { + try { + const shouldRebuildIndexNow = await confirm({ + message: 'Rebuild the vector index from exports now?', + default: true + }) + + if (!shouldRebuildIndexNow) { + return + } + + await this.searchOrchestrator.validateVectorSearch() + await this.searchOrchestrator.vectorizeNow() + } catch (vectorizationError) { + errorBus.emitError('Vectorization wizard failed', vectorizationError) + } + } +} diff --git a/src/repl/help.ts b/src/repl/help.ts index ce622c6..0f9a870 100644 --- a/src/repl/help.ts +++ b/src/repl/help.ts @@ -2,47 +2,43 @@ import chalk from 'chalk' import { logger } from '../utils/logger.js' export function showHelp(): void { - const logAction = (actionName: string, actionDescription: string) => { - logger.info(chalk.cyan(` ${actionName}`)) - logger.info(` ${actionDescription}\n`) + const logHelpAction = (actionLabel: string, actionDescriptionText: string) => { + logger.info(chalk.cyan(` ${actionLabel}`)) + logger.info(` ${actionDescriptionText}\n`) } logger.info(chalk.bold('\n📚 Available Actions:\n')) - logAction( + logHelpAction( 'Start scraper (Library)', 'Run the scraper to export your Perplexity history. If a checkpoint exists, you can resume or restart.' ) - logAction( + logHelpAction( 'Search conversations', 'Search through exported conversations using various modes: auto, semantic, RAG, or exact text.' ) - logAction( + logHelpAction( 'Build vector index', 'Build or update the local vector index from your exports to enable semantic search and RAG.' ) - logAction( + logHelpAction( 'Reset all data', 'Remove all stored checkpoints, authentication data, and the vector index to start fresh.' ) - logAction('Help', 'Display this help overview.') + logHelpAction('Help', 'Display this help overview.') - logAction('Exit', 'Close the application.') + logHelpAction('Exit', 'Close the application.') logger.info(chalk.bold('💡 Search & RAG Tips:\n')) logger.info( ' • RAG: Ask history with Ollama. Combines vector retrieval with AI generation for comprehensive answers.' ) - logger.info( - ' The pipeline now includes HyDE (generates a hypothetical answer passage before searching)' - ) - logger.info( - ' and cross-encoder reranking (rescores top candidates for higher precision) automatically.' - ) + logger.info(' The pipeline now includes HyDE (hypothetical embeddings) and cross-encoder reranking.') + logger.info( ' • Auto Search: Intelligently switches between semantic and exact search based on query length.' ) @@ -55,7 +51,4 @@ export function showHelp(): void { logger.info( ' Run npm run benchmark to measure RAG pipeline latency across a set of test queries.' ) - logger.info( - ' Requires a built vector index. Edit BENCHMARK_QUERIES in src/benchmark.ts to tailor to your history.\n' - ) } diff --git a/src/repl/index.ts b/src/repl/index.ts index 975fe7a..446ada8 100644 --- a/src/repl/index.ts +++ b/src/repl/index.ts @@ -6,20 +6,20 @@ import { CommandHandler } from './commands.js' import { type Config } from '../utils/config.js' export class Repl { - private readonly commandHandler: CommandHandler - private isRunning = true + private readonly applicationCommandHandler: CommandHandler + private isApplicationRunning = true - constructor(config: Config) { - this.commandHandler = new CommandHandler(config) + constructor(applicationConfig: Config) { + this.applicationCommandHandler = new CommandHandler(applicationConfig) } async start(): Promise { logger.info(chalk.bold.cyan('\n🔮 Perplexity History Export Tool\n')) logger.info('Select commands to execute. Press Ctrl+C to exit.\n') - while (this.isRunning) { + while (this.isApplicationRunning) { try { - const selectedAction = await select({ + const userSelectedAction = await select({ message: 'perplexity>', choices: [ { name: 'Start scraper (Library)', value: 'start-library' }, @@ -31,47 +31,49 @@ export class Repl { ], }) - await this.dispatchCommand(selectedAction) - } catch (error) { - const isUserExit = error instanceof Error && error.name === 'ExitPromptError' - if (isUserExit) { - this.terminate() + await this.dispatchSelectedCommand(userSelectedAction) + } catch (interactionError) { + const isUserIntentionalExit = interactionError instanceof Error && interactionError.name === 'ExitPromptError' + if (isUserIntentionalExit) { + this.terminateApplication() } else { - throw error + throw interactionError } } } } - private async dispatchCommand(actionValue: string): Promise { - switch (actionValue) { + private async dispatchSelectedCommand(commandValue: string): Promise { + switch (commandValue) { case 'start-library': - await this.commandHandler.handleScraperWizard() + await this.applicationCommandHandler.handleScraperWizard() break case 'search': - await this.commandHandler.handleSearchWizard() + await this.applicationCommandHandler.handleSearchWizard() break case 'vectorize': - await this.commandHandler.handleVectorizeWizard() + await this.applicationCommandHandler.handleVectorizeWizard() break case 'reset': - await this.commandHandler.handleDataReset() + await this.applicationCommandHandler.handleDataReset() break case 'help': - this.commandHandler.handleShowHelp() + this.applicationCommandHandler.handleShowHelp() break case 'exit': - this.terminate() + this.terminateApplication() break default: - errorBus.emitError(`Unknown action: ${actionValue}`) - this.commandHandler.handleShowHelp() + errorBus.emitError(`Unknown command action: ${commandValue}`) + this.applicationCommandHandler.handleShowHelp() } } - private terminate(): void { - if (!this.isRunning) return - this.isRunning = false + private terminateApplication(): void { + if (!this.isApplicationRunning) { + return + } + this.isApplicationRunning = false logger.info(chalk.cyan('\n👋 Goodbye!\n')) process.exit(0) } diff --git a/src/scraper/browser.ts b/src/scraper/browser.ts index 39e1699..ef26106 100644 --- a/src/scraper/browser.ts +++ b/src/scraper/browser.ts @@ -1,281 +1,148 @@ -import { chromium, type Browser, type BrowserContext, type Page } from '@playwright/test' -import { readFileSync, writeFileSync, existsSync, statSync } from 'node:fs' +import { chromium, type Browser, type BrowserContext, type Page } from 'patchright' +import { readFileSync, existsSync, statSync } from 'node:fs' +import writeFileAtomic from 'write-file-atomic' import { type Config } from '../utils/config.js' import { logger } from '../utils/logger.js' +import { errorBus } from '../utils/error-bus.js' import { confirm } from '@inquirer/prompts' import { logHttpRequest, logHttpResponse } from '../utils/http-logger.js' export class BrowserManager { - static readonly BrowserLaunchError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'BrowserLaunchError' - } - } - - static readonly AuthError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'AuthError' - } - } - - static readonly ContextError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'ContextError' - } - } - - static readonly NavigationError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'NavigationError' - } - } - public browserInstance: Browser | null = null - private activeContext: BrowserContext | null = null + private browserContext: BrowserContext | null = null private activePage: Page | null = null - constructor(private readonly config: Config) {} + constructor(private readonly applicationConfig: Config) {} async launch(): Promise { try { - const isSavedAuthValid = this.isSavedAuthenticationFresh(this.config.authStoragePath) + const isSavedSessionStillFresh = this.isAuthenticationStateFresh(this.applicationConfig.authStoragePath) - if (isSavedAuthValid) { - await this.launchBrowser(this.config.headless) - await this.initializeBrowserContext() - await this.navigateToSettingsPage() - - const isLoggedIn = await this.verifyLoginStatus(this.getActivePage()) - if (isLoggedIn) { + if (isSavedSessionStillFresh) { + await this.initializeBrowserComponents(this.applicationConfig.headless) + if (await this.isUserAuthenticated()) { logger.success('Already logged in!') - return this.getActivePage() + return this.activePage! } - - logger.warn( - 'Saved authentication expired or invalid. Restarting in headful mode for login...' - ) + logger.warn('Session invalid. Restarting for login...') await this.close() } - // Need manual login: launch headful - await this.launchBrowser(false) - await this.initializeBrowserContext() - await this.navigateToSettingsPage() - await this.ensureUserIsAuthenticated() + const headfulModeEnabled = false + await this.initializeBrowserComponents(headfulModeEnabled) + await this.ensureUserIsAuthenticatedInBrowser() - const shouldRestartInHeadless = this.config.headless !== false - if (shouldRestartInHeadless) { - logger.info('Authentication successful. Restarting in headless mode...') + const shouldRestartInHeadlessMode = this.applicationConfig.headless !== false + if (shouldRestartInHeadlessMode) { + logger.info('Auth successful. Restarting in headless...') await this.close() - await this.launchBrowser(this.config.headless) - await this.initializeBrowserContext() - await this.navigateToSettingsPage() + await this.initializeBrowserComponents(this.applicationConfig.headless) } - - return this.getActivePage() - } catch (error) { - if (error instanceof Error) throw error - throw new BrowserManager.BrowserLaunchError(`Unexpected error: ${String(error)}`) + return this.activePage! + } catch (launchError) { + return errorBus.raiseError('Failed to launch or authenticate browser', launchError) } } async close(): Promise { - if (this.activePage) { - await this.activePage.close().catch(() => {}) - } - if (this.activeContext) { - await this.activeContext.close().catch(() => {}) - } - if (this.browserInstance) { - await this.browserInstance.close().catch(() => {}) - } + if (this.activePage) await this.activePage.close().catch(() => {}) + if (this.browserContext) await this.browserContext.close().catch(() => {}) + if (this.browserInstance) await this.browserInstance.close().catch(() => {}) + this.activePage = null - this.activeContext = null + this.browserContext = null this.browserInstance = null } - private async launchBrowser(headless: boolean | 'new'): Promise { + private async initializeBrowserComponents(isHeadlessMode: boolean | 'new') { + const headlessValueForLaunch = isHeadlessMode === 'new' ? true : isHeadlessMode try { - const actualHeadlessValue = headless === 'new' ? true : headless - this.browserInstance = await chromium.launch({ - headless: actualHeadlessValue, - args: ['--disable-blink-features=AutomationControlled'], - }) - } catch (error) { - const errorMessage = error instanceof Error ? error.message : String(error) - throw new BrowserManager.BrowserLaunchError(`Failed to launch browser: ${errorMessage}`) - } - } - - private async initializeBrowserContext(): Promise { - if (!this.browserInstance) { - throw new BrowserManager.ContextError('Browser not initialized') - } - - const isSavedAuthValid = this.isSavedAuthenticationFresh(this.config.authStoragePath) - - if (isSavedAuthValid) { - logger.info('Loading saved authentication state...') - try { - const storageStateJson = readFileSync(this.config.authStoragePath, 'utf-8') - const storageStateData = JSON.parse(storageStateJson) - this.activeContext = await this.browserInstance.newContext({ - storageState: storageStateData, + this.browserInstance = await chromium.launch({ headless: headlessValueForLaunch }) + + const isSessionFresh = this.isAuthenticationStateFresh(this.applicationConfig.authStoragePath) + const contextOptions = isSessionFresh + ? { storageState: JSON.parse(readFileSync(this.applicationConfig.authStoragePath, 'utf8')) } + : {} + + this.browserContext = await this.browserInstance.newContext(contextOptions) + + if (this.applicationConfig.debug) { + this.browserContext.on('request', request => { + const requestUrl = request.url() + const isRelevantUrl = requestUrl.includes('perplexity.ai') && !requestUrl.includes('static') + if (isRelevantUrl) { + logHttpRequest(request, true) + } + }) + this.browserContext.on('response', response => { + const responseUrl = response.url() + const isRelevantUrl = responseUrl.includes('perplexity.ai') && !responseUrl.includes('static') + if (isRelevantUrl) { + logHttpResponse(response, true) + } }) - } catch (error) { - logger.warn('Failed to load saved auth state, starting fresh.', error) - this.activeContext = await this.browserInstance.newContext() - } - } else { - const authFileExists = existsSync(this.config.authStoragePath) - if (authFileExists) { - logger.info('Saved authentication is older than 1 day, discarding.') } - this.activeContext = await this.browserInstance.newContext() - } - if (this.config.debug && this.activeContext) { - this.activeContext.on('request', (req) => { - const requestUrl = req.url() - const isRelevantUrl = - (requestUrl.includes('perplexity.ai/rest') || requestUrl.includes('perplexity.ai/api')) && - !requestUrl.includes('static') - if (isRelevantUrl) logHttpRequest(req) - }) - this.activeContext.on('response', (res) => { - const responseUrl = res.url() - const isRelevantUrl = - (responseUrl.includes('perplexity.ai/rest') || - responseUrl.includes('perplexity.ai/api')) && - !responseUrl.includes('static') - if (isRelevantUrl) logHttpResponse(res) - }) + this.activePage = await this.browserContext.newPage() + const SETTINGS_PAGE_URL = 'https://www.perplexity.ai/settings' + await this.activePage.goto(SETTINGS_PAGE_URL, { timeout: 15000 }).catch(() => {}) + } catch (initializationError) { + errorBus.raiseError('Browser initialization failed', initializationError) } } - private isSavedAuthenticationFresh(filePath: string): boolean { - const fileExists = existsSync(filePath) - if (!fileExists) return false + private isAuthenticationStateFresh(filePath: string): boolean { + if (!existsSync(filePath)) return false - try { - const fileStats = statSync(filePath) - const fileAgeMs = Date.now() - fileStats.mtimeMs - const TWENTY_FOUR_HOURS_MS = 24 * 60 * 60 * 1000 - return fileAgeMs < TWENTY_FOUR_HOURS_MS - } catch (_error) { - return false - } - } + const ONE_DAY_IN_MILLISECONDS = 24 * 60 * 60 * 1000 + const fileLastModifiedTimeMilliseconds = statSync(filePath).mtimeMs + const currentTimestampMilliseconds = Date.now() + const fileAgeMilliseconds = currentTimestampMilliseconds - fileLastModifiedTimeMilliseconds - private async navigateToSettingsPage(): Promise { - if (!this.activeContext) { - throw new BrowserManager.NavigationError('No browser context available') - } - - this.activePage = await this.activeContext.newPage() - const SETTINGS_URL = 'https://www.perplexity.ai/settings' - const NAVIGATION_TIMEOUT_MS = 15_000 + return fileAgeMilliseconds < ONE_DAY_IN_MILLISECONDS + } + private async isUserAuthenticated(): Promise { + if (!this.activePage) return false try { - await this.activePage.goto(SETTINGS_URL, { - timeout: NAVIGATION_TIMEOUT_MS, + const sessionData = await this.activePage.evaluate(async () => { + try { + const authSessionResponse = await fetch('/api/auth/session') + return await authSessionResponse.json() + } catch { + return {} + } }) - } catch (error) { - const errorMessage = error instanceof Error ? error.message : String(error) - throw new BrowserManager.NavigationError(`Failed to navigate to settings: ${errorMessage}`) + return !!(sessionData.user || sessionData.expires) + } catch { + return false } } - private async ensureUserIsAuthenticated(): Promise { - if (!this.activePage) { - throw new BrowserManager.AuthError('Page not initialized') - } + private async ensureUserIsAuthenticatedInBrowser() { + if (await this.isUserAuthenticated()) return - const isLoggedIn = await this.verifyLoginStatus(this.activePage) - if (isLoggedIn) { - logger.success('Already logged in!') - return - } + logger.info('Please log in manually...') + await confirm({ message: 'Press Enter when logged in and on settings page' }) - logger.info('Please log in manually in the browser window...') - await confirm({ - message: 'Press Enter when you are logged in and on the settings page', - default: true, - }) + const SETTINGS_PAGE_URL = 'https://www.perplexity.ai/settings' + await this.activePage!.goto(SETTINGS_PAGE_URL, { waitUntil: 'networkidle' }) - const SETTINGS_URL = 'https://www.perplexity.ai/settings' - await this.activePage.goto(SETTINGS_URL, { - waitUntil: 'networkidle', - }) - - const isLoginConfirmed = await this.verifyLoginStatus(this.activePage) - if (!isLoginConfirmed) { - const currentUrl = this.activePage.url() - throw new BrowserManager.AuthError(`Login verification failed. Current URL: ${currentUrl}`) + if (!(await this.isUserAuthenticated())) { + errorBus.raiseError('Login verification failed') } - await this.persistAuthenticationState() - logger.success('Authentication state saved!') - } - - private async verifyLoginStatus(page: Page): Promise { - await page.waitForTimeout(1000).catch(() => {}) - await page.waitForLoadState('domcontentloaded').catch(() => {}) - - const result = await page.evaluate(async () => { - try { - const res = await fetch('/api/auth/session', { - method: 'GET', - credentials: 'include', - }) - const text = await res.text() - return { body: text } - } catch (error) { - return { body: '' } - } - }) - - logger.debug(`verifyLoginStatus: body=${result.body}`) - - const trimmed = result.body.trim() - if (!trimmed) return false - - try { - const parsed = JSON.parse(trimmed) as Record - return Boolean(parsed.user || parsed.expires || parsed.email) - } catch { - return false - } + await this.persistAuthenticationStateToDisk() + logger.success('Auth saved!') } - private async persistAuthenticationState(): Promise { - if (!this.activeContext) { - throw new BrowserManager.AuthError('No browser context available to save') - } - const currentStorageState = await this.activeContext.storageState() - logger.debug( - `Persisting auth state: ${currentStorageState.cookies.length} cookies, ${currentStorageState.origins.length} origins` - ) - - if (currentStorageState.cookies.length === 0) { - logger.warn( - 'persistAuthenticationState: no cookies found — skipping write to avoid overwriting valid state' - ) - return - } - - const serializedState = JSON.stringify(currentStorageState, null, 2) - writeFileSync(this.config.authStoragePath, serializedState) - } + private async persistAuthenticationStateToDisk() { + if (!this.browserContext) return + const currentStorageState = await this.browserContext.storageState() - private getActivePage(): Page { - if (!this.activePage) { - throw new BrowserManager.ContextError('Page not initialized') + if (currentStorageState.cookies.length > 0) { + const serializedStateJson = JSON.stringify(currentStorageState, null, 2) + await (writeFileAtomic as any)(this.applicationConfig.authStoragePath, serializedStateJson) } - return this.activePage } } diff --git a/src/scraper/checkpoint-manager.ts b/src/scraper/checkpoint-manager.ts index bab50ae..2300da4 100644 --- a/src/scraper/checkpoint-manager.ts +++ b/src/scraper/checkpoint-manager.ts @@ -1,5 +1,6 @@ import { errorBus } from '../utils/error-bus.js' -import { readFileSync, writeFileSync, existsSync } from 'node:fs' +import { readFileSync, existsSync } from 'node:fs' +import writeFileAtomic from 'write-file-atomic' import { type Config } from '../utils/config.js' export interface ConversationMeta { @@ -14,66 +15,64 @@ export interface ProgressState { } interface CheckpointData { - discoveryPhaseComplete: boolean + isDiscoveryPhaseComplete: boolean discoveredConversations: ConversationMeta[] - processedIds: string[] + processedConversationIds: string[] } export class CheckpointManager { private readonly checkpointFilePath: string - private currentState: CheckpointData + private currentCheckpointState: CheckpointData - constructor(config: Config) { - this.checkpointFilePath = config.checkpointPath - this.currentState = this.loadCheckpoint() + constructor(applicationConfig: Config) { + this.checkpointFilePath = applicationConfig.checkpointPath + this.currentCheckpointState = this.loadCheckpointFromDisk() } setDiscoveredConversations(newlyDiscoveredConversations: ConversationMeta[]): void { - // Preserve content hashes for already known conversations - this.currentState.discoveredConversations = newlyDiscoveredConversations.map((newConv) => { - const existingConversation = this.currentState.discoveredConversations.find( - (existing) => existing.id === newConv.id + this.currentCheckpointState.discoveredConversations = newlyDiscoveredConversations.map(newConversation => { + const existingConversation = this.currentCheckpointState.discoveredConversations.find( + (existing) => existing.id === newConversation.id ) return existingConversation - ? { ...newConv, contentHash: existingConversation.contentHash } - : newConv + ? { ...newConversation, contentHash: existingConversation.contentHash } + : newConversation }) - this.currentState.discoveryPhaseComplete = true - this.saveCheckpoint() + this.currentCheckpointState.isDiscoveryPhaseComplete = true + this.persistCheckpointToDisk() } isDiscoveryPhaseComplete(): boolean { - return this.currentState.discoveryPhaseComplete + return this.currentCheckpointState.isDiscoveryPhaseComplete } getPendingConversations(): ConversationMeta[] { - return this.currentState.discoveredConversations.filter( - (conversation) => !this.currentState.processedIds.includes(conversation.id) + const processedIdsSet = new Set(this.currentCheckpointState.processedConversationIds) + return this.currentCheckpointState.discoveredConversations.filter( + (conversation) => !processedIdsSet.has(conversation.id) ) } getContentHash(conversationId: string): string | undefined { - const conversation = this.currentState.discoveredConversations.find( - (c) => c.id === conversationId - ) - return conversation?.contentHash + return this.currentCheckpointState.discoveredConversations.find( + (conversation) => conversation.id === conversationId + )?.contentHash } markAsProcessed(conversationId: string, updatedContentHash?: string): void { let hasStateChanged = false + const processedIdsSet = new Set(this.currentCheckpointState.processedConversationIds) - const isAlreadyProcessed = this.currentState.processedIds.includes(conversationId) - if (!isAlreadyProcessed) { - this.currentState.processedIds.push(conversationId) + if (!processedIdsSet.has(conversationId)) { + this.currentCheckpointState.processedConversationIds.push(conversationId) hasStateChanged = true } if (updatedContentHash) { - const targetConversation = this.currentState.discoveredConversations.find( - (c) => c.id === conversationId + const targetConversation = this.currentCheckpointState.discoveredConversations.find( + (conversation) => conversation.id === conversationId ) - const isHashDifferent = - targetConversation && targetConversation.contentHash !== updatedContentHash + const isHashDifferent = targetConversation && targetConversation.contentHash !== updatedContentHash if (isHashDifferent) { targetConversation.contentHash = updatedContentHash @@ -82,56 +81,54 @@ export class CheckpointManager { } if (hasStateChanged) { - this.saveCheckpoint() + this.persistCheckpointToDisk() } } getProcessingProgress(): ProgressState { return { - processed: this.currentState.processedIds.length, - total: this.currentState.discoveredConversations.length, + processed: this.currentCheckpointState.processedConversationIds.length, + total: this.currentCheckpointState.discoveredConversations.length, } } prepareForUpdateRun(): void { - this.currentState.processedIds = [] - this.currentState.discoveryPhaseComplete = false - this.saveCheckpoint() + this.currentCheckpointState.processedConversationIds = [] + this.currentCheckpointState.isDiscoveryPhaseComplete = false + this.persistCheckpointToDisk() } resetCheckpoint(): void { - this.currentState = { - discoveryPhaseComplete: false, + this.currentCheckpointState = { + isDiscoveryPhaseComplete: false, discoveredConversations: [], - processedIds: [], + processedConversationIds: [] } - this.saveCheckpoint() + this.persistCheckpointToDisk() } - private loadCheckpoint(): CheckpointData { - const doesCheckpointExist = existsSync(this.checkpointFilePath) - if (doesCheckpointExist) { + private loadCheckpointFromDisk(): CheckpointData { + if (existsSync(this.checkpointFilePath)) { try { - const rawCheckpointData = readFileSync(this.checkpointFilePath, 'utf-8') - return JSON.parse(rawCheckpointData) - } catch (error) { - errorBus.emitError('Failed to load checkpoint file. Starting fresh.', error) + const rawJsonData = readFileSync(this.checkpointFilePath, 'utf-8') + return JSON.parse(rawJsonData) + } catch (loadError) { + errorBus.emitError('Failed to load checkpoint file', loadError) } } - return { - discoveryPhaseComplete: false, + isDiscoveryPhaseComplete: false, discoveredConversations: [], - processedIds: [], + processedConversationIds: [] } } - private saveCheckpoint(): void { + private async persistCheckpointToDisk() { try { - const serializedState = JSON.stringify(this.currentState, null, 2) - writeFileSync(this.checkpointFilePath, serializedState) - } catch (error) { - errorBus.emitError('Failed to save checkpoint file', error) + const serializedStateJson = JSON.stringify(this.currentCheckpointState, null, 2) + await (writeFileAtomic as any)(this.checkpointFilePath, serializedStateJson) + } catch (saveError) { + errorBus.emitError('Failed to save checkpoint file', saveError) } } } diff --git a/src/scraper/conversation-extractor.ts b/src/scraper/conversation-extractor.ts index d1ded6d..ce11f92 100644 --- a/src/scraper/conversation-extractor.ts +++ b/src/scraper/conversation-extractor.ts @@ -1,401 +1,100 @@ -import { createHash } from 'node:crypto' -import { errorBus } from '../utils/error-bus.js' -import { z } from 'zod' -import { type Page, type BrowserContext, type Response } from '@playwright/test' -import { logger } from '../utils/logger.js' -import { waitStrategy } from '../utils/wait-strategy.js' -import { ApiDiagnosticsWriter } from '../utils/api-diagnostics.js' +import { type BrowserContext, type Page } from 'patchright' import { type Config } from '../utils/config.js' +import { ApiDiagnosticsWriter } from '../utils/api-diagnostics.js' +import { createWaitStrategy } from '../utils/wait-strategy.js' +import { logger } from '../utils/logger.js' +import { errorBus } from '../utils/error-bus.js' -export interface ExtractedConversation { - id: string - contentHash: string - title: string - spaceName: string - timestamp: Date - content: string -} +import { PageNavigator } from './extractor/navigator.js' +import { ApiInterceptor } from './extractor/interceptor.js' +import { DataParser } from './extractor/parser.js' +import { MarkdownFormatter } from './extractor/formatter.js' +import { type ExtractedConversation } from './extractor/types.js' export class ConversationExtractor { - private static readonly BlockSchema = z.object({ - intended_usage: z.string().optional(), - markdown_block: z - .object({ - answer: z.string().optional(), - }) - .optional(), - }) + private static readonly MINIMUM_TIMEOUT_MILLISECONDS = 3000 + private static readonly MAXIMUM_TIMEOUT_MILLISECONDS = 15000 + private static readonly TIMEOUT_RECOVERY_STEP_MILLISECONDS = 2000 + private static readonly TIMEOUT_REDUCTION_STEP_MILLISECONDS = 1000 - private static readonly EntrySchema = z.object({ - thread_title: z.string().optional(), - collection_info: z - .object({ - title: z.string().optional(), - }) - .optional(), - updated_datetime: z.string().optional(), - query_str: z.string().optional(), - blocks: z.array(ConversationExtractor.BlockSchema).optional(), - }) + private currentTimeoutMilliseconds = 8000 + private readonly pageNavigator = new PageNavigator() + private readonly apiInterceptor: ApiInterceptor + private readonly dataParser: DataParser + private readonly markdownFormatter = new MarkdownFormatter() + private readonly apiDiagnosticsWriter: ApiDiagnosticsWriter - private static readonly ApiResponseSchema = z.union([ - z.array(ConversationExtractor.EntrySchema), - z.object({ - entries: z.array(ConversationExtractor.EntrySchema), - background_entries: z.array(z.unknown()).optional(), - collection_info: z - .object({ - has_next_page: z.boolean().optional(), - }) - .optional(), - }), - ]) - - static readonly ExtractionError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'ExtractionError' - } - } - - static readonly NavigationError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'NavigationError' - } - } - - static readonly NotFoundError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'NotFoundError' - } - } - - static readonly AuthError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'AuthError' - } - } - - static readonly ServerError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'ServerError' - } - } - - static readonly NoDataError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'NoDataError' - } - } - - static readonly ParsingError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'ParsingError' - } - } - - private static readonly TIMEOUT_MAX_MS = 30_000 - private static readonly TIMEOUT_MIN_MS = 8_000 - private static readonly TIMEOUT_STEP_DOWN_MS = 3_000 - private static readonly TIMEOUT_STEP_UP_MS = 1_000 - - private currentTimeoutMs = ConversationExtractor.TIMEOUT_MAX_MS - private readonly diagnostics: ApiDiagnosticsWriter - - constructor( - private readonly config: Config, - private readonly context: BrowserContext - ) { - this.diagnostics = new ApiDiagnosticsWriter(config) + constructor(private readonly applicationConfig: Config, private readonly browserContext: BrowserContext) { + this.apiDiagnosticsWriter = new ApiDiagnosticsWriter(applicationConfig) + this.apiInterceptor = new ApiInterceptor(this.apiDiagnosticsWriter) + this.dataParser = new DataParser(this.apiDiagnosticsWriter) } reduceTimeout(): void { - this.currentTimeoutMs = Math.max( - ConversationExtractor.TIMEOUT_MIN_MS, - this.currentTimeoutMs - ConversationExtractor.TIMEOUT_STEP_DOWN_MS + this.currentTimeoutMilliseconds = Math.max( + ConversationExtractor.MINIMUM_TIMEOUT_MILLISECONDS, + this.currentTimeoutMilliseconds - ConversationExtractor.TIMEOUT_REDUCTION_STEP_MILLISECONDS ) - logger.debug(`[extractor] timeout reduced to ${this.currentTimeoutMs}ms`) + logger.debug(`[extractor] timeout reduced to ${this.currentTimeoutMilliseconds}ms`) } recoverTimeout(): void { - this.currentTimeoutMs = Math.min( - ConversationExtractor.TIMEOUT_MAX_MS, - this.currentTimeoutMs + ConversationExtractor.TIMEOUT_STEP_UP_MS + this.currentTimeoutMilliseconds = Math.min( + ConversationExtractor.MAXIMUM_TIMEOUT_MILLISECONDS, + this.currentTimeoutMilliseconds + ConversationExtractor.TIMEOUT_RECOVERY_STEP_MILLISECONDS ) } async extract(conversationUrl: string): Promise { - await this.ensureContextIsAlive() + if (!this.browserContext) { + return errorBus.raiseError('Browser context missing') + } let conversationPage: Page | null = null try { - conversationPage = await this.context.newPage() - } catch (error) { - const errorMessage = error instanceof Error ? error.message : String(error) - throw new ConversationExtractor.ExtractionError(`Failed to create new page: ${errorMessage}`) + conversationPage = await this.browserContext.newPage() + } catch (pageCreationError) { + return errorBus.raiseError(`Failed to create new page for ${conversationUrl}`, pageCreationError) } - const apiResponsePromise = this.captureConversationApiResponse(conversationPage) + const captureApiResponsePromise = this.apiInterceptor.capture(conversationPage, this.currentTimeoutMilliseconds) try { - await this.navigateToConversationUrl(conversationPage, conversationUrl) - await waitStrategy(this.config).afterScroll(conversationPage) + await this.pageNavigator.navigateTo(conversationPage, conversationUrl) + await createWaitStrategy(this.applicationConfig).afterScroll(conversationPage) - const capturedApiData = await apiResponsePromise + const capturedApiData = await captureApiResponsePromise if (!capturedApiData) { - throw new ConversationExtractor.NoDataError('API response timeout or not found') - } - - const extractedConversation = this.parseConversationData(capturedApiData, conversationUrl) - if (!extractedConversation) { - throw new ConversationExtractor.ParsingError('Failed to parse conversation data') - } - - return extractedConversation - } catch (error) { - if (error instanceof Error) throw error - throw new ConversationExtractor.ExtractionError(String(error)) - } finally { - if (conversationPage) { - await conversationPage.close().catch((closeError) => { - logger.warn(`Failed to close page: ${closeError}`) - }) - } - } - } - - private async ensureContextIsAlive(): Promise { - if (!this.context) { - throw new ConversationExtractor.ExtractionError('Browser context is missing') - } - try { - await this.context.pages() - } catch (_error) { - throw new ConversationExtractor.ExtractionError('Browser context is no longer available') - } - } - - private captureConversationApiResponse(page: Page): Promise { - const accumulatedEntries: unknown[] = [] - let isRequestResolved = false - - return new Promise((resolve) => { - const timeoutId = setTimeout(() => { - if (!isRequestResolved) { - if (accumulatedEntries.length > 0) { - logger.info( - `API response timeout – resolving with ${accumulatedEntries.length} accumulated entries` - ) - resolve({ entries: accumulatedEntries }) - } else { - logger.warn('API response timeout – resolving with null') - resolve(null) - } - isRequestResolved = true - } - }, this.currentTimeoutMs) - - page.on('response', async (response: Response) => { - if (isRequestResolved) return - - const responseUrl = response.url() - const isThreadApiRequest = responseUrl.includes('/rest/thread/') - const isListRequest = - responseUrl.includes('list_ask_threads') || - responseUrl.includes('list_recent') || - responseUrl.includes('list_pinned') - - if (!isThreadApiRequest || isListRequest) return - if (page.isClosed()) return - - try { - const jsonResponse = await response.json() - if (isRequestResolved) return - - const parseResult = ConversationExtractor.ApiResponseSchema.safeParse(jsonResponse) - - if (!parseResult.success) { - this.diagnostics - .writeFailure({ - url: response.url(), - errorType: 'zod_error', - zodErrorPaths: parseResult.error.issues.map((issue) => issue.path.join('.')), - }) - .catch(() => {}) - } else { - const responseData = parseResult.data - const currentEntries = Array.isArray(responseData) ? responseData : responseData.entries - accumulatedEntries.push(...currentEntries) - - const hasNextPage = - !Array.isArray(responseData) && responseData.collection_info?.has_next_page === true - - if (!hasNextPage) { - clearTimeout(timeoutId) - isRequestResolved = true - resolve({ entries: accumulatedEntries }) - } else { - logger.info( - `Captured paginated response, ${accumulatedEntries.length} entries so far...` - ) - } - } - } catch (_error) { - // Silent catch for JSON parse errors from non-JSON responses - } - }) - }) - } - - private async navigateToConversationUrl(page: Page, url: string): Promise { - const NAVIGATION_TIMEOUT_MS = 30000 - const navigationResponse = await page.goto(url, { - waitUntil: 'domcontentloaded', - timeout: NAVIGATION_TIMEOUT_MS, - }) - this.validateNavigationResponse(navigationResponse) - } - - private validateNavigationResponse(response: Response | null): void { - if (!response) { - throw new ConversationExtractor.NavigationError('Navigation failed – no response') - } - - const httpStatusCode = response.status() - if (httpStatusCode === 404) { - throw new ConversationExtractor.NotFoundError('Conversation not found (404)') - } - if (httpStatusCode === 403 || httpStatusCode === 401) { - throw new ConversationExtractor.AuthError('Authentication required or expired') - } - if (httpStatusCode >= 500) { - throw new ConversationExtractor.ServerError(`Server error (${httpStatusCode})`) - } - if (httpStatusCode >= 400) { - throw new ConversationExtractor.NavigationError(`HTTP error ${httpStatusCode}`) - } - } - - private hashEntries(rawEntries: unknown[]): string { - const stableJsonString = JSON.stringify(rawEntries, (_key, value) => { - const isObjectButNotArray = value && typeof value === 'object' && !Array.isArray(value) - if (isObjectButNotArray) { - return Object.keys(value) - .sort() - .reduce((sortedObj: Record, currentKey) => { - sortedObj[currentKey] = (value as Record)[currentKey] - return sortedObj - }, {}) + errorBus.raiseError('API response timeout (no data captured)') } - return value - }) - return createHash('sha256').update(stableJsonString).digest('hex') - } - - private parseConversationData( - apiData: unknown, - conversationUrl: string - ): ExtractedConversation | null { - try { - const formattedEntries = this.ensureEntriesFormat(apiData, conversationUrl) - - const entriesValidationResult = z - .array(ConversationExtractor.EntrySchema) - .nonempty({ message: 'No valid entries found' }) - .safeParse(formattedEntries) - if (!entriesValidationResult.success) { - if (formattedEntries.length === 0) { - this.diagnostics - .writeFailure({ url: conversationUrl, errorType: 'empty_entries' }) - .catch(() => {}) - } - logger.warn( - `Entry validation failed for ${conversationUrl}: ${entriesValidationResult.error.message}` - ) - return null - } - - const validatedEntries = entriesValidationResult.data - const firstEntry = validatedEntries[0]! - const conversationId = this.extractIdFromUrl(conversationUrl) - - const threadTitleFromData = (apiData as any)?.thread_title - const collectionTitleFromData = (apiData as any)?.collection_info?.title - - const title = firstEntry.thread_title ?? threadTitleFromData ?? 'Untitled' - const spaceName = firstEntry.collection_info?.title ?? collectionTitleFromData ?? 'General' - const timestamp = this.extractTimestamp(firstEntry, apiData) - const contentHash = this.hashEntries(validatedEntries) - const markdownContent = this.convertEntriesToMarkdown(validatedEntries, title) - - if (!markdownContent) { - logger.warn(`Thread has empty content after formatting: ${conversationUrl}`) - return null + const parsedDataResult = this.dataParser.parse(capturedApiData, conversationUrl) + if (!parsedDataResult) { + errorBus.raiseError('Failed to parse API data') } return { - id: conversationId, - title, - spaceName, - timestamp, - content: markdownContent, - contentHash, + conversationId: parsedDataResult!.meta.id, + conversationTitle: parsedDataResult!.meta.title, + conversationSpaceName: parsedDataResult!.meta.spaceName, + extractionTimestamp: parsedDataResult!.meta.timestamp, + contentIntegrityHash: parsedDataResult!.hash, + formattedMarkdownContent: this.markdownFormatter.format(parsedDataResult!.entries, parsedDataResult!.meta.title) } - } catch (error) { - errorBus.emitError('Failed to parse conversation data.', error) - return null - } - } - - private ensureEntriesFormat(data: unknown, url: string): unknown[] { - if (Array.isArray(data)) return data as unknown[] - - const dataObject = data as Record - if (dataObject && Array.isArray(dataObject.entries)) return dataObject.entries as unknown[] - if (dataObject && (dataObject.query_str || dataObject.blocks)) return [data] - - this.diagnostics.writeFailure({ url, errorType: 'unknown_shape' }).catch(() => {}) - - return [] - } + } catch (extractionError) { + const isExpectedError = extractionError instanceof Error && + (extractionError.message.includes('timeout') || extractionError.message.includes('parse')) - private extractIdFromUrl(url: string): string { - const match = url.match(/\/search\/([^/?]+)/) - return match?.[1] ?? 'unknown' - } - - private extractTimestamp(firstEntry: any, data: unknown): Date { - const rawTimestamp = firstEntry.updated_datetime ?? (data as any)?.updated_datetime - return rawTimestamp ? new Date(rawTimestamp) : new Date() - } - - private convertEntriesToMarkdown(entries: unknown[], threadTitle: string): string { - let markdown = '' - const typedEntries = entries as any[] - - for (let i = 0; i < typedEntries.length; i++) { - const entry = typedEntries[i] - let question = entry.query_str ?? (i === 0 ? threadTitle : 'Follow‑up') - - let answer = '' - for (const block of entry.blocks ?? []) { - if (block.markdown_block?.answer) { - answer += block.markdown_block.answer + '\n\n' - } + if (isExpectedError) { + throw extractionError + } + return errorBus.raiseError(`Extraction failed for ${conversationUrl}`, extractionError) + } finally { + if (conversationPage) { + await conversationPage.close().catch(pageCloseError => { + logger.warn(`Failed to close page: ${pageCloseError}`) + }) } - - if (question) markdown += `## ${question}\n\n` - if (answer) markdown += `${answer.trim()}\n\n` - markdown += '---\n\n' } - - return markdown.trim() } } diff --git a/src/scraper/extractor/formatter.ts b/src/scraper/extractor/formatter.ts new file mode 100644 index 0000000..73a94aa --- /dev/null +++ b/src/scraper/extractor/formatter.ts @@ -0,0 +1,29 @@ +export class MarkdownFormatter { + format(conversationEntries: any[], conversationThreadTitle: string): string { + let resultMarkdownText = '' + + for (let entryIndex = 0; entryIndex < conversationEntries.length; entryIndex++) { + const currentEntry = conversationEntries[entryIndex] + const questionText = currentEntry.query_str ?? (entryIndex === 0 ? conversationThreadTitle : 'Follow-up') + + let answerContentText = '' + for (const contentBlock of currentEntry.blocks ?? []) { + if (contentBlock.markdown_block?.answer) { + answerContentText += contentBlock.markdown_block.answer + '\n\n' + } + } + + if (questionText) { + resultMarkdownText += `## ${questionText}\n\n` + } + if (answerContentText) { + resultMarkdownText += `${answerContentText.trim()}\n\n` + } + + const horizontalRuleSeparator = '---\n\n' + resultMarkdownText += horizontalRuleSeparator + } + + return resultMarkdownText.trim() + } +} diff --git a/src/scraper/extractor/interceptor.ts b/src/scraper/extractor/interceptor.ts new file mode 100644 index 0000000..78e94d3 --- /dev/null +++ b/src/scraper/extractor/interceptor.ts @@ -0,0 +1,69 @@ +import { type Page } from 'patchright' +import { z } from 'zod' +import { type ApiDiagnosticsWriter } from '../../utils/api-diagnostics.js' + +export class ApiInterceptor { + private static readonly ApiResponseSchema = z.union([ + z.array(z.any()), + z.object({ + entries: z.array(z.any()), + collection_info: z.object({ has_next_page: z.boolean().optional() }).optional(), + }), + ]) + + constructor(private readonly apiDiagnosticsWriter: ApiDiagnosticsWriter) {} + + async capture(webPage: Page, captureTimeoutMilliseconds: number): Promise { + const accumulatedApiEntries: any[] = [] + let isRequestResolved = false + + return new Promise((resolve) => { + const timeoutTimerIdentifier = setTimeout(() => { + if (!isRequestResolved) { + isRequestResolved = true + resolve(accumulatedApiEntries.length > 0 ? { entries: accumulatedApiEntries } : null) + } + }, captureTimeoutMilliseconds) + + webPage.on('response', async (webResponse) => { + if (isRequestResolved || webPage.isClosed()) { + return + } + + const responseUrl = webResponse.url() + const isThreadApiRequest = responseUrl.includes('/rest/thread/') + const isExcludedListRequest = responseUrl.includes('list_') + + if (!isThreadApiRequest || isExcludedListRequest) { + return + } + + try { + const jsonResponseData = await webResponse.json() + const validationResult = ApiInterceptor.ApiResponseSchema.safeParse(jsonResponseData) + + if (!validationResult.success) { + await this.apiDiagnosticsWriter.writeFailure({ + url: webResponse.url(), + errorType: 'zod_error', + zodErrorPaths: validationResult.error.issues.map(issue => issue.path.join('.')) + }) + } else { + const validatedData: any = validationResult.data + const newEntriesToAppend = Array.isArray(validatedData) ? validatedData : validatedData.entries + accumulatedApiEntries.push(...newEntriesToAppend) + + const hasMorePagesToCapture = !Array.isArray(validatedData) && validatedData.collection_info?.has_next_page + if (!hasMorePagesToCapture) { + clearTimeout(timeoutTimerIdentifier) + isRequestResolved = true + resolve({ entries: accumulatedApiEntries }) + } + } + } catch (jsonParsingError) { + // Ignore responses that are not valid JSON + } + }) + }) + } +} diff --git a/src/scraper/extractor/navigator.ts b/src/scraper/extractor/navigator.ts new file mode 100644 index 0000000..8a5cf43 --- /dev/null +++ b/src/scraper/extractor/navigator.ts @@ -0,0 +1,33 @@ +import { type Page, type Response } from 'patchright' +import { errorBus } from '../../utils/error-bus.js' + +export class PageNavigator { + async navigateTo(webPage: Page, targetUrl: string): Promise { + try { + const navigationResponse = await webPage.goto(targetUrl, { + waitUntil: 'domcontentloaded', + timeout: 30000, + }) + this.validateResponse(navigationResponse) + } catch (navigationError) { + const isTimeoutError = navigationError instanceof Error && navigationError.name === 'TimeoutError' + if (isTimeoutError) { + errorBus.raiseError(`Navigation timeout for ${targetUrl}`, navigationError) + } + throw navigationError + } + } + + private validateResponse(navigationResponse: Response | null): void { + if (!navigationResponse) { + errorBus.raiseError('Navigation failed - no response') + return + } + + const httpStatusCode = navigationResponse.status() + if (httpStatusCode === 404) errorBus.raiseError('Conversation not found (404)') + if (httpStatusCode === 403 || httpStatusCode === 401) errorBus.raiseError('Auth required or expired') + if (httpStatusCode >= 500) errorBus.raiseError(`Server error (${httpStatusCode})`) + if (httpStatusCode >= 400) errorBus.raiseError(`HTTP error ${httpStatusCode}`) + } +} diff --git a/src/scraper/extractor/parser.ts b/src/scraper/extractor/parser.ts new file mode 100644 index 0000000..6018971 --- /dev/null +++ b/src/scraper/extractor/parser.ts @@ -0,0 +1,64 @@ +import { z } from 'zod' +import { createHash } from 'node:crypto' +import stringify from 'fast-json-stable-stringify' +import { type ApiDiagnosticsWriter } from '../../utils/api-diagnostics.js' +import { errorBus } from '../../utils/error-bus.js' + +export class DataParser { + private static readonly ConversationEntrySchema = z.object({ + uuid: z.string().optional(), + query_str: z.string().nullable().optional(), + thread_title: z.string().nullable().optional(), + blocks: z.array(z.any()).optional(), + updated_datetime: z.string().optional(), + collection_info: z.object({ title: z.string().optional() }).optional().nullable(), + }) + + constructor(private readonly apiDiagnosticsWriter: ApiDiagnosticsWriter) {} + + parse(rawApiData: any, conversationUrl: string): { entries: any[], meta: any, hash: string } | null { + const normalizedEntries = this.normalizeApiData(rawApiData, conversationUrl) + const validationResult = z.array(DataParser.ConversationEntrySchema).nonempty().safeParse(normalizedEntries) + + if (!validationResult.success) { + const areEntriesEmpty = normalizedEntries.length === 0 + if (areEntriesEmpty) { + this.apiDiagnosticsWriter.writeFailure({ url: conversationUrl, errorType: 'empty_entries' }).catch(() => {}) + } + errorBus.emitError(`Entry validation failed for ${conversationUrl}`, validationResult.error) + return null + } + + const validatedEntries = validationResult.data + const firstEntryInConversation = validatedEntries[0]! + const contentIntegrityHash = createHash('sha256').update(stringify(validatedEntries)).digest('hex') + + const conversationIdentifierMatch = conversationUrl.match(/\/search\/([^/?]+)/) + const conversationId = conversationIdentifierMatch?.[1] ?? 'unknown' + + return { + entries: validatedEntries, + hash: contentIntegrityHash, + meta: { + id: conversationId, + title: firstEntryInConversation.thread_title ?? rawApiData?.thread_title ?? 'Untitled', + spaceName: firstEntryInConversation.collection_info?.title ?? rawApiData?.collection_info?.title ?? 'General', + timestamp: new Date(firstEntryInConversation.updated_datetime ?? rawApiData?.updated_datetime ?? new Date()) + } + } + } + + private normalizeApiData(rawApiData: any, conversationUrl: string): any[] { + const isAlreadyAnArray = Array.isArray(rawApiData) + if (isAlreadyAnArray) return rawApiData + + const hasEntriesProperty = rawApiData?.entries && Array.isArray(rawApiData.entries) + if (hasEntriesProperty) return rawApiData.entries + + const isSingleEntryObject = rawApiData?.query_str || rawApiData?.blocks + if (isSingleEntryObject) return [rawApiData] + + this.apiDiagnosticsWriter.writeFailure({ url: conversationUrl, errorType: 'unknown_shape' }).catch(() => {}) + return [] + } +} diff --git a/src/scraper/extractor/types.ts b/src/scraper/extractor/types.ts new file mode 100644 index 0000000..5b3adf5 --- /dev/null +++ b/src/scraper/extractor/types.ts @@ -0,0 +1,8 @@ +export interface ExtractedConversation { + conversationId: string + conversationTitle: string + conversationSpaceName: string + extractionTimestamp: Date + formattedMarkdownContent: string + contentIntegrityHash: string +} diff --git a/src/scraper/library-discovery.ts b/src/scraper/library-discovery.ts index 803b021..9923b23 100644 --- a/src/scraper/library-discovery.ts +++ b/src/scraper/library-discovery.ts @@ -1,269 +1,154 @@ -import type { Page } from '@playwright/test' +import { type Page } from 'patchright' import { logger } from '../utils/logger.js' +import { errorBus } from '../utils/error-bus.js' -// ─── Constants ─────────────────────────────────────────────────────────────── +const PERPLEXITY_BASE_URL = 'https://www.perplexity.ai' +const PERPLEXITY_LIBRARY_URL = `${PERPLEXITY_BASE_URL}/library` +const THREAD_BATCH_FETCH_LIMIT = 50 +const PAGE_READY_CONFIRMATION_BUFFER_MILLISECONDS = 500 -const BASE_URL = 'https://www.perplexity.ai' -const LIBRARY_URL = `${BASE_URL}/library` -const BATCH_SIZE = 50 -const MAX_RETRIES = 3 -const RETRY_DELAY_MS = 1500 -const PAGE_READY_BUFFER_MS = 500 - -/** - * Only capture API version from endpoints that fire AFTER the library page - * is fully initialized. /api/auth/session is intentionally excluded — it fires - * too early (before cookies/CSRF are hydrated) and causes list_ask_threads - * to return []. - */ -const VERSIONED_URL_PATTERNS = [ +const API_VERSION_URL_PATTERNS = [ '/rest/userinfo', '/rest/thread/list_ask_threads', '/rest/thread/list_pinned_ask_threads', '/rest/sidebar', ] -// ─── Types ──────────────────────────────────────────────────────────────────── - -interface RawThread { +interface RawPerplexityThread { uuid: string slug: string title: string query_str: string - first_answer: string - answer_preview: string - last_query_datetime: string - mode: string - status: string - display_model: string - thread_access: number - has_next_page: boolean total_threads: number - collection: Collection | null - sources: string[] - query_count: number - search_focus: string + collection: { title: string } | null [key: string]: unknown } -interface Collection { - uuid: string - title: string - emoji: string - slug: string -} - export interface ConversationMeta { id: string url: string - uuid: string - slug: string - title: string - query_str: string - first_answer: string - answer_preview: string - last_query_datetime: string - mode: string - status: string - display_model: string - thread_access: number - collection: Collection | null - sources: string[] - query_count: number - search_focus: string [key: string]: unknown } -interface ThreadBatchResponse { - threads: RawThread[] - hasMore: boolean - total: number +function extractApiVersionFromUrl(targetUrl: string): string | null { + const versionMatch = targetUrl.match(/[?&]version=([\d.]+)/) + return versionMatch?.[1] ?? null } -// ─── Helpers ────────────────────────────────────────────────────────────────── - -function extractVersionFromUrl(url: string): string | null { - const match = url.match(/[?&]version=([\d.]+)/) - return match?.[1] ?? null -} - -function rawThreadToConversationMeta(thread: RawThread): ConversationMeta { - return { - ...thread, - id: thread.uuid, - url: `${BASE_URL}/search/${thread.slug}`, - } -} - -// ─── Version Detection ──────────────────────────────────────────────────────── - -async function detectApiVersion(page: Page): Promise { +async function detectCurrentApiVersion(webPage: Page): Promise { try { - const response = await page.waitForResponse( - (res) => VERSIONED_URL_PATTERNS.some((p) => res.url().includes(p)) && res.status() === 200, + const apiResponse = await webPage.waitForResponse( + (response) => API_VERSION_URL_PATTERNS.some((pattern) => response.url().includes(pattern)) && response.status() === 200, { timeout: 15_000 } ) - const version = extractVersionFromUrl(response.url()) ?? '2.18' - logger.debug(`Detected API version: ${version} (from ${new URL(response.url()).pathname})`) - return version - } catch { - logger.debug('Version detection timeout — using fallback 2.18') + return extractApiVersionFromUrl(apiResponse.url()) ?? '2.18' + } catch (detectionTimeout) { return '2.18' } } -// ─── Page Readiness ─────────────────────────────────────────────────────────── - -/** - * Wait until the library page has finished its initialization network burst. - * /rest/userinfo fires at library load, well after /api/auth/session, ensuring - * cookies/CSRF are fully hydrated before we call list_ask_threads. - */ -async function waitForLibraryReady(page: Page, timeout = 12_000): Promise { +async function waitLibraryPageToBeReady(webPage: Page): Promise { try { - await page.waitForResponse( - (res) => res.url().includes('/rest/userinfo') && res.status() === 200, - { timeout } + const userInfoEndpointPattern = '/rest/userinfo' + await webPage.waitForResponse( + (response) => response.url().includes(userInfoEndpointPattern) && response.status() === 200, + { timeout: 12000 } ) - logger.debug('Library page ready (userinfo confirmed)') - } catch { - logger.debug('waitForLibraryReady: timeout — proceeding anyway') - } - - await page.waitForTimeout(PAGE_READY_BUFFER_MS) + } catch (readyTimeout) {} + await webPage.waitForTimeout(PAGE_READY_CONFIRMATION_BUFFER_MILLISECONDS) } -// ─── API Fetching ───────────────────────────────────────────────────────────── - async function fetchThreadBatch( - page: Page, - version: string, - offset: number -): Promise { - const url = `${BASE_URL}/rest/thread/list_ask_threads?version=${version}&source=default` - - const raw = await page.evaluate( - async ({ url, offset, batchSize }: { url: string; offset: number; batchSize: number }) => { - const res = await fetch(url, { - method: 'POST', - headers: { 'content-type': 'application/json' }, - body: JSON.stringify({ - limit: batchSize, - offset, - ascending: false, - include_assets: true, - search_term: '', - send_last_entry: true, - thread_type_filter: null, - with_temporary_threads: false, - }), - credentials: 'include', - }) - const text = await res.text() - return { status: res.status, body: text } - }, - { url, offset, batchSize: BATCH_SIZE } - ) - - logger.debug(`list_ask_threads offset=${offset}: status=${raw.status}`) - logger.debug(`list_ask_threads offset=${offset}: body=${raw.body.slice(0, 500)}`) - - if (raw.status !== 200) { - throw new Error(`list_ask_threads returned HTTP ${raw.status}`) + webPage: Page, + apiVersion: string, + itemOffset: number +): Promise<{ threads: RawPerplexityThread[], hasMoreThreads: boolean, totalThreadsOnServer: number }> { + const fetchUrl = `${PERPLEXITY_BASE_URL}/rest/thread/list_ask_threads?version=${apiVersion}&source=default` + + const executionResult = await webPage.evaluate(async ({ url, offset, batchSize }) => { + const response = await fetch(url, { + method: 'POST', + headers: { 'content-type': 'application/json' }, + body: JSON.stringify({ + limit: batchSize, + offset, + ascending: false, + include_assets: true, + search_term: '', + send_last_entry: true, + thread_type_filter: null, + with_temporary_threads: false, + }), + credentials: 'include', + }) + return { httpStatus: response.status, responseBodyText: await response.text() } + }, { url: fetchUrl, offset: itemOffset, batchSize: THREAD_BATCH_FETCH_LIMIT }) + + const isSuccessfulResponse = executionResult.httpStatus === 200 + if (!isSuccessfulResponse) { + errorBus.raiseError(`API error: ${executionResult.httpStatus}`, undefined, { body: executionResult.responseBodyText }) } - let parsed: unknown + let parsedResponseThreads: any try { - parsed = JSON.parse(raw.body) - } catch { - throw new Error(`list_ask_threads: invalid JSON — body: ${raw.body.slice(0, 200)}`) + parsedResponseThreads = JSON.parse(executionResult.responseBodyText) + } catch (jsonParsingError) { + errorBus.raiseError('Invalid JSON from API', jsonParsingError) } - if (!Array.isArray(parsed)) { - throw new Error(`list_ask_threads: expected array, got ${typeof parsed}`) + if (!Array.isArray(parsedResponseThreads)) { + errorBus.raiseError('Expected array from API') } - const threads = parsed as RawThread[] - const total = threads[0]?.total_threads ?? threads.length + const threadList = parsedResponseThreads as RawPerplexityThread[] + const totalCountFromServer = threadList[0]?.total_threads ?? threadList.length return { - threads, - hasMore: offset + threads.length < total, - total, - } -} - -async function fetchFirstBatch(page: Page, version: string): Promise { - for (let attempt = 1; attempt <= MAX_RETRIES; attempt++) { - const result = await fetchThreadBatch(page, version, 0) - - if (result.threads.length > 0) { - logger.debug(`First batch OK — ${result.threads.length} threads (total: ${result.total})`) - return result - } - - if (attempt < MAX_RETRIES) { - logger.debug(`Attempt ${attempt}: empty batch, retrying in ${RETRY_DELAY_MS}ms…`) - await page.waitForTimeout(RETRY_DELAY_MS) - } + threads: threadList, + hasMoreThreads: itemOffset + threadList.length < totalCountFromServer, + totalThreadsOnServer: totalCountFromServer } - - logger.info('No threads found in library') - return { threads: [], hasMore: false, total: 0 } } -// ─── Main Discovery ─────────────────────────────────────────────────────────── - export class LibraryDiscovery { - constructor() {} - - async discoverAllConversationsFromLibrary(page: Page): Promise { - logger.info('Discovering threads via REST API...') - - // Start version detection BEFORE navigation so we catch the first matching response - const versionPromise = detectApiVersion(page) - - // Navigate to library - await page.goto(LIBRARY_URL, { waitUntil: 'domcontentloaded' }) - - // Wait for page to be fully ready (userinfo fired + hydration buffer) - await waitForLibraryReady(page) - - // Resolve detected version (fallback to 2.18 on timeout) - const version = await versionPromise - logger.info(`Detected API version: ${version} (from /rest/thread/list_ask_threads)`) - - // First batch with retry logic - const allThreads: RawThread[] = [] - const firstBatch = await fetchFirstBatch(page, version) - allThreads.push(...firstBatch.threads) - - if (firstBatch.threads.length === 0) { - logger.success(`Discovered 0 threads`) - return [] - } - - logger.debug(`Total threads on server: ${firstBatch.total}`) - - // Paginate remaining batches - let offset = firstBatch.threads.length - let hasMore = firstBatch.hasMore - - while (hasMore) { - // Randomized delay to avoid Cloudflare triggers (from PR #12) - const delay = 800 + Math.random() * 700 - await page.waitForTimeout(delay) - - const batch = await fetchThreadBatch(page, version, offset) - allThreads.push(...batch.threads) - offset += batch.threads.length - hasMore = batch.hasMore - - logger.debug(`Fetched ${allThreads.length} / ${batch.total} threads`) + async discoverAllConversationsFromLibrary(webPage: Page): Promise { + try { + logger.info('Discovering threads...') + const versionDetectionPromise = detectCurrentApiVersion(webPage) + + await webPage.goto(PERPLEXITY_LIBRARY_URL, { waitUntil: 'domcontentloaded' }) + await waitLibraryPageToBeReady(webPage) + const currentApiVersion = await versionDetectionPromise + + let allDiscoveredThreads: RawPerplexityThread[] = [] + let currentItemOffset = 0 + let areMoreThreadsAvailable = true + + while (areMoreThreadsAvailable) { + const isSubsequentBatch = currentItemOffset > 0 + if (isSubsequentBatch) { + const randomizedJitterDelay = 800 + Math.random() * 700 + await webPage.waitForTimeout(randomizedJitterDelay) + } + + const threadBatchResult = await fetchThreadBatch(webPage, currentApiVersion, currentItemOffset) + allDiscoveredThreads.push(...threadBatchResult.threads) + currentItemOffset += threadBatchResult.threads.length + areMoreThreadsAvailable = threadBatchResult.hasMoreThreads + + logger.debug(`Fetched ${allDiscoveredThreads.length} / ${threadBatchResult.totalThreadsOnServer} threads`) + } + + const conversationMetadataList = allDiscoveredThreads.map(thread => ({ + ...thread, + id: thread.uuid, + url: `${PERPLEXITY_BASE_URL}/search/${thread.slug}` + })) + + logger.success(`Discovered ${conversationMetadataList.length} threads`) + return conversationMetadataList + } catch (discoveryError) { + return errorBus.raiseError('Discovery failed', discoveryError) } - - const conversations = allThreads.map(rawThreadToConversationMeta) - logger.success(`Discovered ${conversations.length} threads`) - return conversations } } diff --git a/src/scraper/worker-pool.ts b/src/scraper/worker-pool.ts index e64f3b1..d63fafa 100644 --- a/src/scraper/worker-pool.ts +++ b/src/scraper/worker-pool.ts @@ -1,165 +1,174 @@ import { errorBus } from '../utils/error-bus.js' -import { type Browser, type BrowserContext } from '@playwright/test' +import { type Browser, type BrowserContext } from 'patchright' import { ConversationExtractor } from './conversation-extractor.js' import { CheckpointManager, type ConversationMeta } from './checkpoint-manager.js' import { FileWriter } from '../export/file-writer.js' import { logger } from '../utils/logger.js' import { type Config } from '../utils/config.js' +import pLimit from 'p-limit' -const MAX_RETRIES = 2 -const POLLING_INTERVAL_MS = 100 +const MAXIMUM_RETRY_ATTEMPTS = 2 interface ExtractionWorker { - id: number - extractor: ConversationExtractor - isBusy: boolean + workerId: number + conversationExtractor: ConversationExtractor + isCurrentlyBusy: boolean } -interface QueueItem { - meta: ConversationMeta - attempts: number +interface ExtractionQueueItem { + conversationMetadata: ConversationMeta + currentAttemptCount: number } export class WorkerPool { - private readonly workers: ExtractionWorker[] = [] - private readonly fileWriter: FileWriter + private readonly activeWorkers: ExtractionWorker[] = [] + private readonly conversationFileWriter: FileWriter private sharedBrowserContext: BrowserContext | null = null - private isRefreshing = false + private isContextRefreshingInProgress = false constructor( - private readonly config: Config, + private readonly applicationConfig: Config, private readonly checkpointManager: CheckpointManager, - private readonly browser: Browser + private readonly browserInstance: Browser ) { - this.fileWriter = new FileWriter(config) + this.conversationFileWriter = new FileWriter(applicationConfig) } async initialize(): Promise { try { - this.sharedBrowserContext = await this.browser.newContext({ - storageState: this.config.authStoragePath, + this.sharedBrowserContext = await this.browserInstance.newContext({ + storageState: this.applicationConfig.authStoragePath, }) - for (let i = 0; i < this.config.parallelWorkers; i++) { - this.workers.push({ - id: i, - extractor: new ConversationExtractor(this.config, this.sharedBrowserContext), - isBusy: false, + for (let workerIndex = 0; workerIndex < this.applicationConfig.parallelWorkers; workerIndex++) { + this.activeWorkers.push({ + workerId: workerIndex, + conversationExtractor: new ConversationExtractor(this.applicationConfig, this.sharedBrowserContext), + isCurrentlyBusy: false, }) } - } catch (error) { - errorBus.emitError('Failed to initialize worker pool', error) - throw error + } catch (initializationError) { + errorBus.raiseError('Failed to initialize worker pool', initializationError) } } async processConversations(conversationsToProcess: ConversationMeta[]): Promise { - const queue: QueueItem[] = conversationsToProcess.map((meta) => ({ meta, attempts: 0 })) - const activeTasks: Promise[] = [] + const concurrencyLimiter = pLimit(this.applicationConfig.parallelWorkers) + const extractionQueue: ExtractionQueueItem[] = conversationsToProcess.map((metadata) => ({ + conversationMetadata: metadata, + currentAttemptCount: 0 + })) - while (queue.length > 0 || activeTasks.length > 0) { - const worker = this.workers.find((w) => !w.isBusy) + const extractionTasks = extractionQueue.map((queueItem) => + concurrencyLimiter(() => this.executeExtractionWithRetryLogic(queueItem)) + ) - if (worker && queue.length > 0) { - const item = queue.shift()! - worker.isBusy = true + await Promise.all(extractionTasks) - const task = this.runExtraction(worker, item, queue).finally(() => { - worker.isBusy = false - activeTasks.splice(activeTasks.indexOf(task), 1) - }) + const totalConversationsRequested = conversationsToProcess.length + const totalConversationsProcessed = this.checkpointManager.getProcessingProgress().processed + const failedConversationsCount = totalConversationsRequested - totalConversationsProcessed - activeTasks.push(task) - } else { - await new Promise((resolve) => setTimeout(resolve, POLLING_INTERVAL_MS)) - } + if (failedConversationsCount > 0) { + logger.warn(`${failedConversationsCount} conversation(s) failed and will be retried on next run.`) } + } - const failedCount = - conversationsToProcess.length - this.checkpointManager.getProcessingProgress().processed - if (failedCount > 0) { - logger.warn(`${failedCount} conversation(s) failed and will be retried on next run.`) + private async executeExtractionWithRetryLogic(queueItem: ExtractionQueueItem): Promise { + const availableWorker = this.findAvailableWorker() + availableWorker.isCurrentlyBusy = true + try { + await this.performConversationExtraction(availableWorker, queueItem) + } finally { + availableWorker.isCurrentlyBusy = false } } + private findAvailableWorker(): ExtractionWorker { + const worker = this.activeWorkers.find(w => !w.isCurrentlyBusy) + if (worker) return worker + // Fallback to first worker if none are marked free (should not happen with p-limit) + return this.activeWorkers[0]! + } + async close(): Promise { await this.sharedBrowserContext?.close().catch(() => {}) } - private async runExtraction( - worker: ExtractionWorker, - item: QueueItem, - queue: QueueItem[] - ): Promise { + private async performConversationExtraction(worker: ExtractionWorker, queueItem: ExtractionQueueItem): Promise { try { - const result = await worker.extractor.extract(item.meta.url) - await this.handleSuccess(worker, item.meta, result) - } catch (error) { - await this.handleFailure(worker, item, queue, error) + const extractionResult = await worker.conversationExtractor.extract(queueItem.conversationMetadata.url) + await this.handleExtractionSuccess(worker, queueItem.conversationMetadata, extractionResult) + } catch (extractionError) { + await this.handleExtractionFailure(worker, queueItem, extractionError) } } - private async handleSuccess( + private async handleExtractionSuccess( worker: ExtractionWorker, - meta: ConversationMeta, - result: Awaited> + conversationMetadata: ConversationMeta, + extractionResult: any ): Promise { - const existingHash = this.checkpointManager.getContentHash(meta.id) - const { processed, total } = this.checkpointManager.getProcessingProgress() - const progressLabel = `[${processed}/${total}]` + const existingContentHash = this.checkpointManager.getContentHash(conversationMetadata.id) + const currentProgress = this.checkpointManager.getProcessingProgress() + const progressStatusLabel = `[${currentProgress.processed}/${currentProgress.total}]` - if (existingHash && existingHash === result.contentHash) { - this.checkpointManager.markAsProcessed(meta.id) - logger.info(`${progressLabel} Up to date: ${result.title} (skipped write)`) + const isContentUnchanged = existingContentHash && existingContentHash === extractionResult.contentIntegrityHash + + if (isContentUnchanged) { + this.checkpointManager.markAsProcessed(conversationMetadata.id) + logger.info(`${progressStatusLabel} Up to date: ${extractionResult.conversationTitle} (skipped write)`) } else { - this.fileWriter.write(result) - this.checkpointManager.markAsProcessed(meta.id, result.contentHash) - logger.info(`${progressLabel} Processed: ${result.title}`) + await this.conversationFileWriter.write(extractionResult) + this.checkpointManager.markAsProcessed(conversationMetadata.id, extractionResult.contentIntegrityHash) + logger.info(`${progressStatusLabel} Processed: ${extractionResult.conversationTitle}`) } - worker.extractor.recoverTimeout() + worker.conversationExtractor.recoverTimeout() } - private async handleFailure( + private async handleExtractionFailure( worker: ExtractionWorker, - item: QueueItem, - queue: QueueItem[], - error: unknown + queueItem: ExtractionQueueItem, + errorObject: unknown ): Promise { - const isTimeout = error instanceof Error && error.message.includes('API response timeout') - const isContextLost = - error instanceof Error && error.message.includes('context is no longer available') + const errorMessage = errorObject instanceof Error ? errorObject.message : String(errorObject) + const isTimeoutError = errorMessage.includes('API response timeout') + const isBrowserContextLost = errorMessage.includes('context is no longer available') || + errorMessage.includes('Target page, context or browser has been closed') - if (isTimeout) worker.extractor.reduceTimeout() + if (isTimeoutError) { + worker.conversationExtractor.reduceTimeout() + } - if (isContextLost) { - logger.warn('Browser context lost. Refreshing worker context...') - await this.refreshContext() + if (isBrowserContextLost) { + await this.refreshSharedBrowserContext() } - if (item.attempts < MAX_RETRIES) { - item.attempts++ - logger.warn(`Retrying ${item.meta.url} (attempt ${item.attempts}/${MAX_RETRIES})...`) - queue.push(item) + const canRetry = queueItem.currentAttemptCount < MAXIMUM_RETRY_ATTEMPTS + if (canRetry) { + queueItem.currentAttemptCount++ + logger.warn(`Retrying ${queueItem.conversationMetadata.url} (attempt ${queueItem.currentAttemptCount}/${MAXIMUM_RETRY_ATTEMPTS})...`) + await this.executeExtractionWithRetryLogic(queueItem) } else { - errorBus.emitError(`Failed to process ${item.meta.url} after ${MAX_RETRIES} retries`, error) + errorBus.emitError(`Failed to process ${queueItem.conversationMetadata.url} after ${MAXIMUM_RETRY_ATTEMPTS} retries`, errorObject) } } - private async refreshContext(): Promise { - if (this.isRefreshing) return - this.isRefreshing = true + private async refreshSharedBrowserContext(): Promise { + if (this.isContextRefreshingInProgress) return + this.isContextRefreshingInProgress = true try { await this.sharedBrowserContext?.close().catch(() => {}) - this.sharedBrowserContext = await this.browser.newContext({ - storageState: this.config.authStoragePath, + this.sharedBrowserContext = await this.browserInstance.newContext({ + storageState: this.applicationConfig.authStoragePath, }) - for (const worker of this.workers) { - worker.extractor = new ConversationExtractor(this.config, this.sharedBrowserContext) + for (const worker of this.activeWorkers) { + worker.conversationExtractor = new ConversationExtractor(this.applicationConfig, this.sharedBrowserContext) } - } catch (error) { - errorBus.emitError('Failed to refresh worker context', error) + } catch (refreshError) { + errorBus.emitError('Failed to refresh worker context', refreshError) } finally { - this.isRefreshing = false + this.isContextRefreshingInProgress = false } } } diff --git a/src/search/rg-search.ts b/src/search/rg-search.ts index 08d8758..fadb56b 100644 --- a/src/search/rg-search.ts +++ b/src/search/rg-search.ts @@ -3,7 +3,7 @@ import { existsSync } from 'node:fs' import { createInterface } from 'node:readline' import { type Config } from '../utils/config.js' import { logger } from '../utils/logger.js' -import chalk from 'chalk' +import { errorBus } from '../utils/error-bus.js' import { rgPath } from '@vscode/ripgrep' export interface RgSearchOptions { @@ -20,188 +20,132 @@ export interface RgMatch { } export class RgSearch { - static readonly RgSearchError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'RgSearchError' - } - } - - static readonly RgNotFoundError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'RgNotFoundError' - } - } + constructor(private readonly applicationConfig: Config) {} - constructor(private readonly config: Config) {} - - async search(options: RgSearchOptions): Promise { - this.ensureExportDirectoryIsAccessible() - const ripgrepCommandArguments = this.constructRipgrepArguments(options) - await this.spawnRipgrepProcess(ripgrepCommandArguments) + async search(searchOptions: RgSearchOptions): Promise { + this.ensureExportDirectoryExists() + const ripgrepArguments = this.constructRipgrepArguments(searchOptions) + await this.executeRipgrepProcess(ripgrepArguments) } - async captureSearchMatches(options: RgSearchOptions): Promise { - this.ensureExportDirectoryIsAccessible() - - const baseArguments = this.constructRipgrepArguments(options) - const jsonOutputArguments = baseArguments - .filter((arg) => arg !== '--color=always') + async captureSearchMatches(searchOptions: RgSearchOptions): Promise { + this.ensureExportDirectoryExists() + const ripgrepArguments = this.constructRipgrepArguments(searchOptions) + .filter(argument => argument !== '--color=always') .concat(['--color=never', '--json', '--max-filesize', '1M', '--no-binary']) return new Promise((resolve, reject) => { - const MAX_MATCHES_PER_QUERY = 100 - const SEARCH_TIMEOUT_MS = 30000 - const matches: RgMatch[] = [] + const MAXIMUM_MATCHES_TO_CAPTURE = 100 + const capturedMatches: RgMatch[] = [] - const ripgrepProcess = spawn(rgPath, jsonOutputArguments, { cwd: this.config.exportDir }) + const ripgrepProcess = spawn(rgPath, ripgrepArguments, { cwd: this.applicationConfig.exportDir }) + const readlineInterface = createInterface({ input: ripgrepProcess.stdout, terminal: false }) + const SEARCH_TIMEOUT_MILLISECONDS = 30000 const timeoutId = setTimeout(() => { - const timeoutSeconds = SEARCH_TIMEOUT_MS / 1000 - logger.warn( - `Ripgrep search for "${options.pattern}" timed out after ${timeoutSeconds}s. Killing process.` - ) - ripgrepProcess.kill('SIGKILL') - }, SEARCH_TIMEOUT_MS) - - const readlineInterface = createInterface({ - input: ripgrepProcess.stdout, - terminal: false, - }) + ripgrepProcess.kill() + reject(new Error('ripgrep search timed out after 30 seconds')) + }, SEARCH_TIMEOUT_MILLISECONDS) - readlineInterface.on('line', (line) => { - if (matches.length >= MAX_MATCHES_PER_QUERY) { + readlineInterface.on('line', (outputLine) => { + if (capturedMatches.length >= MAXIMUM_MATCHES_TO_CAPTURE) { ripgrepProcess.kill() return } try { - const parsedLine = JSON.parse(line) - if (parsedLine.type === 'match') { - matches.push({ - path: parsedLine.data.path.text, - line: parsedLine.data.line_number, - text: parsedLine.data.lines.text, + if (!outputLine.trim()) return + const parsedJsonLine = JSON.parse(outputLine) + if (parsedJsonLine.type === 'match') { + capturedMatches.push({ + path: parsedJsonLine.data.path.text, + line: parsedJsonLine.data.line_number, + text: parsedJsonLine.data.lines.text, }) } - } catch (_err) { - // Ignore lines that are not valid JSON or of a different type + } catch (parsingError) { + // Ignore invalid JSON lines from ripgrep } }) - ripgrepProcess.stderr.on('data', () => { - // Silently consume stderr to avoid buffer filling up + ripgrepProcess.on('close', (exitCode) => { + clearTimeout(timeoutId) + if (exitCode === 0 || exitCode === 1 || ripgrepProcess.killed) { + resolve(capturedMatches) + } else { + const errorMessage = `ripgrep exited with code ${exitCode}` + errorBus.emitError(errorMessage) + reject(new Error(errorMessage)) + } }) ripgrepProcess.on('error', (processError) => { clearTimeout(timeoutId) - readlineInterface.close() + errorBus.emitError('ripgrep failed to start', processError) reject(processError) }) - - ripgrepProcess.on('close', (exitCode) => { - clearTimeout(timeoutId) - readlineInterface.close() - - const isSuccessfulExit = - exitCode === 0 || exitCode === 1 || exitCode === null || ripgrepProcess.killed - if (isSuccessfulExit) { - resolve(matches) - } else { - reject(new RgSearch.RgSearchError(`ripgrep exited with code ${exitCode}`)) - } - }) }) } - private ensureExportDirectoryIsAccessible(): void { - const exportsExist = existsSync(this.config.exportDir) - if (!exportsExist) { - throw new RgSearch.RgSearchError( - 'No exports directory found. Please run the "start" command first to export your history.' - ) + private ensureExportDirectoryExists() { + if (!existsSync(this.applicationConfig.exportDir)) { + errorBus.raiseError('No exports directory found. Please run export first.') } } - private constructRipgrepArguments(options: RgSearchOptions): string[] { - const argumentsList: string[] = [ - '--color=always', - '--heading', - '--line-number', - '--no-messages', - '--column', - '--smart-case', - ] - - if (options.caseSensitive) { - argumentsList.push('--case-sensitive') - } + private constructRipgrepArguments(searchOptions: RgSearchOptions): string[] { + const ripgrepArguments = ['--color=never', '--heading', '--line-number', '--no-messages', '--column', '--smart-case'] - if (options.wholeWord) { - argumentsList.push('--word-regexp') + if (searchOptions.caseSensitive) { + ripgrepArguments.push('--case-sensitive') + } + if (searchOptions.wholeWord) { + ripgrepArguments.push('--word-regexp') } - if (options.regex) { - argumentsList.push('--regexp', options.pattern) + if (searchOptions.regex) { + ripgrepArguments.push('--regexp', searchOptions.pattern) } else { - argumentsList.push('--fixed-strings', options.pattern) + ripgrepArguments.push('--fixed-strings', searchOptions.pattern) } - argumentsList.push('--type', 'markdown') - return argumentsList + ripgrepArguments.push('--type', 'markdown') + // Search only in Markdown files within the current directory and its subdirectories + ripgrepArguments.push('.') + + return ripgrepArguments } - private spawnRipgrepProcess(args: string[]): Promise { + private executeRipgrepProcess(ripgrepArguments: string[]): Promise { return new Promise((resolve, reject) => { - const ripgrepProcess = spawn(rgPath, args, { - cwd: this.config.exportDir, - stdio: ['ignore', 'pipe', 'pipe'], - }) - - let hasFoundMatches = false - - ripgrepProcess.stdout.on('data', (data) => { - hasFoundMatches = true - process.stdout.write(data) - }) - - ripgrepProcess.stderr.on('data', (data) => { - const errorText = data.toString() - const isNotNotFoundError = !errorText.includes('No such file or directory') - if (isNotNotFoundError) { - process.stderr.write(chalk.red(data)) - } + const ripgrepProcess = spawn(rgPath, ripgrepArguments, { + cwd: this.applicationConfig.exportDir, + stdio: ['ignore', 'pipe', 'pipe'] }) - ripgrepProcess.on('error', (processError) => { - const isMissingBinary = processError.message.includes('ENOENT') - if (isMissingBinary) { - reject(new RgSearch.RgNotFoundError(this.getRipgrepInstallationInstructions())) - } else { - reject(new RgSearch.RgSearchError(`Search failed: ${processError.message}`)) - } + let hasFoundAnyMatches = false + ripgrepProcess.stdout.on('data', (outputDataChunk) => { + hasFoundAnyMatches = true + process.stdout.write(outputDataChunk) }) ripgrepProcess.on('close', (exitCode) => { - const isSuccessStatus = exitCode === 0 || exitCode === 1 - if (isSuccessStatus) { - const isEmptyResult = exitCode === 1 && !hasFoundMatches - if (isEmptyResult) { + if (exitCode === 0 || exitCode === 1) { + if (exitCode === 1 && !hasFoundAnyMatches) { logger.info('No results found.') } resolve() } else { - reject(new RgSearch.RgSearchError(`ripgrep exited with code ${exitCode}`)) + const errorMessage = `ripgrep exited with code ${exitCode}` + errorBus.emitError(errorMessage) + reject(new Error(errorMessage)) } }) - }) - } - private getRipgrepInstallationInstructions(): string { - return ( - 'Bundled ripgrep (rg) not found or failed to execute. ' + - 'Please ensure the application was installed correctly.' - ) + ripgrepProcess.on('error', (processError) => { + errorBus.emitError('ripgrep failed', processError) + reject(processError) + }) + }) } } diff --git a/src/search/search-orchestrator.ts b/src/search/search-orchestrator.ts index b430452..fa7d8ad 100644 --- a/src/search/search-orchestrator.ts +++ b/src/search/search-orchestrator.ts @@ -1,6 +1,7 @@ import { RgSearch, type RgSearchOptions } from './rg-search.js' import { VectorStore } from './vector-store.js' import { logger } from '../utils/logger.js' +import { errorBus } from '../utils/error-bus.js' import { type Config } from '../utils/config.js' import { RagOrchestrator } from '../ai/rag-orchestrator.js' import chalk from 'chalk' @@ -8,35 +9,19 @@ import chalk from 'chalk' export type SearchMode = 'rg' | 'vector' | 'auto' | 'rag' export class SearchOrchestrator { - static readonly SearchOrchestratorError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'SearchOrchestratorError' - } - } - - static readonly ValidationError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'SearchOrchestratorValidationError' - } - } - - private readonly rgSearch: RgSearch + private readonly ripgrepSearch: RgSearch private readonly vectorStore: VectorStore private readonly ragOrchestrator: RagOrchestrator - constructor(private readonly config: Config) { - this.rgSearch = new RgSearch(config) - this.vectorStore = new VectorStore(config) - this.ragOrchestrator = new RagOrchestrator(config) + constructor(private readonly applicationConfig: Config) { + this.ripgrepSearch = new RgSearch(applicationConfig) + this.vectorStore = new VectorStore(applicationConfig) + this.ragOrchestrator = new RagOrchestrator(applicationConfig) } async validateVectorSearch(): Promise { - if (!this.config.enableVectorSearch) { - const vectorSearchDisabledErrorMessage = - 'Vector search is disabled (ENABLE_VECTOR_SEARCH=false).' - throw new SearchOrchestrator.ValidationError(vectorSearchDisabledErrorMessage) + if (!this.applicationConfig.enableVectorSearch) { + errorBus.raiseError('Vector search disabled') } await this.vectorStore.validate() } @@ -45,67 +30,56 @@ export class SearchOrchestrator { await this.vectorStore.rebuildFromExports() } - async search(query: string, mode: SearchMode, rgOptions: RgSearchOptions): Promise { + async search(searchQuery: string, searchMode: SearchMode, ripgrepOptions: RgSearchOptions): Promise { try { - switch (mode) { + switch (searchMode) { case 'rg': - await this.rgSearch.search(rgOptions) + await this.ripgrepSearch.search(ripgrepOptions) break case 'vector': - await this.performVectorOnlySearch(query) + await this.executeVectorOnlySearch(searchQuery) break case 'rag': - await this.ragOrchestrator.answerQuestion(query) + await this.ragOrchestrator.answerQuestion(searchQuery) break case 'auto': default: - await this.executeAutoSearch(query, rgOptions) + await this.executeAutoSearch(searchQuery, ripgrepOptions) break } - } catch (_error) { - if (_error instanceof Error) { - const searchFailedErrorMessage = `Search failed: ${_error.message}` - throw new SearchOrchestrator.SearchOrchestratorError(searchFailedErrorMessage) - } - throw _error + } catch (searchError) { + errorBus.raiseError(`Search failed`, searchError) } } - private async executeAutoSearch(query: string, rgOptions: RgSearchOptions): Promise { - const LONG_QUERY_WORD_COUNT_THRESHOLD = 5 - const queryWordCount = query.trim().split(/\s+/).length - const isLongQuery = queryWordCount > LONG_QUERY_WORD_COUNT_THRESHOLD + private async executeAutoSearch(searchQuery: string, ripgrepOptions: RgSearchOptions) { + const LONG_QUERY_THRESHOLD_WORDS = 5 + const isLongQuery = searchQuery.trim().split(/\s+/).length > LONG_QUERY_THRESHOLD_WORDS if (isLongQuery) { - await this.performVectorOnlySearch(query) + await this.executeVectorOnlySearch(searchQuery) } else { - await this.rgSearch.search(rgOptions) + await this.ripgrepSearch.search(ripgrepOptions) } } - private async performVectorOnlySearch(query: string): Promise { - logger.info('Using vector search (Ollama + Vectra)...') - const SEARCH_RESULT_LIMIT = 10 - const searchResults = await this.vectorStore.search(query, SEARCH_RESULT_LIMIT) + private async executeVectorOnlySearch(searchQuery: string) { + logger.info('Using semantic search...') + const MAXIMUM_VECTOR_RESULTS = 10 + const searchResults = await this.vectorStore.search(searchQuery, MAXIMUM_VECTOR_RESULTS) if (searchResults.length === 0) { - logger.info('No vector search results found.') + logger.info('No results.') return } - for (const result of searchResults) { - const { meta, score } = result - const relevanceScoreLabel = score.toFixed(3) - - const spaceNameDisplay = chalk.green(meta['spaceName'] as string) - const arrowSeparator = chalk.gray('›') - const titleDisplay = chalk.cyan(meta['title'] as string) - const scoreDisplay = chalk.gray(`(${relevanceScoreLabel})`) - const pathDisplay = chalk.gray(meta['path'] as string) + for (const searchResult of searchResults) { + const spaceNameDisplay = chalk.green(searchResult.meta['spaceName'] as string) + const titleDisplay = chalk.cyan(searchResult.meta['title'] as string) + const relevanceScoreDisplay = chalk.gray(`(${searchResult.score.toFixed(3)})`) + const filePathDisplay = chalk.gray(searchResult.meta['path'] as string) - logger.info( - `${spaceNameDisplay} ${arrowSeparator} ${titleDisplay} ${scoreDisplay}\n${pathDisplay}\n` - ) + logger.info(`${spaceNameDisplay} › ${titleDisplay} ${relevanceScoreDisplay}\n${filePathDisplay}\n`) } } } diff --git a/src/search/vector-store.ts b/src/search/vector-store.ts index 0878b53..a44abb9 100644 --- a/src/search/vector-store.ts +++ b/src/search/vector-store.ts @@ -1,7 +1,7 @@ import { errorBus } from '../utils/error-bus.js' import { LocalIndex } from 'vectra' import { join } from 'node:path' -import { readFileSync, readdirSync, statSync } from 'node:fs' +import fileSystem from 'node:fs/promises' import { type Config } from '../utils/config.js' import { logger } from '../utils/logger.js' import { OllamaClient } from '../ai/ollama-client.js' @@ -15,232 +15,146 @@ export interface VectorSearchResult { } export class VectorStore { - static readonly VectorStoreError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'VectorStoreError' - } - } - - static readonly IndexError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'VectorStoreIndexError' - } - } - - static readonly EmbeddingError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'VectorStoreEmbeddingError' - } - } - - static readonly SearchError = class extends Error { - constructor(message: string) { - super(message) - this.name = 'VectorStoreSearchError' - } - } - private readonly vectorIndex: LocalIndex private readonly ollamaClient: OllamaClient - constructor(private readonly config: Config) { - this.vectorIndex = new LocalIndex(config.vectorIndexPath) - this.ollamaClient = new OllamaClient(config) + constructor(private readonly applicationConfig: Config) { + this.vectorIndex = new LocalIndex(applicationConfig.vectorIndexPath) + this.ollamaClient = new OllamaClient(applicationConfig) } async validate(): Promise { try { await this.ollamaClient.validate() - } catch (error) { - const errorMessage = error instanceof Error ? error.message : String(error) - throw new VectorStore.VectorStoreError(`Vector store validation failed: ${errorMessage}`) + } catch (validationError) { + errorBus.raiseError(`Vector store validation failed`, validationError) } } async rebuildFromExports(): Promise { logger.info('Building vector index from exports folder...') - const markdownFilePaths = this.getMarkdownFilePathsRecursively(this.config.exportDir) + const markdownFilePaths = await this.discoverMarkdownFilesRecursively(this.applicationConfig.exportDir) if (markdownFilePaths.length === 0) { logger.warn('No markdown files found to index.') return } - await this.ensureIndexExists() - await this.processMarkdownFilesByBatches(markdownFilePaths) - + await this.ensureVectorIndexIsCreated() + await this.indexFilesByBatches(markdownFilePaths) logger.success('Vector index rebuild complete.') } - async search(query: string, limit = 10): Promise { + async search(searchQuery: string, resultLimit = 10): Promise { try { - const queryEmbedding = await this.generateQueryEmbedding(query) - const rawResults = await this.queryVectorIndex(queryEmbedding, query, limit) - return this.formatVectorSearchResults(rawResults) - } catch (error) { - const errorMessage = error instanceof Error ? error.message : String(error) - throw new VectorStore.SearchError(`Vector search failed: ${errorMessage}`) - } - } + const [searchQueryEmbedding] = await this.ollamaClient.embed([searchQuery]) + if (!searchQueryEmbedding) { + return errorBus.raiseError('Failed to generate embedding for query') + } - async searchWithMetadataFilter( - query: string, - filter: (meta: Record) => boolean, - limit = 10 - ): Promise { - try { - const queryEmbedding = await this.generateQueryEmbedding(query) - const rawResults = await this.vectorIndex.queryItems( - queryEmbedding, - query, - limit, - filter as any - ) - return this.formatVectorSearchResults(rawResults) - } catch (error) { - const errorMessage = error instanceof Error ? error.message : String(error) - throw new VectorStore.SearchError(`Filtered vector search failed: ${errorMessage}`) + const rawSearchResults = await this.vectorIndex.queryItems(searchQueryEmbedding, searchQuery, resultLimit) + return rawSearchResults.map(searchResult => ({ + meta: searchResult.item.metadata as VectorDocMeta, + score: searchResult.score + })) + } catch (searchError) { + return errorBus.raiseError('Vector search failed', searchError, { searchQuery }) } } - private async ensureIndexExists(): Promise { - const isAlreadyCreated = await this.vectorIndex.isIndexCreated() - if (!isAlreadyCreated) { + private async ensureVectorIndexIsCreated() { + const isIndexAlreadyCreated = await this.vectorIndex.isIndexCreated() + if (!isIndexAlreadyCreated) { await this.vectorIndex.createIndex() } } - private getMarkdownFilePathsRecursively(directoryPath: string): string[] { - const directoryEntries = readdirSync(directoryPath) + private async discoverMarkdownFilesRecursively(directoryPath: string): Promise { + const directoryEntries = await fileSystem.readdir(directoryPath, { withFileTypes: true }) const markdownFilePaths: string[] = [] - for (const entryName of directoryEntries) { - const fullPath = join(directoryPath, entryName) - const pathStatus = statSync(fullPath) - - if (pathStatus.isDirectory()) { - markdownFilePaths.push(...this.getMarkdownFilePathsRecursively(fullPath)) - } else if (pathStatus.isFile() && fullPath.endsWith('.md')) { - markdownFilePaths.push(fullPath) + for (const directoryEntry of directoryEntries) { + const fullEntryPath = join(directoryPath, directoryEntry.name) + if (directoryEntry.isDirectory()) { + const nestedMarkdownPaths = await this.discoverMarkdownFilesRecursively(fullEntryPath) + markdownFilePaths.push(...nestedMarkdownPaths) + } else if (fullEntryPath.endsWith('.md')) { + markdownFilePaths.push(fullEntryPath) } } return markdownFilePaths } - private async processMarkdownFilesByBatches(filePaths: string[]): Promise { + private async indexFilesByBatches(markdownFilePaths: string[]) { await this.vectorIndex.beginUpdate() - const EMBEDDING_BATCH_SIZE = 10 - let pendingTextsToEmbed: string[] = [] - let pendingMetadataToInsert: VectorDocMeta[] = [] + let pendingTextChunks: string[] = [] + let pendingMetadataEntries: VectorDocMeta[] = [] - for (let i = 0; i < filePaths.length; i++) { - const currentFilePath = filePaths[i]! - const { contentChunks, fileMetadata } = this.extractContentAndMetadata(currentFilePath) + for (let fileIndex = 0; fileIndex < markdownFilePaths.length; fileIndex++) { + const currentFilePath = markdownFilePaths[fileIndex]! + const { markdownChunks, fileMetadata } = await this.extractChunksAndMetadata(currentFilePath) - for (let chunkIndex = 0; chunkIndex < contentChunks.length; chunkIndex++) { - const textChunk = contentChunks[chunkIndex]! - pendingTextsToEmbed.push(textChunk) - pendingMetadataToInsert.push({ + for (let chunkIndex = 0; chunkIndex < markdownChunks.length; chunkIndex++) { + const textChunk = markdownChunks[chunkIndex]! + pendingTextChunks.push(textChunk) + pendingMetadataEntries.push({ ...fileMetadata, - id: `${fileMetadata['id']}_part_${chunkIndex}`, + id: `${fileMetadata['id']}_p${chunkIndex}`, title: `${fileMetadata['title']} (Part ${chunkIndex + 1})`, - snippet: textChunk, + snippet: textChunk }) - const isBatchFull = pendingTextsToEmbed.length >= EMBEDDING_BATCH_SIZE - if (isBatchFull) { - await this.processAndInsertEmbeddingBatch(pendingTextsToEmbed, pendingMetadataToInsert) - pendingTextsToEmbed = [] - pendingMetadataToInsert = [] + if (pendingTextChunks.length >= EMBEDDING_BATCH_SIZE) { + await this.insertEmbeddingBatchIntoIndex(pendingTextChunks, pendingMetadataEntries) + pendingTextChunks = [] + pendingMetadataEntries = [] } } - const isLogCheckpoint = (i + 1) % 10 === 0 - if (isLogCheckpoint) { - logger.debug(`Processed ${i + 1}/${filePaths.length} files...`) + const LOGGING_FREQUENCY = 10 + if ((fileIndex + 1) % LOGGING_FREQUENCY === 0) { + logger.debug(`Processed ${fileIndex + 1}/${markdownFilePaths.length} files...`) } } - const hasRemainingItems = pendingTextsToEmbed.length > 0 - if (hasRemainingItems) { - await this.processAndInsertEmbeddingBatch(pendingTextsToEmbed, pendingMetadataToInsert) + if (pendingTextChunks.length > 0) { + await this.insertEmbeddingBatchIntoIndex(pendingTextChunks, pendingMetadataEntries) } - await this.vectorIndex.endUpdate() } - private extractContentAndMetadata(filePath: string): { - contentChunks: string[] - fileMetadata: VectorDocMeta - } { - const fileContent = readFileSync(filePath, 'utf-8') - - const titleMatch = fileContent.match(/^# (.+)$/m) - const spaceMatch = fileContent.match(/^\*\*Space:\*\* (.+?)\s{2,}$/m) - const idMatch = fileContent.match(/^\*\*ID:\*\* (.+?)\s{2,}$/m) - const dateMatch = fileContent.match(/^\*\*Date:\*\* (.+?)\s{2,}$/m) - - const CHUNK_SIZE_CHARS = 1500 - const CHUNK_OVERLAP_CHARS = 100 - - const fileMetadata: VectorDocMeta = { - id: idMatch?.[1] ?? filePath, + private async extractChunksAndMetadata(filePath: string) { + const fileContent = await fileSystem.readFile(filePath, 'utf-8') + const fileMetadata = { + id: fileContent.match(/^\*\*ID:\*\* (.+?)\s{2,}$/m)?.[1] ?? filePath, path: filePath, - title: titleMatch?.[1] ?? 'Untitled', - spaceName: spaceMatch?.[1] ?? 'General', - date: dateMatch?.[1] ?? new Date().toISOString(), + title: fileContent.match(/^# (.+)$/m)?.[1] ?? 'Untitled', + spaceName: fileContent.match(/^\*\*Space:\*\* (.+?)\s{2,}$/m)?.[1] ?? 'General', + date: fileContent.match(/^\*\*Date:\*\* (.+?)\s{2,}$/m)?.[1] ?? new Date().toISOString(), + } + const MAXIMUM_CHARS_PER_CHUNK = 1500 + const OVERLAP_CHARS_BETWEEN_CHUNKS = 100 + return { + markdownChunks: chunkMarkdown(fileContent, MAXIMUM_CHARS_PER_CHUNK, OVERLAP_CHARS_BETWEEN_CHUNKS), + fileMetadata } - - const contentChunks = chunkMarkdown(fileContent, CHUNK_SIZE_CHARS, CHUNK_OVERLAP_CHARS) - - return { contentChunks, fileMetadata } } - private async processAndInsertEmbeddingBatch( - batchTexts: string[], - batchMetas: VectorDocMeta[] - ): Promise { + private async insertEmbeddingBatchIntoIndex(textChunks: string[], metadataEntries: VectorDocMeta[]) { try { - const embeddingVectors = await this.ollamaClient.embed(batchTexts) - - for (let i = 0; i < embeddingVectors.length; i++) { - const currentVector = embeddingVectors[i] - if (!currentVector) continue - - await this.vectorIndex.insertItem({ - vector: currentVector, - metadata: batchMetas[i] as Record, - }) + const embeddingVectors = await this.ollamaClient.embed(textChunks) + for (let vectorIndex = 0; vectorIndex < embeddingVectors.length; vectorIndex++) { + const vector = embeddingVectors[vectorIndex] + if (vector) { + await this.vectorIndex.insertItem({ + vector: vector, + metadata: metadataEntries[vectorIndex] as any + }) + } } - } catch (error) { - errorBus.emitError('Batch embedding failed', error) - } - } - - private async generateQueryEmbedding(query: string): Promise { - const [queryEmbeddingVector] = await this.ollamaClient.embed([query]) - if (!queryEmbeddingVector) { - throw new VectorStore.EmbeddingError('Failed to generate embedding for query') + } catch (embeddingError) { + errorBus.emitError('Batch embedding failed', embeddingError) } - return queryEmbeddingVector - } - - private async queryVectorIndex( - queryEmbedding: number[], - queryString: string, - resultLimit: number - ): Promise { - return this.vectorIndex.queryItems(queryEmbedding, queryString, resultLimit) - } - - private formatVectorSearchResults(rawResults: any[]): VectorSearchResult[] { - return rawResults.map((result) => ({ - meta: result.item.metadata as VectorDocMeta, - score: result.score, - })) } } diff --git a/src/utils/api-diagnostics.ts b/src/utils/api-diagnostics.ts index 4ecccbb..a852a7f 100644 --- a/src/utils/api-diagnostics.ts +++ b/src/utils/api-diagnostics.ts @@ -1,6 +1,6 @@ -import fs from 'node:fs/promises' +import fileSystem from 'node:fs/promises' import path from 'node:path' -import { logger } from './logger.js' +import { errorBus } from './error-bus.js' import type { Config } from './config.js' export interface ApiDiagnosticEntry { @@ -11,28 +11,27 @@ export interface ApiDiagnosticEntry { } export class ApiDiagnosticsWriter { - private readonly DEBUG_DIRECTORY = 'debug' - private readonly DIAGNOSTICS_FILENAME = 'api-diagnostics.jsonl' + private static readonly DEBUG_DIRECTORY = 'debug' + private static readonly DIAGNOSTICS_FILENAME = 'api-diagnostics.jsonl' - constructor(private readonly config: Config) {} + constructor(private readonly applicationConfig: Config) {} - async writeFailure(entry: Omit): Promise { - if (!this.config.debug) return + async writeFailure(failureEntry: Omit): Promise { + if (!this.applicationConfig.debug) return try { const diagnosticEntry: ApiDiagnosticEntry = { timestamp: new Date().toISOString(), - ...entry, + ...failureEntry, } - await fs.mkdir(this.DEBUG_DIRECTORY, { recursive: true }) - const diagnosticLogPath = path.join(this.DEBUG_DIRECTORY, this.DIAGNOSTICS_FILENAME) + await fileSystem.mkdir(ApiDiagnosticsWriter.DEBUG_DIRECTORY, { recursive: true }) + const diagnosticLogPath = path.join(ApiDiagnosticsWriter.DEBUG_DIRECTORY, ApiDiagnosticsWriter.DIAGNOSTICS_FILENAME) - const entryAsJsonLine = JSON.stringify(diagnosticEntry) + '\n' - await fs.appendFile(diagnosticLogPath, entryAsJsonLine, 'utf8') - } catch (error) { - const errorMessage = error instanceof Error ? error.message : String(error) - logger.warn(`Failed to write API diagnostic: ${errorMessage}`) + const diagnosticEntryAsJsonLine = JSON.stringify(diagnosticEntry) + '\n' + await fileSystem.appendFile(diagnosticLogPath, diagnosticEntryAsJsonLine, 'utf8') + } catch (failureError) { + errorBus.emitError('Failed to write API diagnostic', failureError) } } } diff --git a/src/utils/chunking.ts b/src/utils/chunking.ts index 7fa4c25..1e881b9 100644 --- a/src/utils/chunking.ts +++ b/src/utils/chunking.ts @@ -1,44 +1,40 @@ -export function chunkMarkdown(markdown: string, maxChars = 1500, overlapChars = 150): string[] { - const HEADER_OR_RULE_REGEX = /(?=^#{1,3}\s)|(?=^---)/gm - - const sections = markdown.split(HEADER_OR_RULE_REGEX) - - const chunks: string[] = [] - let currentChunk = '' - - for (const section of sections) { - const trimmedSection = section.trim() - if (!trimmedSection) continue - - const wouldExceedMaxSize = currentChunk.length + trimmedSection.length > maxChars - const isCurrentChunkPopulated = currentChunk.length > 0 - - if (wouldExceedMaxSize && isCurrentChunkPopulated) { - chunks.push(currentChunk.trim()) - - const overlapText = currentChunk.slice(-overlapChars).replace(/^---\s*/, '') - currentChunk = overlapText + '\n\n' + trimmedSection +export function chunkMarkdown(sourceMarkdownText: string, maximumCharactersPerChunk = 1500, overlapCharactersBetweenChunks = 150): string[] { + const MARKDOWN_SECTION_MARKER_REGEX = /(?=^#{1,3}\s)|(?=^---)/gm + const markdownSections = sourceMarkdownText.split(MARKDOWN_SECTION_MARKER_REGEX) + const resultChunks: string[] = [] + let currentChunkTextBuffer = '' + + for (const markdownSection of markdownSections) { + const trimmedSectionText = markdownSection.trim() + if (!trimmedSectionText) continue + + const isChunkFull = currentChunkTextBuffer.length + trimmedSectionText.length > maximumCharactersPerChunk + const hasExistingContentInChunk = currentChunkTextBuffer.length > 0 + + if (isChunkFull && hasExistingContentInChunk) { + resultChunks.push(currentChunkTextBuffer.trim()) + const overlapText = currentChunkTextBuffer.slice(-overlapCharactersBetweenChunks).replace(/^---\s*/, '') + currentChunkTextBuffer = overlapText + '\n\n' + trimmedSectionText } else { - const separator = currentChunk ? '\n\n' : '' - currentChunk += separator + trimmedSection + const sectionSeparator = currentChunkTextBuffer ? '\n\n' : '' + currentChunkTextBuffer += sectionSeparator + trimmedSectionText } } - const trimmedRemainingChunk = currentChunk.trim() - if (trimmedRemainingChunk.length > 0) { - chunks.push(trimmedRemainingChunk) + if (currentChunkTextBuffer.trim()) { + resultChunks.push(currentChunkTextBuffer.trim()) } - const MAX_CHUNK_THRESHOLD = maxChars + 500 - return chunks.flatMap((chunk) => { - if (chunk.length <= MAX_CHUNK_THRESHOLD) { - return [chunk] + return resultChunks.flatMap(accumulatedChunk => { + const MAXIMUM_ALLOWED_CHUNK_SIZE_BEFORE_SPLITTING = maximumCharactersPerChunk + 500 + if (accumulatedChunk.length <= MAXIMUM_ALLOWED_CHUNK_SIZE_BEFORE_SPLITTING) { + return [accumulatedChunk] } - const oversizedSubChunks: string[] = [] - for (let offset = 0; offset < chunk.length; offset += maxChars) { - oversizedSubChunks.push(chunk.slice(offset, offset + maxChars)) + const subChunksAfterSplittingLargeBlock: string[] = [] + for (let currentOffset = 0; currentOffset < accumulatedChunk.length; currentOffset += maximumCharactersPerChunk) { + subChunksAfterSplittingLargeBlock.push(accumulatedChunk.slice(currentOffset, currentOffset + maximumCharactersPerChunk)) } - return oversizedSubChunks + return subChunksAfterSplittingLargeBlock }) } diff --git a/src/utils/config.ts b/src/utils/config.ts index eca3ac5..f656066 100644 --- a/src/utils/config.ts +++ b/src/utils/config.ts @@ -1,101 +1,87 @@ -import { config as loadEnv } from 'dotenv' +import { config as loadEnvironmentVariables } from 'dotenv' import { existsSync, mkdirSync } from 'node:fs' import { dirname, join } from 'node:path' import { z } from 'zod' -import { logger } from './logger.js' +import { errorBus } from './error-bus.js' -loadEnv() +loadEnvironmentVariables() -const configSchema = z.object({ +const applicationConfigurationSchema = z.object({ authStoragePath: z.string().min(1), waitMode: z.enum(['dynamic', 'static']), - rateLimitMs: z.number().int().positive(), - parallelWorkers: z.number().int().min(1).max(20), - checkpointSaveInterval: z.number().int().positive(), + rateLimitMs: z.coerce.number().int().positive(), + parallelWorkers: z.coerce.number().int().min(1).max(20), + checkpointSaveInterval: z.coerce.number().int().positive(), exportDir: z.string().min(1), checkpointPath: z.string().min(1), vectorIndexPath: z.string().min(1), ollamaUrl: z.string().url(), ollamaModel: z.string().min(1), ollamaEmbedModel: z.string().min(1), - enableVectorSearch: z - .string() - .optional() - .transform((val) => val === 'true'), - headless: z.union([z.boolean(), z.literal('new')]), - debug: z.boolean(), + enableVectorSearch: z.string().optional().transform((value) => value === 'true'), + headless: z.preprocess((value) => { + if (value === 'true') return true + if (value === 'false') return false + return value + }, z.union([z.boolean(), z.literal('new')])), + debug: z.preprocess((value) => value === 'true', z.boolean()), }) -export type Config = z.infer -export type WaitMode = Config['waitMode'] +export type Config = z.infer -function parseEnvConfig(): Config { - const DEFAULT_OLLAMA_URL = 'http://localhost:11434' - const DEFAULT_RATE_LIMIT_MS = '500' - const DEFAULT_PARALLEL_WORKERS = '5' - const DEFAULT_CHECKPOINT_INTERVAL = '10' - - const rawHeadlessValue = process.env['HEADLESS'] ?? 'false' - let headless: boolean | 'new' = false - if (rawHeadlessValue === 'true') { - headless = true - } else if (rawHeadlessValue === 'new') { - headless = 'new' - } - - const rawConfig = { +function parseConfigurationFromEnvironment(): Config { + const rawEnvironmentValues = { authStoragePath: process.env['AUTH_STORAGE_PATH'] ?? join('.storage', 'auth.json'), waitMode: process.env['WAIT_MODE'] ?? 'dynamic', - rateLimitMs: parseInt(process.env['RATE_LIMIT_MS'] ?? DEFAULT_RATE_LIMIT_MS, 10), - parallelWorkers: parseInt(process.env['PARALLEL_WORKERS'] ?? DEFAULT_PARALLEL_WORKERS, 10), - checkpointSaveInterval: parseInt( - process.env['CHECKPOINT_SAVE_INTERVAL'] ?? DEFAULT_CHECKPOINT_INTERVAL, - 10 - ), + rateLimitMs: process.env['RATE_LIMIT_MS'] ?? '500', + parallelWorkers: process.env['PARALLEL_WORKERS'] ?? '5', + checkpointSaveInterval: process.env['CHECKPOINT_SAVE_INTERVAL'] ?? '10', exportDir: process.env['EXPORT_DIR'] ?? 'exports', checkpointPath: process.env['CHECKPOINT_PATH'] ?? join('.storage', 'checkpoint.json'), vectorIndexPath: process.env['VECTOR_INDEX_PATH'] ?? join('.storage', 'vector-index'), - ollamaUrl: process.env['OLLAMA_URL'] ?? DEFAULT_OLLAMA_URL, + ollamaUrl: process.env['OLLAMA_URL'] ?? 'http://localhost:11434', ollamaModel: process.env['OLLAMA_MODEL'] ?? 'llama3.1', ollamaEmbedModel: process.env['OLLAMA_EMBED_MODEL'] ?? 'nomic-embed-text', enableVectorSearch: process.env['ENABLE_VECTOR_SEARCH'], - headless: headless, - debug: process.env['DEBUG'] === 'true', + headless: process.env['HEADLESS'] ?? 'false', + debug: process.env['DEBUG'] ?? 'false', } - const result = configSchema.safeParse(rawConfig) + const validationResult = applicationConfigurationSchema.safeParse(rawEnvironmentValues) - if (!result.success) { - logger.error('Invalid configuration detected:') - result.error.issues.forEach((issue) => { - const fieldPath = issue.path.join('.') - const envVarName = camelToSnakeCase(fieldPath).toUpperCase() - logger.error(` ${envVarName}: ${issue.message}`) + if (!validationResult.success) { + validationResult.error.issues.forEach((validationIssue) => { + const configurationFieldPath = validationIssue.path.join('.') + const environmentVariableName = configurationFieldPath.replace(/[A-Z]/g, (letter) => `_${letter.toLowerCase()}`).toUpperCase() + errorBus.emitError(`Configuration error: ${environmentVariableName} - ${validationIssue.message}`) }) - logger.error('\nPlease check your .env file and fix the above errors.') process.exit(1) } - return result.data + return validationResult.data } -function camelToSnakeCase(camelStr: string): string { - return camelStr.replace(/[A-Z]/g, (letter) => `_${letter.toLowerCase()}`) -} +export const config: Config = parseConfigurationFromEnvironment() -function ensureDirectoryExistsForFile(filePath: string): void { - const dirPath = dirname(filePath) - if (!existsSync(dirPath)) { - mkdirSync(dirPath, { recursive: true }) +function ensureDirectoryExistsForPath(targetPath: string) { + const directoryPath = dirname(targetPath) + try { + if (!existsSync(directoryPath)) { + mkdirSync(directoryPath, { recursive: true }) + } + } catch (filesystemError) { + errorBus.emitError(`Failed to create directory for path: ${targetPath}`, filesystemError) } } -export const config: Config = parseEnvConfig() - -ensureDirectoryExistsForFile(config.authStoragePath) -ensureDirectoryExistsForFile(config.checkpointPath) -ensureDirectoryExistsForFile(config.vectorIndexPath) +ensureDirectoryExistsForPath(config.authStoragePath) +ensureDirectoryExistsForPath(config.checkpointPath) +ensureDirectoryExistsForPath(config.vectorIndexPath) -if (!existsSync(config.exportDir)) { - mkdirSync(config.exportDir, { recursive: true }) +try { + if (!existsSync(config.exportDir)) { + mkdirSync(config.exportDir, { recursive: true }) + } +} catch (exportDirectoryError) { + errorBus.emitError(`Failed to create export directory`, exportDirectoryError) } diff --git a/src/utils/error-bus.ts b/src/utils/error-bus.ts index 093ea79..4c2cc84 100644 --- a/src/utils/error-bus.ts +++ b/src/utils/error-bus.ts @@ -11,27 +11,36 @@ export interface AppError { class ErrorBus extends EventEmitter { constructor() { super() - this.on('error', (appError: AppError) => { - const contextSuffix = appError.context - ? ` | Context: ${JSON.stringify(appError.context)}` - : '' - logger.error(`${appError.message}${contextSuffix}`) - - const isDebugEnabled = process.env['DEBUG'] === 'true' || process.env['DEBUG_MODE'] === 'true' - if (appError.error && isDebugEnabled) { - console.error(appError.error) - } - }) + this.on('error', () => {}) } - emitError(message: string, error?: unknown, context?: Record): void { - const appError: AppError = { - message, - error, - context, + emitError(errorMessage: string, errorObject?: unknown, contextMetadata?: Record): void { + const applicationError: AppError = { + message: errorMessage, + error: errorObject, + context: contextMetadata, timestamp: new Date(), } - this.emit('error', appError) + this.emit('error', applicationError) + this.logApplicationError(applicationError) + } + + raiseError(errorMessage: string, errorObject?: unknown, contextMetadata?: Record): never { + this.emitError(errorMessage, errorObject, contextMetadata) + if (errorObject instanceof Error) { + throw errorObject + } + throw new Error(errorMessage) + } + + private logApplicationError(applicationError: AppError): void { + const contextSuffix = applicationError.context ? ` | Context: ${JSON.stringify(applicationError.context)}` : '' + logger.error(`${applicationError.message}${contextSuffix}`) + + const isDebugModeActive = process.env['DEBUG'] === 'true' + if (applicationError.error && isDebugModeActive) { + console.error(applicationError.error) + } } } diff --git a/src/utils/http-logger.ts b/src/utils/http-logger.ts index b3dc229..33d3667 100644 --- a/src/utils/http-logger.ts +++ b/src/utils/http-logger.ts @@ -1,20 +1,19 @@ import { appendFileSync, existsSync, mkdirSync } from 'node:fs' import { join } from 'node:path' -import type { Request, Response } from '@playwright/test' -import { config } from './config.js' +import type { Request, Response } from 'patchright' +import { errorBus } from './error-bus.js' -const LOGS_DIRECTORY = 'logs' -const LOG_FILE_TIMESTAMP = new Date().toISOString().replace(/[:.]/g, '-') -const HTTP_LOG_FILENAME = `http-req-res-log-${LOG_FILE_TIMESTAMP}.txt` -const HTTP_LOG_PATH = join(LOGS_DIRECTORY, HTTP_LOG_FILENAME) +const LOGS_DIRECTORY_NAME = 'logs' +const LOG_FILE_TIMESTAMP_SUFFIX = new Date().toISOString().replace(/[:.]/g, '-') +const HTTP_REQUEST_RESPONSE_LOG_FILENAME = `http-request-response-log-${LOG_FILE_TIMESTAMP_SUFFIX}.txt` +const HTTP_LOG_FULL_PATH = join(LOGS_DIRECTORY_NAME, HTTP_REQUEST_RESPONSE_LOG_FILENAME) -const SENSITIVE_HEADERS = ['authorization', 'cookie', 'set-cookie', 'x-api-key'] -const PROMPT_KEYWORDS = ['"query"', '"prompt"', '"messages"'] +const SENSITIVE_HEADER_NAMES = ['authorization', 'cookie', 'set-cookie', 'x-api-key'] +const PROMPT_KEYWORD_INDICATORS = ['"query"', '"prompt"', '"messages"'] -function redactSensitiveHeaders(headers: Record): Record { - const redactedHeaders = { ...headers } - - for (const headerKey of SENSITIVE_HEADERS) { +function redactSensitiveHeaders(headersRecord: Record): Record { + const redactedHeaders = { ...headersRecord } + for (const headerKey of SENSITIVE_HEADER_NAMES) { if (redactedHeaders[headerKey]) { redactedHeaders[headerKey] = '[REDACTED]' } @@ -22,89 +21,70 @@ function redactSensitiveHeaders(headers: Record): Record postData.includes(keyword)) - if (containsPromptKeyword) { - return true - } + return PROMPT_KEYWORD_INDICATORS.some(keyword => requestPostData.includes(keyword)) } } return false } -function ensureLogsDirectoryExists(): void { - if (!existsSync(LOGS_DIRECTORY)) { - mkdirSync(LOGS_DIRECTORY, { recursive: true }) - } -} +export function logHttpRequest(webRequest: Request, isDebugModeEnabled: boolean): void { + if (!isDebugModeEnabled) return -export async function logHttpRequest(request: Request): Promise { - if (!config.debug) return - - ensureLogsDirectoryExists() - - const requestUrl = request.url() - const requestMethod = request.method() - const sanitizedHeaders = redactSensitiveHeaders(request.headers()) - const rawPostData = request.postData() + try { + if (!existsSync(LOGS_DIRECTORY_NAME)) { + mkdirSync(LOGS_DIRECTORY_NAME, { recursive: true }) + } - const requestBody = isPromptRequest(requestUrl, rawPostData) ? '[PROMPT REDACTED]' : rawPostData + const bodyDisplayContent = isRequestContainingUserPrompts(webRequest.url(), webRequest.postData()) + ? '[PROMPT REDACTED]' + : webRequest.postData() - const logTimestamp = new Date().toISOString() - const logEntry = [ - `[${logTimestamp}] REQUEST: ${requestMethod} ${requestUrl}`, - `Headers: ${JSON.stringify(sanitizedHeaders, null, 2)}`, - `Body: ${requestBody ?? 'None'}`, - '--------------------------------------------------------------------------------', - ].join('\n') + const logEntryText = `[${new Date().toISOString()}] REQUEST: ${webRequest.method()} ${webRequest.url()}\n` + + `Headers: ${JSON.stringify(redactSensitiveHeaders(webRequest.headers()), null, 2)}\n` + + `Body: ${bodyDisplayContent ?? 'None'}\n` + + '--------------------------------------------------------------------------------\n' - appendFileSync(HTTP_LOG_PATH, logEntry + '\n') + appendFileSync(HTTP_LOG_FULL_PATH, logEntryText) + } catch (loggingError) { + errorBus.emitError('HTTP Request logging failed', loggingError) + } } -export async function logHttpResponse(response: Response): Promise { - if (!config.debug) return - - const originalRequest = response.request() - const responseUrl = originalRequest.url() - const responseStatus = response.status() - const sanitizedHeaders = redactSensitiveHeaders(response.headers()) - - let responseBody = '[BODY SKIPPED]' - - const contentType = sanitizedHeaders['content-type'] ?? '' - const isJsonContent = contentType.includes('application/json') - const isPrompt = isPromptRequest(responseUrl, originalRequest.postData()) - - if (isJsonContent && !isPrompt) { - try { - const jsonResponse = await response.json() - responseBody = JSON.stringify(jsonResponse, null, 2) - } catch { - responseBody = '[COULD NOT PARSE JSON BODY]' +export async function logHttpResponse(webResponse: Response, isDebugModeEnabled: boolean): Promise { + if (!isDebugModeEnabled) return + + try { + const originalRequest = webResponse.request() + let responseBodyDisplay = '[BODY SKIPPED]' + const responseContentType = webResponse.headers()['content-type'] ?? '' + const isJsonContent = responseContentType.includes('json') + const isRequestAPrompt = isRequestContainingUserPrompts(originalRequest.url(), originalRequest.postData()) + + if (isJsonContent && !isRequestAPrompt) { + try { + responseBodyDisplay = JSON.stringify(await webResponse.json(), null, 2) + } catch { + responseBodyDisplay = '[PARSE ERROR]' + } } - } - const logTimestamp = new Date().toISOString() - const logEntry = [ - `[${logTimestamp}] RESPONSE: ${responseStatus} ${responseUrl}`, - `Headers: ${JSON.stringify(sanitizedHeaders, null, 2)}`, - `Body: ${responseBody}`, - '--------------------------------------------------------------------------------', - ].join('\n') + const logEntryText = `[${new Date().toISOString()}] RESPONSE: ${webResponse.status()} ${webResponse.url()}\n` + + `Headers: ${JSON.stringify(redactSensitiveHeaders(webResponse.headers()), null, 2)}\n` + + `Body: ${responseBodyDisplay}\n` + + '--------------------------------------------------------------------------------\n' - appendFileSync(HTTP_LOG_PATH, logEntry + '\n') + appendFileSync(HTTP_LOG_FULL_PATH, logEntryText) + } catch (loggingError) { + errorBus.emitError('HTTP Response logging failed', loggingError) + } } diff --git a/src/utils/logger.ts b/src/utils/logger.ts index 2a54b04..dc2675d 100644 --- a/src/utils/logger.ts +++ b/src/utils/logger.ts @@ -2,58 +2,54 @@ import chalk from 'chalk' import { appendFileSync, mkdirSync, existsSync } from 'node:fs' import { join } from 'node:path' -const IS_DEBUG_MODE = - process.env['DEBUG_MODE'] === 'true' || process.env['DIAGNOSIS_MODE'] === 'true' -const LOGS_DIRECTORY = 'logs' -const LOG_FILE_TIMESTAMP = new Date().toISOString().replace(/[:.]/g, '-') -const MAIN_LOG_FILENAME = `main-log-${LOG_FILE_TIMESTAMP}.txt` -const MAIN_LOG_PATH = join(LOGS_DIRECTORY, MAIN_LOG_FILENAME) +function isVerboseLoggingEnabled(): boolean { + return process.env['DEBUG'] === 'true' +} + +const LOGS_ROOT_DIRECTORY = 'logs' +const LOG_FILE_TIMESTAMP_IDENTIFIER = new Date().toISOString().replace(/[:.]/g, '-') +const MAIN_APPLICATION_LOG_FILENAME = `main-log-${LOG_FILE_TIMESTAMP_IDENTIFIER}.txt` +const MAIN_LOG_FILE_PATH = join(LOGS_ROOT_DIRECTORY, MAIN_APPLICATION_LOG_FILENAME) -function writeToLogFile(message: string): void { - if (!IS_DEBUG_MODE) return +function writeMessageToFile(logMessage: string): void { + if (!isVerboseLoggingEnabled()) return - if (!existsSync(LOGS_DIRECTORY)) { - mkdirSync(LOGS_DIRECTORY, { recursive: true }) + if (!existsSync(LOGS_ROOT_DIRECTORY)) { + mkdirSync(LOGS_ROOT_DIRECTORY, { recursive: true }) } - const ANSI_ESCAPE_REGEX = /\x1b\[[0-9;]*m/g - const plainTextLines = message.replace(ANSI_ESCAPE_REGEX, '') - const logTimestamp = new Date().toISOString() + const ANSI_COLOR_CODE_REGEX = /\x1b\[[0-9;]*m/g + const plainTextMessage = logMessage.replace(ANSI_COLOR_CODE_REGEX, '') + const currentTimestamp = new Date().toISOString() - appendFileSync(MAIN_LOG_PATH, `[${logTimestamp}] ${plainTextLines}\n`) + appendFileSync(MAIN_LOG_FILE_PATH, `[${currentTimestamp}] ${plainTextMessage}\n`) } export const logger = { - info(...args: unknown[]): void { - const message = args.join(' ') - console.log(chalk.blue('ℹ'), message) - writeToLogFile(`INFO: ${message}`) + info(...messageArguments: unknown[]): void { + const combinedMessage = messageArguments.join(' ') + console.log(chalk.blue('ℹ'), combinedMessage) + writeMessageToFile(`INFO: ${combinedMessage}`) }, - - success(...args: unknown[]): void { - const message = args.join(' ') - console.log(chalk.green('✓'), message) - writeToLogFile(`SUCCESS: ${message}`) + success(...messageArguments: unknown[]): void { + const combinedMessage = messageArguments.join(' ') + console.log(chalk.green('✓'), combinedMessage) + writeMessageToFile(`SUCCESS: ${combinedMessage}`) }, - - warn(...args: unknown[]): void { - const message = args.join(' ') - console.log(chalk.yellow('⚠'), message) - writeToLogFile(`WARN: ${message}`) + warn(...messageArguments: unknown[]): void { + const combinedMessage = messageArguments.join(' ') + console.log(chalk.yellow('⚠'), combinedMessage) + writeMessageToFile(`WARN: ${combinedMessage}`) }, - - error(...args: unknown[]): void { - const message = args.join(' ') - console.error(chalk.red('✗'), message) - writeToLogFile(`ERROR: ${message}`) + error(...messageArguments: unknown[]): void { + const combinedMessage = messageArguments.join(' ') + console.error(chalk.red('✗'), combinedMessage) + writeMessageToFile(`ERROR: ${combinedMessage}`) }, - - debug(...args: unknown[]): void { - const isVerboseDebug = process.env['DEBUG'] === 'true' - if (!isVerboseDebug) return - - const message = args.join(' ') - console.log(chalk.gray('›'), message) - writeToLogFile(`DEBUG: ${message}`) + debug(...messageArguments: unknown[]): void { + if (!isVerboseLoggingEnabled()) return + const combinedMessage = messageArguments.join(' ') + console.log(chalk.gray('›'), combinedMessage) + writeMessageToFile(`DEBUG: ${combinedMessage}`) }, } diff --git a/src/utils/wait-strategy.ts b/src/utils/wait-strategy.ts index 712549d..5ae35f7 100644 --- a/src/utils/wait-strategy.ts +++ b/src/utils/wait-strategy.ts @@ -1,61 +1,46 @@ -import type { Page } from '@playwright/test' +import type { Page } from 'patchright' import { type Config } from './config.js' export interface WaitStrategy { - afterClick(page: Page): Promise - afterScroll(page: Page): Promise - forSelector(page: Page, selector: string): Promise + afterClick(webPage: Page): Promise + afterScroll(webPage: Page): Promise + forSelector(webPage: Page, elementSelector: string): Promise } -class DynamicWaitStrategy implements WaitStrategy { - private static readonly NETWORK_IDLE_TIMEOUT_MS = 2000 - private static readonly SELECTOR_TIMEOUT_MS = 5000 - - async afterClick(page: Page): Promise { - await page - .waitForLoadState('networkidle', { timeout: DynamicWaitStrategy.NETWORK_IDLE_TIMEOUT_MS }) - .catch(() => {}) +class DynamicNetworkWaitStrategy implements WaitStrategy { + async afterClick(webPage: Page): Promise { + const NETWORK_IDLE_TIMEOUT_MILLISECONDS = 2000 + await webPage.waitForLoadState('networkidle', { timeout: NETWORK_IDLE_TIMEOUT_MILLISECONDS }).catch(() => {}) } - async afterScroll(page: Page): Promise { - await page.waitForLoadState('domcontentloaded') + async afterScroll(webPage: Page): Promise { + await webPage.waitForLoadState('domcontentloaded') } - async forSelector(page: Page, selector: string): Promise { - await page.waitForSelector(selector, { - state: 'visible', - timeout: DynamicWaitStrategy.SELECTOR_TIMEOUT_MS, - }) + async forSelector(webPage: Page, elementSelector: string): Promise { + const SELECTOR_VISIBILITY_TIMEOUT_MILLISECONDS = 5000 + await webPage.waitForSelector(elementSelector, { state: 'visible', timeout: SELECTOR_VISIBILITY_TIMEOUT_MILLISECONDS }) } } -class StaticWaitStrategy implements WaitStrategy { - private readonly baseDelayMs: number - - constructor(delayMs: number) { - this.baseDelayMs = delayMs - } - - private async randomPause(page: Page): Promise { - const jitter = Math.floor(this.baseDelayMs * 0.5 * Math.random()) - const totalWaitTime = this.baseDelayMs + jitter - await page.waitForTimeout(totalWaitTime) - } - - async afterClick(page: Page): Promise { - await this.randomPause(page) - } +class StaticDelayWaitStrategy implements WaitStrategy { + constructor(private readonly baseDelayMilliseconds: number) {} - async afterScroll(page: Page): Promise { - await this.randomPause(page) + private async pauseWithJitter(webPage: Page) { + const jitterFactor = 0.5 + const randomJitter = Math.random() * this.baseDelayMilliseconds * jitterFactor + const totalWaitTime = this.baseDelayMilliseconds + randomJitter + await webPage.waitForTimeout(totalWaitTime) } - async forSelector(page: Page, _selector: string): Promise { - await this.randomPause(page) - } + async afterClick(webPage: Page) { await this.pauseWithJitter(webPage) } + async afterScroll(webPage: Page) { await this.pauseWithJitter(webPage) } + async forSelector(webPage: Page) { await this.pauseWithJitter(webPage) } } -export const waitStrategy = (config: Config): WaitStrategy => { - const isDynamicMode = config.waitMode === 'dynamic' - return isDynamicMode ? new DynamicWaitStrategy() : new StaticWaitStrategy(config.rateLimitMs) +export const createWaitStrategy = (applicationConfig: Config): WaitStrategy => { + const isDynamicMode = applicationConfig.waitMode === 'dynamic' + return isDynamicMode + ? new DynamicNetworkWaitStrategy() + : new StaticDelayWaitStrategy(applicationConfig.rateLimitMs) } diff --git a/test/e2e/export-to-search-flow.e2e.test.ts b/test/e2e/export-to-search-flow.e2e.test.ts new file mode 100644 index 0000000..7d71efa --- /dev/null +++ b/test/e2e/export-to-search-flow.e2e.test.ts @@ -0,0 +1,123 @@ +import { describe, it, expect, beforeAll, afterAll, vi } from 'vitest' +import { ExportHandler } from '../../src/repl/handlers/export.js' +import { SearchOrchestrator } from '../../src/search/search-orchestrator.js' +import { CheckpointManager } from '../../src/scraper/checkpoint-manager.js' +import { OllamaClient } from '../../src/ai/ollama-client.js' +import { chromium, type Browser } from 'patchright' +import { type Config } from '../../src/utils/config.js' +import { fileURLToPath } from 'node:url' +import { dirname, join } from 'node:path' +import { existsSync, rmSync, mkdirSync } from 'node:fs' +import { BrowserManager } from '../../src/scraper/browser.js' +import { WorkerPool } from '../../src/scraper/worker-pool.js' + +const __dirname = dirname(fileURLToPath(import.meta.url)) +const TEST_STORAGE_DIR = join(__dirname, '../../.test-storage-e2e') + +const mockApplicationConfig: Config = { + authStoragePath: join(TEST_STORAGE_DIR, 'auth.json'), + waitMode: 'static', + rateLimitMs: 1, + parallelWorkers: 1, + checkpointSaveInterval: 1, + exportDir: join(TEST_STORAGE_DIR, 'exports'), + checkpointPath: join(TEST_STORAGE_DIR, 'checkpoint.json'), + vectorIndexPath: join(TEST_STORAGE_DIR, 'vector-index'), + ollamaUrl: 'http://localhost:11434', + ollamaModel: 'llama3.1', + ollamaEmbedModel: 'nomic-embed-text', + enableVectorSearch: true, + headless: true, + debug: false, +} + +describe('Export to Search E2E Flow', () => { + let browser: Browser + + beforeAll(async () => { + if (existsSync(TEST_STORAGE_DIR)) { + rmSync(TEST_STORAGE_DIR, { recursive: true, force: true }) + } + mkdirSync(TEST_STORAGE_DIR, { recursive: true }) + browser = await chromium.launch({ headless: true }) + }) + + afterAll(async () => { + await browser?.close() + if (existsSync(TEST_STORAGE_DIR)) { + rmSync(TEST_STORAGE_DIR, { recursive: true, force: true }) + } + vi.restoreAllMocks() + }) + + it('should complete the full flow from export to semantic search', async () => { + const browserContext = await browser.newContext() + + await browserContext.route('**/api/auth/session', async (route) => { + await route.fulfill({ status: 200, contentType: 'application/json', body: JSON.stringify({ user: { id: 'user-1' } }) }) + }) + await browserContext.route('**/settings', async (route) => { + await route.fulfill({ status: 200, contentType: 'text/html', body: 'Settings' }) + }) + await browserContext.route('**/rest/thread/list_ask_threads**', async (route) => { + const mockedThreads = [ + { uuid: 'refactor-tips', slug: 'refactor-tips', title: 'Refactoring Tips', total_threads: 1 } + ] + await route.fulfill({ status: 200, contentType: 'application/json', body: JSON.stringify(mockedThreads) }) + }) + await browserContext.route('**/rest/userinfo', async (route) => { + await route.fulfill({ status: 200, contentType: 'application/json', body: JSON.stringify({}) }) + }) + await browserContext.route('**/search/refactor-tips', async (route) => { + const html = `` + await route.fulfill({ status: 200, contentType: 'text/html', body: html }) + }) + await browserContext.route('**/rest/thread/refactor-tips', async (route) => { + const threadData = { + entries: [{ + thread_title: 'Refactoring Tips', + query_str: 'How to refactor?', + blocks: [{ markdown_block: { answer: 'Small commits.' } }], + updated_datetime: new Date().toISOString() + }] + } + await route.fulfill({ status: 200, contentType: 'application/json', body: JSON.stringify(threadData) }) + }) + + const checkpointManager = new CheckpointManager(mockApplicationConfig) + const searchOrchestrator = new SearchOrchestrator(mockApplicationConfig) + const exportHandler = new ExportHandler(mockApplicationConfig, checkpointManager, searchOrchestrator) + + vi.spyOn(BrowserManager.prototype, 'launch').mockResolvedValue(await browserContext.newPage()) + vi.spyOn(BrowserManager.prototype, 'close').mockResolvedValue(undefined) + + vi.spyOn(WorkerPool.prototype, 'initialize').mockImplementation(async function(this: any) { + (this as any).sharedBrowserContext = browserContext; + for (let i = 0; i < (this as any).applicationConfig.parallelWorkers; i++) { + (this as any).activeWorkers.push({ + workerId: i, + conversationExtractor: new (await import('../../src/scraper/conversation-extractor.js')).ConversationExtractor((this as any).applicationConfig, browserContext), + isCurrentlyBusy: false, + }) + } + }) + + await exportHandler.handleStartLibraryExport() + + const exportFile = join(mockApplicationConfig.exportDir, 'General', 'Refactoring_Tips (refactor-tips).md') + expect(existsSync(exportFile)).toBe(true) + + vi.spyOn(OllamaClient.prototype, 'embed').mockImplementation(async (texts: string[]) => { + return texts.map(() => [1, 0, 0]) + }) + vi.spyOn(OllamaClient.prototype, 'validate').mockResolvedValue(undefined) + + await searchOrchestrator.vectorizeNow() + + const searchResults = await (searchOrchestrator as any).vectorStore.search('refactor') + expect(searchResults.length).toBeGreaterThan(0) + expect(searchResults[0].meta.title).toContain('Refactoring Tips') + + await browserContext.close() + }, 60000) +}) diff --git a/test/e2e/scraper-critical-path.e2e.test.ts b/test/e2e/scraper-critical-path.e2e.test.ts index 302b4d7..ae44709 100644 --- a/test/e2e/scraper-critical-path.e2e.test.ts +++ b/test/e2e/scraper-critical-path.e2e.test.ts @@ -1,39 +1,38 @@ import { describe, it, expect, beforeAll, afterAll } from 'vitest' -import { chromium, type Browser, type BrowserContext } from '@playwright/test' +import { chromium, type Browser, type BrowserContext } from 'patchright' import { ConversationExtractor } from '../../src/scraper/conversation-extractor.js' -import { config } from '../../src/utils/config.js' +import { config as applicationConfiguration } from '../../src/utils/config.js' import { existsSync, rmSync } from 'node:fs' -const TEST_OUTPUT = './test-output-e2e' +const TEST_OUTPUT_DIRECTORY = './test-output-e2e' describe('Scraper E2E - Critical Path', () => { - let browser: Browser - let context: BrowserContext + let browserInstance: Browser + let browserContext: BrowserContext beforeAll(async () => { - browser = await chromium.launch({ headless: true }) - if (existsSync(TEST_OUTPUT)) rmSync(TEST_OUTPUT, { recursive: true }) + browserInstance = await chromium.launch({ headless: true }) + if (existsSync(TEST_OUTPUT_DIRECTORY)) { + rmSync(TEST_OUTPUT_DIRECTORY, { recursive: true }) + } }) afterAll(async () => { - await browser?.close() - if (existsSync(TEST_OUTPUT)) rmSync(TEST_OUTPUT, { recursive: true }) + await browserInstance?.close() + if (existsSync(TEST_OUTPUT_DIRECTORY)) { + rmSync(TEST_OUTPUT_DIRECTORY, { recursive: true }) + } }) - // Skip this - requires real authenticated Perplexity session - it.skip('should complete full workflow: discover → extract → save', async () => { - // Manual test only - replace URL with real conversation from your account - }, 60000) + it('should handle missing/invalid URL gracefully by raising a descriptive error', async () => { + browserContext = await browserInstance.newContext() + const conversationExtractor = new ConversationExtractor(applicationConfiguration, browserContext) - it('should handle missing/invalid URL gracefully without crashing', async () => { - context = await browser.newContext() - const extractor = new ConversationExtractor(config, context) - - // ✅ Now we expect it to THROW with a descriptive error + // We expect the extraction to fail with an authentication or status error for a nonexistent thread await expect( - extractor.extract('https://www.perplexity.ai/search/nonexistent-xyz-12345') - ).rejects.toThrow(/Authentication required|403|401|No API response/) + conversationExtractor.extract('https://www.perplexity.ai/search/nonexistent-xyz-12345') + ).rejects.toThrow(/Auth required or expired|Authentication required|403|401|No API response/) - await context.close() + await browserContext.close() }, 30000) }) diff --git a/test/integration/ollama-client-mocked.integration.test.ts b/test/integration/ollama-client-mocked.integration.test.ts deleted file mode 100644 index 4744b81..0000000 --- a/test/integration/ollama-client-mocked.integration.test.ts +++ /dev/null @@ -1,55 +0,0 @@ -import { describe, it, expect, beforeAll, afterEach, afterAll } from 'vitest' -import { setupServer } from 'msw/node' -import { http, HttpResponse } from 'msw' -import { OllamaClient } from '../../src/ai/ollama-client.js' -import { config } from '../../src/utils/config.js' - -const mockEmbeddingsResponse = [{ embedding: [0.1, 0.2, 0.3] }, { embedding: [0.4, 0.5, 0.6] }] - -const mswServer = setupServer( - http.post(`${config.ollamaUrl}/v1/embeddings`, () => { - return HttpResponse.json({ data: mockEmbeddingsResponse }) - }), - http.post(`${config.ollamaUrl}/api/generate`, async ({ request }) => { - const requestBody = (await request.json()) as { prompt: string } - return HttpResponse.json({ - model: config.ollamaModel, - created_at: new Date().toISOString(), - response: `Mocked response for prompt: ${requestBody.prompt}`, - done: true, - }) - }) -) - -beforeAll(() => mswServer.listen()) -afterEach(() => mswServer.resetHandlers()) -afterAll(() => mswServer.close()) - -describe('OllamaClient (MSW Mocked)', () => { - it('should return embeddings using OpenAI format', async () => { - const ollamaClientInstance = new OllamaClient(config) - const resultVectors = await ollamaClientInstance.embed(['text1', 'text2']) - expect(resultVectors).toEqual([ - [0.1, 0.2, 0.3], - [0.4, 0.5, 0.6], - ]) - }) - - it('should generate a response from a prompt', async () => { - const ollamaClientInstance = new OllamaClient(config) - const generatedText = await ollamaClientInstance.generate('test prompt') - expect(generatedText).toBe('Mocked response for prompt: test prompt') - }) - - it('should throw an error when the server returns a 500 status', async () => { - mswServer.use( - http.post(`${config.ollamaUrl}/v1/embeddings`, () => { - return new HttpResponse(null, { status: 500 }) - }) - ) - const ollamaClientInstance = new OllamaClient(config) - await expect(ollamaClientInstance.embed(['text'])).rejects.toThrow( - 'Ollama request failed with status 500' - ) - }) -}) diff --git a/test/integration/ollama-client.integration.test.ts b/test/integration/ollama-client.integration.test.ts deleted file mode 100644 index c7a9546..0000000 --- a/test/integration/ollama-client.integration.test.ts +++ /dev/null @@ -1,33 +0,0 @@ -import { config } from '../../src/utils/config.js' -import { describe, it, expect } from 'vitest' -import { OllamaClient } from '../../src/ai/ollama-client.js' -import { isOllamaAvailable } from '../ollama-available.js' - -describe.runIf(await isOllamaAvailable())('OllamaClient Integration', () => { - it('should validate Ollama is running and model is available', async () => { - const client = new OllamaClient(config) - await expect(client.validate()).resolves.not.toThrow() - }) - - it('should embed single text and return correct shape', async () => { - const client = new OllamaClient(config) - const result = await client.embed(['hello']) - expect(result).toBeInstanceOf(Array) - expect(result[0]).toBeInstanceOf(Array) - expect(result[0].length).toBeGreaterThan(0) - }) - - it('should embed batch of texts in parallel', async () => { - const client = new OllamaClient(config) - const texts = ['hello', 'world', 'test'] - const result = await client.embed(texts) - expect(result).toHaveLength(3) - result.forEach((emb) => expect(emb.length).toBeGreaterThan(0)) - }) - - it('should handle empty array gracefully', async () => { - const client = new OllamaClient(config) - const result = await client.embed([]) - expect(result).toEqual([]) - }) -}) diff --git a/test/integration/rag-orchestrator-mocked.integration.test.ts b/test/integration/rag-orchestrator-mocked.integration.test.ts deleted file mode 100644 index 1cf95df..0000000 --- a/test/integration/rag-orchestrator-mocked.integration.test.ts +++ /dev/null @@ -1,72 +0,0 @@ -import { describe, it, expect, beforeAll, afterEach, afterAll, vi } from 'vitest' -import { setupServer } from 'msw/node' -import { http, HttpResponse } from 'msw' -import { RagOrchestrator } from '../../src/ai/rag-orchestrator.js' -import { config } from '../../src/utils/config.js' -import { VectorStore } from '../../src/search/vector-store.js' -import { RgSearch } from '../../src/search/rg-search.js' - -const mockSearchOutcome = [ - { - meta: { - title: 'Mocked Title', - path: 'path/to/mocked.md', - snippet: 'This is some mocked content from a Perplexity export.', - id: 'mock-1', - }, - score: 0.95, - }, -] - -const mswServer = setupServer( - http.post(`${config.ollamaUrl}/api/generate`, async ({ request }) => { - const body = (await request.json()) as { prompt: string } - - let responseText = '' - if (body.prompt.includes('Analyze:')) { - responseText = - '{"strategy": "precise", "queries": ["What is in my history?"], "hardKeywords": ["mocked"], "filters": {}}' - } else if (body.prompt.includes('You are the Researcher.')) { - responseText = - '[{"fact": "Based on your history, there is a Mocked Title.", "node_id": 0, "thread": "Mocked Title"}]' - } else if (body.prompt.includes('You are the Narrator.')) { - responseText = 'Based on your history, there is a Mocked Title.' - } else if (body.prompt.includes('Verify the answer.')) { - responseText = '{"status": "ok"}' - } else { - responseText = '{"status": "ok"}' - } - - return HttpResponse.json({ - model: config.ollamaModel, - created_at: new Date().toISOString(), - response: responseText, - done: true, - }) - }) -) - -beforeAll(() => mswServer.listen()) -afterEach(() => { - mswServer.resetHandlers() - vi.restoreAllMocks() -}) -afterAll(() => mswServer.close()) - -describe('RagOrchestrator (MSW Mocked)', () => { - it('should orchestrate the RAG flow successfully', async () => { - vi.spyOn(VectorStore.prototype, 'search').mockResolvedValue(mockSearchOutcome) - vi.spyOn(VectorStore.prototype, 'validate').mockResolvedValue(undefined) - vi.spyOn(RgSearch.prototype, 'captureSearchMatches').mockResolvedValue([]) - - const ragOrchestratorInstance = new RagOrchestrator(config) - const consoleLogSpy = vi.spyOn(console, 'log').mockImplementation(() => {}) - - await ragOrchestratorInstance.answerQuestion('What is in my history?') - - expect(consoleLogSpy).toHaveBeenCalledWith(expect.stringContaining('Based on your history')) - expect(consoleLogSpy).toHaveBeenCalledWith(expect.stringContaining('Mocked Title')) - - consoleLogSpy.mockRestore() - }) -}) diff --git a/test/integration/scraper-engine.integration.test.ts b/test/integration/scraper-engine.integration.test.ts new file mode 100644 index 0000000..9658dff --- /dev/null +++ b/test/integration/scraper-engine.integration.test.ts @@ -0,0 +1,135 @@ +import { describe, it, expect, beforeAll, afterAll } from 'vitest' +import { chromium, type Browser, type BrowserContext } from 'patchright' +import { ConversationExtractor } from '../../src/scraper/conversation-extractor.js' +import { type Config } from '../../src/utils/config.js' +import { fileURLToPath } from 'node:url' +import { dirname, join } from 'node:path' +import { existsSync, rmSync } from 'node:fs' + +const __dirname = dirname(fileURLToPath(import.meta.url)) +const TEST_STORAGE_DIR = join(__dirname, '../../.test-storage-scraper') + +const mockApplicationConfig: Config = { + authStoragePath: join(TEST_STORAGE_DIR, 'auth.json'), + waitMode: 'static', + rateLimitMs: 10, + parallelWorkers: 1, + checkpointSaveInterval: 1, + exportDir: join(TEST_STORAGE_DIR, 'exports'), + checkpointPath: join(TEST_STORAGE_DIR, 'checkpoint.json'), + vectorIndexPath: join(TEST_STORAGE_DIR, 'vector-index'), + ollamaUrl: 'http://localhost:11434', + ollamaModel: 'llama3.1', + ollamaEmbedModel: 'nomic-embed-text', + enableVectorSearch: true, + headless: true, + debug: false, +} + +describe('Scraper Engine Integration', () => { + let browser: Browser + let browserContext: BrowserContext + + beforeAll(async () => { + if (existsSync(TEST_STORAGE_DIR)) { + rmSync(TEST_STORAGE_DIR, { recursive: true, force: true }) + } + browser = await chromium.launch({ headless: true }) + }) + + afterAll(async () => { + await browser?.close() + if (existsSync(TEST_STORAGE_DIR)) { + rmSync(TEST_STORAGE_DIR, { recursive: true, force: true }) + } + }) + + it('should successfully extract and format a conversation using mocked network responses', async () => { + browserContext = await browser.newContext() + + // Mock the main page navigation. + // IMPORTANT: It must trigger a fetch to the thread API to simulate real site behavior. + await browserContext.route('**/search/**', async (route) => { + const htmlContent = ` + + + + + + ` + await route.fulfill({ status: 200, contentType: 'text/html', body: htmlContent }) + }) + + // Intercept Perplexity API calls to return realistic mocked data + await browserContext.route('**/rest/thread/**', async (route) => { + const url = route.request().url() + if (url.includes('list_')) { + await route.fulfill({ status: 200, contentType: 'application/json', body: JSON.stringify([]) }) + return + } + + const mockedPerplexityThreadResponse = { + entries: [ + { + uuid: 'entry-1', + thread_title: 'The Great Refactor', + query_str: 'How do I refactor a God Object?', + blocks: [ + { + markdown_block: { + answer: 'Break it down into smaller, specialized services with single responsibilities.' + } + } + ], + updated_datetime: '2026-06-01T12:00:00.000Z', + collection_info: { title: 'Coding Best Practices' } + } + ] + } + await route.fulfill({ + status: 200, + contentType: 'application/json', + body: JSON.stringify(mockedPerplexityThreadResponse) + }) + }) + + // Mock settings page to avoid real navigation failures + await browserContext.route('**/settings', async (route) => { + await route.fulfill({ status: 200, contentType: 'text/html', body: 'Settings' }) + }) + + const conversationExtractor = new ConversationExtractor(mockApplicationConfig, browserContext) + const testConversationUrl = 'https://www.perplexity.ai/search/the-great-refactor-123' + + const extractionResult = await conversationExtractor.extract(testConversationUrl) + + expect(extractionResult.conversationId).toBe('the-great-refactor-123') + expect(extractionResult.conversationTitle).toBe('The Great Refactor') + expect(extractionResult.conversationSpaceName).toBe('Coding Best Practices') + expect(extractionResult.formattedMarkdownContent).toContain('How do I refactor a God Object?') + expect(extractionResult.formattedMarkdownContent).toContain('Break it down into smaller') + expect(extractionResult.contentIntegrityHash).toBeDefined() + expect(extractionResult.contentIntegrityHash.length).toBe(64) // SHA-256 hex length + + await browserContext.close() + }) + + it('should throw an error bus exception for a 404 response', async () => { + browserContext = await browser.newContext() + + await browserContext.route('**/search/non-existent', async (route) => { + await route.fulfill({ status: 404 }) + }) + + const conversationExtractor = new ConversationExtractor(mockApplicationConfig, browserContext) + + await expect( + conversationExtractor.extract('https://www.perplexity.ai/search/non-existent') + ).rejects.toThrow(/404/) + + await browserContext.close() + }) +}) diff --git a/test/integration/search-engine.integration.test.ts b/test/integration/search-engine.integration.test.ts new file mode 100644 index 0000000..ca0b228 --- /dev/null +++ b/test/integration/search-engine.integration.test.ts @@ -0,0 +1,94 @@ +import { describe, it, expect, beforeAll, afterAll, vi } from 'vitest' +import { VectorStore } from '../../src/search/vector-store.js' +import { SearchOrchestrator } from '../../src/search/search-orchestrator.js' +import { OllamaClient } from '../../src/ai/ollama-client.js' +import { RgSearch } from '../../src/search/rg-search.js' +import { type Config } from '../../src/utils/config.js' +import { fileURLToPath } from 'node:url' +import { dirname, join } from 'node:path' +import { existsSync, rmSync, mkdirSync, writeFileSync } from 'node:fs' + +const __dirname = dirname(fileURLToPath(import.meta.url)) +const TEST_STORAGE_DIR = join(__dirname, '../../.test-storage-search') +const EXPORTS_DIR = join(TEST_STORAGE_DIR, 'exports') + +const mockApplicationConfig: Config = { + authStoragePath: join(TEST_STORAGE_DIR, 'auth.json'), + waitMode: 'static', + rateLimitMs: 10, + parallelWorkers: 1, + checkpointSaveInterval: 1, + exportDir: EXPORTS_DIR, + checkpointPath: join(TEST_STORAGE_DIR, 'checkpoint.json'), + vectorIndexPath: join(TEST_STORAGE_DIR, 'vector-index'), + ollamaUrl: 'http://localhost:11434', + ollamaModel: 'llama3.1', + ollamaEmbedModel: 'nomic-embed-text', + enableVectorSearch: true, + headless: true, + debug: false, +} + +describe('Search Engine Integration', () => { + beforeAll(() => { + if (existsSync(TEST_STORAGE_DIR)) { + rmSync(TEST_STORAGE_DIR, { recursive: true, force: true }) + } + mkdirSync(EXPORTS_DIR, { recursive: true }) + + const codingDir = join(EXPORTS_DIR, 'Coding') + mkdirSync(codingDir, { recursive: true }) + + writeFileSync( + join(codingDir, 'Refactoring (thread-1).md'), + `# The Great Refactor\n\n**Space:** Coding \n**ID:** thread-1 \n**Date:** 2026-06-01T12:00:00.000Z \n\n## Question\nHow do I refactor?\n\nAnswer: Use small steps.` + ) + }) + + afterAll(() => { + if (existsSync(TEST_STORAGE_DIR)) { + rmSync(TEST_STORAGE_DIR, { recursive: true, force: true }) + } + vi.restoreAllMocks() + }) + + it('should successfully index and search using VectorStore (Ollama mocked)', async () => { + const embedSpy = vi.spyOn(OllamaClient.prototype, 'embed').mockImplementation(async (texts: string[]) => { + return texts.map(text => { + if (text.toLowerCase().includes('refactor')) return [1, 0, 0] + return [0, 0, 1] + }) + }) + + const vectorStore = new VectorStore(mockApplicationConfig) + await vectorStore.rebuildFromExports() + + const searchResults = await vectorStore.search('refactor') + + expect(searchResults.length).toBeGreaterThan(0) + expect(searchResults[0]!.meta['title']).toContain('The Great Refactor') + + embedSpy.mockRestore() + }) + + it('should find content using ripgrep (RgSearch)', async () => { + const ripgrepSearch = new RgSearch(mockApplicationConfig) + const searchMatches = await ripgrepSearch.captureSearchMatches({ pattern: 'refactor' }) + expect(searchMatches.length).toBeGreaterThan(0) + }) + + it('should coordinate searches correctly via SearchOrchestrator', async () => { + const orchestrator = new SearchOrchestrator(mockApplicationConfig) + const vectorSearchSpy = vi.spyOn(VectorStore.prototype, 'search').mockResolvedValue([ + { meta: { spaceName: 'Coding', title: 'Vector Result', path: 'path' }, score: 0.99 } + ]) + + await orchestrator.search('short', 'auto', { pattern: 'short' }) + expect(vectorSearchSpy).not.toHaveBeenCalled() + + await orchestrator.search('this is a very long query about refactoring', 'auto', { pattern: 'this is a very long query about refactoring' }) + expect(vectorSearchSpy).toHaveBeenCalled() + + vectorSearchSpy.mockRestore() + }) +}) diff --git a/test/integration/vector-store.integration.test.ts b/test/integration/vector-store.integration.test.ts deleted file mode 100644 index 30ea6cd..0000000 --- a/test/integration/vector-store.integration.test.ts +++ /dev/null @@ -1,110 +0,0 @@ -import { config } from '../../src/utils/config.js' -import { describe, it, expect, beforeAll, afterAll, beforeEach } from 'vitest' -import { rmSync, existsSync, mkdirSync, writeFileSync, readFileSync } from 'node:fs' -import { join } from 'node:path' -import { isOllamaAvailable } from '../ollama-available.js' - -const TEST_EXPORTS = join(process.cwd(), 'test-fixtures', 'exports') -const TEST_INDEX = join(process.cwd(), 'test-fixtures', 'vector-index') - -// Import and patch config before loading VectorStore -let VectorStore: any - -describe.runIf(await isOllamaAvailable())('VectorStore Integration', () => { - beforeAll(async () => { - // Setup test directories - ;[TEST_EXPORTS, TEST_INDEX].forEach((dir) => { - if (existsSync(dir)) rmSync(dir, { recursive: true }) - mkdirSync(dir, { recursive: true }) - }) - - // Dynamically import and patch - process.env.EXPORT_DIR = TEST_EXPORTS - process.env.VECTOR_INDEX_PATH = TEST_INDEX - - const configModule = await import('../../src/utils/config.js') - // Override config properties - Object.defineProperty(configModule.config, 'exportDir', { - get: () => TEST_EXPORTS, - configurable: true, - }) - Object.defineProperty(configModule.config, 'vectorIndexPath', { - get: () => TEST_INDEX, - configurable: true, - }) - - const vectorStoreModule = await import('../../src/search/vector-store.js') - VectorStore = vectorStoreModule.VectorStore - }) - - afterAll(() => { - ;[TEST_EXPORTS, TEST_INDEX].forEach((dir) => { - if (existsSync(dir)) rmSync(dir, { recursive: true }) - }) - }) - - beforeEach(() => { - // Clean between tests - ;[join(TEST_EXPORTS, '*.md'), join(TEST_INDEX, '*')].forEach((pattern) => { - const dir = pattern.replace('/*', '').replace('/*.md', '') - if (existsSync(dir)) { - const files = require('fs').readdirSync(dir) - for (const file of files) { - const isFileMdOrJson = file.endsWith('.md') || file.endsWith('.json') - if (isFileMdOrJson) rmSync(join(dir, file)) - } - } - }) - }) - - it('should build index from markdown files with real Ollama embeddings', async () => { - const store = new VectorStore(config) - - writeFileSync( - join(TEST_EXPORTS, 'test-conv.md'), - `# Test Conversation\n\n**Space:** General\n**ID:** test-123\n\n## Question\n\nWhat is testing?\n\n---\n\n## Answer\n\nTesting verifies software behavior.` - ) - - await store.rebuildFromExports() - - expect(existsSync(join(TEST_INDEX, 'index.json'))).toBe(true) - - // Verify index has content - const indexContent = readFileSync(join(TEST_INDEX, 'index.json'), 'utf-8') - expect(indexContent.length).toBeGreaterThan(100) - }, 30000) - - it('should chunk large files automatically during indexing', async () => { - const store = new VectorStore(config) - - const largeContent = `# Large File\n\n**Space:** Test\n**ID:** large-1\n\n${'Lorem ipsum dolor sit amet consectetur adipiscing elit. '.repeat(100)}` - writeFileSync(join(TEST_EXPORTS, 'large.md'), largeContent) - - await store.rebuildFromExports() - - expect(existsSync(join(TEST_INDEX, 'index.json'))).toBe(true) - }, 30000) - - it('should search and return relevant results with scores', async () => { - const store = new VectorStore(config) - - writeFileSync( - join(TEST_EXPORTS, 'typescript.md'), - `# TypeScript Guide\n\n**Space:** Dev\n**ID:** ts-123\n\nTypeScript adds static typing to JavaScript for safer code.` - ) - - await store.rebuildFromExports() - - const results = await store.search('TypeScript static typing', 5) - - expect(results.length).toBeGreaterThan(0) - expect(results[0]).toHaveProperty('meta') - expect(results[0]).toHaveProperty('score') - expect(results[0]!.score).toBeGreaterThan(0) - }, 30000) - - it('should handle empty exports directory gracefully', async () => { - const store = new VectorStore(config) - await expect(store.rebuildFromExports()).resolves.not.toThrow() - }) -}) diff --git a/test/setup.ts b/test/setup.ts index d9c83f6..513909d 100644 --- a/test/setup.ts +++ b/test/setup.ts @@ -1,13 +1,13 @@ import { beforeAll, afterAll } from 'vitest' -import { chromium, type Browser } from '@playwright/test' +import { chromium, type Browser } from 'patchright' let sharedBrowserInstance: Browser beforeAll(async () => { try { sharedBrowserInstance = await chromium.launch({ headless: true }) - } catch (_error) { - console.warn('Could not launch browser in setup.ts, some tests might fail if they require it.') + } catch (launchError) { + // Suppress errors here; individual tests should handle missing browsers or attempt launch } }) diff --git a/test/unit/conversation-extractor.unit.test.ts b/test/unit/conversation-extractor.unit.test.ts index d087a01..cfbae91 100644 --- a/test/unit/conversation-extractor.unit.test.ts +++ b/test/unit/conversation-extractor.unit.test.ts @@ -1,7 +1,7 @@ import { describe, it, expect, vi, beforeEach } from 'vitest' import { ConversationExtractor } from '../../src/scraper/conversation-extractor.js' import { ApiDiagnosticsWriter } from '../../src/utils/api-diagnostics.js' -import type { BrowserContext } from '@playwright/test' +import type { BrowserContext } from 'patchright' vi.mock('../../src/utils/api-diagnostics.js', () => { return { @@ -31,45 +31,42 @@ describe('ConversationExtractor (Unit)', () => { vi.clearAllMocks() }) - describe('ensureEntriesFormat', () => { + describe('Data Normalization (via DataParser)', () => { it('should return array if input is array', () => { const data = [{ query_str: 'test' }] - const result = (extractor as any).ensureEntriesFormat(data, 'http://test.com') + const result = (extractor as any).dataParser.normalizeApiData(data, 'http://test.com') expect(result).toEqual(data) - expect((extractor as any).diagnostics.writeFailure).not.toHaveBeenCalled() }) it('should return data.entries if input has entries array', () => { const data = { entries: [{ query_str: 'test' }] } - const result = (extractor as any).ensureEntriesFormat(data, 'http://test.com') + const result = (extractor as any).dataParser.normalizeApiData(data, 'http://test.com') expect(result).toEqual(data.entries) - expect((extractor as any).diagnostics.writeFailure).not.toHaveBeenCalled() }) it('should return [data] if input has query_str', () => { const data = { query_str: 'test' } - const result = (extractor as any).ensureEntriesFormat(data, 'http://test.com') + const result = (extractor as any).dataParser.normalizeApiData(data, 'http://test.com') expect(result).toEqual([data]) - expect((extractor as any).diagnostics.writeFailure).not.toHaveBeenCalled() }) it('should return empty array and call diagnostics for unknown shape', () => { const data = { foo: 'bar' } - const result = (extractor as any).ensureEntriesFormat(data, 'http://test.com') + const result = (extractor as any).dataParser.normalizeApiData(data, 'http://test.com') expect(result).toEqual([]) - expect((extractor as any).diagnostics.writeFailure).toHaveBeenCalledWith({ + expect((extractor as any).dataParser.apiDiagnosticsWriter.writeFailure).toHaveBeenCalledWith({ url: 'http://test.com', errorType: 'unknown_shape', }) }) }) - describe('parseConversationData', () => { + describe('Data Parsing (via DataParser)', () => { it('should return null and call diagnostics if entries are empty', () => { const data = { entries: [] } - const result = extractor.parseConversationData(data, 'http://test.com') + const result = (extractor as any).dataParser.parse(data, 'http://test.com') expect(result).toBeNull() - expect((extractor as any).diagnostics.writeFailure).toHaveBeenCalledWith({ + expect((extractor as any).dataParser.apiDiagnosticsWriter.writeFailure).toHaveBeenCalledWith({ url: 'http://test.com', errorType: 'empty_entries', }) @@ -85,11 +82,9 @@ describe('ConversationExtractor (Unit)', () => { }, ], } - const result = extractor.parseConversationData(data, 'https://perplexity.ai/search/uuid') + const result = (extractor as any).dataParser.parse(data, 'https://perplexity.ai/search/uuid') expect(result).not.toBeNull() - expect(result?.title).toBe('Test Thread') - expect(result?.content).toContain('What is 1+1?') - expect(result?.content).toContain('2') + expect(result?.meta.title).toBe('Test Thread') }) }) }) diff --git a/test/unit/hashing.unit.test.ts b/test/unit/hashing.unit.test.ts index ed8841f..dcd05a9 100644 --- a/test/unit/hashing.unit.test.ts +++ b/test/unit/hashing.unit.test.ts @@ -1,46 +1,39 @@ import { describe, it, expect, beforeEach } from 'vitest' import { ConversationExtractor } from '../../src/scraper/conversation-extractor.js' -import type { BrowserContext } from '@playwright/test' +import type { BrowserContext } from 'patchright' describe('ConversationExtractor Hashing (Unit)', () => { let extractor: ConversationExtractor - let mockContext: BrowserContext - const mockConfig = { - waitMode: 'static', - rateLimitMs: 1000, - debug: true, - } as any + const mockConfig = {} as any + const mockContext = {} as any beforeEach(() => { - mockContext = {} as unknown as BrowserContext extractor = new ConversationExtractor(mockConfig, mockContext) }) it('should generate the same hash for identical entries', () => { - const entries = [{ id: '1', query_str: 'test', blocks: [{ markdown_block: { answer: 'hi' } }] }] - const hash1 = (extractor as any).hashEntries(entries) - const hash2 = (extractor as any).hashEntries(entries) - expect(hash1).toBe(hash2) - expect(hash1).toHaveLength(64) + const entries = [{ id: '1', query_str: 'test', blocks: [{ markdown_block: { answer: 'abc' } }] }] + const data = { entries } + const res1 = (extractor as any).dataParser.parse(data, 'http://test.com/search/1') + const res2 = (extractor as any).dataParser.parse(data, 'http://test.com/search/1') + expect(res1.hash).toBe(res2.hash) }) it('should generate different hashes for different entries', () => { - const entries1 = [ - { id: '1', query_str: 'test', blocks: [{ markdown_block: { answer: 'hi' } }] }, - ] - const entries2 = [ - { id: '1', query_str: 'test', blocks: [{ markdown_block: { answer: 'hello' } }] }, - ] - const hash1 = (extractor as any).hashEntries(entries1) - const hash2 = (extractor as any).hashEntries(entries2) - expect(hash1).not.toBe(hash2) + const entries1 = [{ id: '1', query_str: 'test', blocks: [{ markdown_block: { answer: 'abc' } }] }] + const entries2 = [{ id: '1', query_str: 'test', blocks: [{ markdown_block: { answer: 'def' } }] }] + const res1 = (extractor as any).dataParser.parse({ entries: entries1 }, 'http://test.com/search/1') + const res2 = (extractor as any).dataParser.parse({ entries: entries2 }, 'http://test.com/search/1') + expect(res1.hash).not.toBe(res2.hash) }) it('should be stable regardless of key order in entries', () => { const entries1 = [{ a: 1, b: 2 }] const entries2 = [{ b: 2, a: 1 }] - const hash1 = (extractor as any).hashEntries(entries1) - const hash2 = (extractor as any).hashEntries(entries2) - expect(hash1).toBe(hash2) + const data1 = { query_str: 'q', entries: entries1 } + const data2 = { query_str: 'q', entries: entries2 } + const res1 = (extractor as any).dataParser.parse(data1, 'http://test.com/search/1') + const res2 = (extractor as any).dataParser.parse(data2, 'http://test.com/search/1') + expect(res1.hash).toBe(res2.hash) }) }) diff --git a/test/unit/worker-pool.unit.test.ts b/test/unit/worker-pool.unit.test.ts index ccdaaa9..ab5b88a 100644 --- a/test/unit/worker-pool.unit.test.ts +++ b/test/unit/worker-pool.unit.test.ts @@ -1,60 +1,71 @@ import { describe, it, expect, vi, beforeEach } from 'vitest' import { WorkerPool } from '../../src/scraper/worker-pool.js' +import { CheckpointManager } from '../../src/scraper/checkpoint-manager.js' +import { FileWriter } from '../../src/export/file-writer.js' +import { ConversationExtractor } from '../../src/scraper/conversation-extractor.js' +import type { Browser } from 'patchright' -vi.mock('../../src/scraper/conversation-extractor.js') vi.mock('../../src/scraper/checkpoint-manager.js') vi.mock('../../src/export/file-writer.js') +vi.mock('../../src/scraper/conversation-extractor.js', () => { + return { + ConversationExtractor: vi.fn().mockImplementation(function() { + return { + extract: vi.fn(), + recoverTimeout: vi.fn(), + reduceTimeout: vi.fn(), + } + }), + } +}) describe('WorkerPool Skip Logic (Unit)', () => { let pool: WorkerPool + let mockConfig: any let mockCheckpoint: any let mockBrowser: any - let mockConfig: any beforeEach(() => { - mockConfig = { parallelWorkers: 1 } - mockCheckpoint = { - getContentHash: vi.fn(), - markAsProcessed: vi.fn(), - getProcessingProgress: vi.fn().mockReturnValue({ processed: 1, total: 1 }), - } - mockBrowser = { - newContext: vi.fn().mockResolvedValue({ - close: vi.fn().mockResolvedValue(undefined), - }), - } + mockConfig = { parallelWorkers: 1, authStoragePath: 'path' } + mockCheckpoint = new CheckpointManager(mockConfig) as any + mockBrowser = { newContext: vi.fn().mockResolvedValue({ close: vi.fn() }) } as unknown as Browser pool = new WorkerPool(mockConfig, mockCheckpoint, mockBrowser) + vi.clearAllMocks() }) it('should skip file write if hash matches', async () => { await pool.initialize() - const worker = (pool as any).workers[0] - worker.extractor.extract = vi.fn().mockResolvedValue({ - id: 'thread-1', - title: 'Title', - contentHash: 'hash-match', + const worker = (pool as any).activeWorkers[0] + worker.conversationExtractor.extract = vi.fn().mockResolvedValue({ + conversationId: 'thread-1', + conversationTitle: 'Title', + contentIntegrityHash: 'hash-abc', }) - mockCheckpoint.getContentHash.mockReturnValue('hash-match') - await pool.processConversations([{ id: 'thread-1', url: 'http://url' }]) + mockCheckpoint.getContentHash.mockReturnValue('hash-abc') + mockCheckpoint.getProcessingProgress.mockReturnValue({ processed: 1, total: 1 }) - expect((pool as any).fileWriter.write).not.toHaveBeenCalled() + await (pool as any).performConversationExtraction(worker, { conversationMetadata: { id: 'thread-1', url: 'url' }, currentAttemptCount: 0 }) + + expect(FileWriter.prototype.write).not.toHaveBeenCalled() expect(mockCheckpoint.markAsProcessed).toHaveBeenCalledWith('thread-1') }) it('should perform file write if hash differs', async () => { await pool.initialize() - const worker = (pool as any).workers[0] - worker.extractor.extract = vi.fn().mockResolvedValue({ - id: 'thread-1', - title: 'Title', - contentHash: 'hash-new', + const worker = (pool as any).activeWorkers[0] + worker.conversationExtractor.extract = vi.fn().mockResolvedValue({ + conversationId: 'thread-1', + conversationTitle: 'Title', + contentIntegrityHash: 'hash-new', }) + mockCheckpoint.getContentHash.mockReturnValue('hash-old') + mockCheckpoint.getProcessingProgress.mockReturnValue({ processed: 1, total: 1 }) - await pool.processConversations([{ id: 'thread-1', url: 'http://url' }]) + await (pool as any).performConversationExtraction(worker, { conversationMetadata: { id: 'thread-1', url: 'url' }, currentAttemptCount: 0 }) - expect((pool as any).fileWriter.write).toHaveBeenCalled() + expect(FileWriter.prototype.write).toHaveBeenCalled() expect(mockCheckpoint.markAsProcessed).toHaveBeenCalledWith('thread-1', 'hash-new') }) })