🦀 crabQL-rs

High-Throughput ETL Engine for the GitHub GraphQL v4 API documentation.

crabQL-rs is an asynchronous extraction engine designed for data engineering and AI pipelines.

It ingests, sanitizes, and structures dynamic GitHub documentation (Next.js-based) into deterministic and typed datasets, ready for consumption in knowledge bases, security analysis, or RAG (Retrieval-Augmented Generation) architectures.

Instead of relying on heavy headless browsers or fragile DOM parsing, the tool directly intercepts the Next.js hydration state (__NEXT_DATA__), ensuring clean, predictable extraction with a minimal resource footprint.

🚀 Architecture and Engineering Decisions

The pipeline was designed with a focus on resilience and solid performance, operating in three strict stages:

Extract (I/O Bound): Scaled concurrent HTTP requests via tokio. It utilizes a 100% memory-safe TLS stack (rustls), mitigating common memory corruption vulnerabilities found in C dependencies (like OpenSSL).
Transform (CPU Bound): Isolation of intensive processing via spawn_blocking to prevent thread starvation in the asynchronous runtime. Unstructured HTML is sanitized and semantically converted into Markdown optimized for LLM context windows.
Load (Memory Optimized): Results are accumulated in memory and then serialized in a single pass to disk as a consolidated JSON artifact. Serde's zero-cost abstractions keep serialization overhead low while remaining suitable for moderate-scale datasets on typical developer hardware.

✨ System Guarantees

Strict Data Contracts (Type-Safety): The input schema is rigorously validated at deserialization time. Malformed payloads are blocked at the edge, ensuring downstream pipelines receive only intact data.
AI-Ready Artifacts: The output converts HTML code blocks and references into native Markdown, maximizing embedding quality for language models.
Supply Chain Security: Drastic reduction of the attack surface by disabling unused default crate features and forcing pure Rust implementations for network cryptography.
Resource Isolation: Use of optimized synchronization primitives (OnceLock for selector caching, relaxed Atomics) to minimize CPU lock contention during massive concurrency.

🛠️ Pipeline Execution

Requires the Rust toolchain installed in the environment.

Clone the repository and access the directory:

git clone https://github.com/usrbinbrain/crabql-rs.git
cd crabql-rs

Compile and start ingestion in release mode for maximum compiler optimization:

cargo run --release

Expect output:

🦀 Iniciando o crabQL-rs com concorrência máxima...
⚡ Processando... [16/16] concluídas
✅ Extração concluída!
📄 Arquivo salvo como: github_api_graphql_v4_docs.json

The typed artifact github_api_graphql_v4_docs.json will be consolidated at the project root at the end of the flow.

📄 Output Contract

The final artifact follows a predictable JSON schema, separating textual guides from GraphQL schema definitions:

[
  {
    "url": "https://docs.github.com/en/graphql/guides/migrating-graphql-global-node-ids",
    "type": "document",
    "content": "## Background\n\nThe GitHub GraphQL API currently supports two types..."
  },
  {
    "url": "https://docs.github.com/en/graphql/reference/queries",
    "type": "schema",
    "content": { 
      /* GraphQL Schema AST Structure */ 
    }
  }
]

🛡️ Rust Lib Stack

Runtime: Tokio
Networking: Reqwest + Rustls
Data Serialization: Serde

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🦀 crabQL-rs

🚀 Architecture and Engineering Decisions

✨ System Guarantees

🛠️ Pipeline Execution

📄 Output Contract

🛡️ Rust Lib Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🦀 crabQL-rs

🚀 Architecture and Engineering Decisions

✨ System Guarantees

🛠️ Pipeline Execution

📄 Output Contract

🛡️ Rust Lib Stack

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages