High-Throughput ETL Engine for the GitHub GraphQL v4 API documentation.
crabQL-rs is an asynchronous extraction engine designed for data engineering and AI pipelines.
It ingests, sanitizes, and structures dynamic GitHub documentation (Next.js-based) into deterministic and typed datasets, ready for consumption in knowledge bases, security analysis, or RAG (Retrieval-Augmented Generation) architectures.
Instead of relying on heavy headless browsers or fragile DOM parsing, the tool directly intercepts the Next.js hydration state (__NEXT_DATA__), ensuring clean, predictable extraction with a minimal resource footprint.
The pipeline was designed with a focus on resilience and solid performance, operating in three strict stages:
- Extract (I/O Bound): Scaled concurrent HTTP requests via
tokio. It utilizes a 100% memory-safe TLS stack (rustls), mitigating common memory corruption vulnerabilities found in C dependencies (like OpenSSL). - Transform (CPU Bound): Isolation of intensive processing via
spawn_blockingto prevent thread starvation in the asynchronous runtime. Unstructured HTML is sanitized and semantically converted into Markdown optimized for LLM context windows. - Load (Memory Optimized): Results are accumulated in memory and then serialized in a single pass to disk as a consolidated JSON artifact. Serde's zero-cost abstractions keep serialization overhead low while remaining suitable for moderate-scale datasets on typical developer hardware.
- Strict Data Contracts (Type-Safety): The input schema is rigorously validated at deserialization time. Malformed payloads are blocked at the edge, ensuring downstream pipelines receive only intact data.
- AI-Ready Artifacts: The output converts HTML code blocks and references into native Markdown, maximizing embedding quality for language models.
- Supply Chain Security: Drastic reduction of the attack surface by disabling unused default crate features and forcing pure Rust implementations for network cryptography.
- Resource Isolation: Use of optimized synchronization primitives (
OnceLockfor selector caching, relaxed Atomics) to minimize CPU lock contention during massive concurrency.
Requires the Rust toolchain installed in the environment.
-
Clone the repository and access the directory:
git clone https://github.com/usrbinbrain/crabql-rs.git cd crabql-rs -
Compile and start ingestion in release mode for maximum compiler optimization:
cargo run --release
Expect output:
🦀 Iniciando o crabQL-rs com concorrência máxima... ⚡ Processando... [16/16] concluídas ✅ Extração concluída! 📄 Arquivo salvo como: github_api_graphql_v4_docs.json -
The typed artifact
github_api_graphql_v4_docs.jsonwill be consolidated at the project root at the end of the flow.
The final artifact follows a predictable JSON schema, separating textual guides from GraphQL schema definitions:
[
{
"url": "https://docs.github.com/en/graphql/guides/migrating-graphql-global-node-ids",
"type": "document",
"content": "## Background\n\nThe GitHub GraphQL API currently supports two types..."
},
{
"url": "https://docs.github.com/en/graphql/reference/queries",
"type": "schema",
"content": {
/* GraphQL Schema AST Structure */
}
}
]