Skip to content

usrbinbrain/crabql-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🦀 crabQL-rs

High-Throughput ETL Engine for the GitHub GraphQL v4 API documentation.

Rust Json Linux windows MAC OS

crabQL-rs is an asynchronous extraction engine designed for data engineering and AI pipelines.

It ingests, sanitizes, and structures dynamic GitHub documentation (Next.js-based) into deterministic and typed datasets, ready for consumption in knowledge bases, security analysis, or RAG (Retrieval-Augmented Generation) architectures.

Instead of relying on heavy headless browsers or fragile DOM parsing, the tool directly intercepts the Next.js hydration state (__NEXT_DATA__), ensuring clean, predictable extraction with a minimal resource footprint.

🚀 Architecture and Engineering Decisions

The pipeline was designed with a focus on resilience and solid performance, operating in three strict stages:

  • Extract (I/O Bound): Scaled concurrent HTTP requests via tokio. It utilizes a 100% memory-safe TLS stack (rustls), mitigating common memory corruption vulnerabilities found in C dependencies (like OpenSSL).
  • Transform (CPU Bound): Isolation of intensive processing via spawn_blocking to prevent thread starvation in the asynchronous runtime. Unstructured HTML is sanitized and semantically converted into Markdown optimized for LLM context windows.
  • Load (Memory Optimized): Results are accumulated in memory and then serialized in a single pass to disk as a consolidated JSON artifact. Serde's zero-cost abstractions keep serialization overhead low while remaining suitable for moderate-scale datasets on typical developer hardware.

✨ System Guarantees

  • Strict Data Contracts (Type-Safety): The input schema is rigorously validated at deserialization time. Malformed payloads are blocked at the edge, ensuring downstream pipelines receive only intact data.
  • AI-Ready Artifacts: The output converts HTML code blocks and references into native Markdown, maximizing embedding quality for language models.
  • Supply Chain Security: Drastic reduction of the attack surface by disabling unused default crate features and forcing pure Rust implementations for network cryptography.
  • Resource Isolation: Use of optimized synchronization primitives (OnceLock for selector caching, relaxed Atomics) to minimize CPU lock contention during massive concurrency.

🛠️ Pipeline Execution

Requires the Rust toolchain installed in the environment.

  1. Clone the repository and access the directory:

    git clone https://github.com/usrbinbrain/crabql-rs.git
    cd crabql-rs
  2. Compile and start ingestion in release mode for maximum compiler optimization:

    cargo run --release

    Expect output:

    🦀 Iniciando o crabQL-rs com concorrência máxima...
    ⚡ Processando... [16/16] concluídas
    ✅ Extração concluída!
    📄 Arquivo salvo como: github_api_graphql_v4_docs.json
    
  3. The typed artifact github_api_graphql_v4_docs.json will be consolidated at the project root at the end of the flow.

📄 Output Contract

The final artifact follows a predictable JSON schema, separating textual guides from GraphQL schema definitions:

[
  {
    "url": "https://docs.github.com/en/graphql/guides/migrating-graphql-global-node-ids",
    "type": "document",
    "content": "## Background\n\nThe GitHub GraphQL API currently supports two types..."
  },
  {
    "url": "https://docs.github.com/en/graphql/reference/queries",
    "type": "schema",
    "content": { 
      /* GraphQL Schema AST Structure */ 
    }
  }
]

🛡️ Rust Lib Stack

About

High-Throughput ETL Engine for the GitHub GraphQL v4 API documentation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages