Seamlessly encode an 800-page cybersecurity book for AI agent access.
This system implements advanced encoding strategies from representation theory and modern RAG architectures to enable semantic search, concept algebra, and analogical reasoning over your book's content.
- Child chunks (512 tokens): Precise, specific segments for accurate retrieval
- Parent chunks (2048 tokens): Full context windows for comprehensive understanding
- Semantic boundary detection: Respects paragraphs, lists, and sentence boundaries
- Uses BGE-large-en-v1.5 (1024 dimensions) optimized for technical retrieval
- Contextual enrichment: Prepends chapter/section breadcrumbs to anchor vectors
- Optional hyperbolic projection for hierarchical concept relationships
- Dense search: Semantic similarity for meaning-based queries
- Sparse search: Keyword matching for exact technical terms (CVEs, protocols)
- Reciprocal Rank Fusion: Merges results for best of both worlds
Zero Trust + Cloud Architecture ≈ Cloud-Native Identity Management
Buffer Overflow - C++ + Rust ≈ Memory Safety Patterns
- CVE extraction: Automatically detects and indexes CVE patterns
- Severity tagging: Critical, High, Medium, Low classifications
- Category detection: Network, Web, Crypto, Malware, Cloud, etc.
pip install -r requirements.txtFor GPU acceleration (recommended for large books):
pip install torch --index-url https://download.pytorch.org/whl/cu118python main.py encode path/to/cybersecurity_book.pdfWith custom options:
python main.py encode book.pdf \
--model "BAAI/bge-large-en-v1.5" \
--chunk-size 512 \
--parent-chunk-size 2048 \
--device cudaNatural language query:
python main.py query "How do buffer overflow attacks work?"Concept algebra:
python main.py concept "Zero Trust + Cloud Architecture"Analogical reasoning:
python main.py analogy "SQL Injection" "Web Application" "Memory Corruption"
# Returns: content about memory corruption exploits in applicationspython main.py interactiveCommands:
/concept <expression>- Vector arithmetic/category <cat> <query>- Filter by category/severity <level> <query>- Filter by severity/quit- Exit
from agent_interface import create_agent
# Create agent connected to encoded book
agent = create_agent(db_path="./vector_db")
# Natural language query
response = agent.query("What are the OWASP Top 10 vulnerabilities?")
print(response.context) # Ready for LLM consumption
# Concept algebra
response = agent.concept_search("Ransomware + Prevention - Detection")
# Analogical reasoning
response = agent.analogy_search(
a="SQL Injection",
b="Input Validation",
c="Buffer Overflow"
)
# Returns: content about memory protection techniques
# Category-filtered search
response = agent.search_by_category(
question="authentication bypass",
category="web_security"
)
# Severity-filtered search
response = agent.search_by_severity(
question="remote code execution",
severity="critical"
)┌─────────────────────────────────────────────────────────────┐
│ ENCODING PIPELINE │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ PDF/Text │───▶│ Semantic │───▶│ Contextual │ │
│ │ Loader │ │ Chunker │ │ Embedder │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Structure Child + Parent 1024-dim Dense │
│ Detection Chunks Vectors │
├─────────────────────────────────────────────────────────────┤
│ VECTOR STORE │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────────┐ │
│ │ ChromaDB │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────────┐ │ │
│ │ │ Dense │ │ Metadata │ │ Parent │ │ │
│ │ │ Index │ │ Filters │ │ Store │ │ │
│ │ └───────────┘ └───────────┘ └───────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ AI AGENT INTERFACE │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │
│ │ Hybrid │ │ Vector │ │ Metadata │ │
│ │ Search │ │ Algebra │ │ Filtering │ │
│ └─────────────┘ └─────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Edit config.py to customize:
from config import PipelineConfig, ChunkingConfig, EmbeddingConfig
config = PipelineConfig(
chunking=ChunkingConfig(
chunk_size=512, # Child chunk target size
parent_chunk_size=2048, # Parent context window
overlap=0.25, # 25% overlap between chunks
),
embedding=EmbeddingConfig(
model_name="BAAI/bge-large-en-v1.5",
dimension=1024,
device="cuda",
),
use_hyperbolic_embeddings=True, # Enable for hierarchical data
)new_thing/
├── main.py # CLI entry point
├── pipeline.py # Encoding orchestrator
├── document_loader.py # PDF/text parsing with structure detection
├── semantic_chunker.py # Parent-document chunking strategy
├── embeddings.py # Dense embeddings + vector arithmetic
├── vector_store.py # ChromaDB with hybrid search
├── agent_interface.py # High-level AI agent API
├── config.py # Configuration management
├── requirements.txt # Dependencies
└── README.md # This file
Enable for hierarchical cybersecurity concepts:
config = PipelineConfig(use_hyperbolic_embeddings=True)This projects Euclidean embeddings into hyperbolic space where:
- Parent concepts (e.g., "Malware") are near the origin
- Child concepts (e.g., "WannaCry") are near the boundary
- Hierarchical distances are preserved with minimal distortion
from embeddings import VectorArithmetic, EmbeddingEngine
engine = EmbeddingEngine()
va = VectorArithmetic(engine)
# Concept composition
result = va.compute("Zero Trust + Cloud Architecture - On-Premise")
# Analogy: A is to B as C is to ?
result = va.analogy("Firewall", "Network", "WAF") # → Web Application- Use GPU: Set
--device cudafor 10x faster encoding - Batch size: Increase
--batch-sizeif you have VRAM headroom - Chunk size: Smaller chunks = more precise retrieval, larger = more context
- Overlap: 20-30% overlap prevents context loss at boundaries
This tool processes potentially sensitive cybersecurity content. The vector database is stored locally by default. For production deployments, consider:
- Encrypting the vector store
- Access controls on the database directory
- Audit logging for queries
Built for seamless AI agent access to cybersecurity knowledge.