Skip to content

nm727/Data-Embedding

Repository files navigation

Cybersecurity Book Vector Encoder

Seamlessly encode an 800-page cybersecurity book for AI agent access.

This system implements advanced encoding strategies from representation theory and modern RAG architectures to enable semantic search, concept algebra, and analogical reasoning over your book's content.

image

🎯 Key Features

Semantic Chunking with Parent-Document Retrieval

  • Child chunks (512 tokens): Precise, specific segments for accurate retrieval
  • Parent chunks (2048 tokens): Full context windows for comprehensive understanding
  • Semantic boundary detection: Respects paragraphs, lists, and sentence boundaries

High-Dimensional Dense Embeddings

  • Uses BGE-large-en-v1.5 (1024 dimensions) optimized for technical retrieval
  • Contextual enrichment: Prepends chapter/section breadcrumbs to anchor vectors
  • Optional hyperbolic projection for hierarchical concept relationships

Hybrid Search with RRF Fusion

  • Dense search: Semantic similarity for meaning-based queries
  • Sparse search: Keyword matching for exact technical terms (CVEs, protocols)
  • Reciprocal Rank Fusion: Merges results for best of both worlds

Vector Arithmetic (Concept Algebra)

Zero Trust + Cloud Architecture ≈ Cloud-Native Identity Management
Buffer Overflow - C++ + Rust ≈ Memory Safety Patterns

Cybersecurity-Specific Features

  • CVE extraction: Automatically detects and indexes CVE patterns
  • Severity tagging: Critical, High, Medium, Low classifications
  • Category detection: Network, Web, Crypto, Malware, Cloud, etc.

📦 Installation

pip install -r requirements.txt

For GPU acceleration (recommended for large books):

pip install torch --index-url https://download.pytorch.org/whl/cu118

🚀 Quick Start

1. Encode Your Book

python main.py encode path/to/cybersecurity_book.pdf

With custom options:

python main.py encode book.pdf \
    --model "BAAI/bge-large-en-v1.5" \
    --chunk-size 512 \
    --parent-chunk-size 2048 \
    --device cuda

2. Query the Encoded Book

Natural language query:

python main.py query "How do buffer overflow attacks work?"

Concept algebra:

python main.py concept "Zero Trust + Cloud Architecture"

Analogical reasoning:

python main.py analogy "SQL Injection" "Web Application" "Memory Corruption"
# Returns: content about memory corruption exploits in applications

3. Interactive Mode

python main.py interactive

Commands:

  • /concept <expression> - Vector arithmetic
  • /category <cat> <query> - Filter by category
  • /severity <level> <query> - Filter by severity
  • /quit - Exit

🤖 AI Agent Integration

from agent_interface import create_agent

# Create agent connected to encoded book
agent = create_agent(db_path="./vector_db")

# Natural language query
response = agent.query("What are the OWASP Top 10 vulnerabilities?")
print(response.context)  # Ready for LLM consumption

# Concept algebra
response = agent.concept_search("Ransomware + Prevention - Detection")

# Analogical reasoning
response = agent.analogy_search(
    a="SQL Injection",
    b="Input Validation", 
    c="Buffer Overflow"
)
# Returns: content about memory protection techniques

# Category-filtered search
response = agent.search_by_category(
    question="authentication bypass",
    category="web_security"
)

# Severity-filtered search
response = agent.search_by_severity(
    question="remote code execution",
    severity="critical"
)

📊 Architecture

┌─────────────────────────────────────────────────────────────┐
│                     ENCODING PIPELINE                        │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
│  │   PDF/Text  │───▶│  Semantic   │───▶│  Contextual │     │
│  │   Loader    │    │  Chunker    │    │  Embedder   │     │
│  └─────────────┘    └─────────────┘    └─────────────┘     │
│        │                  │                   │              │
│        ▼                  ▼                   ▼              │
│   Structure         Child + Parent      1024-dim Dense      │
│   Detection         Chunks              Vectors             │
├─────────────────────────────────────────────────────────────┤
│                      VECTOR STORE                            │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────┐   │
│  │                    ChromaDB                          │   │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────────┐   │   │
│  │  │  Dense    │  │  Metadata │  │    Parent     │   │   │
│  │  │  Index    │  │  Filters  │  │    Store      │   │   │
│  │  └───────────┘  └───────────┘  └───────────────┘   │   │
│  └─────────────────────────────────────────────────────┘   │
├─────────────────────────────────────────────────────────────┤
│                     AI AGENT INTERFACE                       │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────┐    │
│  │   Hybrid    │  │   Vector    │  │   Metadata      │    │
│  │   Search    │  │   Algebra   │  │   Filtering     │    │
│  └─────────────┘  └─────────────┘  └─────────────────┘    │
└─────────────────────────────────────────────────────────────┘

🔧 Configuration

Edit config.py to customize:

from config import PipelineConfig, ChunkingConfig, EmbeddingConfig

config = PipelineConfig(
    chunking=ChunkingConfig(
        chunk_size=512,           # Child chunk target size
        parent_chunk_size=2048,   # Parent context window
        overlap=0.25,             # 25% overlap between chunks
    ),
    embedding=EmbeddingConfig(
        model_name="BAAI/bge-large-en-v1.5",
        dimension=1024,
        device="cuda",
    ),
    use_hyperbolic_embeddings=True,  # Enable for hierarchical data
)

📁 Project Structure

new_thing/
├── main.py              # CLI entry point
├── pipeline.py          # Encoding orchestrator
├── document_loader.py   # PDF/text parsing with structure detection
├── semantic_chunker.py  # Parent-document chunking strategy
├── embeddings.py        # Dense embeddings + vector arithmetic
├── vector_store.py      # ChromaDB with hybrid search
├── agent_interface.py   # High-level AI agent API
├── config.py            # Configuration management
├── requirements.txt     # Dependencies
└── README.md           # This file

🧠 Advanced: Representation Theory Features

Hyperbolic Embeddings (Poincaré Ball)

Enable for hierarchical cybersecurity concepts:

config = PipelineConfig(use_hyperbolic_embeddings=True)

This projects Euclidean embeddings into hyperbolic space where:

  • Parent concepts (e.g., "Malware") are near the origin
  • Child concepts (e.g., "WannaCry") are near the boundary
  • Hierarchical distances are preserved with minimal distortion

Vector Arithmetic Examples

from embeddings import VectorArithmetic, EmbeddingEngine

engine = EmbeddingEngine()
va = VectorArithmetic(engine)

# Concept composition
result = va.compute("Zero Trust + Cloud Architecture - On-Premise")

# Analogy: A is to B as C is to ?
result = va.analogy("Firewall", "Network", "WAF")  # → Web Application

📈 Performance Tips

  1. Use GPU: Set --device cuda for 10x faster encoding
  2. Batch size: Increase --batch-size if you have VRAM headroom
  3. Chunk size: Smaller chunks = more precise retrieval, larger = more context
  4. Overlap: 20-30% overlap prevents context loss at boundaries

🔐 Security Note

This tool processes potentially sensitive cybersecurity content. The vector database is stored locally by default. For production deployments, consider:

  • Encrypting the vector store
  • Access controls on the database directory
  • Audit logging for queries

Built for seamless AI agent access to cybersecurity knowledge.

About

This system implements advanced encoding strategies from representation theory and modern RAG architectures to enable semantic search, concept algebra, and analogical reasoning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages