Cybersecurity Book Vector Encoder

Seamlessly encode an 800-page cybersecurity book for AI agent access.

This system implements advanced encoding strategies from representation theory and modern RAG architectures to enable semantic search, concept algebra, and analogical reasoning over your book's content.

🎯 Key Features

Semantic Chunking with Parent-Document Retrieval

Child chunks (512 tokens): Precise, specific segments for accurate retrieval
Parent chunks (2048 tokens): Full context windows for comprehensive understanding
Semantic boundary detection: Respects paragraphs, lists, and sentence boundaries

High-Dimensional Dense Embeddings

Uses BGE-large-en-v1.5 (1024 dimensions) optimized for technical retrieval
Contextual enrichment: Prepends chapter/section breadcrumbs to anchor vectors
Optional hyperbolic projection for hierarchical concept relationships

Hybrid Search with RRF Fusion

Dense search: Semantic similarity for meaning-based queries
Sparse search: Keyword matching for exact technical terms (CVEs, protocols)
Reciprocal Rank Fusion: Merges results for best of both worlds

Vector Arithmetic (Concept Algebra)

Zero Trust + Cloud Architecture ≈ Cloud-Native Identity Management
Buffer Overflow - C++ + Rust ≈ Memory Safety Patterns

Cybersecurity-Specific Features

CVE extraction: Automatically detects and indexes CVE patterns
Severity tagging: Critical, High, Medium, Low classifications
Category detection: Network, Web, Crypto, Malware, Cloud, etc.

📦 Installation

pip install -r requirements.txt

For GPU acceleration (recommended for large books):

pip install torch --index-url https://download.pytorch.org/whl/cu118

🚀 Quick Start

1. Encode Your Book

python main.py encode path/to/cybersecurity_book.pdf

With custom options:

python main.py encode book.pdf \
    --model "BAAI/bge-large-en-v1.5" \
    --chunk-size 512 \
    --parent-chunk-size 2048 \
    --device cuda

2. Query the Encoded Book

Natural language query:

python main.py query "How do buffer overflow attacks work?"

Concept algebra:

python main.py concept "Zero Trust + Cloud Architecture"

Analogical reasoning:

python main.py analogy "SQL Injection" "Web Application" "Memory Corruption"
# Returns: content about memory corruption exploits in applications

3. Interactive Mode

python main.py interactive

Commands:

/concept <expression> - Vector arithmetic
/category <cat> <query> - Filter by category
/severity <level> <query> - Filter by severity
/quit - Exit

🤖 AI Agent Integration

from agent_interface import create_agent

# Create agent connected to encoded book
agent = create_agent(db_path="./vector_db")

# Natural language query
response = agent.query("What are the OWASP Top 10 vulnerabilities?")
print(response.context)  # Ready for LLM consumption

# Concept algebra
response = agent.concept_search("Ransomware + Prevention - Detection")

# Analogical reasoning
response = agent.analogy_search(
    a="SQL Injection",
    b="Input Validation", 
    c="Buffer Overflow"
)
# Returns: content about memory protection techniques

# Category-filtered search
response = agent.search_by_category(
    question="authentication bypass",
    category="web_security"
)

# Severity-filtered search
response = agent.search_by_severity(
    question="remote code execution",
    severity="critical"
)

📊 Architecture

┌─────────────────────────────────────────────────────────────┐
│                     ENCODING PIPELINE                        │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
│  │   PDF/Text  │───▶│  Semantic   │───▶│  Contextual │     │
│  │   Loader    │    │  Chunker    │    │  Embedder   │     │
│  └─────────────┘    └─────────────┘    └─────────────┘     │
│        │                  │                   │              │
│        ▼                  ▼                   ▼              │
│   Structure         Child + Parent      1024-dim Dense      │
│   Detection         Chunks              Vectors             │
├─────────────────────────────────────────────────────────────┤
│                      VECTOR STORE                            │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────┐   │
│  │                    ChromaDB                          │   │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────────┐   │   │
│  │  │  Dense    │  │  Metadata │  │    Parent     │   │   │
│  │  │  Index    │  │  Filters  │  │    Store      │   │   │
│  │  └───────────┘  └───────────┘  └───────────────┘   │   │
│  └─────────────────────────────────────────────────────┘   │
├─────────────────────────────────────────────────────────────┤
│                     AI AGENT INTERFACE                       │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────┐    │
│  │   Hybrid    │  │   Vector    │  │   Metadata      │    │
│  │   Search    │  │   Algebra   │  │   Filtering     │    │
│  └─────────────┘  └─────────────┘  └─────────────────┘    │
└─────────────────────────────────────────────────────────────┘

🔧 Configuration

Edit config.py to customize:

from config import PipelineConfig, ChunkingConfig, EmbeddingConfig

config = PipelineConfig(
    chunking=ChunkingConfig(
        chunk_size=512,           # Child chunk target size
        parent_chunk_size=2048,   # Parent context window
        overlap=0.25,             # 25% overlap between chunks
    ),
    embedding=EmbeddingConfig(
        model_name="BAAI/bge-large-en-v1.5",
        dimension=1024,
        device="cuda",
    ),
    use_hyperbolic_embeddings=True,  # Enable for hierarchical data
)

📁 Project Structure

new_thing/
├── main.py              # CLI entry point
├── pipeline.py          # Encoding orchestrator
├── document_loader.py   # PDF/text parsing with structure detection
├── semantic_chunker.py  # Parent-document chunking strategy
├── embeddings.py        # Dense embeddings + vector arithmetic
├── vector_store.py      # ChromaDB with hybrid search
├── agent_interface.py   # High-level AI agent API
├── config.py            # Configuration management
├── requirements.txt     # Dependencies
└── README.md           # This file

🧠 Advanced: Representation Theory Features

Hyperbolic Embeddings (Poincaré Ball)

Enable for hierarchical cybersecurity concepts:

config = PipelineConfig(use_hyperbolic_embeddings=True)

This projects Euclidean embeddings into hyperbolic space where:

Parent concepts (e.g., "Malware") are near the origin
Child concepts (e.g., "WannaCry") are near the boundary
Hierarchical distances are preserved with minimal distortion

Vector Arithmetic Examples

from embeddings import VectorArithmetic, EmbeddingEngine

engine = EmbeddingEngine()
va = VectorArithmetic(engine)

# Concept composition
result = va.compute("Zero Trust + Cloud Architecture - On-Premise")

# Analogy: A is to B as C is to ?
result = va.analogy("Firewall", "Network", "WAF")  # → Web Application

📈 Performance Tips

Use GPU: Set --device cuda for 10x faster encoding
Batch size: Increase --batch-size if you have VRAM headroom
Chunk size: Smaller chunks = more precise retrieval, larger = more context
Overlap: 20-30% overlap prevents context loss at boundaries

🔐 Security Note

This tool processes potentially sensitive cybersecurity content. The vector database is stored locally by default. For production deployments, consider:

Encrypting the vector store
Access controls on the database directory
Audit logging for queries

Built for seamless AI agent access to cybersecurity knowledge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cybersecurity Book Vector Encoder

🎯 Key Features

Semantic Chunking with Parent-Document Retrieval

High-Dimensional Dense Embeddings

Hybrid Search with RRF Fusion

Vector Arithmetic (Concept Algebra)

Cybersecurity-Specific Features

📦 Installation

🚀 Quick Start

1. Encode Your Book

2. Query the Encoded Book

3. Interactive Mode

🤖 AI Agent Integration

📊 Architecture

🔧 Configuration

📁 Project Structure

🧠 Advanced: Representation Theory Features

Hyperbolic Embeddings (Poincaré Ball)

Vector Arithmetic Examples

📈 Performance Tips

🔐 Security Note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
agent_interface.py		agent_interface.py
config.py		config.py
document_loader.py		document_loader.py
embeddings.py		embeddings.py
main.py		main.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt
semantic_chunker.py		semantic_chunker.py
vector_store.py		vector_store.py

Folders and files

Latest commit

History

Repository files navigation

Cybersecurity Book Vector Encoder

🎯 Key Features

Semantic Chunking with Parent-Document Retrieval

High-Dimensional Dense Embeddings

Hybrid Search with RRF Fusion

Vector Arithmetic (Concept Algebra)

Cybersecurity-Specific Features

📦 Installation

🚀 Quick Start

1. Encode Your Book

2. Query the Encoded Book

3. Interactive Mode

🤖 AI Agent Integration

📊 Architecture

🔧 Configuration

📁 Project Structure

🧠 Advanced: Representation Theory Features

Hyperbolic Embeddings (Poincaré Ball)

Vector Arithmetic Examples

📈 Performance Tips

🔐 Security Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages