CustomKB: AI-Powered Knowledgebase System

CustomKB transforms documents into AI-powered, searchable knowledgebases using vector embeddings, FAISS indexing, and LLM integration. It supports multiple AI providers (OpenAI, Anthropic, Google, xAI, Ollama) and delivers contextually relevant answers from your data.

How It Works

CustomKB follows a three-stage pipeline:

1. Document Processing (customkb database)
   ├─ Text extraction (Markdown, HTML, PDF, code, plain text)
   ├─ Language detection (27+ languages)
   ├─ Intelligent chunking (configurable token ranges, context-aware)
   └─ Metadata extraction (filenames, categories, timestamps)

2. Embedding Generation (customkb embed)
   ├─ Vector embeddings via OpenAI, Google, or local models
   ├─ Batch processing with checkpoints and retry logic
   ├─ FAISS index creation (auto-selects Flat/IVF/HNSW by dataset size)
   └─ Two-tier caching (memory LRU + disk)

3. Semantic Query (customkb query)
   ├─ Query embedding + optional enhancement (spelling, synonyms)
   ├─ Vector similarity search (k-NN via FAISS)
   ├─ Optional hybrid search (vector + BM25 keyword matching)
   ├─ Optional cross-encoder reranking for precision
   ├─ Context assembly from top results
   └─ LLM response generation with retrieved context

Key Features

Category	Features
Search	Semantic search by meaning; hybrid vector + BM25 keyword search; cross-encoder reranking (+20-40% accuracy)
AI Providers	OpenAI, Anthropic, Google, xAI, and local models via Ollama
Documents	Markdown, HTML, PDF, code files, plain text; 27+ languages with auto-detection
Performance	Auto-tuning memory tiers (4GB–128GB+); GPU acceleration (CUDA 11/12); concurrent batch processing; two-tier LRU cache
Security	Zero pickle (JSON-only serialization); SQL injection protection; path traversal prevention; parameterized queries; no API keys in logs
Reliability	Checkpoint saving; exponential backoff with retry; graceful error handling; atomic DB operations

Prerequisites

Python 3.12+
SQLite 3.45+ (included with Python)
RAM 4GB+ (8GB+ recommended)
GPU (optional): NVIDIA with CUDA 11 or 12
API Keys: For chosen AI providers (OpenAI, Anthropic, Google, xAI) — or use local Ollama models for zero API dependency

Installation

1. Clone and Install

git clone https://github.com/Open-Technology-Foundation/customkb.git
cd customkb

# Install uv if not already available
# See: https://docs.astral.sh/uv/getting-started/installation/

# Install dependencies (auto-creates .venv)
uv sync --extra faiss-gpu-cu12 --extra mcp --extra test

# CPU-only (no NVIDIA GPU):
# uv sync --extra faiss-cpu --extra mcp --extra test

# Or auto-detect FAISS variant:
# ./setup/install_faiss.sh

2. Install NLTK Data

sudo ./setup/nltk_setup.py download cleanup

3. Setup Knowledgebase Directory

# System-wide (requires sudo)
sudo mkdir -p /var/lib/vectordbs && sudo chown $USER:$USER /var/lib/vectordbs
export VECTORDBS="/var/lib/vectordbs"

# Or user-local (no sudo)
mkdir -p "$HOME/knowledgebases"
export VECTORDBS="$HOME/knowledgebases"

Add to your shell profile (~/.bashrc):

export VECTORDBS="/var/lib/vectordbs"  # or $HOME/knowledgebases

4. Configure API Keys

export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export GOOGLE_API_KEY="your-google-key"      # Optional
export XAI_API_KEY="your-xai-key"            # Optional

Quick Start

# 1. Create knowledgebase directory
mkdir -p "$VECTORDBS/myproject"

# 2. Create configuration
cat > "$VECTORDBS/myproject/myproject.cfg" << 'EOF'
[DEFAULT]
vector_model = text-embedding-3-small
query_model = gpt-4o-mini
db_min_tokens = 200
db_max_tokens = 400
EOF

# 3. Process documents
customkb database myproject docs/*.md *.txt

# 4. Generate embeddings
customkb embed myproject

# 5. Query
customkb query myproject "What are the main features?"

Core Commands

`database` — Import Documents

customkb database <kb_name> [files...] [options]

Option	Description
`-l, --language`	Stopwords language (en, fr, de, etc.)
`--detect-language`	Auto-detect language per file
`-f, --force`	Reprocess existing files
`-v / -q`	Verbose / quiet output

`embed` — Generate Embeddings

customkb embed <kb_name> [options]

Option	Description
`-r, --reset-database`	Reset embedding status and regenerate all
`-v / -q`	Verbose / quiet output

`query` — Search & Ask Questions

customkb query <kb_name> "<question>" [options]

Option	Description
`-c, --context-only`	Return context without AI response
`-m, --model`	AI model to use
`-k, --top-k`	Number of results (default: 50)
`-s, --context-scope`	Context segments per result
`-t, --temperature`	Response creativity (0.0–2.0)
`-f, --format`	Output format: xml, json, markdown, plain
`-p, --prompt-template`	Style: default, instructive, scholarly, concise, analytical, conversational, technical
`--categories`	Filter by categories (comma-separated)
`--context-files`	Additional context files to include

`categorize` — AI-Powered Categorization

customkb categorize <kb_name> [options]

Option	Description
`-S, --sample N`	Process only N sample articles
`-f, --full`	Process all articles
`--fresh`	Ignore checkpoint, reprocess all
`--import`	Import categories to database
`--list`	List existing categories and counts
`-m, --model`	AI model (default: `claude-haiku-4-5`)
`-s, --sampling T-M-B`	Chunk sampling (e.g., `5-10-5`)
`-c, --confidence-threshold`	Minimum confidence (default: 0.5)
`-D, --no-dedup`	Disable category deduplication
`--dedup-threshold`	Similarity threshold for dedup (default: 85.0)

Other Commands

Command	Description
`customkb edit <kb_name>`	Open KB config in `$EDITOR`
`customkb optimize <kb_name>`	Auto-optimize performance settings
`customkb optimize --analyze`	Show system recommendations
`customkb verify-indexes <kb_name>`	Check database index health
`customkb bm25 <kb_name>`	Build BM25 hybrid search index
`customkb convert-encoding <files>`	Convert files to UTF-8
`customkb version`	Show version information

Configuration

CustomKB uses INI-style configuration with environment variable overrides.

Priority order: Environment variables > .cfg file > defaults.

Example Configuration

[DEFAULT]
# Models
vector_model = text-embedding-3-small
query_model = gpt-4o-mini

# Text Processing
db_min_tokens = 200          # Minimum chunk size
db_max_tokens = 400          # Maximum chunk size

# Query Settings
query_max_tokens = 4096      # Max tokens in LLM response
query_top_k = 30             # Chunks to retrieve
query_temperature = 0.1      # 0=precise, 2=creative
query_role = You are a helpful expert assistant.

# Output
reference_format = json      # xml, json, markdown, plain
query_prompt_template = technical

[ALGORITHMS]
similarity_threshold = 0.6   # Minimum similarity (0–1)
enable_hybrid_search = true  # Vector + keyword search
bm25_weight = 0.5            # BM25 weight in hybrid mode
bm25_max_results = 1000
enable_reranking = true
reranking_model = cross-encoder/ms-marco-MiniLM-L-6-v2
reranking_top_k = 30

[PERFORMANCE]
embedding_batch_size = 100
cache_thread_pool_size = 4
memory_cache_size = 10000
checkpoint_interval = 10

[API]
api_call_delay_seconds = 0.05
api_max_concurrency = 8
api_max_retries = 20

Configuration Tips

db_min_tokens/db_max_tokens: Smaller chunks = more precise retrieval; larger = more context per result
similarity_threshold: Lower (0.5) for broader recall, higher (0.7) for strict relevance
enable_hybrid_search: Recommended for technical documentation
query_temperature: 0.0–0.3 for factual answers, 0.7–1.0 for creative responses

Architecture

customkb                     # Bash wrapper (activates .venv, calls customkb.py)
customkb.py                  # CLI entry point, match/case command dispatch
├── config/
│   ├── config_manager.py    # KnowledgeBase class, config loading, name resolution
│   └── models.py            # Pydantic config models with env var overrides
├── categorize/
│   ├── categorize_manager.py    # LLM-based document categorization
│   ├── category_deduplicator.py # Fuzzy deduplication via rapidfuzz
│   └── import_to_db.py          # Category import pipeline to SQLite
├── database/
│   ├── db_manager.py        # Document processing pipeline
│   ├── connection.py        # SQLite connection with WAL + PRAGMA tuning
│   ├── chunking.py          # Text chunking with overlap (langchain splitters)
│   ├── index_manager.py     # Database index creation and verification
│   └── migrations.py        # Schema migrations with version tracking
├── embedding/
│   ├── embed_manager.py     # Embedding orchestration, FAISS index management
│   ├── providers.py         # Embedding provider abstraction (OpenAI, Google, local)
│   ├── litellm_provider.py  # LiteLLM unified embedding interface
│   ├── bm25_manager.py      # BM25 index for hybrid search
│   ├── rerank_manager.py    # Cross-encoder reranking with score caching
│   ├── cache.py             # Thread-safe two-tier cache (memory LRU + disk)
│   ├── batch.py             # Batch progress tracking with ETA
│   └── index.py             # FAISS index type auto-selection
├── query/
│   ├── query_manager.py     # Re-export hub for all query submodules
│   ├── processing.py        # Query orchestration pipeline
│   ├── search.py            # FAISS vector search, hybrid search, result assembly
│   ├── response.py          # LLM response generation (multi-provider)
│   ├── llm.py               # Unified LLM interface via LiteLLM
│   ├── embedding.py         # Query embedding generation and caching
│   ├── enhancement.py       # Spelling correction, synonym expansion
│   ├── formatters.py        # Output formatting (XML, JSON, Markdown, plain)
│   └── prompt_templates.py  # Response style templates
├── models/
│   └── model_manager.py     # Model registry from Models.json (aliases, providers)
├── mcp_server/
│   └── server.py            # MCP server for Claude Code integration
└── utils/
    ├── security_utils.py    # Input validation, path sanitization, safe SQL
    ├── text_utils.py        # Text cleaning, entity preservation
    ├── logging_config.py    # Centralized KB-specific logging
    ├── optimization_manager.py  # Memory tier auto-optimization
    ├── performance_analyzer.py  # System profiling and recommendations
    ├── exceptions.py        # Custom exception hierarchy
    ├── faiss_loader.py      # FAISS loading with GPU/CPU fallback
    ├── gpu_utils.py         # GPU detection and memory management
    ├── resource_manager.py  # Thread pool and resource lifecycle
    ├── language_detector.py # Language detection with caching
    ├── encoding_converter.py # File encoding conversion to UTF-8
    ├── enums.py             # Shared enumerations
    └── context_managers.py  # Database and resource context managers

Data Flow

Database: Files → text extraction → language-aware chunking → SQLite storage with metadata
Embed: Chunks → vector embeddings (via LiteLLM) → FAISS index (auto-selected type) → two-tier cache
Query: Question → embedding → FAISS search → optional BM25 hybrid → optional reranking → context assembly → LLM response

Storage Structure

$VECTORDBS/<kb_name>/
├── <kb_name>.cfg      # Configuration (required)
├── <kb_name>.db       # SQLite database (chunks, metadata, categories)
├── <kb_name>.faiss    # FAISS vector index
└── <kb_name>.bm25     # BM25 index (optional, for hybrid search)

Design Principles

Lazy imports: Heavy modules loaded only when their command is invoked (fast CLI startup)
Config-driven: All thresholds, batch sizes, and behaviors configurable via .cfg with env var overrides
Provider-agnostic: LiteLLM abstraction layer for embeddings and LLM calls
Safe by default: JSON serialization only, parameterized SQL, validated file paths

Advanced Features

Supported Models

Embedding Models

Model	Provider	Dimensions	Notes
`text-embedding-3-small`	OpenAI	1536	Cost-effective default
`text-embedding-3-large`	OpenAI	3072	Best quality
`text-embedding-ada-002`	OpenAI	1536	Legacy
`gemini-embedding-001`	Google	768/1536/3072	30k token context, Matryoshka support

LLM Models (via LiteLLM)

Provider	Models
OpenAI	GPT-5.x, GPT-4.1, GPT-4o, o3, o4-mini
Anthropic	Claude Opus 4.5, Sonnet 4.5, Haiku 4.5, Opus 4.1
Google	Gemini 2.5 Pro/Flash, Gemini 1.5 Pro
xAI	Grok 4, Grok 4-fast
Ollama	Llama 3.3, Gemma 3, DeepSeek R1, Qwen 2.5, Mistral, Phi-4

Model aliases are resolved via Models.json. Run customkb query <kb> --model <alias> to use any supported model.

Prompt Templates

customkb query myproject "question" --prompt-template <template>

Available: default, instructive, scholarly, concise, analytical, conversational, technical

Output Formats

customkb query myproject "search" --format json     # Structured API output
customkb query myproject "search" --format xml      # XML with references
customkb query myproject "search" --format markdown # Documentation-friendly
customkb query myproject "search" --format plain    # Plain text

KB Name Resolution

All of these resolve to the same knowledgebase:

customkb query myproject "test"
customkb query myproject.cfg "test"
customkb query $VECTORDBS/myproject "test"
customkb query $VECTORDBS/myproject/myproject.cfg "test"

Security

Implemented Protections

Safe serialization: JSON-only format for all caches, checkpoints, and indexes (zero pickle)
SQL injection prevention: Parameterized queries and table name validation throughout
Path traversal protection: All file paths validated before access
API key security: Environment variable storage, never logged or exposed in errors
Atomic operations: Database writes with rollback support
Input validation: All user-provided parameters sanitized at system boundaries

Best Practices

Store API keys in environment variables, never in config files
Restrict $VECTORDBS directory permissions to the application user
Run behind authentication proxy for network-exposed deployments
Use local Ollama models for maximum data privacy

Performance Optimization

Auto-Optimization

customkb optimize myproject           # Apply optimizations
customkb optimize --analyze           # Show recommendations
customkb optimize myproject --dry-run # Preview changes

Memory Tiers

CustomKB auto-configures based on available system memory:

Memory	Tier	Batch Size	Cache Size	Hybrid Search
<16GB	Low	50	5,000	Disabled
16–64GB	Medium	100	10,000	Available
64–128GB	High	200	20,000	Enabled
>128GB	Very High	300	50,000	Enabled

GPU Acceleration

CUDA-enabled NVIDIA GPUs accelerate cross-encoder reranking (20-40% faster) and FAISS index operations:

uv sync --extra faiss-gpu-cu12   # Install GPU FAISS
./scripts/benchmark_gpu.py       # Benchmark GPU vs CPU

Database Indexes

customkb verify-indexes myproject  # Check index health
customkb bm25 myproject           # Build hybrid search index

MCP Server Integration

CustomKB includes a Model Context Protocol server for integration with Claude Code and other MCP-compatible tools:

uv sync --extra mcp  # Install MCP dependencies

The MCP server exposes knowledgebase search as tools, allowing AI assistants to query your knowledgebases directly. See mcp_server/server.py for configuration.

Troubleshooting

Common Issues

"Knowledgebase not found"

ls -la $VECTORDBS/                              # Verify KB exists
ls -la $VECTORDBS/myproject/myproject.cfg       # Check config file

"API rate limit exceeded" — Increase delay in config:

api_call_delay_seconds = 0.1
api_max_concurrency = 4

"Out of memory during embedding"

customkb optimize myproject  # Auto-adjust for your system
# Or manually: embedding_batch_size = 50

Low similarity scores — Adjust search parameters:

similarity_threshold = 0.5
enable_hybrid_search = true
# Or use stronger model: vector_model = text-embedding-3-large

Debug Mode

customkb query myproject "test" -v -d   # Verbose + debug
./scripts/diagnose_crashes.py myproject # Run diagnostics

Utility Scripts

Script	Purpose
`scripts/benchmark_gpu.py`	GPU vs CPU performance comparison
`scripts/benchmark_vectordb.py`	Vector database benchmarking
`scripts/show_optimization_tiers.py`	Display memory tier settings
`scripts/emergency_optimize.py`	Conservative recovery settings
`scripts/clean_corrupted_cache.py`	Clean corrupted cache files
`scripts/rebuild_bm25_filtered.py`	Rebuild BM25 indexes with filters
`scripts/upgrade_bm25_tokens.py`	Upgrade database for BM25 tokens
`scripts/diagnose_crashes.py`	Analyze crash logs and system state
`scripts/security-check.sh`	Run security validation checks

Testing

uv sync --extra test              # Install test dependencies

python run_tests.py               # All tests
python run_tests.py --unit        # Unit tests only
python run_tests.py --integration # Integration tests only
python run_tests.py --safe --memory-limit 2048  # With safety limits
python run_tests.py --coverage    # Generate coverage report

Test markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.slow, @pytest.mark.requires_api, @pytest.mark.performance, @pytest.mark.requires_data, @pytest.mark.resource_intensive

Frequently Asked Questions

Can I use CustomKB without API keys? Yes. Use local Ollama models for both embeddings and queries. No external API calls required.

How much does embedding cost? OpenAI text-embedding-3-small: ~$0.02/1M tokens. A 500-page manual (~250k tokens) costs ~$0.005. Ollama is free.

Is my data private? Documents stay local. Only text chunks are sent to API providers during embedding/query. For maximum privacy, use Ollama.

Can I use multiple embedding models in one KB? No. Each KB uses one embedding model. To switch, create a new KB or regenerate with --reset-database.

How do I update a KB when documents change?

customkb database myproject docs/*.md --force
customkb embed myproject

Maximum KB size? Tested to 10M+ chunks (~4GB database). FAISS scales to billions of vectors.

Environment Variables

Variable	Purpose
`VECTORDBS`	Knowledgebase base directory (default: `/var/lib/vectordbs`)
`OPENAI_API_KEY`	OpenAI API key
`ANTHROPIC_API_KEY`	Anthropic API key
`GOOGLE_API_KEY`	Google/Gemini API key
`XAI_API_KEY`	xAI API key
`NLTK_DATA`	NLTK data location (optional)

Contributing

Fork and clone the repository
Create a feature branch: git checkout -b feature/amazing-feature
Set up dev environment: uv sync --extra test --extra dev
Make changes, add tests, run python run_tests.py
Submit a pull request

See docs/DEVELOPMENT.md for detailed development guidelines.

License

GPL-3.0 License — see LICENSE file.

Maintained by the Indonesian Open Technology Foundation

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
categorize		categorize
config		config
database		database
docs		docs
embedding		embedding
examples		examples
mcp_server		mcp_server
models		models
query		query
scripts		scripts
setup		setup
tests		tests
utils		utils
.coveragerc		.coveragerc
.editorconfig		.editorconfig
.flake8		.flake8
.pre-commit-config.yaml		.pre-commit-config.yaml
.ruff.toml		.ruff.toml
LICENSE		LICENSE
Models.json		Models.json
PURPOSE-FUNCTIONALITY-USAGE.md		PURPOSE-FUNCTIONALITY-USAGE.md
README.md		README.md
customkb		customkb
customkb.bash_completion		customkb.bash_completion
customkb.py		customkb.py
example.cfg		example.cfg
pyproject.toml		pyproject.toml
pytest-safe.ini		pytest-safe.ini
run_tests.py		run_tests.py
uv.lock		uv.lock
version.py		version.py
version.sh		version.sh

Folders and files

Latest commit

History

Repository files navigation

CustomKB: AI-Powered Knowledgebase System

Table of Contents

How It Works

Key Features

Prerequisites

Installation

1. Clone and Install

2. Install NLTK Data

3. Setup Knowledgebase Directory

4. Configure API Keys

Quick Start

Core Commands

database — Import Documents

embed — Generate Embeddings

query — Search & Ask Questions

categorize — AI-Powered Categorization

Other Commands

Configuration

Example Configuration

Configuration Tips

Architecture

Data Flow

Storage Structure

Design Principles

Advanced Features

Supported Models

Embedding Models

LLM Models (via LiteLLM)

Prompt Templates

Output Formats

KB Name Resolution

Security

Implemented Protections

Best Practices

Performance Optimization

Auto-Optimization

Memory Tiers

GPU Acceleration

Database Indexes

MCP Server Integration

Troubleshooting

Common Issues

Debug Mode

Utility Scripts

Testing

Frequently Asked Questions

Environment Variables

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`database` — Import Documents

`embed` — Generate Embeddings

`query` — Search & Ask Questions

`categorize` — AI-Powered Categorization

Packages