Transform your documentation and codebases into intelligent, queryable knowledge bases for RAG applications
Features β’ Quick Start β’ Architecture β’ Plugins β’ Docs
CPM (Context Packet Manager) is a modular Python framework that transforms documentation, codebases, and knowledge repositories into chunked, embedded, FAISS-indexed context packets optimized for Retrieval Augmented Generation.
- π Plugin Architecture - Extend without modifying core code
- π§© Language-Aware Chunking - 40+ languages with AST/Tree-sitter parsing
- β‘ Incremental Builds - Hash-based caching for blazing fast rebuilds
- π€ Claude Desktop Integration - Native MCP support for AI assistants
- π¦ Package Management - Versioned packets with semantic versioning
- π― Zero Config - Intelligent defaults, works out of the box
Create custom commands, builders, and retrievers without touching core code. Plugins auto-discover from .cpm/plugins/
and integrate seamlessly with the CLI.
cpm plugin:list # List loaded plugins
cpm my-plugin:custom-command # Your command, integratedCPM automatically detects and applies the optimal chunking strategy for your content. Can't find the right chunker?
Use --builder custom-builder to plug in your own.
| Language | Strategy | Approach |
|---|---|---|
| Python | AST-based | Function/class boundaries |
| Java | Structure-aware | Method scope preservation |
| JavaScript/TypeScript | Tree-sitter | Syntax-aware parsing |
| Markdown | Header-based | Hierarchy preservation |
| 40+ more | Tree-sitter/Fallback | Universal coverage |
Fully extensible: Implement your own builder for custom logic.
Rebuild only what changed. SHA-256 hash-based caching reuses existing embeddings:
# First build: 250 chunks
[embed] missing_vectors shape=(250, 768)
# Edit one file, rebuild
[cache] new_chunks=251 reused=250 to_embed=1 removed=0
[embed] missing_vectors shape=(1, 768)Native Model Context Protocol (MCP) support. Expose your context packets as tools directly in Claude Desktop:
{
"mcpServers": {
"cpm": {
"command": "cpm",
"args": [
"mcp:serve"
]
}
}
}Claude can now search your docs, code, and knowledge bases conversationally!
Versioned packets with semantic versioning, pinning, and pruning:
cpm pkg list # List installed packets
cpm pkg use my-packet@1.2.0 # Pin specific version
cpm pkg prune my-packet --keep 2 # Keep 2 latest versions# Clone repository
git clone https://github.com/AEndrix03/component-rag.git
cd component-rag
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install CPM
pip install -e .
# Install dev dependencies (optional)
pip install -e ".[dev]" # black, ruff, mypy, pytest# Create .cpm/ workspace structure
cpm init
# Verify installation
cpm doctorPoint CPM to an adapter exposing POST /v1/embeddings:
cpm embed add \
--name adapter-local \
--url http://127.0.0.1:8080 \
--model text-embedding-3-small \
--dims 768 \
--set-defaultMinimal .cpm/config/embeddings.yml example:
default: adapter-local
providers:
- name: adapter-local
type: http
url: http://127.0.0.1:8080
model: text-embedding-3-small
dims: 768
http:
path: /v1/embeddings
hints:
normalize: trueSupported hint headers sent by CPM connector:
X-Embedding-DimX-Embedding-NormalizeX-Embedding-TaskX-Model-Hint
See cpm_builtin/embeddings/README.md for full adapter spec, Docker Compose examples, and troubleshooting.
# Start embedding server (or use remote service)
# (See embedding server docs for setup)
# Build a context packet from your docs
cpm build \
--source ./docs \
--destination ./packets/my-docs \
--model jinaai/jina-embeddings-v2-base-code \
--version 1.0.0# 1) Standard build (default builder)
cpm build \
--source ./docs \
--name my-docs \
--version 1.0.0 \
--model jinaai/jina-embeddings-v2-base-code \
--embed-url http://127.0.0.1:8876# 2) LLM builder (explicit embedding model)
cpm build \
--source C:\path\to\repo \
--builder llm:cpm-llm-builder \
--name repo-packet \
--version 0.0.1 \
--model BAAI/bge-base-en-v1.5 \
--embed-url http://127.0.0.1:8876# 3) Rebuild same packet/version to regenerate vectors + FAISS
# (useful if a previous run produced chunks/cache but no vectors/index)
cpm build \
--source C:\path\to\repo \
--builder llm:cpm-llm-builder \
--name repo-packet \
--version 0.0.1 \
--model BAAI/bge-base-en-v1.5 \
--embed-url http://127.0.0.1:8876# 4) Migrate to a different embedder/model using workspace default provider
# (embed URL is resolved from .cpm/config/embeddings.yml default provider)
cpm build \
--source C:\path\to\repo \
--builder llm:cpm-llm-builder \
--name repo-packet \
--version 0.0.1 \
--model intfloat/multilingual-e5-base# 5) Re-embed an existing packet directly from docs.jsonl chunks
# (builder is not required in this mode)
cpm build embed \
--source ./dist/repo-packet/0.0.1 \
--model intfloat/multilingual-e5-baseNotes:
--packet-versionremains supported as a compatibility alias, but--versionis preferred.--sourceand--builderare still required for deterministic rebuilds (cpm build run) because chunk generation depends on builder behavior and source content.cpm build embedstarts from an already built packet (docs.jsonlrequired) and regeneratesvectors.f16.bin,faiss/index.faiss, andmanifest.json.
Output:
[scan] files_indexed=145 chunks_total=1250
[cache] enabled: cached_vectors=0 dim=768
[embed] missing_vectors shape=(1250, 768)
[faiss] ntotal=1250
[done] build ok
# Query for relevant context (auto-detects retriever from project config)
cpm query \
--packet my-docs \
--query "authentication setup" \
-k 5
# Or specify a custom retriever
cpm query --packet my-docs --query "auth" --retriever custom-retriever-
Configure Claude Desktop
Edit
~/.config/Claude/claude_desktop_config.json(Linux) or equivalent:{ "mcpServers": { "cpm": { "command": "/path/to/.venv/bin/cpm", "args": ["mcp:serve"], "env": { "RAG_CPM_DIR": "/path/to/workspace/.cpm" } } } } -
Restart Claude Desktop
-
Use in conversation:
You: "What packets are available?" Claude: [calls lookup tool] I can see 3 context packets... You: "Search my-docs for authentication examples" Claude: [calls query tool] Here are the relevant sections...
CPM follows a modular, plugin-based architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CPM β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
cpm_cli CLI Entry Point
β
ββ Command Resolution
ββ Token Parsing
β
βΌ
ββββββββββββββββββββββββ
β cpm_core β Foundation Layer
β β
β β’ CPMApp β Application Bootstrap
β β’ FeatureRegistry β Command/Plugin Registry
β β’ PluginManager β Plugin Discovery/Loading
β β’ Workspace β .cpm/ Management
β β’ EventBus β Lifecycle Hooks
β β’ ServiceContainer β Dependency Injection
ββββββββββββββββββββββββ
β
βββββββββββββΌββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββ βββββββββββ ββββββββββββ
β Plugins β βBuiltins β β Build β
β β β β β System β
β β’ MCP β ββ’ Init β ββ’ Chunker β
β β’ ... β ββ’ Doctor β ββ’ Embedderβ
βββββββββββ βββββββββββ ββ’ FAISS β
ββββββββββββ
β
βΌ
βββββββββββββββββββββ
β Context Packet β
β β
β β’ docs.jsonl β
β β’ vectors.f16 β
β β’ faiss/index β
β β’ manifest.json β
βββββββββββββββββββββ
component-rag/
βββ cpm_core/ ποΈ Foundation layer (app, plugins, registry)
βββ cpm_cli/ π₯οΈ CLI routing and command resolution
βββ cpm_builtin/ π§° Built-in features (chunking, embeddings, packages)
βββ cpm_plugins/ π Official plugins (MCP, etc.)
π See Architecture Docs for detailed component documentation
CPM is built for extensibility. Create custom commands without touching core code.
1. Create plugin directory:
mkdir -p .cpm/plugins/my-plugin
cd .cpm/plugins/my-plugin2. Create plugin.toml:
[plugin]
id = "my-plugin"
name = "My Custom Plugin"
version = "1.0.0"
entrypoint = "entrypoint:register_plugin"3. Create entrypoint.py:
from cpm_core.api import CPMAbstractCommand, cpmcommand
from cpm_core.plugin import PluginContext
@cpmcommand(name="hello", group="my-plugin")
class HelloCommand(CPMAbstractCommand):
"""Say hello to the user."""
def configure(self, parser):
parser.add_argument("--name", default="World")
def run(self, args):
print(f"Hello, {args.name}!")
return 0
def register_plugin(ctx: PluginContext):
ctx.logger.info("My plugin loaded!")4. Use your plugin:
cpm my-plugin:hello --name CPM
# Output: Hello, CPM!CPM automatically detects and selects the optimal chunking strategy for your content. If the default doesn't fit,
simply implement your own builder and pass --builder your-builder during build.
| Chunker | Languages | Key Feature |
|---|---|---|
| python_ast | Python | Preserves function/class boundaries |
| java | Java | Maintains method scope |
| treesitter_generic | JS, TS, Go, Rust, C/C++, and 35+ more | Syntax tree parsing |
| markdown | Markdown, reStructuredText | Header hierarchy |
| text | Plain text | Token-budget with overlap |
| brace_fallback | C-style languages | Brace-based sectioning |
Builders: CPM intelligently selects builders based on project structure. Need custom logic?
cpm build --source ./docs --builder my-custom-builderRetrievers: Auto-detected from project configuration, or explicitly specified:
cpm query --packet my-docs --query "search" --retriever my-custom-retrieverHierarchical Chunking: Built-in support for multi-level chunking:
config = ChunkingConfig(
hierarchical=True,
chunk_tokens=800, # Parent chunk size
micro_chunk_tokens=220, # Child chunk size
emit_parent_chunks=False, # Only index children
)CPM includes a built-in Model Context Protocol plugin for seamless Claude Desktop integration.
{
"name": "lookup",
"description": "List available context packets",
"inputSchema": {
"type": "object",
"properties": {
"cpm_dir": {
"type": "string",
"optional": true
}
}
}
}{
"name": "query",
"description": "Search context packets for relevant information",
"inputSchema": {
"type": "object",
"properties": {
"packet": {
"type": "string",
"required": true
},
"query": {
"type": "string",
"required": true
},
"k": {
"type": "number",
"default": 5
}
}
}
}// Claude Desktop config
{
"mcpServers": {
"cpm": {
"command": "cpm",
"args": ["mcp:serve"],
"env": {
"RAG_CPM_DIR": "/path/to/.cpm",
"RAG_EMBED_URL": "http://127.0.0.1:8876"
}
}
}
}Conversation with Claude:
User: Search my python-stdlib packet for file I/O examples
Claude: [Calls query tool]
Here are the most relevant sections from python-stdlib:
1. **File Operations (score: 0.92)**
"The `open()` function is the primary way to work with files..."
2. **Context Managers (score: 0.89)**
"Using `with open()` ensures proper file closure..."
| Command | Description |
|---|---|
cpm init |
Initialize CPM workspace |
cpm doctor |
Validate workspace and diagnose issues |
cpm build |
Build a context packet from source |
cpm pkg list |
List installed packets |
cpm pkg use <pkg@version> |
Pin a packet version |
cpm pkg prune <pkg> |
Remove old packet versions |
cpm plugin:list |
List loaded plugins |
cpm plugin:doctor |
Diagnose plugin issues |
cpm mcp:serve |
Start MCP server for Claude |
| Variable | Purpose | Default |
|---|---|---|
RAG_CPM_DIR |
Workspace root directory | .cpm |
RAG_EMBED_URL |
Embedding server URL | http://127.0.0.1:8876 |
CPM_CONFIG |
Main config file path | .cpm/config/cpm.toml |
CPM_EMBEDDINGS |
Embeddings config path | .cpm/config/embeddings.yml |
.cpm/
βββ packages/ # Installed context packets
β βββ <name>/
β βββ <version>/
βββ config/ # Configuration files
β βββ cpm.toml # Main configuration
β βββ embeddings.yml # Embedding providers
βββ plugins/ # Workspace plugins
βββ cache/ # Query result caches
βββ state/ # Runtime state (pins, active versions)
βββ logs/ # Application logs
βββ pins/ # Version pin files
CPM includes comprehensive documentation for every component:
- cpm_core - Foundation layer architecture
- cpm_core/api - Extension interfaces
- cpm_core/plugin - Plugin system deep dive
- cpm_core/registry - Feature registry
- cpm_core/build - Build system internals
- cpm_core/packet - Packet data structures
- cpm_builtin/chunking - Chunking strategies
- cpm_builtin/embeddings - Embedding management
- cpm_builtin/packages - Package management
- cpm_plugins/mcp - MCP plugin for Claude Desktop
- DOCUMENTATION.md - Complete documentation index
- Python 3.11+
- Virtual environment recommended
# Clone and install
git clone https://github.com/AEndrix03/component-rag.git
cd component-rag
python -m venv .venv
source .venv/bin/activate
# Install with dev dependencies
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install# Run all tests
pytest
# Run with coverage
pytest --cov=cpm_core --cov=cpm_builtin --cov=cpm_cli
# Run specific test file
pytest tests/test_core.py
# Run with verbose output
pytest -v# Format code
black .
# Lint
ruff check .
# Type check
mypy .Contributions are welcome! Please follow these guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Write tests for new functionality
- Ensure code quality (black, ruff, mypy pass)
- Commit with clear messages (
git commit -m 'Add amazing feature') - Push to your fork (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 style guide
- Use type hints for all functions
- Write docstrings for public APIs
- Add tests for bug fixes and new features
- Update documentation for user-facing changes
- Scanning: ~5,000 files/second
- Chunking: ~2,000 files/second (language-dependent)
- Incremental builds: 90%+ cache hit rate for small edits
- FAISS search: Sub-millisecond on 100k vectors
- Scalability: Tested with 10M+ vector indices
- Memory: ~4KB per vector (768-dim float32)
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.
- Built with FAISS for efficient vector search
- Uses Sentence Transformers for embeddings
- Tree-sitter integration for multi-language parsing
- FastMCP for Model Context Protocol support
Made with β€οΈ for Everyone
CPM supports packaging packets for standard OCI registries (Harbor, GHCR, GitLab, Nexus OCI compatible).
- Packet tag mapping:
name@version -> <registry>/<project>/<name>:<version> - Immutable identity: always consume by digest (
@sha256:...) after resolve - OCI staging layout includes:
packet.manifest.jsonpacket.lock.json(when present)payload/(cpm.yml,manifest.json,docs.jsonl,vectors.f16.bin,faiss/index.faiss)
Digest form example:
registry.local/project/demo@sha256:<digest>
Example publish/install/query flow with OCI registries:
# Publish a built packet directory
cpm publish --from-dir ./dist/demo/1.0.0 --registry registry.local/project
# Install from OCI by name@version
cpm install demo@1.0.0 --registry registry.local/project
# Query uses selected model from install lock when available
cpm query --packet demo --query "authentication setup" -k 5# Publish/install without vectors (chunks + metadata only)
cpm publish --from-dir ./dist/demo/1.0.0 --registry registry.local/project --no-embed
cpm install demo@1.0.0 --registry registry.local/project --no-embed
# Then generate vectors locally with your preferred model/provider
cpm build embed --source ./.cpm/packages/demo/1.0.0 --model intfloat/multilingual-e5-baseFor Harbor, use the project/repository form in --registry, for example:
harbor.local/my-project