MCP-native post-generation validation layer for LLM outputs.
Lemma is a Python MCP server that exposes model-agnostic LLM output validation as callable tools. Any agentic system — Claude, GPT, or a custom agent — can invoke Lemma's tools mid-reasoning to check outputs before acting on them or returning them to users.
The core idea: if you can't fully trust an LLM's output, you can build infrastructure to check it.
While building an AI-powered review management system, I hit a specific failure: the model generated repetitive responses that violated Google's content policies. Clients lost rankings. The outputs were technically fluent but contextually harmful — and nothing caught them before they reached users.
Prompt engineering, context windows, and keyword constraints didn't solve the core problem. What was missing was a post-generation validation layer — something that checks outputs against constraints after generation, before downstream use.
Lemma is that layer.
A concrete example: validating Claude's tool call outputs before they trigger downstream actions in an agentic pipeline — catching hallucinated parameters or contradictory instructions before execution.
LLM Raw Response
↓
[Lemma MCP Server] ← any agent calls verify_output() as an MCP tool
↓
┌─────────────────────────────────┐
│ Validators │
│ - Mathematical consistency │
│ - Logical contradiction │
└─────────────────────────────────┘
↓
┌─────────────────────────────────┐
│ Structured Failure Logger │
│ - tenant_id tagged events │
│ - severity + violation_type │
│ - fine-tuning signal format │
└─────────────────────────────────┘
↓
Validated ✓ or Flagged ✗ + Reason + Structured Log
Master tool. Runs all validators and returns aggregated result.
Input:
{
"output": "<LLM raw response>",
"spec": "<optional behavioral specification>",
"context": "<optional prior context>",
"token": "<optional JWT for tenant isolation>"
}Output:
{
"verified": false,
"violations": [
{
"type": "math_error",
"severity": "high",
"expression": "2 + 3 = 7",
"detail": "LHS evaluates to 5 but RHS is 7"
},
{
"type": "logical_contradiction",
"severity": "high",
"output_claim": "The service is available in all regions.",
"context_claim": "The service is not available in Asia.",
"detail": "Output contradicts provided context"
}
],
"scores": {
"math_validity": 0.0,
"consistency": 0.5
}
}Uses SymPy to symbolically validate mathematical expressions in LLM output. Catches arithmetic errors with exact detail — not pattern matching.
Rule-based contradiction detection across multi-sentence outputs. Checks output claims against provided context for logical contradictions. Initial implementation is rule-based — not ML-based.
| Layer | Technology |
|---|---|
| MCP Server | Python MCP SDK (FastMCP) |
| Math Validation | SymPy |
| Schema / Types | Pydantic v2 |
| Auth | PyJWT |
| Logging | JSON structured output |
| Testing | pytest |
git clone https://github.com/0210shivam/lemma
cd lemma
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtRun the MCP server:
python server.pyRun tests:
pytest tests/ -vRun validators directly:
from tools.validate_math import validate_math
result = validate_math("The total is 2 + 3 = 7")
print(result)MATH: False | 1 violation(s)
CONSISTENCY: False | 1 violation(s)
TOTAL VIOLATIONS: 2
{
"tenant_id": "org_test",
"violation_type": "math_error",
"severity": "high",
"violations": [...],
"scores": {"math_validity": 0.0, "consistency": 0.5}
}
- Consistency checker is rule-based — detects direct negation patterns but will miss paraphrased contradictions ("available everywhere" vs "not offered in Asia"). Semantic similarity model planned.
- Sequential validation — validators run one after another. Async pipeline needed for production throughput.
- Math extraction is pattern-based — catches explicit
X + Y = Zexpressions but not implicit numerical reasoning across sentences. - No persistence layer — failure events emit to stdout only. Database sink needed for multi-tenant dashboards.
Active development. Core validation pipeline working. Includes: mathematical validation, logical contradiction detection, JWT multi-tenant auth, structured failure logging, pytest test suite. Planned: spec compliance scoring, semantic consistency, async pipeline.
Built to solve a real production failure. Inspired by the reliability gap between fluent and trustworthy LLM outputs.