Lemma

MCP-native post-generation validation layer for LLM outputs.

Lemma is a Python MCP server that exposes model-agnostic LLM output validation as callable tools. Any agentic system — Claude, GPT, or a custom agent — can invoke Lemma's tools mid-reasoning to check outputs before acting on them or returning them to users.

The core idea: if you can't fully trust an LLM's output, you can build infrastructure to check it.

Why Lemma

While building an AI-powered review management system, I hit a specific failure: the model generated repetitive responses that violated Google's content policies. Clients lost rankings. The outputs were technically fluent but contextually harmful — and nothing caught them before they reached users.

Prompt engineering, context windows, and keyword constraints didn't solve the core problem. What was missing was a post-generation validation layer — something that checks outputs against constraints after generation, before downstream use.

Lemma is that layer.

A concrete example: validating Claude's tool call outputs before they trigger downstream actions in an agentic pipeline — catching hallucinated parameters or contradictory instructions before execution.

Architecture

LLM Raw Response
      ↓
[Lemma MCP Server]  ←  any agent calls verify_output() as an MCP tool
      ↓
┌─────────────────────────────────┐
│  Validators                     │
│  - Mathematical consistency     │
│  - Logical contradiction        │
└─────────────────────────────────┘
      ↓
┌─────────────────────────────────┐
│  Structured Failure Logger      │
│  - tenant_id tagged events      │
│  - severity + violation_type    │
│  - fine-tuning signal format    │
└─────────────────────────────────┘
      ↓
Validated ✓  or  Flagged ✗ + Reason + Structured Log

MCP Tools

`verify_output`

Master tool. Runs all validators and returns aggregated result.

Input:

{
  "output": "<LLM raw response>",
  "spec": "<optional behavioral specification>",
  "context": "<optional prior context>",
  "token": "<optional JWT for tenant isolation>"
}

Output:

{
  "verified": false,
  "violations": [
    {
      "type": "math_error",
      "severity": "high",
      "expression": "2 + 3 = 7",
      "detail": "LHS evaluates to 5 but RHS is 7"
    },
    {
      "type": "logical_contradiction",
      "severity": "high",
      "output_claim": "The service is available in all regions.",
      "context_claim": "The service is not available in Asia.",
      "detail": "Output contradicts provided context"
    }
  ],
  "scores": {
    "math_validity": 0.0,
    "consistency": 0.5
  }
}

`validate_math_tool`

Uses SymPy to symbolically validate mathematical expressions in LLM output. Catches arithmetic errors with exact detail — not pattern matching.

`check_consistency_tool`

Rule-based contradiction detection across multi-sentence outputs. Checks output claims against provided context for logical contradictions. Initial implementation is rule-based — not ML-based.

Tech Stack

Layer	Technology
MCP Server	Python MCP SDK (FastMCP)
Math Validation	SymPy
Schema / Types	Pydantic v2
Auth	PyJWT
Logging	JSON structured output
Testing	pytest

Setup

git clone https://github.com/0210shivam/lemma
cd lemma
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Run the MCP server:

python server.py

Run tests:

pytest tests/ -v

Run validators directly:

from tools.validate_math import validate_math

result = validate_math("The total is 2 + 3 = 7")
print(result)

Example Output

MATH: False | 1 violation(s)
CONSISTENCY: False | 1 violation(s)
TOTAL VIOLATIONS: 2

{
  "tenant_id": "org_test",
  "violation_type": "math_error",
  "severity": "high",
  "violations": [...],
  "scores": {"math_validity": 0.0, "consistency": 0.5}
}

Current Limitations

Consistency checker is rule-based — detects direct negation patterns but will miss paraphrased contradictions ("available everywhere" vs "not offered in Asia"). Semantic similarity model planned.
Sequential validation — validators run one after another. Async pipeline needed for production throughput.
Math extraction is pattern-based — catches explicit X + Y = Z expressions but not implicit numerical reasoning across sentences.
No persistence layer — failure events emit to stdout only. Database sink needed for multi-tenant dashboards.

Status

Active development. Core validation pipeline working. Includes: mathematical validation, logical contradiction detection, JWT multi-tenant auth, structured failure logging, pytest test suite. Planned: spec compliance scoring, semantic consistency, async pipeline.

Built to solve a real production failure. Inspired by the reliability gap between fluent and trustworthy LLM outputs.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
auth		auth
examples		examples
logger		logger
models		models
tests		tests
tools		tools
.gitignore		.gitignore
README.md		README.md
conftest.py		conftest.py
requirements.txt		requirements.txt
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lemma

Why Lemma

Architecture

MCP Tools

`verify_output`

`validate_math_tool`

`check_consistency_tool`

Tech Stack

Setup

Example Output

Current Limitations

Status

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Lemma

Why Lemma

Architecture

MCP Tools

verify_output

validate_math_tool

check_consistency_tool

Tech Stack

Setup

Example Output

Current Limitations

Status

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

`verify_output`

`validate_math_tool`

`check_consistency_tool`