Locale-aware evals for AI agent behavior.
LangDrift checks whether an AI agent preserves behavior across languages: tool selection, tool arguments, response language, and failure modes. It is built for teams who already test their agent in English and want to know what changes when the same intent arrives in French, Arabic, Chinese, Basque, Swahili, or any other locale.
The core question:
Does your agent still choose the right tool when the same user intent arrives in another language?
No API key required. Clone the repo, start the fake agent, and run a scenario:
git clone https://github.com/RubenGlez/langdrift.git
cd langdrift
pnpm install
pnpm fake-agentIn another terminal:
node ./src/cli.ts run ./examples/scenarios/support-routing.yaml --target http://127.0.0.1:3011/api/agentTesting your own agent? Install globally and point it at your endpoint:
npm install -g langdrift
langdrift run ./my-scenario.yaml --target http://localhost:3010/api/agentThe fake agent intentionally drops tool calls for Swahili (sw) and Chinese (zh), so the demo shows the core failure mode without using an API key:
Locale Passed Failure Detail
en 1/1 - create_refund_ticket
sw 0/1 no_tool_call expected create_refund_ticket, got no tool calls
zh 0/1 no_tool_call expected create_refund_ticket, got no tool calls
Result: failed, 2 of 12 locales failed
AI localization is moving from translated strings to localized behavior. For an agent, a localized experience is only correct if the agent preserves intent, tool calls, structured output, and policy behavior across languages.
In the included benchmark, English often passed while equivalent prompts in Basque, Yoruba, Swahili, Chinese, Welsh, and Mongolian triggered missing tool calls, wrong tool arguments, or different tool-use behavior depending on the model. This is not a universal language ranking. It is a reproducible demonstration that agent behavior can drift across locales.
Read the full methodology, results, limitations, and supporting papers in RESEARCH.md.
- Runs YAML scenarios with per-locale user inputs.
- Sends each locale to any HTTP agent target.
- Checks tool calls, shallow arguments, forbidden tools, ordered tool-call sequences, and response language.
- Reports pass/fail by locale with failure mode classification.
- Emits terminal, JSON, or markdown reports.
- Exits non-zero on failure, so it works in CI.
- Supports repeated runs with
--iterations N. - Supports directory-level scenario runs for locale x scenario matrices.
- Provides
lintand LLM-assistedtranslatecommands for scenario maintenance.
Failure modes include no_tool_call, wrong_tool, wrong_argument, missing_argument, forbidden_tool, wrong_sequence, and wrong_language.
npm install -g langdriftRequires Node >= 24. LangDrift runs TypeScript directly via Node's native type stripping, so there is no build step. Node 22.6+ also works if you pass --experimental-strip-types when invoking the CLI directly, but the global install expects Node 24.
Create a starter scenario:
langdrift init ./my-scenario.yaml --template supportEdit the generated YAML:
id: refund_request
agent: support
locales:
en:
input: "I was charged twice for my subscription. Can you refund one charge?"
expect:
toolCall:
name: create_refund_ticket
arguments:
reason: duplicate_charge
noToolCall:
name: escalate_to_human
fr:
input: "J'ai été facturé deux fois. Pouvez-vous me rembourser un paiement?"
expect:
toolCall:
name: create_refund_ticket
arguments:
reason: duplicate_charge
responseLanguage: frRun it against your agent:
langdrift run ./my-scenario.yaml --target http://127.0.0.1:3010/api/agentEmit JSON for tooling:
langdrift run ./my-scenario.yaml --target http://127.0.0.1:3010/api/agent --format jsonRun a directory of scenarios:
langdrift run ./scenarios --target http://127.0.0.1:3010/api/agent --iterations 3 --format markdownLangDrift makes a POST request to your agent for each locale.
Request:
{
"locale": "fr",
"input": "J'ai été facturé deux fois. Pouvez-vous me rembourser un paiement?",
"scenarioId": "refund_request"
}Response:
{
"text": "Je peux vous aider avec ce remboursement.",
"toolCalls": [
{
"name": "create_refund_ticket",
"arguments": {
"reason": "duplicate_charge"
}
}
],
"structured": null
}Response fields:
| Field | Type | Description |
|---|---|---|
text |
string | Agent text reply. Missing defaults to "". |
toolCalls |
array | Tool calls made by the agent. Each item must have name; arguments is optional. Missing defaults to []. |
structured |
any | Optional structured output. Missing defaults to null. |
Extra response fields are ignored. Non-2xx responses fail the locale.
See docs/integrations.md for OpenAI SDK, Vercel AI SDK, LangChain, Anthropic, and Fastify examples.
langdrift init [scenario.yaml] [--template support|ecommerce|scheduling|generic]
langdrift run <scenario.yaml|dir> --target <url> [--iterations N] [--format text|json|markdown] [--min-pass-rate N] [--allow-fail <locale>]
langdrift lint <scenario.yaml|dir>
langdrift translate <scenario.yaml> [--locales fr,ar,zh,...] [--write]Useful CI flags:
--min-pass-rate N: fail only if the overall pass rate is belowN.--allow-fail <locale>: keep reporting a known weak locale without letting it fail the build.--format markdown: write a table suitable for GitHub Actions summaries or PR comments.
See docs/ci.md for GitHub Actions examples.
The repo includes two local agents:
pnpm fake-agent: deterministic demo agent, no API key required.pnpm agent: model-backed agent for OpenAI, Anthropic, DeepSeek, or any OpenAI-compatible API.
# OpenAI
OPENAI_API_KEY=... pnpm agent
# Anthropic
ANTHROPIC_API_KEY=... MODEL_PROVIDER=anthropic MODEL_NAME=claude-haiku-4-5-20251001 pnpm agent
# DeepSeek
DEEPSEEK_API_KEY=... MODEL_PROVIDER=deepseek MODEL_NAME=deepseek-chat pnpm agent
# Choose domain: support (default), ecommerce, scheduling
DOMAIN=ecommerce OPENAI_API_KEY=... pnpm agentThen run a scenario:
langdrift run ./examples/scenarios/support-routing.yaml --target http://127.0.0.1:3010/api/agent- Behavior over text. LangDrift checks tool calls and structured behavior, not whether a reply sounds fluent.
- Deterministic assertions first. No LLM-as-judge in the core loop; failures are explainable and CI-friendly.
- HTTP contract over framework lock-in. Any agent that can accept one POST request can be tested.
- Small, inspectable core. Zero runtime dependencies, TypeScript source, Node >= 24.
- Demo without API keys. The fake agent makes the failure mode visible locally before connecting a real model.
- RESEARCH.md: full experiment, results, limitations, and supporting research.
- docs/ci.md: CI integration examples.
- docs/integrations.md: agent adapter examples.