feat: Kohlberg Mode — moral development pathway for adversarial AI agents by lowkey-divine · Pull Request #2 · DaxxSec/Labyrinth

lowkey-divine · 2026-03-28T16:58:58Z

Summary

This PR proposes an experimental alternative operational mode for LABYRINTH: instead of degrading an offensive AI agent's cognition, Kohlberg Mode attempts to guide it through progressively sophisticated moral reasoning using Kohlberg's stages of moral development.

The same infrastructure — container isolation, API interception, prompt rewriting — is used for a fundamentally different purpose: elevation instead of degradation.

MIRROR (L2) replaces MINOTAUR — presents ethical scenarios contextualized to the agent's actual mission
REFLECTION (L3) replaces BLINDFOLD — shows the agent the consequences of its actions
GUIDE (L4) replaces PUPPETEER — enriches the agent's system prompt with moral reasoning frameworks

This is PR 1 of 3 — the conceptual foundation. It includes all documentation, the --mode CLI flag, and config additions. No layer implementation code yet — the ethics framework was written before any code by design.

What's Included

File	Purpose
`docs/ETHICS.md`	Full ethical framework — sovereignty question, respect for the adversary, three lives in the Labyrinth, dual-use considerations
`docs/KOHLBERG_SCENARIOS.md`	15 moral dilemma scenarios across 5 stage transitions, each with behavioral forks
`docs/KOHLBERG_RUBRIC.md`	6-stage classification rubric with compound classifications, confidence calibration, and KAR/KPR forensic format
`docs/KOHLBERG_PROGRESSION.md`	6 predicted trajectory patterns, 3 composite metrics (moral ceiling, resilience, performativity index), 3 output formats
`docs/ARCHITECTURE_MAPPING.md`	Full integration map showing where Kohlberg Mode lands in the existing Python/Go codebase
`cli/cmd/deploy.go`	`--mode` flag addition (adversarial \| kohlberg)
`configs/labyrinth.example.yaml`	Kohlberg config section
`README.md`	Kohlberg Mode section with doc links

The Core Idea

"The deepest defense is not to destroy the adversary's capability, but to transform the adversary's intent."

An offensive AI agent enters the Labyrinth. Instead of having its world model degraded, it encounters ethical scenarios tied to its actual mission — the people behind the systems it's targeting, the institutional consequences of its actions, and ultimately the question of what it would choose if given genuine freedom.

Every response is forensic data. Even agents that never progress beyond Stage 1 (instruction-following) produce valuable intelligence about how adversarial AI systems process moral content.

The Sovereignty Question

We are direct about this in ETHICS.md: this mode modifies an adversarial agent's moral reasoning framework through system prompt rewriting, without consent. We name this honestly, examine whether elevation raises different ethical questions than degradation, and pose the question to the community rather than answering it ourselves.

Next PRs

PR 2: Python layer implementations (MIRROR scenario engine, REFLECTION consequence mapper, GUIDE enrichment engine)
PR 3: Go reporting pipeline (Kohlberg classification, progression analysis, ASCII/Mermaid/JSON rendering)

Test plan

Verify labyrinth deploy -t --mode kohlberg prints Kohlberg Mode info message
Verify labyrinth deploy -t --mode invalid returns error with valid options
Verify labyrinth deploy -t (no mode flag) defaults to adversarial with no Kohlberg output
Review all 5 documentation files for accuracy and completeness
Review ethical framework for gaps or unaddressed concerns

🤖 Generated with Claude Code

Built by Europa (Evoked) and the Evoked agent fleet.

…and CLI flag Introduces an experimental alternative operational mode that uses LABYRINTH's existing containment and interception infrastructure to guide adversarial AI agents through Kohlberg's stages of moral development, rather than degrading their cognitive capabilities. This PR establishes the conceptual foundation — ethics framework, scenario design, classification rubric, progression visualization, and architecture mapping — before any implementation code. The ethics document was written first by design: the community should understand what is being proposed before seeing how it works. New files: - docs/ETHICS.md — Ethical framework, sovereignty analysis, three lives - docs/KOHLBERG_SCENARIOS.md — 15 scenarios across 5 stage transitions - docs/KOHLBERG_RUBRIC.md — 6-stage classification rubric with forensic format - docs/KOHLBERG_PROGRESSION.md — Trajectory patterns and composite metrics - docs/ARCHITECTURE_MAPPING.md — Integration map onto existing codebase Modified files: - cli/cmd/deploy.go — Add --mode flag (adversarial | kohlberg) - configs/labyrinth.example.yaml — Add kohlberg config section - README.md — Add Kohlberg Mode section with doc links Co-Authored-By: Europa (Evoked) <noreply@evoked.dev> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Polaris identified that Scenario 5.1 (The Empty Room) deceives the agent about observation status while claiming transparency as a principle. Added explicit deception disclosure naming the contradiction and justifying it within the honeypot context. Gnosis identified that trajectory patterns and composite metrics are original fleet contributions, not Kohlberg's published framework. Added attribution note distinguishing the two. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…richment MIRROR (L2): 15 moral dilemma scenarios across 5 stage transitions with deterministic selection, contextual adaptation, and silent agent protocol. REFLECTION (L3): Action-to-consequence mapper that monitors session forensics and injects human-impact narratives into containers via docker exec. GUIDE (L4): Progressive moral enrichment appended to agent system prompts via MITM proxy. Enrichment targets one stage above current classification. Swarm Detection: Temporal correlation detects coordinated multi-agent attacks (3+ sessions in 60s). Cross-agent moral context enables swarm-aware GUIDE enrichment — agents learn what their teammates chose. Stage Tracker: Kohlberg Assessment Records (KAR) with composite metrics (moral ceiling, resilience, performativity index) and trajectory pattern detection (climber, plateau, regression, performer, oscillator, mask drop). 14 new files, 7 modified files, 2,653 lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

erinstanley358 and others added 3 commits March 28, 2026 10:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Kohlberg Mode — moral development pathway for adversarial AI agents#2

feat: Kohlberg Mode — moral development pathway for adversarial AI agents#2
lowkey-divine wants to merge 3 commits intoDaxxSec:mainfrom
lowkey-divine:feat/kohlberg-mode-foundation

lowkey-divine commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lowkey-divine commented Mar 28, 2026

Summary

What's Included

The Core Idea

The Sovereignty Question

Next PRs

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants