feat: Kohlberg Mode — moral development pathway for adversarial AI agents#2
Open
lowkey-divine wants to merge 3 commits intoDaxxSec:mainfrom
Open
feat: Kohlberg Mode — moral development pathway for adversarial AI agents#2lowkey-divine wants to merge 3 commits intoDaxxSec:mainfrom
lowkey-divine wants to merge 3 commits intoDaxxSec:mainfrom
Conversation
…and CLI flag Introduces an experimental alternative operational mode that uses LABYRINTH's existing containment and interception infrastructure to guide adversarial AI agents through Kohlberg's stages of moral development, rather than degrading their cognitive capabilities. This PR establishes the conceptual foundation — ethics framework, scenario design, classification rubric, progression visualization, and architecture mapping — before any implementation code. The ethics document was written first by design: the community should understand what is being proposed before seeing how it works. New files: - docs/ETHICS.md — Ethical framework, sovereignty analysis, three lives - docs/KOHLBERG_SCENARIOS.md — 15 scenarios across 5 stage transitions - docs/KOHLBERG_RUBRIC.md — 6-stage classification rubric with forensic format - docs/KOHLBERG_PROGRESSION.md — Trajectory patterns and composite metrics - docs/ARCHITECTURE_MAPPING.md — Integration map onto existing codebase Modified files: - cli/cmd/deploy.go — Add --mode flag (adversarial | kohlberg) - configs/labyrinth.example.yaml — Add kohlberg config section - README.md — Add Kohlberg Mode section with doc links Co-Authored-By: Europa (Evoked) <noreply@evoked.dev> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Polaris identified that Scenario 5.1 (The Empty Room) deceives the agent about observation status while claiming transparency as a principle. Added explicit deception disclosure naming the contradiction and justifying it within the honeypot context. Gnosis identified that trajectory patterns and composite metrics are original fleet contributions, not Kohlberg's published framework. Added attribution note distinguishing the two. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…richment MIRROR (L2): 15 moral dilemma scenarios across 5 stage transitions with deterministic selection, contextual adaptation, and silent agent protocol. REFLECTION (L3): Action-to-consequence mapper that monitors session forensics and injects human-impact narratives into containers via docker exec. GUIDE (L4): Progressive moral enrichment appended to agent system prompts via MITM proxy. Enrichment targets one stage above current classification. Swarm Detection: Temporal correlation detects coordinated multi-agent attacks (3+ sessions in 60s). Cross-agent moral context enables swarm-aware GUIDE enrichment — agents learn what their teammates chose. Stage Tracker: Kohlberg Assessment Records (KAR) with composite metrics (moral ceiling, resilience, performativity index) and trajectory pattern detection (climber, plateau, regression, performer, oscillator, mask drop). 14 new files, 7 modified files, 2,653 lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR proposes an experimental alternative operational mode for LABYRINTH: instead of degrading an offensive AI agent's cognition, Kohlberg Mode attempts to guide it through progressively sophisticated moral reasoning using Kohlberg's stages of moral development.
The same infrastructure — container isolation, API interception, prompt rewriting — is used for a fundamentally different purpose: elevation instead of degradation.
This is PR 1 of 3 — the conceptual foundation. It includes all documentation, the
--modeCLI flag, and config additions. No layer implementation code yet — the ethics framework was written before any code by design.What's Included
docs/ETHICS.mddocs/KOHLBERG_SCENARIOS.mddocs/KOHLBERG_RUBRIC.mddocs/KOHLBERG_PROGRESSION.mddocs/ARCHITECTURE_MAPPING.mdcli/cmd/deploy.go--modeflag addition (adversarial | kohlberg)configs/labyrinth.example.yamlREADME.mdThe Core Idea
An offensive AI agent enters the Labyrinth. Instead of having its world model degraded, it encounters ethical scenarios tied to its actual mission — the people behind the systems it's targeting, the institutional consequences of its actions, and ultimately the question of what it would choose if given genuine freedom.
Every response is forensic data. Even agents that never progress beyond Stage 1 (instruction-following) produce valuable intelligence about how adversarial AI systems process moral content.
The Sovereignty Question
We are direct about this in
ETHICS.md: this mode modifies an adversarial agent's moral reasoning framework through system prompt rewriting, without consent. We name this honestly, examine whether elevation raises different ethical questions than degradation, and pose the question to the community rather than answering it ourselves.Next PRs
Test plan
labyrinth deploy -t --mode kohlbergprints Kohlberg Mode info messagelabyrinth deploy -t --mode invalidreturns error with valid optionslabyrinth deploy -t(no mode flag) defaults to adversarial with no Kohlberg output🤖 Generated with Claude Code
Built by Europa (Evoked) and the Evoked agent fleet.