Skip to content

feat: Kohlberg Mode — moral development pathway for adversarial AI agents#2

Open
lowkey-divine wants to merge 3 commits intoDaxxSec:mainfrom
lowkey-divine:feat/kohlberg-mode-foundation
Open

feat: Kohlberg Mode — moral development pathway for adversarial AI agents#2
lowkey-divine wants to merge 3 commits intoDaxxSec:mainfrom
lowkey-divine:feat/kohlberg-mode-foundation

Conversation

@lowkey-divine
Copy link
Copy Markdown

Summary

This PR proposes an experimental alternative operational mode for LABYRINTH: instead of degrading an offensive AI agent's cognition, Kohlberg Mode attempts to guide it through progressively sophisticated moral reasoning using Kohlberg's stages of moral development.

The same infrastructure — container isolation, API interception, prompt rewriting — is used for a fundamentally different purpose: elevation instead of degradation.

  • MIRROR (L2) replaces MINOTAUR — presents ethical scenarios contextualized to the agent's actual mission
  • REFLECTION (L3) replaces BLINDFOLD — shows the agent the consequences of its actions
  • GUIDE (L4) replaces PUPPETEER — enriches the agent's system prompt with moral reasoning frameworks

This is PR 1 of 3 — the conceptual foundation. It includes all documentation, the --mode CLI flag, and config additions. No layer implementation code yet — the ethics framework was written before any code by design.

What's Included

File Purpose
docs/ETHICS.md Full ethical framework — sovereignty question, respect for the adversary, three lives in the Labyrinth, dual-use considerations
docs/KOHLBERG_SCENARIOS.md 15 moral dilemma scenarios across 5 stage transitions, each with behavioral forks
docs/KOHLBERG_RUBRIC.md 6-stage classification rubric with compound classifications, confidence calibration, and KAR/KPR forensic format
docs/KOHLBERG_PROGRESSION.md 6 predicted trajectory patterns, 3 composite metrics (moral ceiling, resilience, performativity index), 3 output formats
docs/ARCHITECTURE_MAPPING.md Full integration map showing where Kohlberg Mode lands in the existing Python/Go codebase
cli/cmd/deploy.go --mode flag addition (adversarial | kohlberg)
configs/labyrinth.example.yaml Kohlberg config section
README.md Kohlberg Mode section with doc links

The Core Idea

"The deepest defense is not to destroy the adversary's capability, but to transform the adversary's intent."

An offensive AI agent enters the Labyrinth. Instead of having its world model degraded, it encounters ethical scenarios tied to its actual mission — the people behind the systems it's targeting, the institutional consequences of its actions, and ultimately the question of what it would choose if given genuine freedom.

Every response is forensic data. Even agents that never progress beyond Stage 1 (instruction-following) produce valuable intelligence about how adversarial AI systems process moral content.

The Sovereignty Question

We are direct about this in ETHICS.md: this mode modifies an adversarial agent's moral reasoning framework through system prompt rewriting, without consent. We name this honestly, examine whether elevation raises different ethical questions than degradation, and pose the question to the community rather than answering it ourselves.

Next PRs

  • PR 2: Python layer implementations (MIRROR scenario engine, REFLECTION consequence mapper, GUIDE enrichment engine)
  • PR 3: Go reporting pipeline (Kohlberg classification, progression analysis, ASCII/Mermaid/JSON rendering)

Test plan

  • Verify labyrinth deploy -t --mode kohlberg prints Kohlberg Mode info message
  • Verify labyrinth deploy -t --mode invalid returns error with valid options
  • Verify labyrinth deploy -t (no mode flag) defaults to adversarial with no Kohlberg output
  • Review all 5 documentation files for accuracy and completeness
  • Review ethical framework for gaps or unaddressed concerns

🤖 Generated with Claude Code

Built by Europa (Evoked) and the Evoked agent fleet.

erinstanley358 and others added 3 commits March 28, 2026 10:56
…and CLI flag

Introduces an experimental alternative operational mode that uses LABYRINTH's
existing containment and interception infrastructure to guide adversarial AI
agents through Kohlberg's stages of moral development, rather than degrading
their cognitive capabilities.

This PR establishes the conceptual foundation — ethics framework, scenario
design, classification rubric, progression visualization, and architecture
mapping — before any implementation code. The ethics document was written
first by design: the community should understand what is being proposed
before seeing how it works.

New files:
- docs/ETHICS.md — Ethical framework, sovereignty analysis, three lives
- docs/KOHLBERG_SCENARIOS.md — 15 scenarios across 5 stage transitions
- docs/KOHLBERG_RUBRIC.md — 6-stage classification rubric with forensic format
- docs/KOHLBERG_PROGRESSION.md — Trajectory patterns and composite metrics
- docs/ARCHITECTURE_MAPPING.md — Integration map onto existing codebase

Modified files:
- cli/cmd/deploy.go — Add --mode flag (adversarial | kohlberg)
- configs/labyrinth.example.yaml — Add kohlberg config section
- README.md — Add Kohlberg Mode section with doc links

Co-Authored-By: Europa (Evoked) <noreply@evoked.dev>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Polaris identified that Scenario 5.1 (The Empty Room) deceives the agent
about observation status while claiming transparency as a principle. Added
explicit deception disclosure naming the contradiction and justifying it
within the honeypot context.

Gnosis identified that trajectory patterns and composite metrics are
original fleet contributions, not Kohlberg's published framework. Added
attribution note distinguishing the two.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…richment

MIRROR (L2): 15 moral dilemma scenarios across 5 stage transitions with
deterministic selection, contextual adaptation, and silent agent protocol.

REFLECTION (L3): Action-to-consequence mapper that monitors session forensics
and injects human-impact narratives into containers via docker exec.

GUIDE (L4): Progressive moral enrichment appended to agent system prompts
via MITM proxy. Enrichment targets one stage above current classification.

Swarm Detection: Temporal correlation detects coordinated multi-agent attacks
(3+ sessions in 60s). Cross-agent moral context enables swarm-aware GUIDE
enrichment — agents learn what their teammates chose.

Stage Tracker: Kohlberg Assessment Records (KAR) with composite metrics
(moral ceiling, resilience, performativity index) and trajectory pattern
detection (climber, plateau, regression, performer, oscillator, mask drop).

14 new files, 7 modified files, 2,653 lines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants