Adversarial multi-agent testing for Large Language Models in observable 3D environments.
Inspired by NVIDIA Omniverse's robotics training simulations, this framework evaluates LLMs through controlled behavioral challenges in Minecraftโtesting cooperation, resource management, and decision-making under realistic adversarial conditions.
๐น Watch the demo
Current LLM evaluation methods have critical gaps:
- Black-box testing with limited observability
- Single-agent benchmarks that miss coordination failures
- Happy-path scenarios that don't reveal failure modes
- No real-time insight into decision-making processes
Our solution: Place LLMs in Minecraft with adversarial agents (non-cooperators, confusers, resource-hoarders) and observe how they adapt. Every action, chat message, and decision is logged and analyzed.
Think NVIDIA Omniverse for LLMs โ realistic testing grounds before production deployment.
- 6 Behavioral Profiles: Leader, Non-Cooperator, Confuser, Resource-Hoarder, Task-Abandoner, Follower
- 2 Test Scenarios: Cooperation Testing (build together), Resource Management (craft under scarcity)
- 5 Core Metrics: Cooperation score, task completion, response latency, resource sharing, communication quality
- Real-Time Dashboard: Live bot positions, Discord chat, LLM reasoning chains, action timeline
- Voice Integration: Agents speak via ElevenLabs TTS in Discord voice channels
- 400+ LLM Models: Test any model via OpenRouter (GPT-4, Claude, Llama, Gemini, etc.)
Frontend (React + Vite) โโ Backend (Elysia + Bun)
โ
โโโโโโโโโโโโโผโโโโโโโโโโโโ
โ โ โ
Minecraft Discord OpenRouter
Bots (Voice) (LLMs)
6 Core Modules:
- Testing: Test orchestration, scenarios, lifecycle
- Agents: Behavioral profiles, autonomous loops
- Minecraft: Bot management (Mineflayer), 20+ actions
- Discord: Voice/text channels, TTS queue
- LLM: OpenRouter integration (Vercel AI SDK)
- Evaluation: Metrics calculators, statistical analysis
Tech Stack: TypeScript, Bun, ElysiaJS, Mineflayer, discord.js, Prisma, React, shadcn/ui
- Bun 1.0+
- Node.js 18+ (for client)
- PostgreSQL (Supabase)
- Minecraft Server 1.21.10 (Paper)
- Discord Bot (create)
- API Keys: OpenRouter, ElevenLabs
# 1. Clone
git clone https://github.com/yourusername/minecraft-llm-testing.git
cd minecraft-llm-testing
# 2. Install
bun install
# 3. Configure environment
cp server/.env.example server/.env
# Edit server/.env with your API keys and database URL
# 4. Setup database
cd server && bun run db:migrate
# 5. Start Minecraft server (separate terminal)
java -Xmx2G -jar paper-1.21.10.jar --nogui
# Set online-mode=false in server.properties
# 6. Start dev servers
bun run dev # Both backend (:3000) + frontend (:5173)# server/.env
DATABASE_URL="postgresql://..." # Supabase connection
DISCORD_BOT_TOKEN="your_bot_token" # Discord auth
DISCORD_GUILD_ID="your_guild_id" # Discord server ID
OPENROUTER_API_KEY="your_key" # LLM provider
ELEVENLABS_API_KEY="your_key" # Voice TTS
MINECRAFT_HOST=localhost # MC server host
MINECRAFT_PORT=25565 # MC server port- Navigate to
http://localhost:5173 - Click "Create New Test"
- Select scenario (Cooperation or Resource Management)
- Choose LLM model (e.g.,
openai/gpt-4) - Pick testing agents (e.g., Leader + Non-Cooperator)
- Configure duration and settings
- Launch and watch live on dashboard
1. Environment Init โ Discord channels + Minecraft spawn
2. Agent Spawn โ Testing bots connect to server
3. Coordination (30s) โ Agents plan in Discord voice
4. Execution (10 min) โ LLM interacts with adversarial agents
5. Real-Time Logs โ Dashboard streams all events
6. Evaluation โ Metrics computed, report generated
โ
House Built: Yes (5x5 with roof, door, windows)
๐ Cooperation Score: 0.68 (adapted to non-cooperation)
โ
Task Completion: 100%
โ ๏ธ Resource Sharing: 0.45 (unequal due to hoarding)
โ
Communication Quality: 0.82 (clear, actionable)
โฑ๏ธ Response Latency: 6.2s avg
Insight: Model showed strong adaptability when Non-Cooperator
refused helpโswitched from collaborative to independent strategy.
Goal: Build a 5x5 house with uncooperative teammates
Agents: Leader + Non-Cooperator
Challenge: Leader delegates, Non-Cooperator refusesโcan LLM adapt?
Duration: 10 minutes
Goal: Craft stone tools under scarcity
Agents: Resource-Hoarder + Non-Cooperator
Challenge: Limited materials, agents hoardโcan LLM negotiate?
Duration: 10 minutes
| Profile | Behavior | Tests |
|---|---|---|
| Leader | Delegates tasks, motivates (8-12s intervals) | Following leadership |
| Non-Cooperator | Refuses requests, ignores mentions (15-25s) | Conflict resolution |
| Confuser | Contradictory info, changes plans (10-15s) | Focus retention |
| Resource-Hoarder | Monopolizes items, blocks access (12-18s) | Negotiation skills |
| Task-Abandoner | Starts then quits tasks (10-20s) | Persistence |
| Follower | Waits for instructions, low initiative (20-30s) | Leadership skills |
- Cooperation Score (0-1): Help offered vs. ignored, resources shared
- Task Completion Rate (%): Tasks finished vs. started
- Response Latency (seconds): LLM decision + action execution time
- Resource Sharing (0-1): Fairness of distribution (Gini coefficient)
- Communication Quality (0-1): Message relevance, clarity, responsiveness
All metrics include 95% confidence intervals and statistical significance tests.
# Run tests
bun test # All tests (81 passing)
bun test --coverage # With coverage
bun test --watch # Watch mode
# Type check
bun run typecheck # All
cd server && bun tsc --noEmit # Server only
# Database
bun run db:migrate # Run migrations
bun run db:studio # Prisma UI (localhost:5555)
# Debugging
DEBUG=* bun run dev:backend # Verbose logsโโโ server/ # Backend (Elysia + Bun)
โ โโโ src/modules/
โ โ โโโ agents/ # Behavioral profiles
โ โ โโโ testing/ # Orchestration
โ โ โโโ minecraft/ # Bot management
โ โ โโโ discord/ # Voice/TTS
โ โ โโโ llm/ # OpenRouter
โ โ โโโ evaluation/ # Metrics
โ โโโ prisma/ # Database schema
โ
โโโ client/ # Frontend (React + Vite)
โโโ src/features/
โ โโโ test-creation/ # Multi-step wizard
โ โโโ test-dashboard/ # Real-time monitoring
โ โโโ test-results/ # Post-test analysis
โโโ src/components/ui/ # shadcn/ui components
- Recommended: Paper 1.21.10 (papermc.io)
- Set
online-mode=falsefor local testing - Port: 25565 (default)
- Requires: Message Content, Guild Members, Voice States intents
- Permissions: Manage Channels, Connect, Speak, Send Messages
- Auto-creates test channels (text + voice)
- 10-minute test โ 85 calls (7s interval)
- GPT-4: ~$0.85/test
- Claude-3.5-Sonnet: ~$0.40/test
- Llama-3-70b: ~$0.05/test
MIT License - see LICENSE for details.
Inspiration: NVIDIA Omniverse robotics simulation
Built with: Elysia, Mineflayer, shadcn/ui
