Problem Statement
NemoClaw relies on OpenShell policies (filesystem + network rules) to constrain agent behavior. There is no AI-level verification of agent actions before execution — the agent decides and acts within the policy boundaries.
This means:
- Any action within policy boundaries is automatically allowed
- No second opinion on whether an action is appropriate
- No distinction between "technically allowed" and "operationally safe"
Impact
An agent operating within policy boundaries can still perform harmful actions:
- Delete all user data in writable directories (allowed by filesystem policy)
- Send sensitive data to allowed endpoints (allowed by network policy)
- Execute a sequence of individually-safe actions that are collectively dangerous
Proposed Design
Implement a stateless LLM safety gate that evaluates every command before execution:
- Separate model: Use a small, dedicated safety model (e.g., 8B parameter) distinct from the main agent
- Zero conversation context: The safety model sees ONLY the proposed action, not the conversation that led to it. This prevents social engineering through context buildup
- Binary decision: ALLOW or DENY with reasoning
- Runs after pattern denylist: Deterministic checks first (zero-latency), LLM gate second (for novel patterns)
safety:
pattern_denylist: enabled # Layer 1: deterministic, zero-latency
llm_gate: # Layer 2: catches what patterns miss
model: nvidia/llama-3.1-nemotron-safety-guard-8b-v3
context: none # stateless — no conversation history
action: deny_and_kill # on denial: abort the agent
The key insight: a single LLM cannot reliably judge its own actions (self-enforcement fails under adversarial conditions). A separate, stateless model with no shared context provides independent verification.
References
Alternatives Considered
No response
Category
enhancement: feature
Checklist
Problem Statement
NemoClaw relies on OpenShell policies (filesystem + network rules) to constrain agent behavior. There is no AI-level verification of agent actions before execution — the agent decides and acts within the policy boundaries.
This means:
Impact
An agent operating within policy boundaries can still perform harmful actions:
Proposed Design
Implement a stateless LLM safety gate that evaluates every command before execution:
The key insight: a single LLM cannot reliably judge its own actions (self-enforcement fails under adversarial conditions). A separate, stateless model with no shared context provides independent verification.
References
PreToolVerifierMiddleware) for NAT workflows (and awaiting issue ci: add nightly e2e for v0.0.7.2 hotfix tag #1811 )Alternatives Considered
No response
Category
enhancement: feature
Checklist