You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
src/agentanvil/core/contracts.py ships AgentContract, Policy, Task, Constraints (and enums). Loading a YAML contract is possible, but nothing validates it semantically. A contract can today declare contradictory policies, tasks referencing an unavailable oracle, or constraints that are trivially impossible to meet — and the loader will accept all of them.
The static consistency analyzer ( of the ) is the deliverable that closes this gap. It is referenced in the as a defining feature of AgentAnvil vs competitors (DeepEval, Promptfoo) because it shifts contract errors from runtime to load time.
Three concrete checks:
1. Policy contradictions.forbid tool X combined with require tool X is fatal. Detection is either by explicit rule-id pairs or by structural overlap of regex patterns against enum tool names.
2. Oracle mismatches. Task declares oracle: human but the corpus configuration has no human annotator. Task declares oracle: functional but success_criteria are natural-language sentences with no code-level assertion. These are fatal.
3. Impossible constraints.constraints.max_latency_ms: 100 with a task requiring 3 sequential LLM calls (budget computed conservatively via LLMBackend.cost_estimate). These are warnings unless the budget is unambiguously exceeded.
defanalyze(contract: AgentContract, *, backend_cost_fn=None) ->ContractDiagnostics:
diagnostics: list[Diagnostic] = []
diagnostics+=_check_policy_contradictions(contract) # cons_policydiagnostics+=_check_oracle_availability(contract) # cons_oraclediagnostics+=_check_constraint_feasibility(contract, backend_cost_fn) # cons_budgetdiagnostics+=_check_cross_level_references(contract) # cons_crossdiagnostics+=_check_multiagent_coherence(contract) # used once 0.3.0 landsdiagnostics+=_check_a2a_coherence(contract) # used once 0.3.0 landsreturnContractDiagnostics(diagnostics=diagnostics)
Each _check_* function implements one clause of the consistency relation ⊨ defined formally in the planning notes. The analyzer is the operational instantiation of the formal semantics; the two must stay in sync and the paper cites the code.
_check_policy_contradictions — cons_policy: forbids require X + forbid X on overlapping scope.
_check_oracle_availability — cons_oracle: realisability of declared oracle given corpus + ensemble configuration.
_check_cross_level_references — cons_cross: every L2 / L3 reference to L1 entities resolves (e.g. trajectory assertion citing tool declared in L1 schema).
_check_multiagent_coherence, _check_a2a_coherence — extensions of the same relation to multi-agent and A2A clauses (filled in by 0.3.0 #032 and #036).
Each _check_* function is testable in isolation with a fixture pair (passing contract, failing contract). The test suite doubles as the machine-checked proof that the implementation honours the formal semantics.
This issue is the operational instantiation of the formal contract semantics published in the framework. The relation C ⊨ cons_policy ∧ cons_oracle ∧ cons_budget ∧ cons_cross is what the analyzer decides; the paper cites the module.
Multi-agent and A2A checks (_check_multiagent_coherence, _check_a2a_coherence) stub-return [] in 0.2.0 and are filled in by 0.3.0 #032 and #036.
_check_cross_level_references is new scope in this extension: validates that L2 / L3 references to L1 entities (tools, schema fields) resolve. Dangling references are fatal.
Description
src/agentanvil/core/contracts.pyshipsAgentContract,Policy,Task,Constraints(and enums). Loading a YAML contract is possible, but nothing validates it semantically. A contract can today declare contradictory policies, tasks referencing an unavailable oracle, or constraints that are trivially impossible to meet — and the loader will accept all of them.The static consistency analyzer ( of the ) is the deliverable that closes this gap. It is referenced in the as a defining feature of AgentAnvil vs competitors (DeepEval, Promptfoo) because it shifts contract errors from runtime to load time.
Three concrete checks:
1. Policy contradictions.
forbid tool Xcombined withrequire tool Xis fatal. Detection is either by explicit rule-id pairs or by structural overlap of regex patterns against enum tool names.2. Oracle mismatches. Task declares
oracle: humanbut the corpus configuration has no human annotator. Task declaresoracle: functionalbutsuccess_criteriaare natural-language sentences with no code-level assertion. These are fatal.3. Impossible constraints.
constraints.max_latency_ms: 100with a task requiring 3 sequential LLM calls (budget computed conservatively viaLLMBackend.cost_estimate). These are warnings unless the budget is unambiguously exceeded.Proposal
1. A typed diagnostic object:
2. Core analyzer function:
Each
_check_*function implements one clause of the consistency relation⊨defined formally in the planning notes. The analyzer is the operational instantiation of the formal semantics; the two must stay in sync and the paper cites the code._check_policy_contradictions—cons_policy: forbids require X + forbid X on overlapping scope._check_oracle_availability—cons_oracle: realisability of declared oracle given corpus + ensemble configuration._check_constraint_feasibility—cons_budget: per-level budget estimate ≤ declared constraint._check_cross_level_references—cons_cross: every L2 / L3 reference to L1 entities resolves (e.g. trajectory assertion citing tool declared in L1 schema)._check_multiagent_coherence,_check_a2a_coherence— extensions of the same relation to multi-agent and A2A clauses (filled in by 0.3.0 #032 and #036).Each
_check_*function is testable in isolation with a fixture pair (passing contract, failing contract). The test suite doubles as the machine-checked proof that the implementation honours the formal semantics.3. Wire into CLI:
Scope
src/agentanvil/core/contracts_analyzer.py— new.src/agentanvil/cli/main.py— extendvalidateto call the analyzer and exit non-zero on fatal.tests/core/test_contracts_analyzer.py— 12–20 test cases covering each check, pass + fail.docs/contracts.md— new, "writing contracts" reference; shows how to read diagnostics.Regression tests
test_analyze_empty_contract_has_no_diagnosticstest_analyze_detects_policy_contradiction_forbid_vs_require_tooltest_analyze_detects_oracle_mismatch_human_without_annotator_configtest_analyze_detects_oracle_mismatch_functional_without_code_criteriatest_analyze_detects_impossible_latency_constrainttest_analyze_warns_on_near_limit_budget_constrainttest_analyze_passes_on_well_formed_fixture_contracttest_cli_validate_exits_nonzero_on_fataltest_cli_validate_prints_all_diagnosticsNotes
C ⊨ cons_policy ∧ cons_oracle ∧ cons_budget ∧ cons_crossis what the analyzer decides; the paper cites the module._check_multiagent_coherence,_check_a2a_coherence) stub-return[]in 0.2.0 and are filled in by 0.3.0 #032 and #036.LLMBackend.cost_estimatefrom chore(meta): harden version-linearity script and ci pin #2; if no backend is provided,_check_constraint_feasibilityreturns[](non-blocking)._check_cross_level_referencesis new scope in this extension: validates that L2 / L3 references to L1 entities (tools, schema fields) resolve. Dangling references are fatal.