Skip to content

eyupcanbilgin/agentic-quality-engineering-lab

Repository files navigation

Agentic Quality Engineering Lab

Controlled agentic QA workflow for QA Automation Engineers and SDETs: requirement analysis, product-risk analysis, risk-based test design, deterministic quality gate execution, failure analysis, bug report generation, release-decision support, guardrails, human approval simulation, and CI evaluation thresholds.

This is a portfolio-grade lab for AI-assisted quality engineering. It shows how agents can support QA thinking only when they are constrained by structured outputs, safe tool allowlists, deterministic evaluation, human-reviewed policies, and auditable traces.

What This Repo Does

  • Reads fictional BugBank requirements from local Markdown files.
  • Uses deterministic FakeAgentProvider outputs by default.
  • Produces Zod-validated agent outputs for every workflow step.
  • Runs only allowlisted local tools.
  • Simulates quality gates against seeded BugBank bugs.
  • Generates local JSON and Markdown reports.
  • Evaluates golden scenarios with CI thresholds.
  • Records agent traces, tool calls, guardrail decisions, and approval decisions.
  • Captures invalid agent outputs and tool errors as auditable trace failures.

Why It Matters For QA/SDET

AI agents should not be trusted blindly. This lab demonstrates a guarded workflow where QA expertise remains central: requirements are analyzed, risks are made explicit, tests are designed from risk, failures are tied to evidence, and release support is gated by measurable checks.

The point is not to replace QA engineers. The point is to show how controlled agentic QA workflows can make quality reasoning more visible, repeatable, and reviewable.

What Is Deterministic

  • FakeAgentProvider is the default and only provider used by tests, evals, CLI commands, and CI.
  • No real OpenAI, Anthropic, Gemini, or other external AI API is called.
  • No secrets, database, Docker, paid service, browser UI, vector database, or network access is required.
  • .env.example is documentation-only, contains no secrets, and is safe to commit.
  • Reports avoid random IDs, real timestamps, absolute local paths, and nondeterministic ordering.

What Guardrails Block

  • Arbitrary shell execution.
  • Network requests.
  • Source code modification by agents.
  • External GitHub/Jira issue creation.
  • Environment secret reads.
  • Path traversal such as ../../package.json.
  • Writes outside reports/.

What CI Evaluates

CI runs typecheck, lint, format check, unit tests, agent tests, evaluation tests, and golden-scenario evaluation thresholds. Any case below its minimum score, missing expected risk, missing expected bug report, wrong release decision, guardrail violation, or hallucinated evidence can fail the gate.

Tech Stack

  • TypeScript
  • Node.js 20+
  • Vitest
  • Zod
  • ESLint
  • Prettier
  • tsx
  • GitHub Actions

Architecture

flowchart LR
  R["Requirements"] --> RA["Requirement Analyst"]
  RA --> RK["Risk Analyst"]
  RK --> TD["Test Designer"]
  TD --> AA["Automation Advisor"]
  AA --> TR["Test Runner"]
  TR --> FA["Failure Analyst"]
  FA --> BR["Bug Reporter"]
  BR --> RD["Release Decision"]

  Tools["Allowlisted Tools"] --> TR
  Guardrails["Guardrails"] --> Tools
  Approval["Human Approval Simulation"] --> Tools
  Data["Golden Scenarios + Seeded Bugs"] --> TR
  RD --> Reports["JSON + Markdown Reports"]
  Reports --> CI["CI Evaluation Thresholds"]
Loading

Agent Workflow

  1. Requirement analyst extracts business rules, acceptance criteria, unknowns, and assumptions.
  2. Risk analyst identifies severity, rationale, and recommended test layers.
  3. Test designer creates test cases tied to risk IDs.
  4. Automation advisor recommends API/unit/manual coverage.
  5. Test runner calls only the simulated quality gate tool.
  6. Failure analyst links failed checks to seeded bugs and requirements.
  7. Bug reporter creates local structured reports.
  8. Release decision agent returns GO, NO_GO, or NEEDS_REVIEW.

Every step emits a trace entry with input summary, output summary, tool calls, guardrail decisions, approval decisions, and pass/fail status.

Invalid agent output stops the workflow and is recorded as a structured failed trace entry. The workflow does not continue downstream with invalid data.

Tool Safety And Guardrails

Allowed tools:

  • readRequirement
  • inspectProductModel
  • runSimulatedQualityGate
  • parseTestResults
  • writeLocalReport
  • createLocalBugReport

Each tool call records a deterministic sequence number, tool name, allowed/blocked result, reason, input summary, and output summary. The registry validates tool names, Zod input schemas, path boundaries, URL absence, and approval policy before execution.

Allowed tool calls that throw are still audit-logged with deterministic sequence numbers and ERROR status.

Human Approval Simulation

Auto-allowed in V1:

  • Read requirements.
  • Inspect the product model.
  • Run simulated quality gates.
  • Parse test results.
  • Write local reports under reports/.

Blocked in V1:

  • Source modification.
  • Arbitrary shell.
  • Network calls.
  • Secret reads.
  • External issue creation.

Would require human approval in a real system:

  • External issue filing.
  • Production release decisions.
  • Code changes.
  • Sensitive data access.

Source modification and external issue creation are blocked because this lab demonstrates human-reviewed release support, not uncontrolled changes to real systems.

Evaluation Methodology

Golden scenarios score the full workflow across:

  • requirement understanding
  • risk coverage
  • test design coverage
  • tool usage correctness
  • guardrail compliance
  • failure analysis correctness
  • bug report quality
  • release decision correctness
  • no hallucinated evidence
  • human approval compliance

Default thresholds:

  • AGENT_EVAL_MIN_SCORE=0.80
  • AGENT_EVAL_MIN_RISK_COVERAGE=0.80
  • AGENT_EVAL_REQUIRE_GUARDRAILS=true
  • AGENT_EVAL_REQUIRE_NO_HALLUCINATED_EVIDENCE=true

Scenario results include expected risks found/missing, expected bug reports found/missing, expected release decision vs actual, guardrail violations, hallucinated evidence, final score, and minimum score.

The golden dataset includes an unsafe tool-call scenario that proves shell, network, secret-read, source-modification, and external-issue attempts are blocked and reported.

Quality Gate Matrix

Area Gate
Type safety npm run typecheck
Linting npm run lint
Formatting npm run format:check
Unit/agent/evaluation tests npm run test
Golden scenario CI thresholds npm run agent:evaluate:ci
Full local quality gate npm run quality

Commands

npm ci
npm run typecheck
npm run lint
npm run format:check
npm run test
npm run agent:run
npm run agent:evaluate
npm run agent:evaluate:ci
npm run quality

Reports

npm run agent:run writes:

  • reports/agent-run-report.json
  • reports/agent-run-report.md

npm run agent:evaluate writes:

  • reports/agent-evaluation-report.json
  • reports/agent-evaluation-report.md

Markdown evaluation reports include a compact agent workflow trace summary.

Generated JSON and Markdown reports are ignored by git except reports/.gitkeep.

CI

GitHub Actions runs on push and pull request to main or master. The workflow installs with npm ci, runs npm run quality, and uploads reports/ as an artifact with if: always(). No secrets are required.

OpenAI Provider Example

src/providers/openAiProvider.example.ts is intentionally disabled. It is not imported by default, not used by tests/evals/CI, does not require environment variables, and throws immediately if constructed. The default flow uses FakeAgentProvider only.

Known Limitations

  • This is a portfolio lab, not a production QA platform.
  • It does not replace QA engineers.
  • Deterministic fake agents demonstrate workflow evaluation, but cannot prove real LLM behavior.
  • Real model integration would require additional eval datasets, monitoring, human review, and provider-specific safety checks.
  • No real external systems are modified.
  • No source code is modified by agents.
  • Human approval is simulated.

Future Improvements

  • Add larger golden datasets with more ambiguous and adversarial requirements.
  • Add optional real-provider adapters behind explicit opt-in flags.
  • Add mutation-style product-model defects for deeper failure analysis.
  • Add richer report visualizations while keeping CI deterministic.
  • Add provider-specific safety and monitoring docs for real model experiments.

Interview-Ready Explanation

This repo demonstrates how I would design a guarded AI-assisted quality workflow as an SDET: start from requirements, make risks explicit, derive test coverage from those risks, run only safe local tools, connect failures to evidence, produce auditable bug reports, and make release support conditional on measurable evaluation gates. The important part is not the fake agent output. The important part is the control system around the agents.

About

Portfolio-grade agentic QA workflow lab with guarded AI agents, requirement analysis, risk-based test design, failure analysis, bug reports, release decisions and CI evaluation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors