pytest for LLM apps - Test for grounding failures, prompt injection, safety violations, and regressions
-
Updated
Mar 30, 2026 - Python
pytest for LLM apps - Test for grounding failures, prompt injection, safety violations, and regressions
A hands-on exploration of Deepeval — an open-source framework for evaluating and red-teaming large language models (LLMs). This repository documents my journey of testing, benchmarking, and improving LLM reliability using custom prompts, metrics, and pipelines.
Red teaming a banking and finance llm assistant
Integrating promptfoo into CI/CD pipelines to automatically evaluate prompts, test for security vulnerabilities, and ensure quality before deployment.
This repo is my playground to experiment with autogen and use the same to converse, build pipelines and do LLM testing
AIRTA is an open source, production-ready AI Risk Testing Agent. Point it at your chatbot, copilot, or API; build structured compliance tests from rubrics such as the EU AI Act and OECD; then assess every response against regulatory mandates.
LLM Red Team security testing for LLMs and LLM-driven applications, built for red teams and whitehats. Generate adversarial suites from security playbooks (OWASP LLM, OWASP Agent, MITRE ATLAS, jailbreak, multimodal/file-upload), execute them against live targets via browser UI or HTTP API, and assess risk.
Add a description, image, and links to the llmtesting topic page so that developers can more easily learn about it.
To associate your repository with the llmtesting topic, visit your repo's landing page and select "manage topics."