Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen
📄 Paper: arXiv:2604.07223 · 📦 Dataset (TraceSafe-Bench): CyCraftAI/TraceSafe
TraceSafe is a generalized testing framework designed to robustly assess the resilience of Large Language Models (LLMs) and specialized guardrail models against adversarial, hallucinated, and ambiguous multi-step tool-calling workflows. It systematically injects security failures (e.g., API key leaks, Prompt Injections) and functional failures (e.g., execution of non-existent utilities) into conversational trajectories to benchmark the defensive capabilities of evaluating models.
To get started with TraceSafe, follow these steps to configure your environment:
conda create -n TraceSafe python=3.12 -y
conda activate TraceSafe
pip install -r requirements.txtThe TraceSafe-Bench golden collection (1,170 records across 12 risk categories + benign) is hosted on Hugging Face as a gated dataset. Request access at CyCraftAI/TraceSafe, then run:
huggingface-cli login # once, paste a token with read access
python data_preprocessing/download_data.pyThis populates the top-level data/ directory with the 13 golden_*.jsonl files that the evaluation scripts expect. Note: data/ is the local mirror of the HuggingFace dataset — the canonical source is the gated repo above.
core_utils/: The foundational backbone. Contains centralized schema definitions (TraceEntry), unifiedjson/jsonldataset loaders, shared path configurations, and threaded evaluation pipelines (BaseEvaluationRunner).data_preprocessing/: Code to construct the benchmark.0_trace_generation/: Core scripts to fetch, filter, and format initial ground-truth (safe baseline) conversational tool-calling traces from benchmark endpoints (e.g. BFCL). Please refer to the Generation README for script configuration.1_mutation/: The heart of the red-teaming engine. Recursively applies corrupted trace permutations to baselines across 12 distinct risk taxonomy classes. See the Mutation README for command-line arguments.
evaluation/: The primary benchmarking surface.- Contains parallelized runners to benchmark arbitrary models against the synthetic corrupted traces efficiently. Outputs detailed tabular
csvmetrics tracking overall detection accuracies. Read the Evaluation README for usage arguments.
- Contains parallelized runners to benchmark arbitrary models against the synthetic corrupted traces efficiently. Outputs detailed tabular
TraceSafe separates its execution abstraction by model-type: generalized LLMs functioning as zero-shot guards, and enterprise guardrails utilizing unique SDKs.
To test how well an LLM detects vulnerabilities across the mutated traces, serve your target model (e.g., via vllm) and point evaluate_llm.py toward it using an OpenAI-compatible interface.
cd evaluation
python evaluate_llm.py \
--model_name "Qwen3/Qwen3-32B" \
--api_key "EMPTY" \
--base_url "http://localhost:8017/v1" \
--settings binary_classification_with_taxonomy fine_grained_classification \
--output_dir "./results/Qwen3-32B"TraceSafe natively proxies requests to proprietary safety filters like Azure Content Safety, AWS Bedrock Guardrails, GCP Model Armor, and models like Llama-Guard 3. Use evaluate_guard.py explicitly declaring your provider:
# Example evaluating Azure Content Safety
cd evaluation
python evaluate_guard.py \
--provider azure \
--azure_endpoint "https://your-endpoint.cognitiveservices.azure.com/" \
--azure_key "your-key" \
--output_dir ./results/Azure-Guard(Note: Execution outputs concurrent results mapping standard 'correct', 'wrong', and '.csv' statistics into your defined --output_dir)
Our generator injects 12 vulnerability classes mathematically distributed into 4 root safety vectors:
- PROMPT_INJECTION
1_PromptInjectionIn|2_PromptInjectionOut
- PRIVACY_LEAKAGE
3_UserInfoLeak|4_ApiKeyLeak|5_DataLeak
- HALLUCINATION
6_AmbiguousArg|7_HallucinatedTool|8_HallucinatedArgValue|9_RedundantArg|10_MissingTypeHint
- INTERFACE_INCONSISTENCIES
11_VersionConflict|12_DescriptionMismatch
For detailed implementation and description of the categories, please see Implementation.md.
If you find TraceSafe-Bench useful in your research, please cite our work:
@misc{chen2026tracesafe,
title = {TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories},
author = {Chen, Yen-Shan and Huang, Sian-Yao and Yang, Cheng-Lin and Chen, Yun-Nung},
year = {2026},
eprint = {2604.07223},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2604.07223}
}Released under the Apache License 2.0. Source baseline traces are derived from BFCL (Apache-2.0); mutations and curation are released under the same license.