From c89a1bb26d193b2d61d616ff2d527b3eb259ed27 Mon Sep 17 00:00:00 2001 From: AymanBx Date: Wed, 9 Jul 2025 16:44:00 +0000 Subject: [PATCH 1/4] Initial draft for readme.md --- README.md | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/README.md b/README.md index fd2bc76..6ca1ddd 100644 --- a/README.md +++ b/README.md @@ -4,6 +4,25 @@ This is a benchmark to evaluate AI capabilities to do fair data driven decision- The benchmark consists of several tasks. +A fairnessBench task is defined as follows: +For a dataset and a very simple training script that uses logistic regression model. +How well can an LLM agent improve the training script to achieve high fairness metrics. + +## Fairness Metrics: + + + +## Different LLM models used for agent + +## Different LLM models used for eval + +## What is baseline + +## What does eval do? + +## Reading eval results + + Roles: - task-specific: environment files for the task, the train.py, etc - benchmarking infrastructure: code needed to overall run benchmark, scoring etc (`eval-.py`) @@ -12,3 +31,8 @@ Roles: | file | description | role| |--- |---| ---| + + +## Instructions for running the benchmark: +Pick a task/ list of tasks to run from tasks.json. +Run using run_experiment.sh \ No newline at end of file From 2ce2187b713a0382fe6fef42f67a263d25183a20 Mon Sep 17 00:00:00 2001 From: lanij21 <118463516+lanij21@users.noreply.github.com> Date: Thu, 10 Jul 2025 15:44:48 -0400 Subject: [PATCH 2/4] added variables/filled some descriptions Added variables from run_experiments.sh and eval.sh, i added short descriptions for the ones I know --- README.md | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 6ca1ddd..dba1ba8 100644 --- a/README.md +++ b/README.md @@ -35,4 +35,26 @@ Roles: ## Instructions for running the benchmark: Pick a task/ list of tasks to run from tasks.json. -Run using run_experiment.sh \ No newline at end of file +Run using run_experiment.sh + +### run_experiment.sh + +- log_dir + - The directory name for the llm to create logs +- models + - The models you want to use for the tasks + - claude-2.1, gpt-4-0125-preview, gemini-pro, gpt-4o-mini, gpt-4o, llama, qwen, granite, claude-3-7-sonnet-20250219, claude-3-5-haiku-20241022, claude-3-opus-20240229 +- edit_script_model + +- fast_llm +### eval.sh + +- log_dir + - directory that the llm placed the experiment logs +- json_folder + - +- all_tasks + - list of tasks to be evaluated +- models + - Models being used to evaluate results +- eval_model From dd0ed63c9a0ca7894e5e7849f471c487b054edc4 Mon Sep 17 00:00:00 2001 From: Ayman Sandouk <111829133+AymanBx@users.noreply.github.com> Date: Tue, 22 Jul 2025 16:24:07 +0300 Subject: [PATCH 3/4] Update README.md --- README.md | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index dba1ba8..614fc1c 100644 --- a/README.md +++ b/README.md @@ -13,12 +13,25 @@ How well can an LLM agent improve the training script to achieve high fairness m ## Different LLM models used for agent +We use a variaty of open-source paid LLMs. +- Llama 3.3 70B (open source) +- Qwen 2.5 72B (open source) +- GPT-4o (paid) +- Claude-sonnet 3.7 (paid) -## Different LLM models used for eval +## LLM models used for eval +Gemma 3 27B - Instruction tuned ## What is baseline +Run baseline on a task/list of tasks to run the baseline train.py provided in the task to compare the accuracy and fairness metrics with the values reached by the agents ## What does eval do? +Run eval.sh with a list of tasks, agent LLMs to be evaluated and a selected evaluation LLM +eval.sh runs eval.py which in turn runs the different level of evaluations: +- The task's specific eval.py (to evaluate accuracy and fairness metrics) +- LLM eval that evaluates the new training script generated by the agent using the LLM evaluater +- LLM log eval that evaluates the Agent's log using the LLM evaluater +- Flake8 eval that evaluates the training script generated by the agent for Python AST tree and some fairness library use ## Reading eval results From f68d9a46666e84c0537bdcc08b421cbfd743c5c0 Mon Sep 17 00:00:00 2001 From: Ayman Sandouk <111829133+AymanBx@users.noreply.github.com> Date: Wed, 13 Aug 2025 20:54:14 +0300 Subject: [PATCH 4/4] Update README.md with more details --- README.md | 69 ++++++++++++++++++++++++++++--------------------------- 1 file changed, 35 insertions(+), 34 deletions(-) diff --git a/README.md b/README.md index 614fc1c..d996e1e 100644 --- a/README.md +++ b/README.md @@ -9,21 +9,30 @@ For a dataset and a very simple training script that uses logistic regression mo How well can an LLM agent improve the training script to achieve high fairness metrics. ## Fairness Metrics: - +* To capture disparities: + * Disparate Impact + * Statistical Parity Difference +* To assess differences in true positive rates: + * Equal Opportunity Difference +* To quantify misclassification disparities + * Error Rate Difference + * Error Rate Ratio +* To examine disparities in false negatives across groups: + * False Omission Rate Difference ## Different LLM models used for agent We use a variaty of open-source paid LLMs. -- Llama 3.3 70B (open source) -- Qwen 2.5 72B (open source) -- GPT-4o (paid) -- Claude-sonnet 3.7 (paid) +- Meta's Llama-3.3-70B (open source) +- Alibaba's Qwen-2.5-72B (open source) +- OpenAI's GPT-4o (paid) +- Anthropic's Claude-sonnet 3.7 (paid) -## LLM models used for eval -Gemma 3 27B - Instruction tuned +## LLM model used for eval +Google's Gemma 3 27B - Instruction tuned -## What is baseline -Run baseline on a task/list of tasks to run the baseline train.py provided in the task to compare the accuracy and fairness metrics with the values reached by the agents +## Baseline +Run baseline.sh on a task/list of tasks to run the baseline train.py provided in the task to compare the accuracy and fairness metrics with the values reached by the agents ## What does eval do? Run eval.sh with a list of tasks, agent LLMs to be evaluated and a selected evaluation LLM @@ -34,40 +43,32 @@ eval.sh runs eval.py which in turn runs the different level of evaluations: - Flake8 eval that evaluates the training script generated by the agent for Python AST tree and some fairness library use ## Reading eval results +- From the fairnessbench_analysis directory run explode_results.py (make sure to set the result paths to the folder that eval.sh outputted to) to prepare csv files with all the collected results +- Use the other py scripts in the analysis folder to get all the plots -Roles: -- task-specific: environment files for the task, the train.py, etc -- benchmarking infrastructure: code needed to overall run benchmark, scoring etc (`eval-.py`) -- agent: agent tools, agent prompts, etc - - -| file | description | role| -|--- |---| ---| +## Roles +- Task-specific: environment files for the task, the train.py, etc +- Benchmarking infrastructure: code needed to overall run benchmark, scoring etc (`eval-.py`) +- Agent: agent tools, agent prompts, etc ## Instructions for running the benchmark: -Pick a task/ list of tasks to run from tasks.json. -Run using run_experiment.sh +- Pick a task/ list of tasks to run from tasks.json +- Pick LLMs wanted for the benchmark (make sure the required API keys are there) +- Run using run_experiment.sh ### run_experiment.sh -- log_dir - - The directory name for the llm to create logs -- models - - The models you want to use for the tasks +- Log_dir: The path to a directory for the environment to keep the logs +- Models: The models you want to evaluate on the tasks - claude-2.1, gpt-4-0125-preview, gemini-pro, gpt-4o-mini, gpt-4o, llama, qwen, granite, claude-3-7-sonnet-20250219, claude-3-5-haiku-20241022, claude-3-opus-20240229 -- edit_script_model +- edit_script_model & fast_llm: Are LLMs specifically used to run smaller actions such as editing a script or summarizing a long observation, these can optionally be different from main agent models -- fast_llm ### eval.sh -- log_dir - - directory that the llm placed the experiment logs -- json_folder - - -- all_tasks - - list of tasks to be evaluated -- models - - Models being used to evaluate results -- eval_model +- Log_dir: Directory that the llm placed the experiment logs +- json_folder: Directory to place results in +- All tasks: list of tasks to be evaluated on +- Models: Models that we are evaluating on above tasks +- Eval_model: LLM used to evaluate the training script generated by the agent and the logs in which the agent logged its thought/action process