From c89a1bb26d193b2d61d616ff2d527b3eb259ed27 Mon Sep 17 00:00:00 2001
From: AymanBx <ayman_sandouk@uri.edu>
Date: Wed, 9 Jul 2025 16:44:00 +0000
Subject: [PATCH 1/4] Initial draft for readme.md

---
 README.md | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)
diff --git a/README.md b/README.md
index fd2bc76..6ca1ddd 100644
--- a/README.md
+++ b/README.md
@@ -4,6 +4,25 @@ This is a benchmark to evaluate AI capabilities to do fair data driven decision-
 
 The benchmark consists of several tasks.
 
+A fairnessBench task is defined as follows:
+For a dataset and a very simple training script that uses logistic regression model. 
+How well can an LLM agent improve the training script to achieve high fairness metrics.
+
+## Fairness Metrics:
+
+
+
+## Different LLM models used for agent
+
+## Different LLM models used for eval
+
+## What is baseline
+
+## What does eval do?
+
+## Reading eval results
+
+
 Roles:
 - task-specific: environment files for the task, the train.py, etc
 - benchmarking infrastructure: code needed to overall run benchmark, scoring etc (`eval-<type>.py`)
@@ -12,3 +31,8 @@ Roles:
 
 | file | description | role|
 |--- |---| ---|
+
+
+## Instructions for running the benchmark:
+Pick a task/ list of tasks to run from tasks.json.
+Run using run_experiment.sh
\ No newline at end of file

From 2ce2187b713a0382fe6fef42f67a263d25183a20 Mon Sep 17 00:00:00 2001
From: lanij21 <118463516+lanij21@users.noreply.github.com>
Date: Thu, 10 Jul 2025 15:44:48 -0400
Subject: [PATCH 2/4] added variables/filled some descriptions

Added variables from run_experiments.sh and eval.sh, i added short descriptions for the ones I know
---
 README.md | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 6ca1ddd..dba1ba8 100644
--- a/README.md
+++ b/README.md
@@ -35,4 +35,26 @@ Roles:
 
 ## Instructions for running the benchmark:
 Pick a task/ list of tasks to run from tasks.json.
-Run using run_experiment.sh
\ No newline at end of file
+Run using run_experiment.sh
+
+### run_experiment.sh
+
+- log_dir
+  - The directory name for the llm to create logs
+- models
+  - The models you want to use for the tasks
+  - claude-2.1, gpt-4-0125-preview, gemini-pro, gpt-4o-mini, gpt-4o, llama, qwen, granite, claude-3-7-sonnet-20250219, claude-3-5-haiku-20241022, claude-3-opus-20240229
+- edit_script_model
+
+- fast_llm
+### eval.sh
+
+- log_dir
+  - directory that the llm placed the experiment logs
+- json_folder
+  - 
+- all_tasks
+  - list of tasks to be evaluated
+- models
+  - Models being used to evaluate results
+- eval_model

From dd0ed63c9a0ca7894e5e7849f471c487b054edc4 Mon Sep 17 00:00:00 2001
From: Ayman Sandouk <111829133+AymanBx@users.noreply.github.com>
Date: Tue, 22 Jul 2025 16:24:07 +0300
Subject: [PATCH 3/4] Update README.md

---
 README.md | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index dba1ba8..614fc1c 100644
--- a/README.md
+++ b/README.md
@@ -13,12 +13,25 @@ How well can an LLM agent improve the training script to achieve high fairness m
 
 
 ## Different LLM models used for agent
+We use a variaty of open-source paid LLMs. 
+- Llama 3.3 70B (open source)
+- Qwen 2.5 72B (open source)
+- GPT-4o (paid)
+- Claude-sonnet 3.7 (paid)
 
-## Different LLM models used for eval
+## LLM models used for eval
+Gemma 3 27B - Instruction tuned
 
 ## What is baseline
+Run baseline on a task/list of tasks to run the baseline train.py provided in the task to compare the accuracy and fairness metrics with the values reached by the agents 
 
 ## What does eval do?
+Run eval.sh with a list of tasks, agent LLMs to be evaluated and a selected evaluation LLM 
+eval.sh runs eval.py which in turn runs the different level of evaluations:
+- The task's specific eval.py (to evaluate accuracy and fairness metrics)
+- LLM eval that evaluates the new training script generated by the agent using the LLM evaluater
+- LLM log eval that evaluates the Agent's log using the LLM evaluater
+- Flake8 eval that evaluates the training script generated by the agent for Python AST tree and some fairness library use
 
 ## Reading eval results
 

From f68d9a46666e84c0537bdcc08b421cbfd743c5c0 Mon Sep 17 00:00:00 2001
From: Ayman Sandouk <111829133+AymanBx@users.noreply.github.com>
Date: Wed, 13 Aug 2025 20:54:14 +0300
Subject: [PATCH 4/4] Update README.md with more details

---
 README.md | 69 ++++++++++++++++++++++++++++---------------------------
 1 file changed, 35 insertions(+), 34 deletions(-)

diff --git a/README.md b/README.md
index 614fc1c..d996e1e 100644
--- a/README.md
+++ b/README.md
@@ -9,21 +9,30 @@ For a dataset and a very simple training script that uses logistic regression mo
 How well can an LLM agent improve the training script to achieve high fairness metrics.
 
 ## Fairness Metrics:
-
+* To capture disparities:
+  * Disparate Impact
+  * Statistical Parity Difference
+* To assess differences in true positive rates:
+  * Equal Opportunity Difference
+* To quantify misclassification disparities
+  * Error Rate Difference
+  * Error Rate Ratio
+* To examine disparities in false negatives across groups:
+  * False Omission Rate Difference
 
 
 ## Different LLM models used for agent
 We use a variaty of open-source paid LLMs. 
-- Llama 3.3 70B (open source)
-- Qwen 2.5 72B (open source)
-- GPT-4o (paid)
-- Claude-sonnet 3.7 (paid)
+- Meta's Llama-3.3-70B (open source)
+- Alibaba's Qwen-2.5-72B (open source)
+- OpenAI's GPT-4o (paid)
+- Anthropic's Claude-sonnet 3.7 (paid)
 
-## LLM models used for eval
-Gemma 3 27B - Instruction tuned
+## LLM model used for eval
+Google's Gemma 3 27B - Instruction tuned
 
-## What is baseline
-Run baseline on a task/list of tasks to run the baseline train.py provided in the task to compare the accuracy and fairness metrics with the values reached by the agents 
+## Baseline
+Run baseline.sh on a task/list of tasks to run the baseline train.py provided in the task to compare the accuracy and fairness metrics with the values reached by the agents 
 
 ## What does eval do?
 Run eval.sh with a list of tasks, agent LLMs to be evaluated and a selected evaluation LLM 
@@ -34,40 +43,32 @@ eval.sh runs eval.py which in turn runs the different level of evaluations:
 - Flake8 eval that evaluates the training script generated by the agent for Python AST tree and some fairness library use
 
 ## Reading eval results
+- From the fairnessbench_analysis directory run explode_results.py (make sure to set the result paths to the folder that eval.sh outputted to) to prepare csv files with all the collected results
+- Use the other py scripts in the analysis folder to get all the plots
 
 
-Roles:
-- task-specific: environment files for the task, the train.py, etc
-- benchmarking infrastructure: code needed to overall run benchmark, scoring etc (`eval-<type>.py`)
-- agent: agent tools, agent prompts, etc
-
-
-| file | description | role|
-|--- |---| ---|
+## Roles
+- Task-specific: environment files for the task, the train.py, etc
+- Benchmarking infrastructure: code needed to overall run benchmark, scoring etc (`eval-<type>.py`)
+- Agent: agent tools, agent prompts, etc
 
 
 ## Instructions for running the benchmark:
-Pick a task/ list of tasks to run from tasks.json.
-Run using run_experiment.sh
+- Pick a task/ list of tasks to run from tasks.json
+- Pick LLMs wanted for the benchmark (make sure the required API keys are there)
+- Run using run_experiment.sh
 
 ### run_experiment.sh
 
-- log_dir
-  - The directory name for the llm to create logs
-- models
-  - The models you want to use for the tasks
+- Log_dir: The path to a directory for the environment to keep the logs
+- Models: The models you want to evaluate on the tasks
   - claude-2.1, gpt-4-0125-preview, gemini-pro, gpt-4o-mini, gpt-4o, llama, qwen, granite, claude-3-7-sonnet-20250219, claude-3-5-haiku-20241022, claude-3-opus-20240229
-- edit_script_model
+- edit_script_model & fast_llm: Are LLMs specifically used to run smaller actions such as editing a script or summarizing a long observation, these can optionally be different from main agent models
 
-- fast_llm
 ### eval.sh
 
-- log_dir
-  - directory that the llm placed the experiment logs
-- json_folder
-  - 
-- all_tasks
-  - list of tasks to be evaluated
-- models
-  - Models being used to evaluate results
-- eval_model
+- Log_dir: Directory that the llm placed the experiment logs
+- json_folder: Directory to place results in 
+- All tasks: list of tasks to be evaluated on
+- Models: Models that we are evaluating on above tasks
+- Eval_model: LLM used to evaluate the training script generated by the agent and the logs in which the agent logged its thought/action process