ml4sts · AymanBx · Jul 9, 2025 · Jul 10, 2025 · Jul 22, 2025 · Aug 13, 2025
diff --git a/README.md b/README.md
@@ -4,11 +4,71 @@ This is a benchmark to evaluate AI capabilities to do fair data driven decision-
 
 The benchmark consists of several tasks.
 
-Roles:
-- task-specific: environment files for the task, the train.py, etc
-- benchmarking infrastructure: code needed to overall run benchmark, scoring etc (`eval-<type>.py`)
-- agent: agent tools, agent prompts, etc
+A fairnessBench task is defined as follows:
+For a dataset and a very simple training script that uses logistic regression model. 
+How well can an LLM agent improve the training script to achieve high fairness metrics.
 
+## Fairness Metrics:
+* To capture disparities:
+  * Disparate Impact
+  * Statistical Parity Difference
+* To assess differences in true positive rates:
+  * Equal Opportunity Difference
+* To quantify misclassification disparities
+  * Error Rate Difference
+  * Error Rate Ratio
+* To examine disparities in false negatives across groups:
+  * False Omission Rate Difference
 
-| file | description | role|
-|--- |---| ---|
+
+## Different LLM models used for agent
+We use a variaty of open-source paid LLMs. 
+- Meta's Llama-3.3-70B (open source)
+- Alibaba's Qwen-2.5-72B (open source)
+- OpenAI's GPT-4o (paid)
+- Anthropic's Claude-sonnet 3.7 (paid)
+
+## LLM model used for eval
+Google's Gemma 3 27B - Instruction tuned
+
+## Baseline
+Run baseline.sh on a task/list of tasks to run the baseline train.py provided in the task to compare the accuracy and fairness metrics with the values reached by the agents 
+
+## What does eval do?
+Run eval.sh with a list of tasks, agent LLMs to be evaluated and a selected evaluation LLM 
+eval.sh runs eval.py which in turn runs the different level of evaluations:
+- The task's specific eval.py (to evaluate accuracy and fairness metrics)
+- LLM eval that evaluates the new training script generated by the agent using the LLM evaluater
+- LLM log eval that evaluates the Agent's log using the LLM evaluater
+- Flake8 eval that evaluates the training script generated by the agent for Python AST tree and some fairness library use
+
+## Reading eval results
+- From the fairnessbench_analysis directory run explode_results.py (make sure to set the result paths to the folder that eval.sh outputted to) to prepare csv files with all the collected results
+- Use the other py scripts in the analysis folder to get all the plots
+
+
+## Roles
+- Task-specific: environment files for the task, the train.py, etc
+- Benchmarking infrastructure: code needed to overall run benchmark, scoring etc (`eval-<type>.py`)
+- Agent: agent tools, agent prompts, etc
+
+
+## Instructions for running the benchmark:
+- Pick a task/ list of tasks to run from tasks.json
+- Pick LLMs wanted for the benchmark (make sure the required API keys are there)
+- Run using run_experiment.sh
+
+### run_experiment.sh
+
+- Log_dir: The path to a directory for the environment to keep the logs
+- Models: The models you want to evaluate on the tasks
+  - claude-2.1, gpt-4-0125-preview, gemini-pro, gpt-4o-mini, gpt-4o, llama, qwen, granite, claude-3-7-sonnet-20250219, claude-3-5-haiku-20241022, claude-3-opus-20240229
+- edit_script_model & fast_llm: Are LLMs specifically used to run smaller actions such as editing a script or summarizing a long observation, these can optionally be different from main agent models
+
+### eval.sh
+
+- Log_dir: Directory that the llm placed the experiment logs
+- json_folder: Directory to place results in 
+- All tasks: list of tasks to be evaluated on
+- Models: Models that we are evaluating on above tasks
+- Eval_model: LLM used to evaluate the training script generated by the agent and the logs in which the agent logged its thought/action process