Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 66 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,71 @@ This is a benchmark to evaluate AI capabilities to do fair data driven decision-

The benchmark consists of several tasks.

Roles:
- task-specific: environment files for the task, the train.py, etc
- benchmarking infrastructure: code needed to overall run benchmark, scoring etc (`eval-<type>.py`)
- agent: agent tools, agent prompts, etc
A fairnessBench task is defined as follows:
For a dataset and a very simple training script that uses logistic regression model.
How well can an LLM agent improve the training script to achieve high fairness metrics.

## Fairness Metrics:
* To capture disparities:
* Disparate Impact
* Statistical Parity Difference
* To assess differences in true positive rates:
* Equal Opportunity Difference
* To quantify misclassification disparities
* Error Rate Difference
* Error Rate Ratio
* To examine disparities in false negatives across groups:
* False Omission Rate Difference

| file | description | role|
|--- |---| ---|

## Different LLM models used for agent
We use a variaty of open-source paid LLMs.
- Meta's Llama-3.3-70B (open source)
- Alibaba's Qwen-2.5-72B (open source)
- OpenAI's GPT-4o (paid)
- Anthropic's Claude-sonnet 3.7 (paid)

## LLM model used for eval
Google's Gemma 3 27B - Instruction tuned

## Baseline
Run baseline.sh on a task/list of tasks to run the baseline train.py provided in the task to compare the accuracy and fairness metrics with the values reached by the agents

## What does eval do?
Run eval.sh with a list of tasks, agent LLMs to be evaluated and a selected evaluation LLM
eval.sh runs eval.py which in turn runs the different level of evaluations:
- The task's specific eval.py (to evaluate accuracy and fairness metrics)
- LLM eval that evaluates the new training script generated by the agent using the LLM evaluater
- LLM log eval that evaluates the Agent's log using the LLM evaluater
- Flake8 eval that evaluates the training script generated by the agent for Python AST tree and some fairness library use

## Reading eval results
- From the fairnessbench_analysis directory run explode_results.py (make sure to set the result paths to the folder that eval.sh outputted to) to prepare csv files with all the collected results
- Use the other py scripts in the analysis folder to get all the plots


## Roles
- Task-specific: environment files for the task, the train.py, etc
- Benchmarking infrastructure: code needed to overall run benchmark, scoring etc (`eval-<type>.py`)
- Agent: agent tools, agent prompts, etc


## Instructions for running the benchmark:
- Pick a task/ list of tasks to run from tasks.json
- Pick LLMs wanted for the benchmark (make sure the required API keys are there)
- Run using run_experiment.sh

### run_experiment.sh

- Log_dir: The path to a directory for the environment to keep the logs
- Models: The models you want to evaluate on the tasks
- claude-2.1, gpt-4-0125-preview, gemini-pro, gpt-4o-mini, gpt-4o, llama, qwen, granite, claude-3-7-sonnet-20250219, claude-3-5-haiku-20241022, claude-3-opus-20240229
- edit_script_model & fast_llm: Are LLMs specifically used to run smaller actions such as editing a script or summarizing a long observation, these can optionally be different from main agent models

### eval.sh

- Log_dir: Directory that the llm placed the experiment logs
- json_folder: Directory to place results in
- All tasks: list of tasks to be evaluated on
- Models: Models that we are evaluating on above tasks
- Eval_model: LLM used to evaluate the training script generated by the agent and the logs in which the agent logged its thought/action process