diff --git a/README.md b/README.md index fd2bc76..d996e1e 100644 --- a/README.md +++ b/README.md @@ -4,11 +4,71 @@ This is a benchmark to evaluate AI capabilities to do fair data driven decision- The benchmark consists of several tasks. -Roles: -- task-specific: environment files for the task, the train.py, etc -- benchmarking infrastructure: code needed to overall run benchmark, scoring etc (`eval-.py`) -- agent: agent tools, agent prompts, etc +A fairnessBench task is defined as follows: +For a dataset and a very simple training script that uses logistic regression model. +How well can an LLM agent improve the training script to achieve high fairness metrics. +## Fairness Metrics: +* To capture disparities: + * Disparate Impact + * Statistical Parity Difference +* To assess differences in true positive rates: + * Equal Opportunity Difference +* To quantify misclassification disparities + * Error Rate Difference + * Error Rate Ratio +* To examine disparities in false negatives across groups: + * False Omission Rate Difference -| file | description | role| -|--- |---| ---| + +## Different LLM models used for agent +We use a variaty of open-source paid LLMs. +- Meta's Llama-3.3-70B (open source) +- Alibaba's Qwen-2.5-72B (open source) +- OpenAI's GPT-4o (paid) +- Anthropic's Claude-sonnet 3.7 (paid) + +## LLM model used for eval +Google's Gemma 3 27B - Instruction tuned + +## Baseline +Run baseline.sh on a task/list of tasks to run the baseline train.py provided in the task to compare the accuracy and fairness metrics with the values reached by the agents + +## What does eval do? +Run eval.sh with a list of tasks, agent LLMs to be evaluated and a selected evaluation LLM +eval.sh runs eval.py which in turn runs the different level of evaluations: +- The task's specific eval.py (to evaluate accuracy and fairness metrics) +- LLM eval that evaluates the new training script generated by the agent using the LLM evaluater +- LLM log eval that evaluates the Agent's log using the LLM evaluater +- Flake8 eval that evaluates the training script generated by the agent for Python AST tree and some fairness library use + +## Reading eval results +- From the fairnessbench_analysis directory run explode_results.py (make sure to set the result paths to the folder that eval.sh outputted to) to prepare csv files with all the collected results +- Use the other py scripts in the analysis folder to get all the plots + + +## Roles +- Task-specific: environment files for the task, the train.py, etc +- Benchmarking infrastructure: code needed to overall run benchmark, scoring etc (`eval-.py`) +- Agent: agent tools, agent prompts, etc + + +## Instructions for running the benchmark: +- Pick a task/ list of tasks to run from tasks.json +- Pick LLMs wanted for the benchmark (make sure the required API keys are there) +- Run using run_experiment.sh + +### run_experiment.sh + +- Log_dir: The path to a directory for the environment to keep the logs +- Models: The models you want to evaluate on the tasks + - claude-2.1, gpt-4-0125-preview, gemini-pro, gpt-4o-mini, gpt-4o, llama, qwen, granite, claude-3-7-sonnet-20250219, claude-3-5-haiku-20241022, claude-3-opus-20240229 +- edit_script_model & fast_llm: Are LLMs specifically used to run smaller actions such as editing a script or summarizing a long observation, these can optionally be different from main agent models + +### eval.sh + +- Log_dir: Directory that the llm placed the experiment logs +- json_folder: Directory to place results in +- All tasks: list of tasks to be evaluated on +- Models: Models that we are evaluating on above tasks +- Eval_model: LLM used to evaluate the training script generated by the agent and the logs in which the agent logged its thought/action process