A framework for evaluating language models on different tasks.
- Python 3.x
- API keys for the model providers you intend to use
-
Clone the repository:
git clone https://github.com/Sara-Rajaee/lingo_reason cd your-repo -
Install dependencies:
pip install -r requirements.txt
-
Set your API keys as environment variables:
export TOGETHER_API_KEY=your_key_here export GEMINI_API_KEY=your_key_here export OPENAI_API_KEY=your_key_here
You only need to set the keys for the model providers you plan to use.
For GPT-OSS models via a local vLLM proxy:
export GPT_OSS_API_BASE=http://localhost:19743/v1/ # proxy URL (default) export GPT_OSS_API_KEY=no-key # optional, proxy usually has no auth
-
(optional) register polymath repo as a module (needed for polymath evaluation)
git submodule add https://github.com/QwenLM/PolyMath.git third_party/PolyMath git submodule update --init --recursive
GPT-OSS models are served locally via vLLM. A launch script is provided at scripts/serve.sh.
Prerequisites: Install vLLM (pip install vllm) and have GPU(s) available.
# Serve gpt-oss-20b (auto-detects available GPUs for tensor parallelism)
bash scripts/serve.sh openai/gpt-oss-20b
# Serve gpt-oss-120b on 4 GPUs, custom port
TP=4 PORT=8000 bash scripts/serve.sh openai/gpt-oss-120b
# Serve any HuggingFace model
bash scripts/serve.sh meta-llama/Llama-3.3-70B-InstructThe server exposes an OpenAI-compatible API at http://localhost:19743/v1/ (by default).
Once the server is running, use run.py in a separate terminal to evaluate against it.
Environment variables for scripts/serve.sh:
| Variable | Default | Description |
|---|---|---|
PORT |
19743 |
API server port |
TP |
auto-detect | Tensor-parallel GPU count |
DTYPE |
auto |
Model dtype (auto, float16, bfloat16) |
MAX_MODEL_LEN |
vLLM default | Max sequence length |
GPU_UTIL |
0.9 |
GPU memory utilization (0.0–1.0) |
EXTRA_ARGS |
Additional vllm serve arguments |
Run an evaluation by specifying a model and a task:
python run.py --model gemini-2.5-flash --task polymath
# GPT-OSS models (requires a running vLLM proxy)
python run.py --model gpt-oss-20b --task mmmlu
python run.py --model gpt-oss-120b --task polymathTo see all available models and tasks:
python run.py --listTo collect high-temperature samples for distillation:
python run.py --model gpt-oss-120b --task mgsm --distillation --distillation-samples 32Distillation runs write to distilled_results/<model_name>/<task>/<subset>/ with sampled_outputs.json for every sampled output and gold_outputs.json for one gold-matching output per example. The gold file keeps the original task columns plus the selected reasoning trace.
For distillation sampling params, add distillation_temperature and distillation_top_p under the task's defaults in config/tasks.yaml. If they are not set, distillation falls back to the task's normal temperature and top_p.
Use create_reasoning_dataloader to mix distilled reasoning data with math examples from OpenThoughts2-1M:
from src.data_generation import create_reasoning_dataloader
dataloader = create_reasoning_dataloader(
distilled_path=[
"distilled_results/gpt-oss-120b/lingoly/default/all_gold_outputs.json",
"distilled_results/gpt-oss-120b/iolbench/default/all_gold_outputs.json",
],
distilled_percentage=30,
max_samples=1000,
batch_size=8,
seed=42,
)
for batch in dataloader:
print(batch["dataset"])
print(batch["question"][0])
print(batch["reasoning"][0])
print(batch["final_answer"][0])
breakdistilled_percentage=30 means 30% distilled examples and 70% OpenThoughts math examples. The first mixed run builds data/openthoughts_math_subset.json. max_samples=-1 includes all rows from the distilled linguistic reasoning JSON file and loads enough OpenThoughts2 math rows to satisfy the requested mix. later runs reuse it.
Task configurations (e.g. languages, number of evaluation samples) can be modified in the config files located in the config/ directory.