A comprehensive Armenian model evaluation framework for benchmarking large language models (LLMs).
ArmBench-LLM enables thorough evaluation of language models across multiple Armenian-specific and general datasets. The framework supports both VLLM optimization and Hugging Face's AutoModelForCausalLM for flexible evaluation options.
- Evaluate models on specialized Armenian datasets (language, literature, history)
- Support for the Armenian version of MMLU-Pro (MMLU-Pro-Hy)
- vLLM optimization for faster inference
- Ability to resume interrupted evaluations
- Comprehensive scoring system
- Submit your model results to the ArmBench-LLM leaderboard
- Python 3.10.16
- CUDA-compatible GPU (preferable)
- Required dependencies (install via
pip install -r requirements.txt)
- GPU Preferable: This benchmark uses vLLM for optimization which is optimized for CUDA-compatible GPUs.
The benchmark already includes configuration files in the configs directory:
model_config.yaml- Contains model specification and parametersgeneration_config.yaml- Contains text generation parametersdataset_config.yaml- Contains dataset selection for evaluation
You can modify these existing config files according to your needs before running the benchmark.
Example of configs/config.yaml:
model_config: "configs/model_config.yaml"
generation_config: "configs/generation_config.yaml"
dataset_config: "configs/dataset_config.yaml"
get_results: True
batch_size: 8Example of configs/model_config.yaml:
model_name: 'your_hf_model_name' # Required
vllm_optimisation: True # Optional, defaults to True
# Add other model parameters supported by vLLM or Hugging Face
# Examples:
# dtype: 'bfloat16'
# trust_remote_code: True
# revision: 'main'Example of configs/generation_config.yaml:
# Text generation parameters
temperature: 0.7
top_p: 0.9
top_k: 40
max_tokens: 512
# Add other generation parameters according to vLLM or Hugging Face documentationExample of configs/dataset_config.yaml:
datasets:
- Armenian language and literature
- Armenian history
- Mathematics
- mmlu_proIf not provided, the framework will evaluate using all available datasets.
The benchmark includes the following configuration files:
-
General configuration file:
config.yamlreferences the individual configuration files and sets other parameters
-
Individual configuration files:
model_config.yaml- Contains model specification and parametersgeneration_config.yaml- Contains text generation parametersdataset_config.yaml- Contains dataset selection for evaluation
The main script supports command-line arguments for flexible configuration:
python main.py [--config CONFIG_PATH] [--model_config MODEL_CONFIG_PATH]
[--generation_config GEN_CONFIG_PATH] [--dataset_config DATASET_CONFIG_PATH]
[--get_results GET_RESULTS] [--batch_size BATCH_SIZE]--config: Path to the general YAML config file (default: 'configs/config.yaml')--model_config: Path to the model configuration file (default: 'configs/model_config.yaml')--generation_config: Path to the generation configuration file (default: 'configs/generation_config.yaml')--dataset_config: Path to the dataset configuration file (default: None)--get_results: Whether to run evaluation after generating completions (default: True)--batch_size: Batch size for processing (default: 8)
Using the general config file:
python main.py --config configs/config.yamlUsing individual configuration files:
python main.py --model_config configs/model_config.yaml \
--generation_config configs/generation_config.yaml \
--batch_size 4The get_results parameter controls whether the evaluation results are computed and saved automatically after running the main script (main.py).
-
get_results=True(default):
If set toTrue, the script will run the evaluation and generate aresults.jsonfile in the model's output directory automatically. -
get_results=False:
If set toFalse, the main script will not calculate and save the evaluation results immediately. Instead, you can manually compute the results later by running theevaluate_model.pyscript.
This step is only required if you set get_results=False in the previous step. Run the evaluation script with:
python evaluate_model.py [--config CONFIG_PATH] \
[--model_config MODEL_CONFIG_PATH] \
[--dataset_config DATASET_CONFIG_PATH]--config: Path to the general YAML config file (default: 'configs/config.yaml')--model_config: Path to the model configuration file (default: 'configs/model_config.yaml')--dataset_config: Path to the dataset configuration file (default: None)
Important: You must use the same configuration values in evaluate_model.py as you used in main.py to ensure consistent evaluation results.
This will create a results.json file in the model's output directory with comprehensive scores.
After evaluating your model, you can submit your results to the ArmBench-LLM leaderboard by following these steps:
- Push your model and tokenizer to the Hugging Face Hub
- Add the
ArmBench-LLMtag to your model repository - Include the
results.jsonfile in your repository
Important Submission Requirements: To be included in the official ArmBench leaderboard, your evaluation must include at least one of the following:
- All unified exam datasets (Armenian language and literature, Armenian history, Mathematics)
- The MMLU-Pro dataset (mmlu_pro)
Results that include only partial evaluations (e.g., only Armenian language and history tests) will be discarded from the Hugging Face ArmBench space.
Here's an example of how to push your model with results to the Hugging Face Hub:
from transformers import AutoModelForCausalLM, AutoTokenizer
import json
import os
from huggingface_hub import HfApi
hub_repo_name = "company_name/model_name"
results_path = "results.json"
model = AutoModelForCausalLM.from_pretrained(hub_repo_name)
tokenizer = AutoTokenizer.from_pretrained(hub_repo_name)
api = HfApi()
api.upload_file(
path_or_fileobj=results_path,
path_in_repo="results.json",
repo_id=hub_repo_name,
repo_type="model",
commit_message="Add ARM benchmark results"
)
model.push_to_hub(hub_repo_name, tags=['ArmBench-LLM'])
tokenizer.push_to_hub(hub_repo_name, tags=['ArmBench-LLM'])After pushing your model with the ArmBench-LLM tag and results.json file that meets the submission requirements, your model will appear on the ArmBench leaderboard.
During evaluation, the framework creates output directories in the following structure:
model_outputs/
└── {model_name}/
├── dataset_x_outputs.json
└── results.json
Where {model_name} is the model name specified in your model_config.yaml file. Each model gets its own directory to store:
- Raw model responses for each dataset
- Final
results.jsonwith detailed performance metrics
The model_outputs directory also serves as a valuable resource for error analysis, allowing you to inspect the model's raw responses and understand its performance characteristics in detail.
If an evaluation is interrupted without changing parameters, you can resume it by running the main script again with the same configuration parameters.
Important: If you need to change any parameters during evaluation (model configuration, generation settings, etc.), you MUST delete the corresponding model directory from the model_outputs folder before continuing. For example, if you want to change parameters for a model named "google_gemma-2-2b-it", you should delete the entire model_outputs/google_gemma-2-2b-it directory.
It is strongly recommended to complete the entire evaluation with the same set of parameters. This ensures that all evaluation data for a model is generated with consistent parameters. Failing to delete previous results when changing parameters may result in inconsistent evaluation results where some parts of the dataset were evaluated with different parameters than others.
This benchmark evaluates Language Models on Armenian-specific tasks, including Armenian Unified Test Exams and a subsample of the MMLU-Pro benchmark, translated into Armenian (MMLU-Pro-Hy). It is designed to measure the models' understanding and generation capabilities in the Armenian language.
The framework includes the following datasets:
- Armenian language and literature
- Armenian history
- Mathematics
- mmlu_pro
You can evaluate on specific datasets by specifying them in the dataset_config.yaml file.
By default, the framework uses vLLM for optimized inference. This requires a CUDA-compatible GPU.
You can add any parameter supported by vLLM or Hugging Face's AutoModelForCausalLM to the model configuration file. See their respective documentation for details:
Similarly, you can customize text generation by adding parameters to the generation configuration file:
- For vLLM: See vLLM Generation Parameters
- For Hugging Face: See Generation Strategies
Check out the current leaderboard at Metric-AI/ArmBench-LLM to see how your model compares to others on Armenian language tasks.
Contributions are welcome! Please feel free to submit a Pull Request.