Skip to content

GRAAL-Research/COLE

title COLE !
emoji 🐳
colorFrom purple
colorTo gray
sdk docker
app_port 7860

COLE: Comprehensive Benchmark for Quebec French Language Understanding Evaluation

Website Paper Dataset Coverage

COLE is a comprehensive benchmark for evaluating Quebec French Natural Language Understanding (NLU). It includes 23 diverse tasks covering sentiment analysis, paraphrase detection, natural language inference, question answering, grammatical judgment, word sense disambiguation, and more — with a particular focus on linguistic phenomena relevant to the French language.

We benchmark 94 large language models (LLMs), providing an extensive analysis of the current state of Quebec French NLU. Our results highlight a significant performance gap between closed- and open-weight models and identify key challenging frontiers such as zero-shot extractive question answering, fine-grained word sense disambiguation, and understanding of regional language variations.

Links

Tasks

COLE consists of 23 tasks grouped by NLU capability:

Sentiment Analysis

Task Description Test size
Allocine Sentiment classification of French movie reviews (positive/negative) 20,000
MMS-fr Sentiment analysis with 3 classes (positive, neutral, negative) 63,190

Natural Language Inference (NLI)

Task Description Test size
FraCaS NLI involving quantifiers, plurality, anaphora, and ellipsis 346
GQNLI-fr NLI with quantifier logic (e.g., most, at least, more than half) 30
LingNLI NLI corpus constructed with a linguist in the loop 4,893
MNLI-nineeleven-Fr-MT French machine-translated MNLI using 9/11 context 2,000
RTE3-Fr French version of RTE3 for textual entailment 3,121
SICK-fr Sentence pair relatedness and entailment 4,906
XNLI-fr Cross-lingual NLI in French 5,010

Question Answering

Task Description Test size
FQuAD Extractive QA on high-quality French Wikipedia articles 400
Fr-BoolQ Boolean question answering in French 178
PIAF French extractive QA pairs 384

Paraphrase Detection

Task Description Test size
PAWS-X Paraphrase identification from sentence pairs 2,000
QFrBLiMP Semantic equivalence detection between sentence pairs 2,290

Grammatical Judgment

Task Description Test size
DACCORD Semantic plausibility of French sentences (binary) 1,034
MultiBLiMP-Fr Grammatical correctness from minimal pairs 77
QFrCoLA Sentence acceptability in French (grammar, syntax) 7,546

Semantic Similarity

Task Description Test size
STS22 Document-level similarity of multilingual news articles 72

Word Sense Disambiguation

Task Description Test size
WSD-Fr Disambiguating verb meanings in context 3,121

Quebec French

Task Description Test size
QFrCoRE Matching Quebec French expressions to standard definitions 4,633
QFrCoRT Matching Quebec French terms to standard definitions 201

Coreference / Pronoun Resolution

Task Description Test size
Wino-X-LM Pronoun resolution with ambiguous referents 2,793
Wino-X-MT Translation-based pronoun resolution with gendered pronouns 2,988

Language

All data in COLE is in French.

Citation

If you use COLE in your research, please cite our paper:

@article{beauchemin2025cole,
  title={COLE: a Comprehensive Benchmark for Quebec French Language Understanding Evaluation},
  author={Beauchemin, David and Tremblay, Yan and Youssef, Mohamed Amine and Khoury, Richard},
  journal={arXiv preprint arXiv:2510.05046},
  year={2025},
  url={https://arxiv.org/abs/2510.05046}
}