You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
COLE: Comprehensive Benchmark for Quebec French Language Understanding Evaluation
COLE is a comprehensive benchmark for evaluating Quebec French Natural Language Understanding (NLU). It includes 23 diverse tasks covering sentiment analysis, paraphrase detection, natural language inference, question answering, grammatical judgment, word sense disambiguation, and more — with a particular focus on linguistic phenomena relevant to the French language.
We benchmark 94 large language models (LLMs), providing an extensive analysis of the current state of Quebec French NLU. Our results highlight a significant performance gap between closed- and open-weight models and identify key challenging frontiers such as zero-shot extractive question answering, fine-grained word sense disambiguation, and understanding of regional language variations.
COLE consists of 23 tasks grouped by NLU capability:
Sentiment Analysis
Task
Description
Test size
Allocine
Sentiment classification of French movie reviews (positive/negative)
20,000
MMS-fr
Sentiment analysis with 3 classes (positive, neutral, negative)
63,190
Natural Language Inference (NLI)
Task
Description
Test size
FraCaS
NLI involving quantifiers, plurality, anaphora, and ellipsis
346
GQNLI-fr
NLI with quantifier logic (e.g., most, at least, more than half)
30
LingNLI
NLI corpus constructed with a linguist in the loop
4,893
MNLI-nineeleven-Fr-MT
French machine-translated MNLI using 9/11 context
2,000
RTE3-Fr
French version of RTE3 for textual entailment
3,121
SICK-fr
Sentence pair relatedness and entailment
4,906
XNLI-fr
Cross-lingual NLI in French
5,010
Question Answering
Task
Description
Test size
FQuAD
Extractive QA on high-quality French Wikipedia articles
400
Fr-BoolQ
Boolean question answering in French
178
PIAF
French extractive QA pairs
384
Paraphrase Detection
Task
Description
Test size
PAWS-X
Paraphrase identification from sentence pairs
2,000
QFrBLiMP
Semantic equivalence detection between sentence pairs
2,290
Grammatical Judgment
Task
Description
Test size
DACCORD
Semantic plausibility of French sentences (binary)
1,034
MultiBLiMP-Fr
Grammatical correctness from minimal pairs
77
QFrCoLA
Sentence acceptability in French (grammar, syntax)
7,546
Semantic Similarity
Task
Description
Test size
STS22
Document-level similarity of multilingual news articles
72
Word Sense Disambiguation
Task
Description
Test size
WSD-Fr
Disambiguating verb meanings in context
3,121
Quebec French
Task
Description
Test size
QFrCoRE
Matching Quebec French expressions to standard definitions
4,633
QFrCoRT
Matching Quebec French terms to standard definitions
201
Coreference / Pronoun Resolution
Task
Description
Test size
Wino-X-LM
Pronoun resolution with ambiguous referents
2,793
Wino-X-MT
Translation-based pronoun resolution with gendered pronouns
2,988
Language
All data in COLE is in French.
Citation
If you use COLE in your research, please cite our paper:
@article{beauchemin2025cole,
title={COLE: a Comprehensive Benchmark for Quebec French Language Understanding Evaluation},
author={Beauchemin, David and Tremblay, Yan and Youssef, Mohamed Amine and Khoury, Richard},
journal={arXiv preprint arXiv:2510.05046},
year={2025},
url={https://arxiv.org/abs/2510.05046}
}
About
COLE: Comprehensive Benchmark for Quebec French Language Understanding Evaluation (ICLR 2025 Workshop)