feat: add new ranking tasks for melo#37
Merged
Mattdl merged 34 commits intotechwolf-ai:mainfrom Feb 26, 2026
Merged
Conversation
…y with dataset_id abstraction
Introduce DatasetLanguages type and LanguageAggregationMode enum with three modes: monolingual_only, crosslingual_group_input_languages, and crosslingual_group_output_languages. Rename get_dataset_language() to get_dataset_languages(), now returning input and output language sets. Replace MetricsResult.language with input_languages/output_languages. Datasets incompatible with the chosen aggregation mode are skipped rather than causing a failure. BREAKING CHANGE: MetricsResult.language replaced by input_languages/output_languages
Skip datasets incompatible with the chosen LanguageAggregationMode before evaluation to avoid unnecessary compute. Extract get_language_grouping_key() as a shared function in types.py so the same compatibility logic is reused by both the eval-time filter (_filter_pending_work) and the aggregation-time skip in results.py. - Add get_language_grouping_key() to types.py, refactor BenchmarkResults._get_language_grouping_key() to delegate to it - Add ExecutionMode enum (LAZY / ALL) - Add language_aggregation_mode and execution_mode params to evaluate() - Add language_aggregation_mode param to get_summary_metrics() - Export ExecutionMode and LanguageAggregationMode from workrb - Add tests for shared function and MELO-like filtering scenarios
Conflicts: - README.md - src/workrb/tasks/ranking/__init__.py
The base class interface evolved from get_dataset_language() (singular, returning Language None) to get_dataset_languages() (plural, returning DatasetLanguages with input/output language sets). Update MELORanking and MELSRanking to implement the current interface, reusing the existing _parse_dataset_id() logic. This enables lazy execution filtering and per-language aggregation for these tasks. - Rename LANGUAGE_TO_DATASETS to MELO_DATASET_IDS / MELS_DATASET_IDS - Replace get_dataset_language() with get_dataset_languages() - Add tests covering get_dataset_languages() for all MELO and MELS dataset IDs, including monolingual, cross-lingual, and multilingual corpus cases
Contributor
Author
|
This branch is now up-to-date with the latest changes in #34. |
…evaluate() Address PR techwolf-ai#34 review feedback from @Mattdl. The aggregation mode now flows consistently through the entire evaluation and aggregation pipeline instead of being an optional parameter.
…ized-index-melo-ranking-tasks
Override get_dataset_languages() and languages_to_dataset_ids() in the
freelancer candidate ranking tasks, replacing the Language.CROSS sentinel
with a proper DATASET_LANGUAGES_MAP that describes each dataset's input
and output languages.
This lets the aggregation and filtering logic handle the multilingual
dataset ("cross_lingual" renamed to "multilingual") as a regular entry
rather than a special-cased language enum value.
…ized-index-melo-ranking-tasks
Per-task aggregation now groups datasets by language before averaging, giving equal weight to each language regardless of how many datasets it contains. Previously, all compatible datasets were flat-averaged, which over-represented languages with more datasets. Add SKIP_LANGUAGE_AGGREGATION mode for users who want the old flat average with no filtering and no per-language output.
Add four example scripts illustrating the two aggregation modes (language-weighted vs flat average) combined with task-level language filtering (selected subset vs all available languages).
…ized-index-melo-ranking-tasks
Add 6 new MELO dataset IDs for Austria (aut) and Belgium (bel): - aut_q_de_c_de, aut_q_de_c_en - bel_q_fr_c_fr, bel_q_fr_c_en - bel_q_nl_c_nl, bel_q_nl_c_en Update test expectations in TestMELORankingDatasetIds accordingly.
… UK as supported language
…ized-index-melo-ranking-tasks
… naming conventions Addresses PR review feedback requesting richer documentation for users without domain knowledge. Both docstrings now include scope and stats, corpus construction details (surface forms from ESCO concepts), dataset variant descriptions (monolingual vs cross-lingual), the dataset ID naming convention, and concrete examples using real data from the benchmarks.
Contributor
Author
Mattdl
approved these changes
Feb 26, 2026
Collaborator
Mattdl
left a comment
There was a problem hiding this comment.
LGTM! Thanks again for the really impactful contributions 🚀
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses #30
Description
This PR introduces the MELO Benchmark (Multilingual Entity Linking of Occupations) as a new ranking task for job title normalization into ESCO. MELO provides 42 evaluation datasets spanning 21 languages, built from crosswalks between national occupation taxonomies and ESCO published by official labor-related organizations across EU member states.
Additionally, we include MELS (Multilingual Entity Linking of Skills), a sibling benchmark following the same methodology but targeting skill normalization into ESCO Skills rather than occupations. MELS currently covers 5 languages with 8 datasets, providing complementary evaluation coverage for the skill normalization task group.
This PR is built on top of #34, which introduces a refactor with the generalized dataset indexing infrastructure required for this implementation. As such, this PR is contingent on #34 being merged. If the maintainers prefer a different approach for the refactor, I would be happy to adapt the implementation accordingly.
Changes:
MELORankingtask class with 42 datasets across 21 languages for job normalizationMELSRankingtask class with 8 datasets across 5 languages for skill normalizationRankingDatasetconstructor to supportallow_duplicate_targetsparameter (required by MELO)Checklist