Skip to content

feat: add new ranking tasks for melo#37

Merged
Mattdl merged 34 commits intotechwolf-ai:mainfrom
federetyk:feat/generalized-index-melo-ranking-tasks
Feb 26, 2026
Merged

feat: add new ranking tasks for melo#37
Mattdl merged 34 commits intotechwolf-ai:mainfrom
federetyk:feat/generalized-index-melo-ranking-tasks

Conversation

@federetyk
Copy link
Contributor

Addresses #30

Description

This PR introduces the MELO Benchmark (Multilingual Entity Linking of Occupations) as a new ranking task for job title normalization into ESCO. MELO provides 42 evaluation datasets spanning 21 languages, built from crosswalks between national occupation taxonomies and ESCO published by official labor-related organizations across EU member states.

Additionally, we include MELS (Multilingual Entity Linking of Skills), a sibling benchmark following the same methodology but targeting skill normalization into ESCO Skills rather than occupations. MELS currently covers 5 languages with 8 datasets, providing complementary evaluation coverage for the skill normalization task group.

This PR is built on top of #34, which introduces a refactor with the generalized dataset indexing infrastructure required for this implementation. As such, this PR is contingent on #34 being merged. If the maintainers prefer a different approach for the refactor, I would be happy to adapt the implementation accordingly.

Changes:

  • Add MELORanking task class with 42 datasets across 21 languages for job normalization
  • Add MELSRanking task class with 8 datasets across 5 languages for skill normalization
  • Extend RankingDataset constructor to support allow_duplicate_targets parameter (required by MELO)
  • Add unit tests for dataset ID filtering logic with various language combinations
  • Add defensive check in e2e test to skip tasks with no datasets for the requested language set
  • Update README with new task entries

Checklist

  • Added new tests for new functionality
  • Tested locally with example tasks
  • Code follows project style guidelines
  • Documentation updated
  • No new warnings introduced

Introduce DatasetLanguages type and LanguageAggregationMode enum with
three modes: monolingual_only, crosslingual_group_input_languages,
and crosslingual_group_output_languages.

Rename get_dataset_language() to get_dataset_languages(), now returning
input and output language sets. Replace MetricsResult.language with
input_languages/output_languages. Datasets incompatible with the chosen
aggregation mode are skipped rather than causing a failure.

BREAKING CHANGE: MetricsResult.language replaced by input_languages/output_languages
Skip datasets incompatible with the chosen LanguageAggregationMode
before evaluation to avoid unnecessary compute. Extract
get_language_grouping_key() as a shared function in types.py so
the same compatibility logic is reused by both the eval-time filter
(_filter_pending_work) and the aggregation-time skip in results.py.

- Add get_language_grouping_key() to types.py, refactor
  BenchmarkResults._get_language_grouping_key() to delegate to it
- Add ExecutionMode enum (LAZY / ALL)
- Add language_aggregation_mode and execution_mode params to evaluate()
- Add language_aggregation_mode param to get_summary_metrics()
- Export ExecutionMode and LanguageAggregationMode from workrb
- Add tests for shared function and MELO-like filtering scenarios
Conflicts:
- README.md
- src/workrb/tasks/ranking/__init__.py
The base class interface evolved from get_dataset_language() (singular,
returning Language
None) to get_dataset_languages() (plural, returning
DatasetLanguages with input/output language sets). Update MELORanking
and MELSRanking to implement the current interface, reusing the existing
_parse_dataset_id() logic. This enables lazy execution filtering and
per-language aggregation for these tasks.

- Rename LANGUAGE_TO_DATASETS to MELO_DATASET_IDS / MELS_DATASET_IDS
- Replace get_dataset_language() with get_dataset_languages()
- Add tests covering get_dataset_languages() for all MELO and MELS
  dataset IDs, including monolingual, cross-lingual, and multilingual
  corpus cases
@federetyk
Copy link
Contributor Author

This branch is now up-to-date with the latest changes in #34.

…evaluate()

Address PR techwolf-ai#34 review feedback from @Mattdl. The aggregation mode now flows
consistently through the entire evaluation and aggregation pipeline
instead of being an optional parameter.
Override get_dataset_languages() and languages_to_dataset_ids() in the
freelancer candidate ranking tasks, replacing the Language.CROSS sentinel
with a proper DATASET_LANGUAGES_MAP that describes each dataset's input
and output languages.

This lets the aggregation and filtering logic handle the multilingual
dataset ("cross_lingual" renamed to "multilingual") as a regular entry
rather than a special-cased language enum value.
Per-task aggregation now groups datasets by language before averaging,
giving equal weight to each language regardless of how many datasets
it contains. Previously, all compatible datasets were flat-averaged,
which over-represented languages with more datasets.

Add SKIP_LANGUAGE_AGGREGATION mode for users who want the old flat
average with no filtering and no per-language output.
Add four example scripts illustrating the two aggregation modes
(language-weighted vs flat average) combined with task-level
language filtering (selected subset vs all available languages).
Add 6 new MELO dataset IDs for Austria (aut) and Belgium (bel):
- aut_q_de_c_de, aut_q_de_c_en
- bel_q_fr_c_fr, bel_q_fr_c_en
- bel_q_nl_c_nl, bel_q_nl_c_en

Update test expectations in TestMELORankingDatasetIds accordingly.
Copy link
Collaborator

@Mattdl Mattdl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments (might be outdated already, in which case you can ignore), and just needs to resolve conflicts from latest merge with #34

Then looks ready ready to merge! (: 🚀

… naming conventions

Addresses PR review feedback requesting richer documentation for users
without domain knowledge. Both docstrings now include scope and stats,
corpus construction details (surface forms from ESCO concepts), dataset
variant descriptions (monolingual vs cross-lingual), the dataset ID
naming convention, and concrete examples using real data from the
benchmarks.
@federetyk
Copy link
Contributor Author

@Mattdl Thanks for the review! I have expanded the class docstrings for both tasks in the latest commit. The merge conflicts with #34 have already been resolved as well. Let me know if there is anything else needed.

@federetyk federetyk requested a review from Mattdl February 26, 2026 14:52
Copy link
Collaborator

@Mattdl Mattdl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks again for the really impactful contributions 🚀

@Mattdl Mattdl merged commit 5f27df6 into techwolf-ai:main Feb 26, 2026
2 checks passed
@federetyk federetyk deleted the feat/generalized-index-melo-ranking-tasks branch February 26, 2026 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants