refactor: generalize dataset indexing from language-based to dataset_id-based by federetyk · Pull Request #34 · techwolf-ai/workrb

federetyk · 2026-01-16T10:42:41Z

Addresses #33

Description

This PR generalizes dataset indexing within tasks from Language enum to arbitrary string identifiers (dataset_id). The current architecture limits each task to at most one dataset per language, which prevents supporting tasks with multiple monolingual datasets per language, cross-lingual datasets, or multilingual datasets.

The refactor introduces a languages_to_dataset_ids() method with a default 1:1 mapping that preserves backward compatibility for existing tasks. Tasks that require more complex dataset structures can override this method to return custom identifiers. A new get_dataset_language() method maps datasets back to their language for proper per-language result aggregation, returning None for cross-lingual or multilingual datasets.

Changes:

Rename lang_datasets: dict[Language, Dataset] to datasets: dict[str, Dataset] in Task base class
Add languages_to_dataset_ids(languages) -> list[str] method with default backward-compatible mapping
Rename load_monolingual_data(language, split) to load_dataset(dataset_id, split) across all tasks
Add get_dataset_language(dataset_id) -> Language | None method for per-language aggregation
Add language field to MetricsResult to track dataset language
Update _aggregate_per_language() to group by the language field, skipping datasets marked as cross-lingual or multilingual
Update all task implementations to use the new method signature
Add unit test for multi-dataset task scenarios
Fix minor issues in some files in examples/

All tests pass, and the output of examples/run_multiple_models.py produces results consistent with the main branch.

Checklist

Added new tests for new functionality
Tested locally with example tasks
Code follows project style guidelines
Documentation updated
No new warnings introduced

…id-based

…egation

src/workrb/run.py

src/workrb/results.py

tests/test_multi_dataset_task.py

src/workrb/tasks/abstract/base.py

Mattdl

I left some comments. In general, I think this is a great PR because:

the dataset_id allows to decouple the execution of the evaluation loop, with the evaluation strategy. Additionally it allows for backwards compatibility.
The aggregation logic for the metrics can be tied to the dataset_id. Each dataset_id has a required mapping to the language.

One issue with current implementation is that the metric aggregation just ignores cross-lingual tasks. Hence, we would not want to execute these as they are ignored in the final result calculation. So we need to find a way to include the cross-lingual results in the final metric calculation.

My proposal is to by default stick to a metric aggregation mode that can be defined by the user to the benchmark (in workrb.evaluate). Then there's 2 steps involved:

Check if all defined tasks match with the defined mode (otherwise we raise, or we need to filter underlying tasks, which could also be defined with a user-defined mode).
Then, in the aggregation step, we need to make the distinction between the input_language and output_language. This would require a refactor of all tasks to define both (they are the same for monolingual). The metric aggregation logic then looks at:
1. monolingual_only (default): checks if all tasks input/output languages are the same, and only aggregates on those.
2. crosslingual_group_input_languages: would aggregate all results based on input_language
3. crosslingual_group_output_languages: would aggregate all results based on output_language

…y with dataset_id abstraction

…verride

…xing

Introduce DatasetLanguages type and LanguageAggregationMode enum with three modes: monolingual_only, crosslingual_group_input_languages, and crosslingual_group_output_languages. Rename get_dataset_language() to get_dataset_languages(), now returning input and output language sets. Replace MetricsResult.language with input_languages/output_languages. Datasets incompatible with the chosen aggregation mode are skipped rather than causing a failure. BREAKING CHANGE: MetricsResult.language replaced by input_languages/output_languages

federetyk · 2026-02-22T14:50:23Z

@Mattdl Thank you very much for your thorough review! I have pushed a new commit that addresses your proposal. It introduces a new enum with the three aggregation modes you suggested, and it also generalizes get_dataset_languages(), which now returns both input and output languages. I ran all the tests and they pass.

[...]
One issue with current implementation is that the metric aggregation just ignores cross-lingual tasks. Hence, we would not want to execute these as they are ignored in the final result calculation. So we need to find a way to include the cross-lingual results in the final metric calculation.
[...]

Check if all defined tasks match with the defined mode (otherwise we raise, or we need to filter underlying tasks, which could also be defined with a user-defined mode).

Between raising and filtering, I would lean toward filtering. I would suggest we consider the per-dataset raw results as the primary output artifact of a run, and the aggregations as derived. Since the evaluation loop is the expensive part, raw results should always be saved. Then, the user can re-aggregate with a different mode at any time. So rather than raising when datasets are incompatible with the aggregation mode, they should be skipped during aggregation. And if the user specifically wants only the aggregated results, we could support a filtering option that restricts execution to the datasets that would contribute to that aggregation, to avoid the computation of unnecessary evaluations.

Taking this a step further, if you are interested, I would be glad to work (perhaps in a different PR) on separating the reporting layer from the evaluation loop. The run would produce the primary per-dataset results, and then separate analysis modules would handle aggregation, visualization, formatting as Markdown/LaTeX tables, etc. These modules could be optionally invoked by the user in the experiment's config, or run a posteriori on previously stored results, making it possible to perform arbitrarily complex post-hoc analysis without complicating the evaluation loop.

Let me know what you think!

Mattdl

@federetyk this is such a great PR. Thanks again for the contributions and insightful ideas for next steps of the package!

I left some small comments. Just want to check with you on the filtering vs raising you mentioned (see my comment below), and then we can close this PR.

Current PR: Filtering vs raising

I would suggest we consider the per-dataset raw results as the primary output artifact of a run, and the aggregations as derived. Since the evaluation loop is the expensive part, raw results should always be saved.

I completely agree with that philosphy! However, I think for the very same reason that the "evaluation loop is the expensive part", we should default to lazy exection: run only the relevant dataset_id's and tasks that make sense w.r.t. the chosen aggregation strategy. (Also see following comment on checkpointing)

Since the evaluation loop is the expensive part, raw results should always be saved. Then, the user can re-aggregate with a different mode at any time.

I think here we can make use of the checkpointing. As our checkpointing is defined on the (task_name, dataset_id) level, we can perfectly do a partial execution by filtering.

This builds on the idea that a user has to always explicitly define which set of tasks (and models) they are interested in, and only the relevant ones will be executed. Hence, I completely agree with your following comment:

So rather than raising when datasets are incompatible with the aggregation mode, they should be skipped during aggregation.

As a default I would raise warnings of skipped datasets based on the aggregation strategy.

To double-check with you: Currently we are already taking the (task_name, dataset_id) skipping strategy if I'm not mistaken? So in case you agree we can keep the PR as is, and maybe just add the warnings for skipped dataset_ids in execution.

Regarding next steps

Docs update

To keep the scope of this PR limited, I've created a separate issue #43 regarding documentation for the README and CONTRIBUTING guidelines that need to be updated w.r.t. this refactor.

New PR: Separated evaluation logic idea

Taking this a step further, if you are interested, I would be glad to work (perhaps in a different PR) on separating the reporting layer from the evaluation loop.

I absolutely love that idea. One major thing to keep in mind is basically the same as I mentioned above, is that I'd focus on lazy execution depending on the chosen analysis. Additional evaluation are then just executed on-the-go if required.
For the user that doesn't want lazy loading, we could also introduce 2 execution modes of the evaluation loop: ExecutionMode.LazyTasksRequiredByAggregation or ExecutionMode.AllTasks where the user can explicitly define the behavior.

I agree with you it's super valuable to be able to do any post-hoc analysis. By using the execution mode, all tasks are skipped by default if they're already present. Then we can perfectly have a separated reporting/aggregation layer on top.

Feel free to make an Issue or PR with how you see the main changes compared to current aggregation strategy. Really curious to have your view on that!

tests/test_freelancer_project_matching.py

…xing

…age enum

src/workrb/tasks/ranking/freelancer_project_matching.py

Skip datasets incompatible with the chosen LanguageAggregationMode before evaluation to avoid unnecessary compute. Extract get_language_grouping_key() as a shared function in types.py so the same compatibility logic is reused by both the eval-time filter (_filter_pending_work) and the aggregation-time skip in results.py. - Add get_language_grouping_key() to types.py, refactor BenchmarkResults._get_language_grouping_key() to delegate to it - Add ExecutionMode enum (LAZY / ALL) - Add language_aggregation_mode and execution_mode params to evaluate() - Add language_aggregation_mode param to get_summary_metrics() - Export ExecutionMode and LanguageAggregationMode from workrb - Add tests for shared function and MELO-like filtering scenarios

federetyk · 2026-02-23T21:34:55Z

@Mattdl Thanks for the detailed review. Hope the ever-growing scope of this refactor is not making you regret the invitation to contribute! :)

Regarding your question: the code you reviewed yesterday only filtered incompatible datasets at aggregation time, not at evaluation time. I have just pushed a new commit that addresses your suggestions. Now, evaluate() and evaluate_multiple_models() accept new parameters to control both the execution plan and also the aggregation of results.

The default "aggregation mode" is None, which I understand preserves the behavior you had before this refactor. But when the user does specify an aggregation mode, then the default "execution mode" is to skip the evaluation of datasets that are not going to be used for the chosen aggregation mode. In other words, lazy execution is the default. Is this what you had in mind? Also, as you suggested, we are logging a warning message if a dataset is skipped because of the chosen aggregation strategy.

With this, there are now two ways to control which datasets get evaluated. Taking run_multiple_models.py as an example:

# 1. The `languages` parameter in the Task constructor controls which
#    dataset IDs are generated in the first place:
tasks = [
    ESCOJob2SkillRanking(split="test", languages=["en", "fr", "de", "nl"]),
    ...
]

# 2. The new parameters in evaluate() filter out datasets that would be
#    discarded by the chosen aggregation mode:
results = workrb.evaluate(
    model=model,
    tasks=tasks,
    output_folder="results/my_model",
    language_aggregation_mode=LanguageAggregationMode.MONOLINGUAL_ONLY,  # default is None
    execution_mode=ExecutionMode.LAZY,  # this is the default
)

The first mechanism predates this refactor, and selects which datasets take part in the evaluation; the second narrows that set based on how the results will be aggregated. Does this match how you see the high-level API? If so, we could do some clean-up and documentation around this in the follow-up PR we discussed.

If you prefer to merge the PR as it was before this commit and handle lazy execution in a separate PR, I would be happy to revert. Just let me know which you prefer.

…onments

Mattdl

Thanks for the detailed review. Hope the ever-growing scope of this refactor is not making you regret the invitation to contribute! :)

On the contrary, the package has improved a lot thanks to your contributions!!🥇

LanguageAggregation mode `None`

For the LanguageAggregation we will always have to make a choice, so None for filtering would be no option. The default would be currently LanguageAggregationl.MONOLINGUAL_ONLY as we didn't support cross-lingual before.

The reason we need to define this, is that the aggregation must happen at some point in reporting the metrics. Especially in src/workrb/results.py, in the get_summary_metrics() call.

The second choice a user now has, is whether they still want to execute all tasks regardless of the chosen LanguageAggregation. With default ExecutionMode.LAZY it will skip any cross-lingual entries that won't be relevant for get_summary_metrics(), or execut all of them with ExecutionMode.ALL.

So requested change here is that we don't have a None for language filtering, but must always be defined. Feel free to add your own viewpoint on this one!

Aggregation revision

The problem in src/workrb/results.py, in the get_summary_metrics() is that we now have a dependency on the language_aggregation_mode.

There's a couple of issues arising:

Problem 1

get_summary_metrics(): The language_aggregation_mode should not be optional but a required argument to make the strategy explicit and avoid eval errors in the code.

Problem 2

The BenchmarkResults.__str__ that is used in a typical user flow with print(results) API won't work anymore, as we'll require explicit definition of language aggregation mode. Example:

  # 1. Setup model and tasks
  model = workrb.models.BiEncoderModel("all-MiniLM-L6-v2")
  tasks = [workrb.tasks.ESCOSkill2JobRanking(split="test", languages=["en"])]

  # 2. Run the benchmark
  agg_mode= LanguageAggregation.MONOLINGUAL_ONLY
  results = workrb.evaluate(
      model,
      tasks,
      output_folder="results/demo",
      description="WorkRB demo",
      force_restart=True,
      language_aggregation_mode=agg_mode,
  ) # -> Will already print the results via `reporting.py`

# (Optional) Display results
print("\nResults:")
print(results) # -> This will not be possible anymore.

A possible solution would be to make the LanguageAggregation mode part of the results state, enabling print(results). However, it becomes error-prone when prior results would be ran with a different LanguageAggregation than the original evaluation run. With the lazy execution mode, we would need a check that all necessary tasks have been executed. Not sure yet what would be the best solution here. In any case we want to keep the API as clean as possible and easy to use for the user, hence maybe using the default-aggregation mode is the better option and documenting the default behavior.

Problem 3

For efficiency reasons, src/workrb/metrics/reporting.py:format_results recalculated the metrics on-demand. But with the refactor this feels like an error-prone approach. Instead we should simplify by making the internal of results.get_summary_metrics(...) be moved to a new function results._get_summary_metrics, which should return all of the 5 mean_per_task,... mean_per_language.

The original results.get_summary_metrics(...) would then just combine the results of the call to results._get_summary_metrics,
src/workrb/metrics/reporting.py:format_results would also just make the call to results._get_summary_metrics so the logic is just in 1 place.

src/workrb/results.py

Mattdl · 2026-02-24T16:52:19Z

We're currently working on the paper, so want to recap on the short-term prios.
I love the contributions in this PR and would love to have these in the paper!
Especially if there's any API changes or changes to how to add tasks/models, these should be included in the main paper.

Prios:

Wrap-up this PR so that we have a fully stable state again, including cross-lingual, and stable metric aggregation, and default LAZY loading.
Then get the MELO/MELS datasets included, so we can easily add them to the paper: [FEATURE] Add MELO Benchmark datasets as a ranking task for job title normalization #30

Let's keep the separate reporting layer as future steps for now to maintain focus (:

Super excited to be adding your features to the paper-version of WorkRB (:🚀

@Mattdl

…evaluate() Address PR techwolf-ai#34 review feedback from @Mattdl. The aggregation mode now flows consistently through the entire evaluation and aggregation pipeline instead of being an optional parameter.

Override get_dataset_languages() and languages_to_dataset_ids() in the freelancer candidate ranking tasks, replacing the Language.CROSS sentinel with a proper DATASET_LANGUAGES_MAP that describes each dataset's input and output languages. This lets the aggregation and filtering logic handle the multilingual dataset ("cross_lingual" renamed to "multilingual") as a regular entry rather than a special-cased language enum value.

federetyk · 2026-02-25T02:10:07Z

@Mattdl Thank you again for your feedback! In the latest commit, I have addressed your suggestions. Now the parameter language_aggregation_mode is non-optional, BenchmarkMetadata does store it, and the aggregation methods all receive and use it as well. Also, the commit includes a refactor of freelancer_project_matching.py and a small fix to the evaluation plan summary, because get_tasks_overview used to show all datasets regardless of our new filtering logic.

Happy to adjust anything before merging.

Not a blocker for this PR, but I wanted to share some ideas for future discussion. I think this is also where some of my earlier confusion came from. As I see it, there are at least three alternatives for the aggregation chain:

Flat average, no filtering.
dataset → task → task_group → benchmark
Here, per-task aggregation is a flat macro-average of datasets. It averages all evaluated datasets within each task, with no filtering and no language awareness. In this scenario, the per-language aggregation is a separate, independent analysis outside this hierarchy; the language aggregation mode is only needed for per-language aggregation, to determine which language bucket each dataset belongs to.
Filter by mode, then average.
dataset (filtered by mode) → task → task_group → benchmark
This is what we implemented in this PR. Datasets incompatible with the language aggregation mode are filtered out, and the remaining ones are flat-averaged to produce the per-task score. Unlike option 1, some datasets are excluded from the main hierarchy.
Group by language, then average.
dataset → language (given a grouping criterion) → task → task_group → benchmark
Within each task, datasets are first grouped by language (using the language aggregation mode to determine the bucket), averaged within each language group, and then those per-language scores are averaged to produce the per-task score. This weights languages equally regardless of how many datasets each has. Incompatible datasets are also excluded.

I can see the case for options 2 and 3: they ensure consistency between the main hierarchy and the per-language view, since both operate on the same set of datasets. Under option 1, the benchmark score would include datasets that the per-language breakdown cannot account for, which could be confusing.

That said, option 1 has the appeal of being a straightforward cascade of macro-averages where each dataset is an independent data point... and per-language aggregation is independent, not being part of the chain. I feel it is also more natural for users who prefer ExecutionMode.ALL, and the per-language breakdown becomes a secondary output artifact. Under options 2 and 3, by contrast, some evaluated datasets would always be excluded from the benchmark score, since no single mode is compatible with every dataset type.

For now I have implemented option 2, which is how I understood your feedback. If you had option 3 in mind instead, it would be a small change. I can apply it before merging this PR.

Mattdl

@federetyk thanks for the clear overview! 🚀 All looks good, I'd just make this final change if you agree:

I can definitely see the advantage of the Option1, as you mention it would just use each dataset_id as a separate data point.

Regarding option2 and 3, my preference would be Option3, as it gives a equally-weighed performance aggregation over languages.

So combining the 2 approaches, we could indeed leave a LanguageAggregationMode None (for option1), although I'd prefer to make it more explicit with LanguageAggregationMode.SKIP_LANGUAGE_AGGREGATION where we can also document the behavior for Option1. I'd keep the default to LanguageAggregationMode.MONOLINGUAL_ONLY.

To document our final flow this would result in the following 2 flows:

lang_agg_mode=LanguageAggregationMode.{MONOLINGUAL/CROSS_LINGUAL}:
dataset (filtered based on lang_agg_mode AND if ExectionMode.LAZY, or not filtered if ExectionMode.ALL) → language (follows the lang_agg_mode grouping criterion) → task → task_group → benchmark
lang_agg_mode=LanguageAggregationMode.SKIP_LANGUAGE_AGGREGATION:
dataset (no filtering regardless of ExectutionMode) → language (skipped) → task → task_group → benchmark

Per-task aggregation now groups datasets by language before averaging, giving equal weight to each language regardless of how many datasets it contains. Previously, all compatible datasets were flat-averaged, which over-represented languages with more datasets. Add SKIP_LANGUAGE_AGGREGATION mode for users who want the old flat average with no filtering and no per-language output.

Add four example scripts illustrating the two aggregation modes (language-weighted vs flat average) combined with task-level language filtering (selected subset vs all available languages).

federetyk · 2026-02-25T20:36:20Z

@Mattdl I sincerely appreciate the time you have put into reviewing and directing this refactor! This was a prerequisite for including MELO in the benchmark, so it directly unblocks a contribution I care about.

I think allowing the user to choose between the flat average and the language-grouped average is good. It makes the benchmark flexible enough to accommodate both perspectives we discussed, and users can pick the one that best fits their evaluation goals.

The latest commits in the branch reflect your suggestions: in per-task aggregation, we now use language-grouped averaging in the way described as option 3; there is a new language aggregation mode SKIP_LANGUAGE_AGGREGATION for what we described as option 1, and in this case no per-language aggregation output is computed; finally, I added four new example scripts in examples/ that showcase the two aggregation modes combined with the task-level language filtering that already existed in main (this could be useful for supporting the documentation you are planning in #43). Also, all tests pass.

Let me know if you would like me to change anything before we merge!

… UK as supported language

federetyk added 5 commits January 15, 2026 11:26

refactor: generalize dataset indexing from language-based to dataset_…

b00e4c5

…id-based

fix: solve issues in example files

17b1897

fix: add language field to MetricsResult for proper per-language aggr…

e16f8dd

…egation

style: update docstrings to comply with NumPy style

e254bc2

chore: merge upstream changes (v0.3.0, task renames, test refactor)

40810c2

This was referenced Jan 16, 2026

[FEATURE] Add MELO Benchmark datasets as a ranking task for job title normalization #30

Open

feat: add new ranking tasks for melo #37

Open

Mattdl reviewed Feb 19, 2026

View reviewed changes

src/workrb/run.py Outdated Show resolved Hide resolved

Mattdl reviewed Feb 19, 2026

View reviewed changes

src/workrb/results.py Show resolved Hide resolved

Mattdl reviewed Feb 19, 2026

View reviewed changes

tests/test_multi_dataset_task.py Show resolved Hide resolved

Mattdl reviewed Feb 19, 2026

View reviewed changes

src/workrb/tasks/abstract/base.py Outdated Show resolved Hide resolved

Mattdl requested changes Feb 19, 2026

View reviewed changes

refactor: rename language_results to datasetid_results for consistenc…

71d6d97

…y with dataset_id abstraction

Mattdl mentioned this pull request Feb 20, 2026

feat: project freelancer ranking #40

Merged

5 tasks

federetyk added 4 commits February 20, 2026 14:15

docs: clarify get_dataset_language docstring on purpose and when to o…

647b070

…verride

Merge branch 'techwolf-ai:main' into refactor/generalize-dataset-inde…

1b726ee

…xing

refactor: migrate freelancer project matching tasks to load_dataset API

3a9514d

federetyk requested a review from Mattdl February 23, 2026 11:59

Mattdl mentioned this pull request Feb 23, 2026

[FEATURE] Update documentation for execution loop and input/output language refactor #43

Open

5 tasks

Mattdl reviewed Feb 23, 2026

View reviewed changes

tests/test_freelancer_project_matching.py Outdated Show resolved Hide resolved

tests/test_freelancer_project_matching.py Outdated Show resolved Hide resolved

federetyk added 3 commits February 23, 2026 19:34

Merge branch 'techwolf-ai:main' into refactor/generalize-dataset-inde…

e3ccb24

…xing

test: make it explicit that the dataset key "en" comes from the Langu…

72b8e40

…age enum

test: fix lexical baselines regression test to use dataset_id parameter

033db0f

federetyk commented Feb 23, 2026

View reviewed changes

src/workrb/tasks/ranking/freelancer_project_matching.py Outdated Show resolved Hide resolved

test: fix tolerance for regression test to work well on diverse envir…

f3c5e19

…onments

federetyk requested a review from Mattdl February 23, 2026 22:04

Mattdl requested changes Feb 24, 2026

View reviewed changes

src/workrb/results.py Outdated Show resolved Hide resolved

src/workrb/results.py Outdated Show resolved Hide resolved

federetyk added 2 commits February 24, 2026 21:42

refactor: make language_aggregation_mode a non-optional parameter in …

7d0b8b5

…evaluate() Address PR techwolf-ai#34 review feedback from @Mattdl. The aggregation mode now flows consistently through the entire evaluation and aggregation pipeline instead of being an optional parameter.

federetyk requested a review from Mattdl February 25, 2026 02:10

Mattdl reviewed Feb 25, 2026

View reviewed changes

federetyk added 2 commits February 25, 2026 20:00

docs: add benchmark example scripts for each aggregation mode

e1dfd9d

Add four example scripts illustrating the two aggregation modes (language-weighted vs flat average) combined with task-level language filtering (selected subset vs all available languages).

federetyk requested a review from Mattdl February 25, 2026 20:36

fix: remove from example the dataset that uses ESCO 1.0.5 but defines…

e4a6bce

… UK as supported language

Conversation

federetyk commented Jan 16, 2026

Description

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mattdl left a comment

Choose a reason for hiding this comment

Uh oh!

federetyk commented Feb 22, 2026

Uh oh!

Mattdl left a comment

Choose a reason for hiding this comment

Current PR: Filtering vs raising

Regarding next steps

Docs update

New PR: Separated evaluation logic idea

Uh oh!

Uh oh!

Uh oh!

Uh oh!

federetyk commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mattdl left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

LanguageAggregation mode None

Aggregation revision

Problem 1

Problem 2

Problem 3

Uh oh!

Uh oh!

Uh oh!

Mattdl commented Feb 24, 2026

Uh oh!

federetyk commented Feb 25, 2026

Uh oh!

Mattdl left a comment

Choose a reason for hiding this comment

Uh oh!

federetyk commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

federetyk commented Feb 23, 2026 •

edited

Loading

Mattdl left a comment •

edited

Loading

LanguageAggregation mode `None`