Skip to content

refactor: generalize dataset indexing from language-based to dataset_id-based#34

Open
federetyk wants to merge 20 commits intotechwolf-ai:mainfrom
federetyk:refactor/generalize-dataset-indexing
Open

refactor: generalize dataset indexing from language-based to dataset_id-based#34
federetyk wants to merge 20 commits intotechwolf-ai:mainfrom
federetyk:refactor/generalize-dataset-indexing

Conversation

@federetyk
Copy link
Contributor

Addresses #33

Description

This PR generalizes dataset indexing within tasks from Language enum to arbitrary string identifiers (dataset_id). The current architecture limits each task to at most one dataset per language, which prevents supporting tasks with multiple monolingual datasets per language, cross-lingual datasets, or multilingual datasets.

The refactor introduces a languages_to_dataset_ids() method with a default 1:1 mapping that preserves backward compatibility for existing tasks. Tasks that require more complex dataset structures can override this method to return custom identifiers. A new get_dataset_language() method maps datasets back to their language for proper per-language result aggregation, returning None for cross-lingual or multilingual datasets.

Changes:

  • Rename lang_datasets: dict[Language, Dataset] to datasets: dict[str, Dataset] in Task base class
  • Add languages_to_dataset_ids(languages) -> list[str] method with default backward-compatible mapping
  • Rename load_monolingual_data(language, split) to load_dataset(dataset_id, split) across all tasks
  • Add get_dataset_language(dataset_id) -> Language | None method for per-language aggregation
  • Add language field to MetricsResult to track dataset language
  • Update _aggregate_per_language() to group by the language field, skipping datasets marked as cross-lingual or multilingual
  • Update all task implementations to use the new method signature
  • Add unit test for multi-dataset task scenarios
  • Fix minor issues in some files in examples/

All tests pass, and the output of examples/run_multiple_models.py produces results consistent with the main branch.

Checklist

  • Added new tests for new functionality
  • Tested locally with example tasks
  • Code follows project style guidelines
  • Documentation updated
  • No new warnings introduced

Copy link
Collaborator

@Mattdl Mattdl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. In general, I think this is a great PR because:

  1. the dataset_id allows to decouple the execution of the evaluation loop, with the evaluation strategy. Additionally it allows for backwards compatibility.
  2. The aggregation logic for the metrics can be tied to the dataset_id. Each dataset_id has a required mapping to the language.

One issue with current implementation is that the metric aggregation just ignores cross-lingual tasks. Hence, we would not want to execute these as they are ignored in the final result calculation. So we need to find a way to include the cross-lingual results in the final metric calculation.

My proposal is to by default stick to a metric aggregation mode that can be defined by the user to the benchmark (in workrb.evaluate). Then there's 2 steps involved:

  1. Check if all defined tasks match with the defined mode (otherwise we raise, or we need to filter underlying tasks, which could also be defined with a user-defined mode).
  2. Then, in the aggregation step, we need to make the distinction between the input_language and output_language. This would require a refactor of all tasks to define both (they are the same for monolingual). The metric aggregation logic then looks at:
    1. monolingual_only (default): checks if all tasks input/output languages are the same, and only aggregates on those.
    2. crosslingual_group_input_languages: would aggregate all results based on input_language
    3. crosslingual_group_output_languages: would aggregate all results based on output_language

@Mattdl Mattdl mentioned this pull request Feb 20, 2026
5 tasks
Introduce DatasetLanguages type and LanguageAggregationMode enum with
three modes: monolingual_only, crosslingual_group_input_languages,
and crosslingual_group_output_languages.

Rename get_dataset_language() to get_dataset_languages(), now returning
input and output language sets. Replace MetricsResult.language with
input_languages/output_languages. Datasets incompatible with the chosen
aggregation mode are skipped rather than causing a failure.

BREAKING CHANGE: MetricsResult.language replaced by input_languages/output_languages
@federetyk
Copy link
Contributor Author

@Mattdl Thank you very much for your thorough review! I have pushed a new commit that addresses your proposal. It introduces a new enum with the three aggregation modes you suggested, and it also generalizes get_dataset_languages(), which now returns both input and output languages. I ran all the tests and they pass.

[...]
One issue with current implementation is that the metric aggregation just ignores cross-lingual tasks. Hence, we would not want to execute these as they are ignored in the final result calculation. So we need to find a way to include the cross-lingual results in the final metric calculation.
[...]

  1. Check if all defined tasks match with the defined mode (otherwise we raise, or we need to filter underlying tasks, which could also be defined with a user-defined mode).

Between raising and filtering, I would lean toward filtering. I would suggest we consider the per-dataset raw results as the primary output artifact of a run, and the aggregations as derived. Since the evaluation loop is the expensive part, raw results should always be saved. Then, the user can re-aggregate with a different mode at any time. So rather than raising when datasets are incompatible with the aggregation mode, they should be skipped during aggregation. And if the user specifically wants only the aggregated results, we could support a filtering option that restricts execution to the datasets that would contribute to that aggregation, to avoid the computation of unnecessary evaluations.

Taking this a step further, if you are interested, I would be glad to work (perhaps in a different PR) on separating the reporting layer from the evaluation loop. The run would produce the primary per-dataset results, and then separate analysis modules would handle aggregation, visualization, formatting as Markdown/LaTeX tables, etc. These modules could be optionally invoked by the user in the experiment's config, or run a posteriori on previously stored results, making it possible to perform arbitrarily complex post-hoc analysis without complicating the evaluation loop.

Let me know what you think!

Copy link
Collaborator

@Mattdl Mattdl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@federetyk this is such a great PR. Thanks again for the contributions and insightful ideas for next steps of the package!

I left some small comments. Just want to check with you on the filtering vs raising you mentioned (see my comment below), and then we can close this PR.

Current PR: Filtering vs raising

I would suggest we consider the per-dataset raw results as the primary output artifact of a run, and the aggregations as derived. Since the evaluation loop is the expensive part, raw results should always be saved.

I completely agree with that philosphy! However, I think for the very same reason that the "evaluation loop is the expensive part", we should default to lazy exection: run only the relevant dataset_id's and tasks that make sense w.r.t. the chosen aggregation strategy. (Also see following comment on checkpointing)

Since the evaluation loop is the expensive part, raw results should always be saved. Then, the user can re-aggregate with a different mode at any time.

I think here we can make use of the checkpointing. As our checkpointing is defined on the (task_name, dataset_id) level, we can perfectly do a partial execution by filtering.

This builds on the idea that a user has to always explicitly define which set of tasks (and models) they are interested in, and only the relevant ones will be executed. Hence, I completely agree with your following comment:

So rather than raising when datasets are incompatible with the aggregation mode, they should be skipped during aggregation.

As a default I would raise warnings of skipped datasets based on the aggregation strategy.

To double-check with you: Currently we are already taking the (task_name, dataset_id) skipping strategy if I'm not mistaken? So in case you agree we can keep the PR as is, and maybe just add the warnings for skipped dataset_ids in execution.

Regarding next steps

Docs update

To keep the scope of this PR limited, I've created a separate issue #43 regarding documentation for the README and CONTRIBUTING guidelines that need to be updated w.r.t. this refactor.

New PR: Separated evaluation logic idea

Taking this a step further, if you are interested, I would be glad to work (perhaps in a different PR) on separating the reporting layer from the evaluation loop.

I absolutely love that idea. One major thing to keep in mind is basically the same as I mentioned above, is that I'd focus on lazy execution depending on the chosen analysis. Additional evaluation are then just executed on-the-go if required.
For the user that doesn't want lazy loading, we could also introduce 2 execution modes of the evaluation loop: ExecutionMode.LazyTasksRequiredByAggregation or ExecutionMode.AllTasks where the user can explicitly define the behavior.

I agree with you it's super valuable to be able to do any post-hoc analysis. By using the execution mode, all tasks are skipped by default if they're already present. Then we can perfectly have a separated reporting/aggregation layer on top.

Feel free to make an Issue or PR with how you see the main changes compared to current aggregation strategy. Really curious to have your view on that!

Skip datasets incompatible with the chosen LanguageAggregationMode
before evaluation to avoid unnecessary compute. Extract
get_language_grouping_key() as a shared function in types.py so
the same compatibility logic is reused by both the eval-time filter
(_filter_pending_work) and the aggregation-time skip in results.py.

- Add get_language_grouping_key() to types.py, refactor
  BenchmarkResults._get_language_grouping_key() to delegate to it
- Add ExecutionMode enum (LAZY / ALL)
- Add language_aggregation_mode and execution_mode params to evaluate()
- Add language_aggregation_mode param to get_summary_metrics()
- Export ExecutionMode and LanguageAggregationMode from workrb
- Add tests for shared function and MELO-like filtering scenarios
@federetyk
Copy link
Contributor Author

federetyk commented Feb 23, 2026

@Mattdl Thanks for the detailed review. Hope the ever-growing scope of this refactor is not making you regret the invitation to contribute! :)

Regarding your question: the code you reviewed yesterday only filtered incompatible datasets at aggregation time, not at evaluation time. I have just pushed a new commit that addresses your suggestions. Now, evaluate() and evaluate_multiple_models() accept new parameters to control both the execution plan and also the aggregation of results.

The default "aggregation mode" is None, which I understand preserves the behavior you had before this refactor. But when the user does specify an aggregation mode, then the default "execution mode" is to skip the evaluation of datasets that are not going to be used for the chosen aggregation mode. In other words, lazy execution is the default. Is this what you had in mind? Also, as you suggested, we are logging a warning message if a dataset is skipped because of the chosen aggregation strategy.

With this, there are now two ways to control which datasets get evaluated. Taking run_multiple_models.py as an example:

# 1. The `languages` parameter in the Task constructor controls which
#    dataset IDs are generated in the first place:
tasks = [
    ESCOJob2SkillRanking(split="test", languages=["en", "fr", "de", "nl"]),
    ...
]

# 2. The new parameters in evaluate() filter out datasets that would be
#    discarded by the chosen aggregation mode:
results = workrb.evaluate(
    model=model,
    tasks=tasks,
    output_folder="results/my_model",
    language_aggregation_mode=LanguageAggregationMode.MONOLINGUAL_ONLY,  # default is None
    execution_mode=ExecutionMode.LAZY,  # this is the default
)

The first mechanism predates this refactor, and selects which datasets take part in the evaluation; the second narrows that set based on how the results will be aggregated. Does this match how you see the high-level API? If so, we could do some clean-up and documentation around this in the follow-up PR we discussed.

If you prefer to merge the PR as it was before this commit and handle lazy execution in a separate PR, I would be happy to revert. Just let me know which you prefer.

@federetyk federetyk requested a review from Mattdl February 23, 2026 22:04
Copy link
Collaborator

@Mattdl Mattdl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed review. Hope the ever-growing scope of this refactor is not making you regret the invitation to contribute! :)

On the contrary, the package has improved a lot thanks to your contributions!!🥇

LanguageAggregation mode None

For the LanguageAggregation we will always have to make a choice, so None for filtering would be no option. The default would be currently LanguageAggregationl.MONOLINGUAL_ONLY as we didn't support cross-lingual before.

The reason we need to define this, is that the aggregation must happen at some point in reporting the metrics. Especially in src/workrb/results.py, in the get_summary_metrics() call.

The second choice a user now has, is whether they still want to execute all tasks regardless of the chosen LanguageAggregation. With default ExecutionMode.LAZY it will skip any cross-lingual entries that won't be relevant for get_summary_metrics(), or execut all of them with ExecutionMode.ALL.

So requested change here is that we don't have a None for language filtering, but must always be defined. Feel free to add your own viewpoint on this one!

Aggregation revision

The problem in src/workrb/results.py, in the get_summary_metrics() is that we now have a dependency on the language_aggregation_mode.

There's a couple of issues arising:

Problem 1

get_summary_metrics(): The language_aggregation_mode should not be optional but a required argument to make the strategy explicit and avoid eval errors in the code.

Problem 2

The BenchmarkResults.__str__ that is used in a typical user flow with print(results) API won't work anymore, as we'll require explicit definition of language aggregation mode. Example:

  # 1. Setup model and tasks
  model = workrb.models.BiEncoderModel("all-MiniLM-L6-v2")
  tasks = [workrb.tasks.ESCOSkill2JobRanking(split="test", languages=["en"])]

  # 2. Run the benchmark
  agg_mode= LanguageAggregation.MONOLINGUAL_ONLY
  results = workrb.evaluate(
      model,
      tasks,
      output_folder="results/demo",
      description="WorkRB demo",
      force_restart=True,
      language_aggregation_mode=agg_mode,
  ) # -> Will already print the results via `reporting.py`

# (Optional) Display results
print("\nResults:")
print(results) # -> This will not be possible anymore.

A possible solution would be to make the LanguageAggregation mode part of the results state, enabling print(results). However, it becomes error-prone when prior results would be ran with a different LanguageAggregation than the original evaluation run. With the lazy execution mode, we would need a check that all necessary tasks have been executed. Not sure yet what would be the best solution here. In any case we want to keep the API as clean as possible and easy to use for the user, hence maybe using the default-aggregation mode is the better option and documenting the default behavior.

Problem 3

For efficiency reasons, src/workrb/metrics/reporting.py:format_results recalculated the metrics on-demand. But with the refactor this feels like an error-prone approach. Instead we should simplify by making the internal of results.get_summary_metrics(...) be moved to a new function results._get_summary_metrics, which should return all of the 5 mean_per_task,... mean_per_language.

  1. The original results.get_summary_metrics(...) would then just combine the results of the call to results._get_summary_metrics,
  2. src/workrb/metrics/reporting.py:format_results would also just make the call to results._get_summary_metrics so the logic is just in 1 place.

@Mattdl
Copy link
Collaborator

Mattdl commented Feb 24, 2026

We're currently working on the paper, so want to recap on the short-term prios.
I love the contributions in this PR and would love to have these in the paper!
Especially if there's any API changes or changes to how to add tasks/models, these should be included in the main paper.

Prios:

Let's keep the separate reporting layer as future steps for now to maintain focus (:

Super excited to be adding your features to the paper-version of WorkRB (:🚀

…evaluate()

Address PR techwolf-ai#34 review feedback from @Mattdl. The aggregation mode now flows
consistently through the entire evaluation and aggregation pipeline
instead of being an optional parameter.
Override get_dataset_languages() and languages_to_dataset_ids() in the
freelancer candidate ranking tasks, replacing the Language.CROSS sentinel
with a proper DATASET_LANGUAGES_MAP that describes each dataset's input
and output languages.

This lets the aggregation and filtering logic handle the multilingual
dataset ("cross_lingual" renamed to "multilingual") as a regular entry
rather than a special-cased language enum value.
@federetyk
Copy link
Contributor Author

@Mattdl Thank you again for your feedback! In the latest commit, I have addressed your suggestions. Now the parameter language_aggregation_mode is non-optional, BenchmarkMetadata does store it, and the aggregation methods all receive and use it as well. Also, the commit includes a refactor of freelancer_project_matching.py and a small fix to the evaluation plan summary, because get_tasks_overview used to show all datasets regardless of our new filtering logic.

Happy to adjust anything before merging.


Not a blocker for this PR, but I wanted to share some ideas for future discussion. I think this is also where some of my earlier confusion came from. As I see it, there are at least three alternatives for the aggregation chain:

  1. Flat average, no filtering.
    datasettasktask_groupbenchmark
    Here, per-task aggregation is a flat macro-average of datasets. It averages all evaluated datasets within each task, with no filtering and no language awareness. In this scenario, the per-language aggregation is a separate, independent analysis outside this hierarchy; the language aggregation mode is only needed for per-language aggregation, to determine which language bucket each dataset belongs to.

  2. Filter by mode, then average.
    dataset (filtered by mode) → tasktask_groupbenchmark
    This is what we implemented in this PR. Datasets incompatible with the language aggregation mode are filtered out, and the remaining ones are flat-averaged to produce the per-task score. Unlike option 1, some datasets are excluded from the main hierarchy.

  3. Group by language, then average.
    datasetlanguage (given a grouping criterion) → tasktask_groupbenchmark
    Within each task, datasets are first grouped by language (using the language aggregation mode to determine the bucket), averaged within each language group, and then those per-language scores are averaged to produce the per-task score. This weights languages equally regardless of how many datasets each has. Incompatible datasets are also excluded.

I can see the case for options 2 and 3: they ensure consistency between the main hierarchy and the per-language view, since both operate on the same set of datasets. Under option 1, the benchmark score would include datasets that the per-language breakdown cannot account for, which could be confusing.

That said, option 1 has the appeal of being a straightforward cascade of macro-averages where each dataset is an independent data point... and per-language aggregation is independent, not being part of the chain. I feel it is also more natural for users who prefer ExecutionMode.ALL, and the per-language breakdown becomes a secondary output artifact. Under options 2 and 3, by contrast, some evaluated datasets would always be excluded from the benchmark score, since no single mode is compatible with every dataset type.

For now I have implemented option 2, which is how I understood your feedback. If you had option 3 in mind instead, it would be a small change. I can apply it before merging this PR.

@federetyk federetyk requested a review from Mattdl February 25, 2026 02:10
Copy link
Collaborator

@Mattdl Mattdl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@federetyk thanks for the clear overview! 🚀 All looks good, I'd just make this final change if you agree:

I can definitely see the advantage of the Option1, as you mention it would just use each dataset_id as a separate data point.

Regarding option2 and 3, my preference would be Option3, as it gives a equally-weighed performance aggregation over languages.

So combining the 2 approaches, we could indeed leave a LanguageAggregationMode None (for option1), although I'd prefer to make it more explicit with LanguageAggregationMode.SKIP_LANGUAGE_AGGREGATION where we can also document the behavior for Option1. I'd keep the default to LanguageAggregationMode.MONOLINGUAL_ONLY.

To document our final flow this would result in the following 2 flows:

  • lang_agg_mode=LanguageAggregationMode.{MONOLINGUAL/CROSS_LINGUAL}:
    dataset (filtered based on lang_agg_mode AND if ExectionMode.LAZY, or not filtered if ExectionMode.ALL) → language (follows the lang_agg_mode grouping criterion) → tasktask_groupbenchmark

  • lang_agg_mode=LanguageAggregationMode.SKIP_LANGUAGE_AGGREGATION:
    dataset (no filtering regardless of ExectutionMode) → language (skipped) → tasktask_groupbenchmark

Per-task aggregation now groups datasets by language before averaging,
giving equal weight to each language regardless of how many datasets
it contains. Previously, all compatible datasets were flat-averaged,
which over-represented languages with more datasets.

Add SKIP_LANGUAGE_AGGREGATION mode for users who want the old flat
average with no filtering and no per-language output.
Add four example scripts illustrating the two aggregation modes
(language-weighted vs flat average) combined with task-level
language filtering (selected subset vs all available languages).
@federetyk
Copy link
Contributor Author

@Mattdl I sincerely appreciate the time you have put into reviewing and directing this refactor! This was a prerequisite for including MELO in the benchmark, so it directly unblocks a contribution I care about.

I think allowing the user to choose between the flat average and the language-grouped average is good. It makes the benchmark flexible enough to accommodate both perspectives we discussed, and users can pick the one that best fits their evaluation goals.

The latest commits in the branch reflect your suggestions: in per-task aggregation, we now use language-grouped averaging in the way described as option 3; there is a new language aggregation mode SKIP_LANGUAGE_AGGREGATION for what we described as option 1, and in this case no per-language aggregation output is computed; finally, I added four new example scripts in examples/ that showcase the two aggregation modes combined with the task-level language filtering that already existed in main (this could be useful for supporting the documentation you are planning in #43). Also, all tests pass.

Let me know if you would like me to change anything before we merge!

@federetyk federetyk requested a review from Mattdl February 25, 2026 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants