refactor: generalize dataset indexing from language-based to dataset_id-based#34
refactor: generalize dataset indexing from language-based to dataset_id-based#34federetyk wants to merge 20 commits intotechwolf-ai:mainfrom
Conversation
Mattdl
left a comment
There was a problem hiding this comment.
I left some comments. In general, I think this is a great PR because:
- the
dataset_idallows to decouple the execution of the evaluation loop, with the evaluation strategy. Additionally it allows for backwards compatibility. - The aggregation logic for the metrics can be tied to the
dataset_id. Each dataset_id has a required mapping to the language.
One issue with current implementation is that the metric aggregation just ignores cross-lingual tasks. Hence, we would not want to execute these as they are ignored in the final result calculation. So we need to find a way to include the cross-lingual results in the final metric calculation.
My proposal is to by default stick to a metric aggregation mode that can be defined by the user to the benchmark (in workrb.evaluate). Then there's 2 steps involved:
- Check if all defined tasks match with the defined mode (otherwise we raise, or we need to filter underlying tasks, which could also be defined with a user-defined mode).
- Then, in the aggregation step, we need to make the distinction between the
input_languageandoutput_language. This would require a refactor of all tasks to define both (they are the same for monolingual). The metric aggregation logic then looks at:monolingual_only(default): checks if all tasks input/output languages are the same, and only aggregates on those.crosslingual_group_input_languages: would aggregate all results based on input_languagecrosslingual_group_output_languages: would aggregate all results based on output_language
…y with dataset_id abstraction
Introduce DatasetLanguages type and LanguageAggregationMode enum with three modes: monolingual_only, crosslingual_group_input_languages, and crosslingual_group_output_languages. Rename get_dataset_language() to get_dataset_languages(), now returning input and output language sets. Replace MetricsResult.language with input_languages/output_languages. Datasets incompatible with the chosen aggregation mode are skipped rather than causing a failure. BREAKING CHANGE: MetricsResult.language replaced by input_languages/output_languages
|
@Mattdl Thank you very much for your thorough review! I have pushed a new commit that addresses your proposal. It introduces a new enum with the three aggregation modes you suggested, and it also generalizes
Between raising and filtering, I would lean toward filtering. I would suggest we consider the per-dataset raw results as the primary output artifact of a run, and the aggregations as derived. Since the evaluation loop is the expensive part, raw results should always be saved. Then, the user can re-aggregate with a different mode at any time. So rather than raising when datasets are incompatible with the aggregation mode, they should be skipped during aggregation. And if the user specifically wants only the aggregated results, we could support a filtering option that restricts execution to the datasets that would contribute to that aggregation, to avoid the computation of unnecessary evaluations. Taking this a step further, if you are interested, I would be glad to work (perhaps in a different PR) on separating the reporting layer from the evaluation loop. The run would produce the primary per-dataset results, and then separate analysis modules would handle aggregation, visualization, formatting as Markdown/LaTeX tables, etc. These modules could be optionally invoked by the user in the experiment's config, or run a posteriori on previously stored results, making it possible to perform arbitrarily complex post-hoc analysis without complicating the evaluation loop. Let me know what you think! |
Mattdl
left a comment
There was a problem hiding this comment.
@federetyk this is such a great PR. Thanks again for the contributions and insightful ideas for next steps of the package!
I left some small comments. Just want to check with you on the filtering vs raising you mentioned (see my comment below), and then we can close this PR.
Current PR: Filtering vs raising
I would suggest we consider the per-dataset raw results as the primary output artifact of a run, and the aggregations as derived. Since the evaluation loop is the expensive part, raw results should always be saved.
I completely agree with that philosphy! However, I think for the very same reason that the "evaluation loop is the expensive part", we should default to lazy exection: run only the relevant dataset_id's and tasks that make sense w.r.t. the chosen aggregation strategy. (Also see following comment on checkpointing)
Since the evaluation loop is the expensive part, raw results should always be saved. Then, the user can re-aggregate with a different mode at any time.
I think here we can make use of the checkpointing. As our checkpointing is defined on the (task_name, dataset_id) level, we can perfectly do a partial execution by filtering.
This builds on the idea that a user has to always explicitly define which set of tasks (and models) they are interested in, and only the relevant ones will be executed. Hence, I completely agree with your following comment:
So rather than raising when datasets are incompatible with the aggregation mode, they should be skipped during aggregation.
As a default I would raise warnings of skipped datasets based on the aggregation strategy.
To double-check with you: Currently we are already taking the (task_name, dataset_id) skipping strategy if I'm not mistaken? So in case you agree we can keep the PR as is, and maybe just add the warnings for skipped dataset_ids in execution.
Regarding next steps
Docs update
To keep the scope of this PR limited, I've created a separate issue #43 regarding documentation for the README and CONTRIBUTING guidelines that need to be updated w.r.t. this refactor.
New PR: Separated evaluation logic idea
Taking this a step further, if you are interested, I would be glad to work (perhaps in a different PR) on separating the reporting layer from the evaluation loop.
I absolutely love that idea. One major thing to keep in mind is basically the same as I mentioned above, is that I'd focus on lazy execution depending on the chosen analysis. Additional evaluation are then just executed on-the-go if required.
For the user that doesn't want lazy loading, we could also introduce 2 execution modes of the evaluation loop: ExecutionMode.LazyTasksRequiredByAggregation or ExecutionMode.AllTasks where the user can explicitly define the behavior.
I agree with you it's super valuable to be able to do any post-hoc analysis. By using the execution mode, all tasks are skipped by default if they're already present. Then we can perfectly have a separated reporting/aggregation layer on top.
Feel free to make an Issue or PR with how you see the main changes compared to current aggregation strategy. Really curious to have your view on that!
Skip datasets incompatible with the chosen LanguageAggregationMode before evaluation to avoid unnecessary compute. Extract get_language_grouping_key() as a shared function in types.py so the same compatibility logic is reused by both the eval-time filter (_filter_pending_work) and the aggregation-time skip in results.py. - Add get_language_grouping_key() to types.py, refactor BenchmarkResults._get_language_grouping_key() to delegate to it - Add ExecutionMode enum (LAZY / ALL) - Add language_aggregation_mode and execution_mode params to evaluate() - Add language_aggregation_mode param to get_summary_metrics() - Export ExecutionMode and LanguageAggregationMode from workrb - Add tests for shared function and MELO-like filtering scenarios
|
@Mattdl Thanks for the detailed review. Hope the ever-growing scope of this refactor is not making you regret the invitation to contribute! :) Regarding your question: the code you reviewed yesterday only filtered incompatible datasets at aggregation time, not at evaluation time. I have just pushed a new commit that addresses your suggestions. Now, The default "aggregation mode" is With this, there are now two ways to control which datasets get evaluated. Taking # 1. The `languages` parameter in the Task constructor controls which
# dataset IDs are generated in the first place:
tasks = [
ESCOJob2SkillRanking(split="test", languages=["en", "fr", "de", "nl"]),
...
]
# 2. The new parameters in evaluate() filter out datasets that would be
# discarded by the chosen aggregation mode:
results = workrb.evaluate(
model=model,
tasks=tasks,
output_folder="results/my_model",
language_aggregation_mode=LanguageAggregationMode.MONOLINGUAL_ONLY, # default is None
execution_mode=ExecutionMode.LAZY, # this is the default
)The first mechanism predates this refactor, and selects which datasets take part in the evaluation; the second narrows that set based on how the results will be aggregated. Does this match how you see the high-level API? If so, we could do some clean-up and documentation around this in the follow-up PR we discussed. If you prefer to merge the PR as it was before this commit and handle lazy execution in a separate PR, I would be happy to revert. Just let me know which you prefer. |
There was a problem hiding this comment.
Thanks for the detailed review. Hope the ever-growing scope of this refactor is not making you regret the invitation to contribute! :)
On the contrary, the package has improved a lot thanks to your contributions!!🥇
LanguageAggregation mode None
For the LanguageAggregation we will always have to make a choice, so None for filtering would be no option. The default would be currently LanguageAggregationl.MONOLINGUAL_ONLY as we didn't support cross-lingual before.
The reason we need to define this, is that the aggregation must happen at some point in reporting the metrics. Especially in src/workrb/results.py, in the get_summary_metrics() call.
The second choice a user now has, is whether they still want to execute all tasks regardless of the chosen LanguageAggregation. With default ExecutionMode.LAZY it will skip any cross-lingual entries that won't be relevant for get_summary_metrics(), or execut all of them with ExecutionMode.ALL.
So requested change here is that we don't have a None for language filtering, but must always be defined. Feel free to add your own viewpoint on this one!
Aggregation revision
The problem in src/workrb/results.py, in the get_summary_metrics() is that we now have a dependency on the language_aggregation_mode.
There's a couple of issues arising:
Problem 1
get_summary_metrics(): The language_aggregation_mode should not be optional but a required argument to make the strategy explicit and avoid eval errors in the code.
Problem 2
The BenchmarkResults.__str__ that is used in a typical user flow with print(results) API won't work anymore, as we'll require explicit definition of language aggregation mode. Example:
# 1. Setup model and tasks
model = workrb.models.BiEncoderModel("all-MiniLM-L6-v2")
tasks = [workrb.tasks.ESCOSkill2JobRanking(split="test", languages=["en"])]
# 2. Run the benchmark
agg_mode= LanguageAggregation.MONOLINGUAL_ONLY
results = workrb.evaluate(
model,
tasks,
output_folder="results/demo",
description="WorkRB demo",
force_restart=True,
language_aggregation_mode=agg_mode,
) # -> Will already print the results via `reporting.py`
# (Optional) Display results
print("\nResults:")
print(results) # -> This will not be possible anymore.
A possible solution would be to make the LanguageAggregation mode part of the results state, enabling print(results). However, it becomes error-prone when prior results would be ran with a different LanguageAggregation than the original evaluation run. With the lazy execution mode, we would need a check that all necessary tasks have been executed. Not sure yet what would be the best solution here. In any case we want to keep the API as clean as possible and easy to use for the user, hence maybe using the default-aggregation mode is the better option and documenting the default behavior.
Problem 3
For efficiency reasons, src/workrb/metrics/reporting.py:format_results recalculated the metrics on-demand. But with the refactor this feels like an error-prone approach. Instead we should simplify by making the internal of results.get_summary_metrics(...) be moved to a new function results._get_summary_metrics, which should return all of the 5 mean_per_task,... mean_per_language.
- The original
results.get_summary_metrics(...)would then just combine the results of the call toresults._get_summary_metrics, src/workrb/metrics/reporting.py:format_resultswould also just make the call toresults._get_summary_metricsso the logic is just in 1 place.
|
We're currently working on the paper, so want to recap on the short-term prios. Prios:
Let's keep the separate reporting layer as future steps for now to maintain focus (: Super excited to be adding your features to the paper-version of WorkRB (:🚀 |
…evaluate() Address PR techwolf-ai#34 review feedback from @Mattdl. The aggregation mode now flows consistently through the entire evaluation and aggregation pipeline instead of being an optional parameter.
Override get_dataset_languages() and languages_to_dataset_ids() in the
freelancer candidate ranking tasks, replacing the Language.CROSS sentinel
with a proper DATASET_LANGUAGES_MAP that describes each dataset's input
and output languages.
This lets the aggregation and filtering logic handle the multilingual
dataset ("cross_lingual" renamed to "multilingual") as a regular entry
rather than a special-cased language enum value.
|
@Mattdl Thank you again for your feedback! In the latest commit, I have addressed your suggestions. Now the parameter Happy to adjust anything before merging. Not a blocker for this PR, but I wanted to share some ideas for future discussion. I think this is also where some of my earlier confusion came from. As I see it, there are at least three alternatives for the aggregation chain:
I can see the case for options 2 and 3: they ensure consistency between the main hierarchy and the per-language view, since both operate on the same set of datasets. Under option 1, the benchmark score would include datasets that the per-language breakdown cannot account for, which could be confusing. That said, option 1 has the appeal of being a straightforward cascade of macro-averages where each dataset is an independent data point... and per-language aggregation is independent, not being part of the chain. I feel it is also more natural for users who prefer For now I have implemented option 2, which is how I understood your feedback. If you had option 3 in mind instead, it would be a small change. I can apply it before merging this PR. |
Mattdl
left a comment
There was a problem hiding this comment.
@federetyk thanks for the clear overview! 🚀 All looks good, I'd just make this final change if you agree:
I can definitely see the advantage of the Option1, as you mention it would just use each dataset_id as a separate data point.
Regarding option2 and 3, my preference would be Option3, as it gives a equally-weighed performance aggregation over languages.
So combining the 2 approaches, we could indeed leave a LanguageAggregationMode None (for option1), although I'd prefer to make it more explicit with LanguageAggregationMode.SKIP_LANGUAGE_AGGREGATION where we can also document the behavior for Option1. I'd keep the default to LanguageAggregationMode.MONOLINGUAL_ONLY.
To document our final flow this would result in the following 2 flows:
-
lang_agg_mode=LanguageAggregationMode.{MONOLINGUAL/CROSS_LINGUAL}:
dataset(filtered based on lang_agg_mode AND ifExectionMode.LAZY, or not filtered ifExectionMode.ALL) →language(follows thelang_agg_modegrouping criterion) →task→task_group→benchmark -
lang_agg_mode=LanguageAggregationMode.SKIP_LANGUAGE_AGGREGATION:
dataset(no filtering regardless ofExectutionMode) →language(skipped) →task→task_group→benchmark
Per-task aggregation now groups datasets by language before averaging, giving equal weight to each language regardless of how many datasets it contains. Previously, all compatible datasets were flat-averaged, which over-represented languages with more datasets. Add SKIP_LANGUAGE_AGGREGATION mode for users who want the old flat average with no filtering and no per-language output.
Add four example scripts illustrating the two aggregation modes (language-weighted vs flat average) combined with task-level language filtering (selected subset vs all available languages).
|
@Mattdl I sincerely appreciate the time you have put into reviewing and directing this refactor! This was a prerequisite for including MELO in the benchmark, so it directly unblocks a contribution I care about. I think allowing the user to choose between the flat average and the language-grouped average is good. It makes the benchmark flexible enough to accommodate both perspectives we discussed, and users can pick the one that best fits their evaluation goals. The latest commits in the branch reflect your suggestions: in per-task aggregation, we now use language-grouped averaging in the way described as option 3; there is a new language aggregation mode Let me know if you would like me to change anything before we merge! |
… UK as supported language
Addresses #33
Description
This PR generalizes dataset indexing within tasks from
Languageenum to arbitrary string identifiers (dataset_id). The current architecture limits each task to at most one dataset per language, which prevents supporting tasks with multiple monolingual datasets per language, cross-lingual datasets, or multilingual datasets.The refactor introduces a
languages_to_dataset_ids()method with a default 1:1 mapping that preserves backward compatibility for existing tasks. Tasks that require more complex dataset structures can override this method to return custom identifiers. A newget_dataset_language()method maps datasets back to their language for proper per-language result aggregation, returningNonefor cross-lingual or multilingual datasets.Changes:
lang_datasets: dict[Language, Dataset]todatasets: dict[str, Dataset]in Task base classlanguages_to_dataset_ids(languages) -> list[str]method with default backward-compatible mappingload_monolingual_data(language, split)toload_dataset(dataset_id, split)across all tasksget_dataset_language(dataset_id) -> Language | Nonemethod for per-language aggregationlanguagefield toMetricsResultto track dataset language_aggregate_per_language()to group by thelanguagefield, skipping datasets marked as cross-lingual or multilingualexamples/All tests pass, and the output of
examples/run_multiple_models.pyproduces results consistent with the main branch.Checklist