Skip to content

Implement OmniSTEval evaluation#24

Open
pe-trik wants to merge 2 commits intohlt-mt:mainfrom
pe-trik:omnisteval
Open

Implement OmniSTEval evaluation#24
pe-trik wants to merge 2 commits intohlt-mt:mainfrom
pe-trik:omnisteval

Conversation

@pe-trik
Copy link

@pe-trik pe-trik commented Mar 1, 2026

This pull request adds support for combined quality and latency evaluation using the OmniSTEval toolkit. It introduces a new CLI command, updates the documentation to describe this functionality.

The most important changes are:

Combined Quality and Latency Evaluation

  • Introduced a new script simulstream/metrics/score_omnisteval.py that provides a CLI (simulstream_run_omnisteval) to compute both quality (BLEU, chrF, COMET) and latency (LongYAAL, LongLAAL, LongAL, LongDAL, LongAP) metrics in a single run using OmniSTEval. The script handles input parsing, metric computation, and outputs results in TSV and TXT formats.

CLI and Dependency Updates

  • Registered the new CLI command simulstream_run_omnisteval in the pyproject.toml entry points, making it available as a command-line tool. Feel free to change the CLI command name.
  • Added omnisteval as an evaluation dependency in pyproject.toml.

Documentation Improvements

  • Updated README.md to replace the generic metrics log file name with inference_log_file.jsonl for clarity and consistency in all example commands. The original metrics.jsonl was a bit confusing. Feel free to revert this change.
  • Added a new section to the README.md describing how to use the combined evaluation command.

@pe-trik pe-trik mentioned this pull request Mar 1, 2026
@mgaido91
Copy link
Contributor

mgaido91 commented Mar 2, 2026

I think the previous PR was better, as I think we should keep score quality and score latency and just add metrics there. I am fine using the dependency to omnisteval for doing so, even though I prefer the approach in #19. I would not add a new command that, besides, computes also metrics that can already be computed with the existing code.

Let me know if you want to proceed with the approach in #19 or whether you prefer to add the dependency on omnisteval, but let's keep the things within the current framework of scoring computation just adding a novel scorer.
If you want, I can also help with the revisions requested in #19, just let me know. Thanks.

@pe-trik
Copy link
Author

pe-trik commented Mar 2, 2026

Thanks for the feedback!

I completely understand your preference for keeping everything within the existing score quality and score latency framework. The main reason I proposed the combined command in this PR wasn't to disrupt the current setup, but to address a specific performance bottleneck.

This PR introduces the new SoftSegmenter. Under the current SimulStream framework, the resegmenter runs independently for each metric. Because the SoftSegmenter is computationally heavy, running it multiple times for a single evaluation is highly inefficient. The combined approach calculates the resegmentation once and uses it for all metrics, saving a lot of compute.

Regarding the dependency on OmniSTEval over the approach in #19: bringing the code directly into SimulStream would mean maintaining two separate copies. OmniSTEval is designed to evaluate logs from outside SimulStream as well (e.g., older SimulEval logs or offline speech/text translation). Keeping it as a dependency ensures fair and 1:1 reproducibility across current and previous systems without fragmentation.

My ultimate goal is just to let SimulStream users benefit from OmniSTEval seamlessly via the SimulStream CLI. Do you think there is a way we can integrate OmniSTEval into the existing implementation without triggering the resegmenter multiple times?

@mgaido91
Copy link
Contributor

mgaido91 commented Mar 2, 2026

This PR introduces the new SoftSegmenter. Under the current SimulStream framework, the resegmenter runs independently for each metric. Because the SoftSegmenter is computationally heavy, running it multiple times for a single evaluation is highly inefficient. The combined approach calculates the resegmentation once and uses it for all metrics, saving a lot of compute.

I see your point and it is valid also for the mwersegmenter or other segmenter we might add in the future. However, the resegmentation does not require costly resources (e.g., GPU) so it is probably an inefficiency we can tolerate at this stage. One can also run the quality and latency evaluations in parallel. We might consider in the future to revisit the current organization of the code and introduce a single scoring entrypoint, where one can include both quality and latency metrics and compute them all in one shot. I would go ahead with the current mechanism for the moment, though, so we can release a 0.4 version soon that can be used for IWSLT this year. Then we can work on the refactoring and, since it is a breaking change, release a 1.0 version in time for next year's campaign.

Regarding the dependency on OmniSTEval over the approach in #19: bringing the code directly into SimulStream would mean maintaining two separate copies.

I am fine with your proposal, then we can just create a resegmenter class and scoring class like in #19 but inside them we call the functions in OmniSTEval. This would mean that we run the segmentation twice, as you mentioned, but, as I said before, this is not a big deal for the moment IMHO and we can work later on that.

My ultimate goal is just to let SimulStream users benefit from OmniSTEval seamlessly via the SimulStream CLI.

I would say that just having a cli that runs OmniSTEval is not that useful. One can just install OmniSTEval and run it directly at that point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants