Implement OmniSTEval evaluation by pe-trik · Pull Request #24 · hlt-mt/simulstream

pe-trik · 2026-03-01T23:27:08Z

This pull request adds support for combined quality and latency evaluation using the OmniSTEval toolkit. It introduces a new CLI command, updates the documentation to describe this functionality.

The most important changes are:

Combined Quality and Latency Evaluation

Introduced a new script simulstream/metrics/score_omnisteval.py that provides a CLI (simulstream_run_omnisteval) to compute both quality (BLEU, chrF, COMET) and latency (LongYAAL, LongLAAL, LongAL, LongDAL, LongAP) metrics in a single run using OmniSTEval. The script handles input parsing, metric computation, and outputs results in TSV and TXT formats.

CLI and Dependency Updates

Registered the new CLI command simulstream_run_omnisteval in the pyproject.toml entry points, making it available as a command-line tool. Feel free to change the CLI command name.
Added omnisteval as an evaluation dependency in pyproject.toml.

Documentation Improvements

Updated README.md to replace the generic metrics log file name with inference_log_file.jsonl for clarity and consistency in all example commands. The original metrics.jsonl was a bit confusing. Feel free to revert this change.
Added a new section to the README.md describing how to use the combined evaluation command.

mgaido91 · 2026-03-02T10:54:08Z

I think the previous PR was better, as I think we should keep score quality and score latency and just add metrics there. I am fine using the dependency to omnisteval for doing so, even though I prefer the approach in #19. I would not add a new command that, besides, computes also metrics that can already be computed with the existing code.

Let me know if you want to proceed with the approach in #19 or whether you prefer to add the dependency on omnisteval, but let's keep the things within the current framework of scoring computation just adding a novel scorer.
If you want, I can also help with the revisions requested in #19, just let me know. Thanks.

pe-trik · 2026-03-02T11:27:50Z

Thanks for the feedback!

I completely understand your preference for keeping everything within the existing score quality and score latency framework. The main reason I proposed the combined command in this PR wasn't to disrupt the current setup, but to address a specific performance bottleneck.

This PR introduces the new SoftSegmenter. Under the current SimulStream framework, the resegmenter runs independently for each metric. Because the SoftSegmenter is computationally heavy, running it multiple times for a single evaluation is highly inefficient. The combined approach calculates the resegmentation once and uses it for all metrics, saving a lot of compute.

Regarding the dependency on OmniSTEval over the approach in #19: bringing the code directly into SimulStream would mean maintaining two separate copies. OmniSTEval is designed to evaluate logs from outside SimulStream as well (e.g., older SimulEval logs or offline speech/text translation). Keeping it as a dependency ensures fair and 1:1 reproducibility across current and previous systems without fragmentation.

My ultimate goal is just to let SimulStream users benefit from OmniSTEval seamlessly via the SimulStream CLI. Do you think there is a way we can integrate OmniSTEval into the existing implementation without triggering the resegmenter multiple times?

mgaido91 · 2026-03-02T13:46:14Z

This PR introduces the new SoftSegmenter. Under the current SimulStream framework, the resegmenter runs independently for each metric. Because the SoftSegmenter is computationally heavy, running it multiple times for a single evaluation is highly inefficient. The combined approach calculates the resegmentation once and uses it for all metrics, saving a lot of compute.

I see your point and it is valid also for the mwersegmenter or other segmenter we might add in the future. However, the resegmentation does not require costly resources (e.g., GPU) so it is probably an inefficiency we can tolerate at this stage. One can also run the quality and latency evaluations in parallel. We might consider in the future to revisit the current organization of the code and introduce a single scoring entrypoint, where one can include both quality and latency metrics and compute them all in one shot. I would go ahead with the current mechanism for the moment, though, so we can release a 0.4 version soon that can be used for IWSLT this year. Then we can work on the refactoring and, since it is a breaking change, release a 1.0 version in time for next year's campaign.

Regarding the dependency on OmniSTEval over the approach in #19: bringing the code directly into SimulStream would mean maintaining two separate copies.

I am fine with your proposal, then we can just create a resegmenter class and scoring class like in #19 but inside them we call the functions in OmniSTEval. This would mean that we run the segmentation twice, as you mentioned, but, as I said before, this is not a big deal for the moment IMHO and we can work later on that.

My ultimate goal is just to let SimulStream users benefit from OmniSTEval seamlessly via the SimulStream CLI.

I would say that just having a cli that runs OmniSTEval is not that useful. One can just install OmniSTEval and run it directly at that point.

Implement OmniSTEval evaluation

cb4f6df

pe-trik mentioned this pull request Mar 1, 2026

SoftSegmenter and LongYAAL #19

Closed

fix format

beead32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement OmniSTEval evaluation#24

Implement OmniSTEval evaluation#24
pe-trik wants to merge 2 commits intohlt-mt:mainfrom
pe-trik:omnisteval

pe-trik commented Mar 1, 2026

Uh oh!

mgaido91 commented Mar 2, 2026

Uh oh!

pe-trik commented Mar 2, 2026

Uh oh!

mgaido91 commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pe-trik commented Mar 1, 2026

Uh oh!

mgaido91 commented Mar 2, 2026

Uh oh!

pe-trik commented Mar 2, 2026

Uh oh!

mgaido91 commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants