Conversation
|
I think the previous PR was better, as I think we should keep score quality and score latency and just add metrics there. I am fine using the dependency to Let me know if you want to proceed with the approach in #19 or whether you prefer to add the dependency on |
|
Thanks for the feedback! I completely understand your preference for keeping everything within the existing score quality and score latency framework. The main reason I proposed the combined command in this PR wasn't to disrupt the current setup, but to address a specific performance bottleneck. This PR introduces the new Regarding the dependency on OmniSTEval over the approach in #19: bringing the code directly into SimulStream would mean maintaining two separate copies. OmniSTEval is designed to evaluate logs from outside SimulStream as well (e.g., older SimulEval logs or offline speech/text translation). Keeping it as a dependency ensures fair and 1:1 reproducibility across current and previous systems without fragmentation. My ultimate goal is just to let SimulStream users benefit from OmniSTEval seamlessly via the SimulStream CLI. Do you think there is a way we can integrate OmniSTEval into the existing implementation without triggering the resegmenter multiple times? |
I see your point and it is valid also for the mwersegmenter or other segmenter we might add in the future. However, the resegmentation does not require costly resources (e.g., GPU) so it is probably an inefficiency we can tolerate at this stage. One can also run the quality and latency evaluations in parallel. We might consider in the future to revisit the current organization of the code and introduce a single scoring entrypoint, where one can include both quality and latency metrics and compute them all in one shot. I would go ahead with the current mechanism for the moment, though, so we can release a 0.4 version soon that can be used for IWSLT this year. Then we can work on the refactoring and, since it is a breaking change, release a 1.0 version in time for next year's campaign.
I am fine with your proposal, then we can just create a resegmenter class and scoring class like in #19 but inside them we call the functions in OmniSTEval. This would mean that we run the segmentation twice, as you mentioned, but, as I said before, this is not a big deal for the moment IMHO and we can work later on that.
I would say that just having a cli that runs OmniSTEval is not that useful. One can just install OmniSTEval and run it directly at that point. |
This pull request adds support for combined quality and latency evaluation using the OmniSTEval toolkit. It introduces a new CLI command, updates the documentation to describe this functionality.
The most important changes are:
Combined Quality and Latency Evaluation
simulstream/metrics/score_omnisteval.pythat provides a CLI (simulstream_run_omnisteval) to compute both quality (BLEU, chrF, COMET) and latency (LongYAAL, LongLAAL, LongAL, LongDAL, LongAP) metrics in a single run using OmniSTEval. The script handles input parsing, metric computation, and outputs results in TSV and TXT formats.CLI and Dependency Updates
simulstream_run_omnistevalin thepyproject.tomlentry points, making it available as a command-line tool. Feel free to change the CLI command name.omnistevalas an evaluation dependency inpyproject.toml.Documentation Improvements
README.mdto replace the generic metrics log file name withinference_log_file.jsonlfor clarity and consistency in all example commands. The originalmetrics.jsonlwas a bit confusing. Feel free to revert this change.README.mddescribing how to use the combined evaluation command.