Conversation
sarapapi
left a comment
There was a problem hiding this comment.
Looks good, just minor changes. Additionally, have you manually tested it?
|
Updated the docstring and yes, I also tested the code using a small sample XML. |
|
@sarapapi Passes the checks now. |
Thanks, there are still comments pending |
|
@retkowski could you please address the remaining requested changes by @mgaido91? |
|
@sarapapi I already added a new commit that incorporates requested changes. However, I replied to two of his comments because I think further confirmation/clarification would be good |
|
@retkowski I do not see your comments, may you check please? |
src/mcif/evaluation.py
Outdated
| @@ -1,3 +1,4 @@ | |||
|
|
|||
src/mcif/evaluation.py
Outdated
| audio_paths = {} | ||
| if hypo_path is not None: | ||
| hypo_xml = ET.parse(hypo_path) | ||
| for task in hypo_xml.getroot().iter("task"): | ||
| if task.attrib['track'] == track and task.attrib['text_lang'] == language: | ||
| for s in task.iter("sample"): | ||
| audio_paths[s.attrib['id']] = s.find('audio_path').text | ||
| break |
There was a problem hiding this comment.
why do we need this? the audio_path is already in the reference. We do not need all this part.
src/mcif/evaluation.py
Outdated
| for sid in sample_ids: | ||
| if sid in audio_paths: | ||
| sample_metadata['audio_path'] = audio_paths[sid] |
There was a problem hiding this comment.
no need for this. We can just edit line 152 to be:
sample_metadata = {'audio_path': next(sample.iter('reference')).text}
or you can put it here if you prefer, but it is just like this, no need to look at the hypo
mgaido91
left a comment
There was a problem hiding this comment.
@retkowski I tried this but I am having hard times. I suggested a fix to solve a first problem, but then a second problem arises:
Traceback (most recent call last):
File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/bin/mcif_eval", line 7, in <module>
sys.exit(cli_script())
File "/Users/mgaido/github/mcif/src/mcif/evaluation.py", line 446, in cli_script
scores = main(
File "/Users/mgaido/github/mcif/src/mcif/evaluation.py", line 410, in main
scores.update(score_achap(base_ref_path, hypo, ref, lang))
File "/Users/mgaido/github/mcif/src/mcif/evaluation.py", line 350, in score_achap
results = evaluate_batch(
File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/chunkseg/evaluate.py", line 210, in evaluate_batch
result = evaluate(
File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/chunkseg/evaluate.py", line 86, in evaluate
hyp_timestamps = _timestamps_from_transcript(
File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/chunkseg/evaluate.py", line 311, in _timestamps_from_transcript
result = parse_transcript(
File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/chunkseg/parsers.py", line 406, in parse_transcript
titles, sections = parse_markdown(text)
File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/chunkseg/parsers.py", line 228, in parse_markdown
sections.append(_tokenize_section(section_text))
File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/chunkseg/parsers.py", line 121, in _tokenize_section
return sent_tokenize(text)
File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/nltk/tokenize/__init__.py", line 119, in sent_tokenize
tokenizer = _get_punkt_tokenizer(language)
File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/nltk/tokenize/__init__.py", line 105, in _get_punkt_tokenizer
return PunktTokenizer(language)
File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/nltk/tokenize/punkt.py", line 1744, in __init__
self.load_lang(lang)
File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/nltk/tokenize/punkt.py", line 1749, in load_lang
lang_dir = find(f"tokenizers/punkt_tab/{lang}/")
File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/nltk/data.py", line 696, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource 'punkt_tab' not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt_tab')
|
|
||
|
|
||
| def score_achap( | ||
| hypo_dict: Dict[str, str], |
There was a problem hiding this comment.
| hypo_dict: Dict[str, str], | |
| base_ref_path: Path, | |
| hypo_dict: Dict[str, str], |
| ref_chapters = json.loads(ref_sample.reference) # [[title, start_sec], ...] | ||
| ref_titles = [(t, float(s)) for t, s in ref_chapters] | ||
| ref_boundaries = [float(s) for _, s in ref_chapters] | ||
| audio_path = ref_sample.metadata["audio_path"] |
There was a problem hiding this comment.
| audio_path = ref_sample.metadata["audio_path"] | |
| audio_path = base_ref_path / "LONG_AUDIOS" / ref_sample.metadata["audio_path"] |
| ref_titles = [(t, float(s)) for t, s in ref_chapters] | ||
| ref_boundaries = [float(s) for _, s in ref_chapters] | ||
| audio_path = ref_sample.metadata["audio_path"] | ||
| duration = _audio_duration(audio_path) |
There was a problem hiding this comment.
| duration = _audio_duration(audio_path) | |
| duration = _audio_duration(audio_path.absolute().as_posix()) |
| Main function computing all the scores and returning a Dictionary with the scores | ||
| """ | ||
| hypo = read_hypo(hypo_path, track, lang) | ||
| ref = read_reference(ref_path, track, lang, modality=filter_modality) |
There was a problem hiding this comment.
| ref = read_reference(ref_path, track, lang, modality=filter_modality) | |
| base_ref_path = Path(ref_path).parent |
| assert "TRANS" in ref.keys() | ||
| scores["TRANS-COMET"] = score_st(hypo, ref, lang) | ||
| if "ACHAP" in ref.keys(): | ||
| scores.update(score_achap(hypo, ref, lang)) |
There was a problem hiding this comment.
| scores.update(score_achap(hypo, ref, lang)) | |
| scores.update(score_achap(base_ref_path, hypo, ref, lang)) |
I fixed the issue within chunkseg by automatically downloading the punkt_tab sentence tokenizer if not available. Updated the dependency in the requirements. Not entirely sure about the path issue. In my test, I provided an absolute path directly. |
Cool, thanks!
I provided the suggestions on how to fix it. If you just apply them, it will work. We cannot put absolute path in the xml, those have to run everywhere, in any cluster and laptop. |
This PR adds the ACHAP task and its metrics to MCIF. Selected metrics include collar F1 for segmentation quality and BERTScore for title quality. Optionally, we can also get the WER of the transcript. It handles forced alignment automatically using torchaudio. Important: Markdown format is assumed for the structured transcript (i.e., "\n# Title\n" as separator between chapters).