Skip to content

Add ACHAP task / evaluation metrics#37

Open
retkowski wants to merge 8 commits intohlt-mt:mainfrom
retkowski:achap
Open

Add ACHAP task / evaluation metrics#37
retkowski wants to merge 8 commits intohlt-mt:mainfrom
retkowski:achap

Conversation

@retkowski
Copy link

This PR adds the ACHAP task and its metrics to MCIF. Selected metrics include collar F1 for segmentation quality and BERTScore for title quality. Optionally, we can also get the WER of the transcript. It handles forced alignment automatically using torchaudio. Important: Markdown format is assumed for the structured transcript (i.e., "\n# Title\n" as separator between chapters).

Copy link
Collaborator

@sarapapi sarapapi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just minor changes. Additionally, have you manually tested it?

@retkowski
Copy link
Author

Updated the docstring and yes, I also tested the code using a small sample XML.

@retkowski
Copy link
Author

@sarapapi Passes the checks now.

@sarapapi
Copy link
Collaborator

@sarapapi Passes the checks now.

Thanks, there are still comments pending

@sarapapi
Copy link
Collaborator

@retkowski could you please address the remaining requested changes by @mgaido91?

@retkowski
Copy link
Author

@sarapapi I already added a new commit that incorporates requested changes. However, I replied to two of his comments because I think further confirmation/clarification would be good

@mgaido91
Copy link
Contributor

@retkowski I do not see your comments, may you check please?

@@ -1,3 +1,4 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this line

Comment on lines +148 to +155
audio_paths = {}
if hypo_path is not None:
hypo_xml = ET.parse(hypo_path)
for task in hypo_xml.getroot().iter("task"):
if task.attrib['track'] == track and task.attrib['text_lang'] == language:
for s in task.iter("sample"):
audio_paths[s.attrib['id']] = s.find('audio_path').text
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? the audio_path is already in the reference. We do not need all this part.

Comment on lines +173 to +175
for sid in sample_ids:
if sid in audio_paths:
sample_metadata['audio_path'] = audio_paths[sid]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this. We can just edit line 152 to be:

sample_metadata = {'audio_path': next(sample.iter('reference')).text}

or you can put it here if you prefer, but it is just like this, no need to look at the hypo

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@retkowski I tried this but I am having hard times. I suggested a fix to solve a first problem, but then a second problem arises:

Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/bin/mcif_eval", line 7, in <module>
    sys.exit(cli_script())
  File "/Users/mgaido/github/mcif/src/mcif/evaluation.py", line 446, in cli_script
    scores = main(
  File "/Users/mgaido/github/mcif/src/mcif/evaluation.py", line 410, in main
    scores.update(score_achap(base_ref_path, hypo, ref, lang))
  File "/Users/mgaido/github/mcif/src/mcif/evaluation.py", line 350, in score_achap
    results = evaluate_batch(
  File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/chunkseg/evaluate.py", line 210, in evaluate_batch
    result = evaluate(
  File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/chunkseg/evaluate.py", line 86, in evaluate
    hyp_timestamps = _timestamps_from_transcript(
  File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/chunkseg/evaluate.py", line 311, in _timestamps_from_transcript
    result = parse_transcript(
  File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/chunkseg/parsers.py", line 406, in parse_transcript
    titles, sections = parse_markdown(text)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/chunkseg/parsers.py", line 228, in parse_markdown
    sections.append(_tokenize_section(section_text))
  File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/chunkseg/parsers.py", line 121, in _tokenize_section
    return sent_tokenize(text)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/nltk/tokenize/__init__.py", line 119, in sent_tokenize
    tokenizer = _get_punkt_tokenizer(language)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/nltk/tokenize/__init__.py", line 105, in _get_punkt_tokenizer
    return PunktTokenizer(language)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/nltk/tokenize/punkt.py", line 1744, in __init__
    self.load_lang(lang)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/nltk/tokenize/punkt.py", line 1749, in load_lang
    lang_dir = find(f"tokenizers/punkt_tab/{lang}/")
  File "/opt/homebrew/Caskroom/miniforge/base/envs/mcif/lib/python3.10/site-packages/nltk/data.py", line 696, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource 'punkt_tab' not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt_tab')



def score_achap(
hypo_dict: Dict[str, str],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hypo_dict: Dict[str, str],
base_ref_path: Path,
hypo_dict: Dict[str, str],

ref_chapters = json.loads(ref_sample.reference) # [[title, start_sec], ...]
ref_titles = [(t, float(s)) for t, s in ref_chapters]
ref_boundaries = [float(s) for _, s in ref_chapters]
audio_path = ref_sample.metadata["audio_path"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
audio_path = ref_sample.metadata["audio_path"]
audio_path = base_ref_path / "LONG_AUDIOS" / ref_sample.metadata["audio_path"]

ref_titles = [(t, float(s)) for t, s in ref_chapters]
ref_boundaries = [float(s) for _, s in ref_chapters]
audio_path = ref_sample.metadata["audio_path"]
duration = _audio_duration(audio_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
duration = _audio_duration(audio_path)
duration = _audio_duration(audio_path.absolute().as_posix())

Main function computing all the scores and returning a Dictionary with the scores
"""
hypo = read_hypo(hypo_path, track, lang)
ref = read_reference(ref_path, track, lang, modality=filter_modality)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ref = read_reference(ref_path, track, lang, modality=filter_modality)
base_ref_path = Path(ref_path).parent

assert "TRANS" in ref.keys()
scores["TRANS-COMET"] = score_st(hypo, ref, lang)
if "ACHAP" in ref.keys():
scores.update(score_achap(hypo, ref, lang))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
scores.update(score_achap(hypo, ref, lang))
scores.update(score_achap(base_ref_path, hypo, ref, lang))

@retkowski
Copy link
Author

@retkowski I tried this but I am having hard times. I suggested a fix to solve a first problem, but then a second problem arises:

import nltk
nltk.download('punkt_tab')

I fixed the issue within chunkseg by automatically downloading the punkt_tab sentence tokenizer if not available. Updated the dependency in the requirements.

Not entirely sure about the path issue. In my test, I provided an absolute path directly.

@mgaido91
Copy link
Contributor

I fixed the issue within chunkseg by automatically downloading the punkt_tab sentence tokenizer if not available. Updated the dependency in the requirements.

Cool, thanks!

Not entirely sure about the path issue. In my test, I provided an absolute path directly.

I provided the suggestions on how to fix it. If you just apply them, it will work. We cannot put absolute path in the xml, those have to run everywhere, in any cluster and laptop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants