Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
717fd70
Gemini infer scripts
zhopto3 Feb 4, 2026
16ddd25
wmt output for Gemini
zhopto3 Feb 4, 2026
0d12706
seed for reproducability
zhopto3 Feb 4, 2026
3d70b09
WMT output w consistent seed
zhopto3 Feb 4, 2026
866c1fd
fleurs output
zhopto3 Feb 5, 2026
c2b4dc2
add wmt and fleurs evals for gemini
JAVI897 Feb 5, 2026
9180939
add correlations analysis
JAVI897 Feb 3, 2026
ebf0024
remove merge with manifests (not needed)
JAVI897 Feb 3, 2026
c589068
fix item-level computation to compute Group-by-item Spearman
JAVI897 Feb 4, 2026
fd35732
add latex table, fix mcif item-level
JAVI897 Feb 4, 2026
1b20a12
Merge branch 'main' into add-gemini
Gldkslfmsd Feb 6, 2026
d82877b
infer code for openrouter gpt-audio
Gldkslfmsd Feb 6, 2026
7a8c7a2
gpt-audio processed wmt
Gldkslfmsd Feb 6, 2026
9606ed6
Convert empty outputs to str
zhopto3 Feb 9, 2026
0e850b1
Gemini Europarl
zhopto3 Feb 9, 2026
da89c6e
script to count duration of test audio files
Gldkslfmsd Feb 9, 2026
3a8f261
test and stat script finished and documented
Gldkslfmsd Feb 9, 2026
81461fe
add wmt evals for gpt-audio
JAVI897 Feb 9, 2026
e9206be
add europarl evals for gemini-2.5-flash
JAVI897 Feb 9, 2026
8761921
Merge pull request #36 from sarapapi/data-statistics
sarapapi Feb 16, 2026
c1bea46
Intermediate covost outputs WIP
zhopto3 Feb 16, 2026
ab9443f
gemini covost2 outputs
zhopto3 Feb 23, 2026
71217c0
add gemini evals on covost2
JAVI897 Feb 24, 2026
4f3aad2
gemini exception handling
zhopto3 Mar 3, 2026
5834bb2
remove gpt-audio results
zhopto3 Mar 3, 2026
d74fa3f
remove gpt audio wmt out
zhopto3 Mar 3, 2026
685fede
update readme
zhopto3 Mar 19, 2026
62f7901
remove gpt from infer
zhopto3 Mar 19, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,10 @@ and set `${H2T_DATADIR}` to the directory containing the corresponding audio fil
- **Emotion**: [`emotiontalk`](manifests/emotiontalk/README.md), [`mexpresso`](manifests/mexpresso/README.md)
- **Long-Form**: [`acl6060-long`](manifests/acl6060-long/README.md), [`acl6060-short`](manifests/acl6060-short/README.md), [`mcif-long`](manifests/mcif-long/README.md), [`mcif-short`](manifests/mcif-short/README.md)

Optionally:
- Test if all audio files in dataset exist in the right location: `tests/test_dataset.py`
- Count testset statistics: `tests/stat_dataset.py`

### 2. Run inference

Run inference with the following command:
Expand All @@ -70,6 +74,11 @@ Run inference with the following command:
The full list of supported models can be obtained with `python infer.py -h`.
Supported benchmarks are listed above, while benchmark-specific language coverage is documented in the corresponding READMEs.

To use the Gemini API, you must set the environment variable with your API key:
```
export GEMINI_API_KEY=<your-api-key>
```

### 3. Run evaluation

After generating model outputs, run the evaluation suite using the scripts in the `evaluation/` directory.
Expand Down
13,511 changes: 13,511 additions & 0 deletions evaluation/output_evals/covost2/gemini-2.5-flash/de_en/results.jsonl

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"SacreBLEU": 40.031, "chrF": 64.3759, "XCOMET": 0.9059, "XCOMET-QE": 0.8795, "RefMetricX_24": 3.635, "QEMetricX_24": 3.6569, "LinguaPy": 4.4482, "RefMetricX_24-Strict-linguapy": 4.5578, "QEMetricX_24-Strict-linguapy": 4.5728, "XCOMET-Strict-linguapy": 0.8677, "XCOMET-QE-Strict-linguapy": 0.8424}
15,531 changes: 15,531 additions & 0 deletions evaluation/output_evals/covost2/gemini-2.5-flash/en_de/results.jsonl

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"SacreBLEU": 34.9453, "chrF": 60.7662, "XCOMET": 0.9147, "XCOMET-QE": 0.9177, "RefMetricX_24": 2.1869, "QEMetricX_24": 2.1879, "LinguaPy": 1.526, "RefMetricX_24-Strict-linguapy": 2.4985, "QEMetricX_24-Strict-linguapy": 2.4942, "XCOMET-Strict-linguapy": 0.9031, "XCOMET-QE-Strict-linguapy": 0.9061}
15,531 changes: 15,531 additions & 0 deletions evaluation/output_evals/covost2/gemini-2.5-flash/en_es/results.jsonl

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"XCOMET-QE": 0.8569, "QEMetricX_24": 3.566, "LinguaPy": 4.0757, "QEMetricX_24-Strict-linguapy": 4.3974, "XCOMET-QE-Strict-linguapy": 0.8284}
15,531 changes: 15,531 additions & 0 deletions evaluation/output_evals/covost2/gemini-2.5-flash/en_fr/results.jsonl

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"XCOMET-QE": 0.8171, "QEMetricX_24": 3.6918, "LinguaPy": 1.9767, "QEMetricX_24-Strict-linguapy": 4.0718, "XCOMET-QE-Strict-linguapy": 0.8043}
15,531 changes: 15,531 additions & 0 deletions evaluation/output_evals/covost2/gemini-2.5-flash/en_it/results.jsonl

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"XCOMET-QE": 0.8376, "QEMetricX_24": 3.5933, "LinguaPy": 3.8246, "QEMetricX_24-Strict-linguapy": 4.3602, "XCOMET-QE-Strict-linguapy": 0.8124}
15,531 changes: 15,531 additions & 0 deletions evaluation/output_evals/covost2/gemini-2.5-flash/en_nl/results.jsonl

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"XCOMET-QE": 0.8926, "QEMetricX_24": 3.3753, "LinguaPy": 3.8954, "QEMetricX_24-Strict-linguapy": 4.1708, "XCOMET-QE-Strict-linguapy": 0.8612}
15,531 changes: 15,531 additions & 0 deletions evaluation/output_evals/covost2/gemini-2.5-flash/en_pt/results.jsonl

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"XCOMET-QE": 0.8656, "QEMetricX_24": 4.1375, "LinguaPy": 3.9019, "QEMetricX_24-Strict-linguapy": 4.8819, "XCOMET-QE-Strict-linguapy": 0.8372}
15,531 changes: 15,531 additions & 0 deletions evaluation/output_evals/covost2/gemini-2.5-flash/en_zh/results.jsonl

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"SacreBLEU": 40.759, "chrF": 33.5414, "XCOMET": 0.8681, "XCOMET-QE": 0.8493, "RefMetricX_24": 2.0848, "QEMetricX_24": 2.2457, "LinguaPy": 0.1095, "RefMetricX_24-Strict-linguapy": 2.1023, "QEMetricX_24-Strict-linguapy": 2.2589, "XCOMET-Strict-linguapy": 0.8675, "XCOMET-QE-Strict-linguapy": 0.8489}
13,221 changes: 13,221 additions & 0 deletions evaluation/output_evals/covost2/gemini-2.5-flash/es_en/results.jsonl

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"SacreBLEU": 42.4109, "chrF": 66.8989, "XCOMET": 0.9022, "XCOMET-QE": 0.8847, "RefMetricX_24": 3.639, "QEMetricX_24": 3.4421, "LinguaPy": 4.4777, "RefMetricX_24-Strict-linguapy": 4.5582, "QEMetricX_24-Strict-linguapy": 4.3559, "XCOMET-Strict-linguapy": 0.8641, "XCOMET-QE-Strict-linguapy": 0.8476}
8,951 changes: 8,951 additions & 0 deletions evaluation/output_evals/covost2/gemini-2.5-flash/it_en/results.jsonl

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"SacreBLEU": 39.2412, "chrF": 65.1239, "XCOMET": 0.8851, "XCOMET-QE": 0.861, "RefMetricX_24": 3.6573, "QEMetricX_24": 3.4602, "LinguaPy": 5.2955, "RefMetricX_24-Strict-linguapy": 4.7895, "QEMetricX_24-Strict-linguapy": 4.5939, "XCOMET-Strict-linguapy": 0.8415, "XCOMET-QE-Strict-linguapy": 0.8169}
4,023 changes: 4,023 additions & 0 deletions evaluation/output_evals/covost2/gemini-2.5-flash/pt_en/results.jsonl

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"SacreBLEU": 54.4028, "chrF": 73.1512, "XCOMET": 0.9101, "XCOMET-QE": 0.871, "RefMetricX_24": 2.7843, "QEMetricX_24": 2.9325, "LinguaPy": 2.262, "RefMetricX_24-Strict-linguapy": 3.2472, "QEMetricX_24-Strict-linguapy": 3.3935, "XCOMET-Strict-linguapy": 0.8919, "XCOMET-QE-Strict-linguapy": 0.8542}
4,898 changes: 4,898 additions & 0 deletions evaluation/output_evals/covost2/gemini-2.5-flash/zh_en/results.jsonl

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"SacreBLEU": 23.4225, "chrF": 52.5903, "XCOMET": 0.7809, "XCOMET-QE": 0.8179, "RefMetricX_24": 4.683, "QEMetricX_24": 3.6345, "LinguaPy": 11.7191, "RefMetricX_24-Strict-linguapy": 6.9063, "QEMetricX_24-Strict-linguapy": 6.0799, "XCOMET-Strict-linguapy": 0.7026, "XCOMET-QE-Strict-linguapy": 0.7296}
2,631 changes: 2,631 additions & 0 deletions evaluation/output_evals/europarl_st/gemini-2.5-flash/de_en/results.jsonl

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"SacreBLEU": 29.5399, "chrF": 57.5177, "XCOMET": 0.8715, "XCOMET-QE": 0.8654, "RefMetricX_24": 4.5573, "QEMetricX_24": 4.0121, "LinguaPy": 0.3801, "RefMetricX_24-Strict-linguapy": 4.6187, "QEMetricX_24-Strict-linguapy": 4.0835, "XCOMET-Strict-linguapy": 0.8686, "XCOMET-QE-Strict-linguapy": 0.8625}
1,253 changes: 1,253 additions & 0 deletions evaluation/output_evals/europarl_st/gemini-2.5-flash/en_de/results.jsonl

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"SacreBLEU": 32.1047, "chrF": 62.0079, "XCOMET": 0.9429, "XCOMET-QE": 0.95, "RefMetricX_24": 1.8137, "QEMetricX_24": 1.6219, "LinguaPy": 0.0798, "RefMetricX_24-Strict-linguapy": 1.8336, "QEMetricX_24-Strict-linguapy": 1.6416, "XCOMET-Strict-linguapy": 0.9421, "XCOMET-QE-Strict-linguapy": 0.9492}
Loading