-
Notifications
You must be signed in to change notification settings - Fork 4
Description
I was testing out the SketchBLEU implementation via CodeBLEU (from validation/evaluation_scripts/codebleu), and while I was able to get it running after a couple of modifications, I ran into an issue where the syntax match score is extremely low, even when evaluating two identical repos.
Steps to Reproduce
-
Clone the repo and set up the environment (conda environment details provided below).
-
Apply the following fixes to make CodeBLEU run without errors:
-
In
CodeS/validation/evaluation_scripts/codebleu/codebleu/syntax_match.py, update theto_strfunction in theFileOrNodeclass:field_names.append(cursor.current_field_name) # instead of cursor.field_name
-
In
CodeS/validation/evaluation_scripts/codebleu/codebleu/__main__.py, update themainfunction to properly handlerepo_bleuruns:def main( ref_files: List[str], hyp_file: str, lang: str, weights: Tuple[float, float, float, float] = (0.25, 0.25, 0.25, 0.25), repo_bleu: bool = False, ) -> None: if repo_bleu: repo_bleu_score = calc_repobleu( [Path(ref_file) for ref_file in ref_files], [Path(hyp_file)], lang, weights=weights, ) print("Repo-level CodeBLEU score: ", repo_bleu_score) else: code_bleu_score = calc_codebleu( references, hypothesis, lang, weights=weights, ) ...
-
-
Create two dummy repos with identical files:
tmp.py# This script reads from 'input.txt' and writes its content to 'output.txt' with open('input.txt', 'r') as infile: data = infile.read() with open('output.txt', 'w') as outfile: outfile.write(data)
codebleu.py
(copied directly from this repo’scodebleu.py) -
Run the command:
python -m codebleu --refs "../repo_1" --hyp "../repo_2" --lang python --repo
Expected Behavior
Since both repos are identical, I expected all scores (including syntax match) to be 1.0 (or very close to it).
Actual Behavior
The output was:
Repo-level CodeBLEU score: {
'codebleu': np.float64(0.7503026634382567),
'ngram_match_score': 1.0,
'weighted_ngram_match_score': 1.0,
'syntax_match_score': 0.0012106537530266344,
'dataflow_match_score': np.float64(1.0)
}
The syntax match score is ~0.001, even though the repos are identical.
Environment
Conda environment (relevant libraries):
python 3.11.13
numpy 2.3.3
tree-sitter 0.20.1
types-tree-sitter 0.20.1.20240311
codebleu 0.4.0
Notes
- This might be related to the
tree-sitterAPI changes (hence the fix needed insyntax_match.py). - Possibly the repo-level aggregation is not correctly handling syntax trees across files.
Could you clarify whether this is expected behavior or if the syntax match computation is incorrect at the repo level?