Skip to content

Unexpectedly Low Syntax Match Score in Repo-Level CodeBLEU for Identical Repos #3

@StwayneXG

Description

@StwayneXG

I was testing out the SketchBLEU implementation via CodeBLEU (from validation/evaluation_scripts/codebleu), and while I was able to get it running after a couple of modifications, I ran into an issue where the syntax match score is extremely low, even when evaluating two identical repos.


Steps to Reproduce

  1. Clone the repo and set up the environment (conda environment details provided below).

  2. Apply the following fixes to make CodeBLEU run without errors:

    • In CodeS/validation/evaluation_scripts/codebleu/codebleu/syntax_match.py, update the to_str function in the FileOrNode class:

      field_names.append(cursor.current_field_name)  # instead of cursor.field_name
    • In CodeS/validation/evaluation_scripts/codebleu/codebleu/__main__.py, update the main function to properly handle repo_bleu runs:

      def main(
          ref_files: List[str],
          hyp_file: str,
          lang: str,
          weights: Tuple[float, float, float, float] = (0.25, 0.25, 0.25, 0.25),
          repo_bleu: bool = False,
      ) -> None:
          if repo_bleu:
              repo_bleu_score = calc_repobleu(
                  [Path(ref_file) for ref_file in ref_files],
                  [Path(hyp_file)],
                  lang,
                  weights=weights,
              )
              print("Repo-level CodeBLEU score: ", repo_bleu_score)
          else:
              code_bleu_score = calc_codebleu(
                  references,
                  hypothesis,
                  lang,
                  weights=weights,
              )
              ...
  3. Create two dummy repos with identical files:

    tmp.py

    # This script reads from 'input.txt' and writes its content to 'output.txt'
    
    with open('input.txt', 'r') as infile:
        data = infile.read()
    
    with open('output.txt', 'w') as outfile:
        outfile.write(data)

    codebleu.py
    (copied directly from this repo’s codebleu.py)

  4. Run the command:

    python -m codebleu --refs "../repo_1" --hyp "../repo_2" --lang python --repo

Expected Behavior

Since both repos are identical, I expected all scores (including syntax match) to be 1.0 (or very close to it).

Actual Behavior

The output was:

Repo-level CodeBLEU score:  {
    'codebleu': np.float64(0.7503026634382567),
    'ngram_match_score': 1.0,
    'weighted_ngram_match_score': 1.0,
    'syntax_match_score': 0.0012106537530266344,
    'dataflow_match_score': np.float64(1.0)
}

The syntax match score is ~0.001, even though the repos are identical.


Environment

Conda environment (relevant libraries):

python 3.11.13
numpy 2.3.3
tree-sitter 0.20.1
types-tree-sitter 0.20.1.20240311
codebleu 0.4.0

Notes

  • This might be related to the tree-sitter API changes (hence the fix needed in syntax_match.py).
  • Possibly the repo-level aggregation is not correctly handling syntax trees across files.

Could you clarify whether this is expected behavior or if the syntax match computation is incorrect at the repo level?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions