Dataset generation is not fully reproducible — local output cannot cover all HuggingFace samples

When reproducing the dataset generation pipeline ([data_gen.cpp](vscode-file://vscode-app/c:/Users/wenqizha/AppData/Local/Programs/Microsoft%20VS%20Code/10c8e557c8/resources/app/out/vs/code/electron-browser/workbench/workbench.html) → [data_verbalize.py](vscode-file://vscode-app/c:/Users/wenqizha/AppData/Local/Programs/Microsoft%20VS%20Code/10c8e557c8/resources/app/out/vs/code/electron-browser/workbench/workbench.html)), the locally generated data does not fully match the published HuggingFace dataset (causal-nlp/corr2cause).

Environment:

OS: Windows 11
Compiler: g++ 15.2.0 (MSYS2 MinGW)

Like for node=4:

Category | Count
-- | --
Matched pairs (both sides) | 472
Only in local | 644
Only in HF | 248



All premises match between local and HF. The mismatch is entirely in hypotheses.

I understand the hypothesis mismatch is likely caused by randomness  during generation, but my goal is to fully reproduce the published HuggingFace dataset in order to verify the correctness of every (premise, hypothesis, label) triple end-to-end.

Could you help me figure out how to fully reproduce the published HuggingFace dataset so that my local output covers all the samples?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset generation is not fully reproducible — local output cannot cover all HuggingFace samples #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Dataset generation is not fully reproducible — local output cannot cover all HuggingFace samples #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions