When reproducing the dataset generation pipeline (data_gen.cpp → data_verbalize.py), the locally generated data does not fully match the published HuggingFace dataset (causal-nlp/corr2cause).
Environment:
OS: Windows 11
Compiler: g++ 15.2.0 (MSYS2 MinGW)
Like for node=4:
| Category |
Count |
| Matched pairs (both sides) |
472 |
| Only in local |
644 |
| Only in HF |
248 |
All premises match between local and HF. The mismatch is entirely in hypotheses.
I understand the hypothesis mismatch is likely caused by randomness during generation, but my goal is to fully reproduce the published HuggingFace dataset in order to verify the correctness of every (premise, hypothesis, label) triple end-to-end.
Could you help me figure out how to fully reproduce the published HuggingFace dataset so that my local output covers all the samples?
When reproducing the dataset generation pipeline (data_gen.cpp → data_verbalize.py), the locally generated data does not fully match the published HuggingFace dataset (causal-nlp/corr2cause).
Environment:
OS: Windows 11
Compiler: g++ 15.2.0 (MSYS2 MinGW)
Like for node=4:
All premises match between local and HF. The mismatch is entirely in hypotheses.
I understand the hypothesis mismatch is likely caused by randomness during generation, but my goal is to fully reproduce the published HuggingFace dataset in order to verify the correctness of every (premise, hypothesis, label) triple end-to-end.
Could you help me figure out how to fully reproduce the published HuggingFace dataset so that my local output covers all the samples?