Skip to content

Dataset generation is not fully reproducible — local output cannot cover all HuggingFace samples #9

@passerby-bye

Description

@passerby-bye

When reproducing the dataset generation pipeline (data_gen.cpp → data_verbalize.py), the locally generated data does not fully match the published HuggingFace dataset (causal-nlp/corr2cause).

Environment:

OS: Windows 11
Compiler: g++ 15.2.0 (MSYS2 MinGW)

Like for node=4:

Category Count
Matched pairs (both sides) 472
Only in local 644
Only in HF 248

All premises match between local and HF. The mismatch is entirely in hypotheses.

I understand the hypothesis mismatch is likely caused by randomness during generation, but my goal is to fully reproduce the published HuggingFace dataset in order to verify the correctness of every (premise, hypothesis, label) triple end-to-end.

Could you help me figure out how to fully reproduce the published HuggingFace dataset so that my local output covers all the samples?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions