Skip to content

Streaming validation dataset will lead to infinite loop #42

@isaacveg

Description

@isaacveg

In train_diloco_torch.py, the validation set is loaded with streaming=True format.
This means when evaluating, the process will continue infinitely since IterableDataset does not have len

ds = (
        load_dataset("PrimeIntellect/c4-tiny", "en", ignore_verifications=True)
        if c4_tiny
        else load_dataset(
            "allenai/c4",
            "en",
            streaming=True,
            data_files={
                "train": "en/c4-train.*.json.gz",
                "validation": "en/c4-validation.00000-of-00008.json.gz",
            },
        )
    )

We can use 1000 samples to test perplexity, or we can just simply load the validation dataset with streaming=False.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions