Skip to content

Run output size can exceed 50 GB – investigate and reduce disk usage #66

Description

@floo-dck

Problem

A single AutoRecLab run can produce output directories of ~50 GB or more.
This is far too large for most machines and makes it impractical to store or
compare multiple experiments.

Expected behaviour

A single run should only consume storage proportional to the actual model
outputs (generated code, results, statistics). Storage should not silently
balloon due to implementation artefacts.

Likely causes to investigate

  • Checkpoints: The checkpoint/ subfolder saves the full workspace state
    for every tree node. If the agent downloads or generates large datasets, every
    node checkpoint duplicates that data.
  • Workspace accumulation: The workspace/ directory accumulates all files
    produced by the generated code (downloaded datasets, trained model weights,
    intermediate files) and is never pruned between nodes.
  • keep_only_relevant_files = false (current default): All intermediate
    files are retained. Switching to true already deletes some artefacts, but
    apparently not enough.
  • No size cap / warning: There is currently no mechanism to alert the user
    when the output directory exceeds a configurable threshold.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions