Skip to content

TimeLeft utility fails in containerized jobs due to batch system isolation #8416

@aldbr

Description

@aldbr

Problem
When jobs run inside Apptainer containers via SingularityComputingElement, they cannot access batch system tools (e.g., sacct, qstat, bjobs) needed to compute remaining time. This can prevent elastic jobs from getting a good time left estimate.

The SingularityComputingElement creates an isolated container environment and batch system binaries are not bind-mounted into the container, so tools like sacct (SLURM) or condor_q (HTCondor) are not available.

Current Behavior

  • Pilot job: can access batch system, compute time left and store it in /LocalSite/CPUTimeLeft in pilot.cfg
  • Job in container:
    • Watchdog calls TimeLeft().getTimeLeft(), which tries to execute batch commands and fails.
    • LHCbDIRAC modules also calls TimeLeft().getTimeLeft() to compute the number of events a given job can process within the allocated time. It fails too.

Result: Jobs cannot determine remaining time and may be killed unexpectedly

Proposed Solution

Jobs should likely read CPUTimeLeft from the pilot.cfg file (through gConfig): the pilot already computes and stores time left information every cycle. We could reuse this value in containerized environments.

Any opinion?

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions