-
Notifications
You must be signed in to change notification settings - Fork 184
Description
Problem
When jobs run inside Apptainer containers via SingularityComputingElement, they cannot access batch system tools (e.g., sacct, qstat, bjobs) needed to compute remaining time. This can prevent elastic jobs from getting a good time left estimate.
The SingularityComputingElement creates an isolated container environment and batch system binaries are not bind-mounted into the container, so tools like sacct (SLURM) or condor_q (HTCondor) are not available.
Current Behavior
- Pilot job: can access batch system, compute time left and store it in
/LocalSite/CPUTimeLeftin pilot.cfg - Job in container:
WatchdogcallsTimeLeft().getTimeLeft(), which tries to execute batch commands and fails.- LHCbDIRAC modules also calls
TimeLeft().getTimeLeft()to compute the number of events a given job can process within the allocated time. It fails too.
Result: Jobs cannot determine remaining time and may be killed unexpectedly
Proposed Solution
Jobs should likely read CPUTimeLeft from the pilot.cfg file (through gConfig): the pilot already computes and stores time left information every cycle. We could reuse this value in containerized environments.
Any opinion?