-
Notifications
You must be signed in to change notification settings - Fork 44
Description
Hello, I have a usecase where I would like to use hyper queue to allow me to run a single very large (multimode) job where a single job is run for some long time (essentially the whole time limit, say ~24h). Then, the workflow engine remotely retrieves it, does minimal tasks and submits the next big job to the queue.
When submitting directly with SLURM, these few minutes let other job start, occupying some of the nodes; so the next step remains for a long time in the queue until the machine is emptied again to have enough free nodes and the user has a high SLURM priority.
One way to solve this, if the next job does not require any "remote" workflow logic, is to already submit the next scheduler job to SLURM with a dependency (--dependency=afterok:XXX). So as soon as the first job ends, the second starts, and slurm does not try to allocate other jobs of other users inbetween.
However, we are in a usecase where some minimal logic needs to happen in between (say, max 1 minute).
I'm trying to address this with hyper queue.
I imagine I would use a single allocation of the required size (number of nodes) and time limit. And then submit to HQ.
My goal/question is: is it possible in hyper queue to set an option so that the automatic allocation, with a backlog of 1, puts the next job in the backlog in the SLURM queue, BUT with a --dependency=afterok:XXX passed to SLURM (where XXX is the job id of the currently running allocation)?
I think in this way I would solve my issue. Is the following reasoning correct?
- no jobs submitted: nothing in the SLURM queue
- first job submitted to HQ: HQ puts an allocation in SLURM, and then start running the actual workload inside the allocation as soon as the allocation starts running (possibly after waiting a bit in the queue)
- at that point, HQ submits a new allocation in the queue. Without would I am suggesting, the allocation might get running before the first one finishes, wait the idle-timeout of 5 minutes, and the stop (is this correct?). I would suggest that instead the allocation is submitted, with with
--dependency=afterok:XXX. Then, as long as the first allocation is still running, the second does not start. After max 5 minutes, the first allocation stops (even if, in my usecase, I will essentially be very close to the max wall time allowed by SLURM, so in any case the allocation will be terminated relatively soon after). - at that point, the new allocation will (in my case almost immediately, because of SLURM priorities) start running, and my remote workflow engine would still have time (up to the idle-timeout of 5 minutes) to actually submit the next step to HQ. If this is achieved, that job will start running immediately.
Is this interpretation correct, and is such feature existing (or could be implemented)?
(As a final comment: even if there are these 5-10 minutes where many nodes are blocked because the allocation is waiting the idle-timeout, this is anyways much more efficient than now: currently, when the second job is scheduled, SLURM will prevent more jobs to start running, so many nodes can remain idle for a much longer time, possibly even 10-20 hours)