Skip to content

Job.load() raises KeyError for array parent job when parent has finished but child tasks are still running #397

@dliptai

Description

@dliptai

We are encountering an issue when working with Slurm job arrays via pyslurm.Job.load().

In a job array where:

  • SLURM_JOB_ID == SLURM_ARRAY_JOB_ID (i.e. the array parent job ID)
  • The parent job has finished
  • Some array tasks are still running

Calling:

pyslurm.Job.load(job_id)

where job_id is the array parent ID, results in:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "pyslurm/core/job/job.pyx", line 307, in pyslurm.core.job.job.Job.load
  File "pyslurm/core/job/job.pyx", line 300, in pyslurm.core.job.job.Job.load
  File "pyslurm/core/job/step.pyx", line 103, in pyslurm.core.job.step.JobSteps._load_single
KeyError: 222

(where 222 is the job ID in this case)

Environment:

  • Slurm version: 24.11.6
  • pyslurm version: 24.11.0

Analysis:
Job.load() attempts to load job steps as part of job construction.

Normally:

  • If the returned dictionary of steps (from JobSteps._load_data()) is empty, an RPC error is raised.
    data = steps._load_data(job.id, slurm.SHOW_ALL)
    if not data and not slurm.IS_JOB_PENDING(job.ptr):
    msg = f"Failed to load step info for Job {job.id}."
    raise RPCError(msg=msg)
  • That RPC error is handled in Job.load().
    if not slurm.IS_JOB_PENDING(wrap.ptr):
    # Just ignore if the steps couldn't be loaded here.
    try:
    wrap.steps = JobSteps._load_single(wrap)
    except RPCError:
    pass

The problematic case appears to be:

  1. JobSteps._load_single() is called for the array parent job ID.
  2. The RPC and thus JobSteps._load_data() returns steps for all array elements that still have running steps.
  3. The parent job itself has no steps (it is already finished).
  4. Therefore, the returned dictionary is non-empty, but does not contain an entry for the parent job ID.
  5. JobSteps._load_single() then attempts to index into the dictionary using the parent job ID.
  6. Since that key does not exist, a KeyError is raised.
  7. This bypasses the normal RPC error handling path in Job.load().

By contrast:

  • If a single (non-array) job is finished, JobSteps._load_data() returns an empty dict.
  • That empty dict correctly triggers the RPC error path, which is handled.

So the failure only occurs when:

  • The array parent is finished, and
  • Some child tasks are still running.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions