Skip to content

Improve error handling of spank-qrmi(Removal of QRMI_PLUGIN_ERROR env var)#188

Open
ohtanim wants to merge 12 commits into
qiskit-community:mainfrom
ohtanim:stop_job_if_qrmi_failed
Open

Improve error handling of spank-qrmi(Removal of QRMI_PLUGIN_ERROR env var)#188
ohtanim wants to merge 12 commits into
qiskit-community:mainfrom
ohtanim:stop_job_if_qrmi_failed

Conversation

@ohtanim
Copy link
Copy Markdown
Collaborator

@ohtanim ohtanim commented May 11, 2026

Description of Change

Our SPANK plugin handles error codes returned from the QRMI functions (new(), acquire(), and release()).

When an error occurs in slurm_spank_init_post_opt(), our SPANK plugin returns SLURM_ERROR to Slurm. However, it appears that returning SLURM_ERROR from slurm_spank_init_post_opt() does not terminate the job execution. Instead, the job remains in the pending state, the node becomes drained, and manual intervention by a system administrator is required to resume it. This behavior differs from what is described in the documentation.
As a result, in the current implementation, we propagate the error to the workload launched by Slurm via the QRMI_PLUGIN_ERROR environment variable. If this environment variable is set, the workload immediately terminates with an error. However, this approach relies on the workload itself (for example, QRMIService) to handle the environment variable, and therefore does not provide a complete solution.

This PR proposes a revised solution to address these issues. Although returning SLURM_ERROR from slurm_spank_init_post_opt() cannot terminate the job, returning SLURM_ERROR from slurm_spank_init_task() does terminate the job. In addition, when slurm_error() is called within this function, the error is surfaced to the user and written to the sbatch error output. Based on this observation, the proposed approach is as follows:

  • Errors that occur in slurm_spank_init_post_opt() are recorded internally(=g_init_post_opt_errors) within the SPANK plugin, and SLURM_SUCCESS is returned to allow execution to proceed.
  • In slurm_spank_init_task(), the plugin first checks whether an error occurred in slurm_spank_init_post_opt(). If so, it logs the error using slurm_error(), returns SLURM_ERROR, and terminates the job.
  • To ensure that jobs are terminated when an error occurs in the spank_qrmi plugin, the plugin definition in plugstack.conf is changed from optional to required.
  • The previously used QRMI_PLUGIN_ERROR environment variable is removed.

With these changes, when a job fails, the sbatch error output will look as follows:

[2026-05-11T12:58:14.952] error: spank_qrmi, ibm_kingston is not accessible. Failed to get backend status: ResponseError(ResponseContent { status: 404, content: "{\"message\":\"Cannot GET /api/v1--/backends/ibm_kingston/status\",\"error\":\"Not Found\",\"statusCode\":404}", entity: Some(UnknownValue(Object {"error": String("Not Found"), "message": String("Cannot GET /api/v1--/backends/ibm_kingston/status"), "statusCode": Number(404)})) })
[2026-05-11T12:58:14.953] error: spank_qrmi, failed to acquire resource: ibm_kingston
[2026-05-11T12:58:14.953] error: spank_qrmi, No QPU resource available
[2026-05-11T12:58:14.954] error: spank: required plugin spank_qrmi.so: task_init() failed with rc=-1
[2026-05-11T12:58:14.954] error: Failed to invoke spank plugin stack

Checklist ✅

  • Have you included a description of this change?
  • Have you updated the relevant documentation to reflect this change?
  • Have you made sure CI is passing before requesting a review?

Ticket

@ohtanim ohtanim self-assigned this May 11, 2026
@ohtanim ohtanim added the enhancement New feature or request label May 11, 2026
@ohtanim ohtanim changed the title WIP - Improve error handling of spank-qrmi(Removal of QRMI_PLUGIN_ERROR env var) Improve error handling of spank-qrmi(Removal of QRMI_PLUGIN_ERROR env var) May 11, 2026
@ohtanim ohtanim marked this pull request as ready for review May 11, 2026 15:29
@@ -1 +1 @@
optional /shared/spank-plugins/plugins/spank_qrmi/build/spank_qrmi.so /etc/slurm/qrmi_config.json
required /shared/spank-plugins/plugins/spank_qrmi/build/spank_qrmi.so /etc/slurm/qrmi_config.json
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest not to do this, as it could unnecessarily bring down nodes.

It is fine for me to have it in the demo patch though

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, if this is not defined as required, we cannot terminate the job due to an error raised by the SPANK plugin.

We avoid returning an error from slurm_spank_init_post_opt() so that the node itself is not drained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants