Improve error handling of spank-qrmi(Removal of QRMI_PLUGIN_ERROR env var)#188
Open
ohtanim wants to merge 12 commits into
Open
Improve error handling of spank-qrmi(Removal of QRMI_PLUGIN_ERROR env var)#188ohtanim wants to merge 12 commits into
ohtanim wants to merge 12 commits into
Conversation
| @@ -1 +1 @@ | |||
| optional /shared/spank-plugins/plugins/spank_qrmi/build/spank_qrmi.so /etc/slurm/qrmi_config.json | |||
| required /shared/spank-plugins/plugins/spank_qrmi/build/spank_qrmi.so /etc/slurm/qrmi_config.json | |||
Collaborator
There was a problem hiding this comment.
I would suggest not to do this, as it could unnecessarily bring down nodes.
It is fine for me to have it in the demo patch though
Collaborator
Author
There was a problem hiding this comment.
However, if this is not defined as required, we cannot terminate the job due to an error raised by the SPANK plugin.
We avoid returning an error from slurm_spank_init_post_opt() so that the node itself is not drained.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of Change
Our SPANK plugin handles error codes returned from the QRMI functions (
new(),acquire(), andrelease()).When an error occurs in
slurm_spank_init_post_opt(), our SPANK plugin returnsSLURM_ERRORto Slurm. However, it appears that returningSLURM_ERRORfromslurm_spank_init_post_opt()does not terminate the job execution. Instead, the job remains in the pending state, the node becomes drained, and manual intervention by a system administrator is required to resume it. This behavior differs from what is described in the documentation.As a result, in the current implementation, we propagate the error to the workload launched by Slurm via the
QRMI_PLUGIN_ERRORenvironment variable. If this environment variable is set, the workload immediately terminates with an error. However, this approach relies on the workload itself (for example,QRMIService) to handle the environment variable, and therefore does not provide a complete solution.This PR proposes a revised solution to address these issues. Although returning
SLURM_ERRORfromslurm_spank_init_post_opt()cannot terminate the job, returningSLURM_ERRORfromslurm_spank_init_task()does terminate the job. In addition, whenslurm_error()is called within this function, the error is surfaced to the user and written to the sbatch error output. Based on this observation, the proposed approach is as follows:slurm_spank_init_post_opt()are recorded internally(=g_init_post_opt_errors) within the SPANK plugin, andSLURM_SUCCESSis returned to allow execution to proceed.slurm_spank_init_task(), the plugin first checks whether an error occurred inslurm_spank_init_post_opt(). If so, it logs the error usingslurm_error(), returnsSLURM_ERROR, and terminates the job.spank_qrmiplugin, the plugin definition inplugstack.confis changed fromoptionaltorequired.QRMI_PLUGIN_ERRORenvironment variable is removed.With these changes, when a job fails, the sbatch error output will look as follows:
Checklist ✅
Ticket