Skip to content

update go-nvml with changes from cuda 13.2.1#179

Open
tariq1890 wants to merge 1 commit into
mainfrom
upd-nvml-h-cuda-13.2.1
Open

update go-nvml with changes from cuda 13.2.1#179
tariq1890 wants to merge 1 commit into
mainfrom
upd-nvml-h-cuda-13.2.1

Conversation

@tariq1890
Copy link
Copy Markdown
Contributor

@tariq1890 tariq1890 commented May 6, 2026

The following commands were run to update the go-nvml API with the latest changes from CUDA 13.2.1

  • make update-nvml-h
  • make bindings
  • go generate ./...

@tariq1890 tariq1890 requested review from cdesiniotis and elezar May 6, 2026 19:58
@tariq1890 tariq1890 force-pushed the upd-nvml-h-cuda-13.2.1 branch 2 times, most recently from 2881034 to b031cd0 Compare May 6, 2026 20:26
@tariq1890
Copy link
Copy Markdown
Contributor Author

Raised PR #180 to address the CI failures

@klueska
Copy link
Copy Markdown
Collaborator

klueska commented May 7, 2026

This doesn't seem to update any of versioned libs, or add any of the wrapper functions as described in the README.

@tariq1890 tariq1890 force-pushed the upd-nvml-h-cuda-13.2.1 branch 3 times, most recently from ce28140 to 8289fbb Compare May 9, 2026 02:38
@tariq1890
Copy link
Copy Markdown
Contributor Author

@klueska I have added the wrapper functions. As for the new versioned methods, I am not able to do the expected symbol remapping as the input/output struct types are also versioned. These struct versions are also different in each of these versioned methods.

@tariq1890 tariq1890 requested a review from klueska May 9, 2026 02:48
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
@tariq1890 tariq1890 force-pushed the upd-nvml-h-cuda-13.2.1 branch from 8289fbb to 536f8d4 Compare May 12, 2026 02:53
@klueska
Copy link
Copy Markdown
Collaborator

klueska commented May 12, 2026

There should be an intermediate PR for CUDA 13.1.x that lands before this one.

Comment thread gen/nvml/nvml.yml
const:
- {action: accept, from: "^NVML_"}
- {action: accept, from: "^nvml"}
- {action: ignore, from: "NVML_MCDM_SUPPORT"}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this need ignoring?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable was not being rendered properly, so it led to a compilation error in nvml.go. Ignoring it was the only way to unblock this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what the constant gets rendered as in nvml.go

const (
// NO_UNVERSIONED_FUNC_DEFS as defined in go-nvml/<predefine>:24
NO_UNVERSIONED_FUNC_DEFS = 1
// MCDM_SUPPORT as defined in nvml/nvml.h:106
[77 67 68 77 95 83 85 80 80 79 82 84]
// API_VERSION as defined in nvml/nvml.h:111
API_VERSION = 13
// API_VERSION_STR as defined in nvml/nvml.h:112
API_VERSION_STR = "13"
// VALUE_NOT_AVAILABLE as defined in nvml/nvml.h:156
VALUE_NOT_AVAILABLE = -1

You can see that MCDM_SUPPORT is mis-rendered

Sample1: sample1,
Sample2: sample2,
Metrics: [210]nvml.GpmMetric{
Metrics: [333]nvml.GpmMetric{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this number need bumping?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct's schema was updated in the latest nvml.h.

Comment thread gen/nvml/nvml.h
Comment on lines 118 to 139
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only need to create an alias to a versioned function if it appears in this list. In this update, this list is unmodified, so there are no new versioned function aliases necessary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip, this is good to know

Comment thread Makefile

# Update nvml.h from the NVIDIA CUDA redistributable JSON
update-nvml-h: CUDA_VERSION := 13.0.0
update-nvml-h: CUDA_VERSION := 13.2.1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an intermediate PR for the 13.1 change before this one. We are unfortunately behind on updating this, but that doesn't mean we should just skip a version.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood

Comment thread gen/nvml/nvml.h
{
unsigned int counterId; //!< Counter ID, one of \ref nvmlPRMCounterId_t
/* Input data */
nvmlPRMCounterInput_v1_t inData; //!< PRM input values
Copy link
Copy Markdown
Collaborator

@klueska klueska May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this struct has input values, we need to make sure our wrappers take this struct as an input. There should be other examples of this in older PRs.

Comment thread pkg/nvml/device.go
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look closely yet, but do any of the structs used here require the NVML_STRUCT_VERSION to be set?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this struct :

type RusdSettings_v1 struct {
	Version  uint32
	PollMask uint64
}

I've addressed this in #182

@rajathagasthya rajathagasthya added this to the v0.14.0 milestone May 13, 2026
@tariq1890
Copy link
Copy Markdown
Contributor Author

There should be an intermediate PR for CUDA 13.1.x that lands before this one.

#182

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants