Add installation of data-files build artifacts into Python package#574
Add installation of data-files build artifacts into Python package#574jakelishman wants to merge 4 commits intoPyO3:mainfrom
Conversation
This allows a `RustExtension` to "own" some data files, produced programmatically as part of its `build.rs` script, and for these to be installed along with the built extension module somewhere into the Python package tree. The motivating example use is an extension module that provides a C API, and generates the header file needed to access it and a function-pointer table in a `PyCapsule` as part of its build script; the header file ideally should be installed as a regular file inside the Python wheel, so the package can be used as a build dependency.
davidhewitt
left a comment
There was a problem hiding this comment.
Thanks, a bunch of thoughts posted as comments below. Good to see this can all come together though I guess we need to think a bit about the config.
I wonder, an alternative design I could think of which would give user more control would be to maybe use an "in-tree" build backend https://peps.python.org/pep-0517/#in-tree-build-backends - maybe we could define some public API "hooks" into setuptools-rust build which in-tree backends could use to e.g. parse the cargo messages themselves and do custom logic? That might be more flexible than trying to fit this option directly into setuptools-rust itself...
examples/data-files/pyproject.toml
Outdated
| # Keys correspond to files/directories in the Rust extension's build directory `OUT_DIR`. | ||
| # Values are Python packages that the corresponding file or directory should be placed inside. | ||
| "my_file.txt" = "data_files" | ||
| "dir" = "data_files._data" |
There was a problem hiding this comment.
I wonder if the value should be the final filename inside the directory? e.g. data_files/my_file.txt, data_files/data/dir. Not sure about the Python dotted path syntax here.
There was a problem hiding this comment.
I agree it's a bit weird as-is. I'd prototyped it this way because "find the installation directory of this Python package" is a built-in function to the setuptools machinery, and I was concerned that if I did it in terms of file structure, it would be harder to associate the relative position without at least having to first split off the top level into an implicit Python package to locate the installation location.
There was a problem hiding this comment.
I've left this as-is in the latest commit for now but I'm not married to it. With generated-files now implying that they can also be Python files, it seems less strange to me that the target locations are specified as Python packages.
Just for comparison purposes too:
- we use a Python package name in
RustExtension.targetto mark the install path setuptools'package_dataargument is also defined in terms of Python packages (the keys of the table). I suspect it's for the same reason my initial inclination was this way -setuptoolsis more designed to think about packages, and the directory layout on the filesystem is more of an implementation detail.
examples/data-files/pyproject.toml
Outdated
| [[tool.setuptools-rust.ext-modules]] | ||
| target = "data_files._lib" | ||
| [tool.setuptools-rust.ext-modules.data-files] |
There was a problem hiding this comment.
Possible sticking point here - ext-modules is a list, but ext-modules.data-files is a table? How does this interact if there are multiple ext-modules?
There was a problem hiding this comment.
I'm relatively sure that the data-files table attaches to the most recent entry in the ext-modules list - I think this is just the single-item case of it.
I actually wanted to write this example as
[[tool.setuptools-rust.ext-modules]]
target = "data_files._lib"
data-files = {
"my_file.txt" = "data_files",
"dir" = "data_files._data",
}but using an inline table with linebreaks inside it only arrived in TOML 1.1 late last year, and the Python tomllib doesn't handle it yet.
There was a problem hiding this comment.
Sorry, to finish the thought - so the options you can write in TOML 1.0 today are
[[tool.setuptools-rust.ext-modules]]
target = "pkg_a"
data-files = { "my_file.txt" = "pkg_a", "my_file2.txt" = "pkg_a.sub" }
[[tool.setuptools-rust.ext-modules]]
target = "pkg_b"
data-files = { "file3" = "pkg_b" }or
[[tool.setuptools-rust.ext-modules]]
target = "pkg_a"
[tool.setuptools-rust.ext-modules.data-files]
"my_file.txt" = "pkg_a"
"my_file2.txt" = "pkg_a.sub"
[[tool.setuptools-rust.ext-modules]]
target = "pkg_b"
[tool.setuptools-rust.ext-modules.data-files]
"file3" = "pkg_b"and these parse to the same thing:
import tomllib
a = """
[[tool.setuptools-rust.ext-modules]]
target = "pkg_a"
[tool.setuptools-rust.ext-modules.data-files]
"my_file.txt" = "pkg_a"
"my_file2.txt" = "pkg_a.sub"
[[tool.setuptools-rust.ext-modules]]
target = "pkg_b"
[tool.setuptools-rust.ext-modules.data-files]
"file3" = "pkg_b"
"""
b = """
[[tool.setuptools-rust.ext-modules]]
target = "pkg_a"
data-files = { "my_file.txt" = "pkg_a", "my_file2.txt" = "pkg_a.sub" }
[[tool.setuptools-rust.ext-modules]]
target = "pkg_b"
data-files = { "file3" = "pkg_b" }
"""
assert tomllib.loads(a) == tomllib.loads(b)There was a problem hiding this comment.
The two ways of writing the multi-table I gave above both work fine in Python and produce the expected list output. It's (at the end of the day) just regular TOML syntax, though admittedly one that I wasn't 100% confident on off the top of my head.
setuptools_rust/extension.py
Outdated
| universal2_data_files_from: If there are ``data_files`` to copy over during a | ||
| ``universal2`` build, take them from this location. By default, this uses | ||
| only the AArch64 build. |
There was a problem hiding this comment.
I'm unsure how I feel about this; it seems to me that content generated by the build.rs is likely to be target-specific. A conservative option might be to refuse to build multiple targets if data_files is set, i.e. require thin macOS builds for each platform. (Not sure how this interacts with multiple ext-modules.)
There was a problem hiding this comment.
I'm quite content with forbidding universal2 builds entirely, or at least until there's a further user request for it.
Fwiw, in my particular use-case, the build files aren't target-specific (they're platform-agnostic header files), but that's just one case.
setuptools_rust/extension.py
Outdated
| if self.data_files and len(self.target) > 1: | ||
| raise ValueError( | ||
| "using 'data_files' with multiple targets is not supported" | ||
| ) |
There was a problem hiding this comment.
Exactly this makes sense to me, we should check how the config interacts with multiple ext-modules, and I'd be in favour of disallowing universal2 with data files unless someone has a really good reason why they need it.
There was a problem hiding this comment.
I will check the multiple ext-modules, but I don't expect that particular one to be a problem, other than the user defining files that conflict between RustExtensions and making themselves build-order dependent. I don't think we could meaningfully restrict data-files across different RustExtension definitions because they don't see each other.
Just to make sure (because I got mixed up here too) - the self.target referred to here is potentially multiple Rust binaries to build and install, rather than multiple target tuples (it's a different duplication to the universal2 case).
There was a problem hiding this comment.
I removed the universal2 handling in 13807a7, so everything's just consistent in this form now.
|
Thanks for the tip on in-tree build backends - I hadn't come across those. If we went that route, I think there's more internal details of One thing I thought about after I wrote this is the name |
The handling as written was overly complex without a strong motivating use case. Better to leave it unimplemented at first, and wait to see if it becomes necessary.
Using `data-files` was needlessly opinionated about the content of the files. There's no particular reason that one of the generated files might not be a Python file (for example, a `build.rs` script that generates a `ctypes` wrapper).
|
Sorry I disappeared for a while there. I think this is ready for another review - note I pivoted the name to |
|
By the way (and this is in no way intended as pressure): I assume that there's little chance for a |
We need PyO3/setuptools-rust#574 (or something akin to it) to handle the movement of the generated files from the build script to the output. This is not yet ready for merge, and is highly unlikely to be ready in time for the Qiskit 2.4.0 release (or at least 2.4.0rc1), so to avoid blocking ourselves, we vendor in a monkey-patching reimplementation of the code temporarily. The allowed range of `setuptools-rust` is _probably_ larger than the hard pin, but the monkey-patch is so invasive and v1.12.0 is sufficiently old (August 2025) that it's not worth the risk. There are no meaningful licensing concerns since the vendored code here was entirely written by me both here and in the `setuptools-rust` patch; there is no cross-contamination from that repository and no code inclusion from it. This commit should be reverted as soon as we can swap to a safe released version of `setuptools-rust`.
This allows a
RustExtensionto "own" some data files, produced programmatically as part of itsbuild.rsscript, and for these to be installed along with the built extension module somewhere into the Python package tree. The motivating example use is an extension module that provides a C API, and generates the header file needed to access it and a function-pointer table in aPyCapsuleas part of its build script; the header file ideally should be installed as a regular file inside the Python wheel, so the package can be used as a build dependency.I've written in a basic example to use as a test case because I didn't want to just to immediately writing something like the
cbindgen-based setup that motivated this work from my side. I got repeatedly bitten by TOML 1.0 (what Python'stomllibstdlib library can handle) not supporting newlines in inline tables when trying to write thedata-fileskey! But happily there's syntax that works, even with the array-of-tables form there.I'm not convinced by the logic around
universal2handling, but I also can't entirely see the use case for generated files and universal2 other than only needing one copy, so maybe what I've done is overkill.I was able to use this logic entirely successfully with my own downstream package (Qiskit/qiskit#15711, though the code there is much more involved).
Close #563.