Skip preallocation and use sparse-aware extraction for GNU sparse tar entries#128748
Skip preallocation and use sparse-aware extraction for GNU sparse tar entries#128748rzikm wants to merge 3 commits into
Conversation
When extracting a GNU sparse tar entry to a file, the entry's expanded (real) size was being used as the FileStream preallocation size, which could reserve massive amounts of disk space for entries whose actual on-disk payload was a tiny fraction of that. Sparse entries are now: * Created without any preallocation reservation. * Marked sparse on Windows (FSCTL_SET_SPARSE) so unwritten ranges become real holes rather than zero-filled extents. * Extracted by seeking to each populated segment's virtual offset and writing only that segment's data; the gaps between segments are left unwritten so the file system can keep them as holes. A final SetLength call extends the file to its declared real size to materialize any trailing hole. On file systems that don't support sparse files the result is functionally identical to the previous extraction (zero-filled holes, full disk allocation) but without the up-front preallocation request. Fixes #128283. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @dotnet/area-system-formats-tar |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
This PR makes System.Formats.Tar extraction sparse-aware for GNU sparse tar entries, avoiding full logical-size preallocation and writing only populated ranges while preserving holes where supported.
Changes:
- Skips
FileStreamOptions.PreallocationSizefor GNU sparse entries. - Adds sparse segment copy helpers for sync/async extraction.
- Marks Windows output files sparse via
FSCTL_SET_SPARSEand adds extraction coverage.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarEntry.cs |
Routes GNU sparse extraction through sparse-aware copy logic and skips preallocation. |
src/libraries/System.Formats.Tar/src/System/Formats/Tar/GnuSparseStream.cs |
Adds sync/async helpers to copy populated sparse segments directly. |
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarEntry.Windows.cs |
Adds Windows sparse-file marking helper. |
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarEntry.Unix.cs |
Adds Unix no-op sparse marking helper. |
src/libraries/Common/src/Interop/Windows/Kernel32/Interop.DeviceIoControl.cs |
Adds FSCTL_SET_SPARSE constant. |
src/libraries/System.Formats.Tar/src/System.Formats.Tar.csproj |
Includes shared DeviceIoControl interop source. |
src/libraries/System.Formats.Tar/tests/TarReader/TarReader.SparseFile.Tests.cs |
Adds sync/async sparse extraction verification. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Code Review: PR #128748 — Sparse-aware extraction for GNU sparse tar entriesNote This review was generated with the assistance of GitHub Copilot. SummaryThis PR makes GNU sparse tar entry extraction sparse-aware: instead of preallocating the full expanded size and writing every byte (including zero-filled holes), it seeks over holes and writes only the populated segments. On Windows, it marks the file sparse via Verdict: ✅ ApproveThe change is well-structured, addresses a real resource-consumption issue, and the prior review concerns (position guard, FAT/exFAT fallback) have been addressed. No blocking issues found. Findings💡 suggestion — Consider
|
Note
This PR was authored with assistance from GitHub Copilot.
GNU sparse tar entries report their expanded (real) size via
TarEntry.Length, but the archive only stores the much smaller packed payload. Extracting such an entry used to reserve disk space proportional to the real size (potentially many GB for a few-KB archive) and then write the expanded form including all the zero blocks, so the resulting file was fully allocated even on file systems that support holes.This change makes the extraction sparse-aware:
CreateFileStreamOptionsno longer setsPreallocationSizefor GNU sparse entries, so the destinationFileStreamis not asked to reserve real-size bytes up front.ExtractAsRegularFile[Async]now walks the sparse map directly, seeking the destination to each populated segment's virtual offset and writing only that segment's bytes. A finalSetLengthto the real size materializes any trailing hole.FSCTL_SET_SPARSEbefore writing so the unwritten ranges become real holes on NTFS rather than zero-filled extents. The call is best-effort and silently ignored on file systems that don't support it (FAT/exFAT), where the result is the same as before (zero-filled holes, full allocation) but without the up-front preallocation.A new theory
ExtractToFile_SparseEntry_ExpandsCorrectlycovers both the sync and async extraction paths, checks the expanded content, and asserts theSparseFileattribute on Windows.Fixes #128283.