Skip to content

Skip preallocation and use sparse-aware extraction for GNU sparse tar entries#128748

Open
rzikm wants to merge 3 commits into
mainfrom
rzikm/animated-umbrella
Open

Skip preallocation and use sparse-aware extraction for GNU sparse tar entries#128748
rzikm wants to merge 3 commits into
mainfrom
rzikm/animated-umbrella

Conversation

@rzikm
Copy link
Copy Markdown
Member

@rzikm rzikm commented May 29, 2026

Note

This PR was authored with assistance from GitHub Copilot.

GNU sparse tar entries report their expanded (real) size via TarEntry.Length, but the archive only stores the much smaller packed payload. Extracting such an entry used to reserve disk space proportional to the real size (potentially many GB for a few-KB archive) and then write the expanded form including all the zero blocks, so the resulting file was fully allocated even on file systems that support holes.

This change makes the extraction sparse-aware:

  • CreateFileStreamOptions no longer sets PreallocationSize for GNU sparse entries, so the destination FileStream is not asked to reserve real-size bytes up front.
  • ExtractAsRegularFile[Async] now walks the sparse map directly, seeking the destination to each populated segment's virtual offset and writing only that segment's bytes. A final SetLength to the real size materializes any trailing hole.
  • On Windows, the destination is marked sparse via FSCTL_SET_SPARSE before writing so the unwritten ranges become real holes on NTFS rather than zero-filled extents. The call is best-effort and silently ignored on file systems that don't support it (FAT/exFAT), where the result is the same as before (zero-filled holes, full allocation) but without the up-front preallocation.
  • On Unix, no extra call is needed: the kernel produces holes automatically when writes skip past existing EOF on file systems that support them.

A new theory ExtractToFile_SparseEntry_ExpandsCorrectly covers both the sync and async extraction paths, checks the expanded content, and asserts the SparseFile attribute on Windows.

Fixes #128283.

When extracting a GNU sparse tar entry to a file, the entry's expanded
(real) size was being used as the FileStream preallocation size, which
could reserve massive amounts of disk space for entries whose actual
on-disk payload was a tiny fraction of that. Sparse entries are now:

 * Created without any preallocation reservation.
 * Marked sparse on Windows (FSCTL_SET_SPARSE) so unwritten ranges
   become real holes rather than zero-filled extents.
 * Extracted by seeking to each populated segment's virtual offset and
   writing only that segment's data; the gaps between segments are
   left unwritten so the file system can keep them as holes. A final
   SetLength call extends the file to its declared real size to
   materialize any trailing hole.

On file systems that don't support sparse files the result is
functionally identical to the previous extraction (zero-filled holes,
full disk allocation) but without the up-front preallocation request.

Fixes #128283.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 29, 2026 09:51
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @dotnet/area-system-formats-tar
See info in area-owners.md if you want to be subscribed.

@github-actions

This comment has been minimized.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes System.Formats.Tar extraction sparse-aware for GNU sparse tar entries, avoiding full logical-size preallocation and writing only populated ranges while preserving holes where supported.

Changes:

  • Skips FileStreamOptions.PreallocationSize for GNU sparse entries.
  • Adds sparse segment copy helpers for sync/async extraction.
  • Marks Windows output files sparse via FSCTL_SET_SPARSE and adds extraction coverage.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarEntry.cs Routes GNU sparse extraction through sparse-aware copy logic and skips preallocation.
src/libraries/System.Formats.Tar/src/System/Formats/Tar/GnuSparseStream.cs Adds sync/async helpers to copy populated sparse segments directly.
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarEntry.Windows.cs Adds Windows sparse-file marking helper.
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarEntry.Unix.cs Adds Unix no-op sparse marking helper.
src/libraries/Common/src/Interop/Windows/Kernel32/Interop.DeviceIoControl.cs Adds FSCTL_SET_SPARSE constant.
src/libraries/System.Formats.Tar/src/System.Formats.Tar.csproj Includes shared DeviceIoControl interop source.
src/libraries/System.Formats.Tar/tests/TarReader/TarReader.SparseFile.Tests.cs Adds sync/async sparse extraction verification.

Comment thread src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarEntry.cs Outdated
Comment thread src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarEntry.cs Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@rzikm rzikm requested review from a team and Copilot May 29, 2026 10:35
@github-actions

This comment has been minimized.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 29, 2026 11:18
@github-actions
Copy link
Copy Markdown
Contributor

Code Review: PR #128748 — Sparse-aware extraction for GNU sparse tar entries

Note

This review was generated with the assistance of GitHub Copilot.

Summary

This PR makes GNU sparse tar entry extraction sparse-aware: instead of preallocating the full expanded size and writing every byte (including zero-filled holes), it seeks over holes and writes only the populated segments. On Windows, it marks the file sparse via FSCTL_SET_SPARSE; on Unix, the kernel handles holes natively.

Verdict: ✅ Approve

The change is well-structured, addresses a real resource-consumption issue, and the prior review concerns (position guard, FAT/exFAT fallback) have been addressed. No blocking issues found.


Findings

💡 suggestion — Consider Stream parameter instead of FileStream

CopyPopulatedDataTo and CopyPopulatedDataToAsync accept FileStream but only use Position (set), Write/WriteAsync, Length, and SetLength — all available on Stream. Accepting Stream would make the methods more testable and reusable (e.g., extracting to a MemoryStream in tests without touching the file system). This is a non-blocking follow-up idea.

💡 suggestion — ReadFromPackedData seek pattern

In CopyPopulatedDataTo, each iteration computes _packedStartOffsets[i] + written as the read offset. This means ReadFromPackedData may need to seek the underlying raw stream on every call. For large segments this is fine (sequential reads), but the pattern is worth noting for future readers — the current implementation in ReadFromPackedData already handles this correctly via _nextPackedOffset tracking, so no action needed.

💡 suggestion — Test assertion gated on NTFS

The test already gates the SparseFile attribute check behind DriveFormat.Equals("NTFS", ...), which is good. One minor note: DriveInfo on mounted subst drives or network volumes could return unexpected format strings. This is very unlikely in CI and not a blocker.


What looks good

  • Position guard ({ Position: 0 }) correctly preserves the old DataStream.CopyTo path when the sparse stream has been partially consumed — addresses the earlier review feedback cleanly.
  • Best-effort FSCTL_SET_SPARSE — silently ignored on unsupported file systems, no error paths to worry about.
  • No preallocation for sparse entries — straightforward conditional in CreateFileStreamOptions prevents surprising disk reservation.
  • Sync/async parity — both code paths are structurally identical, reducing maintenance risk.
  • Test coverage — the new theory exercises both sync and async paths and validates expanded content correctness plus the Windows sparse attribute.

Note

🔒 Integrity filter blocked 1 item

The following item was blocked because it doesn't meet the GitHub integrity level.

  • #128283 issue_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by Code Review for issue #128748 · ● 1.4M ·

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unexpected disk preallocation when extracting GNU sparse tar entries

3 participants