Skip to content

Support bounded parallel chunk transfers#29341

Closed
tyler-french wants to merge 1 commit into
bazelbuild:masterfrom
tyler-french:tfrench/cdc-concurrent
Closed

Support bounded parallel chunk transfers#29341
tyler-french wants to merge 1 commit into
bazelbuild:masterfrom
tyler-french:tfrench/cdc-concurrent

Conversation

@tyler-french
Copy link
Copy Markdown
Contributor

@tyler-french tyler-french commented Apr 19, 2026

Description

For --experimental_remote_cache_chunking implemented in #28437

This PR enables parallel uploads and downloads for chunked files to improve performance. Transport-level concurrency is still bounded by the existing remote cache gRPC connection/concurrency limits. The chunk transfer managers add a fixed per-blob window of 16 to prevent a single large blob from fanning out too aggressively.

To avoid issues with batches, uploads and downloads use simple sliding-window style transfer managers. This PR also fixes #29544 by properly copying the data to out

RELNOTES: CDC chunk uploads and downloads can now happen in parallel within a large blob.

Benchmarking

Benchmarks were rerun on 2026-05-12 with the JMH target //src/test/java/com/google/devtools/build/lib/remote:ChunkedTransferBenchmark. The parent baseline was measured by running the same benchmark harness against the parent commit.

With the synthetic benchmark of network delays and simulated jitter, the current branch is about 11-13x faster than the parent baseline for these cases. As usual, this synthetic benchmark is not a substitute for real remote-cache measurements.

After Change:

Benchmark                                 (avgChunkSizeBytes)  (chunkCount)  (chunkSizeBytes)  (delayMillis)  (fileSizeBytes)  (jitterMillis)  (schedulerThreads)  Mode  Cnt   Score   Error  Units
ChunkedTransferBenchmark.downloadChunked                  N/A            32              1024             25              N/A              10                   1  avgt    3  67.963 ± 2.039  ms/op
ChunkedTransferBenchmark.downloadChunked                  N/A            32              1024             25              N/A              10                   2  avgt    3  67.925 ± 2.257  ms/op
ChunkedTransferBenchmark.downloadChunked                  N/A            32              1024             25              N/A              10                   4  avgt    3  67.915 ± 2.497  ms/op
ChunkedTransferBenchmark.downloadChunked                  N/A            32              1024             25              N/A              10                   8  avgt    3  67.946 ± 2.863  ms/op
ChunkedTransferBenchmark.uploadChunked                   1024           N/A               N/A             25            32768              10                   1  avgt    3  61.357 ± 7.385  ms/op
ChunkedTransferBenchmark.uploadChunked                   1024           N/A               N/A             25            32768              10                   2  avgt    3  61.373 ± 7.920  ms/op
ChunkedTransferBenchmark.uploadChunked                   1024           N/A               N/A             25            32768              10                   4  avgt    3  61.326 ± 8.220  ms/op
ChunkedTransferBenchmark.uploadChunked                   1024           N/A               N/A             25            32768              10                   8  avgt    3  61.393 ± 8.376  ms/op

Before Change:

Benchmark                                 (avgChunkSizeBytes)  (chunkCount)  (chunkSizeBytes)  (delayMillis)  (fileSizeBytes)  (jitterMillis)  (schedulerThreads)  Mode  Cnt    Score     Error  Units
ChunkedTransferBenchmark.downloadChunked                  N/A            32              1024             25              N/A              10                   1  avgt    3  812.081 ± 463.666  ms/op
ChunkedTransferBenchmark.downloadChunked                  N/A            32              1024             25              N/A              10                   2  avgt    3  812.570 ± 442.021  ms/op
ChunkedTransferBenchmark.downloadChunked                  N/A            32              1024             25              N/A              10                   4  avgt    3  811.883 ± 459.534  ms/op
ChunkedTransferBenchmark.downloadChunked                  N/A            32              1024             25              N/A              10                   8  avgt    3  812.371 ± 461.231  ms/op
ChunkedTransferBenchmark.uploadChunked                   1024           N/A               N/A             25            32768              10                   1  avgt    3  740.734 ± 389.653  ms/op
ChunkedTransferBenchmark.uploadChunked                   1024           N/A               N/A             25            32768              10                   2  avgt    3  742.434 ± 412.117  ms/op
ChunkedTransferBenchmark.uploadChunked                   1024           N/A               N/A             25            32768              10                   4  avgt    3  742.483 ± 395.466  ms/op
ChunkedTransferBenchmark.uploadChunked                   1024           N/A               N/A             25            32768              10                   8  avgt    3  742.509 ± 397.122  ms/op

Big File:

CURRENT BRANCH (512 MiB)

Benchmark                                 (avgChunkSizeBytes)  (chunkCount)  (chunkSizeBytes)  (delayMillis)  (fileSizeBytes)  (jitterMillis)  (schedulerThreads)  Mode  Cnt     Score     Error  Units
ChunkedTransferBenchmark.downloadChunked                  N/A           512           1048576             25              N/A              10                   8  avgt    3   991.101 ± 173.890  ms/op
ChunkedTransferBenchmark.uploadChunked                1048576           N/A               N/A             25        536870912              10                   8  avgt    3  1087.504 ± 123.857  ms/op
PARENT BASELINE (512 MiB)

Benchmark                                 (avgChunkSizeBytes)  (chunkCount)  (chunkSizeBytes)  (delayMillis)  (fileSizeBytes)  (jitterMillis)  (schedulerThreads)  Mode  Cnt      Score      Error  Units
ChunkedTransferBenchmark.downloadChunked                  N/A           512           1048576             25              N/A              10                   8  avgt    3  12849.136 ± 1733.063  ms/op
ChunkedTransferBenchmark.uploadChunked                1048576           N/A               N/A             25        536870912              10                   8  avgt    3  11888.109 ± 2591.883  ms/op

Resolves: #29544

@tyler-french tyler-french requested a review from a team as a code owner April 19, 2026 18:23
@github-actions github-actions Bot added team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Apr 19, 2026
@tyler-french
Copy link
Copy Markdown
Contributor Author

FYI @tjgq I think this was a follow-up from the original PR

@tyler-french
Copy link
Copy Markdown
Contributor Author

@bazel-io fork 9.2.0

Copy link
Copy Markdown
Contributor

@sluongng sluongng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit skeptical of the current approach, so I will skip on reading the window filling implementation right now. (though Codex does suggest there is a problem there)

Each invocation may have multiple actions/spawns running in parallel, each creates multiple blob uploads/downloads, and some of those blobs are chunked blobs. Adding parallelism on the blob level feels like a local optimization, and the new flag does not offer strong control over the total parallelism of the invocation.

I wonder if we need something higher-level that lets us effectively enforce a global parallelism for uploads and downloads🤔

Comment thread src/main/java/com/google/devtools/build/lib/remote/ChunkedBlobUploader.java Outdated
@tyler-french tyler-french force-pushed the tfrench/cdc-concurrent branch from 8969051 to 33689c3 Compare April 22, 2026 15:44
Copilot AI review requested due to automatic review settings April 22, 2026 15:44
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables bounded parallelism for content-defined chunk (CDC) blob uploads/downloads, improving throughput while avoiding unbounded per-blob fanout.

Changes:

  • Implement sliding-window style, per-blob bounded concurrency for chunk uploads in ChunkedBlobUploader.
  • Implement sliding-window style, per-blob bounded concurrency for chunk downloads (including in-flight dedup) in ChunkedBlobDownloader.
  • Add/expand unit tests for window refill, cancellation, and failure propagation; add a JMH benchmark binary target.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/main/java/com/google/devtools/build/lib/remote/ChunkedBlobUploader.java Adds bounded in-flight chunk upload window and cancellation on failure.
src/main/java/com/google/devtools/build/lib/remote/ChunkedBlobDownloader.java Adds bounded in-flight chunk download window with reassembly and in-flight dedup.
src/test/java/com/google/devtools/build/lib/remote/ChunkedBlobUploaderTest.java Adds tests for window refill, cancellation, and failure handling for parallel uploads.
src/test/java/com/google/devtools/build/lib/remote/ChunkedBlobDownloaderTest.java Updates tests for new download API and adds parallel-window behavior tests.
src/test/java/com/google/devtools/build/lib/remote/ChunkedTransferBenchmark.java Introduces a JMH benchmark for chunked upload/download with latency + jitter.
src/test/java/com/google/devtools/build/lib/remote/BUILD Adds a java_opt_binary target to run the new benchmark.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/main/java/com/google/devtools/build/lib/remote/ChunkedBlobDownloader.java Outdated
Comment thread src/test/java/com/google/devtools/build/lib/remote/ChunkedTransferBenchmark.java Outdated
Comment thread src/main/java/com/google/devtools/build/lib/remote/ChunkedBlobUploader.java Outdated
@tyler-french
Copy link
Copy Markdown
Contributor Author

I'm a bit skeptical of the current approach, so I will skip on reading the window filling implementation right now. (though Codex does suggest there is a problem there)

Each invocation may have multiple actions/spawns running in parallel, each creates multiple blob uploads/downloads, and some of those blobs are chunked blobs. Adding parallelism on the blob level feels like a local optimization, and the new flag does not offer strong control over the total parallelism of the invocation.

I wonder if we need something higher-level that lets us effectively enforce a global parallelism for uploads and downloads🤔

Updated this in the follow-up direction you suggested: I removed the flag/plumbing and kept only a small hardcoded per-blob window of 32 as a guard against huge single-blob fanout. The actual global active RPC limit is still the shared gRPC channel pool (--remote_max_connections / --remote_max_concurrency_per_connection), so this isn’t intended to be a new invocation-level concurrency control - the combined cache doesn't have such restriction as far as I can tell.

@tyler-french tyler-french force-pushed the tfrench/cdc-concurrent branch from 33689c3 to 8363b37 Compare April 22, 2026 17:15
Comment thread src/main/java/com/google/devtools/build/lib/remote/ChunkedBlobDownloader.java Outdated
Comment thread src/main/java/com/google/devtools/build/lib/remote/ChunkedBlobDownloader.java Outdated
@tyler-french tyler-french force-pushed the tfrench/cdc-concurrent branch from 8363b37 to 976a25f Compare April 27, 2026 14:46
@tyler-french tyler-french force-pushed the tfrench/cdc-concurrent branch 2 times, most recently from c64cbdb to d846665 Compare May 11, 2026 15:12
Copy link
Copy Markdown
Contributor

@tjgq tjgq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main point of contention with this approach is that files are buffered in memory in their entirety while being transferred, which could easily cause us to run out of heap. I'm concerned that this might make chunked transfers useless in practice for builds with lots of parallelism.

I think this is easy to fix for the upload side: instead of passing a byte[] around, pass a (path, offset) and delay the opening/reading of the InputStream until the upload starts. (That does increase the number of open descriptors, but as long as they're still bounded by MAX_IN_FLIGHT_CHUNK_UPLOADS, it might be ok.)

The download side unfortunately isn't as easy. We could try to write a chunk out as soon as it completes instead of waiting for all of them the next one in order, but that doesn't help much if the downloads complete roughly at the same time (which is what we'd expect if the transport is completely fair).

Any thoughts?

@tyler-french tyler-french force-pushed the tfrench/cdc-concurrent branch from d846665 to 2a2987d Compare May 11, 2026 20:29
@tyler-french
Copy link
Copy Markdown
Contributor Author

My main point of contention with this approach is that files are buffered in memory in their entirety while being transferred, which could easily cause us to run out of heap. I'm concerned that this might make chunked transfers useless in practice for builds with lots of parallelism.

I think this is easy to fix for the upload side: instead of passing a byte[] around, pass a (path, offset) and delay the opening/reading of the InputStream until the upload starts. (That does increase the number of open descriptors, but as long as they're still bounded by MAX_IN_FLIGHT_CHUNK_UPLOADS, it might be ok.)

The download side unfortunately isn't as easy. We could try to write a chunk out as soon as it completes instead of waiting for all of them the next one in order, but that doesn't help much if the downloads complete roughly at the same time (which is what we'd expect if the transport is completely fair).

Any thoughts?

I like this. I adjusted the upload path to lazily matierialize the chunk. But for download we can't really do this. I adjusted the lookahead to match the concurrency, but realistically this would be rare that these take up a lot of memory, since we should immediately consume the memory once the next chunk in order is available. It's a tad messier

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

@tyler-french tyler-french force-pushed the tfrench/cdc-concurrent branch from 2a2987d to cc1c408 Compare May 12, 2026 20:05
Copy link
Copy Markdown
Contributor

@tjgq tjgq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I'm happy with the current state; we can revisit the downloader if anyone reports problems. I only have a couple of nits, will approve once they've been addressed.

@tyler-french tyler-french force-pushed the tfrench/cdc-concurrent branch from cc1c408 to 5e2051f Compare May 12, 2026 20:58
@tyler-french
Copy link
Copy Markdown
Contributor Author

@bazel-io fork 9.2.0

@tjgq tjgq added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels May 12, 2026
@tyler-french
Copy link
Copy Markdown
Contributor Author

@bazel-io fork 9.2.0

@iancha1992 Let me know if there's issues importing this one here. They may be a merge conflicts

@iancha1992
Copy link
Copy Markdown
Member

@tyler-french We're currently on it.

@iancha1992
Copy link
Copy Markdown
Member

@bazel-io fork 9.2.0

@iancha1992
Copy link
Copy Markdown
Member

@bazel-io fork 9.1.1

@Wyverald
Copy link
Copy Markdown
Member

Hmm... During import, certain tests are somehow failing for RBE (https://buildkite.com/bazel/google-bazel-presubmit/builds/102948/list?jid=019e4143-51bb-4115-9a99-73ded8407a81&tab=artifacts). We recently upgraded RBE VMs to Ubuntu 24.04 from 20.04. Could you rebase on top of HEAD and see if you can trigger the same failures?

@tyler-french tyler-french force-pushed the tfrench/cdc-concurrent branch from 5e2051f to 16f48f9 Compare May 19, 2026 18:52
@tyler-french
Copy link
Copy Markdown
Contributor Author

Hmm... During import, certain tests are somehow failing for RBE (buildkite.com/bazel/google-bazel-presubmit/builds/102948/list?jid=019e4143-51bb-4115-9a99-73ded8407a81&tab=artifacts). We recently upgraded RBE VMs to Ubuntu 24.04 from 20.04. Could you rebase on top of HEAD and see if you can trigger the same failures?

@Wyverald I think there's some underlying permissions issues with 24.04 because of unprivileged user namespace restrictions. These test failure do not seem like they could be caused by anything in this PR, probably just because they were previously cache hits which are now needing to be re-run

@tyler-french
Copy link
Copy Markdown
Contributor Author

tyler-french commented May 19, 2026

@Wyverald @iancha1992 this seems to fix the issue - but maybe there's a better way: #29597

EDIT: seems to still be failing 😞

@Wyverald
Copy link
Copy Markdown
Member

Thanks for the investigation! @coeuvre to take a look as well.

@coeuvre
Copy link
Copy Markdown
Member

coeuvre commented May 20, 2026

It might be related to a recent upgrade of the rbe worker pool. Still investigating.

In the meanwhile, the affected tests on RBE have been disabled by bca5e06.

@tyler-french tyler-french force-pushed the tfrench/cdc-concurrent branch from 16f48f9 to cb1a9c2 Compare May 20, 2026 13:52
@tyler-french
Copy link
Copy Markdown
Contributor Author

tyler-french commented May 20, 2026

In the meanwhile, the affected tests on RBE have been disabled by bca5e06.

Great, CI is passing! @coeuvre Can we cherry-pick the fix onto 9.1.1

@iancha1992
Copy link
Copy Markdown
Member

@tyler-french Sure, this will be cherry-picked as soon as it's merged to master.

@github-actions github-actions Bot removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team-Remote-Exec Issues and PRs for the Execution (Remote) team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

--experimental_remote_cache_chunking can produce truncated top-level outputs when combined with --disk_cache

8 participants