Feat : Multiple download functionality #271
Feat : Multiple download functionality #271Dhiren-Mhatre wants to merge 22 commits intouc-cdis:masterfrom
Conversation
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Avantol13
left a comment
There was a problem hiding this comment.
I haven't had time to review other Python code, will get to that soon
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
533d187 to
97c29eb
Compare
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
gen3/cli/download.py
Outdated
| @click.argument("guid") | ||
| @click.option( | ||
| "--download-path", | ||
| default=".", |
There was a problem hiding this comment.
let's make this default more dynamic and a folder by default.
maybe let's do a timestamped folder name like:
from datetime import datetime
...
default=f"download_{datetime.now().strftime("%d_%b_%Y")}",
There was a problem hiding this comment.
do the same thing for multiple download async
gen3/cli/download.py
Outdated
| ) | ||
| @click.option( | ||
| "--max-concurrent-requests", | ||
| default=300, |
There was a problem hiding this comment.
this is really, really high. Let's scale this default to something like 20 and people can manually bump this up as needed.
| type=int, | ||
| ) | ||
| @click.option( | ||
| "--skip-completed", |
There was a problem hiding this comment.
I'm not sure this is working as intended. I get a lot of warnings when I try this:
$ poetry run gen3 -vv download-multiple-async --manifest MIDRC_case_manifest.json --download-path ./downloads --skip-completed
[2025-09-02 14:28:36,954][ DEBUG] Initializing auth..
Found 5178 files to download
Continue with async download? [y/N]: y
Downloading: 0%| | 0/5178 [00:00<?, ?it/s][2025-09-02 14:28:40,004][WARNING] File will be overwritten: downloads/10041569-u_H3HaB1lES6-HXVpiEfMA/2.16.840.1.114274.1818.48858790339993589885669552017090272896/2.16.840.1.114274.1818.56909342758958044433235758152671341713.zip
[2025-09-02 14:28:40,010][WARNING] File will be overwritten: downloads/10041569-FtpnR4GEPEe9JztuDUqrXg/2.16.840.1.114274.1818.544490635779373958610541144202466729913/2.16.840.1.114274.1818.55741368199706394947919032241968638388.zip
[2025-09-02 14:28:40,085][WARNING] File will be overwritten: downloads/10041569-CooqJZVIRkywB6_o9CJi0Q/2.16.840.1.114274.1818.567640554746210778914433462824885428113/2.16.840.1.114274.1818.523353596424499437510054502719882590104.zip
Actually... it seems like maybe it is actually skipping but we need to suppress all these warnings and/or not overwrite existing files, just skip downloading again.
gen3/cli/download.py
Outdated
| ) | ||
| @click.option( | ||
| "--skip-completed", | ||
| is_flag=True, |
There was a problem hiding this comment.
I don't think you can have an is_flag and default to True b/c how do I set this to false then? it just needs to be a boolean but default to true
There was a problem hiding this comment.
you could reverse the logic here and just call it: redownload-completed and then leave is_flag=True but default to false
There was a problem hiding this comment.
make sure to update docs if you change this
There was a problem hiding this comment.
don't change it, actually. We need to ensure we're matching previous commands exactly.
Here's the help from gen3-client:
Flags:
--download-path string The directory in which to store the downloaded files (default ".")
--filename-format string The format of filename to be used, including "original", "guid" and "combined" (default "original")
-h, --help help for download-multiple
--manifest string The manifest file to read from. A valid manifest can be acquired by using the "Download Manifest" button in Data Explorer from a data common's portal
--no-prompt If set to true, will not display user prompt message for confirmation
--numparallel int Number of downloads to run in parallel (default 1)
--protocol string Specify the preferred protocol with --protocol=s3
--rename Only useful when "--filename-format=original", will rename file by appending a counter value to its filename if set to true, otherwise the same filename will be used
--skip-completed If set to true, will check for filename and size before download and skip any files in "download-path" that matches both
Let's change the overall command name to match too. download-multiple instead of download-multiple-async
Keep skip completed but make it a boolean so I can pass in false.
I'm okay with --numparallel not existing and being different b/c the parallelization scheme in Python is a bit different... maybe we can allow that to be passed in though and just use the value as max-concurrent-requests if that isn't supplied. So people can easily port their commands over and have them work right away
docs/howto/asyncDownloadMultiple.md
Outdated
| - **High-bandwidth networks**: Increase the number of worker processes | ||
| - **Limited memory**: Reduce queue sizes to manage memory usage | ||
|
|
||
| ### Memory Management |
There was a problem hiding this comment.
let's remove this section, I feel it's redundant given the above config info
docs/howto/asyncDownloadMultiple.md
Outdated
| - **Batch Size**: Balance between memory usage and processing overhead | ||
| - **Process Count**: Match available CPU cores for optimal performance | ||
|
|
||
| ### Network Optimization |
There was a problem hiding this comment.
let's also remove this section
docs/howto/asyncDownloadMultiple.md
Outdated
|
|
||
| - Check network bandwidth and server limits | ||
| - Reduce concurrent request limits if server is overwhelmed | ||
| - Verify authentication token is valid |
There was a problem hiding this comment.
auth token would not be a reason for slow downloads, let's remove this
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Avantol13
left a comment
There was a problem hiding this comment.
Make sure to add unit tests for this new functionality
gen3/cli/__main__.py
Outdated
| main.add_command(drs_pull.drs_pull) | ||
| main.add_command(file.file) | ||
| main.add_command(download.download_single, name="download-single") | ||
| main.add_command(download.download_multiple_async, name="download-multiple-async") |
There was a problem hiding this comment.
| main.add_command(download.download_multiple_async, name="download-multiple-async") | |
| main.add_command(download.download_multiple, name="download-multiple") |
There was a problem hiding this comment.
Try to remember to retest after any changes to your code, this was broken b/c the the function doesn't exist anymore
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
| ] | ||
|
|
||
|
|
||
| def test_download_single_success(gen3_file): |
There was a problem hiding this comment.
please add a docstring for each test describing what you're testing
docs/howto/asyncDownloadMultiple.md
Outdated
| gen3 --endpoint data.commons.io --auth creds.json download-multiple \ | ||
| --manifest large_dataset.json \ | ||
| --download-path ./large_downloads \ | ||
| --max-concurrent-requests 20 \ |
There was a problem hiding this comment.
let's change to:
| --max-concurrent-requests 20 \ | |
| --max-concurrent-requests 64 \ | |
| --max-concurrent-requests 8 |
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Avantol13
left a comment
There was a problem hiding this comment.
You will also need to update this test. I got it working:
def test_download_single_basic_functionality(gen3_file):
"""
Test download_single basic functionality with synchronous download.
Verifies that download_single downloads a file successfully using
synchronous requests and returns True.
"""
gen3_file._auth_provider._refresh_token = {"api_key": "123"}
with patch.object(gen3_file, 'get_presigned_url') as mock_presigned, \
patch('gen3.file.requests.get') as mock_get, \
patch('gen3.index.Gen3Index.get_record') as mock_index:
mock_presigned.return_value = {"url": "https://fake-url.com/file"}
mock_index.return_value = {"file_name": "test-file.txt"}
mock_response = MagicMock()
mock_response.status_code = 200
mock_response.headers = {"content-length": "12"}
mock_response.iter_content = lambda size: [b"test content"]
mock_get.return_value = mock_response
result = gen3_file.download_single(object_id="test-guid", path="/tmp", protocol="s3")
assert result == True
mock_presigned.assert_called_once_with("test-guid", protocol="s3")
mock_index.assert_called_once_with("test-guid")Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
…github.com/Dhiren-Mhatre/gen3sdk-python into feat/multiple-download-performance-testing
gen3/cli/download.py
Outdated
| @click.option( | ||
| "--numparallel", | ||
| default=None, | ||
| help="Number of downloads to run in parallel (compatibility with gen3-client)", |
There was a problem hiding this comment.
| help="Number of downloads to run in parallel (compatibility with gen3-client)", | |
| help="Same as max-concurrent-requests, for compatibility with previous gen3-client arguments", |
gen3/cli/download.py
Outdated
| click.echo("\nFailed downloads:") | ||
| for failure in result["failed"]: | ||
| click.echo( | ||
| f" - {failure.get('guid', 'unknown')}: {failure.get('error', 'Unknown error')}" | ||
| ) | ||
|
|
There was a problem hiding this comment.
this is not working, I'm getting this:
[2026-01-21 15:19:03,719][ ERROR] Async batch download failed: 'str' object has no attribute 'get'
[2026-01-21 15:19:03,719][ ERROR] Async batch download failed: 'str' object has no attribute 'get'
if I provide the command poetry run gen3 download-multiple --manifest MIDRC_case_manifest_small.json --no-progress --no-prompt --protocol foo --skip-completed false
I suggest we remove this listing of failures entirely, as the ERROR logs themselves have the GUIDs that failed in them
gen3/cli/download.py
Outdated
| ) | ||
|
|
||
| click.echo( | ||
| "\nTo retry failed downloads, run the same command with --skip-completed flag:" |
There was a problem hiding this comment.
| "\nTo retry failed downloads, run the same command with --skip-completed flag:" | |
| "\nTo retry failed downloads, run the same command with --skip-completed true" |
gen3/cli/download.py
Outdated
| "\nTo retry failed downloads, run the same command with --skip-completed flag:" | ||
| ) | ||
|
|
||
| success_rate = len(result["succeeded"]) / len(manifest_data) * 100 |
There was a problem hiding this comment.
we should rethink this because the way it's working, I get a non 100 success if I skip previously downloaded files.
$ poetry run gen3 -vv download-multiple --manifest MIDRC_case_manifest.json --no-prompt --max-concurrent-requests 64 --num-processes 8
...
Async Download Results:
✓ Succeeded: 4871
- Skipped: 307
Success rate: 94.1%
If --skip-completed is true, we should count skipped towards success imo
There was a problem hiding this comment.
this is even more apparent when I rerun a successful download and it skips everything
Async Download Results:
✓ Succeeded: 0
- Skipped: 42
Success rate: 0.0%
gen3/cli/download.py
Outdated
| auth = ctx.obj["auth_factory"].get() | ||
|
|
||
| # Use numparallel as max_concurrent_requests if provided (for gen3-client compatibility) | ||
| if numparallel is not None and max_concurrent_requests == 20: # 20 is the default |
There was a problem hiding this comment.
create a global for this default value, don't hard code and make a magic number
| help="Protocol for presigned URL (e.g., s3) (default: auto-detect)", | ||
| ) | ||
| @click.pass_context | ||
| def download_single( |
There was a problem hiding this comment.
I would like this support the same interface as the gen3-client, and ideally the same params (the ones that translate) from download-multiple. What we could maybe do is just wrap download-multiple behind the scenes but pass a manifest with a single GUID (e.g. don't repeat code we don't need to repeat).
As a reference, this is gen3-client's support:
$ gen3-client download-single --help
Gets a presigned URL for a file from a GUID and then downloads the specified file.
Usage:
gen3-client download-single [flags]
Examples:
./gen3-client download-single --profile=<profile-name> --guid=206dfaa6-bcf1-4bc9-b2d0-77179f0f48fc
Flags:
--download-path string The directory in which to store the downloaded files (default ".")
--filename-format string The format of filename to be used, including "original", "guid" and "combined" (default "original")
--guid string Specify the guid for the data you would like to work with
-h, --help help for download-single
--no-prompt If set to true, will not display user prompt message for confirmation
--protocol string Specify the preferred protocol with --protocol=gs
--rename Only useful when "--filename-format=original", will rename file by appending a counter value to its filename if set to true, otherwise the same filename will be used
--skip-completed If set to true, will check for filename and size before download and skip any files in "download-path" that matches both
Global Flags:
--profile string Specify profile to use
Avantol13
left a comment
There was a problem hiding this comment.
in general things are looking good but see my comments. It's probably prudent to test using Python 3.13 (I did locally and had to fix a dep issue, I think just a bump, but double check that).
| help="Disable progress bar (default: false)", | ||
| ) | ||
| @click.pass_context | ||
| def download_single( |
There was a problem hiding this comment.
this was already defined and is now being overridden. remove the old code.
Also can you adjust so that the GUID is not an arg but a --guid VALUE to match the gen3-client exactly?
There was a problem hiding this comment.
I'm not sure if this causes the issues I'm seeing, but download-single is hanging for me and not completing:
poetry run gen3 -vv d-vv download-single dg.MD1R/28fb1a28-d662-429b-9b74-91b2aa03c63f
[2026-02-12 12:34:35,203][ DEBUG] Initializing auth..
Downloading: 0%| | 0/1 [00:00<?, ?it/s]
it just hangs there forever
New Features
Dependency updates
Updated fastavro from 1.8.4 to 1.11.1
Updated pypfb to include extras: pypfb = {extras = ["gen3"], version = "^0.5.33"}
Updated importlib-metadata from 8.5.0 to 4.13.0