Skip to content

Adding batching and multi-histogram support#12

Open
perrymcmanis144 wants to merge 16 commits into
mainfrom
pct
Open

Adding batching and multi-histogram support#12
perrymcmanis144 wants to merge 16 commits into
mainfrom
pct

Conversation

@perrymcmanis144
Copy link
Copy Markdown
Contributor

cc @edugfilho

My queries are getting queued so I can't really test any more, I thought I had a code regression but no it's BQ making me wait.

This code can take multiple columns and runs samples at 10% at a time. It turns out that this actually seems to improve stability quite a bit; i.e. 10 colums at 10% would finish and 1 column at 100% would not. I used mostly the absolutely most populous columns I could. I think this is a real path towards being able to do 100% of histograms.

results = glam_style_histogram(
    [
        "wr_renderer_time",
        "dns_native_queuing",
        "gc_ms",
        "ssl_time_until_ready",
        "dns_native_lookup_time",
        "cycle_collector_max_pause",
        "http_kbread_per_conn2",
        "gc_pretenure_count_2",
        "network_cache_v2_miss_time_ms",
        "input_event_response_ms",
    ],
    False,
    "2023-01-08",
    limit=None,
    batch_size=None,
    table="mozdata.telemetry.main",
)

@edugfilho
Copy link
Copy Markdown

neat. How long did the run that finished take?

@perrymcmanis144
Copy link
Copy Markdown
Contributor Author

perrymcmanis144 commented Jan 23, 2023

neat. How long did the run that finished take?

60 minutes on my laptop; unfortunately batching has some weird interactions because of needing to pull data down, I think that it should be much faster on a VM if you don't have the same queuing issue I appear to still be suffering from.

For example, if we return to our small histogram wr_renderer_time, with 10% repeat sampling (so, 10 20 ... 100%) it takes 6 minutes, though this appears to be mostly waiting for BQ to get through the queue to actually start doing stuff as no sampling is also taking much longer than it should despite the post query section running at normal speed. Testing suggests that 10% sampling may increase runtime significantly anyway, though.

I am going to put sampling to 20%, would you be able to run with this pct branch?

Increasing that rate shows a noticeable improvement, probably due to less queuing, I think your VM should be able to handle 20%. I was able to churn all 10 columns at 100%, it just took a long time. But no OOM which I was hitting before and I know you hit without sampling. Ideally we push this number even higher, or we increase the number of column we process at a single time.

@perrymcmanis144
Copy link
Copy Markdown
Contributor Author

Also would it be possible to test on one of the memory optimized vms? It looks like google has some options that are close to a TB of memory for a similar spot price (e.g. m1-ultramem-40) and I'd very much like to know if this is really an OOM problem or if we are, for example, exceeding some max size such that it would never finish irrespective of how much memory the device has.

@edugfilho
Copy link
Copy Markdown

Just updating this thread: I've been testing on memory-optimized VMs but found a regression in performance, apparently due to google-cloud-bigquery python lib version. Force updating that library normally fixes that issue but somehow with the environment on the memory-optimized VM that isn't happening. In the meantime, on another VM I could update the library and fix the performance issue but all of the sudden I started having issues with permissions to the source table.
I will update this with a doc with execution times as soon as I am able to get a stable environment to test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants