Adding batching and multi-histogram support#12
Conversation
|
neat. How long did the run that finished take? |
60 minutes on my laptop; unfortunately batching has some weird interactions because of needing to pull data down, I think that it should be much faster on a VM if you don't have the same queuing issue I appear to still be suffering from. For example, if we return to our small histogram wr_renderer_time, with 10% repeat sampling (so, 10 20 ... 100%) it takes 6 minutes, though this appears to be mostly waiting for BQ to get through the queue to actually start doing stuff as no sampling is also taking much longer than it should despite the post query section running at normal speed. Testing suggests that 10% sampling may increase runtime significantly anyway, though. I am going to put sampling to 20%, would you be able to run with this pct branch? Increasing that rate shows a noticeable improvement, probably due to less queuing, I think your VM should be able to handle 20%. I was able to churn all 10 columns at 100%, it just took a long time. But no OOM which I was hitting before and I know you hit without sampling. Ideally we push this number even higher, or we increase the number of column we process at a single time. |
|
Also would it be possible to test on one of the memory optimized vms? It looks like google has some options that are close to a TB of memory for a similar spot price (e.g. m1-ultramem-40) and I'd very much like to know if this is really an OOM problem or if we are, for example, exceeding some max size such that it would never finish irrespective of how much memory the device has. |
|
Just updating this thread: I've been testing on memory-optimized VMs but found a regression in performance, apparently due to |
cc @edugfilho
My queries are getting queued so I can't really test any more, I thought I had a code regression but no it's BQ making me wait.
This code can take multiple columns and runs samples at 10% at a time. It turns out that this actually seems to improve stability quite a bit; i.e. 10 colums at 10% would finish and 1 column at 100% would not. I used mostly the absolutely most populous columns I could. I think this is a real path towards being able to do 100% of histograms.