Our team has encountered a performance issue when using package functions that use dplyr::group_by() followed by dplyr::summarise() on very large dataframes. A common example is get_ds_rt().
This seems to be a dplyr issue that is common across many usages.
A workaround is to split up the dataframe into chunks and run one chunk at a time. This could be parallelized by the user as well.
However, perhaps there is a way to do this within the package? It could be off by default and could be implemented by adding an argument in these functions such as nthreads where a user can specify how many threads they would be willing to use on the operation.
(e..g https://multidplyr.tidyverse.org/ ? or other similar packages)
Our team has encountered a performance issue when using package functions that use
dplyr::group_by()followed bydplyr::summarise()on very large dataframes. A common example isget_ds_rt().This seems to be a
dplyrissue that is common across many usages.A workaround is to split up the dataframe into chunks and run one chunk at a time. This could be parallelized by the user as well.
However, perhaps there is a way to do this within the package? It could be off by default and could be implemented by adding an argument in these functions such as
nthreadswhere a user can specify how many threads they would be willing to use on the operation.(e..g https://multidplyr.tidyverse.org/ ? or other similar packages)