Skip to content

explore parallelization in operations using group_by() #21

@henry-ngo

Description

@henry-ngo

Our team has encountered a performance issue when using package functions that use dplyr::group_by() followed by dplyr::summarise() on very large dataframes. A common example is get_ds_rt().

This seems to be a dplyr issue that is common across many usages.

A workaround is to split up the dataframe into chunks and run one chunk at a time. This could be parallelized by the user as well.

However, perhaps there is a way to do this within the package? It could be off by default and could be implemented by adding an argument in these functions such as nthreads where a user can specify how many threads they would be willing to use on the operation.

(e..g https://multidplyr.tidyverse.org/ ? or other similar packages)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions