I think the current naming scheme of the pp.io module is somewhat inconsistent and confusing:
df_to_graph is misleading since it does not result in a "proper" graph because the node attributes are missing. The opposite is also the case, the function graph_to_df is in reality more something in the lines of edges_to_df.
- There is a
write_csv function but no read_csv function; instead, there are several read_csv_... functions - one for each class. Inside write_csv.
- Hardcoding the column names to
v and w seems very inflexible to me.
I propose the following changes:
- Instead of
df_to_graph, there should be a read_dataframe(edges: Optional[DataFrame], nodes: Optional[DataFrame]) function that takes two optional dataframes, one for edges and one for nodes. This should then also be made consistent with all other formats, e.g. read_csv(edges: Optional[str | Path], nodes: Optional[str | Path]). Ideally, these methods should automatically attempt to infer the class, i.e. simple or temporal graph, with an additional option to set the class specifically.
- Instead of
graph_to_df, we would then have a function write_dataframe(...) -> DataFrame, DataFrame that returns two dataframes by default, one for the nodes and one for the edges.
- Additionally, there could be functions like
edges_to_df and nodes_to_df which would split the functionality of write_dataframe into two parts.
- For flexibility, all of these methods should have parameters like
source_col, target_col and time_col that specify the name of the column that contains the source/target node and the timestamp.
PathData could either be handled separately or included into the above with an additional optional parameter read_csv(edges, nodes, paths).
I think the current naming scheme of the
pp.iomodule is somewhat inconsistent and confusing:df_to_graphis misleading since it does not result in a "proper" graph because the node attributes are missing. The opposite is also the case, the functiongraph_to_dfis in reality more something in the lines ofedges_to_df.write_csvfunction but noread_csvfunction; instead, there are severalread_csv_...functions - one for each class. Insidewrite_csv.vandwseems very inflexible to me.I propose the following changes:
df_to_graph, there should be aread_dataframe(edges: Optional[DataFrame], nodes: Optional[DataFrame])function that takes two optional dataframes, one for edges and one for nodes. This should then also be made consistent with all other formats, e.g.read_csv(edges: Optional[str | Path], nodes: Optional[str | Path]). Ideally, these methods should automatically attempt to infer the class, i.e. simple or temporal graph, with an additional option to set the class specifically.graph_to_df, we would then have a functionwrite_dataframe(...) -> DataFrame, DataFramethat returns two dataframes by default, one for the nodes and one for the edges.edges_to_dfandnodes_to_dfwhich would split the functionality ofwrite_dataframeinto two parts.source_col,target_colandtime_colthat specify the name of the column that contains the source/target node and the timestamp.PathDatacould either be handled separately or included into the above with an additional optional parameterread_csv(edges, nodes, paths).