Releases: awslabs/graphstorm
V0.5.1 Release Note
GraphStorm 0.5.1 release is a minor release that brings enhanced real-time inference capabilities with support for raw text input and learnable embeddings in payload JSON, enables edge features support across all GraphStorm built-in GNN models, and introduces experimental Mitra integration for generating embeddings from tabular features. This release improves the flexibility of real-time inference workflows, expands the modeling capabilities for edge-rich graph datasets, and provides new options for automatic feature engineering from tabular data. We have also improved documentation and fixed several issues to enhance overall user experience.
Major Features
- Enhanced Real-time Inference Support
- Edge Features Support for Built-in GNN Models
- Experimental Mitra Integration for Tabular Feature Embeddings
Documentation Updates
Fixes
- [Bug fix] Save and reload edge feature encoder as part of model parameters (#1347).
- [Bug fix] Fix HGT encoder to handle blocks with source node types only (#1356).
- [Bug fix] Fix edge input encoder to handle 1D input features (#1370).
Breaking Changes
- In 0.5.1, the model layer arguments changed from
embed|dense_embed,gnn,decodertonode_embed,edge_embed,gnn,decoder. And therestore_model()behavior was updated accordingly, removing theembedanddense_embedoptions. As a result, GraphStorm models trained with version 0.5 or earlier cannot be loaded by GraphStorm 0.5.1 or later versions. Users must either retrain models with version 0.5.1 or later, or load models using version 0.5 or earlier.
Contributors
- Xiang Song from AWS
- Jian Zhang from AWS
- Runjie Ma from AWS
- Han Xie from AWS
- Haiyang Yu from AWS
V0.5 Release Note
The GraphStorm V0.5 release marks a major milestone with introduction of native real-time inference support for GraphStorm, enabling users to deploy SageMaker endpoints with trained GraphStorm models for seamless node predication services in real time. The key improvements includes, 1/ We developed a specialized SageMaker endpoint for hosting and serving GraphStorm models; 2/ System supports real-time processing of graph payload in JSON format during inference; 3/ A comprehensive launch script that streamlines GraphStorm SageMaker endpoint deployment, reducing setup complexity. The end-to-end real-time inference pipeline delivers exceptional performance, processing a 100-node graph in just 160 ms. The enhanced process includes graph sampling from Neptune DB, payload preparation on SageMaker Endpoint, and producing the inference result. Users can follow this documentation to deploy a GraphStorm SageMaker endpoint and follow this documentation to prepare their node prediction workload.
Major features
- [Feature] Real-time inference for Node Prediction Tasks (#1284, #1285, #1286, #1288, #1292, #1296, #1305, #1307, #1317, #1319, #1320, #1321, #1322, #1326, #1333)
Documentation Update
- [Doc] Update GraphStorm environment setup documentation (#1277)
- [Doc] Correct the directory handling in the GSProcessing Docker script and documentation. (#1306)
- [Doc] Real-time inference documentation (#1312)
Minor features
- [Feature] Enable the Config "fixed-test-size" for LM training with Link Prediction. (#1310)
Fix:
- [Bugfix] Fix the case when the subtask name is too long in multi-task learning (#1279)
- [Bugfix] Fix the inference code for non-autoregressive mode example (#1281)
- [Bugfix] Fix a bug in remapping node embeddings when shared filesystem is not available. (#1299)
- [Bugfix] Solve local model cache issue when running on EMR for GSProcessing (#1298)
Breaking changes
- In 0.5, GraphStorm training on SageMaker will only upload the latest epoch instead of uploading entire model output. For GraphStorm version <= 0.4.2, SageMaker training will upload every model checkpoint to the S3. Now we only determine the latest epoch from the list of directories, copy over the model weights, any embeddings, and config files (json/yaml/yml) to the standard SageMaker output directory (#1314)
Contributors
- Xiang Song from AWS
- Jian Zhang from AWS
- Theodore Vasiloudis from AWS
- Runjie Ma from AWS
- Han Xie from AWS
- Haiyang Yu from AWS
GraphStorm v0.4.2 release
v0.4.2 Release notes
GraphStorm 0.4.2 release is a minor release that brings support for knowledge graph embedding training, new metrics for classification to measure tradeoffs between recall and precision, and made it possible to generate all-node predictions and embeddings with a single argument. We also added support for HyperBand HPO on SageMaker, and improved interoperability between single-instance and distributed graph pre-processing, making it easier to transition between the two modes depending on the size of your graphs.
Major features
- Support separate encoder for different features of the same node type #1221
- Add
fscore_at_beta,precision_at_recall,recall_at_precisionmetrics for classification tasks #1235 #1273 - Add
--infer-all-target-nodesargument to trigger inference on all nodes in target node types #1235 - Add support for knowledge graph embedding training #1260
Documentation update
Minor features
- Allow no-op operation in GConstruct to read strings of delimited numbers as vectors #1225
- Focal loss no longer requires setting num_classes to 1. #1231
- [SageMaker] Support HyperBand optimization strategy #1249
- [GConstruct/GSProcessing] Allow GConstruct to accept GSProcessing Config #1230
- [GSProcessing] Handle
*wildcards in filenames #1268
Fixes
- [GSProcessing] Support
:in label column names #1234 - [GSPartition] Ensure random partition assigns at least one node per partition #1239
- [BugFix] Fix the case when early stopping does not work with report_eval_per_type #1223
- Fix the case when the subtask name is too long in multi-task learning #1279
Breaking changes
- Add future warning for focal loss shape and clarify use in documentation #1275
Contributors
- Xiang Song from AWS
- Jian Zhang from AWS
- Theodore Vasiloudis from AWS
- Runjie Ma from AWS
- Han Xie from AWS
- Yuke Wang from Rice University
GraphStorm v0.4.1 release
v0.4.1 Release notes
We are happy to announce the GraphStorm v0.4.1 release.
In this version we have added native support for edge features in message passing to HGT and RGAT models. We added two new loss functions, Bayesian Personalized Ranking loss for link prediction tasks, and Shrinkage Loss for imbalanced regression tasks. We added support for exporting training metrics to Tensorboard, and continued expanding our SageMaker coverage by adding HPO support.
Major features
- Support Edge Features in message passing for HGT model. (#1188)
- Support Edge Features in message passing for RGAT model.(#1185)
- Add Shrinkage Loss to handle regression label imbalance problem. (#1157)
- Add Bayesian Personalized Ranking Loss for link prediction. (#1149)
- Add support for Tensorboard visualization of training job metrics. You can now set the
task_trackerparameter totensorboard_task_trackerto produce log output that can be visualized with Tensorboard. (#1155,#1156) - Add support for SageMaker HPO jobs (#1133)
Documentation updates
Minor features
- Add option to use ParMETIS within the local Dockerfile. Now users can provide
--use-parmetiswhen building their local image to enable support of ParMETIS partitioning. (#1102) - Add support for DGL 2.5 (#1009)
Fixes
- [GSProcessing] Docker images no longer require Poetry or Python 3.9 to build. (#1076)
- [GSProcessing] Fix ParquetRowCounter bug when different node/edge types have features with identical names (#1140)
- [GSProcessing] Fix mis-ordered label outputs for edge classification (#1192)
- Fix ID mapping writing when writing to multiple files. Previously when writing multiple ID files, files could contain overlapping IDs. This is now fixed, so each file contains distinct IDs (#1178)
Breaking changes
RelationalAttLayeradds two new arguments,edge_feat_nameandedge_feat_mp_opto support using edge feature in message passing. And in its forward function, the input argumentinputsis changed into two arguments,n_hande_h, for node embeddings and edge embeddings, respectively.RelationalGATEncoderadds two new arguments,edge_feat_nameandedge_feat_mp_op. The forward function’s input argument changes fromhinton_hande_hs. (#1185)HGTEncoderadds two new arguments,edge_feat_nameandedge_feat_mp_opto support using edge feature in message passing. And its forward function’s input argument changes fromhinton_hande_hs, for node embeddings and edge embeddings, respectively. (#1188)
Contributors
- Xiang Song from AWS
- Jian Zhang from AWS
- Theodore Vasiloudis from AWS
- Runjie Ma from AWS
- Han Xie from AWS
- Ronald Xu from AWS
- Yuke Wang from Rice University
V0.4 Release Note
The GraphStorm V0.4 release contains several major feature enhancements. In this version, we have introduced experimental support for using edge features in GNN message passing computation. Now users can use edge features by setting two new command line (CLI) arguments, --edge-feat-name and --edge-feat-mp-op. GraphStorm APIs were also updated to support using edge features in message passing. In addition, we introduced support for DGL’s GraphBolt in this version. GraphBolt is a new data loading module for DGL that enables faster and more efficient graph sampling. For link prediction on Paper100M, we achieved a 1.4X speedup in training and a 3.6X speedup in inference with GraphBolt enabled in GraphStorm. We also enhanced distributed graph processing (GSProcess) to support hard negative sampling, multitask mask generation, and saving and loading numeric feature transformations. We added RotatE and TransE score functions for link prediction. In this version, we added a new GraphStorm example that predicts complex and dynamic network traffic and an example that demonstrates how to use the super-node method to perform graph-level prediction tasks. We also added a new example that demonstrates how to use SageMaker Pipelines with GraphStorm and how to run GraphBolt-enabled jobs.
Major features
- Support using edge features in GNN message passing computation. Users only need to set two new CLI arguments to use this new feature with an RGCN encoder. #1057, #1070, #1074, #1084, #1088, #1096, #1098, #1104.
- GraphBolt integration. Users can use GraphBolt by setting one argument,
--use-graphbolt, in graph processing and model training and inference. #1001, #1011, #1024, #1025, # 1029, #1083, #1116. - GSProcessing enhancements: supporting hard negative sampling, multitask mask generation, and saving numeric feature transformations. #994, #1050, #1073, #1085, #1091, #1076, #1117.
New Examples
- Network time series traffic prediction. This example demonstrates how to make time series prediction on a synthetic air transportation traffic by using GraphStorm. #1109.
- Graph-level prediction. This example demonstrates how to use the
super-nodemethod to perform graph-level prediction tasks using GraphStorm CLIs and APIs. #1021, #1026. - A new notebook example of using customized models with CLIs. #1049, #1087.
- A new notebook example of conducting distributed training pipeline on SageMaker. #1108, #1126.
Minor features
- Link prediction enhancements: Adding RotatE and TransE score functions. Add adjusted mean ranking index link prediction metric. #986, #991, #1031, #1042, #1046, #1061, #1075.
Breaking changes
- API changes:
RelGraphConvLayeradds two new arguments,edge_feat_nameandedge_feat_mp_opto support using edge feature, and in itsforwardfunction, change the input argumentinputsinto two arguments,n_hande_h, for node embeddings and edge embeddings, perspectively.RelationalGCNEncoderadds two new arguments,edge_feat_nameandedge_feat_mp_optoo. Itsforwardfunction changes the input argumenthinton_hande_hstoo. #1074. - Decoders, including
EntityClassifier,EntityRegression,DenseBiDecoder,EdgeRegression,MLPEdgeDecoder, andMLPEFeatEdgeDecoder, have a new argument,use_bias, to allow users to set bias in these decoders. #1111, #1125. - Modify GSProcessing configuration parser to be equivalent to GConstruct. #1117.
Contributors
- Xiang Song from AWS
- Jian Zhang from AWS
- Theodore Vasiloudis from AWS
- Runjie Ma from AWS
- Han Xie from AWS
- Ronald Xu from AWS
- Yuke Wang from Rice University
V0.3.1 Release Note
The GraphStorm V0.3.1 release contains a few major feature enhancements. In this version, we have reorganized the overall documentation and tutorial to facilitate a more efficient learning curve for users. The new documentation is organized into four sections: i) Getting Started, which offers a concise tutorial on usinh GraphStorm; ii) Command Line Interface User Guide, which provides an overview of the GraphStorm command line interfaces (CLI); iii) Programming Interface User Guide, which provides details the application programming interfaces (API) of GraphStorm; and vi) ) Advanced Topics, which explores complex subjects such as custom model implementation, link prediction training optimization, multi-task learning, etc. In addition, we have enhanced the distributed graph processing functionalities to improve user experience. We provided four notebook examples to demonstrate the use of GraphStorm APIs in developing custom models and training/inference pipelines.
Major features
- Reorganized the documentations and tutorials to group the main contents under two top-level menus, i.e.,
COMMAND LINE INTERFACE USER GUIDEandPROGRAMMING INTERFACE USER GUIDE. #956- Under the CLI user guide menu, regrouped the contents in into two 2nd-level menus, i.e.,
GraphStorm Graph ConstructionandGraphStorm Model Training and Inference.- Under the
GraphStorm Graph Construction, added a new document,Input Raw Data Specification, to explain the specifications of the input data, and provide a simple raw data example. #996 - Added a new document,
Single Machine Graph Construction, to introduce thegconstructmodule, and provide a simple construction configuration JSON example. #996 - In the
Distributed Graph Construction, reorganized the document structure of GSProcessing. #907
- Under the
- Renamed the
DISTRIBUTED TRAININGtoGraphStorm Model Training and Inferenceand move it underCOMMAND LINE INTERFACE USER GUIDE. #956- Added a new
Model Training and Inference on a Single Machine2nd-level menu to explain the launch commands.
- Added a new
- Under the
PROGRAMMING INTERFACE USER GUIDEmenu, - Refined hard negative tutorial and multi-task learning tutorial. #898 #944
- Under the CLI user guide menu, regrouped the contents in into two 2nd-level menus, i.e.,
- Added a new GSProcessing launch script for EMR on EC2 that allows users to run a GSProcessing job as an EMR step, simplifying the user experience. #902
New examples
- Add a Jupyter Notebook example for using GraphStorm APIs to implement GraphStorm built-in GNN model #919
- Add a Jupyter Notebook example for using GraphStorm APIs to customize GNN model components #929
Minor features
- Add a hit@k evaluator for both classification and link prediction tasks. #911 #948
- Remove the limit that save model frequency must be dividable by the evaluation frequency. Allow users to set the save model frequency freely. #893 #948
- Added a new
truncate_dimargument to GSProcessing no-op transformation and forgconstruct.construct_graphtoo. #922
Breaking changes
- Add a new argument norm in the
__init__of GraphStorm classification and regression decoders. This allows users to set layer or batch normalization on the neural network layers of these decoders. OnlyMLPFeatEdgeDecoderimplements the normalization in this release. #948 - Rename the
pos_graph_feat_fieldswithpos_graph_edge_feat_fieldsin theGSgnnLinkPredictionDataLoaderBaseclass to make its meaning clearer. #934
Contributors
- Xiang Song from AWS
- Jian Zhang from AWS
- Theodore Vasiloudis from AWS
- Runjie Ma from AWS
- Han Xie from AWS
GraphStorm v0.3 release
GraphStorm V0.3 release contains a few major feature enhancements. In this release, we have introduced support for multi-task learning, allowing users to define multiple training targets on different nodes and edges within a single training loop. The supported training supervisions for multi-task learning include node classification/regression, edge classification/regression, link prediction and node feature reconstruction. Users can specify the training targets through the YAML configuration file. We have refactored the implementation of DataLoader and Dataset to decouple Dataset from DataLoader, simplifying the customization of both. We simplified the APIs of DataLoader, Dataset and Evaluator. We have supported re-applying saved feature transformation rules to new data in distributed graph processing pipeline. We added GATv2 model in GraphStorm model zoo. We also added demos of running node classification and link prediction with custom GNN models using Jupyter Notebook.
Major features
- Support graph multi-task learning. which enables users to define multiple training targets, including node classification/regression, edge classification/regression, link prediction and node feature reconstruction, in a single training loop. #804 #813 #825 #828 #842 #837 #834 #843 #852 #855 #860 #863 #871 #861
- Refactor the implementations of DataLoader and Dataset to decouple Dataset from DataLoader and simplify the APIs of DataLoader, Dataset and Evaluator. #795 #820 #821 #822
- Support re-applying saved feature transformation rules to new data in distributed graph processing pipeline #857 #870
New Examples
- Add a Jupyter Notebook example for node classification using a custom GNN model #830
- Add a Jupyter Notebook example for link prediction using a custom GNN model #846
- Add link prediction support in GPEFT example #760
- Add GraphStorm benchmarks using MAG and Amazon Review dataset. #765 #818
Minor features
- Allow re-partitioning to run on the Spark leader, removing the need for a follow-up re-partition job. #767
- Add support for custom graph splits, allowing users to define their own train/validation/test sets. #761
- Allow custom out_dtype for numerical feature transformations in GSProcessing. #739
New Built-in Models
- GATv2 #771
Breaking changes
GraphStorm API changes
Simplify graphstorm.initialize() by given default values, e.g. ip_config, backend and local_rank, (#781 #783)
- The initialize() method adds default values, ip_config=None, backend='gloo, local_rank=0'.
- The gsf.py adds a default device by using the local_rank so other class can call get_device() directly.
Refactor evaluators with new a base class and several interfaces for different tasks. (#803 #807 #822)
- Deprecate GSgnnInstanceEvaluator, and GSgnnAccEvaluator with GSgnnBaseEvaluator, and GSgnnClassificationEvaluator. Refactor GSgnnRegressionEvaluator, GSgnnMrrLPEvaluator, and GSgnnPerEtypeLPEvaluator
Unify different GraphStorm data classess (GSgnnNodeTrainData, GSgnnNodeInferData, GSgnnEdgeTrainData, GSgnnEdgeInferData) with one GSgnnData and one set of constructor arguments. Deprecate GSgnnNodeTrainData, GSgnnNodeInferData, GSgnnEdgeTrainData, GSgnnEdgeInferData. GSgnnData now only provides interfaces for accessing graph data, e.g., node features, edge features, labels, train masks, etc. (#795 #820 #821)
- Update the init arguments of GSgnnData from (graph_name, part_config, node_feat_field, edge_feat_field, decoder_edge_feat, lm_feat_ntypes, lm_feat_etypes) to (part_config, node_feat_field, edge_feat_field, lm_feat_ntypes, lm_feat_etypes).
- Add property functions:
- graph_name(), return a string of the graph name value in config json.
- Add new functions:
- get_node_feats(self, input_nodes, nfeat_fields, device='cpu'), given the node ids (input_nodes) and node features to retrieve from the graph data (nfeat_fields), return the corresponding node features.
- get_edge_feats(self, input_edges, efeat_fields, device='cpu'), given the edge ids (input_edges) and edge features to retrieve from the graph data (efeat_fields), return the corresponding edge features.
- get_node_train_set(self, ntypes, mask), return the node training set.
- get_node_val_set(self, ntypes, mask), return the node validation set.
- get_node_test_set(self, ntypes, mask), return the node test set.
- get_node_infer_set(self, ntypes, mask), return the node inference set.
- get_edge_train_set(self, etypes, mask, reverse_edge_types_map), return the edge training set.
- get_node_val_set(self, etypes, mask, reverse_edge_types_map), return the edge validation set.
- get_node_test_set(self, etypes, mask, reverse_edge_types_map), return the edge test set.
- get_node_infer_set(self, etypes, mask, reverse_edge_types_map), return the edge inference set.
- Update some functions:
- get_node_feats(self, input_nodes, nfeat_fields, device='cpu'), requires the caller to provide the node feature fields to access
- get_edge_feats(self, input_edges, efeat_fields, device='cpu'), requires the caller to provide the edge feature fields to access
- get_labels function is replaced with get_node_feats and get_edge_feats with label field names.
Refactor all dataloader classes, adding new constructor arguments. (#795 #820 #821)
- GSgnnNodeDataLoaderBase and its subclasses, requires three new init arguments:
- label_field: Label field of the node task.
- node_feats: Node feature fields used by the node task.
- edge_feats: Edge feature fields used by the node task.
- GSgnnEdgeDataLoader and its subclasses, requires four new init arguments:
- label_field: Label field of the edge task.
- node_feats: Node feature fields used by the edge task.
- edge_feats: Edge feature fields used by the edge task.
- decoder_edge_feats: Edge feature fields used in the edge task decoder.
- GSgnnLinkPredictionDataLoader and its subclasses, requires three new init arguments:
- node_feats: Node feature fields used by the link prediction task.
- edge_feats: Edge feature fields used by the link prediction task.
- pos_graph_edge_feats: The field of the edge features used by positive graph in link prediction.
GraphStorm GSProcessing updates
GSProcessing now supports re-applying saved feature transformation rules on new data. GSProcessing will now create a new file precomputed_transformations.json in the output location. Users can copy that file to the top-level path of new input data (at the same level as the input configuration JSON) and GSProcessing will use the existing transformations for the same features. This way, a model that has been trained on previous data can continue working even if new values appear in the new data. In this release, we only support re-applying categorical transformations.
Contributors
- Da Zheng from AWS
- Xiang Song from AWS
- Jian Zhang from AWS
- Theodore Vasiloudis from AWS
- Runjie Ma from AWS
- Qi Zhu from AWS
Special thanks to the DGL project and WholeGraph project for supporting GraphStorm 0.3 release.
V 0.2.2 Release Note
GraphStorm V0.2.2 release contains a few major feature enhancements. In this release, we have enhanced the NVIDIA WholeGraph support to speed up the access to learnable embedding training and cached BERT embedding. We have added customized negative sampling method for link prediction tasks, which enables users to define negative edges for each individual edge. We have provided two new feature transformations in our distributed graph processing pipeline, including textual feature tokenization with HuggingFace models and textual feature encoding with HuggingFace models. We further simplified the command line interface for model prototyping by removing the requirement of setting up ssh for running GraphStorm jobs on a single machine. We also added an example of doing GPEFT training to enhance LLM with graph data using the custom model interface.
Major features
- Support using WholeGraph distributed embedding to speedup learnable embedding training. #677 #697 #734
- Support using WholeGraph distributed embedding to speedup cached BERT embedding read. #737
- Support hard negative for link prediction tasks. #678 #684 #703
- Distributed graph processing pipeline supports using HuggingFace models to encode textual node features #724
- Distributed graph processing pipeline supports using HuggingFace models to tokenize textual node features #700
- Support running GraphStorm jobs on a single machine without using ssh. #712
New Examples
- Add the GPEFT method to enhance LLM with graph data as a GraphStorm example using the custom model interface. It trains a GNN model to encode the neighborhood of a target node as a prompt and perform parameter efficient fine-tuning (PEFT) to enhance LLM for computing the node representation of the target node. See GPEFT example for how to run. #673 #701
Minor features
- Add a support to balance training/valid/test in graph partitioning for node classification tasks. #714 #741
- Allow users to start training/inferring job without specifying target_ntype/target_etype on homogeneous graph. #686 #683
- Unify the ID mapping output of GConstruct and GProcessing. #461
Breaking changes
- Previously, GConstruct created the ID mappings as a single Parquet file, with its filename prefixed by the node type. After 0.2.2 release, GConstruct will create partitioned Parquet files for each node type under its own directory. This change unifies the output of GConstruct and GProcessing. See more details in #461.
- We unify the behavior of handling errors in evaluation functions. Previous, evaluation functions, such as roc_auc or f1 score will not raise an exception when an error happens. After 0.2.2 release, evaluation functions will stop the code running and raise an exception with a corresponding error message when an error happens. See more details in #711.
Contributors
- Da Zheng from AWS
- Xiang Song from AWS
- Jian Zhang from AWS
- Theodore Vasiloudis from AWS
- Runjie Ma from AWS
- Qi Zhu from AWS
- Zichen Wang from AWS
- Chang Liu from NVidia
GraphStorm v0.2.1 release
GraphStorm V0.2.1 release contains a few major feature enhancements. In this release, we have enhanced the GraphStorm model inference use experience by automatically mapping inference results (prediction results and generated node embeddings) into Raw Node ID space, i.e., the same ID space as the input raws data. The resulting output will be stored in parquet format. We have added a new inference command (graphstorm.run.gs_gen_node_embedding) for computing node embeddings on any given graph with a trained GraphStorm model. We have improved our distributed graph processing pipeline to provide multiple feature transformations including categorical feature transformation, numerical bucketing, etc. We added GAT model in GraphStorm model zoo. We also added a demo of running GraphStorm using Jupyter Notebook.
Major features
- Automatically map inference results (prediction results and generated node embeddings) into Raw Node ID space (#481, #524, #527, #543, #533, #578, #597, #621, #633, #641)
- Provide a command line to generate GNN embeddings (#478)
- Provide multiple feature transformations include categorical feature transformation (#623), Rank-Gauss (#615), numerical bucketing (#583), Min/Max normalization (#575)
Minor features
- Support caching BERT embeddings on disks for GNN model fine-tuning. #516
- Allows customization of GLEM trainable parameters grouping. #506
- Support using NVidia WholeGraph to store edge features #555
- Add contrastive loss for link prediction tasks #619
- Support in-batch negative for link prediction tasks #596
- Support NCCL backed for sparse embedding #549
New Built-in Models
Breaking changes
We changed the file format and the content of saved node embeddings and saved prediction results of GraphStorm training and inference pipelines. By default, if the task is launch through a command under graphstorm.run.*, GraphStorm will automatically save generated node embeddings and prediction results in parquet files. For node embeddings, the files will contain two columns: column “nid” storing the node IDs in the raw node ID space and column “emb” storing the node embeddings. For node prediction results, the files will contain two columns: column “nid” storing the node IDs in the raw node ID space and column “pred” storing the prediction results. For edge prediction results, the files will contain three columns: column “src_nid” and “dst_nid” storing the node IDs of source nodes and destination nodes in the raw node ID space respectively and column “pred” storing the prediction results.
Contributors
- Da Zheng from AWS
- Xiang Song from AWS
- Jian Zhang from AWS
- Theodore Vasiloudis from AWS
- Runjie Ma from AWS
- Israt Nisa from AWS
- Qi Zhu from AWS
- Zichen Wang from AWS
- Nicolas Castet from NVidia
- Chang Liu from NVidia
GraphStorm v0.2 release
GraphStorm V0.2 release contains a few major features enhancement. In this release, we have added distributed graph processing support for large-scale graphs. Users may now use Spark clusters such as SageMaker, PySpark to execute distributed graph processing. We have added multi-task learning support for node classification tasks. Now, GraphStorm supports even more Huggingface language models (LM) Like bert, roberta, albert, etc.(See https://github.com/awslabs/graphstorm/blob/v0.2/python/graphstorm/model/lm_model/utils.py#L22) for more details. We have enhanced GraphStorm model training speed by supporting NCCL backend. Further performance enhancement to speedup node feature fetching during distributed GNN training by collaborating with Nvidia on NVidia WholeGraph support. We have expanded graph model support for distilling a GNN model into a Huggingface DistilBertModel, and added two new models HGT and GraphSage in GraphStorm model zoo. New GraphStorm doc and tutorial are available on https://graphstorm.readthedocs.io for all user group.
Major features
- Support multi-task learning for node classification tasks (#410)
- Enable NCCL backend (#383, #337)
- Publish GraphStorm doc on https://graphstorm.readthedocs.io.
- Support using multiple language models available in Huggingface including bert, roberta, albert, etc, in graph aware LM fine-tuning, GNN-LM co-training and GLEM. (#385)
- [Experimental] Distributed graph processing support (#435, #427, #419, #408, #407, #400)
- [Experimental] Support using NVidia WholeGraph to speedup node feature fetching during distributed GNN training. (#428, #405)
- [Pre-View] Support for distilling a GNN model into a Huggingface DistilBertModel. (#443, #463)
New Built-in Models
- Heterogeneous Graph Transformer (HGT) (#396)
- GraphSage (#352)
- [Experimental] GLEM semi-supervised training for node tasks. (#327, #432)
Minor features
- Support per edge type link prediction metric report (#393)
- Support per class roc-auc report for multi-label multi-class classification tasks (#397)
- Support batch norm and layer norm (#384)
- Enable standalone mode that allows users to run the training/inference scripts without using the launch script (#331)
API breaking changes
- We changed the filename format of saved embeddings (either learnable embeddings or node embeddings) and model prediction results from .pt to <padding_zeros>.pt . For example, suppose we have 4 trainers, the saved node embeddings will be named as emb.part00000.pt, emb.part00001.pt, emb.part00002.pt, emb.part00003.pt.
Contributors
- Da Zheng from AWS
- Xiang Song from AWS
- Jian Zhang from AWS
- Theodore Vasiloudis from AWS
- Runjie Ma from AWS
- Israt Nisa from AWS
- Qi Zhu from AWS
- Houyu Zhang from Amazon Search
- Zichen Wang from AWS
- Weilin Cong from University of Penn State
- Nicolas Castet from NVidia
- Chang Liu from NVidia