Two datasets (SPNLG and Wiki) can be downloaded from https://drive.google.com/drive/folders/1FsNlFh2aUbuBl45zEjgvAXDkp_e4hQmV?usp=sharing
| Train | Valid | Test | |||
|---|---|---|---|---|---|
| Paired | Raw | Paired | Raw | Paired | |
| SPNLG | 14k | 150k | 21k | / | 21k |
| Wiki | 84k | 842k | 73k | 43k | 73k |
-
SPNLG
- The dataset is from sentence-planning-NLG dataset, a dataset describing the restaurant informations, containing 3 CSV files.
- We aggregate all the 3 CSV files, and leave
train:valid:test=8:1:1,paired:raw=1:10for the train set.
-
Wiki
- The dataset is constructed from both Wiki-Bio Dataset and Wikipedia Person and Animal Dataset.
- We used same valid and test set as Wiki-Bio.
- For training set, we only randomly use 84k samples in Wiki-Bio-train for paired data. We use the remain sentences in Wiki-Bio-train and person descriptions from Wikipedia Person and Animal as raw data (totally up to 842k).
- Sentence planning NLG dataset: https://nlds.soe.ucsc.edu/sentence-planning-NLG
- Wikipedia biography dataset (Wiki-Bio): https://github.com/DavidGrangier/wikipedia-biography-dataset
- Wikipedia Person and Animal Dataset: https://drive.google.com/file/d/1TzcNdjZ0EsLh_rC1pBC7dU70jINcsVJd/view