This repository contains the FBPosts dataset that is used in the experimental setup of the paper "Automating Data Quality Validation for Dynamic Data Ingestion".
The tsv files are tab separated. FBPosts_dirty.tsv contains crawled Facebook posts. FBPosts_clean.tsv is a variant that was semi-automatically cleaned with the OpenRefine tool. The records that could not be cleaned were removed. The FBPosts_dirty_shortened.tsv contains the original records that could be cleaned in the FBPosts_clean.tsv. The partitions/{clean/dirty}/FBPosts_{clean/dirty}_{idx}.tsv files contain the corresponding data partitions of week idx, from 1 to 53 respectively.
For a short demo, run pip install -r requirements.txt and python demo.py.