dq_tool.py is a command-line tool for performing basic data quality assessment on CSV and Excel files. It checks for completeness, type accuracy, and recognizes common data patterns (such as email, phone, and postal codes) in your datasets. The tool generates a comprehensive report in JSON format and can pretty-print results in the terminal.
- Clone or download this repository.
- Create and activate a virtual environment (recommended):
python -m venv .venv # On Windows: .venv\Scripts\activate # On macOS/Linux: source .venv/bin/activate
- Install dependencies:
pip install pandas pyarrow tabulate
tabulateis optional, but recommended for pretty terminal output.
Run the tool from the command line, specifying your input file:
python dq_tool.py --input sample.csv- Use
--inputto specify the path to your CSV or Excel file. - Optionally, use
--reportto specify the output report file (default:dq_report.json).
Example:
python dq_tool.py --input data.xlsx --report results/my_report.jsonTo add or modify pattern recognition rules (e.g., to detect new data types):
- Open
dq_tool.pyand locate thecheck_pattern_recognitionfunction. - Update the
pattern_rulesdictionary to add new rules or adjust existing ones. For example:pattern_rules = { 'email': r'^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$', 'phone': r'^\\+?\\d{10,15}$', 'postal_code': r'^\\d{5}(-\\d{4})?$', 'custom_rule': r'^your-regex-here$' # Add your own }
- Save the file and re-run the tool. The new pattern will be automatically included in the analysis.
For questions or contributions, please open an issue or pull request.