___ (_)__ / /__
(_-</ / _ \/ '_/
/___/_/_//_/_/\_\
Data hygiene for music PR. Scrub, rinse, and soak your contact lists.
The product is sink; the binary is
sink. It's published on npm asdatasinkbecause thesinkname was already taken there. One tool, one name — plus an npm address.
Demo uses fictional contacts for illustration.
sink-web-indol.vercel.app — drop a CSV
and watch the real engine run client-side. Your contacts never leave your
browser; only domain names are checked against DNS. Source in web/.
npx datasink scrub contacts.csv # validate emails
npx datasink rinse contacts.csv # deduplicate
npx datasink wash contacts.csv # full pipelineOr install globally:
npm install -g datasink
sink scrub contacts.csv| Command | Description |
|---|---|
sink |
Interactive menu (no args) |
sink wash <file> |
Full pipeline: scrub + rinse + soak + steep |
sink scrub <file> |
Validate & clean emails |
sink rinse <file> |
Deduplicate contacts |
sink soak <file> |
Enrich contacts with AI |
sink steep <file> |
Discover channels via outlet site scraping |
sink spot <email> |
Spot-check a single email (format, typo, MX) |
sink inspect <file> |
Data quality score |
sink drain <file> |
Convert between formats |
sink tui <file> |
Full TUI dashboard |
- Built for music PR. Knows BBC Radio 1 from Radio X, catches
bbc.com→bbc.co.uktypos, flags role-based emails likepress@. Not a generic email validator -- it understands your industry. - Zero config. Point it at a CSV and go. Flexible header matching means it works with whatever your spreadsheet exports. No mapping files, no setup wizard.
- Four phases, one metaphor. Scrub cleans. Rinse deduplicates. Soak enriches. Steep extracts channels from outlet sites. Run them individually or all at once with
wash. Like doing the washing up, but for data.
Validates and cleans email addresses:
- RFC 5322 format validation
- UK domain typo correction (
bbc.com→bbc.co.uk,gmial.com→gmail.com) - Disposable domain detection
- MX record verification
- Role-based email flagging (
press@,info@) - Catch-all domain detection
Deduplicates and resolves identities:
- Exact email -- case-insensitive dedup, keeps the richer record
- Fuzzy name -- Jaro-Winkler similarity within same domain (threshold: 0.92)
- Cross-field -- matches by phone across different emails
Enriches contacts with AI:
- Platform type detection (radio, press, playlist, blog, podcast)
- Genre identification
- Geographic scope
- Submission guidelines
- Pitch tips
Supports Anthropic (Claude Haiku) and OpenAI (GPT-4o-mini).
Discovers contact channels by scraping the outlet's public website:
- Outlet socials (Instagram, Twitter, LinkedIn, Facebook)
- Submission portal URL + submission email
- Submission format (mp3, link, form, mixed)
- Recent presenters / journalists named on the site
- Per-contact attribution: when a contact's name appears on the team / presenter page, their personal handles are extracted
- Pitch hooks: specific, observable hooks pulled from the scraped text ("submissions form requires ISRC", "no submissions email -- portal only")
One scrape powers every contact at that outlet. The CLI caches scrapes in
memory for the duration of a run; a persistent 30-day cache is available to
programmatic consumers that supply their own CacheAdapter (see below).
Requires FIRECRAWL_API_KEY and an LLM provider key. Phase is silently skipped if creds are missing.
-o, --output <path> Output file path
--format <csv|json|jsonl> Output format (default: csv)
--config <path> Config file path
--dry-run Preview without writing files
--verbose Detailed output
-q, --quiet Suppress all output except errors
--json JSON stdout (for piping)
--no-colour Disable colours
--provider <name> Enrichment provider (anthropic|openai)
| Code | Meaning |
|---|---|
0 |
Success |
1 |
File error (not found, permission denied, is a directory) |
2 |
Parse error (invalid CSV, no usable data) |
3 |
Config error (invalid config file) |
4 |
Pipeline error (enrichment failure, unexpected crash) |
export ANTHROPIC_API_KEY=sk-ant-...
sink soak contacts.csv --provider anthropicexport OPENAI_API_KEY=sk-...
sink soak contacts.csv --provider openaiAccepts CSV files with flexible column names:
| Field | Accepted Headers |
|---|---|
| Name | name, contact, full name, person |
| email, e mail, email address | |
| Outlet | outlet, publication, media, company, station |
| Role | role, title, position, job title |
| Phone | phone, telephone, mobile |
| Website | website, url, web |
| Notes | notes, comments, description |
| Tags | tags, categories, labels |
First/last name columns are automatically joined. Unmapped columns are preserved in extras.
Create a sink.config.mjs (or sink.config.json) in your project root. Sink
auto-discovers sink.config.mjs, .js, .ts, then .json.
// sink.config.mjs
export default {
scrub: {
mxCacheTTL: 1800, // seconds
typoMap: './data/custom-typos.json',
},
rinse: {
fuzzyThreshold: 0.92,
strategies: ['exact-email', 'fuzzy-name', 'cross-field'],
},
soak: {
provider: 'anthropic',
anthropic: {
model: 'claude-haiku-4-5-20251001',
apiKey: process.env.ANTHROPIC_API_KEY,
},
},
output: {
format: 'csv',
locale: 'en-GB',
},
}A TypeScript
sink.config.tsalso works, but only on Node >= 23.6 (which can strip types natively). On Node 20/22 use.mjsor.json. If a config file is present but cannot be loaded, sink warns and falls back to defaults; an explicit--config <path>that is missing or invalid exits with code 3.
import { runPipeline, loadConfig } from 'datasink'
const config = await loadConfig()
const records = [
{
id: '1',
raw: { name: 'Sarah Jones', email: 'sarah@bbc.co.uk', outlet: 'BBC Radio 1' },
phases: [],
timestamp: new Date().toISOString(),
},
]
const { records: processed, stats } = await runPipeline(records, {
phases: ['scrub', 'rinse'],
config,
})
console.log(stats)See CONTRIBUTING.md for dev setup, code style, and PR guidelines.
See CHANGELOG.md for release history.
Tools I build for music PR, by Chris Schofield. Part of Total Audio Promo.
| Project | Description |
|---|---|
| TAP | Campaign management for music PR agencies |
| totalaud.io | Release planning for emerging artists |
| SpotCheck | Spotify playlist validation |
| Newsjack | Music industry newsjacking |
| Podflow | Podcast intelligence for music PR |
| Sink | Contact data hygiene CLI |
Questions? Reach me on X/@chrisschouk or info@totalaudiopromo.com.
MIT

