Skip to content

totalaudiopromo/sink-cli

Repository files navigation

     ___ (_)__  / /__
    (_-</ / _ \/  '_/
   /___/_/_//_/_/\_\

npm version CI License: MIT Node

Data hygiene for music PR. Scrub, rinse, and soak your contact lists.

The product is sink; the binary is sink. It's published on npm as datasink because the sink name was already taken there. One tool, one name — plus an npm address.

sink-cli quick demo

Demo uses fictional contacts for illustration.

Full workflow demo (scrub, rinse, inspect)

sink-cli full workflow demo


Try it in the browser

sink-web-indol.vercel.app — drop a CSV and watch the real engine run client-side. Your contacts never leave your browser; only domain names are checked against DNS. Source in web/.

Quick Start

npx datasink scrub contacts.csv          # validate emails
npx datasink rinse contacts.csv          # deduplicate
npx datasink wash contacts.csv           # full pipeline

Or install globally:

npm install -g datasink
sink scrub contacts.csv

Commands

Command Description
sink Interactive menu (no args)
sink wash <file> Full pipeline: scrub + rinse + soak + steep
sink scrub <file> Validate & clean emails
sink rinse <file> Deduplicate contacts
sink soak <file> Enrich contacts with AI
sink steep <file> Discover channels via outlet site scraping
sink spot <email> Spot-check a single email (format, typo, MX)
sink inspect <file> Data quality score
sink drain <file> Convert between formats
sink tui <file> Full TUI dashboard

Why sink?

  • Built for music PR. Knows BBC Radio 1 from Radio X, catches bbc.combbc.co.uk typos, flags role-based emails like press@. Not a generic email validator -- it understands your industry.
  • Zero config. Point it at a CSV and go. Flexible header matching means it works with whatever your spreadsheet exports. No mapping files, no setup wizard.
  • Four phases, one metaphor. Scrub cleans. Rinse deduplicates. Soak enriches. Steep extracts channels from outlet sites. Run them individually or all at once with wash. Like doing the washing up, but for data.

Phases

Scrub

Validates and cleans email addresses:

  • RFC 5322 format validation
  • UK domain typo correction (bbc.combbc.co.uk, gmial.comgmail.com)
  • Disposable domain detection
  • MX record verification
  • Role-based email flagging (press@, info@)
  • Catch-all domain detection

Rinse

Deduplicates and resolves identities:

  • Exact email -- case-insensitive dedup, keeps the richer record
  • Fuzzy name -- Jaro-Winkler similarity within same domain (threshold: 0.92)
  • Cross-field -- matches by phone across different emails

Soak

Enriches contacts with AI:

  • Platform type detection (radio, press, playlist, blog, podcast)
  • Genre identification
  • Geographic scope
  • Submission guidelines
  • Pitch tips

Supports Anthropic (Claude Haiku) and OpenAI (GPT-4o-mini).

Steep

Discovers contact channels by scraping the outlet's public website:

  • Outlet socials (Instagram, Twitter, LinkedIn, Facebook)
  • Submission portal URL + submission email
  • Submission format (mp3, link, form, mixed)
  • Recent presenters / journalists named on the site
  • Per-contact attribution: when a contact's name appears on the team / presenter page, their personal handles are extracted
  • Pitch hooks: specific, observable hooks pulled from the scraped text ("submissions form requires ISRC", "no submissions email -- portal only")

One scrape powers every contact at that outlet. The CLI caches scrapes in memory for the duration of a run; a persistent 30-day cache is available to programmatic consumers that supply their own CacheAdapter (see below).

Requires FIRECRAWL_API_KEY and an LLM provider key. Phase is silently skipped if creds are missing.

Global Flags

-o, --output <path>       Output file path
--format <csv|json|jsonl>  Output format (default: csv)
--config <path>            Config file path
--dry-run                  Preview without writing files
--verbose                  Detailed output
-q, --quiet                Suppress all output except errors
--json                     JSON stdout (for piping)
--no-colour                Disable colours
--provider <name>          Enrichment provider (anthropic|openai)

Exit Codes

Code Meaning
0 Success
1 File error (not found, permission denied, is a directory)
2 Parse error (invalid CSV, no usable data)
3 Config error (invalid config file)
4 Pipeline error (enrichment failure, unexpected crash)

Provider Setup

Anthropic

export ANTHROPIC_API_KEY=sk-ant-...
sink soak contacts.csv --provider anthropic

OpenAI

export OPENAI_API_KEY=sk-...
sink soak contacts.csv --provider openai

Input Format

Accepts CSV files with flexible column names:

Field Accepted Headers
Name name, contact, full name, person
Email email, e mail, email address
Outlet outlet, publication, media, company, station
Role role, title, position, job title
Phone phone, telephone, mobile
Website website, url, web
Notes notes, comments, description
Tags tags, categories, labels

First/last name columns are automatically joined. Unmapped columns are preserved in extras.

Configuration

Create a sink.config.mjs (or sink.config.json) in your project root. Sink auto-discovers sink.config.mjs, .js, .ts, then .json.

// sink.config.mjs
export default {
  scrub: {
    mxCacheTTL: 1800, // seconds
    typoMap: './data/custom-typos.json',
  },
  rinse: {
    fuzzyThreshold: 0.92,
    strategies: ['exact-email', 'fuzzy-name', 'cross-field'],
  },
  soak: {
    provider: 'anthropic',
    anthropic: {
      model: 'claude-haiku-4-5-20251001',
      apiKey: process.env.ANTHROPIC_API_KEY,
    },
  },
  output: {
    format: 'csv',
    locale: 'en-GB',
  },
}

A TypeScript sink.config.ts also works, but only on Node >= 23.6 (which can strip types natively). On Node 20/22 use .mjs or .json. If a config file is present but cannot be loaded, sink warns and falls back to defaults; an explicit --config <path> that is missing or invalid exits with code 3.

Programmatic API

import { runPipeline, loadConfig } from 'datasink'

const config = await loadConfig()
const records = [
  {
    id: '1',
    raw: { name: 'Sarah Jones', email: 'sarah@bbc.co.uk', outlet: 'BBC Radio 1' },
    phases: [],
    timestamp: new Date().toISOString(),
  },
]

const { records: processed, stats } = await runPipeline(records, {
  phases: ['scrub', 'rinse'],
  config,
})

console.log(stats)

Contributing

See CONTRIBUTING.md for dev setup, code style, and PR guidelines.

Changelog

See CHANGELOG.md for release history.

Part of Total Audio

Tools I build for music PR, by Chris Schofield. Part of Total Audio Promo.

Project Description
TAP Campaign management for music PR agencies
totalaud.io Release planning for emerging artists
SpotCheck Spotify playlist validation
Newsjack Music industry newsjacking
Podflow Podcast intelligence for music PR
Sink Contact data hygiene CLI

Questions? Reach me on X/@chrisschouk or info@totalaudiopromo.com.

Licence

MIT

About

Contact data hygiene CLI for music PR. Email validation, deduplication, and enrichment.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors