Playwright-based data connectors for DataConnect. Each connector exports a user's data from a web platform using browser automation. Credentials never leave the device.
| Platform | Company | Runtime | Scopes |
|---|---|---|---|
| ChatGPT | OpenAI | playwright | chatgpt.conversations, chatgpt.memories |
| GitHub | GitHub | playwright | github.profile, github.repositories, github.starred |
| Meta | playwright | instagram.profile, instagram.posts | |
| playwright | linkedin.profile, .experience, .education, .skills, .languages | ||
| Spotify | Spotify | playwright | spotify.profile, spotify.savedTracks, spotify.playlists |
| YouTube | playwright | youtube.profile, youtube.subscriptions, youtube.playlists, youtube.playlistItems, youtube.likes, youtube.watchLater, youtube.history (top 50 recent items) |
node run-connector.cjs ./github/github-playwright.js # JSON output (for agents)
node run-connector.cjs ./github/github-playwright.js --pretty # colored output (for humans)
node run-connector.cjs ./github/github-playwright.js --inputs '{"username":"x","password":"y"}'See skills/vana-connect/ for the agent skill: setup, running, creating new connectors, and data recipes.
├── run-connector.cjs # Connector runner (symlink)
├── registry.json # Central registry (checksums, versions)
├── skills/vana-connect/ # Agent skill (setup, create, run, recipes)
├── types/
│ └── connector.d.ts # TypeScript type definitions
├── schemas/ # JSON schemas for exported data
│ ├── chatgpt.conversations.json
│ └── ...
├── openai/
│ ├── chatgpt-playwright.js # Connector script
│ └── chatgpt-playwright.json # Metadata
├── github/
│ ├── github-playwright.js
│ └── github-playwright.json
├── linkedin/
│ ├── linkedin-playwright.js
│ └── linkedin-playwright.json
├── meta/
│ ├── instagram-playwright.js
│ └── instagram-playwright.json
├── spotify/
│ ├── spotify-playwright.js
│ └── spotify-playwright.json
└── google/
├── youtube-playwright.js # Connector script
└── youtube-playwright.json # Metadata
Each connector consists of two files inside a <company>/ directory:
<name>-playwright.js-- the connector script (plain JS, runs inside the Playwright runner sidecar)<name>-playwright.json-- metadata (display name, login URL, selectors, scopes)
Connectors run in a sandboxed Playwright browser managed by the DataConnect app. The runner provides a page API object (not raw Playwright). The browser starts headless; connectors call page.showBrowser() when login is needed and page.goHeadless() after.
Phase 1 -- Login (visible browser)
- Navigate to the platform's login page (headless)
- Check if the user is already logged in via persistent session
- If not, show the browser so the user can log in manually
- Extract auth tokens/cookies once logged in
Phase 2 -- Data collection (headless)
- Switch to headless mode (browser disappears)
- Fetch data via API calls, network capture, or DOM scraping
- Report structured progress to the UI
- Return the collected data with an export summary
Connectors return a scoped result object where data keys use the format source.category (e.g., linkedin.profile, chatgpt.conversations). The frontend auto-detects scoped keys (any key containing a . that isn't a metadata field) and POSTs each scope separately to the Personal Server at POST /v1/data/{scope}.
const result = {
'platform.scope1': { /* scope data */ },
'platform.scope2': { /* scope data */ },
exportSummary: { count, label, details },
timestamp: new Date().toISOString(),
version: '2.0.0-playwright',
platform: 'platform-name',
};Metadata keys (exportSummary, timestamp, version, platform) are not treated as scopes.
| Pattern | When to use | Example connector |
|---|---|---|
API fetch via page.evaluate() |
Platform has REST/JSON APIs | openai/chatgpt-playwright.js |
Network capture via page.captureNetwork() |
Platform uses GraphQL/XHR that fires on navigation | meta/instagram-playwright.js |
DOM scraping via page.evaluate() |
No API available, data only in rendered HTML | linkedin/linkedin-playwright.js |
See skills/vana-connect/CREATE.md for the full walkthrough. Summary:
- Scaffold:
node scripts/scaffold.cjs <platform> [company]-- generates script, metadata, and stub schema - Implement: Write login + data collection logic (see CREATE.md for auth patterns, extraction strategies, and reference connectors)
- Validate structure:
node scripts/validate-connector.cjs <company>/<name>-playwright.js - Test:
node run-connector.cjs <company>/<name>-playwright.js --inputs '{"username":"x","password":"y"}' - Validate output:
node scripts/validate-connector.cjs <company>/<name>-playwright.js --check-result ~/.dataconnect/last-result.json - Register:
node scripts/register.cjs <company>/<name>-playwright.js-- adds entry + checksums toregistry.json
The page object is available as a global in connector scripts. The runner implementation lives in data-connect/playwright-runner.
| Method | Description |
|---|---|
page.evaluate(jsString) |
Run JS in browser context, return result |
page.screenshot() |
Take a JPEG screenshot, returns base64 string |
page.requestInput({message, schema?}) |
Request data from the driver (credentials, 2FA codes, etc.) |
page.goto(url, options?) |
Navigate to URL |
page.sleep(ms) |
Wait for milliseconds |
page.setData(key, value) |
Send data to host ('status', 'error', 'result') |
page.setProgress({phase, message, count}) |
Structured progress for the UI |
page.showBrowser(url?) |
Escalate to headed mode; returns { headed: true/false } |
page.goHeadless() |
Switch to headless mode (no-op if already headless) |
page.promptUser(msg, checkFn, interval) |
Poll checkFn until truthy |
page.captureNetwork({urlPattern, bodyPattern, key}) |
Register a network capture |
page.getCapturedResponse(key) |
Get captured response or null |
page.hasCapturedResponse(key) |
Check if a response was captured |
page.clearNetworkCaptures() |
Clear all captures |
page.closeBrowser() |
Close browser, keep process for HTTP work |
page.httpFetch(url, options?) |
Node.js fetch with auto-injected cookies from the browser session |
showBrowser switches the browser to headed mode for cases that require live human interaction (e.g., interactive CAPTCHAs). It returns { headed: true } on success or { headed: false } if the driver doesn't support headed mode. Connectors should check the return value and handle the fallback:
const { headed } = await page.showBrowser(url);
if (!headed) {
// Headed not available — retry, skip, or report error
}For normal login flows, use requestInput to ask the driver for credentials without showing a browser:
const { email, password } = await page.requestInput({
message: 'Log in to ChatGPT',
schema: {
type: 'object',
properties: {
email: { type: 'string', format: 'email' },
password: { type: 'string', format: 'password' }
},
required: ['email', 'password']
}
});The runner relays the request to the driver (Tauri app, agent, CLI) and resolves with the response. The schema field uses JSON Schema — the same format used by OpenAI, Anthropic, and Google for LLM tool definitions. See the headless-first runner spec for the full protocol design.
await page.setProgress({
phase: { step: 1, total: 3, label: 'Fetching memories' },
message: 'Downloaded 50 of 200 items...',
count: 50,
});phase.step/phase.total-- drives the step indicator ("Step 1 of 3")phase.label-- short label for the current phasemessage-- human-readable progress textcount-- numeric count for progress tracking
- DataConnect cloned and able to run (
npm run tauri:dev)
- Clone this repo alongside DataConnect:
git clone https://github.com/vana-com/data-connectors.git- Point DataConnect to your local connectors during development:
# From the DataConnect repo
CONNECTORS_PATH=../data-connectors npm run tauri:devThe CONNECTORS_PATH environment variable tells the fetch script to skip downloading and use your local directory instead.
- After editing connector files, sync them to the app's runtime directory:
# From the DataConnect repo
node scripts/sync-connectors-dev.jsThis copies your connector files to ~/.dataconnect/connectors/ where the running app reads them. The app checks this directory first, so your local edits take effect without rebuilding.
- Edit your connector script
- Run
node scripts/sync-connectors-dev.js(from the DataConnect repo) - Click the connector in the app to test
- Check logs in
~/Library/Logs/DataConnect/(macOS) for debugging
Test connectors without the full DataConnect app. The runner spawns playwright-runner as a child process and outputs JSON protocol messages.
Prerequisites: The DataConnect repo cloned alongside this one (the runner auto-detects ../data-dt-app/playwright-runner), or set PLAYWRIGHT_RUNNER_DIR to point to the playwright-runner directory.
# Run a connector (headed by default, browser visible)
node run-connector.cjs ./linkedin/linkedin-playwright.js
# Colored, human-readable output
node run-connector.cjs ./linkedin/linkedin-playwright.js --pretty
# Pre-supply credentials
node run-connector.cjs ./linkedin/linkedin-playwright.js --inputs '{"username":"x","password":"y"}'
# Run headless (no visible browser)
node run-connector.cjs ./linkedin/linkedin-playwright.js --headless
# Override the initial URL
node run-connector.cjs ./linkedin/linkedin-playwright.js --url https://linkedin.com/feed
# Save result to a custom path (default: ./connector-result.json)
node run-connector.cjs ./linkedin/linkedin-playwright.js --output ./my-result.jsonThe runner reads the connector's sibling .json metadata to resolve the connectURL. In headed mode, goHeadless() becomes a no-op so the browser stays visible throughout.
- Fork this repo
- Create a branch:
git checkout -b feat/<platform>-connector - Add your files in
connectors/<company>/:<name>-playwright.js-- connector script<name>-playwright.json-- metadataschemas/<platform>.<scope>.json-- data schema (optional but encouraged)
- Test locally using the instructions above
- Update
registry.jsonwith your connector entry and checksums - Open a pull request
- Fork and branch
- Make your changes to the connector script and/or metadata
- Test locally
- Update the version in the metadata JSON
- Regenerate checksums and update
registry.json - Open a pull request
- Credentials stay on-device. Never send tokens or passwords to external servers.
- Use
page.setProgress()to report progress during long exports. - Include
exportSummaryin the result. The UI uses it to display what was collected. - Handle errors. Use
page.setData('error', message)with clear error messages. - Prefer API fetch over DOM scraping. APIs are more stable than DOM structure.
- Avoid obfuscated CSS class names. Use structural selectors, heading text, and content heuristics.
- Rate-limit API calls. Add
page.sleep()between requests. - Test pagination edge cases -- empty results, single page, large datasets.
The registry uses SHA-256 checksums to verify file integrity during OTA updates. Always regenerate checksums when modifying connector files:
shasum -a 256 <company>/<name>-playwright.js | awk '{print "sha256:" $1}'
shasum -a 256 <company>/<name>-playwright.json | awk '{print "sha256:" $1}'DataConnect fetches registry.json from this repo on app startup and during npm postinstall. For each connector listed:
- Check if local files exist with matching checksums
- If not, download from
baseUrl/<file_path>(this repo's raw GitHub URL) - Verify SHA-256 checksums match
- Write to local
connectors/directory
This enables OTA connector updates without a full app release.