A TypeScript Discord bot that monitors channels, indexes shared links, and provides semantic search capabilities. Perfect for communities that want to build a searchable knowledge base from shared resources.
SourceBase automatically extracts URLs from Discord messages, fetches content metadata, generates AI-powered summaries and embeddings, and stores everything in a searchable database. When you or your community members share links, the bot makes them discoverable through natural language search.
Key Use Cases:
- Build a searchable archive of resources shared in your Discord community
- Quickly find previously shared articles, videos, and documentation
- Create a knowledge base that grows organically as your community shares content
- 🔗 Link Ingestion: Extracts and stores URLs from Discord messages
- 📺 YouTube Support: Full support for YouTube videos with metadata, captions, and transcripts
- 🧠 AI-Powered: Generates summaries and embeddings using LLM (Ollama/OpenAI compatible)
- 🔍 Semantic Search: Search links by meaning, not just keywords
- 📊 Backfill Queue: Automatic retry for failed operations with SLA tracking
- 🎭 Discord Reactions: Success/failure feedback on message processing
Want to get running quickly? Here's the minimal setup:
# 1. Install dependencies
npm install
# 2. Copy and fill environment variables
cp .env.example .env
# Edit .env and add: DISCORD_BOT_TOKEN, DISCORD_CHANNEL_ID, DATABASE_URL, LLM_BASE_URL, LLM_MODEL
# 3. Set up database
npm run db:migrate
# 4. Run the bot
npm run start:devSee Local Setup below for detailed instructions.
- Node.js 20+
- npm 10+
- PostgreSQL 14+ with pgvector extension
- (Optional) Ollama or OpenAI-compatible API for LLM features
-
Install dependencies:
npm install
-
Copy environment template:
cp .env.example .env
-
Fill in required values in
.env:Required:
DISCORD_BOT_TOKEN- From Discord Developer PortalDISCORD_CHANNEL_ID- Channel to monitorDATABASE_URL- PostgreSQL connection stringLLM_BASE_URL- e.g.,http://localhost:11434/v1for OllamaLLM_MODEL- e.g.,llama3.2orgpt-4o-mini
Optional (for YouTube support):
YOUTUBE_API_KEY- From Google Cloud ConsoleYOUTUBE_CAPTION_LANGUAGE- Preferred caption language (default:en)ENABLE_YOUTUBE_CAPTIONS- Enable/disable captions (default:true)
-
Set up the database:
npm run db:migrate
-
Run the bot:
npm run start:dev
npm run start:dev- Run bot in watch mode with TypeScriptnpm run build- Compile TypeScript intodist/npm run start- Run compiled bot fromdist/npm run lint- Type check without emitting filesnpm run test- Run unit testsnpm run db:migrate- Apply SQL migrations to PostgreSQL
The bot provides a CLI tool for database operations outside of Discord:
After building the project, the sb command is available:
npm run build
# Option 1: Use npx
npx sb <command>
# Option 2: Link globally
npm link
sb <command>Process URLs immediately (bypasses the queue):
# Add single URL
sb add https://example.com/article
# Add multiple URLs
sb add https://url1.com https://url2.com https://url3.com
# Add with verbose progress output
sb add --verbose https://example.comFeatures:
- Extracts content, generates summary and embedding immediately
- Shows progress:
Downloading → Extracting → Summarizing → Embedding → Storing → Completed - Outputs:
Added: <title> (ID: <id>)on success,Failed: <url> - <error>on failure - Supports YouTube URLs with transcript extraction
- Exit codes: 0 (success), 1 (error), 2 (invalid args)
Add URLs to the processing queue (processed by the bot asynchronously):
# Queue single URL
sb queue https://example.com/article
# Queue multiple URLs
sb queue https://url1.com https://url2.com https://url3.com
# Queue with verbose output
sb queue --verbose https://example.comFeatures:
- Inserts URLs into
document_queuewithpendingstatus - URLs processed sequentially by the bot
- Discord notifications sent to the configured channel on progress/completion
- Outputs:
Queued: <url> (ID: <id>)on success - Exit codes: 0 (success), 1 (error), 2 (invalid args)
Search the database using semantic similarity:
# Basic search (table output)
sb search "machine learning"
# JSON output for programmatic use
sb search --format json "neural networks"
# Get just the URLs
sb search --format urls-only "web development" | xargs -I {} curl {}
# Limit results
sb search --limit 10 "artificial intelligence"Features:
- Semantic search using vector embeddings
- Results sorted by relevance
- Exit codes: 0 (success), 1 (error), 2 (invalid args)
Display statistics about the indexed content:
# Table output (default)
sb stats
# JSON output
sb stats --format jsonFeatures:
- Total links count
- Links with summaries, embeddings, content, transcripts
- Recent activity (24h, 7d, 30d)
- Exit codes: 0 (success), 1 (error), 2 (invalid args)
The sb add command supports multiple output formats for progress events:
# Console format (human-friendly, default in TTY)
sb add https://example.com/article
# NDJSON format (one JSON line per event, ideal for automation)
sb add --format ndjson https://example.com/article
sb add --ndjson https://example.com/article # shorthand
# Webhook format (POSTs events to URL)
sb add --format webhook --webhook-url https://example.com/hook https://example.com/articleNDJSON Output Example:
$ sb add --ndjson https://example.com/article
{"type":"progress","phase":"downloading","url":"https://example.com/article","current":1,"total":1,"timestamp":"2026-03-29T12:00:00.000Z"}
{"type":"progress","phase":"extracting_links","url":"https://example.com/article","current":1,"total":1,"timestamp":"2026-03-29T12:00:01.000Z"}
{"type":"progress","phase":"summarizing","url":"https://example.com/article","current":1,"total":1,"chunkCurrent":1,"chunkTotal":3,"timestamp":"2026-03-29T12:00:02.000Z"}
{"type":"progress","phase":"completed","url":"https://example.com/article","current":1,"total":1,"title":"Example Article","summary":"This is a summary...","timestamp":"2026-03-29T12:00:05.000Z"}Webhook Events:
When using --format webhook, each progress event is POSTed as JSON to the provided URL. The webhook receives the same JSON structure as NDJSON output.
When the Discord bot invokes the CLI, it passes context via flags:
sb add \
--channel-id "123456789" \
--message-id "987654321" \
--author-id "111222333" \
https://example.com/articleThese flags associate the operation with Discord entities but are optional for standalone CLI usage.
All commands support:
--help, -h- Show usage information--version, -v- Show version--verbose- Enable detailed JSON logging output
CLI commands operate independently of Discord and require only:
Required:
DATABASE_URL- PostgreSQL connection string
Required for add command:
LLM_BASE_URL- LLM API endpoint (e.g.,http://localhost:11434/v1)LLM_MODEL- Model name (e.g.,gpt-4o-mini)
Optional:
LLM_EMBEDDING_MODEL- Separate model for embeddingsYOUTUBE_API_KEY- For YouTube metadata extractionLOG_LEVEL- Control verbosity (debug,info,warn,error)
Note: The CLI does NOT require DISCORD_BOT_TOKEN or DISCORD_CHANNEL_ID. These are bot-only variables.
The bot provides enhanced support for YouTube URLs with the following features:
- Standard:
https://youtube.com/watch?v=VIDEO_ID - Short:
https://youtu.be/VIDEO_ID - Shorts:
https://youtube.com/shorts/VIDEO_ID - Live:
https://youtube.com/live/VIDEO_ID - Embed:
https://youtube.com/embed/VIDEO_ID - Mobile:
https://m.youtube.com/watch?v=VIDEO_ID - Music:
https://music.youtube.com/watch?v=VIDEO_ID
-
Metadata Extraction (requires
YOUTUBE_API_KEY):- Video title and description
- Channel name
- Published date
- Thumbnail (highest quality available)
-
Caption/Transcript Fetching (optional):
- Automatic caption download when available
- Language fallback: preferred → English → any
- Auto-generated caption detection
- Stored in
transcriptfield for LLM processing
-
LLM Enhancement:
- Summaries generated from transcript when available (better quality)
- Embeddings use transcript text for semantic search
- Fallback to metadata when captions unavailable
-
Rate Limiting:
- Exponential backoff on YouTube API rate limits
- Retry-After header support
- Max 3 retries per request
-
Get a YouTube Data API v3 key:
- Visit Google Cloud Console
- Create a new project or select existing
- Enable "YouTube Data API v3"
- Create credentials → API Key
- Copy the key to
YOUTUBE_API_KEYin.env
-
Configure caption preferences:
YOUTUBE_CAPTION_LANGUAGE=en ENABLE_YOUTUBE_CAPTIONS=true
Issue: YouTube URLs not being processed
- Check
YOUTUBE_API_KEYis set correctly - Verify API key has YouTube Data API v3 enabled
- Check logs for quota exceeded errors
Issue: Captions not being fetched
- Not all videos have captions (especially non-English)
- Check
ENABLE_YOUTUBE_CAPTIONS=truein.env - Verify video has captions available on YouTube
Issue: Rate limit errors
- YouTube API has quota limits (10,000 units/day for free tier)
- Bot implements exponential backoff automatically
- Consider caching or reducing ingestion frequency
Failed operations (embeddings, summaries, transcripts) are automatically queued for retry:
- Queue Table:
backfill_queuetracks pending items - SLA: 24-hour maximum from creation to completion
- Retry Logic: Up to 3 attempts with exponential backoff
- Processing: Hourly by default (configurable via
BACKFILL_INTERVAL_MS) - Metrics: Track queue depth, success rate, SLA violations
BACKFILL_INTERVAL_MS=3600000 # 1 hour in milliseconds
MAX_BACKFILL_ATTEMPTS=3 # Max retry attemptsThe storage layer uses PostgreSQL with pgvector:
migrations/001_initial_schema.sql- Core tables and indexesmigrations/003_add_transcript_column.sql- YouTube transcript supportmigrations/004_backfill_queue.sql- Backfill queue tracking
Key tables:
links- Stored links with metadata, summaries, embeddingsbackfill_queue- Pending retry operationsapp_checkpoints- Processing state per channel
-
URL Detection (
src/ingestion/url.ts):- Extract URLs from Discord messages
- Detect YouTube URLs and normalize to canonical format
-
Content Extraction:
- YouTube: Fetch metadata via YouTube Data API
- Generic: Use
@extractus/article-extractor - Captions: Download via youtube-transcript
-
LLM Processing:
- Generate summary (using transcript if available)
- Create embedding vector for semantic search
-
Storage:
- Upsert to PostgreSQL with pgvector
- Update backfill queue on failures
- React to Discord message with success/failure emoji
Structured logging throughout the pipeline:
ingestion.youtube- YouTube-specific operationsingestion.backfill- Queue processingllm.*- LLM operationsdb.*- Database operations
Set LOG_LEVEL to control verbosity:
error- Errors onlywarn- Warnings and errorsinfo- Standard operational logs (default)debug- Verbose debugging
The backfill service tracks:
- Queue depth (pending items)
- Processed today
- Failed today
- SLA violations
Access via BackfillService.getMetrics()
Discord Message
↓
URL Extraction (youtube.ts, url.ts)
↓
Content Fetch (YouTube API / Article Extractor)
↓
LLM Processing (summarize + embed)
↓
PostgreSQL + pgvector
↓
Discord Reaction
Run the test suite:
npm testTest coverage includes:
- URL extraction and normalization
- YouTube API client (mocked)
- Caption fetching (mocked)
- Backfill queue operations
- Database repository
We welcome contributions from the community! Here's how to get started:
- Fork and clone the repository
- Install dependencies:
npm install - Copy
.env.exampleto.envand configure your environment - Run tests to ensure everything works:
npm test
- Create a feature branch:
git checkout -b feature/your-feature-name - Make your changes following the existing code style
- Add tests for new functionality
- Run the test suite:
npm test - Run linting:
npm run lint - Commit your changes with clear, descriptive messages
- Push your branch to your fork
- Open a Pull Request against the
mainbranch - Describe what your PR does and why
- Ensure all tests pass and the PR is up to date with
main
- Follow TypeScript best practices
- Use meaningful variable and function names
- Add comments for complex logic
- Keep functions focused and small
- Write tests for new features
This project is licensed under the MIT License - see the LICENSE file for details.