Discord Link Aggregation Bot

A TypeScript Discord bot that monitors channels, indexes shared links, and provides semantic search capabilities. Perfect for communities that want to build a searchable knowledge base from shared resources.

Overview

SourceBase automatically extracts URLs from Discord messages, fetches content metadata, generates AI-powered summaries and embeddings, and stores everything in a searchable database. When you or your community members share links, the bot makes them discoverable through natural language search.

Key Use Cases:

Build a searchable archive of resources shared in your Discord community
Quickly find previously shared articles, videos, and documentation
Create a knowledge base that grows organically as your community shares content

Features

🔗 Link Ingestion: Extracts and stores URLs from Discord messages
📺 YouTube Support: Full support for YouTube videos with metadata, captions, and transcripts
🧠 AI-Powered: Generates summaries and embeddings using LLM (Ollama/OpenAI compatible)
🔍 Semantic Search: Search links by meaning, not just keywords
📊 Backfill Queue: Automatic retry for failed operations with SLA tracking
🎭 Discord Reactions: Success/failure feedback on message processing

Quick Start

Want to get running quickly? Here's the minimal setup:

# 1. Install dependencies
npm install

# 2. Copy and fill environment variables
cp .env.example .env
# Edit .env and add: DISCORD_BOT_TOKEN, DISCORD_CHANNEL_ID, DATABASE_URL, LLM_BASE_URL, LLM_MODEL

# 3. Set up database
npm run db:migrate

# 4. Run the bot
npm run start:dev

See Local Setup below for detailed instructions.

Requirements

Node.js 20+
npm 10+
PostgreSQL 14+ with pgvector extension
(Optional) Ollama or OpenAI-compatible API for LLM features

Local Setup

Install dependencies:
```
npm install
```
Copy environment template:
```
cp .env.example .env
```
Fill in required values in .env:

Required:
- DISCORD_BOT_TOKEN - From Discord Developer Portal
- DISCORD_CHANNEL_ID - Channel to monitor
- DATABASE_URL - PostgreSQL connection string
- LLM_BASE_URL - e.g., http://localhost:11434/v1 for Ollama
- LLM_MODEL - e.g., llama3.2 or gpt-4o-mini
Optional (for YouTube support):
- YOUTUBE_API_KEY - From Google Cloud Console
- YOUTUBE_CAPTION_LANGUAGE - Preferred caption language (default: en)
- ENABLE_YOUTUBE_CAPTIONS - Enable/disable captions (default: true)
Set up the database:
```
npm run db:migrate
```
Run the bot:
```
npm run start:dev
```

Scripts

npm run start:dev - Run bot in watch mode with TypeScript
npm run build - Compile TypeScript into dist/
npm run start - Run compiled bot from dist/
npm run lint - Type check without emitting files
npm run test - Run unit tests
npm run db:migrate - Apply SQL migrations to PostgreSQL

CLI Commands

The bot provides a CLI tool for database operations outside of Discord:

Installation

After building the project, the sb command is available:

npm run build
# Option 1: Use npx
npx sb <command>

# Option 2: Link globally
npm link
sb <command>

Commands

`sb add` - Add URLs directly to the database

Process URLs immediately (bypasses the queue):

# Add single URL
sb add https://example.com/article

# Add multiple URLs
sb add https://url1.com https://url2.com https://url3.com

# Add with verbose progress output
sb add --verbose https://example.com

Features:

Extracts content, generates summary and embedding immediately
Shows progress: Downloading → Extracting → Summarizing → Embedding → Storing → Completed
Outputs: Added: <title> (ID: <id>) on success, Failed: <url> - <error> on failure
Supports YouTube URLs with transcript extraction
Exit codes: 0 (success), 1 (error), 2 (invalid args)

`sb queue` - Queue URLs for later processing

Add URLs to the processing queue (processed by the bot asynchronously):

# Queue single URL
sb queue https://example.com/article

# Queue multiple URLs
sb queue https://url1.com https://url2.com https://url3.com

# Queue with verbose output
sb queue --verbose https://example.com

Features:

Inserts URLs into document_queue with pending status
URLs processed sequentially by the bot
Discord notifications sent to the configured channel on progress/completion
Outputs: Queued: <url> (ID: <id>) on success
Exit codes: 0 (success), 1 (error), 2 (invalid args)

`sb search` - Search indexed content

Search the database using semantic similarity:

# Basic search (table output)
sb search "machine learning"

# JSON output for programmatic use
sb search --format json "neural networks"

# Get just the URLs
sb search --format urls-only "web development" | xargs -I {} curl {}

# Limit results
sb search --limit 10 "artificial intelligence"

Features:

Semantic search using vector embeddings
Results sorted by relevance
Exit codes: 0 (success), 1 (error), 2 (invalid args)

`sb stats` - Database statistics

Display statistics about the indexed content:

# Table output (default)
sb stats

# JSON output
sb stats --format json

Features:

Total links count
Links with summaries, embeddings, content, transcripts
Recent activity (24h, 7d, 30d)
Exit codes: 0 (success), 1 (error), 2 (invalid args)

Progress Output Formats (add command)

The sb add command supports multiple output formats for progress events:

# Console format (human-friendly, default in TTY)
sb add https://example.com/article

# NDJSON format (one JSON line per event, ideal for automation)
sb add --format ndjson https://example.com/article
sb add --ndjson https://example.com/article  # shorthand

# Webhook format (POSTs events to URL)
sb add --format webhook --webhook-url https://example.com/hook https://example.com/article

NDJSON Output Example:

$ sb add --ndjson https://example.com/article
{"type":"progress","phase":"downloading","url":"https://example.com/article","current":1,"total":1,"timestamp":"2026-03-29T12:00:00.000Z"}
{"type":"progress","phase":"extracting_links","url":"https://example.com/article","current":1,"total":1,"timestamp":"2026-03-29T12:00:01.000Z"}
{"type":"progress","phase":"summarizing","url":"https://example.com/article","current":1,"total":1,"chunkCurrent":1,"chunkTotal":3,"timestamp":"2026-03-29T12:00:02.000Z"}
{"type":"progress","phase":"completed","url":"https://example.com/article","current":1,"total":1,"title":"Example Article","summary":"This is a summary...","timestamp":"2026-03-29T12:00:05.000Z"}

Webhook Events: When using --format webhook, each progress event is POSTed as JSON to the provided URL. The webhook receives the same JSON structure as NDJSON output.

Context Flags for Bot Integration

When the Discord bot invokes the CLI, it passes context via flags:

sb add \
  --channel-id "123456789" \
  --message-id "987654321" \
  --author-id "111222333" \
  https://example.com/article

These flags associate the operation with Discord entities but are optional for standalone CLI usage.

Global Options

All commands support:

--help, -h - Show usage information
--version, -v - Show version
--verbose - Enable detailed JSON logging output

Environment Requirements

CLI commands operate independently of Discord and require only:

Required:

DATABASE_URL - PostgreSQL connection string

Required for add command:

LLM_BASE_URL - LLM API endpoint (e.g., http://localhost:11434/v1)
LLM_MODEL - Model name (e.g., gpt-4o-mini)

Optional:

LLM_EMBEDDING_MODEL - Separate model for embeddings
YOUTUBE_API_KEY - For YouTube metadata extraction
LOG_LEVEL - Control verbosity (debug, info, warn, error)

Note: The CLI does NOT require DISCORD_BOT_TOKEN or DISCORD_CHANNEL_ID. These are bot-only variables.

YouTube Ingestion

The bot provides enhanced support for YouTube URLs with the following features:

Supported URL Formats

Standard: https://youtube.com/watch?v=VIDEO_ID
Short: https://youtu.be/VIDEO_ID
Shorts: https://youtube.com/shorts/VIDEO_ID
Live: https://youtube.com/live/VIDEO_ID
Embed: https://youtube.com/embed/VIDEO_ID
Mobile: https://m.youtube.com/watch?v=VIDEO_ID
Music: https://music.youtube.com/watch?v=VIDEO_ID

Features

Metadata Extraction (requires YOUTUBE_API_KEY):
- Video title and description
- Channel name
- Published date
- Thumbnail (highest quality available)
Caption/Transcript Fetching (optional):
- Automatic caption download when available
- Language fallback: preferred → English → any
- Auto-generated caption detection
- Stored in transcript field for LLM processing
LLM Enhancement:
- Summaries generated from transcript when available (better quality)
- Embeddings use transcript text for semantic search
- Fallback to metadata when captions unavailable
Rate Limiting:
- Exponential backoff on YouTube API rate limits
- Retry-After header support
- Max 3 retries per request

Setup

Get a YouTube Data API v3 key:
- Visit Google Cloud Console
- Create a new project or select existing
- Enable "YouTube Data API v3"
- Create credentials → API Key
- Copy the key to YOUTUBE_API_KEY in .env

Configure caption preferences:

YOUTUBE_CAPTION_LANGUAGE=en
ENABLE_YOUTUBE_CAPTIONS=true

Troubleshooting

Issue: YouTube URLs not being processed

Check YOUTUBE_API_KEY is set correctly
Verify API key has YouTube Data API v3 enabled
Check logs for quota exceeded errors

Issue: Captions not being fetched

Not all videos have captions (especially non-English)
Check ENABLE_YOUTUBE_CAPTIONS=true in .env
Verify video has captions available on YouTube

Issue: Rate limit errors

YouTube API has quota limits (10,000 units/day for free tier)
Bot implements exponential backoff automatically
Consider caching or reducing ingestion frequency

Backfill Queue

Failed operations (embeddings, summaries, transcripts) are automatically queued for retry:

Queue Table: backfill_queue tracks pending items
SLA: 24-hour maximum from creation to completion
Retry Logic: Up to 3 attempts with exponential backoff
Processing: Hourly by default (configurable via BACKFILL_INTERVAL_MS)
Metrics: Track queue depth, success rate, SLA violations

Configuration

BACKFILL_INTERVAL_MS=3600000  # 1 hour in milliseconds
MAX_BACKFILL_ATTEMPTS=3       # Max retry attempts

Database

The storage layer uses PostgreSQL with pgvector:

migrations/001_initial_schema.sql - Core tables and indexes
migrations/003_add_transcript_column.sql - YouTube transcript support
migrations/004_backfill_queue.sql - Backfill queue tracking

Key tables:

links - Stored links with metadata, summaries, embeddings
backfill_queue - Pending retry operations
app_checkpoints - Processing state per channel

Ingestion Pipeline

URL Detection (src/ingestion/url.ts):
- Extract URLs from Discord messages
- Detect YouTube URLs and normalize to canonical format
Content Extraction:
- YouTube: Fetch metadata via YouTube Data API
- Generic: Use @extractus/article-extractor
- Captions: Download via youtube-transcript
LLM Processing:
- Generate summary (using transcript if available)
- Create embedding vector for semantic search
Storage:
- Upsert to PostgreSQL with pgvector
- Update backfill queue on failures
- React to Discord message with success/failure emoji

Observability

Logging

Structured logging throughout the pipeline:

ingestion.youtube - YouTube-specific operations
ingestion.backfill - Queue processing
llm.* - LLM operations
db.* - Database operations

Set LOG_LEVEL to control verbosity:

error - Errors only
warn - Warnings and errors
info - Standard operational logs (default)
debug - Verbose debugging

Metrics

The backfill service tracks:

Queue depth (pending items)
Processed today
Failed today
SLA violations

Access via BackfillService.getMetrics()

Architecture

Discord Message
    ↓
URL Extraction (youtube.ts, url.ts)
    ↓
Content Fetch (YouTube API / Article Extractor)
    ↓
LLM Processing (summarize + embed)
    ↓
PostgreSQL + pgvector
    ↓
Discord Reaction

Testing

Run the test suite:

npm test

Test coverage includes:

URL extraction and normalization
YouTube API client (mocked)
Caption fetching (mocked)
Backfill queue operations
Database repository

Contributing

We welcome contributions from the community! Here's how to get started:

Development Setup

Fork and clone the repository
Install dependencies: npm install
Copy .env.example to .env and configure your environment
Run tests to ensure everything works: npm test

Making Changes

Create a feature branch: git checkout -b feature/your-feature-name
Make your changes following the existing code style
Add tests for new functionality
Run the test suite: npm test
Run linting: npm run lint
Commit your changes with clear, descriptive messages

Pull Request Process

Push your branch to your fork
Open a Pull Request against the main branch
Describe what your PR does and why
Ensure all tests pass and the PR is up to date with main

Code Style

Follow TypeScript best practices
Use meaningful variable and function names
Add comments for complex logic
Keep functions focused and small
Write tests for new features

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
.githooks		.githooks
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLI_EXTRACTION_PLAN.md		CLI_EXTRACTION_PLAN.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
links-list.jsonl		links-list.jsonl
package-lock.json		package-lock.json
package.json		package.json
tmux.windows.conf.yaml		tmux.windows.conf.yaml
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Folders and files

Latest commit

History

Repository files navigation

Discord Link Aggregation Bot

Overview

Features

Quick Start

Requirements

Local Setup

Scripts

CLI Commands

Installation

Commands

sb add - Add URLs directly to the database

sb queue - Queue URLs for later processing

sb search - Search indexed content

sb stats - Database statistics

Progress Output Formats (add command)

Context Flags for Bot Integration

Global Options

Environment Requirements

YouTube Ingestion

Supported URL Formats

Features

Setup

Troubleshooting

Backfill Queue

Configuration

Database

Ingestion Pipeline

Observability

Logging

Metrics

Architecture

Testing

Contributing

Development Setup

Making Changes

Pull Request Process

Code Style

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`sb add` - Add URLs directly to the database

`sb queue` - Queue URLs for later processing

`sb search` - Search indexed content

`sb stats` - Database statistics

Packages