This project has two main components:
- Dataset Creation: A Python-based data pipeline that crawls, parses, and organizes transcripts from the classic CSPAN show Booknotes
- Interactive Chat App: A web application that lets users "resume" historical Booknotes conversations using AI
cspan-booknotes/
├── dataset/ # Python data pipeline & HuggingFace Hub management
└── app/ # Web application (Svelte + Bun + Neon DB)
The /dataset directory contains tools for creating a unique dataset from the public archives of the wonderful CSPAN program Booknotes. The dataset includes transcripts of conversations between host Brian Lamb and his more than 800 guests.
There are (3) datasets available:
programs: Information for ~809 episodes, including title, description and guest informationtranscripts: Full conversation transcripts (~200 turns/conversation) between Brian Lamb and his guestsrelated_items: Related or recommended programs (~5) for each episode
The transcripts dataset is the key dataset, with the other 2 providing additional context and information about each episode. Using the sequence, speaker_role, and text fields, we can create a chat-like dataset representing very interesting conversations.
cd dataset
uv sync # Install dependencies
uv run scripts/parse_programs.py # Parse episodes
uv run scripts/upload_to_hf.py # Upload to HuggingFace HubSee dataset/README.md for detailed documentation.
The /app directory contains an interactive web application that lets users select historical Booknotes episodes and "resume" the conversations using AI. The application simulates both Brian Lamb and his guests continuing their discussions.
- Frontend: SvelteKit
- Backend: Bun + Hono
- Database: Neon DB (PostgreSQL)
- AI: Claude (Anthropic)
- Deployment: Railway
cd app
bun install # Install dependencies
bun run dev # Start both frontend and backendThe frontend runs on http://localhost:5173 and the backend API on http://localhost:3000.
See app/README.md for detailed documentation.
To understand how the 3 tables work together, here's the source JSON schema for each program:
{
"id": "51559-1",
"url": "https://booknotes.c-span.org/Watch/51559-1",
"title": "For the Sake of Argument",
"guest": "Christopher Hitchens",
"description": "Mr. Hitchens discussed the recent publication...",
"book_isbn": "0860914356",
"air_date": "October 17, 1993",
"transcript": [
{
"sequence": 0,
"speaker_role": "host",
"speaker_name": "BRIAN LAMB, HOST:",
"text": "Christopher Hitchens, author of For the Sake of Argument..."
},
{
"sequence": 1,
"speaker_role": "guest",
"speaker_name": "CHRISTOPHER HITCHENS:",
"text": "Yes."
}
],
"related": [
{
"id": "55567-1",
"url": "https://booknotes.c-span.org/Watch/55567-1",
"author": "John Corry",
"title": "My Times: Adventures in the News Trade"
}
]
}See LICENSE.md for details.