Automated Research Digest Pipeline

A daily digest system that fetches academic papers from arXiv, HuggingFace Daily Papers, and OpenAlex, then extracts and semantically chunks their text, stores embeddings in a vector database, and delivers an LLM-generated summary via email. Supports 150+ arXiv categories, 26 OpenAlex academic fields, and any LLM provider through litellm.

How It Works

arXiv RSS ─────────────┐
  (fetch pool)         │
                       ├──> Dedup & Seen Filter ──> Extract / Chunk / Store ALL
HuggingFace Daily ─────┤                                      |
  (optional)           │                                      v
                       │                             Vector Storage (ChromaDB)
OpenAlex API ──────────┤                              [full corpus retained]
  (optional,           │                                      |
   fetch pool) ────────┘                                      v
                                                     Interest-Based Ranking
                                                     (keyword + LLM scoring)
                                                              |
                                                              v
                                                     Top N for Digest
                                                              |
                                                    LLM Summarization (litellm)
                                                              |
                                                              v
                                             Post-Processing (ELI5, Implications & Critiques)
                                                              |
                                                              v
                                             Email Dispatch (SMTP/STARTTLS)

Pipeline Steps

Fetch — Fetches papers from the arXiv RSS feed for your configured topics. When interest-based ranking is configured, a larger "fetch pool" is retrieved (e.g., 200 papers). PDFs are downloaded in parallel. Optionally also fetches from HuggingFace Daily Papers (community-upvoted) and OpenAlex (broad academic coverage across 26 fields). Papers are deduplicated across sources by DOI and filtered against a 30-day seen history.
Extract — Uses pypdf to pull raw text from each PDF. Image-only documents are flagged as unparseable and stored separately. Papers without PDFs (HuggingFace, OpenAlex) use their abstract.
Chunk — Splits extracted text into semantic segments using Chonkie's SemanticChunker with the potion-base-32M embedding model.
Store — Persists chunks with embeddings and metadata (title, authors, URL, date, chunk index) in ChromaDB for future retrieval. All fetched papers are stored — not just the ones selected for the digest — so the vector store builds a comprehensive corpus over time. Steps 2-4 run in parallel (configurable workers) for faster ingestion.
Rank — When interest-based ranking is configured, papers are scored by keyword match + LLM relevance and the top N per source are selected for the digest (e.g., top 50 arXiv, top 20 OpenAlex). Papers that don't make the digest cut are still in the vector store from step 4.
Summarize — Sends paper text to an LLM via litellm, producing per-paper structured summaries. Supports OpenAI, Anthropic, Google, Cohere, Ollama, Azure, and 100+ other providers.
Post-process — Optionally generates ELI5 (plain-language explanations), practical implications (who benefits, how to apply), and structured critiques (strengths, weaknesses, open questions) via separate LLM calls. Enabled post-processors run in parallel by default.
Email — Delivers the digest as a styled HTML + plaintext email via SMTP with STARTTLS, or prints to console in dry-run mode.

What You Get

Each digest email contains per-paper sections with:

Summary — Key findings and contributions
Practical Implications — Who benefits and how to apply the research
Critique — Methodological strengths, weaknesses, and open questions
Metadata — Authors, categories/fields of study, source, and direct links

Getting Started

Prerequisites

Python 3.10+
An API key for any litellm-supported LLM provider
SMTP credentials for email delivery (e.g., a Gmail App Password)

Installation

git clone https://github.com/Blevene/Daylight-know.git
cd Daylight-know
pip install -e ".[dev]"

Option A: Interactive Setup Wizard (Recommended)

The setup wizard walks you through configuring everything interactively, with topic browsing, connection testing, and .env file generation:

digest-pipeline setup

The wizard will:

Help you browse and select arXiv topics — search by keyword, browse by group (CS, Math, Physics, etc.), or type codes directly
Configure your LLM provider and optionally test the connection
Configure SMTP email settings and optionally test delivery
Set up ChromaDB storage location
Toggle optional features (ELI5, implications, critiques)
Write a complete .env file (with backup if one exists)

Option B: Manual Configuration

Copy the example environment file and edit it:

cp .env.example .env

Browsing arXiv Topics

Not sure which topics to subscribe to? Use the topic browser:

# List all topic groups with counts
digest-pipeline topics list

# Search by keyword
digest-pipeline topics search "machine learning"
digest-pipeline topics search "quantum"
digest-pipeline topics search "natural language"

# List all topics in a group
digest-pipeline topics group cs
digest-pipeline topics group physics

# Validate topic codes
digest-pipeline topics validate cs.AI cs.LG stat.ML

Configuration Reference

All settings are configured via environment variables in .env:

arXiv Settings

Variable	Description	Default
`ARXIV_TOPICS`	Comma-separated arXiv category codes	`cs.AI,cs.LG`
`ARXIV_MAX_RESULTS`	Max papers to include in digest	`50`
`ARXIV_FETCH_POOL`	Papers to fetch before ranking (when ranking is enabled)	`200`

LLM Settings

Variable	Description	Default
`LLM_MODEL`	litellm model string	`openai/gpt-4o-mini`
`LLM_API_KEY`	API key for your LLM provider	required
`LLM_MAX_TOKENS`	Max tokens for LLM responses	`32768`
`LLM_API_BASE`	Custom API base URL (optional)	—

The LLM_MODEL uses litellm's provider/model format. Examples:

openai/gpt-4o-mini — OpenAI
anthropic/claude-sonnet-4-20250514 — Anthropic
gemini/gemini-pro-latest — Google (latest Pro model)
ollama/llama3 — Local Ollama
azure/gpt-4o — Azure OpenAI

See litellm providers for the full list.

Email / SMTP Settings

Variable	Description	Default
`SMTP_HOST`	SMTP server hostname	`smtp.gmail.com`
`SMTP_PORT`	SMTP port (STARTTLS)	`587`
`SMTP_USER`	SMTP username (usually your email)	required
`SMTP_PASSWORD`	SMTP password or app password	required
`EMAIL_FROM`	Sender email address	required
`EMAIL_TO`	Recipient email address	required

Gmail users: You'll need an App Password. Go to Google Account > Security > 2-Step Verification > App Passwords, generate one for "Mail", and use it as SMTP_PASSWORD.

ChromaDB Settings

Variable	Description	Default
`CHROMA_PERSIST_DIR`	Directory for ChromaDB storage	`./data/chromadb`
`CHROMA_COLLECTION`	Collection name	`research_digest`

Pipeline Modes

Variable	Description	Default
`DRY_RUN`	Print digest to console instead of emailing	`true`
`PDF_DOWNLOAD_MAX_RETRIES`	PDF download retry attempts	`3`
`PDF_DOWNLOAD_WORKERS`	Parallel PDF download threads	`8`
`PDF_ARCHIVE_DIR`	Directory to archive PDFs (empty = disabled)	—

Post-Processing

Variable	Description	Default
`POSTPROCESSING_IMPLICATIONS`	Generate practical implications	`true`
`POSTPROCESSING_CRITIQUES`	Generate structured critiques	`true`
`POSTPROCESSING_ELI5`	Generate plain-language ELI5 explanations	`true`

Pipeline Performance

Variable	Description	Default
`PIPELINE_INGEST_WORKERS`	Parallel threads for extract/chunk/store	`4`
`PIPELINE_POSTPROCESS_PARALLEL`	Run post-processing LLM calls concurrently	`true`

HuggingFace Daily Papers (Optional)

Variable	Description	Default
`HUGGINGFACE_ENABLED`	Include HuggingFace community papers	`false`
`HUGGINGFACE_TOKEN`	HuggingFace API token (for rate limits)	—
`HUGGINGFACE_MAX_RESULTS`	Max HuggingFace papers	`20`

Papers from HuggingFace are deduplicated against arXiv — those already fetched from arXiv appear as "trending" in a sidebar rather than being processed twice.

OpenAlex (Optional)

Variable	Description	Default
`OPENALEX_ENABLED`	Include OpenAlex academic papers	`false`
`OPENALEX_API_KEY`	OpenAlex API key (required since Feb 2026)	—
`OPENALEX_EMAIL`	Email for polite pool (recommended)	—
`OPENALEX_MAX_RESULTS`	Papers to include in digest	`20`
`OPENALEX_QUERY`	Search query (used when ranking is off)	`machine learning`
`OPENALEX_FIELDS`	JSON array of academic fields to filter	—

Valid fields: Computer Science, Mathematics, Physics and Astronomy, Chemistry, Engineering, Medicine, Psychology, and 19 others (26 total). See the setup wizard for the full list.

Interest-Based Ranking (Optional)

When configured, the pipeline fetches a larger "fetch pool" of papers from each source and uses LLM scoring to select only the most relevant ones for your digest. Ranking applies to both arXiv and OpenAlex papers.

Pipeline-wide settings (apply to all sources):

Variable	Description	Default
`INTEREST_PROFILE`	Natural language description of your research interests	—
`INTEREST_KEYWORDS`	JSON array of boost keywords	—

OpenAlex-specific overrides (take priority over pipeline-wide for OpenAlex):

Variable	Description	Default
`OPENALEX_INTEREST_PROFILE`	OpenAlex-specific interest profile	—
`OPENALEX_INTEREST_KEYWORDS`	OpenAlex-specific boost keywords	—
`OPENALEX_FETCH_POOL`	OpenAlex papers to fetch before ranking	`100`

How it works: The ranker scores each paper using two signals:

Keyword boost — +2 points per keyword match in title or abstract
LLM scoring — Papers are sent in batches of 20 to your configured LLM, which rates each paper 1-10 against your interest profile

For arXiv, ARXIV_FETCH_POOL papers are fetched and ranked down to ARXIV_MAX_RESULTS. For OpenAlex, OPENALEX_FETCH_POOL papers are fetched and ranked down to OPENALEX_MAX_RESULTS. If the LLM call fails, keyword-only ranking is used as a fallback. If neither profile nor keywords are configured, all fetched papers pass through (backward compatible).

Only the top-ranked papers proceed to LLM summarization — the ranking step saves API costs. All fetched papers are still stored in the vector store for future retrieval.

Example:

INTEREST_PROFILE="AI applications including world models, frontier AI methods, memory and retrieval systems, and cybersecurity"
INTEREST_KEYWORDS=["world model","RAG","knowledge graph","cybersecurity","LLM","reasoning","agent"]
ARXIV_FETCH_POOL=200
OPENALEX_FETCH_POOL=100

Usage

Running the Pipeline

# Dry-run (prints digest to console)
digest-pipeline --dry-run

# With specific topics (overrides .env)
digest-pipeline --dry-run --topics cs.CL cs.CV

# Verbose logging
digest-pipeline --dry-run -v

# Production mode (sends email)
# Set DRY_RUN=false in .env, then:
digest-pipeline

You can also use the explicit run subcommand:

digest-pipeline run --dry-run --topics cs.AI

Scheduling with Cron

The pipeline fetches papers from the last 24 hours, so running it once daily on weekdays is ideal. arXiv publishes new submissions Sun-Thu around 20:00 UTC and does not publish on weekends, so a Monday-Friday schedule catches every batch.

1. Find your executable path:

which digest-pipeline

2. Edit your crontab:

crontab -e

3. Add the pipeline and log rotation:

# Research digest pipeline - Mon-Fri at 7:00 AM EST (12:00 UTC)
0 12 * * 1-5 cd /path/to/Daylight-know && /path/to/bin/digest-pipeline >> /path/to/Daylight-know/logs/digest-pipeline.log 2>&1

# Log rotation - keep previous week's log, truncate current (Monday midnight)
0 0 * * 1 cp /path/to/Daylight-know/logs/digest-pipeline.log /path/to/Daylight-know/logs/digest-pipeline.log.prev && : > /path/to/Daylight-know/logs/digest-pipeline.log

Replace /path/to/Daylight-know with your project directory and /path/to/bin/digest-pipeline with the output of which digest-pipeline.

Important:

Working directory: The cd is required so the pipeline finds your .env file and the relative ChromaDB storage path resolves correctly.
Logs directory: Create logs/ in the project root before the first run: mkdir -p logs
Timing: 12:00 UTC = 7:00 AM EST. Adjust for your timezone. Morning runs give the best coverage since arXiv publishes the previous evening.
Log rotation: The Monday midnight job copies the current log to .log.prev and truncates the current file, keeping two weeks of history.

4. Verify it works:

# Test the exact command cron will run
cd /path/to/Daylight-know && digest-pipeline

5. Set production mode:

Make sure DRY_RUN=false in your .env before the first scheduled run, and verify your SMTP credentials work with a manual test run first.

Scheduling with systemd (Linux)

For more robust scheduling with automatic logging via journalctl and Persistent=true (runs missed jobs on next boot):

# /etc/systemd/system/digest-pipeline.service
[Unit]
Description=Research Digest Pipeline
After=network-online.target

[Service]
Type=oneshot
WorkingDirectory=/path/to/Daylight-know
ExecStart=/path/to/bin/digest-pipeline
User=your-username

# /etc/systemd/system/digest-pipeline.timer
[Unit]
Description=Run Research Digest Pipeline weekdays

[Timer]
OnCalendar=Mon..Fri *-*-* 12:00:00 UTC
Persistent=true

[Install]
WantedBy=timers.target

sudo systemctl enable --now digest-pipeline.timer
sudo systemctl status digest-pipeline.timer
journalctl -u digest-pipeline.service  # view logs

Tech Stack

Core Libraries

Library	Version	Role
litellm	>=1.30	Universal LLM gateway — routes to OpenAI, Anthropic, Google, Ollama, Azure, and 100+ other providers
ChromaDB	>=0.5	Local vector store with persistent SQLite + HNSW indexing for chunk storage and retrieval
pypdf	>=5.0	PDF text extraction
Chonkie	>=1.0	Semantic text chunking with the `potion-base-32M` embedding model
Pydantic	>=2.0	Data validation and schema definitions (Paper, TextChunk models)
pydantic-settings	>=2.0	Configuration management from environment variables and `.env` files
Jinja2	>=3.1	HTML and plaintext email templating
Mistune	>=3.0	Markdown-to-HTML conversion for email rendering
Rich	>=13.0	Interactive setup wizard TUI (console, tables, panels, prompts)
Requests	>=2.31	HTTP client for HuggingFace and OpenAlex APIs
python-dotenv	>=1.0	`.env` file loading

Dev & Testing

Library	Role
pytest	Test runner with custom markers (`unit`, `integration`, `e2e`, `network`)
Ruff	Linter and formatter (target: Python 3.10, line length: 100)
FastAPI + Uvicorn	Stub LLM server for integration tests
aiosmtpd	Fake SMTP server for email integration tests
reportlab	Test PDF fixture generation
Hatchling	Build backend

Project Structure

src/digest_pipeline/
├── __init__.py          # Package version
├── arxiv_topics.py      # Full arXiv taxonomy (150+ topics) with search/validate
├── chunker.py           # Semantic text chunking via Chonkie
├── config.py            # Centralized settings via pydantic-settings
├── emailer.py           # HTML/plaintext email formatting and SMTP dispatch
├── extractor.py         # PDF text extraction via pypdf
├── fetcher.py           # arXiv paper fetching with retry-based PDF download
├── hf_fetcher.py        # HuggingFace Daily Papers fetching & deduplication
├── llm_utils.py         # Shared LLM call utilities (backoff, structured output)
├── openalex_fetcher.py  # OpenAlex paper fetching with field filtering
├── pipeline.py          # Main orchestrator and CLI entry point
├── postprocessor.py     # LLM post-processing (ELI5, implications & critiques)
├── prompts/             # LLM prompt templates (Markdown)
│   ├── summarizer.md    # Summarization prompt
│   ├── eli5.md          # ELI5 plain-language explanation prompt
│   ├── implications.md  # Practical implications prompt
│   ├── critiques.md     # Structured critique prompt
│   └── ranker.md        # Interest-based relevance scoring prompt
├── ranker.py            # Interest-based paper ranking (keyword + LLM scoring)
├── seen_papers.py       # Cross-day deduplication tracking
├── setup.py             # Interactive setup wizard
├── summarizer.py        # LLM-powered summarization with backoff
├── topics_cli.py        # Topic browser CLI
└── vectorstore.py       # ChromaDB vector store for chunk storage

tests/
├── unit/                # Fast tests, no external dependencies
├── integration/         # Tests with real local dependencies (ChromaDB, pypdf)
├── e2e/                 # Full pipeline runs
└── fixtures/            # Test PDFs and data

Running Tests

# Run all unit tests
pytest tests/unit/ -q

# Run integration tests (requires local dependencies)
pytest tests/integration/ -q

# Run end-to-end tests (requires LLM API key)
pytest tests/e2e/ -q

# Run everything
pytest

# Run with markers
pytest -m unit          # fast, no external deps
pytest -m integration   # real local deps
pytest -m e2e           # full pipeline
pytest -m network       # requires internet

Design Documentation

The system is specified using the EARS (Easy Approach to Requirements Syntax) methodology. See docs/ears-design-document.md for the full requirements specification including data schema definitions.

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
.github/workflows		.github/workflows
docs		docs
src/digest_pipeline		src/digest_pipeline
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Automated Research Digest Pipeline

How It Works

Pipeline Steps

What You Get

Getting Started

Prerequisites

Installation

Option A: Interactive Setup Wizard (Recommended)

Option B: Manual Configuration

Browsing arXiv Topics

Configuration Reference

arXiv Settings

LLM Settings

Email / SMTP Settings

ChromaDB Settings

Pipeline Modes

Post-Processing

Pipeline Performance

HuggingFace Daily Papers (Optional)

OpenAlex (Optional)

Interest-Based Ranking (Optional)

Usage

Running the Pipeline

Scheduling with Cron

Scheduling with systemd (Linux)

Tech Stack

Core Libraries

Dev & Testing

Project Structure

Running Tests

Design Documentation

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages