Skip to content

Blevene/Daylight-know

Automated Research Digest Pipeline

A daily digest system that fetches academic papers from arXiv, HuggingFace Daily Papers, and OpenAlex, then extracts and semantically chunks their text, stores embeddings in a vector database, and delivers an LLM-generated summary via email. Supports 150+ arXiv categories, 26 OpenAlex academic fields, and any LLM provider through litellm.

How It Works

arXiv RSS ─────────────┐
  (fetch pool)         │
                       ├──> Dedup & Seen Filter ──> Extract / Chunk / Store ALL
HuggingFace Daily ─────┤                                      |
  (optional)           │                                      v
                       │                             Vector Storage (ChromaDB)
OpenAlex API ──────────┤                              [full corpus retained]
  (optional,           │                                      |
   fetch pool) ────────┘                                      v
                                                     Interest-Based Ranking
                                                     (keyword + LLM scoring)
                                                              |
                                                              v
                                                     Top N for Digest
                                                              |
                                                    LLM Summarization (litellm)
                                                              |
                                                              v
                                             Post-Processing (ELI5, Implications & Critiques)
                                                              |
                                                              v
                                             Email Dispatch (SMTP/STARTTLS)

Pipeline Steps

  1. Fetch — Fetches papers from the arXiv RSS feed for your configured topics. When interest-based ranking is configured, a larger "fetch pool" is retrieved (e.g., 200 papers). PDFs are downloaded in parallel. Optionally also fetches from HuggingFace Daily Papers (community-upvoted) and OpenAlex (broad academic coverage across 26 fields). Papers are deduplicated across sources by DOI and filtered against a 30-day seen history.
  2. Extract — Uses pypdf to pull raw text from each PDF. Image-only documents are flagged as unparseable and stored separately. Papers without PDFs (HuggingFace, OpenAlex) use their abstract.
  3. Chunk — Splits extracted text into semantic segments using Chonkie's SemanticChunker with the potion-base-32M embedding model.
  4. Store — Persists chunks with embeddings and metadata (title, authors, URL, date, chunk index) in ChromaDB for future retrieval. All fetched papers are stored — not just the ones selected for the digest — so the vector store builds a comprehensive corpus over time. Steps 2-4 run in parallel (configurable workers) for faster ingestion.
  5. Rank — When interest-based ranking is configured, papers are scored by keyword match + LLM relevance and the top N per source are selected for the digest (e.g., top 50 arXiv, top 20 OpenAlex). Papers that don't make the digest cut are still in the vector store from step 4.
  6. Summarize — Sends paper text to an LLM via litellm, producing per-paper structured summaries. Supports OpenAI, Anthropic, Google, Cohere, Ollama, Azure, and 100+ other providers.
  7. Post-process — Optionally generates ELI5 (plain-language explanations), practical implications (who benefits, how to apply), and structured critiques (strengths, weaknesses, open questions) via separate LLM calls. Enabled post-processors run in parallel by default.
  8. Email — Delivers the digest as a styled HTML + plaintext email via SMTP with STARTTLS, or prints to console in dry-run mode.

What You Get

Each digest email contains per-paper sections with:

  • Summary — Key findings and contributions
  • Practical Implications — Who benefits and how to apply the research
  • Critique — Methodological strengths, weaknesses, and open questions
  • Metadata — Authors, categories/fields of study, source, and direct links

Getting Started

Prerequisites

Installation

git clone https://github.com/Blevene/Daylight-know.git
cd Daylight-know
pip install -e ".[dev]"

Option A: Interactive Setup Wizard (Recommended)

The setup wizard walks you through configuring everything interactively, with topic browsing, connection testing, and .env file generation:

digest-pipeline setup

The wizard will:

  1. Help you browse and select arXiv topics — search by keyword, browse by group (CS, Math, Physics, etc.), or type codes directly
  2. Configure your LLM provider and optionally test the connection
  3. Configure SMTP email settings and optionally test delivery
  4. Set up ChromaDB storage location
  5. Toggle optional features (ELI5, implications, critiques)
  6. Write a complete .env file (with backup if one exists)

Option B: Manual Configuration

Copy the example environment file and edit it:

cp .env.example .env

Browsing arXiv Topics

Not sure which topics to subscribe to? Use the topic browser:

# List all topic groups with counts
digest-pipeline topics list

# Search by keyword
digest-pipeline topics search "machine learning"
digest-pipeline topics search "quantum"
digest-pipeline topics search "natural language"

# List all topics in a group
digest-pipeline topics group cs
digest-pipeline topics group physics

# Validate topic codes
digest-pipeline topics validate cs.AI cs.LG stat.ML

Configuration Reference

All settings are configured via environment variables in .env:

arXiv Settings

Variable Description Default
ARXIV_TOPICS Comma-separated arXiv category codes cs.AI,cs.LG
ARXIV_MAX_RESULTS Max papers to include in digest 50
ARXIV_FETCH_POOL Papers to fetch before ranking (when ranking is enabled) 200

LLM Settings

Variable Description Default
LLM_MODEL litellm model string openai/gpt-4o-mini
LLM_API_KEY API key for your LLM provider required
LLM_MAX_TOKENS Max tokens for LLM responses 32768
LLM_API_BASE Custom API base URL (optional)

The LLM_MODEL uses litellm's provider/model format. Examples:

  • openai/gpt-4o-mini — OpenAI
  • anthropic/claude-sonnet-4-20250514 — Anthropic
  • gemini/gemini-pro-latest — Google (latest Pro model)
  • ollama/llama3 — Local Ollama
  • azure/gpt-4o — Azure OpenAI

See litellm providers for the full list.

Email / SMTP Settings

Variable Description Default
SMTP_HOST SMTP server hostname smtp.gmail.com
SMTP_PORT SMTP port (STARTTLS) 587
SMTP_USER SMTP username (usually your email) required
SMTP_PASSWORD SMTP password or app password required
EMAIL_FROM Sender email address required
EMAIL_TO Recipient email address required

Gmail users: You'll need an App Password. Go to Google Account > Security > 2-Step Verification > App Passwords, generate one for "Mail", and use it as SMTP_PASSWORD.

ChromaDB Settings

Variable Description Default
CHROMA_PERSIST_DIR Directory for ChromaDB storage ./data/chromadb
CHROMA_COLLECTION Collection name research_digest

Pipeline Modes

Variable Description Default
DRY_RUN Print digest to console instead of emailing true
PDF_DOWNLOAD_MAX_RETRIES PDF download retry attempts 3
PDF_DOWNLOAD_WORKERS Parallel PDF download threads 8
PDF_ARCHIVE_DIR Directory to archive PDFs (empty = disabled)

Post-Processing

Variable Description Default
POSTPROCESSING_IMPLICATIONS Generate practical implications true
POSTPROCESSING_CRITIQUES Generate structured critiques true
POSTPROCESSING_ELI5 Generate plain-language ELI5 explanations true

Pipeline Performance

Variable Description Default
PIPELINE_INGEST_WORKERS Parallel threads for extract/chunk/store 4
PIPELINE_POSTPROCESS_PARALLEL Run post-processing LLM calls concurrently true

HuggingFace Daily Papers (Optional)

Variable Description Default
HUGGINGFACE_ENABLED Include HuggingFace community papers false
HUGGINGFACE_TOKEN HuggingFace API token (for rate limits)
HUGGINGFACE_MAX_RESULTS Max HuggingFace papers 20

Papers from HuggingFace are deduplicated against arXiv — those already fetched from arXiv appear as "trending" in a sidebar rather than being processed twice.

OpenAlex (Optional)

Variable Description Default
OPENALEX_ENABLED Include OpenAlex academic papers false
OPENALEX_API_KEY OpenAlex API key (required since Feb 2026)
OPENALEX_EMAIL Email for polite pool (recommended)
OPENALEX_MAX_RESULTS Papers to include in digest 20
OPENALEX_QUERY Search query (used when ranking is off) machine learning
OPENALEX_FIELDS JSON array of academic fields to filter

Valid fields: Computer Science, Mathematics, Physics and Astronomy, Chemistry, Engineering, Medicine, Psychology, and 19 others (26 total). See the setup wizard for the full list.

Interest-Based Ranking (Optional)

When configured, the pipeline fetches a larger "fetch pool" of papers from each source and uses LLM scoring to select only the most relevant ones for your digest. Ranking applies to both arXiv and OpenAlex papers.

Pipeline-wide settings (apply to all sources):

Variable Description Default
INTEREST_PROFILE Natural language description of your research interests
INTEREST_KEYWORDS JSON array of boost keywords

OpenAlex-specific overrides (take priority over pipeline-wide for OpenAlex):

Variable Description Default
OPENALEX_INTEREST_PROFILE OpenAlex-specific interest profile
OPENALEX_INTEREST_KEYWORDS OpenAlex-specific boost keywords
OPENALEX_FETCH_POOL OpenAlex papers to fetch before ranking 100

How it works: The ranker scores each paper using two signals:

  1. Keyword boost — +2 points per keyword match in title or abstract
  2. LLM scoring — Papers are sent in batches of 20 to your configured LLM, which rates each paper 1-10 against your interest profile

For arXiv, ARXIV_FETCH_POOL papers are fetched and ranked down to ARXIV_MAX_RESULTS. For OpenAlex, OPENALEX_FETCH_POOL papers are fetched and ranked down to OPENALEX_MAX_RESULTS. If the LLM call fails, keyword-only ranking is used as a fallback. If neither profile nor keywords are configured, all fetched papers pass through (backward compatible).

Only the top-ranked papers proceed to LLM summarization — the ranking step saves API costs. All fetched papers are still stored in the vector store for future retrieval.

Example:

INTEREST_PROFILE="AI applications including world models, frontier AI methods, memory and retrieval systems, and cybersecurity"
INTEREST_KEYWORDS=["world model","RAG","knowledge graph","cybersecurity","LLM","reasoning","agent"]
ARXIV_FETCH_POOL=200
OPENALEX_FETCH_POOL=100

Usage

Running the Pipeline

# Dry-run (prints digest to console)
digest-pipeline --dry-run

# With specific topics (overrides .env)
digest-pipeline --dry-run --topics cs.CL cs.CV

# Verbose logging
digest-pipeline --dry-run -v

# Production mode (sends email)
# Set DRY_RUN=false in .env, then:
digest-pipeline

You can also use the explicit run subcommand:

digest-pipeline run --dry-run --topics cs.AI

Scheduling with Cron

The pipeline fetches papers from the last 24 hours, so running it once daily on weekdays is ideal. arXiv publishes new submissions Sun-Thu around 20:00 UTC and does not publish on weekends, so a Monday-Friday schedule catches every batch.

1. Find your executable path:

which digest-pipeline

2. Edit your crontab:

crontab -e

3. Add the pipeline and log rotation:

# Research digest pipeline - Mon-Fri at 7:00 AM EST (12:00 UTC)
0 12 * * 1-5 cd /path/to/Daylight-know && /path/to/bin/digest-pipeline >> /path/to/Daylight-know/logs/digest-pipeline.log 2>&1

# Log rotation - keep previous week's log, truncate current (Monday midnight)
0 0 * * 1 cp /path/to/Daylight-know/logs/digest-pipeline.log /path/to/Daylight-know/logs/digest-pipeline.log.prev && : > /path/to/Daylight-know/logs/digest-pipeline.log

Replace /path/to/Daylight-know with your project directory and /path/to/bin/digest-pipeline with the output of which digest-pipeline.

Important:

  • Working directory: The cd is required so the pipeline finds your .env file and the relative ChromaDB storage path resolves correctly.
  • Logs directory: Create logs/ in the project root before the first run: mkdir -p logs
  • Timing: 12:00 UTC = 7:00 AM EST. Adjust for your timezone. Morning runs give the best coverage since arXiv publishes the previous evening.
  • Log rotation: The Monday midnight job copies the current log to .log.prev and truncates the current file, keeping two weeks of history.

4. Verify it works:

# Test the exact command cron will run
cd /path/to/Daylight-know && digest-pipeline

5. Set production mode:

Make sure DRY_RUN=false in your .env before the first scheduled run, and verify your SMTP credentials work with a manual test run first.

Scheduling with systemd (Linux)

For more robust scheduling with automatic logging via journalctl and Persistent=true (runs missed jobs on next boot):

# /etc/systemd/system/digest-pipeline.service
[Unit]
Description=Research Digest Pipeline
After=network-online.target

[Service]
Type=oneshot
WorkingDirectory=/path/to/Daylight-know
ExecStart=/path/to/bin/digest-pipeline
User=your-username
# /etc/systemd/system/digest-pipeline.timer
[Unit]
Description=Run Research Digest Pipeline weekdays

[Timer]
OnCalendar=Mon..Fri *-*-* 12:00:00 UTC
Persistent=true

[Install]
WantedBy=timers.target
sudo systemctl enable --now digest-pipeline.timer
sudo systemctl status digest-pipeline.timer
journalctl -u digest-pipeline.service  # view logs

Tech Stack

Core Libraries

Library Version Role
litellm >=1.30 Universal LLM gateway — routes to OpenAI, Anthropic, Google, Ollama, Azure, and 100+ other providers
ChromaDB >=0.5 Local vector store with persistent SQLite + HNSW indexing for chunk storage and retrieval
pypdf >=5.0 PDF text extraction
Chonkie >=1.0 Semantic text chunking with the potion-base-32M embedding model
Pydantic >=2.0 Data validation and schema definitions (Paper, TextChunk models)
pydantic-settings >=2.0 Configuration management from environment variables and .env files
Jinja2 >=3.1 HTML and plaintext email templating
Mistune >=3.0 Markdown-to-HTML conversion for email rendering
Rich >=13.0 Interactive setup wizard TUI (console, tables, panels, prompts)
Requests >=2.31 HTTP client for HuggingFace and OpenAlex APIs
python-dotenv >=1.0 .env file loading

Dev & Testing

Library Role
pytest Test runner with custom markers (unit, integration, e2e, network)
Ruff Linter and formatter (target: Python 3.10, line length: 100)
FastAPI + Uvicorn Stub LLM server for integration tests
aiosmtpd Fake SMTP server for email integration tests
reportlab Test PDF fixture generation
Hatchling Build backend

Project Structure

src/digest_pipeline/
├── __init__.py          # Package version
├── arxiv_topics.py      # Full arXiv taxonomy (150+ topics) with search/validate
├── chunker.py           # Semantic text chunking via Chonkie
├── config.py            # Centralized settings via pydantic-settings
├── emailer.py           # HTML/plaintext email formatting and SMTP dispatch
├── extractor.py         # PDF text extraction via pypdf
├── fetcher.py           # arXiv paper fetching with retry-based PDF download
├── hf_fetcher.py        # HuggingFace Daily Papers fetching & deduplication
├── llm_utils.py         # Shared LLM call utilities (backoff, structured output)
├── openalex_fetcher.py  # OpenAlex paper fetching with field filtering
├── pipeline.py          # Main orchestrator and CLI entry point
├── postprocessor.py     # LLM post-processing (ELI5, implications & critiques)
├── prompts/             # LLM prompt templates (Markdown)
│   ├── summarizer.md    # Summarization prompt
│   ├── eli5.md          # ELI5 plain-language explanation prompt
│   ├── implications.md  # Practical implications prompt
│   ├── critiques.md     # Structured critique prompt
│   └── ranker.md        # Interest-based relevance scoring prompt
├── ranker.py            # Interest-based paper ranking (keyword + LLM scoring)
├── seen_papers.py       # Cross-day deduplication tracking
├── setup.py             # Interactive setup wizard
├── summarizer.py        # LLM-powered summarization with backoff
├── topics_cli.py        # Topic browser CLI
└── vectorstore.py       # ChromaDB vector store for chunk storage

tests/
├── unit/                # Fast tests, no external dependencies
├── integration/         # Tests with real local dependencies (ChromaDB, pypdf)
├── e2e/                 # Full pipeline runs
└── fixtures/            # Test PDFs and data

Running Tests

# Run all unit tests
pytest tests/unit/ -q

# Run integration tests (requires local dependencies)
pytest tests/integration/ -q

# Run end-to-end tests (requires LLM API key)
pytest tests/e2e/ -q

# Run everything
pytest

# Run with markers
pytest -m unit          # fast, no external deps
pytest -m integration   # real local deps
pytest -m e2e           # full pipeline
pytest -m network       # requires internet

Design Documentation

The system is specified using the EARS (Easy Approach to Requirements Syntax) methodology. See docs/ears-design-document.md for the full requirements specification including data schema definitions.

License

Apache-2.0

About

Daily collection of papers!

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages