A daily digest system that fetches academic papers from arXiv, HuggingFace Daily Papers, and OpenAlex, then extracts and semantically chunks their text, stores embeddings in a vector database, and delivers an LLM-generated summary via email. Supports 150+ arXiv categories, 26 OpenAlex academic fields, and any LLM provider through litellm.
arXiv RSS ─────────────┐
(fetch pool) │
├──> Dedup & Seen Filter ──> Extract / Chunk / Store ALL
HuggingFace Daily ─────┤ |
(optional) │ v
│ Vector Storage (ChromaDB)
OpenAlex API ──────────┤ [full corpus retained]
(optional, │ |
fetch pool) ────────┘ v
Interest-Based Ranking
(keyword + LLM scoring)
|
v
Top N for Digest
|
LLM Summarization (litellm)
|
v
Post-Processing (ELI5, Implications & Critiques)
|
v
Email Dispatch (SMTP/STARTTLS)
- Fetch — Fetches papers from the arXiv RSS feed for your configured topics. When interest-based ranking is configured, a larger "fetch pool" is retrieved (e.g., 200 papers). PDFs are downloaded in parallel. Optionally also fetches from HuggingFace Daily Papers (community-upvoted) and OpenAlex (broad academic coverage across 26 fields). Papers are deduplicated across sources by DOI and filtered against a 30-day seen history.
- Extract — Uses pypdf to pull raw text from each PDF. Image-only documents are flagged as unparseable and stored separately. Papers without PDFs (HuggingFace, OpenAlex) use their abstract.
- Chunk — Splits extracted text into semantic segments using Chonkie's
SemanticChunkerwith thepotion-base-32Membedding model. - Store — Persists chunks with embeddings and metadata (title, authors, URL, date, chunk index) in ChromaDB for future retrieval. All fetched papers are stored — not just the ones selected for the digest — so the vector store builds a comprehensive corpus over time. Steps 2-4 run in parallel (configurable workers) for faster ingestion.
- Rank — When interest-based ranking is configured, papers are scored by keyword match + LLM relevance and the top N per source are selected for the digest (e.g., top 50 arXiv, top 20 OpenAlex). Papers that don't make the digest cut are still in the vector store from step 4.
- Summarize — Sends paper text to an LLM via litellm, producing per-paper structured summaries. Supports OpenAI, Anthropic, Google, Cohere, Ollama, Azure, and 100+ other providers.
- Post-process — Optionally generates ELI5 (plain-language explanations), practical implications (who benefits, how to apply), and structured critiques (strengths, weaknesses, open questions) via separate LLM calls. Enabled post-processors run in parallel by default.
- Email — Delivers the digest as a styled HTML + plaintext email via SMTP with STARTTLS, or prints to console in dry-run mode.
Each digest email contains per-paper sections with:
- Summary — Key findings and contributions
- Practical Implications — Who benefits and how to apply the research
- Critique — Methodological strengths, weaknesses, and open questions
- Metadata — Authors, categories/fields of study, source, and direct links
- Python 3.10+
- An API key for any litellm-supported LLM provider
- SMTP credentials for email delivery (e.g., a Gmail App Password)
git clone https://github.com/Blevene/Daylight-know.git
cd Daylight-know
pip install -e ".[dev]"The setup wizard walks you through configuring everything interactively,
with topic browsing, connection testing, and .env file generation:
digest-pipeline setupThe wizard will:
- Help you browse and select arXiv topics — search by keyword, browse by group (CS, Math, Physics, etc.), or type codes directly
- Configure your LLM provider and optionally test the connection
- Configure SMTP email settings and optionally test delivery
- Set up ChromaDB storage location
- Toggle optional features (ELI5, implications, critiques)
- Write a complete
.envfile (with backup if one exists)
Copy the example environment file and edit it:
cp .env.example .envNot sure which topics to subscribe to? Use the topic browser:
# List all topic groups with counts
digest-pipeline topics list
# Search by keyword
digest-pipeline topics search "machine learning"
digest-pipeline topics search "quantum"
digest-pipeline topics search "natural language"
# List all topics in a group
digest-pipeline topics group cs
digest-pipeline topics group physics
# Validate topic codes
digest-pipeline topics validate cs.AI cs.LG stat.MLAll settings are configured via environment variables in .env:
| Variable | Description | Default |
|---|---|---|
ARXIV_TOPICS |
Comma-separated arXiv category codes | cs.AI,cs.LG |
ARXIV_MAX_RESULTS |
Max papers to include in digest | 50 |
ARXIV_FETCH_POOL |
Papers to fetch before ranking (when ranking is enabled) | 200 |
| Variable | Description | Default |
|---|---|---|
LLM_MODEL |
litellm model string | openai/gpt-4o-mini |
LLM_API_KEY |
API key for your LLM provider | required |
LLM_MAX_TOKENS |
Max tokens for LLM responses | 32768 |
LLM_API_BASE |
Custom API base URL (optional) | — |
The LLM_MODEL uses litellm's provider/model format. Examples:
openai/gpt-4o-mini— OpenAIanthropic/claude-sonnet-4-20250514— Anthropicgemini/gemini-pro-latest— Google (latest Pro model)ollama/llama3— Local Ollamaazure/gpt-4o— Azure OpenAI
See litellm providers for the full list.
| Variable | Description | Default |
|---|---|---|
SMTP_HOST |
SMTP server hostname | smtp.gmail.com |
SMTP_PORT |
SMTP port (STARTTLS) | 587 |
SMTP_USER |
SMTP username (usually your email) | required |
SMTP_PASSWORD |
SMTP password or app password | required |
EMAIL_FROM |
Sender email address | required |
EMAIL_TO |
Recipient email address | required |
Gmail users: You'll need an App Password.
Go to Google Account > Security > 2-Step Verification > App Passwords,
generate one for "Mail", and use it as SMTP_PASSWORD.
| Variable | Description | Default |
|---|---|---|
CHROMA_PERSIST_DIR |
Directory for ChromaDB storage | ./data/chromadb |
CHROMA_COLLECTION |
Collection name | research_digest |
| Variable | Description | Default |
|---|---|---|
DRY_RUN |
Print digest to console instead of emailing | true |
PDF_DOWNLOAD_MAX_RETRIES |
PDF download retry attempts | 3 |
PDF_DOWNLOAD_WORKERS |
Parallel PDF download threads | 8 |
PDF_ARCHIVE_DIR |
Directory to archive PDFs (empty = disabled) | — |
| Variable | Description | Default |
|---|---|---|
POSTPROCESSING_IMPLICATIONS |
Generate practical implications | true |
POSTPROCESSING_CRITIQUES |
Generate structured critiques | true |
POSTPROCESSING_ELI5 |
Generate plain-language ELI5 explanations | true |
| Variable | Description | Default |
|---|---|---|
PIPELINE_INGEST_WORKERS |
Parallel threads for extract/chunk/store | 4 |
PIPELINE_POSTPROCESS_PARALLEL |
Run post-processing LLM calls concurrently | true |
| Variable | Description | Default |
|---|---|---|
HUGGINGFACE_ENABLED |
Include HuggingFace community papers | false |
HUGGINGFACE_TOKEN |
HuggingFace API token (for rate limits) | — |
HUGGINGFACE_MAX_RESULTS |
Max HuggingFace papers | 20 |
Papers from HuggingFace are deduplicated against arXiv — those already fetched from arXiv appear as "trending" in a sidebar rather than being processed twice.
| Variable | Description | Default |
|---|---|---|
OPENALEX_ENABLED |
Include OpenAlex academic papers | false |
OPENALEX_API_KEY |
OpenAlex API key (required since Feb 2026) | — |
OPENALEX_EMAIL |
Email for polite pool (recommended) | — |
OPENALEX_MAX_RESULTS |
Papers to include in digest | 20 |
OPENALEX_QUERY |
Search query (used when ranking is off) | machine learning |
OPENALEX_FIELDS |
JSON array of academic fields to filter | — |
Valid fields: Computer Science, Mathematics, Physics and Astronomy,
Chemistry, Engineering, Medicine, Psychology, and 19 others
(26 total). See the setup wizard for the full list.
When configured, the pipeline fetches a larger "fetch pool" of papers from each source and uses LLM scoring to select only the most relevant ones for your digest. Ranking applies to both arXiv and OpenAlex papers.
Pipeline-wide settings (apply to all sources):
| Variable | Description | Default |
|---|---|---|
INTEREST_PROFILE |
Natural language description of your research interests | — |
INTEREST_KEYWORDS |
JSON array of boost keywords | — |
OpenAlex-specific overrides (take priority over pipeline-wide for OpenAlex):
| Variable | Description | Default |
|---|---|---|
OPENALEX_INTEREST_PROFILE |
OpenAlex-specific interest profile | — |
OPENALEX_INTEREST_KEYWORDS |
OpenAlex-specific boost keywords | — |
OPENALEX_FETCH_POOL |
OpenAlex papers to fetch before ranking | 100 |
How it works: The ranker scores each paper using two signals:
- Keyword boost — +2 points per keyword match in title or abstract
- LLM scoring — Papers are sent in batches of 20 to your configured LLM, which rates each paper 1-10 against your interest profile
For arXiv, ARXIV_FETCH_POOL papers are fetched and ranked down to
ARXIV_MAX_RESULTS. For OpenAlex, OPENALEX_FETCH_POOL papers are
fetched and ranked down to OPENALEX_MAX_RESULTS. If the LLM call fails,
keyword-only ranking is used as a fallback. If neither profile nor keywords
are configured, all fetched papers pass through (backward compatible).
Only the top-ranked papers proceed to LLM summarization — the ranking step saves API costs. All fetched papers are still stored in the vector store for future retrieval.
Example:
INTEREST_PROFILE="AI applications including world models, frontier AI methods, memory and retrieval systems, and cybersecurity"
INTEREST_KEYWORDS=["world model","RAG","knowledge graph","cybersecurity","LLM","reasoning","agent"]
ARXIV_FETCH_POOL=200
OPENALEX_FETCH_POOL=100# Dry-run (prints digest to console)
digest-pipeline --dry-run
# With specific topics (overrides .env)
digest-pipeline --dry-run --topics cs.CL cs.CV
# Verbose logging
digest-pipeline --dry-run -v
# Production mode (sends email)
# Set DRY_RUN=false in .env, then:
digest-pipelineYou can also use the explicit run subcommand:
digest-pipeline run --dry-run --topics cs.AIThe pipeline fetches papers from the last 24 hours, so running it once daily on weekdays is ideal. arXiv publishes new submissions Sun-Thu around 20:00 UTC and does not publish on weekends, so a Monday-Friday schedule catches every batch.
1. Find your executable path:
which digest-pipeline2. Edit your crontab:
crontab -e3. Add the pipeline and log rotation:
# Research digest pipeline - Mon-Fri at 7:00 AM EST (12:00 UTC)
0 12 * * 1-5 cd /path/to/Daylight-know && /path/to/bin/digest-pipeline >> /path/to/Daylight-know/logs/digest-pipeline.log 2>&1
# Log rotation - keep previous week's log, truncate current (Monday midnight)
0 0 * * 1 cp /path/to/Daylight-know/logs/digest-pipeline.log /path/to/Daylight-know/logs/digest-pipeline.log.prev && : > /path/to/Daylight-know/logs/digest-pipeline.logReplace /path/to/Daylight-know with your project directory and
/path/to/bin/digest-pipeline with the output of which digest-pipeline.
Important:
- Working directory: The
cdis required so the pipeline finds your.envfile and the relative ChromaDB storage path resolves correctly. - Logs directory: Create
logs/in the project root before the first run:mkdir -p logs - Timing: 12:00 UTC = 7:00 AM EST. Adjust for your timezone. Morning runs give the best coverage since arXiv publishes the previous evening.
- Log rotation: The Monday midnight job copies the current log to
.log.prevand truncates the current file, keeping two weeks of history.
4. Verify it works:
# Test the exact command cron will run
cd /path/to/Daylight-know && digest-pipeline5. Set production mode:
Make sure DRY_RUN=false in your .env before the first scheduled run,
and verify your SMTP credentials work with a manual test run first.
For more robust scheduling with automatic logging via journalctl and
Persistent=true (runs missed jobs on next boot):
# /etc/systemd/system/digest-pipeline.service
[Unit]
Description=Research Digest Pipeline
After=network-online.target
[Service]
Type=oneshot
WorkingDirectory=/path/to/Daylight-know
ExecStart=/path/to/bin/digest-pipeline
User=your-username# /etc/systemd/system/digest-pipeline.timer
[Unit]
Description=Run Research Digest Pipeline weekdays
[Timer]
OnCalendar=Mon..Fri *-*-* 12:00:00 UTC
Persistent=true
[Install]
WantedBy=timers.targetsudo systemctl enable --now digest-pipeline.timer
sudo systemctl status digest-pipeline.timer
journalctl -u digest-pipeline.service # view logs| Library | Version | Role |
|---|---|---|
| litellm | >=1.30 | Universal LLM gateway — routes to OpenAI, Anthropic, Google, Ollama, Azure, and 100+ other providers |
| ChromaDB | >=0.5 | Local vector store with persistent SQLite + HNSW indexing for chunk storage and retrieval |
| pypdf | >=5.0 | PDF text extraction |
| Chonkie | >=1.0 | Semantic text chunking with the potion-base-32M embedding model |
| Pydantic | >=2.0 | Data validation and schema definitions (Paper, TextChunk models) |
| pydantic-settings | >=2.0 | Configuration management from environment variables and .env files |
| Jinja2 | >=3.1 | HTML and plaintext email templating |
| Mistune | >=3.0 | Markdown-to-HTML conversion for email rendering |
| Rich | >=13.0 | Interactive setup wizard TUI (console, tables, panels, prompts) |
| Requests | >=2.31 | HTTP client for HuggingFace and OpenAlex APIs |
| python-dotenv | >=1.0 | .env file loading |
| Library | Role |
|---|---|
| pytest | Test runner with custom markers (unit, integration, e2e, network) |
| Ruff | Linter and formatter (target: Python 3.10, line length: 100) |
| FastAPI + Uvicorn | Stub LLM server for integration tests |
| aiosmtpd | Fake SMTP server for email integration tests |
| reportlab | Test PDF fixture generation |
| Hatchling | Build backend |
src/digest_pipeline/
├── __init__.py # Package version
├── arxiv_topics.py # Full arXiv taxonomy (150+ topics) with search/validate
├── chunker.py # Semantic text chunking via Chonkie
├── config.py # Centralized settings via pydantic-settings
├── emailer.py # HTML/plaintext email formatting and SMTP dispatch
├── extractor.py # PDF text extraction via pypdf
├── fetcher.py # arXiv paper fetching with retry-based PDF download
├── hf_fetcher.py # HuggingFace Daily Papers fetching & deduplication
├── llm_utils.py # Shared LLM call utilities (backoff, structured output)
├── openalex_fetcher.py # OpenAlex paper fetching with field filtering
├── pipeline.py # Main orchestrator and CLI entry point
├── postprocessor.py # LLM post-processing (ELI5, implications & critiques)
├── prompts/ # LLM prompt templates (Markdown)
│ ├── summarizer.md # Summarization prompt
│ ├── eli5.md # ELI5 plain-language explanation prompt
│ ├── implications.md # Practical implications prompt
│ ├── critiques.md # Structured critique prompt
│ └── ranker.md # Interest-based relevance scoring prompt
├── ranker.py # Interest-based paper ranking (keyword + LLM scoring)
├── seen_papers.py # Cross-day deduplication tracking
├── setup.py # Interactive setup wizard
├── summarizer.py # LLM-powered summarization with backoff
├── topics_cli.py # Topic browser CLI
└── vectorstore.py # ChromaDB vector store for chunk storage
tests/
├── unit/ # Fast tests, no external dependencies
├── integration/ # Tests with real local dependencies (ChromaDB, pypdf)
├── e2e/ # Full pipeline runs
└── fixtures/ # Test PDFs and data
# Run all unit tests
pytest tests/unit/ -q
# Run integration tests (requires local dependencies)
pytest tests/integration/ -q
# Run end-to-end tests (requires LLM API key)
pytest tests/e2e/ -q
# Run everything
pytest
# Run with markers
pytest -m unit # fast, no external deps
pytest -m integration # real local deps
pytest -m e2e # full pipeline
pytest -m network # requires internetThe system is specified using the EARS (Easy Approach to Requirements Syntax)
methodology. See docs/ears-design-document.md
for the full requirements specification including data schema definitions.
Apache-2.0