feat: Add Telegram file/image support with vision, OCR, and document-to-context pipeline#31
feat: Add Telegram file/image support with vision, OCR, and document-to-context pipeline#31Copilot wants to merge 15 commits into
Conversation
|
@copilot use ocrs https://github.com/robertknight/ocrs as ocr image instead of terract |
…rs OCR Co-authored-by: chinkan <16433287+chinkan@users.noreply.github.com> Agent-Logs-Url: https://github.com/chinkan/RustFox/sessions/9c502a20-dd1e-47bb-8b4d-34cf2f26c7f1
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add MessageContent enum supporting Text(String) and Parts(Vec<ContentPart>) to replace Option<String> in ChatMessage.content. Backwards-compatible: Text variant serializes as plain JSON string, Parts as JSON array. - Add ContentPart, ImageUrlContent, MessageContent types to llm.rs - Update ChatMessage.content: Option<String> → Option<MessageContent> - Fix all usage sites in agent.rs, memory/conversations.rs, memory/query_rewriter.rs, memory/summarizer.rs, memory/rag.rs - DB storage continues to use extracted plain text via as_text() - All 113 tests pass Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add `supports_vision` bool field to `OpenRouterConfig` (default false) - Add `OcrConfig` struct with `model_dir` (default $HOME/.cache/ocrs) - Add `ocr` field to `Config` with serde default - Add 3 tests: supports_vision defaults false, parses true, ocr default dir - Update config.example.toml with comments for new fields Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add src/file_processor/mod.rs with process_attachments() entry point - Image handling: base64 vision parts (supports_vision=true) or OCR via ocrs - OCR: auto-downloads text-detection/recognition .rten models on first use - PDF: text extraction via pdf_extract::extract_text_from_mem - DOCX: text extraction by traversing docx_rs document children - Long context (>6000 chars): chunk+store in MemoryStore, RAG-retrieve - Unit tests for chunk_text (4 cases) - Register module via mod file_processor; in src/main.rs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…comment on pdf unwrap_or_default Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace early-return for non-text messages with caption fallback - Add download_telegram_file helper (creates temp dir, fetches via Telegram API, detects MIME with infer crate) - Add classify_attachment_kind and mime_from_extension helpers - Collect photo and document attachments into Vec<Attachment> - Pass attachments to IncomingMessage instead of empty vec - Clean up temp dir after agent processing completes - Add 5 unit tests for classify_attachment_kind Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…match Co-authored-by: chinkan <16433287+chinkan@users.noreply.github.com> Agent-Logs-Url: https://github.com/chinkan/RustFox/sessions/9c502a20-dd1e-47bb-8b4d-34cf2f26c7f1
Switched OCR from tesseract to Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
There was a problem hiding this comment.
Pull request overview
Adds Telegram media (photos + documents) support by downloading attachments to temp storage, extracting usable text (OCR/PDF/DOCX) and/or sending images as multimodal content to vision-capable models, with long-document fallback to the existing knowledge/vector store.
Changes:
- Extend platform message model with typed attachments and add Telegram download + cleanup logic for photos/documents.
- Promote LLM
ChatMessage.contentto a backwards-compatible multimodal representation (text or parts). - Add a file processing pipeline (vision base64, OCR via
ocrs, PDF/DOCX extraction, long-context chunking + RAG retrieval) and wire it into the agent.
Reviewed changes
Copilot reviewed 17 out of 19 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| src/supervisor/backend/reasoning.rs | Populates new attachments field for supervisor-originated messages. |
| src/platform/telegram.rs | Downloads Telegram photos/documents into temp dirs, classifies attachments, cleans up after processing, adds unit tests. |
| src/platform/mod.rs | Introduces AttachmentKind, Attachment, and IncomingMessage.attachments. |
| src/memory/summarizer.rs | Adapts summarizer to new MessageContent and as_text() extraction. |
| src/memory/rag.rs | Updates RAG snippet building to use MessageContent::as_text(). |
| src/memory/query_rewriter.rs | Updates query rewriter to build prompts from MessageContent::as_text(). |
| src/memory/conversations.rs | Stores/retrieves text-only content while mapping DB rows to MessageContent::Text. |
| src/llm.rs | Adds multimodal content types (MessageContent, ContentPart) with backward-compatible serialization + tests. |
| src/lib.rs | Exposes new file_processor module. |
| src/learning.rs | Updates learning flows to use MessageContent consistently. |
| src/file_processor/mod.rs | New attachment processing module: vision/OCR, PDF/DOCX extraction, chunking + knowledge-store indexing/retrieval. |
| src/config.rs | Adds openrouter.supports_vision and [ocr].model_dir defaults + tests. |
| src/agent.rs | Integrates attachment processing into message construction and persistence behavior. |
| src/agent_prompt.rs | Adjusts prompt estimation/compaction logic for MessageContent. |
| docs/plans/2026-03-25-telegram-file-image-support.md | Adds an implementation plan documenting the intended architecture and dependencies. |
| config.example.toml | Documents new supports_vision and [ocr] settings. |
| Cargo.toml | Adds new crates for OCR/image/PDF/DOCX/mime/base64 support. |
| Cargo.lock | Locks new transitive dependencies for added crates. |
| .gitignore | Updates ignored DB filename pattern to rustfox.db*. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let dest_name = match filename { | ||
| Some(n) => n.to_string(), | ||
| None => format!("{}.{}", uuid::Uuid::new_v4(), ext), | ||
| }; | ||
| let dest_path = dest_dir.join(&dest_name); |
| let ext = Path::new(&file.path) | ||
| .extension() | ||
| .and_then(|e| e.to_str()) | ||
| .unwrap_or("bin"); | ||
|
|
| dest_dir: &Path, | ||
| filename: Option<&str>, | ||
| ) -> Result<(PathBuf, String)> { | ||
| std::fs::create_dir_all(dest_dir).context("Failed to create temp directory")?; |
| .await | ||
| .context("Failed to download file from Telegram")?; | ||
|
|
||
| std::fs::write(&dest_path, &bytes).context("Failed to write downloaded file")?; |
| let response = reqwest::get(url) | ||
| .await | ||
| .context("Failed to fetch OCR model")?; | ||
| let bytes = response | ||
| .bytes() | ||
| .await | ||
| .context("Failed to read OCR model bytes")?; |
| // Save a text-only version to DB (avoid storing base64 image data in message history) | ||
| let db_content = if incoming.attachments.is_empty() { | ||
| user_msg_content.clone() | ||
| } else { | ||
| let mut db_text = incoming.text.clone(); | ||
| if !attachment_text.is_empty() { | ||
| db_text.push_str("\n\n[Attachment processed]"); | ||
| } | ||
| MessageContent::from_text(db_text) | ||
| }; |
| # Used to extract text from images when supports_vision = false. | ||
| # Models are downloaded automatically on first use. | ||
| # [ocr] | ||
| # model_dir = "~/.cache/ocrs" # Where OCR model files are cached (downloaded on first use) |
| tokio::task::spawn_blocking(move || -> Result<String> { | ||
| let detection_model = | ||
| rten::Model::load_file(&det_path).context("Failed to load OCR detection model")?; | ||
| let recognition_model = | ||
| rten::Model::load_file(&rec_path).context("Failed to load OCR recognition model")?; |
| match extract_pdf_text(&attachment.path) { | ||
| Ok(text) => { | ||
| let ctx = handle_context_length(&text, fname, user_query, memory).await; | ||
| text_parts.push(ctx); | ||
| } | ||
| Err(e) => { | ||
| tracing::warn!("PDF extraction failed: {}", e); | ||
| text_parts.push(format!("[PDF processing failed: {}]", e)); | ||
| } | ||
| } |
| match extract_docx_text(&attachment.path) { | ||
| Ok(text) => { | ||
| let ctx = handle_context_length(&text, fname, user_query, memory).await; | ||
| text_parts.push(ctx); | ||
| } | ||
| Err(e) => { | ||
| tracing::warn!("DOCX extraction failed: {}", e); | ||
| text_parts.push(format!("[DOCX processing failed: {}]", e)); | ||
| } | ||
| } |
Telegram only supported text messages. This adds handling for photos and file attachments (PDF, DOCX, images), routing them through a vision/OCR/document extraction pipeline before injecting context into the LLM.
Platform Layer
IncomingMessagegainsattachments: Vec<Attachment>(kind: Image|Pdf|Docx|Other,path,mime_type)telegram.rsnow handlesmsg.photo()(highest-res variant) andmsg.document(), downloading to per-request temp dirs cleaned up after processingLLM Message Model (
llm.rs)ChatMessage.contentpromoted fromOption<String>toMessageContent— either a plain string orVec<ContentPart>(text/image_url)File Processor (
src/file_processor/)openrouter.supports_vision = true→ base64-encode into multi-modal content parts; otherwise → OCR viaocrs(pure Rust, neural-network-based OCR engine — no system library dependencies) → inject as textpdf-extractdocx-rsEmbeddingEngine+sqlite-vecknowledge store, and top-K chunks are RAG-retrieved per user query — no new DB requiredOCR
Uses
ocrs— a pure Rust neural-network OCR engine backed by therteninference runtime. Models (text detection + recognition) are downloaded automatically from S3 on first use and cached in~/.cache/ocrs/. No native system library (e.g. Tesseract) required.Config
New Dependencies
ocrsrtenocrsmodel filesimagepdf-extractdocx-rsinferbase64Original prompt
💬 Send tasks to Copilot coding agent from Slack and Teams to turn conversations into code. Copilot posts an update in your thread when it's finished.