Is your feature request related to a problem?
Describe the feature you'd like
🌟Feature Description
Offload blocking embedding work from the async event loop so that EmbeddingService does not block other concurrent requests. The service should run CPU/GPU-bound model.encode() calls in a thread pool (e.g. via asyncio.to_thread()) while keeping the public API async.
🔍 Problem Statement
EmbeddingService exposes async methods (get_embedding, get_embeddings, summarize_user_profile, search_similar_profiles) but internally calls synchronous SentenceTransformer code:
self.model.encode(...) in get_embedding and get_embeddings is blocking. It runs on CPU/GPU and does not yield to the event loop.
- While one request is generating embeddings, the entire process is blocked: other HTTP requests, agent tools, and background tasks stall until the encode finishes.
- The LLM call in
summarize_user_profile correctly uses await self.llm.ainvoke(...) and is non-blocking; only the embedding step blocks.
This hurts latency and concurrency wherever the service is used (e.g. issue_processor.py, contributor_recommendation.py, user profiling).
🎯 Expected Outcome
- Event loop stays responsive during embedding generation: other async work (API handlers, other embeddings, LLM calls) can run while
model.encode() runs in a worker thread.
- No change to the public API: callers keep using
await embedding_service.get_embedding(...) and await embedding_service.get_embeddings(...).
- Implementation approach: add a synchronous helper that performs
model.encode() and tensor-to-list conversion; call it from get_embedding and get_embeddings via asyncio.to_thread() (or loop.run_in_executor() with a ThreadPoolExecutor).
- Optional: run model lazy-load in a thread at first use to avoid blocking on first request.
📷 Screenshots and Design Ideas
Before: One long embedding request blocks the event loop → other requests wait.
After: Embedding runs in a thread pool → event loop continues handling other requests; embedding call still awaited by the original request.
No UI changes; this is a backend concurrency fix.
📋 Additional Context
- File to change:
backend/app/services/embedding_service/service.py
- Consumers:
app/services/github/user/profiling.py, app/services/github/issue_processor.py, app/agents/devrel/github/tools/contributor_recommendation.py
- Suggested steps:
- Add
import asyncio.
- Add a sync helper method (e.g.
_encode(texts)) that calls self.model.encode(...) and returns list(s) of floats.
- In
get_embedding: replace direct model.encode with await asyncio.to_thread(self._encode, [text]), then return the single embedding list.
- In
get_embeddings: replace direct model.encode with await asyncio.to_thread(self._encode, texts) and return the list of lists.
- Verification: While one request is generating embeddings, trigger another (e.g. health check or simple async endpoint); the second should respond without waiting for the first.
Record
Is your feature request related to a problem?
Describe the feature you'd like
🌟Feature Description
Offload blocking embedding work from the async event loop so that
EmbeddingServicedoes not block other concurrent requests. The service should run CPU/GPU-boundmodel.encode()calls in a thread pool (e.g. viaasyncio.to_thread()) while keeping the public API async.🔍 Problem Statement
EmbeddingServiceexposes async methods (get_embedding,get_embeddings,summarize_user_profile,search_similar_profiles) but internally calls synchronous SentenceTransformer code:self.model.encode(...)inget_embeddingandget_embeddingsis blocking. It runs on CPU/GPU and does not yield to the event loop.summarize_user_profilecorrectly usesawait self.llm.ainvoke(...)and is non-blocking; only the embedding step blocks.This hurts latency and concurrency wherever the service is used (e.g.
issue_processor.py,contributor_recommendation.py, user profiling).🎯 Expected Outcome
model.encode()runs in a worker thread.await embedding_service.get_embedding(...)andawait embedding_service.get_embeddings(...).model.encode()and tensor-to-list conversion; call it fromget_embeddingandget_embeddingsviaasyncio.to_thread()(orloop.run_in_executor()with aThreadPoolExecutor).📷 Screenshots and Design Ideas
Before: One long embedding request blocks the event loop → other requests wait.
After: Embedding runs in a thread pool → event loop continues handling other requests; embedding call still awaited by the original request.
No UI changes; this is a backend concurrency fix.
📋 Additional Context
backend/app/services/embedding_service/service.pyapp/services/github/user/profiling.py,app/services/github/issue_processor.py,app/agents/devrel/github/tools/contributor_recommendation.pyimport asyncio._encode(texts)) that callsself.model.encode(...)and returns list(s) of floats.get_embedding: replace directmodel.encodewithawait asyncio.to_thread(self._encode, [text]), then return the single embedding list.get_embeddings: replace directmodel.encodewithawait asyncio.to_thread(self._encode, texts)and return the list of lists.Record