Improve model loading: cache resolutions, share ModelManager, configurable quantization by devin-ai-integration[bot] · Pull Request #2 · donvito/aibackends

devin-ai-integration · 2026-05-14T07:54:06Z

Summary

Addresses several model loading inefficiencies found during code review:

Cache _model_id() in TransformersRuntime — Previously _model_id() called ensure_model() on every invocation (2× per load in _load_generator and _load_embedder). Now caches the result in _resolved_model_id after the first call.
Share ModelManager instances across runtimes — Added get_model_manager(cache_dir) factory with double-checked locking that returns shared instances keyed by cache_dir. Previously each runtime constructor created its own ModelManager.
Cache GGUF file selection — Added _gguf_selection_cache dict to ModelManager so that once a repo's GGUF files are listed and the best quant selected, subsequent calls skip the list_repo_files() HF API call entirely.
Configurable quantization preference — Added preferred_quantization: str | None field to TransformerModelProfile. The llamacpp model support handler reads this (falling back to extra_options["preferred_quantization"]) and passes it to _select_gguf_file(), which prepends it to the preference order.
Fix pre-existing mypy error — Added llama_cpp.llama_chat_format to the mypy ignore_missing_imports overrides (the top-level llama_cpp entry didn't cover submodules).

Review & Testing Checklist for Human

Verify the get_model_manager() shared-instance pattern is acceptable — it uses module-level state (_MANAGER_INSTANCES). If tests need isolation, ModelManager() can still be constructed directly.
Verify _gguf_selection_cache keyed by "{repo_id}::{preferred_quantization}" is correct — if a user changes quantization preference between calls for the same repo, a new cache entry is created.
Check that preferred_quantization on TransformerModelProfile being a frozen dataclass field with None default doesn't break any downstream serialization.
Run the full test suite (pytest tests) — all 81 tests pass locally with ruff + mypy clean.

Notes

The GLiNER PII backend already had process-level model caching (_MODEL_CACHE + _CACHE_LOCK). These changes bring similar patterns to the core model loading infrastructure.
Direct ModelManager() construction still works for tests and one-off use; get_model_manager() is used by the runtime constructors and CLI.

Link to Devin session: https://app.devin.ai/sessions/f84a43ef798f44b3b1623ec8657b673c
Requested by: @donvito

…F selections, configurable quant - Cache _model_id() result in TransformersRuntime to avoid redundant ensure_model() calls (was called 2x per load for model + tokenizer) - Add get_model_manager() factory that returns shared ModelManager instances keyed by cache_dir (double-checked locking pattern) - Cache GGUF file selection in ModelManager._gguf_selection_cache to avoid repeated list_repo_files() API calls for the same repo - Add preferred_quantization field to TransformerModelProfile and wire it through the llamacpp model support handler - Support preferred_quantization via extra_options in RuntimeConfig as a fallback when no profile preference is set - Fix pre-existing mypy error by adding llama_cpp.llama_chat_format to ignore_missing_imports overrides Co-Authored-By: Melvin <melvindave@gmail.com>

devin-ai-integration · 2026-05-14T07:54:08Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

devin-ai-integration Bot assigned donvito May 14, 2026

devin-ai-integration Bot commented May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve model loading: cache resolutions, share ModelManager, configurable quantization#2

Improve model loading: cache resolutions, share ModelManager, configurable quantization#2
devin-ai-integration[bot] wants to merge 1 commit into
mainfrom
devin/1778745145-improve-model-loading

devin-ai-integration Bot commented May 14, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot commented May 14, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devin-ai-integration Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration Bot commented May 14, 2026

🤖 Devin AI Engineer

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration Bot commented May 14, 2026 •

edited

Loading