Skip to content

Improve model loading: cache resolutions, share ModelManager, configurable quantization#2

Open
devin-ai-integration[bot] wants to merge 1 commit into
mainfrom
devin/1778745145-improve-model-loading
Open

Improve model loading: cache resolutions, share ModelManager, configurable quantization#2
devin-ai-integration[bot] wants to merge 1 commit into
mainfrom
devin/1778745145-improve-model-loading

Conversation

@devin-ai-integration
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot commented May 14, 2026

Summary

Addresses several model loading inefficiencies found during code review:

  1. Cache _model_id() in TransformersRuntime — Previously _model_id() called ensure_model() on every invocation (2× per load in _load_generator and _load_embedder). Now caches the result in _resolved_model_id after the first call.

  2. Share ModelManager instances across runtimes — Added get_model_manager(cache_dir) factory with double-checked locking that returns shared instances keyed by cache_dir. Previously each runtime constructor created its own ModelManager.

  3. Cache GGUF file selection — Added _gguf_selection_cache dict to ModelManager so that once a repo's GGUF files are listed and the best quant selected, subsequent calls skip the list_repo_files() HF API call entirely.

  4. Configurable quantization preference — Added preferred_quantization: str | None field to TransformerModelProfile. The llamacpp model support handler reads this (falling back to extra_options["preferred_quantization"]) and passes it to _select_gguf_file(), which prepends it to the preference order.

  5. Fix pre-existing mypy error — Added llama_cpp.llama_chat_format to the mypy ignore_missing_imports overrides (the top-level llama_cpp entry didn't cover submodules).

Review & Testing Checklist for Human

  • Verify the get_model_manager() shared-instance pattern is acceptable — it uses module-level state (_MANAGER_INSTANCES). If tests need isolation, ModelManager() can still be constructed directly.
  • Verify _gguf_selection_cache keyed by "{repo_id}::{preferred_quantization}" is correct — if a user changes quantization preference between calls for the same repo, a new cache entry is created.
  • Check that preferred_quantization on TransformerModelProfile being a frozen dataclass field with None default doesn't break any downstream serialization.
  • Run the full test suite (pytest tests) — all 81 tests pass locally with ruff + mypy clean.

Notes

  • The GLiNER PII backend already had process-level model caching (_MODEL_CACHE + _CACHE_LOCK). These changes bring similar patterns to the core model loading infrastructure.
  • Direct ModelManager() construction still works for tests and one-off use; get_model_manager() is used by the runtime constructors and CLI.

Link to Devin session: https://app.devin.ai/sessions/f84a43ef798f44b3b1623ec8657b673c
Requested by: @donvito


Open in Devin Review

…F selections, configurable quant

- Cache _model_id() result in TransformersRuntime to avoid redundant
  ensure_model() calls (was called 2x per load for model + tokenizer)
- Add get_model_manager() factory that returns shared ModelManager
  instances keyed by cache_dir (double-checked locking pattern)
- Cache GGUF file selection in ModelManager._gguf_selection_cache to
  avoid repeated list_repo_files() API calls for the same repo
- Add preferred_quantization field to TransformerModelProfile and
  wire it through the llamacpp model support handler
- Support preferred_quantization via extra_options in RuntimeConfig
  as a fallback when no profile preference is set
- Fix pre-existing mypy error by adding llama_cpp.llama_chat_format
  to ignore_missing_imports overrides

Co-Authored-By: Melvin <melvindave@gmail.com>
@devin-ai-integration
Copy link
Copy Markdown
Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link
Copy Markdown
Author

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant