Improve model loading: cache resolutions, share ModelManager, configurable quantization#2
Open
devin-ai-integration[bot] wants to merge 1 commit into
Open
Conversation
…F selections, configurable quant - Cache _model_id() result in TransformersRuntime to avoid redundant ensure_model() calls (was called 2x per load for model + tokenizer) - Add get_model_manager() factory that returns shared ModelManager instances keyed by cache_dir (double-checked locking pattern) - Cache GGUF file selection in ModelManager._gguf_selection_cache to avoid repeated list_repo_files() API calls for the same repo - Add preferred_quantization field to TransformerModelProfile and wire it through the llamacpp model support handler - Support preferred_quantization via extra_options in RuntimeConfig as a fallback when no profile preference is set - Fix pre-existing mypy error by adding llama_cpp.llama_chat_format to ignore_missing_imports overrides Co-Authored-By: Melvin <melvindave@gmail.com>
Author
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses several model loading inefficiencies found during code review:
Cache
_model_id()in TransformersRuntime — Previously_model_id()calledensure_model()on every invocation (2× per load in_load_generatorand_load_embedder). Now caches the result in_resolved_model_idafter the first call.Share
ModelManagerinstances across runtimes — Addedget_model_manager(cache_dir)factory with double-checked locking that returns shared instances keyed bycache_dir. Previously each runtime constructor created its ownModelManager.Cache GGUF file selection — Added
_gguf_selection_cachedict toModelManagerso that once a repo's GGUF files are listed and the best quant selected, subsequent calls skip thelist_repo_files()HF API call entirely.Configurable quantization preference — Added
preferred_quantization: str | Nonefield toTransformerModelProfile. The llamacpp model support handler reads this (falling back toextra_options["preferred_quantization"]) and passes it to_select_gguf_file(), which prepends it to the preference order.Fix pre-existing mypy error — Added
llama_cpp.llama_chat_formatto the mypyignore_missing_importsoverrides (the top-levelllama_cppentry didn't cover submodules).Review & Testing Checklist for Human
get_model_manager()shared-instance pattern is acceptable — it uses module-level state (_MANAGER_INSTANCES). If tests need isolation,ModelManager()can still be constructed directly._gguf_selection_cachekeyed by"{repo_id}::{preferred_quantization}"is correct — if a user changes quantization preference between calls for the same repo, a new cache entry is created.preferred_quantizationonTransformerModelProfilebeing a frozen dataclass field withNonedefault doesn't break any downstream serialization.pytest tests) — all 81 tests pass locally with ruff + mypy clean.Notes
_MODEL_CACHE+_CACHE_LOCK). These changes bring similar patterns to the core model loading infrastructure.ModelManager()construction still works for tests and one-off use;get_model_manager()is used by the runtime constructors and CLI.Link to Devin session: https://app.devin.ai/sessions/f84a43ef798f44b3b1623ec8657b673c
Requested by: @donvito