feat(pymllm): VocabParallelEmbedding & pymllm's cuda infra init#640
feat(pymllm): VocabParallelEmbedding & pymllm's cuda infra init#640chenghuaWang wants to merge 4 commits intoUbiquitousLearning:mainfrom
Conversation
…mple - Replaced the previous JIT utility functions with a streamlined `jit` decorator for kernel registration. - Updated the README.md to reflect the new recommended pattern for CPU kernel implementation. - Simplified the example for using the JIT compilation with a focus on clarity and ease of use.
- Updated `apache-tvm-ffi` version to `0.1.8` in `pyproject.toml` and `mllm-kernel/pyproject.toml`. - Refactored mobile module imports and structure, moving scripts to `pymllm.mobile` and removing unused backends. - Introduced new classes and methods for quantization and model deployment in the Qualcomm backend. - Added new README files for mobile and Qualcomm transformer components.
- Added `flashinfer-python` to the optional `cuda` dependencies in `pyproject.toml`. - Introduced new configuration files for server, model, and layers to centralize runtime settings. - Created initial structure for various layers and components to support future development.
- Added main entry points for `pymllm` and `mllm-kernel` in their respective `pyproject.toml` files. - Implemented a configuration module for `pymllm` to manage global settings, including server, model, runtime, and cache configurations. - Introduced the `VocabParallelEmbedding` layer and utility functions for weight management in the layers module. - Created initial tests for the `VocabParallelEmbedding` layer to validate functionality with tensor parallelism.
📝 WalkthroughWalkthroughThis PR introduces comprehensive configuration management, distributed training orchestration, tensor-parallel layer components, and refactors mobile backend imports. It adds global configuration system with server/model/runtime/cache configs, distributed group coordination, embedding parallelism support, and consolidates CLI entry points. Multiple import paths in mobile backends are updated to new mobile-qualified namespace locations, and FFI bindings expose ParameterFileObj. Dependencies are pinned to specific versions. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 14
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (9)
pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py (1)
31-35:⚠️ Potential issue | 🟡 Minor
--output_diris not markedrequired=True— crashes with an opaqueTypeErrorwhen omitted.
args.output_dirdefaults toNone. Line 50 then callsos.makedirs(None, exist_ok=True), raisingTypeError: expected str, bytes or os.PathLike object, not NoneTypeinstead of a clear argparse error.🐛 Proposed fix
parser.add_argument( "--output_dir", type=str, + required=True, help="Directory to save the quantized model", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py` around lines 31 - 35, The --output_dir argument is optional but later used in os.makedirs causing a TypeError when omitted; update the parser.add_argument for "--output_dir" (the one in train.py) to include required=True (or alternatively provide a sensible default) so args.output_dir is never None, and ensure any subsequent call (os.makedirs(args.output_dir, exist_ok=True) in this file) receives a valid path; keep the change localized to the parser.add_argument for "--output_dir" and/or add an explicit argparse error if args.output_dir is falsy before calling os.makedirs.pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py (1)
136-160:⚠️ Potential issue | 🟠 MajorFix
apply_rotary_pos_embfunction calls torotate_half.
rotate_half()requires 4 positional arguments (x,x_observer,x2_neg_fake_quant,concat_observer), butapply_rotary_pos_emb()calls it with only 1 argument at lines 158–159:q_embed = (q * cos) + (rotate_half(q) * sin) k_embed = (k * cos) + (rotate_half(k) * sin)This causes
TypeError: rotate_half() missing 3 required positional argumentsfor any caller.LlamaAttention.forward(lines 397–420) bypasses this by callingrotate_half()directly with all required arguments, but the public function remains broken and unusable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py` around lines 136 - 160, apply_rotary_pos_emb currently calls rotate_half with a single argument but rotate_half requires four positional args (x, x_observer, x2_neg_fake_quant, concat_observer); update apply_rotary_pos_emb to call rotate_half with the same four arguments used in LlamaAttention.forward (i.e., pass through the observer/quant args or supply None defaults) so the calls become rotate_half(q, x_observer, x2_neg_fake_quant, concat_observer) and rotate_half(k, x_observer, x2_neg_fake_quant, concat_observer) (or explicitly pass None for those three if observers are not available), ensuring the function signature and callers are updated accordingly.pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (1)
114-138:⚠️ Potential issue | 🟠 MajorRemove
apply_rotary_pos_embfunction — it is dead code with signature mismatch.
apply_rotary_pos_embcallsrotate_half(q)androtate_half(k)with 1 argument each (lines 136–137), butrotate_halfrequires 4 arguments (x,x_observer,x2_neg_fake_quant,concat_observer). This would raiseTypeErrorif called. The function is not invoked anywhere in the codebase and is absent from__all__exports.Qwen3Attention.forwardapplies RoPE manually with correct argument passing. Remove this unused function.Note: The same issue exists identically in
pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.pyandpymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py` around lines 114 - 138, The function apply_rotary_pos_emb is dead code and wrong: it calls rotate_half(q) and rotate_half(k) with only one argument while rotate_half requires four, and Qwen3Attention.forward applies RoPE correctly; remove the entire apply_rotary_pos_emb function definition from modeling_qwen3.py to avoid the incorrect call and dead export, and likewise remove the identical apply_rotary_pos_emb definitions from modeling_qwen2.py and modeling_llama.py so no mismatched helper remains in the qualcomm backend.pymllm/mobile/backends/qualcomm/transformers/llama/train.py (1)
32-35:⚠️ Potential issue | 🟠 Major
--output_dirwill crash withTypeErrorif omitted.
--output_dirhas nodefaultand norequired=True. Whenargs.output_dirisNone, line 50 (os.makedirs(args.output_dir, exist_ok=True)) raisesTypeError: expected str, bytes or os.PathLike object, not NoneType. The same issue exists inqwen2/train.py.🐛 Proposed fix
parser.add_argument( "--output_dir", type=str, + required=True, help="Directory to save the quantized model", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/llama/train.py` around lines 32 - 35, The CLI argument "--output_dir" is missing a default or required=True so args.output_dir can be None and later break on os.makedirs(args.output_dir, exist_ok=True); update the argument definition in train.py (and mirror the same change in qwen2/train.py) to either set a sensible default (e.g., "./output") or mark it required=True and then add a defensive check before calling os.makedirs (e.g., ensure args.output_dir is not None and raise a clear error or use the default), referencing the argument parsing block that defines "--output_dir" and the code that calls os.makedirs(args.output_dir, exist_ok=True).pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py (3)
291-298:⚠️ Potential issue | 🟡 MinorMisleading streaming comment and
trust_remote_code=Truesecurity note.
- Line 292 comment states
streaming=Trueis used, but the actualMsDataset.load()call has no such argument. This will cause the full dataset to be downloaded locally — the opposite of what the comment describes.trust_remote_code=Trueon line 297 permits execution of arbitrary Python code bundled with the dataset. While common in research settings, it should be an explicit opt-in with a clear docstring warning rather than a silent default in a library.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 291 - 298, The comment above the MsDataset.load call is misleading and the use of trust_remote_code=True is unsafe; update the MsDataset.load invocation (the call to MsDataset.load in runner.py) to pass streaming=True to match the comment and avoid downloading the entire dataset, and change trust_remote_code=True to a safe default (False) or wire it to an explicit opt-in flag (e.g., allow_trust_remote_code) exposed in the function/class API with a clear docstring warning about executing remote code so callers must consciously opt in; also update the inline comment to accurately reflect the behavior.
166-185:⚠️ Potential issue | 🔴 CriticalAdd
QLinearW8A16_PerChannelSymhandling toenable_fake_quantanddisable_fake_quantfunctions.
QLinearW8A16_PerChannelSymis included infreeze_qwen3_linear_weight(line 147) andconvert_weight(line 189), but is missing fromenable_fake_quantanddisable_fake_quant(lines 166–185). UnlikeQLinearLPBQ, which definesenable_fakequant()anddisable_fakequant()methods,QLinearW8A16_PerChannelSymcurrently lacks these methods. Both classes should provide consistent interfaces for managing fake-quantization state during calibration. AddQLinearW8A16_PerChannelSymto the type checks in both functions, and implement the corresponding methods in the class if they do not exist.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 166 - 185, Add QLinearW8A16_PerChannelSym to the fake-quant toggles and implement matching methods on the class: update enable_fake_quant and disable_fake_quant to check isinstance(..., QLinearW8A16_PerChannelSym) and call the new methods; then add enable_fakequant(self) and disable_fakequant(self) to the QLinearW8A16_PerChannelSym class to mirror QLinearLPBQ’s behavior (toggle whatever internal flag or buffer used for fake-quant state). Ensure the implementation is consistent with how QLinearLPBQ, ActivationQDQ, QRMSNorm, and QEmbedding manage fake-quant state so freeze_qwen3_linear_weight and convert_weight (which already reference QLinearW8A16_PerChannelSym) behave correctly during calibration.
69-73:⚠️ Potential issue | 🟡 MinorThe diagnostic logging loop is dead code and contains an
AttributeError.
FakeQuantizeregistersscaleas a buffer (viaregister_buffer), not as a parameter. Therefore,module.fake_quant.named_parameters()will never yieldscale, making the loop body at lines 70–73 unreachable dead code.Additionally, line 72 references
module.scale, butActivationQDQdoes not expose a.scaleproperty — it should bemodule.fake_quant.scale.Fix both issues:
🐛 Proposed fix
- for key, value in module.fake_quant.named_parameters(): + for key, value in module.fake_quant.named_buffers(): if value is module.fake_quant.scale: - print(f"{module._get_name()}.{key}: {module.scale}") + print(f"{module._get_name()}.{key}: {module.fake_quant.scale}") break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 69 - 73, The loop using module.fake_quant.named_parameters() is dead because FakeQuantize.register_buffer registers scale as a buffer, and it also wrongly references module.scale; update the diagnostic to iterate module.fake_quant.named_buffers() (or check both named_parameters() and named_buffers() if desired) and reference module.fake_quant.scale when printing; specifically modify the block around the loop that uses module.fake_quant.named_parameters() and the print that uses module.scale so it inspects named_buffers() and prints module.fake_quant.scale (or the buffer name) instead.pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py (2)
366-369:⚠️ Potential issue | 🟡 MinorPre-existing typo:
layer_dixshould belayer_idx.Line 369 assigns
self.layer_dix = layer_idx(note "dix" vs "idx"). This typo propagates to Lines 384 and 404 whereself.layer_dixis read. While internally consistent, it's misleading and will cause confusion for anyone extendingQwen2DecoderLayer.Not introduced by this PR, but flagging for awareness.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py` around lines 366 - 369, The Qwen2DecoderLayer constructor uses a typo'd attribute name self.layer_dix instead of self.layer_idx; update the constructor in class Qwen2DecoderLayer to assign self.layer_idx = layer_idx and then change any reads/writes of self.layer_dix (e.g., at the sites currently reading it on lines where it’s accessed) to self.layer_idx so the attribute name is consistent and clear across the class (keep the class name Qwen2DecoderLayer and parameter layer_idx the same).
90-123:⚠️ Potential issue | 🟡 MinorFix the arity mismatch in
apply_rotary_pos_emb: it callsrotate_halfwith 1 argument but the function requires 4.
rotate_halfon line 90 requires 4 parameters (x,x_observer,x2_neg_fake_quant,concat_observer), butapply_rotary_pos_embcalls it on lines 121–122 with only 1 argument. This will raiseTypeErrorat runtime ifapply_rotary_pos_embis ever invoked.The actual RoPE implementation in the attention forward method (lines 281–299) calls
rotate_halfwith all 4 arguments correctly, so the bug only affects the unusedapply_rotary_pos_embfunction. Suggest either updatingapply_rotary_pos_embto pass all required arguments or removing it if it's not needed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py` around lines 90 - 123, The apply_rotary_pos_emb function calls rotate_half(q) and rotate_half(k) but rotate_half requires four parameters (x, x_observer, x2_neg_fake_quant, concat_observer); update apply_rotary_pos_emb to accept and forward those observers/quantizers (e.g., add parameters x_observer, x2_neg_fake_quant: ActivationQDQ, concat_observer: ConcatObserver to apply_rotary_pos_emb signature and call rotate_half(q, x_observer, x2_neg_fake_quant, concat_observer) and rotate_half(k, x_observer, x2_neg_fake_quant, concat_observer)), or if apply_rotary_pos_emb is unused remove it—ensure the fix mirrors how rotate_half is invoked in the attention forward implementation.
🧹 Nitpick comments (32)
pymllm/orchestrator/parallel_state.py (3)
36-37: Silent early return whendistis not initialized may mask configuration errors.If a caller invokes
initialize_model_parallel(tp=4, dp=2, ...)without first callingdist.init_process_group, the function silently returns without setting any groups or config. Downstream code relying onget_tp_group()etc. will getNonewith no indication of why. Consider logging a warning so that misconfigured setups are easier to diagnose.♻️ Proposed change
if not dist.is_initialized(): + logger.warning( + "torch.distributed is not initialized; " + "skipping model parallel group setup." + ) returnAs per coding guidelines, "Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/parallel_state.py` around lines 36 - 37, The early return guarded by dist.is_initialized() in initialize_model_parallel(tp=..., dp=...) should not be silent; update it to log a warning (using the module logger) that process groups were not initialized and advise calling torch.distributed.init_process_group before initialize_model_parallel, so downstream calls like get_tp_group() don't fail silently; ensure the message includes which function returned early (initialize_model_parallel) and the check (dist.is_initialized()) to aid debugging.
20-25: No thread safety or re-initialization guard on the global state.
initialize_model_parallelmutates module-level globals without any guard against concurrent or repeated calls. A second call would leak the priordist.new_grouphandles. Consider adding an early check viamodel_parallel_is_initialized()and raising if already initialized, or explicitly destroying old groups first.♻️ Example guard
def initialize_model_parallel( tensor_model_parallel_size: int = 1, data_parallel_size: int = 1, pipeline_model_parallel_size: int = 1, backend: str = "nccl", ) -> None: + if model_parallel_is_initialized(): + raise RuntimeError( + "Model parallel groups are already initialized. " + "Call destroy_model_parallel() first." + ) + global _TP_GROUP, _DP_GROUP, _PP_GROUPAlso applies to: 177-179
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/parallel_state.py` around lines 20 - 25, The initialize_model_parallel function mutates module-level globals and may leak prior dist.new_group handles on concurrent or repeated calls; add a re-initialization guard by checking model_parallel_is_initialized() at the top of initialize_model_parallel and either raise a clear exception if already initialized or call a cleanup routine (e.g., destroy_model_parallel or a new teardown function) to explicitly destroy existing groups before creating new ones; also protect the initialization logic with a simple threading.Lock to ensure thread-safety when creating dist.new_group handles and apply the same guard/cleanup pattern to the other initialization site(s) that mutate the same globals.
66-72: Sameassert-based validation concern as ingroup_coordinator.py.This is the primary invariant for the entire parallelism setup. If the product doesn't match
world_size, the system will silently compute incorrect groups under-O. Raise aValueErrorinstead.♻️ Proposed fix
- assert ( - tensor_model_parallel_size * data_parallel_size * pipeline_model_parallel_size - == world_size - ), ( - f"TP({tensor_model_parallel_size}) * DP({data_parallel_size}) * " - f"PP({pipeline_model_parallel_size}) != World({world_size})" - ) + total = tensor_model_parallel_size * data_parallel_size * pipeline_model_parallel_size + if total != world_size: + raise ValueError( + f"TP({tensor_model_parallel_size}) * DP({data_parallel_size}) * " + f"PP({pipeline_model_parallel_size}) = {total} != World({world_size})" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/parallel_state.py` around lines 66 - 72, Replace the assert-based check that multiplies tensor_model_parallel_size, data_parallel_size, and pipeline_model_parallel_size against world_size with an explicit runtime validation that raises a ValueError when the product doesn't equal world_size; locate the expression that currently uses assert (the block referencing tensor_model_parallel_size * data_parallel_size * pipeline_model_parallel_size == world_size and the f-string message) and change it to compute the product, compare to world_size, and raise ValueError(f"TP({tensor_model_parallel_size}) * DP({data_parallel_size}) * PP({pipeline_model_parallel_size}) != World({world_size})") when they mismatch.pymllm/orchestrator/group_coordinator.py (2)
72-77:assertfor input validation is stripped underpython -O; prefer raising exceptions in public API functions.
divide()andsplit_tensor_along_dim()are public API (exported via__init__.py).assertstatements are removed when Python runs with the-Oflag, silently disabling these guards and leading to incorrect silent behavior (e.g., truncated results from integer division). Use explicitValueErrorraises for robustness.♻️ Example for `divide`; apply the same pattern to `split_tensor_along_dim`
def divide(numerator: int, denominator: int) -> int: """Divide and ensure divisibility.""" - assert numerator % denominator == 0, ( - f"{numerator} is not divisible by {denominator}" - ) + if numerator % denominator != 0: + raise ValueError(f"{numerator} is not divisible by {denominator}") return numerator // denominatorAs per coding guidelines, "Ensure functions that can fail return appropriate error codes or raise exceptions" and "Validate inputs for public APIs and critical internal functions."
Also applies to: 80-98
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/group_coordinator.py` around lines 72 - 77, The public functions divide and split_tensor_along_dim currently use assert for input validation which is removed under python -O; replace those asserts with explicit exception raises (e.g., raise ValueError(...)) that include the same descriptive message so invalid inputs (non-divisible numerator/denominator or bad split parameters) raise a clear exception instead of failing silently; update the divide function to raise ValueError when numerator % denominator != 0 and apply the same pattern to split_tensor_along_dim to validate its inputs and raise ValueError with contextual messages while preserving the existing return behavior.
19-37:local_rankis stored but never used within the class.
self.local_rankis assigned but never referenced by any method inGroupCoordinator. If it's only needed by external callers (e.g., for device placement), that's fine, but consider documenting that intent. Otherwise, it's dead state that may mislead future maintainers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/group_coordinator.py` around lines 19 - 37, The constructor currently assigns self.local_rank but nothing in GroupCoordinator uses it; either remove the local_rank parameter and attribute from GroupCoordinator.__init__ (and update callers) to avoid dead state, or explicitly document its purpose for external use by adding a docstring/param comment to the GroupCoordinator class and keeping self.local_rank; if you choose removal, delete the local_rank argument and any assignment to self.local_rank and update instantiations, and if you choose to keep it, add a clear docstring on GroupCoordinator and/or expose it via a property (e.g., def local_rank(self): return self._local_rank) so the intent is explicit (references: GroupCoordinator.__init__, self.local_rank, rank_in_group, device_group).pymllm/orchestrator/__init__.py (1)
25-48: Categorical grouping in__all__is reasonable; ignore the sort lint.The static analysis tool flags that
__all__is not alphabetically sorted (RUF022), but the current categorical grouping (GroupCoordinator, TP, DP, PP, State) is intentional and more readable for this API surface. Consider adding a# noqa: RUF022comment if you want to suppress the lint.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/__init__.py` around lines 25 - 48, The linter flags the __all__ list as unsorted (RUF022) even though the categorical grouping is intentional; suppress the warning by adding a noqa comment for RUF022 to the __all__ definition (e.g., append "# noqa: RUF022" to the __all__ assignment) so the grouped exports (including symbols like "GroupCoordinator", "divide", "get_tp_group", "data_parallel_all_reduce", "get_pp_group", "initialize_model_parallel", etc.) keep their categorical order without lint noise.pymllm/mobile/backends/qualcomm/nn.py (1)
4-11: Add docstrings and document placeholder intent forQnnSoftmaxandQnnRoPE.Both classes are bare subclasses that add no behavior. Per coding guidelines, public classes must have docstrings explaining purpose. Without any comment it is also unclear whether these are intentional stubs awaiting Qualcomm-specific overrides or permanent thin wrappers.
♻️ Proposed fix: add minimal docstrings
class QnnSoftmax(Softmax): + """Qualcomm-specific Softmax layer. + + Thin wrapper around Softmax; reserved for Qualcomm QNN-specific + operator overrides. + """ def __init__(self): super().__init__() class QnnRoPE(RoPE): + """Qualcomm-specific Rotary Position Embedding layer. + + Thin wrapper around RoPE; reserved for Qualcomm QNN-specific + operator overrides. + """ def __init__(self): super().__init__()As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/nn.py` around lines 4 - 11, The public classes QnnSoftmax and QnnRoPE are empty subclasses with no docstrings; add concise class docstrings to each explaining their intent (e.g., "Qualcomm-specific subclass of Softmax/RoPE; currently a thin wrapper/placeholder for future Qualcomm overrides") and note any expected behavior or why they exist (compatibility shim, vendor-specific hooks), so readers know they are intentional stubs and not omissions; ensure the docstrings appear directly under the class definitions for QnnSoftmax and QnnRoPE.pymllm/mobile/ffi/base.py (1)
9-25: Add a pre-flight existence check and a descriptive error in_load_lib().
tvm_ffi.load_module(lib_path)is called without checking whetherlib_pathexists. On any environment where the native library hasn't been built or installed, the module import fails with a raw OS/TVM error that gives no indication of which path was searched or what the user should do. This is worsened by the module-level_LIB = _load_lib()at line 25, which means the error fires on anyimportof this module.♻️ Proposed fix: descriptive pre-flight check
lib_path = os.path.join(parent_dir, "lib", lib_name) + if not os.path.exists(lib_path): + raise RuntimeError( + f"MllmFFIExtension library not found at '{lib_path}'. " + "Ensure the native library is built and installed correctly." + ) return tvm_ffi.load_module(lib_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/ffi/base.py` around lines 9 - 25, The current _load_lib() calls tvm_ffi.load_module(lib_path) without verifying the file exists, causing opaque import-time errors; update _load_lib to check os.path.exists(lib_path) (or os.path.isfile) before calling tvm_ffi.load_module and, if the file is missing, raise a clear FileNotFoundError/RuntimeError that includes the full lib_path, the platform-determined lib_name, and a short hint about building or installing the native library; keep the _LIB = _load_lib() behavior but ensure the new error message surfaces the searched path and recovery steps so users know what to do.pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (2)
379-382: Typo:layer_dixshould belayer_idx.
self.layer_dix = layer_idx(line 382) is a persistent typo. It is used consistently aslayer_dixwithin the class (lines 394, 397, 417), so there is no immediate runtime error, but it clashes with the attribute namelayer_idxused by the parentQwen3Attentionand the rest of the decoder stack.🧹 Proposed rename (all occurrences in this class)
- self.layer_dix = layer_idx + self.layer_idx = layer_idxThen update lines 394, 397, 417 from
self.layer_dix→self.layer_idx.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py` around lines 379 - 382, The class Qwen3DecoderLayer has a persistent typo: the attribute was set as self.layer_dix instead of self.layer_idx; update the attribute assignment in Qwen3DecoderLayer.__init__ to self.layer_idx = layer_idx and rename all uses of self.layer_dix to self.layer_idx within this class (including places that interact with Qwen3Attention) so the decoder layer attribute matches the expected layer_idx name used elsewhere.
105-111: Unusedx_observerparameter inrotate_half.
x_observeris accepted but never referenced in the function body. If it is no longer needed, remove it; if it was meant to be used (e.g., to trigger an observer forward pass onx), add the missing call.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py` around lines 105 - 111, The parameter x_observer in function rotate_half is unused; either remove it from rotate_half's signature and all callers if observers are not needed, or invoke it to trigger observation (e.g., call x_observer(x) before returning) so the input tensor is recorded; update the rotate_half definition and ensure symbols x2_neg_fake_quant and concat_observer are still applied in the existing order (use x_observer(x) then return concat_observer(torch.cat((x2_neg_fake_quant(-x2), x1), dim=-1))) and adjust any call sites accordingly.pymllm/mobile/backends/qualcomm/transformers/core/qlinear.py (3)
219-234: Use the already-definednum_stepsconstant instead of repeating2**bitwidth_of_scale.
num_stepsis computed at line 221 but2**bitwidth_of_scaleis used again directly at line 231, breaking the named-constant rule.🧹 Proposed fix
- max=2**bitwidth_of_scale, + max=num_steps,As per coding guidelines, "Use named constants instead of magic numbers."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/core/qlinear.py` around lines 219 - 234, Replace the duplicated magic expression 2**bitwidth_of_scale with the already-defined constant num_steps in the quantization loop: inside the loop in qlinear.py where q_scales is computed (the torch.clamp call that currently uses max=2**bitwidth_of_scale), change that max to num_steps to use the named constant and avoid repeating the exponentiation.
32-36: Redundantis not Noneguard inside the outer null-check.The outer
if self.weight_quant is not None:at line 32 already ensures the inner branch is never reached withNone. Theand self.weight_quant is not Noneon line 36 adds nothing.🧹 Proposed cleanup
- if ( - isinstance(self.weight_quant, PerBlockParamFakeQuantize) - and self.weight_quant is not None - ): + if isinstance(self.weight_quant, PerBlockParamFakeQuantize):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/core/qlinear.py` around lines 32 - 36, Remove the redundant null-check: inside the outer guard "if self.weight_quant is not None:" drop the duplicated "and self.weight_quant is not None" from the inner condition that tests "isinstance(self.weight_quant, PerBlockParamFakeQuantize)"; leave only the isinstance check so the inner branch reads like a single isinstance(self.weight_quant, PerBlockParamFakeQuantize) test referencing the existing self.weight_quant variable.
205-215: Replaceassertwith an explicitValueErrorfor runtime data validation.
assertstatements are silently stripped when Python runs with-O; a data-validity check on user-provided tensors should survive in optimized mode.🛡️ Proposed fix
-assert self.weight.shape[-1] % self.block_size[1] == 0 -assert linear_zero_point.sum() == 0 +if self.weight.shape[-1] % self.block_size[1] != 0: + raise ValueError( + f"in_features ({self.weight.shape[-1]}) must be divisible by " + f"block_size ({self.block_size[1]})" + ) +if linear_zero_point.sum() != 0: + raise ValueError("LPBQ deployment requires symmetric quantization (zero_point must be all zeros)")As per coding guidelines, "Ensure functions that can fail return appropriate error codes or raise exceptions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/core/qlinear.py` around lines 205 - 215, Replace the two runtime assertions in QLinear's quantization block with explicit checks that raise ValueError: check that self.weight.shape[-1] % self.block_size[1] == 0 and raise ValueError with a clear message if not, and check that linear_zero_point.sum() == 0 and raise ValueError with a clear message if not; keep the rest of the flow (calling _quantize_affine and assigning weight_int4) unchanged so invalid tensor shapes or zero-point sums still raise informative exceptions at runtime.mllm-kernel/pyproject.toml (1)
3-3:apache-tvm-ffiis unpinned in[build-system].requiresbut pinned to0.1.8in[project].dependencies.During the
scikit-build-corebuild step a different (potentially incompatible) version ofapache-tvm-fficould be resolved, while the installed package enforces== 0.1.8at runtime. If the build scripts rely on the same API surface, pin the build-system requirement to match.🔧 Proposed fix
requires = [ - "scikit-build-core>=0.11.0", "apache-tvm-ffi", "torch-c-dlpack-ext" + "scikit-build-core>=0.11.0", "apache-tvm-ffi == 0.1.8", "torch-c-dlpack-ext" ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm-kernel/pyproject.toml` at line 3, The build-system requirement for apache-tvm-ffi is unpinned while the runtime dependency in [project].dependencies is pinned to 0.1.8; update the [build-system].requires entry for "apache-tvm-ffi" in pyproject.toml to pin it to "apache-tvm-ffi==0.1.8" so the build resolver uses the same version as the runtime and avoids API/compatibility mismatches.pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py (1)
127-133:x_observerparameter accepted but never used inrotate_half.Same issue as the Qwen3 variant: the
x_observerargument (line 128) is passed at every call-site but the function body ignores it entirely. Either remove it from the signature (and all call-sites) or add the intended forward/observation logic.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py` around lines 127 - 133, The rotate_half function currently accepts x_observer but never uses it; either remove x_observer from the signature and from all call-sites, or apply the observer to the input (e.g., call x = x_observer(x) or x_observer.observe(x) as appropriate) before splitting/rotating so the input is recorded; update the function rotate_half and all callers to match the chosen approach and keep the existing ActivationQDQ x2_neg_fake_quant and ConcatObserver concat_observer usage intact.pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py (2)
305-340:tqdmprogress bar never closed.
pbaris created on line 305 but never explicitly closed. The final display update (e.g., elapsed time) will be missing, and if multiple calibration runs occur in the same process, stale bars may accumulate. Use a context manager:♻️ Proposed fix
- pbar = tqdm(total=num_samples, desc="Calibrating") - for entry in dataset: - ... - pbar.update(1) + with tqdm(total=num_samples, desc="Calibrating") as pbar: + for entry in dataset: + ... + pbar.update(1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 305 - 340, The tqdm progress bar instantiated as pbar in the calibration loop is never closed; wrap the progress bar in a context manager or explicitly close it to avoid stale bars (e.g., use with tqdm(total=num_samples, desc="Calibrating") as pbar: around the for entry in dataset loop or call pbar.close() after the loop). Update the block where pbar is created and used (variable pbar, the for entry in dataset loop, and samples_processed updates) so the progress bar is safely closed regardless of early breaks or exceptions.
19-195: Significant code duplication across runner modules.The utility functions
recompute_scale_zp,validate_concat_observer_fn,disable_qdq_observer,enable_qdq_observer,enable_fake_quant,disable_fake_quant, andconvert_weight(lines 19–195) are byte-for-byte identical inqwen3/runner.py,llama/runner.py, and likelyqwen2/runner.py. This violates DRY and makes any future fix/enhancement require triple (or more) edits.Consider extracting these into a shared
pymllm/mobile/backends/qualcomm/transformers/core/runner_utils.pymodule and importing from there.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 19 - 195, These functions (recompute_scale_zp, validate_concat_observer_fn, disable_qdq_observer, enable_qdq_observer, enable_fake_quant, disable_fake_quant, convert_weight) are duplicated across runner modules—extract them into a shared module (e.g., create a new runner_utils.py under transformers/core) and move the implementations there, then replace the in-file definitions in qwen3/runner.py (and the identical ones in llama/runner.py and qwen2/runner.py) with imports from that shared module; ensure the shared functions keep the same signatures and continue to reference ActivationQDQ, ConcatObserver, QLinearLPBQ, QLinearW8A16_PerChannelSym, QRMSNorm, QEmbedding, etc., so callers (model.apply or named_modules loops) work unchanged.pyproject.toml (2)
24-24: Exact-version pinapache-tvm-ffi == 0.1.8prevents receiving security/bug-fix patches.Strict
==pinning independencies(not justbuild-system.requires) means any downstream environment that already has a newer minor release will fail to resolve. Consider a compatible-release specifier:♻️ Proposed loosening
- "apache-tvm-ffi == 0.1.8", + "apache-tvm-ffi >= 0.1.8, < 0.2.0",If ABI breakage is the concern, at minimum document why strict pinning is required here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` at line 24, The dependency line pins apache-tvm-ffi exactly to "apache-tvm-ffi == 0.1.8", which blocks patch/minor updates and can break resolution in downstream environments; change this to a compatible-release specifier such as "apache-tvm-ffi ~= 0.1.8" or "apache-tvm-ffi >=0.1.8,<0.2.0" in pyproject.toml to allow receiving bug/security fixes while preventing ABI bumps, and if you truly require an exact pin keep the existing strict pin but add a short justification comment in the project docs explaining why exact-version pinning of apache-tvm-ffi is necessary.
21-23:pytestandpytest-htmlshould not be runtime dependencies.Test runner packages installed into every end-user environment bloat the install footprint and create unnecessary version conflicts. Move them to an optional
[dev]or[test]group:♻️ Proposed fix
dependencies=[ "packaging", - "pytest", - "pytest-html", "apache-tvm-ffi == 0.1.8", ... ] +[project.optional-dependencies] +dev = ["pytest", "pytest-html"] cuda = ["tilelang", "flashinfer-python"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 21 - 23, The pyproject currently lists "pytest" and "pytest-html" as regular runtime dependencies; remove those two entries from the main dependencies list and add them to a development/test-only group instead (for example add them under tool.poetry.dev-dependencies if using Poetry, or under project.optional-dependencies.test if following PEP 621). Ensure you delete the "pytest" and "pytest-html" lines from the dependencies block and add equivalent entries in the chosen dev/test section so they are only installed for development/testing.pymllm/layers/utils.py (1)
3-10: Inconsistent type hint style:Dictimport unused in favor of|union syntax.Line 3 imports
Dictfromtyping, but Line 10 usesDict[str, Any] | Nonewith the|union operator (Python 3.10+). Since you're already using 3.10+ syntax, consider using the built-indictinstead of the importedDictfor consistency:✏️ Suggested change
-from typing import Any, Dict +from typing import Any import torch def set_weight_attrs( weight: torch.Tensor, - weight_attrs: Dict[str, Any] | None, + weight_attrs: dict[str, Any] | None, ) -> None:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/utils.py` around lines 3 - 10, The type hint style is inconsistent: remove the unused Dict import and change the annotation on set_weight_attrs (parameter weight_attrs) from Dict[str, Any] | None to the modern built-in form dict[str, Any] | None to match Python 3.10+ union syntax; keep the Any import and update the function signature accordingly (refer to the function name set_weight_attrs and parameter weight_attrs).pymllm/tests/test_vocab_parallel_embedding.py (2)
61-62: HardcodedMASTER_PORTmay cause conflicts in CI.Port
29500is a common default and may already be in use by other distributed tests or services running in parallel. Consider using a dynamic free port or at minimum reading from an environment variable with a fallback.✏️ Suggested change
os.environ["MASTER_ADDR"] = "localhost" - os.environ["MASTER_PORT"] = "29500" + os.environ["MASTER_PORT"] = os.environ.get("MASTER_PORT", "29500")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/tests/test_vocab_parallel_embedding.py` around lines 61 - 62, Replace the hardcoded MASTER_PORT assignment in the test setup (currently setting os.environ["MASTER_PORT"] = "29500") with logic that first reads an optional env var and otherwise selects a free ephemeral port at runtime; e.g., use os.environ.get("MASTER_PORT") as primary and fall back to a dynamically discovered free port (via a helper like find_free_port() that binds a socket to port 0 to obtain an available port) before assigning os.environ["MASTER_PORT"], keeping os.environ["MASTER_ADDR"] assignment as-is.
22-23: Module-levellogging.basicConfig(force=True)can interfere with other tests.Using
force=Trueoverwrites the root logger configuration, which can affect logging behavior of other test modules when run in the same pytest session. Consider scoping this to a fixture or using a module-specific logger instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/tests/test_vocab_parallel_embedding.py` around lines 22 - 23, Remove the module-level call that forcefully reconfigures the root logger; instead stop using logging.basicConfig(..., force=True) and avoid touching the root logger with logging.getLogger().setLevel at import time. Update this test module to either configure a module-scoped logger (use logging.getLogger(__name__) and set its level) or move root-level configuration into a pytest fixture (use caplog or a test-scoped setup) so other tests aren’t affected; look for the existing calls logging.basicConfig and logging.getLogger().setLevel in this file and replace them with a module logger or a fixture-based configuration.pymllm/layers/base.py (1)
8-27: Consider adding a class-level docstring forMllmBaseLayer.This is the base class for all layers in the
pymllm.layerspackage and is part of the public API (re-exported frompymllm.layers.__init__). A brief docstring explaining its role, how subclasses should overrideweight_loaderandforward, and the purpose ofquant_recipewould help consumers.As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/base.py` around lines 8 - 27, Add a concise class-level docstring to MllmBaseLayer describing its role as the base class for layers in pymllm.layers, documenting the quant_recipe attribute's purpose (optional QuantRecipe used for quantization settings), describing that subclasses should override weight_loader to implement custom weight-loading/sharding behavior and forward to implement the layer's forward pass, and briefly list expected parameters/returns/behaviour for weight_loader (param: Parameter, loaded_weight: torch.Tensor) and forward (raise NotImplementedError by default). Reference MllmBaseLayer, weight_loader, forward, and the quant_recipe attribute in the docstring so consumers of the public API understand usage and extension points.pymllm/layers/__init__.py (1)
7-11:__all__is not sorted (Ruff RUF022).✏️ Suggested fix
__all__ = [ "MllmBaseLayer", - "set_weight_attrs", "VocabParallelEmbedding", + "set_weight_attrs", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/__init__.py` around lines 7 - 11, The __all__ list is not alphabetically sorted (Ruff RUF022); reorder the exported symbols in the __all__ list so they are lexicographically sorted — e.g., place "MllmBaseLayer", "VocabParallelEmbedding", and "set_weight_attrs" in alphabetical order (adjust the list containing MllmBaseLayer, set_weight_attrs, VocabParallelEmbedding accordingly) to satisfy the linter.pymllm/quantization/quant_recipe.py (1)
1-3: Add a docstring to the publicQuantRecipeclass.This class is referenced as a type annotation in
MllmBaseLayer(inpymllm/layers/base.py). A brief docstring explaining its intended purpose would help consumers understand it's a placeholder for quantization configuration.✏️ Suggested change
class QuantRecipe: + """Configuration recipe for quantization. + + This is a placeholder class that will be extended with quantization + parameters and methods as quantization support matures. + """ + def __init__(self): passAs per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/quantization/quant_recipe.py` around lines 1 - 3, Add a brief docstring to the public QuantRecipe class describing its purpose as a placeholder/structured container for quantization configuration and how it is intended to be used (e.g., fields or subclasses should represent quantization settings) so consumers and type annotations (such as MllmBaseLayer in pymllm/layers/base.py) understand its role; place the docstring immediately under the class declaration for QuantRecipe and keep it concise, describing expected responsibilities, typical usage, and any important invariants or extension points.pymllm/layers/embedding.py (3)
73-112: Prefer raising exceptions overassertfor runtime validation in public API methods.
assertstatements are stripped when Python runs with-O(optimized mode), silently disabling these shape checks. Sinceweight_loaderis part of the public API and loads external data, these checks should use explicit exceptions (e.g.,ValueError/RuntimeError) to guarantee validation at all times.Example for one assertion (apply same pattern to others)
- assert param.data.shape == loaded_weight.shape, ( - f"Shape mismatch: param {param.data.shape} vs " - f"loaded {loaded_weight.shape}" - ) + if param.data.shape != loaded_weight.shape: + raise ValueError( + f"Shape mismatch: param {param.data.shape} vs " + f"loaded {loaded_weight.shape}" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/embedding.py` around lines 73 - 112, Replace all runtime validation assert statements in the weight_loader method with explicit exceptions so checks are always enforced; for each assert in weight_loader (the shape equality check when no sharding, the loaded_weight vocab-size check using output_dim vs self.num_embeddings, and the shard shape equality check), raise an appropriate exception (ValueError or RuntimeError) with the same descriptive message instead of using assert; ensure both branches (output_dim is None or tp_size == 1 and the sharded branch that uses self.vocab_start_index/self.vocab_end_index or .narrow with self.num_embeddings_per_partition) perform the validations via exceptions before calling param.data.copy_(...) so errors are never skipped under -O.
31-31: Use explicitOptional[int]orint | Noneinstead of implicitOptional.PEP 484 prohibits implicit
Optional. This was also flagged by Ruff (RUF013).Proposed fix
- padding_idx: int = None, + padding_idx: int | None = None,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/embedding.py` at line 31, The parameter padding_idx currently uses an implicit Optional via "padding_idx: int = None"; update its type annotation to an explicit Optional by changing it to "padding_idx: Optional[int] = None" (or "padding_idx: int | None = None" for Python 3.10+), and add the corresponding import from typing (Optional) if you choose that form; ensure you update the signature where padding_idx is declared so Ruff (RUF013) and PEP 484 rules are satisfied.
44-50: Redundant divisibility check: both theif(Line 44) anddivide()(Line 50) validate the same condition.The explicit
ValueErroron Line 44 is the better check since it gives a clear message and can't be stripped. The subsequentdivide()call on Line 50 internally asserts the same condition. This is harmless but worth noting — thedivide()assert is now dead code for this path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/embedding.py` around lines 44 - 50, Keep the explicit divisibility check and clear ValueError for num_embeddings vs tp_size, but remove the redundant runtime assertion by avoiding the call to divide() here: set self.num_embeddings_per_partition using integer division of num_embeddings by self.tp_size (e.g., compute num_embeddings // self.tp_size) instead of calling divide(), so the check remains in this function (ValueError in the if) and the duplicate assert inside divide() is not hit for this path; update references to num_embeddings_per_partition accordingly.pymllm/mobile/__init__.py (1)
1-45: Consider defining__all__to control the public API surface.Without
__all__, a wildcard import (from pymllm.mobile import *) will also pull in internal submodule names likeffi,convertor,utils,quantize,nn,service, andbackends. Defining__all__explicitly would keep the public surface intentional and consistent with howpymllm/__init__.pymanages its exports.Example
__all__ = [ # dtypes "float32", "float16", "bfloat16", "int8", "int16", "int32", "int64", "uint8", "uint16", "uint32", "uint64", "boolean", # devices "cpu", "cuda", "qnn", # core "Tensor", "empty", "echo", "device", "is_torch_available", "is_numpy_available", "from_torch", "from_numpy", "zeros", "ones", "arange", "random", # ops "matmul", # submodules (if intentionally public) "ffi", "nn", "backends", "convertor", "utils", "quantize", "service", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/__init__.py` around lines 1 - 45, Add an explicit __all__ list in pymllm.mobile.__init__ to control the public API surface: include the public symbols currently re-exported from .ffi (float32, float16, bfloat16, int8, int16, int32, int64, uint8, uint16, uint32, uint64, boolean, cpu, cuda, qnn, Tensor, empty, echo, device, is_torch_available, is_numpy_available, from_torch, from_numpy, zeros, ones, arange, random), plus matmul from .nn.functional; decide whether to also expose submodules (ffi, nn, backends, convertor, utils, quantize, service) and only include them in __all__ if they are intentionally public. Ensure the name __all__ is defined at module level so wildcard imports and documentation reflect the intended public API.pymllm/__init__.py (1)
26-27: Add a docstring to the publicis_mobile_available()function.Per coding guidelines, public APIs should have clear docstrings explaining purpose and return value.
Proposed fix
def is_mobile_available() -> bool: + """Check whether the mobile (on-device) FFI libraries are available. + + Returns: + True if the platform-specific MllmFFIExtension shared library + exists under pymllm/lib/, False otherwise. + """ return _has_mobile_libs()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/__init__.py` around lines 26 - 27, Add a concise docstring to the public function is_mobile_available() that explains its purpose (checking if mobile-specific dependencies are present), what it returns (bool indicating availability), and any notes about it delegating to the internal helper _has_mobile_libs(); place the docstring immediately below the def is_mobile_available() signature and follow project docstring style (one-line summary and a short return description).pymllm/configs/__init__.py (1)
12-21:__all__is not sorted (Ruff RUF022).♻️ Proposed fix
__all__ = [ - # Main singleton - "GlobalConfig", - "get_global_config", - # Sub configs - "ServerConfig", - "ModelConfig", - "RuntimeConfig", "CacheConfig", + "GlobalConfig", + "ModelConfig", + "RuntimeConfig", + "ServerConfig", + "get_global_config", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/configs/__init__.py` around lines 12 - 21, The __all__ list in pymllm.configs.__init__ is not alphabetized; update the __all__ definition so the exported symbol names are sorted alphabetically (e.g., "CacheConfig", "GlobalConfig", "ModelConfig", "RuntimeConfig", "ServerConfig", "get_global_config") while preserving it as a list of strings so Ruff RUF022 is satisfied and the module exports remain the same.pymllm/configs/global_config.py (1)
173-173: Usingbase_gpu_id < 0as an undocumented CPU sentinel is confusing.The mapping
device = "cuda" if base_gpu_id >= 0 else "cpu"encodes CPU selection as a negative GPU index, a convention that is non-obvious and unmentioned in theServerConfigfield comment (# Base GPU index used for process-to-device mapping). A typo like-1intended as "auto" would silently switch the device to CPU.Consider an explicit device field on
ServerConfig(e.g.,device: Literal["cuda","cpu","auto"] = "auto") or at least document this sentinel contract on thebase_gpu_idfield.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/configs/global_config.py` at line 173, The code currently treats server_config.base_gpu_id < 0 as a hidden "use CPU" sentinel; update ServerConfig to include an explicit device field (e.g., device: Literal["cuda","cpu","auto"] = "auto") and change the runtime.device assignment to honor server_config.device first (if "cuda" set device="cuda", if "cpu" set device="cpu", if "auto" fall back to using base_gpu_id >= 0 ? "cuda" : "cpu"); also add a brief comment on the ServerConfig.device field explaining the semantics so the contract is explicit (refer to ServerConfig, server_config.base_gpu_id, and runtime.device).
Summary by CodeRabbit
New Features
Dependencies
Documentation