feat(pymllm): VocabParallelEmbedding & pymllm's cuda infra init by chenghuaWang · Pull Request #640 · UbiquitousLearning/mllm

chenghuaWang · 2026-02-18T11:43:25Z

Summary by CodeRabbit

New Features
- Added comprehensive global configuration system for server, model, runtime, and cache management
- Introduced platform-aware mobile support detection and improved mobile APIs
- Implemented tensor-parallel vocabulary embeddings for distributed inference
- Added distributed model parallelism support (tensor, data, pipeline)
- Added CLI command to display system configuration
Dependencies
- Updated apache-tvm-ffi to version 0.1.8
- Added optional flashinfer-python for CUDA support
Documentation
- Updated mllm_kernel usage patterns to unified decorator-based approach

…mple - Replaced the previous JIT utility functions with a streamlined `jit` decorator for kernel registration. - Updated the README.md to reflect the new recommended pattern for CPU kernel implementation. - Simplified the example for using the JIT compilation with a focus on clarity and ease of use.

- Updated `apache-tvm-ffi` version to `0.1.8` in `pyproject.toml` and `mllm-kernel/pyproject.toml`. - Refactored mobile module imports and structure, moving scripts to `pymllm.mobile` and removing unused backends. - Introduced new classes and methods for quantization and model deployment in the Qualcomm backend. - Added new README files for mobile and Qualcomm transformer components.

- Added `flashinfer-python` to the optional `cuda` dependencies in `pyproject.toml`. - Introduced new configuration files for server, model, and layers to centralize runtime settings. - Created initial structure for various layers and components to support future development.

- Added main entry points for `pymllm` and `mllm-kernel` in their respective `pyproject.toml` files. - Implemented a configuration module for `pymllm` to manage global settings, including server, model, runtime, and cache configurations. - Introduced the `VocabParallelEmbedding` layer and utility functions for weight management in the layers module. - Created initial tests for the `VocabParallelEmbedding` layer to validate functionality with tensor parallelism.

coderabbitai · 2026-02-18T11:43:41Z

📝 Walkthrough

Walkthrough

This PR introduces comprehensive configuration management, distributed training orchestration, tensor-parallel layer components, and refactors mobile backend imports. It adds global configuration system with server/model/runtime/cache configs, distributed group coordination, embedding parallelism support, and consolidates CLI entry points. Multiple import paths in mobile backends are updated to new mobile-qualified namespace locations, and FFI bindings expose ParameterFileObj. Dependencies are pinned to specific versions.

Changes

Cohort / File(s)	Summary
Configuration System `pymllm/configs/__init__.py`, `pymllm/configs/global_config.py`, `pymllm/configs/server_config.py`	New comprehensive configuration framework with GlobalConfig singleton (aggregating server, model, runtime, cache), ServerConfig dataclass with validation pipelines, ModelConfig with HF config parsing, and RuntimeConfig/CacheConfig for distributed settings and KV cache parameters.
Layer Components `pymllm/layers/__init__.py`, `pymllm/layers/base.py`, `pymllm/layers/embedding.py`, `pymllm/layers/utils.py`	New layer package with MllmBaseLayer abstract class, VocabParallelEmbedding for tensor parallelism with per-partition sharding and all-reduce aggregation, and set_weight_attrs utility for custom weight attributes.
Orchestrator & Parallelism `pymllm/orchestrator/__init__.py`, `pymllm/orchestrator/group_coordinator.py`, `pymllm/orchestrator/parallel_state.py`	Distributed coordination system: GroupCoordinator for multi-process group operations (all-reduce, all-gather, broadcast), parallel_state for TP/DP/PP initialization and rank/size accessors, and integration with GlobalConfig for runtime updates.
Mobile Backend Imports `pymllm/mobile/backends/qualcomm/nn.py`, `pymllm/mobile/backends/qualcomm/qnn_aot_env.py`, `pymllm/mobile/backends/qualcomm/transformers/.../*`, `pymllm/mobile/backends/__init__.py`, `pymllm/mobile/ffi/base.py`	Refactored import paths from `pymllm.backends.qualcomm` to `pymllm.mobile.backends.qualcomm` across QNN AOT, LLaMA/Qwen2/Qwen3 models, observers, and core components; adjusted library path resolution in FFI base.
FFI & Dependencies `mllm/ffi/Extension.cc`, `mllm/ffi/vendors/tvm-ffi`, `pyproject.toml`, `mllm-kernel/pyproject.toml`	Exposed ParameterFileObj FFI binding in Extension.cc, pinned apache-tvm-ffi to 0.1.8, updated submodule reference for tvm-ffi vendor, and added flashinfer-python to optional CUDA dependencies.
CLI & Entry Points `pymllm/__init__.py`, `pymllm/__main__.py`, `mllm-kernel/mllm_kernel/__main__.py`, `pymllm/backends/__init__.py`	New `pymllm.__main__` with show_config() and main() CLI, mobile availability detection via is_mobile_available(), conditional mobile module export, updated mllm-kernel help text, and removed unconditional backend submodule re-exports.
Documentation & Examples `mllm-kernel/README.md`, `pymllm/mobile/README.md`	Updated mllm-kernel JIT documentation from load_cpu/cuda_jit pattern to unified `@mllm_kernel.jit` decorator with example wrapper, added markdown header to mobile README.
Tests & Quantization `pymllm/tests/test_vocab_parallel_embedding.py`, `pymllm/quantization/quant_recipe.py`, `pymllm/mobile/tests/test_*.py`, `pymllm/backends/cuda/tilelang_compile_test.py` (deleted)	New comprehensive multi-GPU TP-8 test for VocabParallelEmbedding with weight loading validation, new QuantRecipe class, updated mobile test imports to pymllm.mobile namespace, deleted legacy TileLang CUDA kernel test.
Utility & Mobile Package `pymllm/mobile/__init__.py`, `pymllm/utils/mllm_convertor_server/service.py`	Re-exported mobile FFI types (dtypes, device, tensor utilities) and matmul from pymllm.mobile for simplified public API, removed license header from service.py.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

feat(qnn): add Qualcomm QNN AOT support on x86 platforms #562: Modifies FFI layer and exposes ParameterFileObj in Extension.cc, directly related to the FFI binding changes in this PR.
feat(mllm_kernel): add initial implementation of mllm-kernel with CPU and JIT utilities #634: Refactors mllm-kernel README documentation to replace jit_utils APIs with unified mllm_kernel.jit pattern, which aligns with the documentation updates here.

Suggested reviewers

liang1232018
oreomaker
yirongjie

Poem

🐰 A hop through configs, a leap through the layers,
Distributed dancers in parallel affairs!
From mobile backends to orchestrated schemes,
We've woven together the threading of dreams. ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	No pull request description was provided by the author, but the template requires a clear and complete description for efficient review.	Add a comprehensive description following the provided template, explaining the objectives, changes, and rationale for VocabParallelEmbedding and CUDA infrastructure initialization.
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.13% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the primary changes: VocabParallelEmbedding implementation and pymllm's CUDA infrastructure initialization. It is specific and concise.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (9)

pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py (1)
31-35: ⚠️ Potential issue | 🟡 Minor

--output_dir is not marked required=True — crashes with an opaque TypeError when omitted.

args.output_dir defaults to None. Line 50 then calls os.makedirs(None, exist_ok=True), raising TypeError: expected str, bytes or os.PathLike object, not NoneType instead of a clear argparse error.
🐛 Proposed fix
     parser.add_argument(
         "--output_dir",
         type=str,
+        required=True,
         help="Directory to save the quantized model",
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py` around lines 31
- 35, The --output_dir argument is optional but later used in os.makedirs
causing a TypeError when omitted; update the parser.add_argument for
"--output_dir" (the one in train.py) to include required=True (or alternatively
provide a sensible default) so args.output_dir is never None, and ensure any
subsequent call (os.makedirs(args.output_dir, exist_ok=True) in this file)
receives a valid path; keep the change localized to the parser.add_argument for
"--output_dir" and/or add an explicit argparse error if args.output_dir is falsy
before calling os.makedirs.
pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py (1)
136-160: ⚠️ Potential issue | 🟠 Major

Fix apply_rotary_pos_emb function calls to rotate_half.

rotate_half() requires 4 positional arguments (x, x_observer, x2_neg_fake_quant, concat_observer), but apply_rotary_pos_emb() calls it with only 1 argument at lines 158–159:
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
This causes TypeError: rotate_half() missing 3 required positional arguments for any caller. LlamaAttention.forward (lines 397–420) bypasses this by calling rotate_half() directly with all required arguments, but the public function remains broken and unusable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py` around
lines 136 - 160, apply_rotary_pos_emb currently calls rotate_half with a single
argument but rotate_half requires four positional args (x, x_observer,
x2_neg_fake_quant, concat_observer); update apply_rotary_pos_emb to call
rotate_half with the same four arguments used in LlamaAttention.forward (i.e.,
pass through the observer/quant args or supply None defaults) so the calls
become rotate_half(q, x_observer, x2_neg_fake_quant, concat_observer) and
rotate_half(k, x_observer, x2_neg_fake_quant, concat_observer) (or explicitly
pass None for those three if observers are not available), ensuring the function
signature and callers are updated accordingly.
pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (1)
114-138: ⚠️ Potential issue | 🟠 Major

Remove apply_rotary_pos_emb function — it is dead code with signature mismatch.

apply_rotary_pos_emb calls rotate_half(q) and rotate_half(k) with 1 argument each (lines 136–137), but rotate_half requires 4 arguments (x, x_observer, x2_neg_fake_quant, concat_observer). This would raise TypeError if called. The function is not invoked anywhere in the codebase and is absent from __all__ exports. Qwen3Attention.forward applies RoPE manually with correct argument passing. Remove this unused function.

Note: The same issue exists identically in pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py and pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py` around
lines 114 - 138, The function apply_rotary_pos_emb is dead code and wrong: it
calls rotate_half(q) and rotate_half(k) with only one argument while rotate_half
requires four, and Qwen3Attention.forward applies RoPE correctly; remove the
entire apply_rotary_pos_emb function definition from modeling_qwen3.py to avoid
the incorrect call and dead export, and likewise remove the identical
apply_rotary_pos_emb definitions from modeling_qwen2.py and modeling_llama.py so
no mismatched helper remains in the qualcomm backend.
pymllm/mobile/backends/qualcomm/transformers/llama/train.py (1)
32-35: ⚠️ Potential issue | 🟠 Major

--output_dir will crash with TypeError if omitted.

--output_dir has no default and no required=True. When args.output_dir is None, line 50 (os.makedirs(args.output_dir, exist_ok=True)) raises TypeError: expected str, bytes or os.PathLike object, not NoneType. The same issue exists in qwen2/train.py.
🐛 Proposed fix
 parser.add_argument(
     "--output_dir",
     type=str,
+    required=True,
     help="Directory to save the quantized model",
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/llama/train.py` around lines 32
- 35, The CLI argument "--output_dir" is missing a default or required=True so
args.output_dir can be None and later break on os.makedirs(args.output_dir,
exist_ok=True); update the argument definition in train.py (and mirror the same
change in qwen2/train.py) to either set a sensible default (e.g., "./output") or
mark it required=True and then add a defensive check before calling os.makedirs
(e.g., ensure args.output_dir is not None and raise a clear error or use the
default), referencing the argument parsing block that defines "--output_dir" and
the code that calls os.makedirs(args.output_dir, exist_ok=True).
pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py (3)
291-298: ⚠️ Potential issue | 🟡 Minor

Misleading streaming comment and trust_remote_code=True security note.

Line 292 comment states streaming=True is used, but the actual MsDataset.load() call has no such argument. This will cause the full dataset to be downloaded locally — the opposite of what the comment describes.

trust_remote_code=True on line 297 permits execution of arbitrary Python code bundled with the dataset. While common in research settings, it should be an explicit opt-in with a clear docstring warning rather than a silent default in a library.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines
291 - 298, The comment above the MsDataset.load call is misleading and the use
of trust_remote_code=True is unsafe; update the MsDataset.load invocation (the
call to MsDataset.load in runner.py) to pass streaming=True to match the comment
and avoid downloading the entire dataset, and change trust_remote_code=True to a
safe default (False) or wire it to an explicit opt-in flag (e.g.,
allow_trust_remote_code) exposed in the function/class API with a clear
docstring warning about executing remote code so callers must consciously opt
in; also update the inline comment to accurately reflect the behavior.
166-185: ⚠️ Potential issue | 🔴 Critical

Add QLinearW8A16_PerChannelSym handling to enable_fake_quant and disable_fake_quant functions.

QLinearW8A16_PerChannelSym is included in freeze_qwen3_linear_weight (line 147) and convert_weight (line 189), but is missing from enable_fake_quant and disable_fake_quant (lines 166–185). Unlike QLinearLPBQ, which defines enable_fakequant() and disable_fakequant() methods, QLinearW8A16_PerChannelSym currently lacks these methods. Both classes should provide consistent interfaces for managing fake-quantization state during calibration. Add QLinearW8A16_PerChannelSym to the type checks in both functions, and implement the corresponding methods in the class if they do not exist.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines
166 - 185, Add QLinearW8A16_PerChannelSym to the fake-quant toggles and
implement matching methods on the class: update enable_fake_quant and
disable_fake_quant to check isinstance(..., QLinearW8A16_PerChannelSym) and call
the new methods; then add enable_fakequant(self) and disable_fakequant(self) to
the QLinearW8A16_PerChannelSym class to mirror QLinearLPBQ’s behavior (toggle
whatever internal flag or buffer used for fake-quant state). Ensure the
implementation is consistent with how QLinearLPBQ, ActivationQDQ, QRMSNorm, and
QEmbedding manage fake-quant state so freeze_qwen3_linear_weight and
convert_weight (which already reference QLinearW8A16_PerChannelSym) behave
correctly during calibration.
69-73: ⚠️ Potential issue | 🟡 Minor

The diagnostic logging loop is dead code and contains an AttributeError.

FakeQuantize registers scale as a buffer (via register_buffer), not as a parameter. Therefore, module.fake_quant.named_parameters() will never yield scale, making the loop body at lines 70–73 unreachable dead code.

Additionally, line 72 references module.scale, but ActivationQDQ does not expose a .scale property — it should be module.fake_quant.scale.

Fix both issues:
🐛 Proposed fix
-            for key, value in module.fake_quant.named_parameters():
+            for key, value in module.fake_quant.named_buffers():
                 if value is module.fake_quant.scale:
-                    print(f"{module._get_name()}.{key}: {module.scale}")
+                    print(f"{module._get_name()}.{key}: {module.fake_quant.scale}")
                     break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 69
- 73, The loop using module.fake_quant.named_parameters() is dead because
FakeQuantize.register_buffer registers scale as a buffer, and it also wrongly
references module.scale; update the diagnostic to iterate
module.fake_quant.named_buffers() (or check both named_parameters() and
named_buffers() if desired) and reference module.fake_quant.scale when printing;
specifically modify the block around the loop that uses
module.fake_quant.named_parameters() and the print that uses module.scale so it
inspects named_buffers() and prints module.fake_quant.scale (or the buffer name)
instead.
pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py (2)
366-369: ⚠️ Potential issue | 🟡 Minor

Pre-existing typo: layer_dix should be layer_idx.

Line 369 assigns self.layer_dix = layer_idx (note "dix" vs "idx"). This typo propagates to Lines 384 and 404 where self.layer_dix is read. While internally consistent, it's misleading and will cause confusion for anyone extending Qwen2DecoderLayer.

Not introduced by this PR, but flagging for awareness.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py` around
lines 366 - 369, The Qwen2DecoderLayer constructor uses a typo'd attribute name
self.layer_dix instead of self.layer_idx; update the constructor in class
Qwen2DecoderLayer to assign self.layer_idx = layer_idx and then change any
reads/writes of self.layer_dix (e.g., at the sites currently reading it on lines
where it’s accessed) to self.layer_idx so the attribute name is consistent and
clear across the class (keep the class name Qwen2DecoderLayer and parameter
layer_idx the same).
90-123: ⚠️ Potential issue | 🟡 Minor

Fix the arity mismatch in apply_rotary_pos_emb: it calls rotate_half with 1 argument but the function requires 4.

rotate_half on line 90 requires 4 parameters (x, x_observer, x2_neg_fake_quant, concat_observer), but apply_rotary_pos_emb calls it on lines 121–122 with only 1 argument. This will raise TypeError at runtime if apply_rotary_pos_emb is ever invoked.

The actual RoPE implementation in the attention forward method (lines 281–299) calls rotate_half with all 4 arguments correctly, so the bug only affects the unused apply_rotary_pos_emb function. Suggest either updating apply_rotary_pos_emb to pass all required arguments or removing it if it's not needed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py` around
lines 90 - 123, The apply_rotary_pos_emb function calls rotate_half(q) and
rotate_half(k) but rotate_half requires four parameters (x, x_observer,
x2_neg_fake_quant, concat_observer); update apply_rotary_pos_emb to accept and
forward those observers/quantizers (e.g., add parameters x_observer,
x2_neg_fake_quant: ActivationQDQ, concat_observer: ConcatObserver to
apply_rotary_pos_emb signature and call rotate_half(q, x_observer,
x2_neg_fake_quant, concat_observer) and rotate_half(k, x_observer,
x2_neg_fake_quant, concat_observer)), or if apply_rotary_pos_emb is unused
remove it—ensure the fix mirrors how rotate_half is invoked in the attention
forward implementation.

🧹 Nitpick comments (32)

pymllm/orchestrator/parallel_state.py (3)
36-37: Silent early return when dist is not initialized may mask configuration errors.

If a caller invokes initialize_model_parallel(tp=4, dp=2, ...) without first calling dist.init_process_group, the function silently returns without setting any groups or config. Downstream code relying on get_tp_group() etc. will get None with no indication of why. Consider logging a warning so that misconfigured setups are easier to diagnose.
♻️ Proposed change
     if not dist.is_initialized():
+        logger.warning(
+            "torch.distributed is not initialized; "
+            "skipping model parallel group setup."
+        )
         return
As per coding guidelines, "Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/parallel_state.py` around lines 36 - 37, The early return
guarded by dist.is_initialized() in initialize_model_parallel(tp=..., dp=...)
should not be silent; update it to log a warning (using the module logger) that
process groups were not initialized and advise calling
torch.distributed.init_process_group before initialize_model_parallel, so
downstream calls like get_tp_group() don't fail silently; ensure the message
includes which function returned early (initialize_model_parallel) and the check
(dist.is_initialized()) to aid debugging.
20-25: No thread safety or re-initialization guard on the global state.

initialize_model_parallel mutates module-level globals without any guard against concurrent or repeated calls. A second call would leak the prior dist.new_group handles. Consider adding an early check via model_parallel_is_initialized() and raising if already initialized, or explicitly destroying old groups first.
♻️ Example guard
 def initialize_model_parallel(
     tensor_model_parallel_size: int = 1,
     data_parallel_size: int = 1,
     pipeline_model_parallel_size: int = 1,
     backend: str = "nccl",
 ) -> None:
+    if model_parallel_is_initialized():
+        raise RuntimeError(
+            "Model parallel groups are already initialized. "
+            "Call destroy_model_parallel() first."
+        )
+
     global _TP_GROUP, _DP_GROUP, _PP_GROUP
Also applies to: 177-179
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/parallel_state.py` around lines 20 - 25, The
initialize_model_parallel function mutates module-level globals and may leak
prior dist.new_group handles on concurrent or repeated calls; add a
re-initialization guard by checking model_parallel_is_initialized() at the top
of initialize_model_parallel and either raise a clear exception if already
initialized or call a cleanup routine (e.g., destroy_model_parallel or a new
teardown function) to explicitly destroy existing groups before creating new
ones; also protect the initialization logic with a simple threading.Lock to
ensure thread-safety when creating dist.new_group handles and apply the same
guard/cleanup pattern to the other initialization site(s) that mutate the same
globals.
66-72: Same assert-based validation concern as in group_coordinator.py.

This is the primary invariant for the entire parallelism setup. If the product doesn't match world_size, the system will silently compute incorrect groups under -O. Raise a ValueError instead.
♻️ Proposed fix
-    assert (
-        tensor_model_parallel_size * data_parallel_size * pipeline_model_parallel_size
-        == world_size
-    ), (
-        f"TP({tensor_model_parallel_size}) * DP({data_parallel_size}) * "
-        f"PP({pipeline_model_parallel_size}) != World({world_size})"
-    )
+    total = tensor_model_parallel_size * data_parallel_size * pipeline_model_parallel_size
+    if total != world_size:
+        raise ValueError(
+            f"TP({tensor_model_parallel_size}) * DP({data_parallel_size}) * "
+            f"PP({pipeline_model_parallel_size}) = {total} != World({world_size})"
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/parallel_state.py` around lines 66 - 72, Replace the
assert-based check that multiplies tensor_model_parallel_size,
data_parallel_size, and pipeline_model_parallel_size against world_size with an
explicit runtime validation that raises a ValueError when the product doesn't
equal world_size; locate the expression that currently uses assert (the block
referencing tensor_model_parallel_size * data_parallel_size *
pipeline_model_parallel_size == world_size and the f-string message) and change
it to compute the product, compare to world_size, and raise
ValueError(f"TP({tensor_model_parallel_size}) * DP({data_parallel_size}) *
PP({pipeline_model_parallel_size}) != World({world_size})") when they mismatch.
pymllm/orchestrator/group_coordinator.py (2)
72-77: assert for input validation is stripped under python -O; prefer raising exceptions in public API functions.

divide() and split_tensor_along_dim() are public API (exported via __init__.py). assert statements are removed when Python runs with the -O flag, silently disabling these guards and leading to incorrect silent behavior (e.g., truncated results from integer division). Use explicit ValueError raises for robustness.
♻️ Example for `divide`; apply the same pattern to `split_tensor_along_dim`
 def divide(numerator: int, denominator: int) -> int:
     """Divide and ensure divisibility."""
-    assert numerator % denominator == 0, (
-        f"{numerator} is not divisible by {denominator}"
-    )
+    if numerator % denominator != 0:
+        raise ValueError(f"{numerator} is not divisible by {denominator}")
     return numerator // denominator
As per coding guidelines, "Ensure functions that can fail return appropriate error codes or raise exceptions" and "Validate inputs for public APIs and critical internal functions."

Also applies to: 80-98
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/group_coordinator.py` around lines 72 - 77, The public
functions divide and split_tensor_along_dim currently use assert for input
validation which is removed under python -O; replace those asserts with explicit
exception raises (e.g., raise ValueError(...)) that include the same descriptive
message so invalid inputs (non-divisible numerator/denominator or bad split
parameters) raise a clear exception instead of failing silently; update the
divide function to raise ValueError when numerator % denominator != 0 and apply
the same pattern to split_tensor_along_dim to validate its inputs and raise
ValueError with contextual messages while preserving the existing return
behavior.
19-37: local_rank is stored but never used within the class.

self.local_rank is assigned but never referenced by any method in GroupCoordinator. If it's only needed by external callers (e.g., for device placement), that's fine, but consider documenting that intent. Otherwise, it's dead state that may mislead future maintainers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/group_coordinator.py` around lines 19 - 37, The
constructor currently assigns self.local_rank but nothing in GroupCoordinator
uses it; either remove the local_rank parameter and attribute from
GroupCoordinator.__init__ (and update callers) to avoid dead state, or
explicitly document its purpose for external use by adding a docstring/param
comment to the GroupCoordinator class and keeping self.local_rank; if you choose
removal, delete the local_rank argument and any assignment to self.local_rank
and update instantiations, and if you choose to keep it, add a clear docstring
on GroupCoordinator and/or expose it via a property (e.g., def local_rank(self):
return self._local_rank) so the intent is explicit (references:
GroupCoordinator.__init__, self.local_rank, rank_in_group, device_group).
pymllm/orchestrator/__init__.py (1)
25-48: Categorical grouping in __all__ is reasonable; ignore the sort lint.

The static analysis tool flags that __all__ is not alphabetically sorted (RUF022), but the current categorical grouping (GroupCoordinator, TP, DP, PP, State) is intentional and more readable for this API surface. Consider adding a # noqa: RUF022 comment if you want to suppress the lint.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/__init__.py` around lines 25 - 48, The linter flags the
__all__ list as unsorted (RUF022) even though the categorical grouping is
intentional; suppress the warning by adding a noqa comment for RUF022 to the
__all__ definition (e.g., append "# noqa: RUF022" to the __all__ assignment) so
the grouped exports (including symbols like "GroupCoordinator", "divide",
"get_tp_group", "data_parallel_all_reduce", "get_pp_group",
"initialize_model_parallel", etc.) keep their categorical order without lint
noise.
pymllm/mobile/backends/qualcomm/nn.py (1)
4-11: Add docstrings and document placeholder intent for QnnSoftmax and QnnRoPE.

Both classes are bare subclasses that add no behavior. Per coding guidelines, public classes must have docstrings explaining purpose. Without any comment it is also unclear whether these are intentional stubs awaiting Qualcomm-specific overrides or permanent thin wrappers.
♻️ Proposed fix: add minimal docstrings
 class QnnSoftmax(Softmax):
+    """Qualcomm-specific Softmax layer.
+
+    Thin wrapper around Softmax; reserved for Qualcomm QNN-specific
+    operator overrides.
+    """
     def __init__(self):
         super().__init__()


 class QnnRoPE(RoPE):
+    """Qualcomm-specific Rotary Position Embedding layer.
+
+    Thin wrapper around RoPE; reserved for Qualcomm QNN-specific
+    operator overrides.
+    """
     def __init__(self):
         super().__init__()
As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/nn.py` around lines 4 - 11, The public
classes QnnSoftmax and QnnRoPE are empty subclasses with no docstrings; add
concise class docstrings to each explaining their intent (e.g.,
"Qualcomm-specific subclass of Softmax/RoPE; currently a thin
wrapper/placeholder for future Qualcomm overrides") and note any expected
behavior or why they exist (compatibility shim, vendor-specific hooks), so
readers know they are intentional stubs and not omissions; ensure the docstrings
appear directly under the class definitions for QnnSoftmax and QnnRoPE.
pymllm/mobile/ffi/base.py (1)
9-25: Add a pre-flight existence check and a descriptive error in _load_lib().

tvm_ffi.load_module(lib_path) is called without checking whether lib_path exists. On any environment where the native library hasn't been built or installed, the module import fails with a raw OS/TVM error that gives no indication of which path was searched or what the user should do. This is worsened by the module-level _LIB = _load_lib() at line 25, which means the error fires on any import of this module.
♻️ Proposed fix: descriptive pre-flight check
     lib_path = os.path.join(parent_dir, "lib", lib_name)
+    if not os.path.exists(lib_path):
+        raise RuntimeError(
+            f"MllmFFIExtension library not found at '{lib_path}'. "
+            "Ensure the native library is built and installed correctly."
+        )
     return tvm_ffi.load_module(lib_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/ffi/base.py` around lines 9 - 25, The current _load_lib() calls
tvm_ffi.load_module(lib_path) without verifying the file exists, causing opaque
import-time errors; update _load_lib to check os.path.exists(lib_path) (or
os.path.isfile) before calling tvm_ffi.load_module and, if the file is missing,
raise a clear FileNotFoundError/RuntimeError that includes the full lib_path,
the platform-determined lib_name, and a short hint about building or installing
the native library; keep the _LIB = _load_lib() behavior but ensure the new
error message surfaces the searched path and recovery steps so users know what
to do.
pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (2)
379-382: Typo: layer_dix should be layer_idx.

self.layer_dix = layer_idx (line 382) is a persistent typo. It is used consistently as layer_dix within the class (lines 394, 397, 417), so there is no immediate runtime error, but it clashes with the attribute name layer_idx used by the parent Qwen3Attention and the rest of the decoder stack.
🧹 Proposed rename (all occurrences in this class)
-        self.layer_dix = layer_idx
+        self.layer_idx = layer_idx
Then update lines 394, 397, 417 from self.layer_dix → self.layer_idx.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py` around
lines 379 - 382, The class Qwen3DecoderLayer has a persistent typo: the
attribute was set as self.layer_dix instead of self.layer_idx; update the
attribute assignment in Qwen3DecoderLayer.__init__ to self.layer_idx = layer_idx
and rename all uses of self.layer_dix to self.layer_idx within this class
(including places that interact with Qwen3Attention) so the decoder layer
attribute matches the expected layer_idx name used elsewhere.
105-111: Unused x_observer parameter in rotate_half.

x_observer is accepted but never referenced in the function body. If it is no longer needed, remove it; if it was meant to be used (e.g., to trigger an observer forward pass on x), add the missing call.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py` around
lines 105 - 111, The parameter x_observer in function rotate_half is unused;
either remove it from rotate_half's signature and all callers if observers are
not needed, or invoke it to trigger observation (e.g., call x_observer(x) before
returning) so the input tensor is recorded; update the rotate_half definition
and ensure symbols x2_neg_fake_quant and concat_observer are still applied in
the existing order (use x_observer(x) then return
concat_observer(torch.cat((x2_neg_fake_quant(-x2), x1), dim=-1))) and adjust any
call sites accordingly.
pymllm/mobile/backends/qualcomm/transformers/core/qlinear.py (3)
219-234: Use the already-defined num_steps constant instead of repeating 2**bitwidth_of_scale.

num_steps is computed at line 221 but 2**bitwidth_of_scale is used again directly at line 231, breaking the named-constant rule.
🧹 Proposed fix
-            max=2**bitwidth_of_scale,
+            max=num_steps,
As per coding guidelines, "Use named constants instead of magic numbers."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/core/qlinear.py` around lines
219 - 234, Replace the duplicated magic expression 2**bitwidth_of_scale with the
already-defined constant num_steps in the quantization loop: inside the loop in
qlinear.py where q_scales is computed (the torch.clamp call that currently uses
max=2**bitwidth_of_scale), change that max to num_steps to use the named
constant and avoid repeating the exponentiation.
32-36: Redundant is not None guard inside the outer null-check.

The outer if self.weight_quant is not None: at line 32 already ensures the inner branch is never reached with None. The and self.weight_quant is not None on line 36 adds nothing.
🧹 Proposed cleanup
-    if (
-        isinstance(self.weight_quant, PerBlockParamFakeQuantize)
-        and self.weight_quant is not None
-    ):
+    if isinstance(self.weight_quant, PerBlockParamFakeQuantize):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/core/qlinear.py` around lines 32
- 36, Remove the redundant null-check: inside the outer guard "if
self.weight_quant is not None:" drop the duplicated "and self.weight_quant is
not None" from the inner condition that tests "isinstance(self.weight_quant,
PerBlockParamFakeQuantize)"; leave only the isinstance check so the inner branch
reads like a single isinstance(self.weight_quant, PerBlockParamFakeQuantize)
test referencing the existing self.weight_quant variable.
205-215: Replace assert with an explicit ValueError for runtime data validation.

assert statements are silently stripped when Python runs with -O; a data-validity check on user-provided tensors should survive in optimized mode.
🛡️ Proposed fix
-assert self.weight.shape[-1] % self.block_size[1] == 0
-assert linear_zero_point.sum() == 0
+if self.weight.shape[-1] % self.block_size[1] != 0:
+    raise ValueError(
+        f"in_features ({self.weight.shape[-1]}) must be divisible by "
+        f"block_size ({self.block_size[1]})"
+    )
+if linear_zero_point.sum() != 0:
+    raise ValueError("LPBQ deployment requires symmetric quantization (zero_point must be all zeros)")
As per coding guidelines, "Ensure functions that can fail return appropriate error codes or raise exceptions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/core/qlinear.py` around lines
205 - 215, Replace the two runtime assertions in QLinear's quantization block
with explicit checks that raise ValueError: check that self.weight.shape[-1] %
self.block_size[1] == 0 and raise ValueError with a clear message if not, and
check that linear_zero_point.sum() == 0 and raise ValueError with a clear
message if not; keep the rest of the flow (calling _quantize_affine and
assigning weight_int4) unchanged so invalid tensor shapes or zero-point sums
still raise informative exceptions at runtime.
mllm-kernel/pyproject.toml (1)
3-3: apache-tvm-ffi is unpinned in [build-system].requires but pinned to 0.1.8 in [project].dependencies.

During the scikit-build-core build step a different (potentially incompatible) version of apache-tvm-ffi could be resolved, while the installed package enforces == 0.1.8 at runtime. If the build scripts rely on the same API surface, pin the build-system requirement to match.
🔧 Proposed fix
 requires = [
-  "scikit-build-core>=0.11.0", "apache-tvm-ffi", "torch-c-dlpack-ext"
+  "scikit-build-core>=0.11.0", "apache-tvm-ffi == 0.1.8", "torch-c-dlpack-ext"
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm-kernel/pyproject.toml` at line 3, The build-system requirement for
apache-tvm-ffi is unpinned while the runtime dependency in
[project].dependencies is pinned to 0.1.8; update the [build-system].requires
entry for "apache-tvm-ffi" in pyproject.toml to pin it to
"apache-tvm-ffi==0.1.8" so the build resolver uses the same version as the
runtime and avoids API/compatibility mismatches.
pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py (1)
127-133: x_observer parameter accepted but never used in rotate_half.

Same issue as the Qwen3 variant: the x_observer argument (line 128) is passed at every call-site but the function body ignores it entirely. Either remove it from the signature (and all call-sites) or add the intended forward/observation logic.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py` around
lines 127 - 133, The rotate_half function currently accepts x_observer but never
uses it; either remove x_observer from the signature and from all call-sites, or
apply the observer to the input (e.g., call x = x_observer(x) or
x_observer.observe(x) as appropriate) before splitting/rotating so the input is
recorded; update the function rotate_half and all callers to match the chosen
approach and keep the existing ActivationQDQ x2_neg_fake_quant and
ConcatObserver concat_observer usage intact.
pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py (2)
305-340: tqdm progress bar never closed.

pbar is created on line 305 but never explicitly closed. The final display update (e.g., elapsed time) will be missing, and if multiple calibration runs occur in the same process, stale bars may accumulate. Use a context manager:
♻️ Proposed fix
-        pbar = tqdm(total=num_samples, desc="Calibrating")
-        for entry in dataset:
-            ...
-            pbar.update(1)
+        with tqdm(total=num_samples, desc="Calibrating") as pbar:
+            for entry in dataset:
+                ...
+                pbar.update(1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines
305 - 340, The tqdm progress bar instantiated as pbar in the calibration loop is
never closed; wrap the progress bar in a context manager or explicitly close it
to avoid stale bars (e.g., use with tqdm(total=num_samples, desc="Calibrating")
as pbar: around the for entry in dataset loop or call pbar.close() after the
loop). Update the block where pbar is created and used (variable pbar, the for
entry in dataset loop, and samples_processed updates) so the progress bar is
safely closed regardless of early breaks or exceptions.
19-195: Significant code duplication across runner modules.

The utility functions recompute_scale_zp, validate_concat_observer_fn, disable_qdq_observer, enable_qdq_observer, enable_fake_quant, disable_fake_quant, and convert_weight (lines 19–195) are byte-for-byte identical in qwen3/runner.py, llama/runner.py, and likely qwen2/runner.py. This violates DRY and makes any future fix/enhancement require triple (or more) edits.

Consider extracting these into a shared pymllm/mobile/backends/qualcomm/transformers/core/runner_utils.py module and importing from there.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 19
- 195, These functions (recompute_scale_zp, validate_concat_observer_fn,
disable_qdq_observer, enable_qdq_observer, enable_fake_quant,
disable_fake_quant, convert_weight) are duplicated across runner modules—extract
them into a shared module (e.g., create a new runner_utils.py under
transformers/core) and move the implementations there, then replace the in-file
definitions in qwen3/runner.py (and the identical ones in llama/runner.py and
qwen2/runner.py) with imports from that shared module; ensure the shared
functions keep the same signatures and continue to reference ActivationQDQ,
ConcatObserver, QLinearLPBQ, QLinearW8A16_PerChannelSym, QRMSNorm, QEmbedding,
etc., so callers (model.apply or named_modules loops) work unchanged.
pyproject.toml (2)
24-24: Exact-version pin apache-tvm-ffi == 0.1.8 prevents receiving security/bug-fix patches.

Strict == pinning in dependencies (not just build-system.requires) means any downstream environment that already has a newer minor release will fail to resolve. Consider a compatible-release specifier:
♻️ Proposed loosening
-  "apache-tvm-ffi == 0.1.8",
+  "apache-tvm-ffi >= 0.1.8, < 0.2.0",
If ABI breakage is the concern, at minimum document why strict pinning is required here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` at line 24, The dependency line pins apache-tvm-ffi exactly
to "apache-tvm-ffi == 0.1.8", which blocks patch/minor updates and can break
resolution in downstream environments; change this to a compatible-release
specifier such as "apache-tvm-ffi ~= 0.1.8" or "apache-tvm-ffi >=0.1.8,<0.2.0"
in pyproject.toml to allow receiving bug/security fixes while preventing ABI
bumps, and if you truly require an exact pin keep the existing strict pin but
add a short justification comment in the project docs explaining why
exact-version pinning of apache-tvm-ffi is necessary.
21-23: pytest and pytest-html should not be runtime dependencies.

Test runner packages installed into every end-user environment bloat the install footprint and create unnecessary version conflicts. Move them to an optional [dev] or [test] group:
♻️ Proposed fix
 dependencies=[
   "packaging",
-  "pytest",
-  "pytest-html",
   "apache-tvm-ffi == 0.1.8",
   ...
 ]

+[project.optional-dependencies]
+dev = ["pytest", "pytest-html"]
 cuda = ["tilelang", "flashinfer-python"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 21 - 23, The pyproject currently lists "pytest"
and "pytest-html" as regular runtime dependencies; remove those two entries from
the main dependencies list and add them to a development/test-only group instead
(for example add them under tool.poetry.dev-dependencies if using Poetry, or
under project.optional-dependencies.test if following PEP 621). Ensure you
delete the "pytest" and "pytest-html" lines from the dependencies block and add
equivalent entries in the chosen dev/test section so they are only installed for
development/testing.
pymllm/layers/utils.py (1)
3-10: Inconsistent type hint style: Dict import unused in favor of | union syntax.

Line 3 imports Dict from typing, but Line 10 uses Dict[str, Any] | None with the | union operator (Python 3.10+). Since you're already using 3.10+ syntax, consider using the built-in dict instead of the imported Dict for consistency:
✏️ Suggested change
-from typing import Any, Dict
+from typing import Any
 
 import torch
 
 
 def set_weight_attrs(
     weight: torch.Tensor,
-    weight_attrs: Dict[str, Any] | None,
+    weight_attrs: dict[str, Any] | None,
 ) -> None:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/utils.py` around lines 3 - 10, The type hint style is
inconsistent: remove the unused Dict import and change the annotation on
set_weight_attrs (parameter weight_attrs) from Dict[str, Any] | None to the
modern built-in form dict[str, Any] | None to match Python 3.10+ union syntax;
keep the Any import and update the function signature accordingly (refer to the
function name set_weight_attrs and parameter weight_attrs).
pymllm/tests/test_vocab_parallel_embedding.py (2)
61-62: Hardcoded MASTER_PORT may cause conflicts in CI.

Port 29500 is a common default and may already be in use by other distributed tests or services running in parallel. Consider using a dynamic free port or at minimum reading from an environment variable with a fallback.
✏️ Suggested change
     os.environ["MASTER_ADDR"] = "localhost"
-    os.environ["MASTER_PORT"] = "29500"
+    os.environ["MASTER_PORT"] = os.environ.get("MASTER_PORT", "29500")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/tests/test_vocab_parallel_embedding.py` around lines 61 - 62, Replace
the hardcoded MASTER_PORT assignment in the test setup (currently setting
os.environ["MASTER_PORT"] = "29500") with logic that first reads an optional env
var and otherwise selects a free ephemeral port at runtime; e.g., use
os.environ.get("MASTER_PORT") as primary and fall back to a dynamically
discovered free port (via a helper like find_free_port() that binds a socket to
port 0 to obtain an available port) before assigning os.environ["MASTER_PORT"],
keeping os.environ["MASTER_ADDR"] assignment as-is.
22-23: Module-level logging.basicConfig(force=True) can interfere with other tests.

Using force=True overwrites the root logger configuration, which can affect logging behavior of other test modules when run in the same pytest session. Consider scoping this to a fixture or using a module-specific logger instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/tests/test_vocab_parallel_embedding.py` around lines 22 - 23, Remove
the module-level call that forcefully reconfigures the root logger; instead stop
using logging.basicConfig(..., force=True) and avoid touching the root logger
with logging.getLogger().setLevel at import time. Update this test module to
either configure a module-scoped logger (use logging.getLogger(__name__) and set
its level) or move root-level configuration into a pytest fixture (use caplog or
a test-scoped setup) so other tests aren’t affected; look for the existing calls
logging.basicConfig and logging.getLogger().setLevel in this file and replace
them with a module logger or a fixture-based configuration.
pymllm/layers/base.py (1)
8-27: Consider adding a class-level docstring for MllmBaseLayer.

This is the base class for all layers in the pymllm.layers package and is part of the public API (re-exported from pymllm.layers.__init__). A brief docstring explaining its role, how subclasses should override weight_loader and forward, and the purpose of quant_recipe would help consumers.

As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/base.py` around lines 8 - 27, Add a concise class-level
docstring to MllmBaseLayer describing its role as the base class for layers in
pymllm.layers, documenting the quant_recipe attribute's purpose (optional
QuantRecipe used for quantization settings), describing that subclasses should
override weight_loader to implement custom weight-loading/sharding behavior and
forward to implement the layer's forward pass, and briefly list expected
parameters/returns/behaviour for weight_loader (param: Parameter, loaded_weight:
torch.Tensor) and forward (raise NotImplementedError by default). Reference
MllmBaseLayer, weight_loader, forward, and the quant_recipe attribute in the
docstring so consumers of the public API understand usage and extension points.
pymllm/layers/__init__.py (1)
7-11: __all__ is not sorted (Ruff RUF022).
✏️ Suggested fix
 __all__ = [
     "MllmBaseLayer",
-    "set_weight_attrs",
     "VocabParallelEmbedding",
+    "set_weight_attrs",
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/__init__.py` around lines 7 - 11, The __all__ list is not
alphabetically sorted (Ruff RUF022); reorder the exported symbols in the __all__
list so they are lexicographically sorted — e.g., place "MllmBaseLayer",
"VocabParallelEmbedding", and "set_weight_attrs" in alphabetical order (adjust
the list containing MllmBaseLayer, set_weight_attrs, VocabParallelEmbedding
accordingly) to satisfy the linter.
pymllm/quantization/quant_recipe.py (1)
1-3: Add a docstring to the public QuantRecipe class.

This class is referenced as a type annotation in MllmBaseLayer (in pymllm/layers/base.py). A brief docstring explaining its intended purpose would help consumers understand it's a placeholder for quantization configuration.
✏️ Suggested change
 class QuantRecipe:
+    """Configuration recipe for quantization.
+
+    This is a placeholder class that will be extended with quantization
+    parameters and methods as quantization support matures.
+    """
+
     def __init__(self):
         pass
As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/quantization/quant_recipe.py` around lines 1 - 3, Add a brief
docstring to the public QuantRecipe class describing its purpose as a
placeholder/structured container for quantization configuration and how it is
intended to be used (e.g., fields or subclasses should represent quantization
settings) so consumers and type annotations (such as MllmBaseLayer in
pymllm/layers/base.py) understand its role; place the docstring immediately
under the class declaration for QuantRecipe and keep it concise, describing
expected responsibilities, typical usage, and any important invariants or
extension points.
pymllm/layers/embedding.py (3)
73-112: Prefer raising exceptions over assert for runtime validation in public API methods.

assert statements are stripped when Python runs with -O (optimized mode), silently disabling these shape checks. Since weight_loader is part of the public API and loads external data, these checks should use explicit exceptions (e.g., ValueError / RuntimeError) to guarantee validation at all times.
Example for one assertion (apply same pattern to others)
-        assert param.data.shape == loaded_weight.shape, (
-            f"Shape mismatch: param {param.data.shape} vs "
-            f"loaded {loaded_weight.shape}"
-        )
+        if param.data.shape != loaded_weight.shape:
+            raise ValueError(
+                f"Shape mismatch: param {param.data.shape} vs "
+                f"loaded {loaded_weight.shape}"
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/embedding.py` around lines 73 - 112, Replace all runtime
validation assert statements in the weight_loader method with explicit
exceptions so checks are always enforced; for each assert in weight_loader (the
shape equality check when no sharding, the loaded_weight vocab-size check using
output_dim vs self.num_embeddings, and the shard shape equality check), raise an
appropriate exception (ValueError or RuntimeError) with the same descriptive
message instead of using assert; ensure both branches (output_dim is None or
tp_size == 1 and the sharded branch that uses
self.vocab_start_index/self.vocab_end_index or .narrow with
self.num_embeddings_per_partition) perform the validations via exceptions before
calling param.data.copy_(...) so errors are never skipped under -O.
31-31: Use explicit Optional[int] or int | None instead of implicit Optional.

PEP 484 prohibits implicit Optional. This was also flagged by Ruff (RUF013).
Proposed fix
-        padding_idx: int = None,
+        padding_idx: int | None = None,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/embedding.py` at line 31, The parameter padding_idx currently
uses an implicit Optional via "padding_idx: int = None"; update its type
annotation to an explicit Optional by changing it to "padding_idx: Optional[int]
= None" (or "padding_idx: int | None = None" for Python 3.10+), and add the
corresponding import from typing (Optional) if you choose that form; ensure you
update the signature where padding_idx is declared so Ruff (RUF013) and PEP 484
rules are satisfied.
44-50: Redundant divisibility check: both the if (Line 44) and divide() (Line 50) validate the same condition.

The explicit ValueError on Line 44 is the better check since it gives a clear message and can't be stripped. The subsequent divide() call on Line 50 internally asserts the same condition. This is harmless but worth noting — the divide() assert is now dead code for this path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/embedding.py` around lines 44 - 50, Keep the explicit
divisibility check and clear ValueError for num_embeddings vs tp_size, but
remove the redundant runtime assertion by avoiding the call to divide() here:
set self.num_embeddings_per_partition using integer division of num_embeddings
by self.tp_size (e.g., compute num_embeddings // self.tp_size) instead of
calling divide(), so the check remains in this function (ValueError in the if)
and the duplicate assert inside divide() is not hit for this path; update
references to num_embeddings_per_partition accordingly.
pymllm/mobile/__init__.py (1)
1-45: Consider defining __all__ to control the public API surface.

Without __all__, a wildcard import (from pymllm.mobile import *) will also pull in internal submodule names like ffi, convertor, utils, quantize, nn, service, and backends. Defining __all__ explicitly would keep the public surface intentional and consistent with how pymllm/__init__.py manages its exports.
Example
__all__ = [
    # dtypes
    "float32", "float16", "bfloat16",
    "int8", "int16", "int32", "int64",
    "uint8", "uint16", "uint32", "uint64",
    "boolean",
    # devices
    "cpu", "cuda", "qnn",
    # core
    "Tensor", "empty", "echo", "device",
    "is_torch_available", "is_numpy_available",
    "from_torch", "from_numpy",
    "zeros", "ones", "arange", "random",
    # ops
    "matmul",
    # submodules (if intentionally public)
    "ffi", "nn", "backends", "convertor", "utils", "quantize", "service",
]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/__init__.py` around lines 1 - 45, Add an explicit __all__ list
in pymllm.mobile.__init__ to control the public API surface: include the public
symbols currently re-exported from .ffi (float32, float16, bfloat16, int8,
int16, int32, int64, uint8, uint16, uint32, uint64, boolean, cpu, cuda, qnn,
Tensor, empty, echo, device, is_torch_available, is_numpy_available, from_torch,
from_numpy, zeros, ones, arange, random), plus matmul from .nn.functional;
decide whether to also expose submodules (ffi, nn, backends, convertor, utils,
quantize, service) and only include them in __all__ if they are intentionally
public. Ensure the name __all__ is defined at module level so wildcard imports
and documentation reflect the intended public API.
pymllm/__init__.py (1)
26-27: Add a docstring to the public is_mobile_available() function.

Per coding guidelines, public APIs should have clear docstrings explaining purpose and return value.
Proposed fix
 def is_mobile_available() -> bool:
+    """Check whether the mobile (on-device) FFI libraries are available.
+
+    Returns:
+        True if the platform-specific MllmFFIExtension shared library
+        exists under pymllm/lib/, False otherwise.
+    """
     return _has_mobile_libs()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/__init__.py` around lines 26 - 27, Add a concise docstring to the
public function is_mobile_available() that explains its purpose (checking if
mobile-specific dependencies are present), what it returns (bool indicating
availability), and any notes about it delegating to the internal helper
_has_mobile_libs(); place the docstring immediately below the def
is_mobile_available() signature and follow project docstring style (one-line
summary and a short return description).
pymllm/configs/__init__.py (1)
12-21: __all__ is not sorted (Ruff RUF022).
♻️ Proposed fix
 __all__ = [
-    # Main singleton
-    "GlobalConfig",
-    "get_global_config",
-    # Sub configs
-    "ServerConfig",
-    "ModelConfig",
-    "RuntimeConfig",
     "CacheConfig",
+    "GlobalConfig",
+    "ModelConfig",
+    "RuntimeConfig",
+    "ServerConfig",
+    "get_global_config",
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/configs/__init__.py` around lines 12 - 21, The __all__ list in
pymllm.configs.__init__ is not alphabetized; update the __all__ definition so
the exported symbol names are sorted alphabetically (e.g., "CacheConfig",
"GlobalConfig", "ModelConfig", "RuntimeConfig", "ServerConfig",
"get_global_config") while preserving it as a list of strings so Ruff RUF022 is
satisfied and the module exports remain the same.
pymllm/configs/global_config.py (1)
173-173: Using base_gpu_id < 0 as an undocumented CPU sentinel is confusing.

The mapping device = "cuda" if base_gpu_id >= 0 else "cpu" encodes CPU selection as a negative GPU index, a convention that is non-obvious and unmentioned in the ServerConfig field comment (# Base GPU index used for process-to-device mapping). A typo like -1 intended as "auto" would silently switch the device to CPU.

Consider an explicit device field on ServerConfig (e.g., device: Literal["cuda","cpu","auto"] = "auto") or at least document this sentinel contract on the base_gpu_id field.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/configs/global_config.py` at line 173, The code currently treats
server_config.base_gpu_id < 0 as a hidden "use CPU" sentinel; update
ServerConfig to include an explicit device field (e.g., device:
Literal["cuda","cpu","auto"] = "auto") and change the runtime.device assignment
to honor server_config.device first (if "cuda" set device="cuda", if "cpu" set
device="cpu", if "auto" fall back to using base_gpu_id >= 0 ? "cuda" : "cpu");
also add a brief comment on the ServerConfig.device field explaining the
semantics so the contract is explicit (refer to ServerConfig,
server_config.base_gpu_id, and runtime.device).

mllm-kernel/README.md

mllm/ffi/Extension.cc

pymllm/configs/global_config.py

pymllm/layers/embedding.py

pymllm/mobile/backends/__init__.py

pymllm/orchestrator/group_coordinator.py

pymllm/orchestrator/parallel_state.py

pymllm/tests/test_vocab_parallel_embedding.py

chenghuaWang added 4 commits February 17, 2026 09:23

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pymllm): VocabParallelEmbedding & pymllm's cuda infra init#640

feat(pymllm): VocabParallelEmbedding & pymllm's cuda infra init#640
chenghuaWang wants to merge 4 commits intoUbiquitousLearning:mainfrom
chenghuaWang:wch-main

chenghuaWang commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 18, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

chenghuaWang commented Feb 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

chenghuaWang commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 18, 2026 •

edited

Loading