Skip to content

Remaining Array API gaps in ezmsg-learn #10

@cboulay

Description

@cboulay

After the most recent Array API conversion (#9 ), several modules still contain NumPy-only boundaries or are entirely unconverted. This issue tracks those gaps, their root causes, and which ones could be resolved.

Modules with intentional NumPy boundaries

These modules have been partially converted — Array API is used for data manipulation, but a NumPy boundary exists before calling into a library that requires it.

model/refit_kalman.py

Boundary Reason Removable?
_compute_gain() — converts all matrices to numpy before scipy.linalg.solve_discrete_are No Array API or even generic DARE solver exists outside scipy No — would require a pure Array API DARE implementation
refit() — per-sample mutation loop uses np.linalg.norm on 2-element vectors, scalar element assignment Element-wise mutation with Python-level indexing; impractical to vectorise Possibly — the loop could be vectorised with masked operations, which would also be a performance win

process/slda.py

Boundary Reason Removable?
_process() — converts to numpy before sklearn.LDA.predict_proba sklearn LinearDiscriminantAnalysis requires numpy Yes — sklearn 1.8 supports LinearDiscriminantAnalysis with solver="svd" under array_api_dispatch=True (see below)
_reset_state() — model init and .mat loading use numpy throughout Template creation and weight loading are one-time setup Low priority

process/adaptive_linear_regressor.py

Boundary Reason Removable?
partial_fit() / _process() — sklearn path converts to numpy before model.partial_fit / model.predict sklearn SGDRegressor / PassiveAggressiveRegressor have no Array API support No
partial_fit() / _process() — river path converts to pandas before learn_many / predict_many river has no Array API support No

dim_reduce/adaptive_decomp.py

Boundary Reason Removable?
partial_fit() — converts to numpy before sklearn.IncrementalPCA.partial_fit IncrementalPCA has no Array API support. sklearn supports PCA (batch) with array_api_dispatch=True but not IncrementalPCA No (not yet in sklearn)
_process() — converts to numpy before estimator.transform, then converts back Same as above No
MiniBatchNMF — same pattern MiniBatchNMF has no Array API support in sklearn No

Fully unconverted modules

These modules were not touched by the Array API conversion.

process/linear_regressor.py

  • Uses np.any(np.isnan(...)) guard + sklearn.LinearModel.predict
  • sklearn.linear_model.Ridge does support Array API with solver="svd" and array_api_dispatch=True
  • The predict call would also return arrays in the source namespace under dispatch mode
  • This module could be made fully Array API compliant if:
    1. NaN check converted to xp.any(xp.isnan(...))
    2. sklearn.set_config(array_api_dispatch=True) is enabled
    3. Ridge is configured with solver="svd"

process/sgd.py

  • Uses np.any(np.isnan(...)) guard, .reshape, sklearn.SGDClassifier._predict_proba_lr
  • SGDClassifier has no Array API support in sklearn
  • Could trivially convert the NaN check and reshape to xp, but the sklearn boundary would remain
  • Low value — the estimator is the bottleneck

process/sklearn.py

  • Generic wrapper — accepts arbitrary sklearn/river/hmmlearn models by class name string
  • Uses numpy for reshaping, prediction output conversion, axis construction
  • Cannot be generically converted because the wrapped model is unknown at code time
  • Some specific model classes used through this wrapper (e.g. Ridge, LinearDiscriminantAnalysis) do support Array API dispatch, but the wrapper cannot assume this

dim_reduce/incremental_decomp.py

  • Uses np.prod(train_msg.data.shape) and np.asarray(range(...)) in _partial_fit_windowed
  • These are trivial (shape is a Python tuple, range is a Python object) but the module delegates to adaptive_decomp.py which has the sklearn boundary anyway
  • Low value — the underlying estimator is the bottleneck

process/rnn.py

  • PyTorch-native — uses torch.Tensor throughout, not Array API
  • No conversion needed or appropriate; PyTorch is the target backend

Opportunities with sklearn array_api_dispatch=True

sklearn 1.8.0 (currently installed) supports Array API dispatch for a specific set of estimators. The relevant ones for ezmsg-learn are:

sklearn estimator Used in Array API support Constraint
LinearDiscriminantAnalysis process/slda.py Yes Requires solver="svd"
Ridge process/linear_regressor.py (via StaticLinearRegressor) Yes Requires solver="svd"
PCA Not directly used (uses IncrementalPCA instead) Yes IncrementalPCA is not supported
SGDClassifier process/sgd.py No
SGDRegressor process/adaptive_linear_regressor.py No
IncrementalPCA dim_reduce/adaptive_decomp.py No
MiniBatchNMF dim_reduce/adaptive_decomp.py No

What enabling dispatch would require

  1. Set sklearn.set_config(array_api_dispatch=True) at import time or in a config context
  2. For slda.py: change LDA solver to "svd" (currently "lsqr" for the .mat path; pickle path is user-controlled)
  3. For linear_regressor.py: ensure Ridge uses solver="svd"
  4. Remove the np.asarray numpy boundary in those modules — sklearn would accept and return arrays in the source namespace directly

Risks

  • array_api_dispatch is marked experimental in sklearn 1.8
  • Solver constraints (solver="svd") may change numerical results slightly
  • The .mat-loading path in slda.py manually constructs LDA weights and uses solver="lsqr" with shrinkage="auto" — the "svd" solver does not support shrinkage, so this path cannot be converted
  • Enabling dispatch globally could have unintended effects on other sklearn usage in the same process

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions