Remaining Array API gaps in ezmsg-learn

After the most recent Array API conversion (#9 ), several modules still contain NumPy-only boundaries or are entirely unconverted. This issue tracks those gaps, their root causes, and which ones could be resolved.

## Modules with intentional NumPy boundaries

These modules have been partially converted — Array API is used for data manipulation, but a NumPy boundary exists before calling into a library that requires it.

### `model/refit_kalman.py`

| Boundary | Reason | Removable? |
|---|---|---|
| `_compute_gain()` — converts all matrices to numpy before `scipy.linalg.solve_discrete_are` | No Array API or even generic DARE solver exists outside scipy | No — would require a pure Array API DARE implementation |
| `refit()` — per-sample mutation loop uses `np.linalg.norm` on 2-element vectors, scalar element assignment | Element-wise mutation with Python-level indexing; impractical to vectorise | Possibly — the loop could be vectorised with masked operations, which would also be a performance win |

### `process/slda.py`

| Boundary | Reason | Removable? |
|---|---|---|
| `_process()` — converts to numpy before `sklearn.LDA.predict_proba` | sklearn `LinearDiscriminantAnalysis` requires numpy | **Yes** — sklearn 1.8 supports `LinearDiscriminantAnalysis` with `solver="svd"` under `array_api_dispatch=True` (see below) |
| `_reset_state()` — model init and `.mat` loading use numpy throughout | Template creation and weight loading are one-time setup | Low priority |

### `process/adaptive_linear_regressor.py`

| Boundary | Reason | Removable? |
|---|---|---|
| `partial_fit()` / `_process()` — sklearn path converts to numpy before `model.partial_fit` / `model.predict` | sklearn `SGDRegressor` / `PassiveAggressiveRegressor` have no Array API support | No |
| `partial_fit()` / `_process()` — river path converts to pandas before `learn_many` / `predict_many` | river has no Array API support | No |

### `dim_reduce/adaptive_decomp.py`

| Boundary | Reason | Removable? |
|---|---|---|
| `partial_fit()` — converts to numpy before `sklearn.IncrementalPCA.partial_fit` | `IncrementalPCA` has no Array API support. sklearn supports `PCA` (batch) with `array_api_dispatch=True` but not `IncrementalPCA` | No (not yet in sklearn) |
| `_process()` — converts to numpy before `estimator.transform`, then converts back | Same as above | No |
| `MiniBatchNMF` — same pattern | `MiniBatchNMF` has no Array API support in sklearn | No |

## Fully unconverted modules

These modules were not touched by the Array API conversion.

### `process/linear_regressor.py`

- Uses `np.any(np.isnan(...))` guard + `sklearn.LinearModel.predict`
- `sklearn.linear_model.Ridge` **does** support Array API with `solver="svd"` and `array_api_dispatch=True`
- The `predict` call would also return arrays in the source namespace under dispatch mode
- **This module could be made fully Array API compliant** if:
  1. NaN check converted to `xp.any(xp.isnan(...))`
  2. `sklearn.set_config(array_api_dispatch=True)` is enabled
  3. Ridge is configured with `solver="svd"`

### `process/sgd.py`

- Uses `np.any(np.isnan(...))` guard, `.reshape`, `sklearn.SGDClassifier._predict_proba_lr`
- `SGDClassifier` has **no** Array API support in sklearn
- Could trivially convert the NaN check and reshape to xp, but the sklearn boundary would remain
- Low value — the estimator is the bottleneck

### `process/sklearn.py`

- Generic wrapper — accepts arbitrary sklearn/river/hmmlearn models by class name string
- Uses numpy for reshaping, prediction output conversion, axis construction
- Cannot be generically converted because the wrapped model is unknown at code time
- Some *specific* model classes used through this wrapper (e.g. `Ridge`, `LinearDiscriminantAnalysis`) do support Array API dispatch, but the wrapper cannot assume this

### `dim_reduce/incremental_decomp.py`

- Uses `np.prod(train_msg.data.shape)` and `np.asarray(range(...))` in `_partial_fit_windowed`
- These are trivial (shape is a Python tuple, range is a Python object) but the module delegates to `adaptive_decomp.py` which has the sklearn boundary anyway
- Low value — the underlying estimator is the bottleneck

### `process/rnn.py`

- PyTorch-native — uses `torch.Tensor` throughout, not Array API
- No conversion needed or appropriate; PyTorch is the target backend

## Opportunities with sklearn `array_api_dispatch=True`

sklearn 1.8.0 (currently installed) supports Array API dispatch for a specific set of estimators. The relevant ones for ezmsg-learn are:

| sklearn estimator | Used in | Array API support | Constraint |
|---|---|---|---|
| `LinearDiscriminantAnalysis` | `process/slda.py` | **Yes** | Requires `solver="svd"` |
| `Ridge` | `process/linear_regressor.py` (via `StaticLinearRegressor`) | **Yes** | Requires `solver="svd"` |
| `PCA` | Not directly used (uses `IncrementalPCA` instead) | Yes | `IncrementalPCA` is **not** supported |
| `SGDClassifier` | `process/sgd.py` | No | — |
| `SGDRegressor` | `process/adaptive_linear_regressor.py` | No | — |
| `IncrementalPCA` | `dim_reduce/adaptive_decomp.py` | No | — |
| `MiniBatchNMF` | `dim_reduce/adaptive_decomp.py` | No | — |

### What enabling dispatch would require

1. Set `sklearn.set_config(array_api_dispatch=True)` at import time or in a config context
2. For `slda.py`: change LDA solver to `"svd"` (currently `"lsqr"` for the `.mat` path; pickle path is user-controlled)
3. For `linear_regressor.py`: ensure Ridge uses `solver="svd"`
4. Remove the `np.asarray` numpy boundary in those modules — sklearn would accept and return arrays in the source namespace directly

### Risks

- `array_api_dispatch` is marked **experimental** in sklearn 1.8
- Solver constraints (`solver="svd"`) may change numerical results slightly
- The `.mat`-loading path in `slda.py` manually constructs LDA weights and uses `solver="lsqr"` with `shrinkage="auto"` — the `"svd"` solver does not support shrinkage, so this path cannot be converted
- Enabling dispatch globally could have unintended effects on other sklearn usage in the same process


sklearn estimator	Used in	Array API support	Constraint
`LinearDiscriminantAnalysis`	`process/slda.py`	Yes	Requires `solver="svd"`
`Ridge`	`process/linear_regressor.py` (via `StaticLinearRegressor`)	Yes	Requires `solver="svd"`
`PCA`	Not directly used (uses `IncrementalPCA` instead)	Yes	`IncrementalPCA` is not supported
`SGDClassifier`	`process/sgd.py`	No	—
`SGDRegressor`	`process/adaptive_linear_regressor.py`	No	—
`IncrementalPCA`	`dim_reduce/adaptive_decomp.py`	No	—
`MiniBatchNMF`	`dim_reduce/adaptive_decomp.py`	No	—

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remaining Array API gaps in ezmsg-learn #10

Modules with intentional NumPy boundaries

`model/refit_kalman.py`

`process/slda.py`

`process/adaptive_linear_regressor.py`

`dim_reduce/adaptive_decomp.py`

Fully unconverted modules

`process/linear_regressor.py`

`process/sgd.py`

`process/sklearn.py`

`dim_reduce/incremental_decomp.py`

`process/rnn.py`

Opportunities with sklearn `array_api_dispatch=True`

What enabling dispatch would require

Risks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Boundary	Reason	Removable?
`_compute_gain()` — converts all matrices to numpy before `scipy.linalg.solve_discrete_are`	No Array API or even generic DARE solver exists outside scipy	No — would require a pure Array API DARE implementation
`refit()` — per-sample mutation loop uses `np.linalg.norm` on 2-element vectors, scalar element assignment	Element-wise mutation with Python-level indexing; impractical to vectorise	Possibly — the loop could be vectorised with masked operations, which would also be a performance win

Boundary	Reason	Removable?
`_process()` — converts to numpy before `sklearn.LDA.predict_proba`	sklearn `LinearDiscriminantAnalysis` requires numpy	Yes — sklearn 1.8 supports `LinearDiscriminantAnalysis` with `solver="svd"` under `array_api_dispatch=True` (see below)
`_reset_state()` — model init and `.mat` loading use numpy throughout	Template creation and weight loading are one-time setup	Low priority

Boundary	Reason	Removable?
`partial_fit()` / `_process()` — sklearn path converts to numpy before `model.partial_fit` / `model.predict`	sklearn `SGDRegressor` / `PassiveAggressiveRegressor` have no Array API support	No
`partial_fit()` / `_process()` — river path converts to pandas before `learn_many` / `predict_many`	river has no Array API support	No

Boundary	Reason	Removable?
`partial_fit()` — converts to numpy before `sklearn.IncrementalPCA.partial_fit`	`IncrementalPCA` has no Array API support. sklearn supports `PCA` (batch) with `array_api_dispatch=True` but not `IncrementalPCA`	No (not yet in sklearn)
`_process()` — converts to numpy before `estimator.transform`, then converts back	Same as above	No
`MiniBatchNMF` — same pattern	`MiniBatchNMF` has no Array API support in sklearn	No

Remaining Array API gaps in ezmsg-learn #10

Description

Modules with intentional NumPy boundaries

model/refit_kalman.py

process/slda.py

process/adaptive_linear_regressor.py

dim_reduce/adaptive_decomp.py

Fully unconverted modules

process/linear_regressor.py

process/sgd.py

process/sklearn.py

dim_reduce/incremental_decomp.py

process/rnn.py

Opportunities with sklearn array_api_dispatch=True

What enabling dispatch would require

Risks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`model/refit_kalman.py`

`process/slda.py`

`process/adaptive_linear_regressor.py`

`dim_reduce/adaptive_decomp.py`

`process/linear_regressor.py`

`process/sgd.py`

`process/sklearn.py`

`dim_reduce/incremental_decomp.py`

`process/rnn.py`

Opportunities with sklearn `array_api_dispatch=True`