Skip to content

202 model selection module implementation#232

Open
NusretSalli wants to merge 3 commits into
paucablop:mainfrom
NusretSalli:202-Model-selection-module-implementation
Open

202 model selection module implementation#232
NusretSalli wants to merge 3 commits into
paucablop:mainfrom
NusretSalli:202-Model-selection-module-implementation

Conversation

@NusretSalli
Copy link
Copy Markdown
Collaborator

This PR will hopefully close #202 at some point 😄

Goal of the PR- what does it solve?

An implementation of a class CandidateSelector have been implemented with the goal of easily creating, managing and selecting "candidates" based on performance metrics, but also include less used, but still important metrics such as variance.

The introduced methods incentivize people to not only use metrics such as RMSE to determine the best candidate but also select it based on the variance of the model, which hopefully has been made easier to do with the class. 😃

Implementation details.

Two files have been created to implement the model selection module:

  • CandidateSelector class, that is the main driving force to make it work
  • BaseFittedModel class, that will work as a container for a single fitted model - CandidateSelector will give output in the form of BaseFittedModel instances (one for each candidate).

Result from the implementation

Below are some of the ways that you can use it (at least for now):

Example 1

Imagine we need to train a series of PLS models with different parameters and preprocessing applied:

# Create a pipeline with preprocessing and PLS regression
pipeline = make_pipeline(
    SavitzkyGolayFilter(),
    PLSRegression()
)

# Define parameter grid to search - use the auto-generated step names from make_pipeline
# make_pipeline uses lowercase class names: 'savitzkygolayfilter', 'plsregression'
param_grid = {
    "savitzkygolayfilter__window_size": [5, 11, 21, 31],
    "savitzkygolayfilter__polynomial_order": [1, 2, 3],
    "plsregression__n_components": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}

Normally, you would use something like the GridSearchCV, but now we can instantiate an CandidateSelector class instead (which will in fact run a GridSearchCV under the hood) and fit it:

# Create CandidateSelector with RMSE scoring and train score tracking
selector = CandidateSelector(
    estimator=pipeline,
    param_grid=param_grid,
    scoring="root_mean_squared_error",  # Enables RMSE metrics
    cv=5,
    n_jobs=-1,  # Use parallel processing for faster search
    return_train_score=True
)

# Fit the selector
selector.fit(X_train, y_train)

And from this, each "model" trained will be a BaseFittedModel class that contain several interesting metrics such as rank, parameters, test metric score (RMSE and variance) as well as RMSECV, RMSE ratio, etc.

Example 2 - accessing candidates after training

After you have trained your candidates, several calls can be made to better explore these candidates - for instance you can ask for top 3 candidates and print some metrics from them:

# Get top 3 candidates
candidates = selector.get_candidates(n=3)

print("Top 3 Candidates:")
print("-" * 80)
for candidate in candidates:
    print(f"Rank {candidate.rank}: {candidate.params}")
    print(f"  RMSECV: {candidate.rmsecv:.4f}")
    print(f"  RMSE Train: {candidate.rmse_train:.4f}")
    print(f"  RMSE Ratio: {candidate.rmse_ratio:.4f}")
    print(f"  Variance: {candidate.variance:.6f}")

You can also filter your candidates based on the RMSE ratio (i.e. REMSECV / RMSE from train) to identify overfitting:

# Filter candidates with good generalization using filter_candidates()
robust_candidates = selector.filter_candidates(metric='rmse_ratio', threshold=1.3, mode='<=')

print(f"Found {len(robust_candidates)} candidates with RMSE ratio <= 1.3:")
for c in robust_candidates[:5]:  # Show first 5
    print(f"  Rank {c.rank}: ratio={c.rmse_ratio:.3f}, RMSECV={c.rmsecv:.4f}")

Visualization is also possible with the plot_cv_metrics call:

# Plot RMSECV vs RMSE ratio - colored by number of components
selector.plot_cv_metrics(
   color_by="plsregression__n_components",
   title="Cross-validation Error vs Overfitting"
)
plt.show()

Which plots something like this (visually speaking not good, but it shows the idea quite well nonetheless 😃 )

image

Or if you want to have a plot against test metric and variance:

# Plot RMSECV vs RMSE ratio - colored by number of components
selector.plot_score_vs_variance(
    color_by="plsregression__n_components",
    title="Cross-validation Error vs Overfitting"
)
plt.show()

Which produces plots like this:

image

What do to from here

With the addition of the inspector module, it is now of interest to somehow align these two module as I can imagine that the workflow should be something like model selection module --> inspector module so in that sense I would like to get feedback on how I should go about this (and also get feedback just in general, i.e. whether or not this is good or if I should make any larger changes).

So I don't expect to be done with this PR now, but merely have it work as a "feedback" round for the model selection module to be better aligned and how it should be approached in order to do that 😃

@NusretSalli NusretSalli requested a review from paucablop January 27, 2026 23:30
@NusretSalli NusretSalli self-assigned this Jan 27, 2026
Copilot AI review requested due to automatic review settings January 27, 2026 23:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a model selection module to address issue #202, introducing two key classes: CandidateSelector for managing model selection with grid search, and BaseFittedModel as a container for individual candidate models. The module wraps sklearn's GridSearchCV and provides enhanced functionality for evaluating models based on RMSE metrics, variance, and overfitting indicators (RMSE ratio), with visualization capabilities.

Changes:

  • Added CandidateSelector class that extends GridSearchCV with candidate ranking, filtering, and RMSE-specific metrics
  • Added BaseFittedModel dataclass for storing individual model candidate results and metadata
  • Implemented comprehensive test suite covering instantiation, fitting, candidate retrieval, filtering, and visualization

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 18 comments.

File Description
chemotools/model_selection/__init__.py Module initialization exporting the two main classes
chemotools/model_selection/_fitted_model.py BaseFittedModel dataclass implementation for storing candidate model information including RMSE metrics
chemotools/model_selection/_candidate_selector.py CandidateSelector class wrapping GridSearchCV with enhanced model selection, filtering, and visualization features
tests/model_selection/test_candidate_selector.py Comprehensive test suite covering core functionality, edge cases, and plotting methods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +320 to +335
markers = ["o", "s", "^", "D", "v", "*", "p", "h"]
cmap = plt.colormaps.get_cmap("tab10")

for idx, key in enumerate(sorted(groups.keys())):
data = groups[key]
ax.scatter(
[d[0] for d in data],
[d[1] for d in data],
marker=markers[idx % len(markers)],
c=[cmap(idx % 10)],
s=80,
label=str(key),
edgecolors="black",
linewidths=0.5,
alpha=0.8,
)
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential issue with plot coloring when there are many parameter groups. The code uses cmap(idx % 10) on line 329, which means after 10 groups, colors will repeat. Similarly, markers cycle through 8 options. When a grid search has more than 8-10 unique values for the color_by parameter, the visualization becomes ambiguous. Consider: (1) adding a warning when there are more groups than available colors/markers, (2) using a different colormap that handles more distinct values, or (3) documenting this limitation in the docstring.

Copilot uses AI. Check for mistakes.
Comment on lines +127 to +133
def get_candidate(self, rank: int = 1) -> BaseFittedModel:
"""Return candidate by rank (1 = best)."""
check_is_fitted(self, ["candidates_"])
for c in self.candidates_:
if c.rank == rank:
return c
raise ValueError(f"No candidate with rank {rank}.")
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing input validation for the rank parameter. The method does not validate that rank is positive. If a user passes rank=0 or rank=-1, the method will raise a "No candidate with rank" ValueError, but it would be clearer to validate the input upfront and provide a more specific error message indicating that rank must be a positive integer (>= 1).

Copilot uses AI. Check for mistakes.
best_params_ : dict
Parameters that achieved the best score.
best_score_ : float
Best cross-validation score.
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best_index_ attribute is set during fit() but is not documented in the class docstring's "Attributes" section. For API consistency, all fitted attributes (those set during fit and ending with underscore) should be documented. Add best_index_ to the Attributes section with a description like "Index of best parameter combination in cv_results_."

Suggested change
Best cross-validation score.
Best cross-validation score.
best_index_ : int
Index of best parameter combination in ``cv_results_``.

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +16

import matplotlib.pyplot as plt
import numpy as np
from sklearn.base import BaseEstimator
from sklearn.model_selection import GridSearchCV
from sklearn.utils.validation import check_is_fitted
import operator
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import order inconsistency: the operator module should be imported earlier with other standard library imports (before third-party imports like matplotlib, numpy, sklearn). Following PEP 8 conventions, imports should be grouped as: (1) standard library, (2) third-party, (3) local imports.

Suggested change
import matplotlib.pyplot as plt
import numpy as np
from sklearn.base import BaseEstimator
from sklearn.model_selection import GridSearchCV
from sklearn.utils.validation import check_is_fitted
import operator
import operator
import matplotlib.pyplot as plt
import numpy as np
from sklearn.base import BaseEstimator
from sklearn.model_selection import GridSearchCV
from sklearn.utils.validation import check_is_fitted

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +24
@pytest.fixture
def fitted_selector(dummy_data_loader):
"""Return a fitted CandidateSelector for testing."""
X, y = dummy_data_loader
selector = CandidateSelector(
estimator=Ridge(random_state=0),
param_grid={"alpha": [0.1, 1.0, 10.0]},
cv=3,
scoring="neg_root_mean_squared_error",
return_train_score=True,
n_jobs=1,
)
selector.fit(X, y)
return selector
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for non-RMSE scoring functions. All tests use scoring="neg_root_mean_squared_error", but the CandidateSelector should work with other scoring metrics too. The behavior differs when RMSE metrics are None (as documented in line 108 of _fitted_model.py). Add tests that verify: (1) the selector works with other scorers like "r2" or "neg_mean_squared_error", (2) RMSE-specific attributes (rmsecv, rmse_train, rmse_ratio) are None when not using RMSE scoring, and (3) plotting and filtering still work with non-RMSE metrics.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,4 @@
from ._candidate_selector import CandidateSelector
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing module-level docstring. Other modules in the codebase (e.g., smooth, inspector) have docstrings at the module level. Add a docstring describing the model selection module's purpose and main classes.

Copilot uses AI. Check for mistakes.
Comment on lines +106 to +114
# Calculate RMSE metrics if using neg_root_mean_squared_error
rmsecv, rmse_train_val, rmse_ratio = None, None, None
if isinstance(scoring, str) and "neg_root_mean_squared_error" in scoring:
rmsecv = -mean_test
if mean_train is not None:
rmse_train_val = -mean_train
if rmse_train_val != 0:
rmse_ratio = rmsecv / rmse_train_val

Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RMSE metric calculation only checks for "neg_root_mean_squared_error" in the scoring string, which may miss other RMSE-related scorers or fail silently for non-RMSE metrics. Consider: (1) documenting this RMSE-specific behavior more prominently in the class docstring, (2) adding a warning when RMSE metrics will not be calculated, or (3) providing a clearer way to enable/disable RMSE-specific metrics. Users who don't use RMSE scoring will have None for these metrics but may not understand why.

Copilot uses AI. Check for mistakes.
"""Return top *n* candidates (all if n is None)."""
check_is_fitted(self, ["candidates_"])
if n is None:
return self.candidates_
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing input validation for the n parameter. The method does not validate that n is positive when provided. If a user passes n=0 or n=-1, the method will return an empty list or unexpected results respectively, without any warning or error. Consider adding validation to ensure n is either None or a positive integer, raising a ValueError for invalid inputs.

Suggested change
return self.candidates_
return self.candidates_
if not isinstance(n, int) or n < 1:
raise ValueError(
f"n must be a positive integer or None, got {n!r}."
)

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +6
import numpy as np
import pytest
from sklearn.linear_model import Ridge

from chemotools.model_selection import BaseFittedModel, CandidateSelector

Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing sklearn compatibility test. Other modules in the codebase (e.g., smooth/test_savitzky_golay_filter.py) include a test using check_estimator from sklearn.utils.estimator_checks to verify sklearn API compliance. Consider adding a similar test for CandidateSelector to ensure it follows sklearn conventions, though note that meta-estimators like this one may need special handling or exclusions for certain checks.

Copilot uses AI. Check for mistakes.
Comment on lines +298 to +300
# Auto-detect color_by parameter
if color_by is None and self.candidates_:
color_by = next(iter(self.candidates_[0].params), None)
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential edge case: if self.candidates_[0].params is an empty dictionary (which could happen if param_grid is empty or has no keys), next(iter(self.candidates_[0].params), None) will return None, which is handled. However, this edge case is unlikely in practice since GridSearchCV requires a non-empty param_grid. Consider adding a check or documenting this assumption.

Suggested change
# Auto-detect color_by parameter
if color_by is None and self.candidates_:
color_by = next(iter(self.candidates_[0].params), None)
# Auto-detect color_by parameter if candidates have parameters
if color_by is None and self.candidates_:
first_params = getattr(self.candidates_[0], "params", None) or {}
if first_params:
# Use the first available parameter name for coloring
color_by = next(iter(first_params))

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Epic: Model Selection Module

2 participants