Evolve interpretable quantitative factors. Evaluate them on CPU or GPU. Keep the formula.
Quick start · Features · Operators · Architecture · Documentation
QuantGplearn is a genetic-programming framework for quantitative factor research. Instead of fitting an opaque set of weights, it searches for human-readable expressions built from market features, rolling operators, and cross-sectional transformations:
cs_rank(ts_zscore(close, 168))
div(ts_delta(vwap, 24), ts_std(close, 72))
cs_demean(ts_one_ols_resid(close, volume, 336))
The same symbolic program representation can run through the original NumPy/Pandas engine or through a Torch tensor backend designed for dense panel data. The result is a practical bridge between explainable symbolic research and GPU-accelerated factor evaluation.
QuantGplearn discovers candidate signals; it is not a promise of investment performance. Validate every factor with leakage-aware, out-of-sample research and a realistic execution model.
中文简介
QuantGplearn 是一个面向量化因子研究的遗传规划框架。它不输出难以解释的黑盒权重,
而是进化出由行情特征、时序算子和截面算子组成的可读公式。项目同时保留原有
NumPy/Pandas CPU 路径,并提供适用于 [时间, 标的, 特征] 面板数据的 Torch/GPU
执行后端,支持 IC、RankIC、ICIR 和多空组合 Sharpe 代理目标,以及因子相关性过滤。
| Capability | What it gives you |
|---|---|
| Formula-native research | Every candidate is an inspectable expression tree, not a hidden parameter vector. |
| Two execution backends | Keep the mature NumPy/Pandas workflow or evaluate dense panel expressions with Torch. |
| Time-series + cross-section semantics | Mix rolling logic within each security with ranking and normalization across securities. |
| Finance-aware objectives | Search with IC, RankIC, ICIR, RankICIR, or a long-short Sharpe proxy. |
| Diversity controls | Hall-of-fame ranking and factor-correlation filtering reduce duplicate discoveries. |
| Research ergonomics | Scikit-learn-style estimators, warm starts, low-memory mode, score caching, and expression export. |
SymbolicRegressorevolves formulas for continuous prediction.SymbolicClassifierevolves binary classification formulas.SymbolicTransformerproduces multiple symbolic features on the CPU path.GpuSymbolicTransformermines multiple panel factors with Torch execution.
grow,full, andhalf and halfpopulation initialization.- Tournament selection.
- Subtree crossover.
- Subtree, hoist, point, and point-replacement mutation.
- Reproduction of surviving programs.
- A maximum-length evolution guard and parsimony pressure against unnecessarily complex formulas.
- Warm-started evolution and generation-level diagnostics.
- Converts a
pandas.DataFrameinto a dense[T, N, F]tensor. - Applies time-series operators along time for each security.
- Applies cross-section operators across securities at each timestamp.
- Carries a validity mask for missing features and targets.
- Cleans non-finite outputs and optionally normalizes each cross-section.
- Caches expression scores and, optionally, factor tensors.
- Ranks a hall of fame by the chosen research objective.
- Applies best-effort correlation filtering with
tolerable_corr; if too few diverse candidates remain, the strongest candidates fill the requested component count. - Exports factor values as tensors, NumPy arrays, or panel DataFrames.
- Exports rank, score, expression, tree length, and tree depth.
- Provides
TensorFactorCalculatorandGPAlphaPoolfor IC-based factor pools.
git clone https://github.com/WYFHHH/QuantGplearn.git
cd QuantGplearn
python -m pip install -e .QuantGplearn targets Python 3.11+ on Linux. For GPU execution, install the PyTorch build that matches your CUDA environment before installing the project.
The GPU API accepts a DataFrame with a [datetime, symbol] MultiIndex, or
equivalent ordinary columns:
import numpy as np
import pandas as pd
rng = np.random.default_rng(7)
times = pd.date_range("2024-01-01", periods=256, freq="h")
symbols = [f"asset_{i}" for i in range(12)]
index = pd.MultiIndex.from_product(
[times, symbols], names=["datetime", "symbol"]
)
returns = rng.normal(0.0, 0.01, size=(len(times), len(symbols)))
close = 100.0 * np.exp(np.cumsum(returns, axis=0))
volume = rng.lognormal(10.0, 0.7, size=close.shape)
target = np.roll(returns, -1, axis=0)
target[-1] = np.nan
panel = pd.DataFrame(
{
"close": close.reshape(-1),
"volume": volume.reshape(-1),
"target": target.reshape(-1),
},
index=index,
)from QuantGplearn.gpu_transformer import GpuSymbolicTransformer
model = GpuSymbolicTransformer(
population_size=256,
generations=10,
hall_of_fame=50,
n_components=8,
tournament_size=32,
function_set=[
"add", "sub", "mul", "div",
"ts_delta", "ts_mean", "ts_std", "ts_zscore",
"cs_rank", "cs_demean",
],
feature_names=["close", "volume"],
objective="icir",
max_length=20,
tolerable_corr=0.7,
device="cuda:0",
random_state=2025,
verbose=1,
)
model.fit_panel(panel, target_col="target")
expressions = model.get_factor_expressions()
factors = model.transform_panel(output="dataframe")
print(expressions)
print(factors.tail())If CUDA is unavailable, a CUDA device request falls back to CPU tensor execution with a warning. This makes the same example useful as a functional smoke test on a non-GPU machine.
The original estimators remain available for conventional tabular symbolic learning:
from QuantGplearn.genetic import SymbolicTransformer
cpu_model = SymbolicTransformer(
population_size=1000,
generations=20,
hall_of_fame=100,
n_components=10,
function_set=["add", "sub", "mul", "div", "sqrt", "log"],
metric="spearman",
parsimony_coefficient=0.001,
random_state=2025,
)
cpu_model.fit(X_train, y_train)
symbolic_features = cpu_model.transform(X_test)flowchart LR
A["Panel DataFrame<br/>datetime × symbol"] --> B["TensorPanelData<br/>[T, N, F] + mask"]
B --> C["GP Population<br/>expression trees"]
C --> D["_Program.execute_tensor"]
D --> E["Torch operators<br/>raw · time-series · cross-section"]
E --> F["ProgramEvaluator<br/>clean · normalize · cache"]
F --> G["TensorFitness<br/>IC · RankIC · ICIR · Sharpe"]
G --> H["Selection<br/>parsimony · hall of fame · diversity"]
H --> C
H --> I["Factors + readable expressions"]
For an internal tensor with shape [T, N, F]:
| Dimension | Meaning | Example operation |
|---|---|---|
T |
ordered timestamps | ts_mean(close, 24) rolls along time |
N |
securities at one timestamp | cs_rank(factor) ranks across securities |
F |
input feature channels | close, volume, vwap, custom features |
A program returns one [T, N] factor tensor. Warm-up periods, missing
observations, and invalid targets are represented through NaNs and a boolean
mask rather than silently becoming valid samples.
| Objective | Alias | Interpretation |
|---|---|---|
ic |
pearson |
Mean cross-sectional Pearson correlation with the target. |
rank_ic |
spearman |
Mean cross-sectional rank correlation. |
icir |
- | Mean IC divided by IC volatility. |
rank_icir |
- | Mean RankIC divided by RankIC volatility. |
long_short_sharpe |
sharpe |
Sharpe proxy for a top-minus-bottom factor portfolio. |
The GPU evaluator can normalize each timestamp's cross-section before scoring. Expression errors and non-finite scores are converted to neutral scores so one invalid tree does not stop an evolutionary run.
The recommended GPU panel preset contains 49 operators. Torch backends are
registered when QuantGplearn.torch_functions is imported.
add sub mul div sqrt log abs neg inv max min sig
Division, logarithm, square root, inverse, tangent, and ratio-style indicators use protected or clipped implementations to keep expression trees numerically closed.
| Family | Operators |
|---|---|
| Lag and change | ts_shift, ts_delta, ts_mom |
| Rolling statistics | ts_min, ts_max, ts_sum, ts_mean, ts_std, ts_zscore |
| Rolling position | ts_argmax, ts_argmin, ts_rank, ts_freq |
| Relationships | ts_corr, ts_one_ols_k, ts_one_ols_resid, ts_hedge |
| Distribution shape | ts_skew, ts_kurt |
| Price action | ts_cdlbodym, ts_bar_bs, ts_xs_ratio, ts_bband |
| Technical indicators | ts_adx, ts_aroon, ts_bopr, ts_cmo, ts_ema, ts_macd, ts_rsi, ts_stochf, ts_atr |
cs_rank cs_zscore cs_demean cs_scale cs_winsorize
from QuantGplearn import functions
from QuantGplearn.torch_functions import GPU_SAFE_PANEL_FUNCTIONS
functions.all_function # raw + time-series operators
functions.section_function # cross-section-only operators
functions.panel_function # raw + time-series + cross-section
GPU_SAFE_PANEL_FUNCTIONS # recommended 49-operator Torch presetTrigonometric Torch backends for sin, cos, and tan are also registered
and can be selected explicitly, although they are not part of the recommended
default preset.
| Parameter | Purpose |
|---|---|
population_size |
Programs evaluated in each generation. |
generations |
Maximum evolutionary generations. |
tournament_size |
Selection pressure during parent choice. |
init_depth / init_method |
Shape and diversity of the initial trees. |
p_crossover |
Probability of exchanging subtrees between parents. |
p_subtree_mutation |
Probability of replacing a subtree. |
p_hoist_mutation |
Probability of simplifying a tree by hoisting a subtree. |
p_point_mutation |
Probability of replacing compatible functions or terminals. |
max_length |
Evolution guard: overlength parents are reproduced instead of being mutated again. |
parsimony_coefficient |
Penalty for complexity; "auto" is supported. |
p_point_replace |
Replacement probability inside point mutation. |
warm_start |
Continue evolution from an existing run. |
low_memory |
Discard older populations to reduce memory usage. |
cache_scores / cache_factors |
Reuse repeated expression evaluations. |
tolerable_corr |
Maximum accepted mutual factor correlation during selection. |
The probabilities for crossover and the three mutation methods must leave room for reproduction; their cumulative sum cannot exceed one.
get_factor_expressions() returns a compact research table:
rank score expression length depth
transform_panel() supports:
model.transform_panel(output="tensor") # list of [T, N] tensors
model.transform_panel(output="numpy") # flattened [samples, factors]
model.transform_panel(output="dataframe") # MultiIndex panel DataFrameThe estimator also records generation diagnostics in run_details_, including
average expression length, average fitness, best fitness, and generation time.
For workflows that build a factor library incrementally:
from QuantGplearn.alpha_pool import GPAlphaPool, TensorFactorCalculator
calculator = TensorFactorCalculator(
model.tensor_data_,
normalize=True,
cache_factors=True,
)
pool = GPAlphaPool(
capacity=20,
calculator=calculator,
ic_lower_bound=0.01,
mutual_ic_upper_bound=0.7,
)
pool.update(list(model))
records = pool.to_records()The pool scores single-factor IC, rejects factors that are too correlated with existing members, and keeps the strongest expressions up to its capacity.
Use make_function to add a numerically closed vector operator:
import numpy as np
from QuantGplearn.functions import make_function
def signed_square(x):
x = np.asarray(x, dtype=float)
return np.sign(x) * x**2
signed_square_fn = make_function(
function=signed_square,
name="signed_square",
arity=1,
)A custom _Function can participate in GPU evolution after attaching a Torch
callable with matching semantics:
import torch
def torch_signed_square(x):
return torch.sign(x) * x**2
signed_square_fn.torch_function = torch_signed_square
model = GpuSymbolicTransformer(
function_set=["add", "sub", signed_square_fn],
feature_names=feature_names,
)Custom CPU fitness functions can be created with
QuantGplearn.fitness.make_fitness. The GPU path uses TensorFitness
objectives from QuantGplearn.tensor_fitness.
| NumPy/Pandas path | Torch path | |
|---|---|---|
| Main API | SymbolicRegressor, SymbolicClassifier, SymbolicTransformer |
GpuSymbolicTransformer |
| Typical data | arrays or panel DataFrames | panel DataFrames / dense tensors |
| Parallel model | joblib/pathos CPU workers | one Torch device per estimator |
| Time-series operators | grouped by security | tensor time dimension |
| Cross-section operators | grouped by timestamp | tensor security dimension |
| Best use | compatibility, custom CPU research, smaller workloads | dense panel evaluation and GPU-oriented factor mining |
- The GPU estimator evaluates a population serially on one Torch device; it is not yet a distributed multi-GPU trainer.
max_samplesandn_jobsare retained in the GPU estimator signature for API compatibility but are not yet applied by its evaluator.- Dense
[T, N, F]panels can be memory intensive when the source panel is highly sparse or unbalanced. - Rolling operators based on
torch.unfoldmay create large intermediate tensors for long histories and windows. - GPU rank tie handling and a few NaN edge cases can differ slightly from Pandas semantics.
- The GPU path currently focuses on numeric factors and does not support the legacy category-return function system.
long_short_sharpeis a fast training proxy, not a replacement for a realistic portfolio backtest.
These constraints are documented deliberately so research results are easier to interpret and future contributions have clear targets.
QuantGplearn/
├── genetic.py # CPU symbolic estimators and genetic evolution
├── _program.py # expression-tree representation and execution
├── functions.py # NumPy/Pandas primitives and operator metadata
├── gpu_transformer.py # GPU panel factor-mining estimator
├── tensor_data.py # DataFrame <-> [T, N, F] tensor conversion
├── torch_functions.py # Torch implementations of factor operators
├── tensor_fitness.py # IC, RankIC, ICIR, and Sharpe objectives
├── evaluator.py # cleaning, normalization, evaluation, and caches
└── alpha_pool.py # IC and mutual-correlation factor pool
The public repository contains the reusable framework, its documentation, and framework-level tests. Proprietary datasets, mined strategies, and private research workflows are intentionally outside the repository.
python -m pip install pytest
pytest -qThe GPU transformer smoke test runs on CPU tensors when CUDA is unavailable.
- Chunked and batched expression evaluation.
- Multi-GPU evaluator workers.
- Lower-memory rolling kernels for long panels.
- More exact CPU/GPU parity for rank ties and missing-value edges.
- Weight optimization for multi-factor alpha pools.
- Expanded tests and reproducible performance benchmarks.
Contributions that improve numerical correctness, execution efficiency, or research reproducibility are welcome.
QuantGplearn is inspired by and adapted from:
QuantGplearn is released under the MIT License.
