Neural Network-enhanced Local Search for the Parallel Machine Scheduling Problem with Maintenance
This project presents a Deep Reinforcement Learning (DRL) approach for the Parallel Machine Scheduling Problem with Maintenance (PMSP-ML).
Combining the efficiency of Local Search algorithms with the generalization power of Graph Neural Networks (GNNs), the system implements an autonomous agent capable of learning complex heuristics through two complementary training strategies: online fine-tuning (PPO) and AlphaZero-style self-play with MCTS.
- 2 identical parallel machines with mandatory maintenance intervals
- n jobs with processing time
p_j, weightw_jand rejection penaltyu_j - Lexicographic Objective:
- π₯ Minimize
f1 = Ξ£ u_j(rejection cost) - π₯ Minimize
f2 = Ξ£ w_jΒ·C_j(total weighted completion time)
- π₯ Minimize
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TRAINING PIPELINE β
β β
β TSMN Expert Behavior Online AlphaZero β
β Demonstrations βββΊ Cloning βββΊ PPO or Self-Play β
β (Imitation) (instance) (MCTS + GNN) β
β β
β ββββββββββββββββββββββββββββββββββββ β
β β GNN Policy β β
β β GAT layers Β· dual value heads β β
β β phase selector Β· action heads β β
β ββββββββββββββββ¬ββββββββββββββββββββ β
β β β
β Inference / Search β
β NeuroGuidedSearch (NGS) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The solution is represented as a heterogeneous graph with 3 distinct node types:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β GRAPH STATE β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββ βββββββ βββββββ V_J (Jobs) β
β β J_0 ββββ J_1 ββββ J_2 βββ... [p,u,w,wspt, β
β ββββ¬βββ ββββ¬βββ ββββ¬βββ status] β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββββββββββββββ V_B (Blocks) β
β β Block 0 Block 1 β [time,slack, β
β βββββββββββ¬ββββββββββββ machine] β
β β β
β βΌ β
β βββββββββββ V_M (Machines)β
β β Machine β [T_i, Ξ΄_i] β
β βββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Each node type has normalized features extracted from the current solution:
classDiagram
class JobNode_VJ["Job Node Β· V_J (5 features)"] {
p_j float : processing time / max(p)
u_j float : rejection penalty / max(u)
w_j float : weight / max(w)
wspt float : WSPT ratio / max(wspt)
status float : 1.0 accepted Β· 0.0 rejected
}
class BlockNode_VB["Block Node Β· V_B (3 features)"] {
total_time float : sum of p_j in block / T_max
slack float : free time before maintenance / T_max
machine_id float : 0.0 or 1.0
}
class MachineNode_VM["Machine Node Β· V_M (2 features)"] {
T_i float : maintenance time limit / max(T)
d_i float : maintenance duration / max(Ξ΄)
}
JobNode_VJ --> JobNode_VJ : jobβjob (sequence adjacency)
JobNode_VJ --> BlockNode_VB : jobβblock (membership)
BlockNode_VB --> MachineNode_VM : blockβmachine (allocation)
The GNN produces the following outputs for action selection:
classDiagram
class PolicyOutput {
phase_probs Tensor_3 : P(Phase 1 Β· 2 Β· 3)
phase1_accepted_scores Tensor_n_jobs : scores to swap out accepted
phase1_rejected_scores Tensor_n_jobs : scores to swap in rejected
phase2_block_scores Tensor_nB_x_2 : scores for block swap start/end
phase3_job_scores Tensor_n_jobs : scores for job relocation
value_f1 Tensor_1 : predicted Ξf1 (dual head)
value_f2 Tensor_1 : predicted Ξf2 (dual head)
}
The system's "intelligence" resides in the GNNPolicy, a GAT-based network that processes the graph state to make strategic decisions.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GNN POLICY β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Input: GraphState β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Node Encoders (by type) β β
β β Job: Linear(5 β hidden) β β
β β Block: Linear(3 β hidden) β β
β β Machine: Linear(2 β hidden) β β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β 3x GATConvLayer (4 heads, highway) β β
β β + LayerNorm + gated residual β β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Global Pooling (mean+max) β β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββΊ Phase Selector Head [3] β Neighborhood Strategy β
β ββββΊ Phase 1 Heads [jobs] β SWAP acceptedβrejected β
β ββββΊ Phase 2 Head [blocks] β SWAP blocks β
β ββββΊ Phase 3 Head [jobs] β RELOCATE jobs β
β ββββΊ Value Head f1 [1] β Predicted Ξf1 (MCTS) β
β ββββΊ Value Head f2 [1] β Predicted Ξf2 (MCTS) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The NeuroGuidedSearch component allows the agent to adapt in real-time during inference.
- Imitation Learning (Warm-start): The agent is trained via Behavior Cloning to imitate the TSMN expert.
- Online Adaptation (PPO): During resolution, the agent uses Proximal Policy Optimization to adjust its weights in real-time, specializing to the current instance.
The AlphaZeroTrainer implements a full self-play training loop:
- Self-play: MCTS with PUCT selection and neural priors generates episodes. Each step records
(state, Ο, z)where Ο is the MCTS visit distribution and z the lexicographic outcome. - Replay Buffer: Examples are stored in a circular replay buffer (capacity 100k by default).
- Network update: At each iteration, the GNN is trained via cross-entropy against Ο (policy head) and MSE against z_f1/z_f2 (dual value heads).
- Lexicographic MCTS: The
best_f1threshold switches the value signal β when f1 is at its best, MCTS optimizes f2.
AlphaZero Training Loop:
for each iteration:
ββ Self-play N games β collect (s, Ο, z) examples
ββ Add to ReplayBuffer
ββ Sample batch β forward pass β policy loss + value loss β Adam
ββ Every eval_interval: evaluate on val instances + save checkpoint
To prevent deterministic action oscillation during inference, the system uses:
- Epsilon-Greedy: With probability
inference_epsilon(default: 10%), the agent samples randomly instead of using argmax. - Tabu List: Recent actions are stored in a tabu list (size
tabu_sizedefault: 10) to prevent immediate repetition of the same move.
- C++17 compiler (GCC 9+ or Clang 10+)
- LibTorch 2.0+ (CPU or CUDA)
- Meson build system
- Python 3.8+ with matplotlib, pandas, networkx
# 1. Activate conda environment
source ~/miniconda3/etc/profile.d/conda.sh && conda activate pmsp-gns
# 2. Configure the project
meson setup build
# 3. Compile
meson compile -C buildmindmap
root((build/src/))
solver/
pmspml
Classical TSMN solver
neural/
data_generator
Expert dataset generator
train_imitation
Imitation learning trainer
test_ngs
Neural solver + online PPO
train_alphazero
AlphaZero self-play training
train_alns
Neural ALNS training
export_graph
Graph structure visualizer
benchmark_all
Ablation benchmark CSV
./build/src/solver/pmspml instances/Ins1.txt --run-tsmn --time-limit=5Output:
- Initial greedy solution
- Improvement via Phase 1, 2, 3 (Tabu Search)
- Final f1 and f2 values
./build/src/neural/data_generatorWhat happens:
- For each instance, runs full TSMN
DataCollectorcaptures every expert move- Saves transitions: (state, phase, action, f1_before, f2_before, f1_after, f2_after)
./build/src/neural/train_imitationLosses:
- Selector Loss: CrossEntropy for phase selection
- Phase Losses: CrossEntropy for specific actions
- Value Loss: MSE for value estimation
Checkpoints saved in: checkpoints/
# Single-run mode
./build/src/neural/test_ngs instances/Ins1.txt
# With pre-trained checkpoint
./build/src/neural/test_ngs instances/Ins1.txt checkpoints/best_model.pt# Online mode for 10 minutes
./build/src/neural/test_ngs instances/Ins1.txt --online --time 10
# With initial checkpoint
./build/src/neural/test_ngs instances/Ins1.txt checkpoint.pt --online --time 30
# With action debug
./build/src/neural/test_ngs instances/Ins1.txt --online --time 5 --action_debugOptions:
flowchart LR
EXE["test_ngs\ninstances/Ins1.txt"]
EXE --> A["--online\nEnable continuous\ntraining mode"]
EXE --> B["--time X\nRun for X minutes\n(default: 5)"]
EXE --> C["--verbose\nShow f1/f2\nimprovements"]
EXE --> D["--action_debug\nPrint details\nof each action"]
What happens:
- Continuous loop:
solve()βupdatePolicy()β repeat - PPO fine-tuning specializes the model to the current instance
- Saves final checkpoint to
ngs_checkpoint_final.pt - Metrics saved to
ngs_metrics.csv
Output Example (verbose mode):
--- Episode 1 (elapsed: 0.0 min, remaining: 10.0 min) ---
[Step 12] β
Phase3: f2 2672505 -> 2210110 (Ξ=-462395)
[Step 48] β
Phase1: f1 32340 -> 30041 (Ξ=-2299)
Steps: 100 | Improving: 2
f1: 32340.00 -> 30041.00
f2: 2672505.00 -> 2210110.00
Trains the GNN via MCTS-guided self-play across all training instances, without requiring expert demonstrations.
# Smoke test (verify pipeline runs)
./build/src/neural/train_alphazero \
--instances-dir instances/ \
--val-dir instances/ \
--iterations 2 --games 1 --simulations 5 --steps 2 --batch 4 --epochs 1
# From imitation checkpoint (recommended)
./build/src/neural/train_alphazero \
--instances-dir instances_train/ \
--val-dir instances_val/ \
--checkpoint checkpoints/imitation_best.pt \
--checkpoint-dir checkpoints/ \
--iterations 200 \
--games 20 \
--simulations 200 \
--steps 100 \
--lr 5e-5 \
--batch 128 \
--epochs 10 \
--eval-interval 10 \
--temperature-steps 30Options:
classDiagram
class AlphaZeroConfig {
--instances-dir string : instances_train/
--val-dir string : instances_val/
--checkpoint string : (optional) load checkpoint
--checkpoint-dir string : ./checkpoints
--hidden-dim int : 64
--iterations int : 100
--games int : 10
--simulations int : 100
--steps int : 50
--lr double : 1e-4
--batch int : 64
--epochs int : 5
--eval-interval int : 5
--temperature-steps int : 30
}
Recommended progression:
flowchart LR
D["π Debug\niters=5 Β· sims=10 Β· steps=5\nVerify pipeline Β· z β 0"]
W["π‘ Warm-up\niters=50 Β· sims=50 Β· steps=20\nCheck loss trends"]
T["π Training\niters=200 Β· sims=200 Β· steps=50\nFrom imitation checkpoint"]
P["π Production\niters=500+ Β· sims=400-800 Β· steps=100\nGPU recommended"]
D --> W --> T --> P
What to monitor:
buffer=Ngrows each iterationpolicy_lossdecreases over time (if oscillating, reduce--lr)value_lossdecreases (if not, the z signal may be degenerate β checkzdistribution)New global best f1=...appears in early iterations
Using the trained checkpoint with NGS:
./build/src/neural/test_ngs instances/Ins1.txt checkpoints/az_final.pt --time 30 --verboseTrain the Neural ALNS and compare variants for the TCC ablation study.
# Train Neural ALNS from scratch
./build/src/neural/train_alns \
--instances-dir instances_train/ --val-dir instances_val/ \
--checkpoint-dir checkpoints_alns_scratch/
# Train with IL backbone (warm-start)
./build/src/neural/train_alns \
--backbone checkpoints/best_policy.pt \
--instances-dir instances_train/ --val-dir instances_val/ \
--checkpoint-dir checkpoints_alns_il/
# Run full benchmark: TSMN vs all ALNS variants
./build/src/benchmark_all \
--instances-dir instances/ \
--tsmn-time 5 \
--models checkpoints_alns_scratch/alns_final.pt,checkpoints_alns_il/alns_final.pt \
--names scratch,IL \
--output results/benchmark.csv
# Visualize ablation results
python scripts/analysis/plot_ngs_metrics.py results/benchmark.csvmindmap
root((src/solver/))
TSMN.cpp
Tabu Search Multi-Neighborhood
Greedy.cpp
WSPT constructive heuristic
Neighborhood.cpp
Phase 1 acceptedβrejected
Phase 2 block swaps
Phase 3 job relocations
Solution.cpp
BlockCache O(1) access
Instance.cpp
Problem instance parser
BestKnown.h
Gap analysis reference values
mindmap
root((src/neural/))
policy/
GNNPolicy.cpp
GAT + dual value heads
GraphState.cpp
Feature extraction
GraphBatch.cpp
Batched processing
env/
SchedulingEnv.cpp
RL step + lexicographic reward
ActionSampler.cpp
Masking + sampling
PolicyUtils.cpp
Differentiable log-probs
search/
MCTS.cpp
PUCT + neural priors
NeuroGuidedSearch.cpp
NGS + online PPO
InferenceServer.cpp
Batched GPU inference
ReplayBuffer.cpp
Circular buffer 100k
training/
DataCollector.cpp
Expert demonstrations
TrainImitation.cpp
Behavior cloning
AlphaZeroTrainer.cpp
Self-play + network update
alns/
ALNSPolicy.cpp
Operator selection network
ALNSSearch.cpp
ALNS loop + PPO update
ALNSEnv.cpp
2-stage reward environment
ALNSOperators.cpp
5 destroy + 4 repair ops
Generates system architecture visualizations.
python scripts/visualize.py graph.json
python scripts/visualize.py --allPlots NGS training metrics.
python scripts/plot_ngs_metrics.py ngs_metrics.csv --output plots.pngVerifies demonstration dataset integrity.
python scripts/verify_dataset.py dataset/Splits instances into train/val sets (80%/20%) stratified by category (n size).
python scripts/split_instances.pyDataset split:
pie title "120 Instances β Train / Val Split"
"Train (96 Β· 80%)" : 96
"Val (24 Β· 20%)" : 24
Distribution by job count:
xychart-beta
title "Training Instances by Job Count (n)"
x-axis ["n=20", "n=100", "n=105", "n=110", "n=200", "n=210", "n=220", "n=300", "n=315", "n=330"]
y-axis "Train instances" 0 --> 25
bar [24, 8, 8, 8, 8, 8, 8, 8, 8, 8]
Output directories:
instances_train/- 80% of instances for traininginstances_val/- 20% of instances for validation
pmsp-ml/
βββ π instances/ # Benchmark instances (120 total, 20β330 jobs)
βββ π instances_train/ # 80% split for training
βββ π instances_val/ # 20% split for validation
βββ π dataset/ # Expert demonstration dataset (.pt tensors)
βββ π checkpoints/ # Trained model checkpoints
βββ π logs/ # Training metrics, plots, CSVs
β
βββ π docs/
β βββ π architecture/ # GNN, ALNS, batching, phase docs
β βββ π planning/ # AlphaZero guide, TCC plan, roadmaps
β βββ π analysis/ # Behavior analysis, CUDA optimization
β
βββ π src/
β βββ benchmark_all.cpp # Ablation benchmark entry point
β βββ meson.build
β βββ π solver/ # Classical TSMN solver
β β βββ main.cpp
β β βββ Instance.{h,cpp}
β β βββ Solution.{h,cpp}
β β βββ TSMN.{h,cpp}
β β βββ Greedy.{h,cpp}
β β βββ Neighborhood.{h,cpp}
β β βββ Globals.{h,cpp}
β β βββ BestKnown.h
β βββ π neural/ # Neural components
β β βββ π policy/ # GNNPolicy, GraphState, GraphBatch
β β βββ π env/ # SchedulingEnv, ActionSampler, PolicyUtils
β β βββ π search/ # MCTS, NeuroGuidedSearch, InferenceServer
β β βββ π training/ # DataCollector, TrainImitation, AlphaZeroTrainer
β β βββ π alns/ # ALNSPolicy, ALNSSearch, ALNSEnv, ALNSOperators
β β βββ [*_main.cpp] # Executable entry points
β βββ π tools/ # C++ diagnostic tools (export_model, check_cuda)
β
βββ π scripts/
β βββ π analysis/ # plot_*.py, diagnosis.py, visualize.py
β βββ π data/ # split_instances.py, verify_dataset.py, gnn_policy.py
β βββ [*.sh] # Pipeline launchers
βββ meson.build
βββ README.md
R = Wβ Β· Ξfβ + Wβ Β· Ξfβ
where:
Wβ = 1,000,000 (weight for rejection β absolute priority)
Wβ = 1.0 (weight for weighted flowtime)
Ξfβ = fβ_before - fβ_after
Ξfβ = fβ_before - fβ_after
z_f1 = (f1_init β f1_final) / max(f1_init, 1) β trains value_f1 head
z_f2 = (f2_init β f2_final) / max(f2_init, 1) β trains value_f2 head
z = z_f1 if |Ξf1| > Ξ΅, else z_f2 Γ 0.01 β lexicographic scalar
The dual value heads allow MCTS to switch its evaluation criterion: when f1 is at its global best, MCTS optimizes f2 instead.
Hyperparameters in NGSConfig:
classDiagram
class NGSConfig {
clip_epsilon : 0.2 β ratio Ο/Ο_old clipping
gamma : 0.99 β discount factor
gae_lambda : 0.95 β GAE Ξ» parameter
ppo_epochs : 4 β update epochs per rollout
entropy_coef : 0.01 β entropy bonus weight
normalize_rewards : true β reward normalization
inference_epsilon : 0.1 β random sampling probability
tabu_size : 10 β recent actions to avoid
}
classDiagram
class MCTSConfig {
num_simulations : 100 β simulations per step
c_puct : 1.5 β PUCT exploration constant
dirichlet_alpha : 0.3 β noise concentration at root
dirichlet_weight : 0.25 β noise weight (0 during eval)
max_children : 20 β max actions expanded per node
reuse_tree : true β subtree reuse after selection
}
ActionSampler applies masks to ensure valid actions:
- Phase 1: Only allows accepted β rejected jobs
- Phase 2: Only allows blocks on the same machine
- Phase 3: Only allows accepted jobs
- TSMN Original: Taillard, E. (1993). "Benchmarks for Basic Scheduling Problems"
- GNN for Scheduling: Zhang et al. (2020). "Learning to Dispatch"
- PPO: Schulman et al. (2017). "Proximal Policy Optimization Algorithms"
- AlphaZero: Silver et al. (2018). "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play"
- GAT: VeliΔkoviΔ et al. (2018). "Graph Attention Networks"
MIT License - see LICENSE
π Happy Scheduling!