🧠 Neuro-Guided Search for Parallel Machine Scheduling Problem (PMSP-GNS)

Neural Network-enhanced Local Search for the Parallel Machine Scheduling Problem with Maintenance

📋 Table of Contents

Overview
Architecture
Installation
Complete Pipeline
Components
Python Scripts
References

🎯 Overview

This project presents a Deep Reinforcement Learning (DRL) approach for the Parallel Machine Scheduling Problem with Maintenance (PMSP-ML).

Combining the efficiency of Local Search algorithms with the generalization power of Graph Neural Networks (GNNs), the system implements an autonomous agent capable of learning complex heuristics through two complementary training strategies: online fine-tuning (PPO) and AlphaZero-style self-play with MCTS.

The Problem

2 identical parallel machines with mandatory maintenance intervals
n jobs with processing time p_j, weight w_j and rejection penalty u_j
Lexicographic Objective:
1. 🥇 Minimize f1 = Σ u_j (rejection cost)
2. 🥈 Minimize f2 = Σ w_j·C_j (total weighted completion time)

The Approach

┌──────────────────────────────────────────────────────────────────────┐
│                         TRAINING PIPELINE                            │
│                                                                      │
│   TSMN Expert        Behavior         Online        AlphaZero        │
│   Demonstrations ──► Cloning    ──►   PPO      or   Self-Play        │
│                    (Imitation)      (instance)      (MCTS + GNN)     │
│                                                                      │
│              ┌──────────────────────────────────┐                    │
│              │        GNN Policy                │                    │
│              │  GAT layers · dual value heads   │                    │
│              │  phase selector · action heads   │                    │
│              └──────────────┬───────────────────┘                    │
│                             │                                        │
│                     Inference / Search                               │
│                    NeuroGuidedSearch (NGS)                           │
└──────────────────────────────────────────────────────────────────────┘

🏗 Architecture

Heterogeneous Graph

The solution is represented as a heterogeneous graph with 3 distinct node types:

     ┌─────────────────────────────────────────────────┐
     │                 GRAPH STATE                      │
     ├─────────────────────────────────────────────────┤
     │                                                  │
     │   ┌─────┐  ┌─────┐  ┌─────┐       V_J (Jobs)    │
     │   │ J_0 │──│ J_1 │──│ J_2 │──...  [p,u,w,wspt,  │
     │   └──┬──┘  └──┬──┘  └──┬──┘        status]      │
     │      │        │        │                        │
     │      ▼        ▼        ▼                        │
     │   ┌──────────────────────┐        V_B (Blocks)  │
     │   │   Block 0   Block 1 │         [time,slack,  │
     │   └─────────┬───────────┘          machine]     │
     │             │                                   │
     │             ▼                                   │
     │        ┌─────────┐                V_M (Machines)│
     │        │ Machine │                [T_i, δ_i]    │
     │        └─────────┘                              │
     └─────────────────────────────────────────────────┘

Node Features (Inputs)

Each node type has normalized features extracted from the current solution:

classDiagram
    class JobNode_VJ["Job Node · V_J (5 features)"] {
        p_j  float : processing time / max(p)
        u_j  float : rejection penalty / max(u)
        w_j  float : weight / max(w)
        wspt float : WSPT ratio / max(wspt)
        status float : 1.0 accepted · 0.0 rejected
    }
    class BlockNode_VB["Block Node · V_B (3 features)"] {
        total_time  float : sum of p_j in block / T_max
        slack       float : free time before maintenance / T_max
        machine_id  float : 0.0 or 1.0
    }
    class MachineNode_VM["Machine Node · V_M (2 features)"] {
        T_i   float : maintenance time limit / max(T)
        d_i   float : maintenance duration / max(δ)
    }
    JobNode_VJ --> JobNode_VJ : job→job (sequence adjacency)
    JobNode_VJ --> BlockNode_VB : job→block (membership)
    BlockNode_VB --> MachineNode_VM : block→machine (allocation)

Policy Outputs

The GNN produces the following outputs for action selection:

classDiagram
    class PolicyOutput {
        phase_probs            Tensor_3       : P(Phase 1 · 2 · 3)
        phase1_accepted_scores Tensor_n_jobs  : scores to swap out accepted
        phase1_rejected_scores Tensor_n_jobs  : scores to swap in rejected
        phase2_block_scores    Tensor_nB_x_2  : scores for block swap start/end
        phase3_job_scores      Tensor_n_jobs  : scores for job relocation
        value_f1               Tensor_1       : predicted Δf1 (dual head)
        value_f2               Tensor_1       : predicted Δf2 (dual head)
    }

GNN Policy

The system's "intelligence" resides in the GNNPolicy, a GAT-based network that processes the graph state to make strategic decisions.

┌───────────────────────────────────────────────────────────────┐
│                      GNN POLICY                                │
├───────────────────────────────────────────────────────────────┤
│  Input: GraphState                                            │
│    │                                                          │
│    ▼                                                          │
│  ┌─────────────────────────────────────────┐                 │
│  │        Node Encoders (by type)           │                 │
│  │  Job: Linear(5 → hidden)                 │                 │
│  │  Block: Linear(3 → hidden)               │                 │
│  │  Machine: Linear(2 → hidden)             │                 │
│  └─────────────────────────────────────────┘                 │
│    │                                                          │
│    ▼                                                          │
│  ┌─────────────────────────────────────────┐                 │
│  │  3x GATConvLayer (4 heads, highway)      │                 │
│  │  + LayerNorm + gated residual            │                 │
│  └─────────────────────────────────────────┘                 │
│    │                                                          │
│    ▼                                                          │
│  ┌─────────────────────────────────────────┐                 │
│  │         Global Pooling (mean+max)        │                 │
│  └─────────────────────────────────────────┘                 │
│    │                                                          │
│    ├──► Phase Selector Head [3]   → Neighborhood Strategy    │
│    ├──► Phase 1 Heads [jobs]      → SWAP accepted↔rejected   │
│    ├──► Phase 2 Head [blocks]     → SWAP blocks              │
│    ├──► Phase 3 Head [jobs]       → RELOCATE jobs            │
│    ├──► Value Head f1 [1]         → Predicted Δf1 (MCTS)     │
│    └──► Value Head f2 [1]         → Predicted Δf2 (MCTS)     │
└───────────────────────────────────────────────────────────────┘

Neuro-Guided Search & Online PPO

The NeuroGuidedSearch component allows the agent to adapt in real-time during inference.

Imitation Learning (Warm-start): The agent is trained via Behavior Cloning to imitate the TSMN expert.
Online Adaptation (PPO): During resolution, the agent uses Proximal Policy Optimization to adjust its weights in real-time, specializing to the current instance.

AlphaZero Self-Play

The AlphaZeroTrainer implements a full self-play training loop:

Self-play: MCTS with PUCT selection and neural priors generates episodes. Each step records (state, π, z) where π is the MCTS visit distribution and z the lexicographic outcome.
Replay Buffer: Examples are stored in a circular replay buffer (capacity 100k by default).
Network update: At each iteration, the GNN is trained via cross-entropy against π (policy head) and MSE against z_f1/z_f2 (dual value heads).
Lexicographic MCTS: The best_f1 threshold switches the value signal — when f1 is at its best, MCTS optimizes f2.

AlphaZero Training Loop:
  for each iteration:
    ├─ Self-play N games → collect (s, π, z) examples
    ├─ Add to ReplayBuffer
    ├─ Sample batch → forward pass → policy loss + value loss → Adam
    └─ Every eval_interval: evaluate on val instances + save checkpoint

Inference Exploration

To prevent deterministic action oscillation during inference, the system uses:

Epsilon-Greedy: With probability inference_epsilon (default: 10%), the agent samples randomly instead of using argmax.
Tabu List: Recent actions are stored in a tabu list (size tabu_size default: 10) to prevent immediate repetition of the same move.

🛠 Installation

Prerequisites

C++17 compiler (GCC 9+ or Clang 10+)
LibTorch 2.0+ (CPU or CUDA)
Meson build system
Python 3.8+ with matplotlib, pandas, networkx

Build

# 1. Activate conda environment
source ~/miniconda3/etc/profile.d/conda.sh && conda activate pmsp-gns

# 2. Configure the project
meson setup build

# 3. Compile
meson compile -C build

Generated Executables

mindmap
  root((build/src/))
    solver/
      pmspml
        Classical TSMN solver
    neural/
      data_generator
        Expert dataset generator
      train_imitation
        Imitation learning trainer
      test_ngs
        Neural solver + online PPO
      train_alphazero
        AlphaZero self-play training
      train_alns
        Neural ALNS training
      export_graph
        Graph structure visualizer
    benchmark_all
      Ablation benchmark CSV

🚀 Complete Pipeline

Step 1: Run TSMN (Baseline)

./build/src/solver/pmspml instances/Ins1.txt --run-tsmn --time-limit=5

Output:

Initial greedy solution
Improvement via Phase 1, 2, 3 (Tabu Search)
Final f1 and f2 values

Step 2: 📊 Generate Demonstration Dataset

./build/src/neural/data_generator

What happens:

For each instance, runs full TSMN
DataCollector captures every expert move
Saves transitions: (state, phase, action, f1_before, f2_before, f1_after, f2_after)

Step 3: 🎓 Imitation Training

./build/src/neural/train_imitation

Losses:

Selector Loss: CrossEntropy for phase selection
Phase Losses: CrossEntropy for specific actions
Value Loss: MSE for value estimation

Checkpoints saved in: checkpoints/

Step 4: 🧠 NeuroGuidedSearch (Inference)

# Single-run mode
./build/src/neural/test_ngs instances/Ins1.txt

# With pre-trained checkpoint
./build/src/neural/test_ngs instances/Ins1.txt checkpoints/best_model.pt

Step 5: 🔄 Online Training (PPO)

# Online mode for 10 minutes
./build/src/neural/test_ngs instances/Ins1.txt --online --time 10

# With initial checkpoint
./build/src/neural/test_ngs instances/Ins1.txt checkpoint.pt --online --time 30

# With action debug
./build/src/neural/test_ngs instances/Ins1.txt --online --time 5 --action_debug

Options:

flowchart LR
    EXE["test_ngs\ninstances/Ins1.txt"]
    EXE --> A["--online\nEnable continuous\ntraining mode"]
    EXE --> B["--time X\nRun for X minutes\n(default: 5)"]
    EXE --> C["--verbose\nShow f1/f2\nimprovements"]
    EXE --> D["--action_debug\nPrint details\nof each action"]

What happens:

Continuous loop: solve() → updatePolicy() → repeat
PPO fine-tuning specializes the model to the current instance
Saves final checkpoint to ngs_checkpoint_final.pt
Metrics saved to ngs_metrics.csv

Output Example (verbose mode):

--- Episode 1 (elapsed: 0.0 min, remaining: 10.0 min) ---
[Step   12] ★ Phase3: f2 2672505 -> 2210110 (Δ=-462395)
[Step   48] ★ Phase1: f1 32340 -> 30041 (Δ=-2299)
  Steps: 100 | Improving: 2
  f1: 32340.00 -> 30041.00
  f2: 2672505.00 -> 2210110.00

Step 6: ♟ AlphaZero Self-Play Training

Trains the GNN via MCTS-guided self-play across all training instances, without requiring expert demonstrations.

# Smoke test (verify pipeline runs)
./build/src/neural/train_alphazero \
    --instances-dir instances/ \
    --val-dir instances/ \
    --iterations 2 --games 1 --simulations 5 --steps 2 --batch 4 --epochs 1

# From imitation checkpoint (recommended)
./build/src/neural/train_alphazero \
    --instances-dir instances_train/ \
    --val-dir instances_val/ \
    --checkpoint checkpoints/imitation_best.pt \
    --checkpoint-dir checkpoints/ \
    --iterations 200 \
    --games 20 \
    --simulations 200 \
    --steps 100 \
    --lr 5e-5 \
    --batch 128 \
    --epochs 10 \
    --eval-interval 10 \
    --temperature-steps 30

Options:

classDiagram
    class AlphaZeroConfig {
        --instances-dir    string  : instances_train/
        --val-dir          string  : instances_val/
        --checkpoint       string  : (optional) load checkpoint
        --checkpoint-dir   string  : ./checkpoints
        --hidden-dim       int     : 64
        --iterations       int     : 100
        --games            int     : 10
        --simulations      int     : 100
        --steps            int     : 50
        --lr               double  : 1e-4
        --batch            int     : 64
        --epochs           int     : 5
        --eval-interval    int     : 5
        --temperature-steps int    : 30
    }

Recommended progression:

flowchart LR
    D["🔍 Debug\niters=5 · sims=10 · steps=5\nVerify pipeline · z ≠ 0"]
    W["🌡 Warm-up\niters=50 · sims=50 · steps=20\nCheck loss trends"]
    T["🎓 Training\niters=200 · sims=200 · steps=50\nFrom imitation checkpoint"]
    P["🚀 Production\niters=500+ · sims=400-800 · steps=100\nGPU recommended"]
    D --> W --> T --> P

What to monitor:

buffer=N grows each iteration
policy_loss decreases over time (if oscillating, reduce --lr)
value_loss decreases (if not, the z signal may be degenerate — check z distribution)
New global best f1=... appears in early iterations

Using the trained checkpoint with NGS:

./build/src/neural/test_ngs instances/Ins1.txt checkpoints/az_final.pt --time 30 --verbose

Step 7: 🔬 Neural ALNS & Ablation Benchmark

Train the Neural ALNS and compare variants for the TCC ablation study.

# Train Neural ALNS from scratch
./build/src/neural/train_alns \
    --instances-dir instances_train/ --val-dir instances_val/ \
    --checkpoint-dir checkpoints_alns_scratch/

# Train with IL backbone (warm-start)
./build/src/neural/train_alns \
    --backbone checkpoints/best_policy.pt \
    --instances-dir instances_train/ --val-dir instances_val/ \
    --checkpoint-dir checkpoints_alns_il/

# Run full benchmark: TSMN vs all ALNS variants
./build/src/benchmark_all \
    --instances-dir instances/ \
    --tsmn-time 5 \
    --models checkpoints_alns_scratch/alns_final.pt,checkpoints_alns_il/alns_final.pt \
    --names scratch,IL \
    --output results/benchmark.csv

# Visualize ablation results
python scripts/analysis/plot_ngs_metrics.py results/benchmark.csv

📦 Components

Classical Solver (`src/solver/`)

mindmap
  root((src/solver/))
    TSMN.cpp
      Tabu Search Multi-Neighborhood
    Greedy.cpp
      WSPT constructive heuristic
    Neighborhood.cpp
      Phase 1 accepted↔rejected
      Phase 2 block swaps
      Phase 3 job relocations
    Solution.cpp
      BlockCache O(1) access
    Instance.cpp
      Problem instance parser
    BestKnown.h
      Gap analysis reference values

Neural Components (`src/neural/`)

mindmap
  root((src/neural/))
    policy/
      GNNPolicy.cpp
        GAT + dual value heads
      GraphState.cpp
        Feature extraction
      GraphBatch.cpp
        Batched processing
    env/
      SchedulingEnv.cpp
        RL step + lexicographic reward
      ActionSampler.cpp
        Masking + sampling
      PolicyUtils.cpp
        Differentiable log-probs
    search/
      MCTS.cpp
        PUCT + neural priors
      NeuroGuidedSearch.cpp
        NGS + online PPO
      InferenceServer.cpp
        Batched GPU inference
      ReplayBuffer.cpp
        Circular buffer 100k
    training/
      DataCollector.cpp
        Expert demonstrations
      TrainImitation.cpp
        Behavior cloning
      AlphaZeroTrainer.cpp
        Self-play + network update
    alns/
      ALNSPolicy.cpp
        Operator selection network
      ALNSSearch.cpp
        ALNS loop + PPO update
      ALNSEnv.cpp
        2-stage reward environment
      ALNSOperators.cpp
        5 destroy + 4 repair ops

🐍 Python Scripts

📊 `scripts/visualize.py`

Generates system architecture visualizations.

python scripts/visualize.py graph.json
python scripts/visualize.py --all

📈 `scripts/plot_ngs_metrics.py`

Plots NGS training metrics.

python scripts/plot_ngs_metrics.py ngs_metrics.csv --output plots.png

✅ `scripts/verify_dataset.py`

Verifies demonstration dataset integrity.

python scripts/verify_dataset.py dataset/

📂 `scripts/split_instances.py`

Splits instances into train/val sets (80%/20%) stratified by category (n size).

python scripts/split_instances.py

Dataset split:

pie title "120 Instances — Train / Val Split"
    "Train (96 · 80%)" : 96
    "Val (24 · 20%)" : 24

Distribution by job count:

xychart-beta
    title "Training Instances by Job Count (n)"
    x-axis ["n=20", "n=100", "n=105", "n=110", "n=200", "n=210", "n=220", "n=300", "n=315", "n=330"]
    y-axis "Train instances" 0 --> 25
    bar [24, 8, 8, 8, 8, 8, 8, 8, 8, 8]

Output directories:

instances_train/ - 80% of instances for training
instances_val/ - 20% of instances for validation

📁 Project Structure

pmsp-ml/
├── 📂 instances/           # Benchmark instances (120 total, 20–330 jobs)
├── 📂 instances_train/     # 80% split for training
├── 📂 instances_val/       # 20% split for validation
├── 📂 dataset/             # Expert demonstration dataset (.pt tensors)
├── 📂 checkpoints/         # Trained model checkpoints
├── 📂 logs/                # Training metrics, plots, CSVs
│
├── 📂 docs/
│   ├── 📂 architecture/    # GNN, ALNS, batching, phase docs
│   ├── 📂 planning/        # AlphaZero guide, TCC plan, roadmaps
│   └── 📂 analysis/        # Behavior analysis, CUDA optimization
│
├── 📂 src/
│   ├── benchmark_all.cpp   # Ablation benchmark entry point
│   ├── meson.build
│   ├── 📂 solver/          # Classical TSMN solver
│   │   ├── main.cpp
│   │   ├── Instance.{h,cpp}
│   │   ├── Solution.{h,cpp}
│   │   ├── TSMN.{h,cpp}
│   │   ├── Greedy.{h,cpp}
│   │   ├── Neighborhood.{h,cpp}
│   │   ├── Globals.{h,cpp}
│   │   └── BestKnown.h
│   ├── 📂 neural/          # Neural components
│   │   ├── 📂 policy/      # GNNPolicy, GraphState, GraphBatch
│   │   ├── 📂 env/         # SchedulingEnv, ActionSampler, PolicyUtils
│   │   ├── 📂 search/      # MCTS, NeuroGuidedSearch, InferenceServer
│   │   ├── 📂 training/    # DataCollector, TrainImitation, AlphaZeroTrainer
│   │   ├── 📂 alns/        # ALNSPolicy, ALNSSearch, ALNSEnv, ALNSOperators
│   │   └── [*_main.cpp]    # Executable entry points
│   └── 📂 tools/           # C++ diagnostic tools (export_model, check_cuda)
│
├── 📂 scripts/
│   ├── 📂 analysis/        # plot_*.py, diagnosis.py, visualize.py
│   ├── 📂 data/            # split_instances.py, verify_dataset.py, gnn_policy.py
│   └── [*.sh]              # Pipeline launchers
├── meson.build
└── README.md

🔬 Technical Details

Lexicographic Reward (PPO)

R = W₁ · Δf₁ + W₂ · Δf₂

where:
  W₁ = 1,000,000  (weight for rejection — absolute priority)
  W₂ = 1.0        (weight for weighted flowtime)
  Δf₁ = f₁_before - f₁_after
  Δf₂ = f₂_before - f₂_after

AlphaZero Outcome Signal

z_f1 = (f1_init − f1_final) / max(f1_init, 1)   ← trains value_f1 head
z_f2 = (f2_init − f2_final) / max(f2_init, 1)   ← trains value_f2 head
z    = z_f1  if |Δf1| > ε,  else  z_f2 × 0.01   ← lexicographic scalar

The dual value heads allow MCTS to switch its evaluation criterion: when f1 is at its global best, MCTS optimizes f2 instead.

PPO (Proximal Policy Optimization)

Hyperparameters in NGSConfig:

classDiagram
    class NGSConfig {
        clip_epsilon      : 0.2   — ratio π/π_old clipping
        gamma             : 0.99  — discount factor
        gae_lambda        : 0.95  — GAE λ parameter
        ppo_epochs        : 4     — update epochs per rollout
        entropy_coef      : 0.01  — entropy bonus weight
        normalize_rewards : true  — reward normalization
        inference_epsilon : 0.1   — random sampling probability
        tabu_size         : 10    — recent actions to avoid
    }

MCTS (AlphaZero)

classDiagram
    class MCTSConfig {
        num_simulations  : 100   — simulations per step
        c_puct           : 1.5   — PUCT exploration constant
        dirichlet_alpha  : 0.3   — noise concentration at root
        dirichlet_weight : 0.25  — noise weight (0 during eval)
        max_children     : 20    — max actions expanded per node
        reuse_tree       : true  — subtree reuse after selection
    }

Action Masking

ActionSampler applies masks to ensure valid actions:

Phase 1: Only allows accepted ↔ rejected jobs
Phase 2: Only allows blocks on the same machine
Phase 3: Only allows accepted jobs

📚 References

TSMN Original: Taillard, E. (1993). "Benchmarks for Basic Scheduling Problems"
GNN for Scheduling: Zhang et al. (2020). "Learning to Dispatch"
PPO: Schulman et al. (2017). "Proximal Policy Optimization Algorithms"
AlphaZero: Silver et al. (2018). "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play"
GAT: Veličković et al. (2018). "Graph Attention Networks"

📝 License

MIT License - see LICENSE

🚀 Happy Scheduling!

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.vscode		.vscode
docs		docs
logs		logs
scripts		scripts
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
meson.build		meson.build

Folders and files

Latest commit

History

Repository files navigation