Skip to content

Franky03/pmsp-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

54 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 Neuro-Guided Search for Parallel Machine Scheduling Problem (PMSP-GNS)

Neural Network-enhanced Local Search for the Parallel Machine Scheduling Problem with Maintenance

C++17 PyTorch Meson


πŸ“‹ Table of Contents


🎯 Overview

This project presents a Deep Reinforcement Learning (DRL) approach for the Parallel Machine Scheduling Problem with Maintenance (PMSP-ML).

Combining the efficiency of Local Search algorithms with the generalization power of Graph Neural Networks (GNNs), the system implements an autonomous agent capable of learning complex heuristics through two complementary training strategies: online fine-tuning (PPO) and AlphaZero-style self-play with MCTS.

The Problem

  • 2 identical parallel machines with mandatory maintenance intervals
  • n jobs with processing time p_j, weight w_j and rejection penalty u_j
  • Lexicographic Objective:
    1. πŸ₯‡ Minimize f1 = Ξ£ u_j (rejection cost)
    2. πŸ₯ˆ Minimize f2 = Ξ£ w_jΒ·C_j (total weighted completion time)

The Approach

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         TRAINING PIPELINE                            β”‚
β”‚                                                                      β”‚
β”‚   TSMN Expert        Behavior         Online        AlphaZero        β”‚
β”‚   Demonstrations ──► Cloning    ──►   PPO      or   Self-Play        β”‚
β”‚                    (Imitation)      (instance)      (MCTS + GNN)     β”‚
β”‚                                                                      β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚              β”‚        GNN Policy                β”‚                    β”‚
β”‚              β”‚  GAT layers Β· dual value heads   β”‚                    β”‚
β”‚              β”‚  phase selector Β· action heads   β”‚                    β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚                             β”‚                                        β”‚
β”‚                     Inference / Search                               β”‚
β”‚                    NeuroGuidedSearch (NGS)                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ— Architecture

Heterogeneous Graph

The solution is represented as a heterogeneous graph with 3 distinct node types:

     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚                 GRAPH STATE                      β”‚
     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
     β”‚                                                  β”‚
     β”‚   β”Œβ”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”       V_J (Jobs)    β”‚
     β”‚   β”‚ J_0 │──│ J_1 │──│ J_2 │──...  [p,u,w,wspt,  β”‚
     β”‚   β””β”€β”€β”¬β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”˜        status]      β”‚
     β”‚      β”‚        β”‚        β”‚                        β”‚
     β”‚      β–Ό        β–Ό        β–Ό                        β”‚
     β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        V_B (Blocks)  β”‚
     β”‚   β”‚   Block 0   Block 1 β”‚         [time,slack,  β”‚
     β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          machine]     β”‚
     β”‚             β”‚                                   β”‚
     β”‚             β–Ό                                   β”‚
     β”‚        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                V_M (Machines)β”‚
     β”‚        β”‚ Machine β”‚                [T_i, Ξ΄_i]    β”‚
     β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Node Features (Inputs)

Each node type has normalized features extracted from the current solution:

classDiagram
    class JobNode_VJ["Job Node Β· V_J (5 features)"] {
        p_j  float : processing time / max(p)
        u_j  float : rejection penalty / max(u)
        w_j  float : weight / max(w)
        wspt float : WSPT ratio / max(wspt)
        status float : 1.0 accepted Β· 0.0 rejected
    }
    class BlockNode_VB["Block Node Β· V_B (3 features)"] {
        total_time  float : sum of p_j in block / T_max
        slack       float : free time before maintenance / T_max
        machine_id  float : 0.0 or 1.0
    }
    class MachineNode_VM["Machine Node Β· V_M (2 features)"] {
        T_i   float : maintenance time limit / max(T)
        d_i   float : maintenance duration / max(Ξ΄)
    }
    JobNode_VJ --> JobNode_VJ : job→job (sequence adjacency)
    JobNode_VJ --> BlockNode_VB : job→block (membership)
    BlockNode_VB --> MachineNode_VM : block→machine (allocation)
Loading

Policy Outputs

The GNN produces the following outputs for action selection:

classDiagram
    class PolicyOutput {
        phase_probs            Tensor_3       : P(Phase 1 Β· 2 Β· 3)
        phase1_accepted_scores Tensor_n_jobs  : scores to swap out accepted
        phase1_rejected_scores Tensor_n_jobs  : scores to swap in rejected
        phase2_block_scores    Tensor_nB_x_2  : scores for block swap start/end
        phase3_job_scores      Tensor_n_jobs  : scores for job relocation
        value_f1               Tensor_1       : predicted Ξ”f1 (dual head)
        value_f2               Tensor_1       : predicted Ξ”f2 (dual head)
    }
Loading

GNN Policy

The system's "intelligence" resides in the GNNPolicy, a GAT-based network that processes the graph state to make strategic decisions.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      GNN POLICY                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Input: GraphState                                            β”‚
β”‚    β”‚                                                          β”‚
β”‚    β–Ό                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
β”‚  β”‚        Node Encoders (by type)           β”‚                 β”‚
β”‚  β”‚  Job: Linear(5 β†’ hidden)                 β”‚                 β”‚
β”‚  β”‚  Block: Linear(3 β†’ hidden)               β”‚                 β”‚
β”‚  β”‚  Machine: Linear(2 β†’ hidden)             β”‚                 β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β”‚    β”‚                                                          β”‚
β”‚    β–Ό                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
β”‚  β”‚  3x GATConvLayer (4 heads, highway)      β”‚                 β”‚
β”‚  β”‚  + LayerNorm + gated residual            β”‚                 β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β”‚    β”‚                                                          β”‚
β”‚    β–Ό                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
β”‚  β”‚         Global Pooling (mean+max)        β”‚                 β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β”‚    β”‚                                                          β”‚
β”‚    β”œβ”€β”€β–Ί Phase Selector Head [3]   β†’ Neighborhood Strategy    β”‚
β”‚    β”œβ”€β”€β–Ί Phase 1 Heads [jobs]      β†’ SWAP accepted↔rejected   β”‚
β”‚    β”œβ”€β”€β–Ί Phase 2 Head [blocks]     β†’ SWAP blocks              β”‚
β”‚    β”œβ”€β”€β–Ί Phase 3 Head [jobs]       β†’ RELOCATE jobs            β”‚
β”‚    β”œβ”€β”€β–Ί Value Head f1 [1]         β†’ Predicted Ξ”f1 (MCTS)     β”‚
β”‚    └──► Value Head f2 [1]         β†’ Predicted Ξ”f2 (MCTS)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Neuro-Guided Search & Online PPO

The NeuroGuidedSearch component allows the agent to adapt in real-time during inference.

  1. Imitation Learning (Warm-start): The agent is trained via Behavior Cloning to imitate the TSMN expert.
  2. Online Adaptation (PPO): During resolution, the agent uses Proximal Policy Optimization to adjust its weights in real-time, specializing to the current instance.

AlphaZero Self-Play

The AlphaZeroTrainer implements a full self-play training loop:

  1. Self-play: MCTS with PUCT selection and neural priors generates episodes. Each step records (state, Ο€, z) where Ο€ is the MCTS visit distribution and z the lexicographic outcome.
  2. Replay Buffer: Examples are stored in a circular replay buffer (capacity 100k by default).
  3. Network update: At each iteration, the GNN is trained via cross-entropy against Ο€ (policy head) and MSE against z_f1/z_f2 (dual value heads).
  4. Lexicographic MCTS: The best_f1 threshold switches the value signal β€” when f1 is at its best, MCTS optimizes f2.
AlphaZero Training Loop:
  for each iteration:
    β”œβ”€ Self-play N games β†’ collect (s, Ο€, z) examples
    β”œβ”€ Add to ReplayBuffer
    β”œβ”€ Sample batch β†’ forward pass β†’ policy loss + value loss β†’ Adam
    └─ Every eval_interval: evaluate on val instances + save checkpoint

Inference Exploration

To prevent deterministic action oscillation during inference, the system uses:

  1. Epsilon-Greedy: With probability inference_epsilon (default: 10%), the agent samples randomly instead of using argmax.
  2. Tabu List: Recent actions are stored in a tabu list (size tabu_size default: 10) to prevent immediate repetition of the same move.

πŸ›  Installation

Prerequisites

  • C++17 compiler (GCC 9+ or Clang 10+)
  • LibTorch 2.0+ (CPU or CUDA)
  • Meson build system
  • Python 3.8+ with matplotlib, pandas, networkx

Build

# 1. Activate conda environment
source ~/miniconda3/etc/profile.d/conda.sh && conda activate pmsp-gns

# 2. Configure the project
meson setup build

# 3. Compile
meson compile -C build

Generated Executables

mindmap
  root((build/src/))
    solver/
      pmspml
        Classical TSMN solver
    neural/
      data_generator
        Expert dataset generator
      train_imitation
        Imitation learning trainer
      test_ngs
        Neural solver + online PPO
      train_alphazero
        AlphaZero self-play training
      train_alns
        Neural ALNS training
      export_graph
        Graph structure visualizer
    benchmark_all
      Ablation benchmark CSV
Loading

πŸš€ Complete Pipeline

Step 1: Run TSMN (Baseline)

./build/src/solver/pmspml instances/Ins1.txt --run-tsmn --time-limit=5

Output:

  • Initial greedy solution
  • Improvement via Phase 1, 2, 3 (Tabu Search)
  • Final f1 and f2 values

Step 2: πŸ“Š Generate Demonstration Dataset

./build/src/neural/data_generator

What happens:

  1. For each instance, runs full TSMN
  2. DataCollector captures every expert move
  3. Saves transitions: (state, phase, action, f1_before, f2_before, f1_after, f2_after)

Step 3: πŸŽ“ Imitation Training

./build/src/neural/train_imitation

Losses:

  • Selector Loss: CrossEntropy for phase selection
  • Phase Losses: CrossEntropy for specific actions
  • Value Loss: MSE for value estimation

Checkpoints saved in: checkpoints/


Step 4: 🧠 NeuroGuidedSearch (Inference)

# Single-run mode
./build/src/neural/test_ngs instances/Ins1.txt

# With pre-trained checkpoint
./build/src/neural/test_ngs instances/Ins1.txt checkpoints/best_model.pt

Step 5: πŸ”„ Online Training (PPO)

# Online mode for 10 minutes
./build/src/neural/test_ngs instances/Ins1.txt --online --time 10

# With initial checkpoint
./build/src/neural/test_ngs instances/Ins1.txt checkpoint.pt --online --time 30

# With action debug
./build/src/neural/test_ngs instances/Ins1.txt --online --time 5 --action_debug

Options:

flowchart LR
    EXE["test_ngs\ninstances/Ins1.txt"]
    EXE --> A["--online\nEnable continuous\ntraining mode"]
    EXE --> B["--time X\nRun for X minutes\n(default: 5)"]
    EXE --> C["--verbose\nShow f1/f2\nimprovements"]
    EXE --> D["--action_debug\nPrint details\nof each action"]
Loading

What happens:

  1. Continuous loop: solve() β†’ updatePolicy() β†’ repeat
  2. PPO fine-tuning specializes the model to the current instance
  3. Saves final checkpoint to ngs_checkpoint_final.pt
  4. Metrics saved to ngs_metrics.csv

Output Example (verbose mode):

--- Episode 1 (elapsed: 0.0 min, remaining: 10.0 min) ---
[Step   12] β˜… Phase3: f2 2672505 -> 2210110 (Ξ”=-462395)
[Step   48] β˜… Phase1: f1 32340 -> 30041 (Ξ”=-2299)
  Steps: 100 | Improving: 2
  f1: 32340.00 -> 30041.00
  f2: 2672505.00 -> 2210110.00

Step 6: β™Ÿ AlphaZero Self-Play Training

Trains the GNN via MCTS-guided self-play across all training instances, without requiring expert demonstrations.

# Smoke test (verify pipeline runs)
./build/src/neural/train_alphazero \
    --instances-dir instances/ \
    --val-dir instances/ \
    --iterations 2 --games 1 --simulations 5 --steps 2 --batch 4 --epochs 1

# From imitation checkpoint (recommended)
./build/src/neural/train_alphazero \
    --instances-dir instances_train/ \
    --val-dir instances_val/ \
    --checkpoint checkpoints/imitation_best.pt \
    --checkpoint-dir checkpoints/ \
    --iterations 200 \
    --games 20 \
    --simulations 200 \
    --steps 100 \
    --lr 5e-5 \
    --batch 128 \
    --epochs 10 \
    --eval-interval 10 \
    --temperature-steps 30

Options:

classDiagram
    class AlphaZeroConfig {
        --instances-dir    string  : instances_train/
        --val-dir          string  : instances_val/
        --checkpoint       string  : (optional) load checkpoint
        --checkpoint-dir   string  : ./checkpoints
        --hidden-dim       int     : 64
        --iterations       int     : 100
        --games            int     : 10
        --simulations      int     : 100
        --steps            int     : 50
        --lr               double  : 1e-4
        --batch            int     : 64
        --epochs           int     : 5
        --eval-interval    int     : 5
        --temperature-steps int    : 30
    }
Loading

Recommended progression:

flowchart LR
    D["πŸ” Debug\niters=5 Β· sims=10 Β· steps=5\nVerify pipeline Β· z β‰  0"]
    W["🌑 Warm-up\niters=50 · sims=50 · steps=20\nCheck loss trends"]
    T["πŸŽ“ Training\niters=200 Β· sims=200 Β· steps=50\nFrom imitation checkpoint"]
    P["πŸš€ Production\niters=500+ Β· sims=400-800 Β· steps=100\nGPU recommended"]
    D --> W --> T --> P
Loading

What to monitor:

  • buffer=N grows each iteration
  • policy_loss decreases over time (if oscillating, reduce --lr)
  • value_loss decreases (if not, the z signal may be degenerate β€” check z distribution)
  • New global best f1=... appears in early iterations

Using the trained checkpoint with NGS:

./build/src/neural/test_ngs instances/Ins1.txt checkpoints/az_final.pt --time 30 --verbose

Step 7: πŸ”¬ Neural ALNS & Ablation Benchmark

Train the Neural ALNS and compare variants for the TCC ablation study.

# Train Neural ALNS from scratch
./build/src/neural/train_alns \
    --instances-dir instances_train/ --val-dir instances_val/ \
    --checkpoint-dir checkpoints_alns_scratch/

# Train with IL backbone (warm-start)
./build/src/neural/train_alns \
    --backbone checkpoints/best_policy.pt \
    --instances-dir instances_train/ --val-dir instances_val/ \
    --checkpoint-dir checkpoints_alns_il/

# Run full benchmark: TSMN vs all ALNS variants
./build/src/benchmark_all \
    --instances-dir instances/ \
    --tsmn-time 5 \
    --models checkpoints_alns_scratch/alns_final.pt,checkpoints_alns_il/alns_final.pt \
    --names scratch,IL \
    --output results/benchmark.csv

# Visualize ablation results
python scripts/analysis/plot_ngs_metrics.py results/benchmark.csv

πŸ“¦ Components

Classical Solver (src/solver/)

mindmap
  root((src/solver/))
    TSMN.cpp
      Tabu Search Multi-Neighborhood
    Greedy.cpp
      WSPT constructive heuristic
    Neighborhood.cpp
      Phase 1 accepted↔rejected
      Phase 2 block swaps
      Phase 3 job relocations
    Solution.cpp
      BlockCache O(1) access
    Instance.cpp
      Problem instance parser
    BestKnown.h
      Gap analysis reference values
Loading

Neural Components (src/neural/)

mindmap
  root((src/neural/))
    policy/
      GNNPolicy.cpp
        GAT + dual value heads
      GraphState.cpp
        Feature extraction
      GraphBatch.cpp
        Batched processing
    env/
      SchedulingEnv.cpp
        RL step + lexicographic reward
      ActionSampler.cpp
        Masking + sampling
      PolicyUtils.cpp
        Differentiable log-probs
    search/
      MCTS.cpp
        PUCT + neural priors
      NeuroGuidedSearch.cpp
        NGS + online PPO
      InferenceServer.cpp
        Batched GPU inference
      ReplayBuffer.cpp
        Circular buffer 100k
    training/
      DataCollector.cpp
        Expert demonstrations
      TrainImitation.cpp
        Behavior cloning
      AlphaZeroTrainer.cpp
        Self-play + network update
    alns/
      ALNSPolicy.cpp
        Operator selection network
      ALNSSearch.cpp
        ALNS loop + PPO update
      ALNSEnv.cpp
        2-stage reward environment
      ALNSOperators.cpp
        5 destroy + 4 repair ops
Loading

🐍 Python Scripts

πŸ“Š scripts/visualize.py

Generates system architecture visualizations.

python scripts/visualize.py graph.json
python scripts/visualize.py --all

πŸ“ˆ scripts/plot_ngs_metrics.py

Plots NGS training metrics.

python scripts/plot_ngs_metrics.py ngs_metrics.csv --output plots.png

βœ… scripts/verify_dataset.py

Verifies demonstration dataset integrity.

python scripts/verify_dataset.py dataset/

πŸ“‚ scripts/split_instances.py

Splits instances into train/val sets (80%/20%) stratified by category (n size).

python scripts/split_instances.py

Dataset split:

pie title "120 Instances β€” Train / Val Split"
    "Train (96 Β· 80%)" : 96
    "Val (24 Β· 20%)" : 24
Loading

Distribution by job count:

xychart-beta
    title "Training Instances by Job Count (n)"
    x-axis ["n=20", "n=100", "n=105", "n=110", "n=200", "n=210", "n=220", "n=300", "n=315", "n=330"]
    y-axis "Train instances" 0 --> 25
    bar [24, 8, 8, 8, 8, 8, 8, 8, 8, 8]
Loading

Output directories:

  • instances_train/ - 80% of instances for training
  • instances_val/ - 20% of instances for validation

πŸ“ Project Structure

pmsp-ml/
β”œβ”€β”€ πŸ“‚ instances/           # Benchmark instances (120 total, 20–330 jobs)
β”œβ”€β”€ πŸ“‚ instances_train/     # 80% split for training
β”œβ”€β”€ πŸ“‚ instances_val/       # 20% split for validation
β”œβ”€β”€ πŸ“‚ dataset/             # Expert demonstration dataset (.pt tensors)
β”œβ”€β”€ πŸ“‚ checkpoints/         # Trained model checkpoints
β”œβ”€β”€ πŸ“‚ logs/                # Training metrics, plots, CSVs
β”‚
β”œβ”€β”€ πŸ“‚ docs/
β”‚   β”œβ”€β”€ πŸ“‚ architecture/    # GNN, ALNS, batching, phase docs
β”‚   β”œβ”€β”€ πŸ“‚ planning/        # AlphaZero guide, TCC plan, roadmaps
β”‚   └── πŸ“‚ analysis/        # Behavior analysis, CUDA optimization
β”‚
β”œβ”€β”€ πŸ“‚ src/
β”‚   β”œβ”€β”€ benchmark_all.cpp   # Ablation benchmark entry point
β”‚   β”œβ”€β”€ meson.build
β”‚   β”œβ”€β”€ πŸ“‚ solver/          # Classical TSMN solver
β”‚   β”‚   β”œβ”€β”€ main.cpp
β”‚   β”‚   β”œβ”€β”€ Instance.{h,cpp}
β”‚   β”‚   β”œβ”€β”€ Solution.{h,cpp}
β”‚   β”‚   β”œβ”€β”€ TSMN.{h,cpp}
β”‚   β”‚   β”œβ”€β”€ Greedy.{h,cpp}
β”‚   β”‚   β”œβ”€β”€ Neighborhood.{h,cpp}
β”‚   β”‚   β”œβ”€β”€ Globals.{h,cpp}
β”‚   β”‚   └── BestKnown.h
β”‚   β”œβ”€β”€ πŸ“‚ neural/          # Neural components
β”‚   β”‚   β”œβ”€β”€ πŸ“‚ policy/      # GNNPolicy, GraphState, GraphBatch
β”‚   β”‚   β”œβ”€β”€ πŸ“‚ env/         # SchedulingEnv, ActionSampler, PolicyUtils
β”‚   β”‚   β”œβ”€β”€ πŸ“‚ search/      # MCTS, NeuroGuidedSearch, InferenceServer
β”‚   β”‚   β”œβ”€β”€ πŸ“‚ training/    # DataCollector, TrainImitation, AlphaZeroTrainer
β”‚   β”‚   β”œβ”€β”€ πŸ“‚ alns/        # ALNSPolicy, ALNSSearch, ALNSEnv, ALNSOperators
β”‚   β”‚   └── [*_main.cpp]    # Executable entry points
β”‚   └── πŸ“‚ tools/           # C++ diagnostic tools (export_model, check_cuda)
β”‚
β”œβ”€β”€ πŸ“‚ scripts/
β”‚   β”œβ”€β”€ πŸ“‚ analysis/        # plot_*.py, diagnosis.py, visualize.py
β”‚   β”œβ”€β”€ πŸ“‚ data/            # split_instances.py, verify_dataset.py, gnn_policy.py
β”‚   └── [*.sh]              # Pipeline launchers
β”œβ”€β”€ meson.build
└── README.md

πŸ”¬ Technical Details

Lexicographic Reward (PPO)

R = W₁ Β· Ξ”f₁ + Wβ‚‚ Β· Ξ”fβ‚‚

where:
  W₁ = 1,000,000  (weight for rejection β€” absolute priority)
  Wβ‚‚ = 1.0        (weight for weighted flowtime)
  Ξ”f₁ = f₁_before - f₁_after
  Ξ”fβ‚‚ = fβ‚‚_before - fβ‚‚_after

AlphaZero Outcome Signal

z_f1 = (f1_init βˆ’ f1_final) / max(f1_init, 1)   ← trains value_f1 head
z_f2 = (f2_init βˆ’ f2_final) / max(f2_init, 1)   ← trains value_f2 head
z    = z_f1  if |Ξ”f1| > Ξ΅,  else  z_f2 Γ— 0.01   ← lexicographic scalar

The dual value heads allow MCTS to switch its evaluation criterion: when f1 is at its global best, MCTS optimizes f2 instead.

PPO (Proximal Policy Optimization)

Hyperparameters in NGSConfig:

classDiagram
    class NGSConfig {
        clip_epsilon      : 0.2   β€” ratio Ο€/Ο€_old clipping
        gamma             : 0.99  β€” discount factor
        gae_lambda        : 0.95  β€” GAE Ξ» parameter
        ppo_epochs        : 4     β€” update epochs per rollout
        entropy_coef      : 0.01  β€” entropy bonus weight
        normalize_rewards : true  β€” reward normalization
        inference_epsilon : 0.1   β€” random sampling probability
        tabu_size         : 10    β€” recent actions to avoid
    }
Loading

MCTS (AlphaZero)

classDiagram
    class MCTSConfig {
        num_simulations  : 100   β€” simulations per step
        c_puct           : 1.5   β€” PUCT exploration constant
        dirichlet_alpha  : 0.3   β€” noise concentration at root
        dirichlet_weight : 0.25  β€” noise weight (0 during eval)
        max_children     : 20    β€” max actions expanded per node
        reuse_tree       : true  β€” subtree reuse after selection
    }
Loading

Action Masking

ActionSampler applies masks to ensure valid actions:

  • Phase 1: Only allows accepted ↔ rejected jobs
  • Phase 2: Only allows blocks on the same machine
  • Phase 3: Only allows accepted jobs

πŸ“š References

  1. TSMN Original: Taillard, E. (1993). "Benchmarks for Basic Scheduling Problems"
  2. GNN for Scheduling: Zhang et al. (2020). "Learning to Dispatch"
  3. PPO: Schulman et al. (2017). "Proximal Policy Optimization Algorithms"
  4. AlphaZero: Silver et al. (2018). "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play"
  5. GAT: VeličkoviΔ‡ et al. (2018). "Graph Attention Networks"

πŸ“ License

MIT License - see LICENSE


πŸš€ Happy Scheduling!

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages