Skip to content

wzh886/rl-lab

Repository files navigation

rl-lab

CI Python Release

A compact, portfolio-worthy reinforcement learning starter focused on clean abstractions, reproducible experiments, and practical engineering discipline.

This repository is intentionally not a giant framework. It is a serious baseline you can understand, extend, and present:

  • Environment abstraction instead of coupling training logic directly to Gymnasium
  • DQN baseline with replay buffer, target network, gradient clipping, and optional Double DQN target selection
  • Config-driven experiments via YAML snapshots
  • Train / evaluate split suitable for iterative experimentation
  • Artifact management for metrics, checkpoints, and reproducibility
  • Minimal tests to keep the core substrate honest

Why this project exists

Many RL repos fail one of two tests:

  1. They are too toy-like to demonstrate engineering maturity.
  2. They are too large and opaque to learn from or adapt quickly.

rl-lab aims for the middle ground:

  • small enough to audit in one sitting,
  • structured enough to scale into PPO / SAC / Rainbow-style extensions,
  • and polished enough to function as a GitHub portfolio project.

Implemented baseline

DQN

The current baseline targets discrete-action control tasks such as CartPole-v1.

Included features:

  • MLP Q-network
  • target network synchronization
  • replay buffer
  • epsilon-greedy exploration schedule
  • gradient clipping
  • configurable warm-up period (learning_starts)
  • periodic evaluation
  • checkpointing
  • optional Double DQN bootstrap action selection

This is a good first productionized baseline because it demonstrates the essential RL system loop:

[ (s_t, a_t, r_t, s_{t+1}, d_t) \rightarrow \text{replay buffer} \rightarrow \text{batched optimization} \rightarrow \text{policy improvement} ]

with Bellman target

[ y_t = r_t + \gamma (1 - d_t) \max_{a'} Q_{\theta^-}(s_{t+1}, a') ]

and, when Double DQN is enabled,

[ y_t = r_t + \gamma (1 - d_t) Q_{\theta^-}\left(s_{t+1}, \arg\max_{a'} Q_{\theta}(s_{t+1}, a')\right) ]

which reduces maximization bias relative to vanilla DQN.


Repository layout

rl-lab/
├── configs/
│   └── cartpole_dqn.yaml
├── scripts/
│   ├── train.py
│   └── evaluate.py
├── src/rl_lab/
│   ├── agents/
│   │   ├── base.py
│   │   └── dqn/
│   │       ├── agent.py
│   │       ├── network.py
│   │       └── replay.py
│   ├── envs/
│   │   ├── base.py
│   │   ├── factory.py
│   │   └── gym_env.py
│   ├── trainers/
│   │   └── dqn_trainer.py
│   ├── utils/
│   │   ├── checkpoint.py
│   │   ├── device.py
│   │   ├── logging.py
│   │   └── seeding.py
│   ├── config.py
│   └── evaluation.py
├── tests/
│   ├── test_config.py
│   └── test_replay.py
├── .gitignore
├── pyproject.toml
└── README.md

Quickstart

For a concrete training/evaluation flow, see docs/experiment-walkthrough.md.

1. Create an environment

cd rl-lab
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .[dev]

2. Train the CartPole baseline

python scripts/train.py --config configs/cartpole_dqn.yaml

Artifacts are written to:

artifacts/cartpole_dqn/
├── config.snapshot.json
├── metrics.jsonl
├── checkpoint_best.pt
├── checkpoint_step_10000.pt
├── checkpoint_step_20000.pt
├── checkpoint_last.pt
└── summary.json

3. Evaluate a checkpoint

python scripts/evaluate.py \
  --config configs/cartpole_dqn.yaml \
  --checkpoint artifacts/cartpole_dqn/checkpoint_best.pt \
  --episodes 25

4. Run tests

pytest

Training architecture

The training pipeline is intentionally decomposed:

  1. Config layer

    • YAML is parsed into typed dataclasses.
    • Training snapshots are persisted for reproducibility.
  2. Environment layer

    • EnvAdapter isolates the trainer from raw Gymnasium APIs.
    • This makes swapping to custom environments or wrappers straightforward.
  3. Agent layer

    • DQNAgent owns Q-networks, optimizer, replay buffer, and Bellman updates.
  4. Trainer layer

    • DQNTrainer owns rollout collection, logging, evaluation cadence, and checkpoint policy.
  5. Utilities

    • checkpointing
    • deterministic seeding
    • device resolution
    • JSONL metric logging

This separation matters because RL code degrades quickly when rollout logic, optimization logic, evaluation, and environment plumbing are fused into one script.


Config philosophy

Experiments are YAML-driven. The default config is intentionally simple:

experiment_name: cartpole-dqn-baseline
algorithm: dqn

env:
  id: CartPole-v1

network:
  hidden_sizes: [128, 128]

dqn:
  gamma: 0.99
  learning_rate: 0.001
  batch_size: 64
  buffer_size: 50000
  learning_starts: 1000
  epsilon_decay_steps: 20000
  double_dqn: true

train:
  total_steps: 30000
  eval_interval: 5000
  checkpoint_interval: 10000
  artifact_dir: artifacts/cartpole_dqn

This structure is easy to extend with:

  • prioritized replay
  • dueling heads
  • n-step returns
  • vectorized environments
  • experiment sweeps
  • TensorBoard / Weights & Biases logging

Metrics and artifacts

Training writes append-only JSONL records for easy downstream analysis.

Typical metric types:

  • train_episode
  • optimization
  • evaluation

Why JSONL instead of a hidden logger dependency?

  • trivial to parse with Python, pandas, or jq
  • easy to diff and inspect in GitHub artifacts
  • avoids locking the starter to one observability vendor

Example analysis:

cat artifacts/cartpole_dqn/metrics.jsonl | jq 'select(.kind == "evaluation")'

Practical engineering notes

What this starter does well

  • establishes a strong RL project skeleton
  • demonstrates separation of concerns
  • keeps the baseline understandable
  • supports checkpoint-based evaluation
  • is easy to fork for portfolio experimentation

What it does not yet do

  • distributed rollouts
  • vectorized environments
  • mixed precision
  • continuous-action methods
  • benchmark suite automation
  • hyperparameter sweep orchestration
  • comprehensive statistical evaluation across many seeds

That is intentional. A good starter should be extensible without pretending to already be a full research platform.


Suggested roadmap

Near-term

  • TensorBoard logging backend
  • CLI overrides for config values
  • reward normalization / observation normalization wrappers
  • model cards for trained checkpoints
  • more environment presets (Acrobot, LunarLander)

Algorithmic upgrades

  • prioritized replay
  • dueling DQN
  • n-step DQN
  • categorical / distributional DQN
  • PPO baseline for on-policy comparison

Systems upgrades

  • experiment registry + sweep runner
  • vectorized environment interface
  • structured event schema for metrics and lifecycle states
  • CI with lint + tests + smoke training

Portfolio framing

If you present this repository publicly, emphasize:

  • not just the algorithm, but the software architecture around the algorithm
  • typed config ingestion
  • environment abstraction
  • reproducibility through snapshots and seeded evaluation
  • modularity that makes future algorithm additions clean rather than chaotic

That is what distinguishes a serious engineering portfolio piece from a notebook dump.


Common extension points

Add a new environment backend

Implement the EnvAdapter contract and register it in envs/factory.py.

Add a new value-based algorithm

Follow the DQN layout:

  • network module
  • replay / storage module if needed
  • algorithm agent class
  • trainer variant if rollout/optimization cadence differs materially

Add experiment analysis

Build a small notebooks/ or analysis/ layer that reads metrics.jsonl and generates:

  • learning curves
  • checkpoint comparisons
  • seed variance summaries

Development

ruff check .
pytest

License note

No license file is included by default in this starter. For public release, add an explicit license (MIT, Apache-2.0, etc.) based on your intended reuse model.


Bottom line

rl-lab is a disciplined reinforcement learning starter: small enough to learn from, structured enough to extend, and polished enough to ship as a serious GitHub project.

About

Reinforcement learning lab starter with DQN baseline, training pipeline, and evaluation workflow.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages