A compact, portfolio-worthy reinforcement learning starter focused on clean abstractions, reproducible experiments, and practical engineering discipline.
This repository is intentionally not a giant framework. It is a serious baseline you can understand, extend, and present:
- Environment abstraction instead of coupling training logic directly to Gymnasium
- DQN baseline with replay buffer, target network, gradient clipping, and optional Double DQN target selection
- Config-driven experiments via YAML snapshots
- Train / evaluate split suitable for iterative experimentation
- Artifact management for metrics, checkpoints, and reproducibility
- Minimal tests to keep the core substrate honest
Many RL repos fail one of two tests:
- They are too toy-like to demonstrate engineering maturity.
- They are too large and opaque to learn from or adapt quickly.
rl-lab aims for the middle ground:
- small enough to audit in one sitting,
- structured enough to scale into PPO / SAC / Rainbow-style extensions,
- and polished enough to function as a GitHub portfolio project.
The current baseline targets discrete-action control tasks such as CartPole-v1.
Included features:
- MLP Q-network
- target network synchronization
- replay buffer
- epsilon-greedy exploration schedule
- gradient clipping
- configurable warm-up period (
learning_starts) - periodic evaluation
- checkpointing
- optional Double DQN bootstrap action selection
This is a good first productionized baseline because it demonstrates the essential RL system loop:
[ (s_t, a_t, r_t, s_{t+1}, d_t) \rightarrow \text{replay buffer} \rightarrow \text{batched optimization} \rightarrow \text{policy improvement} ]
with Bellman target
[ y_t = r_t + \gamma (1 - d_t) \max_{a'} Q_{\theta^-}(s_{t+1}, a') ]
and, when Double DQN is enabled,
[ y_t = r_t + \gamma (1 - d_t) Q_{\theta^-}\left(s_{t+1}, \arg\max_{a'} Q_{\theta}(s_{t+1}, a')\right) ]
which reduces maximization bias relative to vanilla DQN.
rl-lab/
├── configs/
│ └── cartpole_dqn.yaml
├── scripts/
│ ├── train.py
│ └── evaluate.py
├── src/rl_lab/
│ ├── agents/
│ │ ├── base.py
│ │ └── dqn/
│ │ ├── agent.py
│ │ ├── network.py
│ │ └── replay.py
│ ├── envs/
│ │ ├── base.py
│ │ ├── factory.py
│ │ └── gym_env.py
│ ├── trainers/
│ │ └── dqn_trainer.py
│ ├── utils/
│ │ ├── checkpoint.py
│ │ ├── device.py
│ │ ├── logging.py
│ │ └── seeding.py
│ ├── config.py
│ └── evaluation.py
├── tests/
│ ├── test_config.py
│ └── test_replay.py
├── .gitignore
├── pyproject.toml
└── README.md
For a concrete training/evaluation flow, see docs/experiment-walkthrough.md.
cd rl-lab
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .[dev]python scripts/train.py --config configs/cartpole_dqn.yamlArtifacts are written to:
artifacts/cartpole_dqn/
├── config.snapshot.json
├── metrics.jsonl
├── checkpoint_best.pt
├── checkpoint_step_10000.pt
├── checkpoint_step_20000.pt
├── checkpoint_last.pt
└── summary.json
python scripts/evaluate.py \
--config configs/cartpole_dqn.yaml \
--checkpoint artifacts/cartpole_dqn/checkpoint_best.pt \
--episodes 25pytestThe training pipeline is intentionally decomposed:
-
Config layer
- YAML is parsed into typed dataclasses.
- Training snapshots are persisted for reproducibility.
-
Environment layer
EnvAdapterisolates the trainer from raw Gymnasium APIs.- This makes swapping to custom environments or wrappers straightforward.
-
Agent layer
DQNAgentowns Q-networks, optimizer, replay buffer, and Bellman updates.
-
Trainer layer
DQNTrainerowns rollout collection, logging, evaluation cadence, and checkpoint policy.
-
Utilities
- checkpointing
- deterministic seeding
- device resolution
- JSONL metric logging
This separation matters because RL code degrades quickly when rollout logic, optimization logic, evaluation, and environment plumbing are fused into one script.
Experiments are YAML-driven. The default config is intentionally simple:
experiment_name: cartpole-dqn-baseline
algorithm: dqn
env:
id: CartPole-v1
network:
hidden_sizes: [128, 128]
dqn:
gamma: 0.99
learning_rate: 0.001
batch_size: 64
buffer_size: 50000
learning_starts: 1000
epsilon_decay_steps: 20000
double_dqn: true
train:
total_steps: 30000
eval_interval: 5000
checkpoint_interval: 10000
artifact_dir: artifacts/cartpole_dqnThis structure is easy to extend with:
- prioritized replay
- dueling heads
- n-step returns
- vectorized environments
- experiment sweeps
- TensorBoard / Weights & Biases logging
Training writes append-only JSONL records for easy downstream analysis.
Typical metric types:
train_episodeoptimizationevaluation
Why JSONL instead of a hidden logger dependency?
- trivial to parse with Python, pandas, or jq
- easy to diff and inspect in GitHub artifacts
- avoids locking the starter to one observability vendor
Example analysis:
cat artifacts/cartpole_dqn/metrics.jsonl | jq 'select(.kind == "evaluation")'- establishes a strong RL project skeleton
- demonstrates separation of concerns
- keeps the baseline understandable
- supports checkpoint-based evaluation
- is easy to fork for portfolio experimentation
- distributed rollouts
- vectorized environments
- mixed precision
- continuous-action methods
- benchmark suite automation
- hyperparameter sweep orchestration
- comprehensive statistical evaluation across many seeds
That is intentional. A good starter should be extensible without pretending to already be a full research platform.
- TensorBoard logging backend
- CLI overrides for config values
- reward normalization / observation normalization wrappers
- model cards for trained checkpoints
- more environment presets (
Acrobot,LunarLander)
- prioritized replay
- dueling DQN
- n-step DQN
- categorical / distributional DQN
- PPO baseline for on-policy comparison
- experiment registry + sweep runner
- vectorized environment interface
- structured event schema for metrics and lifecycle states
- CI with lint + tests + smoke training
If you present this repository publicly, emphasize:
- not just the algorithm, but the software architecture around the algorithm
- typed config ingestion
- environment abstraction
- reproducibility through snapshots and seeded evaluation
- modularity that makes future algorithm additions clean rather than chaotic
That is what distinguishes a serious engineering portfolio piece from a notebook dump.
Implement the EnvAdapter contract and register it in envs/factory.py.
Follow the DQN layout:
- network module
- replay / storage module if needed
- algorithm agent class
- trainer variant if rollout/optimization cadence differs materially
Build a small notebooks/ or analysis/ layer that reads metrics.jsonl and generates:
- learning curves
- checkpoint comparisons
- seed variance summaries
ruff check .
pytestNo license file is included by default in this starter. For public release, add an explicit license (MIT, Apache-2.0, etc.) based on your intended reuse model.
rl-lab is a disciplined reinforcement learning starter: small enough to learn from, structured enough to extend, and polished enough to ship as a serious GitHub project.