rl-lab

A compact, portfolio-worthy reinforcement learning starter focused on clean abstractions, reproducible experiments, and practical engineering discipline.

This repository is intentionally not a giant framework. It is a serious baseline you can understand, extend, and present:

Environment abstraction instead of coupling training logic directly to Gymnasium
DQN baseline with replay buffer, target network, gradient clipping, and optional Double DQN target selection
Config-driven experiments via YAML snapshots
Train / evaluate split suitable for iterative experimentation
Artifact management for metrics, checkpoints, and reproducibility
Minimal tests to keep the core substrate honest

Why this project exists

Many RL repos fail one of two tests:

They are too toy-like to demonstrate engineering maturity.
They are too large and opaque to learn from or adapt quickly.

rl-lab aims for the middle ground:

small enough to audit in one sitting,
structured enough to scale into PPO / SAC / Rainbow-style extensions,
and polished enough to function as a GitHub portfolio project.

Implemented baseline

DQN

The current baseline targets discrete-action control tasks such as CartPole-v1.

Included features:

MLP Q-network
target network synchronization
replay buffer
epsilon-greedy exploration schedule
gradient clipping
configurable warm-up period (learning_starts)
periodic evaluation
checkpointing
optional Double DQN bootstrap action selection

This is a good first productionized baseline because it demonstrates the essential RL system loop:

[ (s_t, a_t, r_t, s_{t+1}, d_t) \rightarrow \text{replay buffer} \rightarrow \text{batched optimization} \rightarrow \text{policy improvement} ]

with Bellman target

[ y_t = r_t + \gamma (1 - d_t) \max_{a'} Q_{\theta^-}(s_{t+1}, a') ]

and, when Double DQN is enabled,

[ y_t = r_t + \gamma (1 - d_t) Q_{\theta^-}\left(s_{t+1}, \arg\max_{a'} Q_{\theta}(s_{t+1}, a')\right) ]

which reduces maximization bias relative to vanilla DQN.

Repository layout

rl-lab/
├── configs/
│   └── cartpole_dqn.yaml
├── scripts/
│   ├── train.py
│   └── evaluate.py
├── src/rl_lab/
│   ├── agents/
│   │   ├── base.py
│   │   └── dqn/
│   │       ├── agent.py
│   │       ├── network.py
│   │       └── replay.py
│   ├── envs/
│   │   ├── base.py
│   │   ├── factory.py
│   │   └── gym_env.py
│   ├── trainers/
│   │   └── dqn_trainer.py
│   ├── utils/
│   │   ├── checkpoint.py
│   │   ├── device.py
│   │   ├── logging.py
│   │   └── seeding.py
│   ├── config.py
│   └── evaluation.py
├── tests/
│   ├── test_config.py
│   └── test_replay.py
├── .gitignore
├── pyproject.toml
└── README.md

Quickstart

For a concrete training/evaluation flow, see docs/experiment-walkthrough.md.

1. Create an environment

cd rl-lab
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .[dev]

2. Train the CartPole baseline

python scripts/train.py --config configs/cartpole_dqn.yaml

Artifacts are written to:

artifacts/cartpole_dqn/
├── config.snapshot.json
├── metrics.jsonl
├── checkpoint_best.pt
├── checkpoint_step_10000.pt
├── checkpoint_step_20000.pt
├── checkpoint_last.pt
└── summary.json

3. Evaluate a checkpoint

python scripts/evaluate.py \
  --config configs/cartpole_dqn.yaml \
  --checkpoint artifacts/cartpole_dqn/checkpoint_best.pt \
  --episodes 25

4. Run tests

pytest

Training architecture

The training pipeline is intentionally decomposed:

Config layer
- YAML is parsed into typed dataclasses.
- Training snapshots are persisted for reproducibility.
Environment layer
- EnvAdapter isolates the trainer from raw Gymnasium APIs.
- This makes swapping to custom environments or wrappers straightforward.
Agent layer
- DQNAgent owns Q-networks, optimizer, replay buffer, and Bellman updates.
Trainer layer
- DQNTrainer owns rollout collection, logging, evaluation cadence, and checkpoint policy.
Utilities
- checkpointing
- deterministic seeding
- device resolution
- JSONL metric logging

This separation matters because RL code degrades quickly when rollout logic, optimization logic, evaluation, and environment plumbing are fused into one script.

Config philosophy

Experiments are YAML-driven. The default config is intentionally simple:

experiment_name: cartpole-dqn-baseline
algorithm: dqn

env:
  id: CartPole-v1

network:
  hidden_sizes: [128, 128]

dqn:
  gamma: 0.99
  learning_rate: 0.001
  batch_size: 64
  buffer_size: 50000
  learning_starts: 1000
  epsilon_decay_steps: 20000
  double_dqn: true

train:
  total_steps: 30000
  eval_interval: 5000
  checkpoint_interval: 10000
  artifact_dir: artifacts/cartpole_dqn

This structure is easy to extend with:

prioritized replay
dueling heads
n-step returns
vectorized environments
experiment sweeps
TensorBoard / Weights & Biases logging

Metrics and artifacts

Training writes append-only JSONL records for easy downstream analysis.

Typical metric types:

train_episode
optimization
evaluation

Why JSONL instead of a hidden logger dependency?

trivial to parse with Python, pandas, or jq
easy to diff and inspect in GitHub artifacts
avoids locking the starter to one observability vendor

Example analysis:

cat artifacts/cartpole_dqn/metrics.jsonl | jq 'select(.kind == "evaluation")'

Practical engineering notes

What this starter does well

establishes a strong RL project skeleton
demonstrates separation of concerns
keeps the baseline understandable
supports checkpoint-based evaluation
is easy to fork for portfolio experimentation

What it does not yet do

distributed rollouts
vectorized environments
mixed precision
continuous-action methods
benchmark suite automation
hyperparameter sweep orchestration
comprehensive statistical evaluation across many seeds

That is intentional. A good starter should be extensible without pretending to already be a full research platform.

Suggested roadmap

Near-term

TensorBoard logging backend
CLI overrides for config values
reward normalization / observation normalization wrappers
model cards for trained checkpoints
more environment presets (Acrobot, LunarLander)

Algorithmic upgrades

Systems upgrades

experiment registry + sweep runner
vectorized environment interface
structured event schema for metrics and lifecycle states
CI with lint + tests + smoke training

Portfolio framing

If you present this repository publicly, emphasize:

not just the algorithm, but the software architecture around the algorithm
typed config ingestion
environment abstraction
reproducibility through snapshots and seeded evaluation
modularity that makes future algorithm additions clean rather than chaotic

That is what distinguishes a serious engineering portfolio piece from a notebook dump.

Common extension points

Add a new environment backend

Implement the EnvAdapter contract and register it in envs/factory.py.

Add a new value-based algorithm

Follow the DQN layout:

network module
replay / storage module if needed
algorithm agent class
trainer variant if rollout/optimization cadence differs materially

Add experiment analysis

Build a small notebooks/ or analysis/ layer that reads metrics.jsonl and generates:

learning curves
checkpoint comparisons
seed variance summaries

Development

ruff check .
pytest

License note

No license file is included by default in this starter. For public release, add an explicit license (MIT, Apache-2.0, etc.) based on your intended reuse model.

Bottom line

rl-lab is a disciplined reinforcement learning starter: small enough to learn from, structured enough to extend, and polished enough to ship as a serious GitHub project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rl-lab

Why this project exists

Implemented baseline

DQN

Repository layout

Quickstart

1. Create an environment

2. Train the CartPole baseline

3. Evaluate a checkpoint

4. Run tests

Training architecture

Config philosophy

Metrics and artifacts

Practical engineering notes

What this starter does well

What it does not yet do

Suggested roadmap

Near-term

Algorithmic upgrades

Systems upgrades

Portfolio framing

Common extension points

Add a new environment backend

Add a new value-based algorithm

Add experiment analysis

Development

License note

Bottom line

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
configs		configs
docs		docs
scripts		scripts
src/rl_lab		src/rl_lab
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

rl-lab

Why this project exists

Implemented baseline

DQN

Repository layout

Quickstart

1. Create an environment

2. Train the CartPole baseline

3. Evaluate a checkpoint

4. Run tests

Training architecture

Config philosophy

Metrics and artifacts

Practical engineering notes

What this starter does well

What it does not yet do

Suggested roadmap

Near-term

Algorithmic upgrades

Systems upgrades

Portfolio framing

Common extension points

Add a new environment backend

Add a new value-based algorithm

Add experiment analysis

Development

License note

Bottom line

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages