A completely self-contained, interactive Mini Reinforcement Learning (RL) simulation built for Hackathons and educational demonstrations. This project provides a real-time, highly visual Streamlit dashboard where users can watch an AI agent iteratively learn to solve a 1D grid world from scratch.
- Custom RL Environment: A 1D boundary-contained framework styled akin to OpenAI Gymnasium (
π₯ π€ π© π). Features built-in penalty and reward systems. - Deep RL Agent (REINFORCE): Implements PyTorch-based Policy Gradient (REINFORCE) algorithm with dynamic Epsilon-Greedy exploration parameters.
- Live "Educator" Commentary: The dashboard explicitly surfaces Neural Network probability matrices to the frontend at every step. It highlights how the agent is thinking and why it made its specific moves.
- Dynamic Training Metrics: Tracks multi-episode averages, visual plotting updates, and progressive 'phase' indicators (Exploration > Learning > Mastery).
RL-Pytorch/
β
βββ streamlit_app.py # Primary dashboard UI & Streamlit frontend execution
βββ train.py # Alternate CLI/headless training pipeline
βββ agent.py # Core Policy Gradient algorithm & Action Extractor logic
βββ model.py # PyTorch Multi-layer Perceptron (Policy Network)
βββ custom_env.py # The custom 1D grid interaction environment rules engine
βββ config.py # Centralized Global Configuration and Hyperparameters
βββ utils.py # Supplemental chart & logging helper functions
βββ README.md # Project roadmap
Assuming you have python and standard data science libraries installed, running this project is a breeze.
All UI controls are baked into the system visually, completely averting command-line fiddling for non-technical evaluators.
1. Install Core Dependencies
pip install torch numpy pandas matplotlib streamlit gymnasium2. Start the AI Dashboard
streamlit run streamlit_app.py- The
MiniGridEnvforces the Agent to start at coordinate0. - The Policy Neural Network evaluates the state natively. In early episodes, random "Exploration" takes over.
- If it hits the start boundary (wall), the step yields a -5 penalty. If it takes a standard step, it yields a general -1 timeout drain. When it reaches the Flag, it is granted a +10 reward and the simulation halts.
- Using Gradient Ascent, the Pytorch agent isolates the highest reward pathways backwards via discounted cumulative gains and rewires the probability bias locally.
- You instantly witness the improvement curve visually over
Xepisodes live in the browser dynamically!