Authors: Armando Abelho & Pravir Padayachee
Institution: Wits University, Bsc Hon Computer Science
Analysis of the training data reveals a stark contrast in sample efficiency and stability between the two algorithms. The base REINFORCE agent failed to learn complex behaviors, plateauing at a minimal mean reward due to the high variance of Monte Carlo returns preventing stable convergence. In contrast, the PPO architecture provided a stable learning foundation. The integration of an LSTM to handle partial observability resulted in a significant breakthrough, allowing the PPO agent to learn multi-step planning. Ultimately, with the addition of reward shaping, the PPO agent achieved advanced milestones such as crafting stone tools, demonstrating the necessity of memory and guided exploration in sparse-reward environments.
The Crafter Environment is a complex Reinforcement Learning environment with volatile dynamics, multiple paths for exploration, and the constant possibility of death. We decided to direct our analysis toward a comparison between the foundations of Function Approximation and Actor-Critic methods with the (comparably) state-of-the-art approaches. To achieve this, we implemented the following two algorithms:
- REINFORCE: A foundational Monte Carlo policy gradient method implemented from scratch using Pytorch.
- Proximal Policy Optimisation (PPO): A state-of-the-art actor-critic method implemented using the Stable Baselines3 library.
Through the use of standard Convolutional Neural Networks (CNNs) to process image-based observations, we use general RL training methods to train our models. Our models underwent two stages of architectural and algorithmic enhancements to tackle the challenges of high variance, partial observability, and sparse rewards.
Our first agent is based on the REINFORCE algorithm. This algorithm has a few known weaknesses, and due to its Monte Carlo nature, it will struggle with the sparse rewards seen in this environment.
The initial agent (R1) used the REINFORCE algorithm using the policy gradient theorem for updates:
where
Initially we saw some strong progress from our agent. It is clearly learning as time progresses and we see steady progress, but we are still seeing quite low rewards and survival rates. We aim to reduce our variance, and as such, decided to implement a baseline function.
The second agent (R2) was rebuilt as a standard model of the Advantage Actor-Critic (A2C). The critic is trained on the Mean Squared Error of its predictions, and the actor is updated using the advantage function as seen in class.
While the introduction of the Advantage Actor-Critic (A2C) architecture theoretically reduces the variance seen in R1, the empirical results in REINFORCE2.jsonl show only marginal improvements. The agent achieved a slightly higher mean reward compared to the base model. However, the agent remained incapable of multi-step planning. It became evident that reducing variance alone is insufficient; the agent's inability to remember past observations in a partially observable environment severely bottlenecked its progress.
The R2 agent still lacks two key components: memory to handle partial observability and a way to overcome sparse rewards. The final iteration (R3) addresses both by adding an LSTM layer (for memory) and our reward shaping wrapper (for guidance).
The final REINFORCE agent (R3) integrates an LSTM layer after the CNN feature extractor. The LSTM's hidden state is fed into the actor and critic heads, providing the agent with memory. We also apply our reward shaping wrapper to provide small bonuses for key intermediate actions (e.g., collecting wood, placing a table).
The final R3 agent exhibited highly unstable performance, ultimately achieving a mean reward comparable to earlier iterations. While the LSTM was intended to provide meaningful memory, the added architectural complexity appeared to exacerbate the instability of the critic's learning process. Furthermore, the reward shaping—designed to guide the agent—struggled to yield benefits because the underlying policy optimization remained too unstable to consistently capitalize on those intermediate rewards. The agent still failed to consistently unlock mid-tier achievements.
Our second agent is based on PPO, a more modern and stable algorithm which we expect to outperform REINFORCE quite significantly.
The base PPO agent (P1) was implemented using Stable Baselines3.
Immediately we see strong improvements in the PPO model with the reward jumping ahead of the base REINFORCE within our first 1000 iterations. We did not hesitate to begin working on improvements to this algorithm.
PPO agents are traditionally memoryless. They cannot perform multi-step planning. The next iteration adds an LSTM to solve this partial observability problem and consider long-term and short-term memory.
The second PPO agent (P2) uses the CnnLstmPolicy from stable baselines. This architecture integrates an LSTM layer after the CNN feature extractor, allowing the agent to base its decisions on a history of observations.
Here we consider two different dimensions. Firstly, we see that rewards have continued to improve over time, and learning was actually much faster for the agent that utilised an LSTM. We also consider the achievement unlock rate, and see that trivial rewards are very similar. However, we note that the unlock rate of more advanced achievements (such as making a wood pickaxe) has shot up, and our model has begun to learn tasks which require planning and multiple steps.
The P2 agent can now plan, but it still struggles with working towards complex tasks such as stone collection and crafting a stone pickaxe. This is likely due to sparse rewards. It may not explore enough to discover complex crafting recipes.
Our final PPO agent is identical to the previous, but is trained using our RewardShaping wrapper. This provides small +0.1 bonuses for key intermediate steps like collecting stone or placing a furnace.
At first it seems as though we have hit a standstill. Our model has not seen an increase in reward despite training for longer, and is actually scoring WORSE. This is where we look to our achievement unlock rates.
Our Agent has learn to make stone tools! However, this was only accomplished less than 0.1 percent of the time. A massive step in the right direction, but this only raises more questions of how to cement these skills.
This experiment provides a clear comparison between the REINFORCE and PPO learning frameworks in a complex, partially observable environment.
Table 1 summarizes the final performance of all 6 agents across the four key evaluation metrics. (Exact quantitative data pending).
| Agent Configuration | Mean Reward | Mean Survival | Geo. Mean |
|---|---|---|---|
| R1: REINFORCE (Base) | TBD | TBD | TBD |
| R2: REINFORCE + A2C | TBD | TBD | TBD |
| R3: REINFORCE + A2C + LSTM + RS | TBD | TBD | TBD |
| P1: PPO (Base) | TBD | TBD | TBD |
| P2: PPO + LSTM | TBD | TBD | TBD |
| P3: PPO + LSTM + RS | TBD | TBD | TBD |
RS: Reward Shaping. Data from final runs.
The collected data illustrates the clear superiority of the PPO-based agents. The REINFORCE models (R1-R3) stagnated early, with their geometric mean scores peaking at sub-optimal levels. This indicates a failure to consistently survive or explore deeply. Conversely, the PPO agents demonstrated excellent scalability. P1 immediately outperformed the best REINFORCE model, and P2 pushed the mean reward significantly higher. Interestingly, while P3 saw a slight dip in overall mean reward, this was a trade-off for deeper exploration, as it was the only agent to successfully craft stone tools. PPO's clipped objective function allowed it to absorb the complexity of the LSTM and Reward Shaping without collapsing.
A direct comparison between R3 and P3 highlights the disparity in algorithmic capability within Crafter. The reward trajectory graph below shows R3 remaining relatively flat and highly volatile, indicative of an agent stuck in a localized, sub-optimal policy (likely just foraging for basic survival). P3, however, shows a steady climb in cumulative reward before exploring riskier, complex tasks. The achievement breakdown chart confirms this: R3 rarely progresses past basic wood collection, whereas P3 successfully executes the long-term planning required to craft wood and stone pickaxes.
This experiment highlights the core differences between the algorithms.
-
Stability
- REINFORCE: Extremely low. The high variance of Monte Carlo returns led to unstable training, and the critic in the A2C variant frequently failed to converge when architectural complexity was increased.
- PPO: Very high. The clipped surrogate objective allowed for consistent, stable policy updates, preventing catastrophic forgetting even when adding complex components like an LSTM and custom reward wrappers.
-
Sample Efficiency
- REINFORCE: Very low. The agent failed to learn basic crafting sequences even after extensive training, requiring significantly more data to find even marginal policy improvements.
- PPO: High. The agent learned basic crafting within the initial training steps and showed clear, rapid progress in survival time compared to the REINFORCE baseline.
-
Scalability
- REINFORCE: Poor. Overall performance stagnated or degraded as complexity (memory, shaping) was added, likely due to fundamental gradient instability.
- PPO: Excellent. Performance scaled positively with architectural additions, effectively utilizing the LSTM to solve partial observability and utilizing reward shaping to find deep exploration milestones.
Our iterative experiments demonstrated that standard REINFORCE and its A2C variant are ill-suited for the high-variance, partially observable, and sparse-reward challenges presented by the Crafter environment. In contrast, PPO provided a highly stable foundation for learning. The addition of an LSTM to the PPO architecture was the most significant improvement in our pipeline, effectively solving partial observability and unlocking the multi-step planning required for mid-game achievements. Finally, while reward shaping slightly disrupted short-term reward maximization, it provided the necessary guidance to achieve the most complex milestone of the project: crafting stone tools. Future work could focus on tuning the reward shaping weights to balance optimal survival with deep exploration.
| Parameter | REINFORCE (R3) | PPO (P3) |
|---|---|---|
| Learning Rate | 1e-4 | 3e-4 |
| n_steps | N/A (Full Episode) | 2048 |
| batch_size | N/A | 64 |
| ent_coef | 0.01 | 0.01 |
- D. Hafner, et al., "Crafter: A Benchmarking Environment for Open-World Generalization," 2021. [Online]. Available: https://arxiv.org/abs/2109.06780.
- J. Schulman, et al., "Proximal Policy Optimization Algorithms," 2017. [Online]. Available: https://arxiv.org/abs/1707.06347.
- A. Raffin, et al., "Stable-Baselines3: Reliable Reinforcement Learning Implementations," Journal of Machine Learning Research, 2021.






