Analysis of REINFORCE and PPO in the Crafter Environment

Authors: Armando Abelho & Pravir Padayachee
Institution: Wits University, Bsc Hon Computer Science

Abstract

Analysis of the training data reveals a stark contrast in sample efficiency and stability between the two algorithms. The base REINFORCE agent failed to learn complex behaviors, plateauing at a minimal mean reward due to the high variance of Monte Carlo returns preventing stable convergence. In contrast, the PPO architecture provided a stable learning foundation. The integration of an LSTM to handle partial observability resulted in a significant breakthrough, allowing the PPO agent to learn multi-step planning. Ultimately, with the addition of reward shaping, the PPO agent achieved advanced milestones such as crafting stone tools, demonstrating the necessity of memory and guided exploration in sparse-reward environments.

Introduction

The Crafter Environment is a complex Reinforcement Learning environment with volatile dynamics, multiple paths for exploration, and the constant possibility of death. We decided to direct our analysis toward a comparison between the foundations of Function Approximation and Actor-Critic methods with the (comparably) state-of-the-art approaches. To achieve this, we implemented the following two algorithms:

REINFORCE: A foundational Monte Carlo policy gradient method implemented from scratch using Pytorch.
Proximal Policy Optimisation (PPO): A state-of-the-art actor-critic method implemented using the Stable Baselines3 library.

Through the use of standard Convolutional Neural Networks (CNNs) to process image-based observations, we use general RL training methods to train our models. Our models underwent two stages of architectural and algorithmic enhancements to tackle the challenges of high variance, partial observability, and sparse rewards.

The REINFORCE Agent

Our first agent is based on the REINFORCE algorithm. This algorithm has a few known weaknesses, and due to its Monte Carlo nature, it will struggle with the sparse rewards seen in this environment.

Stage 1: Vanilla REINFORCE

Implementation

The initial agent (R1) used the REINFORCE algorithm using the policy gradient theorem for updates:

$$ \nabla_{\theta}J(\theta) \approx \sum_{t} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) G_t $$

where $G_t$ is the discounted Monte Carlo return. The network used a CNN to process the 64 x 64 x 3 image observations.

Motivation for Next Iteration

Initially we saw some strong progress from our agent. It is clearly learning as time progresses and we see steady progress, but we are still seeing quite low rewards and survival rates. We aim to reduce our variance, and as such, decided to implement a baseline function.

Stage 2: REINFORCE with Baseline : Actor Critic

Implementation

The second agent (R2) was rebuilt as a standard model of the Advantage Actor-Critic (A2C). The critic is trained on the Mean Squared Error of its predictions, and the actor is updated using the advantage function as seen in class.

Performance and Analysis (REINFORCE2.jsonl)

While the introduction of the Advantage Actor-Critic (A2C) architecture theoretically reduces the variance seen in R1, the empirical results in REINFORCE2.jsonl show only marginal improvements. The agent achieved a slightly higher mean reward compared to the base model. However, the agent remained incapable of multi-step planning. It became evident that reducing variance alone is insufficient; the agent's inability to remember past observations in a partially observable environment severely bottlenecked its progress.

Motivation for Next Iteration

The R2 agent still lacks two key components: memory to handle partial observability and a way to overcome sparse rewards. The final iteration (R3) addresses both by adding an LSTM layer (for memory) and our reward shaping wrapper (for guidance).

Stage 3 (R3): A2C + LSTM + Reward Shaping

Implementation

The final REINFORCE agent (R3) integrates an LSTM layer after the CNN feature extractor. The LSTM's hidden state is fed into the actor and critic heads, providing the agent with memory. We also apply our reward shaping wrapper to provide small bonuses for key intermediate actions (e.g., collecting wood, placing a table).

Performance and Analysis (REINFORCE3.jsonl)

The final R3 agent exhibited highly unstable performance, ultimately achieving a mean reward comparable to earlier iterations. While the LSTM was intended to provide meaningful memory, the added architectural complexity appeared to exacerbate the instability of the critic's learning process. Furthermore, the reward shaping—designed to guide the agent—struggled to yield benefits because the underlying policy optimization remained too unstable to consistently capitalize on those intermediate rewards. The agent still failed to consistently unlock mid-tier achievements.

The PPO Agent: Iterative Improvements

Our second agent is based on PPO, a more modern and stable algorithm which we expect to outperform REINFORCE quite significantly.

Base PPO + CNN

Implementation

The base PPO agent (P1) was implemented using Stable Baselines3.

Performance and Analysis

Immediately we see strong improvements in the PPO model with the reward jumping ahead of the base REINFORCE within our first 1000 iterations. We did not hesitate to begin working on improvements to this algorithm.

Motivation for Next Iteration

PPO agents are traditionally memoryless. They cannot perform multi-step planning. The next iteration adds an LSTM to solve this partial observability problem and consider long-term and short-term memory.

PPO + LSTM

Implementation

The second PPO agent (P2) uses the CnnLstmPolicy from stable baselines. This architecture integrates an LSTM layer after the CNN feature extractor, allowing the agent to base its decisions on a history of observations.

Performance and Analysis

Here we consider two different dimensions. Firstly, we see that rewards have continued to improve over time, and learning was actually much faster for the agent that utilised an LSTM. We also consider the achievement unlock rate, and see that trivial rewards are very similar. However, we note that the unlock rate of more advanced achievements (such as making a wood pickaxe) has shot up, and our model has begun to learn tasks which require planning and multiple steps.

Motivation for Next Iteration

The P2 agent can now plan, but it still struggles with working towards complex tasks such as stone collection and crafting a stone pickaxe. This is likely due to sparse rewards. It may not explore enough to discover complex crafting recipes.

PPO + LSTM + Reward Shaping

Implementation

Our final PPO agent is identical to the previous, but is trained using our RewardShaping wrapper. This provides small +0.1 bonuses for key intermediate steps like collecting stone or placing a furnace.

Performance and Analysis (PPO3.jsonl)

At first it seems as though we have hit a standstill. Our model has not seen an increase in reward despite training for longer, and is actually scoring WORSE. This is where we look to our achievement unlock rates.

Our Agent has learn to make stone tools! However, this was only accomplished less than 0.1 percent of the time. A massive step in the right direction, but this only raises more questions of how to cement these skills.

Final Comparative Analysis

This experiment provides a clear comparison between the REINFORCE and PPO learning frameworks in a complex, partially observable environment.

Quantitative Performance Summary

Table 1 summarizes the final performance of all 6 agents across the four key evaluation metrics. (Exact quantitative data pending).

Agent Configuration	Mean Reward	Mean Survival	Geo. Mean
R1: REINFORCE (Base)	TBD	TBD	TBD
R2: REINFORCE + A2C	TBD	TBD	TBD
R3: REINFORCE + A2C + LSTM + RS	TBD	TBD	TBD
P1: PPO (Base)	TBD	TBD	TBD
P2: PPO + LSTM	TBD	TBD	TBD
P3: PPO + LSTM + RS	TBD	TBD	TBD

RS: Reward Shaping. Data from final runs.

The collected data illustrates the clear superiority of the PPO-based agents. The REINFORCE models (R1-R3) stagnated early, with their geometric mean scores peaking at sub-optimal levels. This indicates a failure to consistently survive or explore deeply. Conversely, the PPO agents demonstrated excellent scalability. P1 immediately outperformed the best REINFORCE model, and P2 pushed the mean reward significantly higher. Interestingly, while P3 saw a slight dip in overall mean reward, this was a trade-off for deeper exploration, as it was the only agent to successfully craft stone tools. PPO's clipped objective function allowed it to absorb the complexity of the LSTM and Reward Shaping without collapsing.

Comparison of Final Agents (R3 vs. P3)

A direct comparison between R3 and P3 highlights the disparity in algorithmic capability within Crafter. The reward trajectory graph below shows R3 remaining relatively flat and highly volatile, indicative of an agent stuck in a localized, sub-optimal policy (likely just foraging for basic survival). P3, however, shows a steady climb in cumulative reward before exploring riskier, complex tasks. The achievement breakdown chart confirms this: R3 rarely progresses past basic wood collection, whereas P3 successfully executes the long-term planning required to craft wood and stone pickaxes.

Strengths and Weaknesses

This experiment highlights the core differences between the algorithms.

Stability
- REINFORCE: Extremely low. The high variance of Monte Carlo returns led to unstable training, and the critic in the A2C variant frequently failed to converge when architectural complexity was increased.
- PPO: Very high. The clipped surrogate objective allowed for consistent, stable policy updates, preventing catastrophic forgetting even when adding complex components like an LSTM and custom reward wrappers.
Sample Efficiency
- REINFORCE: Very low. The agent failed to learn basic crafting sequences even after extensive training, requiring significantly more data to find even marginal policy improvements.
- PPO: High. The agent learned basic crafting within the initial training steps and showed clear, rapid progress in survival time compared to the REINFORCE baseline.
Scalability
- REINFORCE: Poor. Overall performance stagnated or degraded as complexity (memory, shaping) was added, likely due to fundamental gradient instability.
- PPO: Excellent. Performance scaled positively with architectural additions, effectively utilizing the LSTM to solve partial observability and utilizing reward shaping to find deep exploration milestones.

Conclusion

Our iterative experiments demonstrated that standard REINFORCE and its A2C variant are ill-suited for the high-variance, partially observable, and sparse-reward challenges presented by the Crafter environment. In contrast, PPO provided a highly stable foundation for learning. The addition of an LSTM to the PPO architecture was the most significant improvement in our pipeline, effectively solving partial observability and unlocking the multi-step planning required for mid-game achievements. Finally, while reward shaping slightly disrupted short-term reward maximization, it provided the necessary guidance to achieve the most complex milestone of the project: crafting stone tools. Future work could focus on tuning the reward shaping weights to balance optimal survival with deep exploration.

Hyperparameters

Parameter	REINFORCE (R3)	PPO (P3)
Learning Rate	1e-4	3e-4
n_steps	N/A (Full Episode)	2048
batch_size	N/A	64
ent_coef	0.01	0.01

References

D. Hafner, et al., "Crafter: A Benchmarking Environment for Open-World Generalization," 2021. [Online]. Available: https://arxiv.org/abs/2109.06780.
J. Schulman, et al., "Proximal Policy Optimization Algorithms," 2017. [Online]. Available: https://arxiv.org/abs/1707.06347.
A. Raffin, et al., "Stable-Baselines3: Reliable Reinforcement Learning Implementations," Journal of Machine Learning Research, 2021.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
__pycache__		__pycache__
logdir		logdir
path/to/logdir		path/to/logdir
src		src
.gitignore		.gitignore
Miniconda3-latest-Linux-x86_64.sh		Miniconda3-latest-Linux-x86_64.sh
README.md		README.md
environment.yml		environment.yml

Folders and files

Latest commit

History

Repository files navigation

Analysis of REINFORCE and PPO in the Crafter Environment

Abstract

Introduction

The REINFORCE Agent

Stage 1: Vanilla REINFORCE

Implementation

Motivation for Next Iteration

Stage 2: REINFORCE with Baseline : Actor Critic

Implementation

Performance and Analysis (REINFORCE2.jsonl)

Motivation for Next Iteration

Stage 3 (R3): A2C + LSTM + Reward Shaping

Implementation

Performance and Analysis (REINFORCE3.jsonl)

The PPO Agent: Iterative Improvements

Base PPO + CNN

Implementation

Performance and Analysis

Motivation for Next Iteration

PPO + LSTM

Implementation

Performance and Analysis

Motivation for Next Iteration

PPO + LSTM + Reward Shaping

Implementation

Performance and Analysis (PPO3.jsonl)

Final Comparative Analysis

Quantitative Performance Summary

Comparison of Final Agents (R3 vs. P3)

Strengths and Weaknesses

Conclusion

Hyperparameters

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages