RL-algorithms

Deep Q-Network (DQN)

DQN algorithm uses a neural network to approximate the Q-value function, which is used to determine the optimal action to take in a given state.

Proximal Policy Optimization (PPO)

PPO is a model-free on-policy RL algorithm that works well for both discrete and continuous action space environments. PPO utilizes an actor-critic framework, where there are two networks, an actor (policy network) and critic network (value function).

Actor Critic

The Q value can be learned by parameterizing the Q function with a neural network.

This leads us to Actor Critic Methods, where:

The “Critic” estimates the value function. This could be the action-value (the Q value) or state-value (the V value).

The “Actor” updates the policy distribution in the direction suggested by the Critic (such as with policy gradients).

and both the Critic and Actor functions are parameterized with neural networks. The Critic neural network parameterizes the Q value — so, it is called Q Actor Critic.

Deep Deterministic Policy Gradient (DDPG)

Deep Deterministic Policy Gradient (DDPG) is a model-free off-policy algorithm for learning continous actions.

It combines ideas from DPG (Deterministic Policy Gradient) and DQN (Deep Q-Network). It uses Experience Replay and slow-learning target networks from DQN, and it is based on DPG, which can operate over continuous action spaces.

DDPG uses two more techniques not present in the original DQN:

First, it uses two Target networks.

Why? Because it add stability to training. In short, we are learning from estimated targets and Target networks are updated slowly, hence keeping our estimated targets stable.

Conceptually, this is like saying, "I have an idea of how to play this well, I'm going to try it out for a bit until I find something better", as opposed to saying "I'm going to re-learn how to play this entire game after every move". See this StackOverflow answer.

Second, it uses Experience Replay.

We store list of tuples (state, action, reward, next_state), and instead of learning only from recent experience, we learn from sampling all of our experience accumulated so far.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
AC		AC
C51		C51
DDPG		DDPG
DQN		DQN
IQN		IQN
PPO		PPO
QR-DQN		QR-DQN
SAC		SAC
TD3		TD3
TRPO		TRPO
vanilla_PG		vanilla_PG
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RL-algorithms

Deep Q-Network (DQN)

Proximal Policy Optimization (PPO)

Actor Critic

Deep Deterministic Policy Gradient (DDPG)

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Dikshuy/RL-algorithms

Folders and files

Latest commit

History

Repository files navigation

RL-algorithms

Deep Q-Network (DQN)

Proximal Policy Optimization (PPO)

Actor Critic

Deep Deterministic Policy Gradient (DDPG)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages