DQN algorithm uses a neural network to approximate the Q-value function, which is used to determine the optimal action to take in a given state.
PPO is a model-free on-policy RL algorithm that works well for both discrete and continuous action space environments. PPO utilizes an actor-critic framework, where there are two networks, an actor (policy network) and critic network (value function).
The Q value can be learned by parameterizing the Q function with a neural network.
This leads us to Actor Critic Methods, where:
The “Critic” estimates the value function. This could be the action-value (the Q value) or state-value (the V value).
The “Actor” updates the policy distribution in the direction suggested by the Critic (such as with policy gradients).
and both the Critic and Actor functions are parameterized with neural networks. The Critic neural network parameterizes the Q value — so, it is called Q Actor Critic.
Deep Deterministic Policy Gradient (DDPG) is a model-free off-policy algorithm for learning continous actions.
It combines ideas from DPG (Deterministic Policy Gradient) and DQN (Deep Q-Network). It uses Experience Replay and slow-learning target networks from DQN, and it is based on DPG, which can operate over continuous action spaces.
DDPG uses two more techniques not present in the original DQN:
First, it uses two Target networks.
Why? Because it add stability to training. In short, we are learning from estimated targets and Target networks are updated slowly, hence keeping our estimated targets stable.
Conceptually, this is like saying, "I have an idea of how to play this well, I'm going to try it out for a bit until I find something better", as opposed to saying "I'm going to re-learn how to play this entire game after every move". See this StackOverflow answer.
Second, it uses Experience Replay.
We store list of tuples (state, action, reward, next_state), and instead of learning only from recent experience, we learn from sampling all of our experience accumulated so far.