Defination

Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.

Policy(策略函数): The agent’s behavior function, which is a mapping from states to actions.
- on-policy: The policy is the same as the policy used to generate the data.
- off-policy: The policy is different from the policy used to generate the data.
Goal: Maximize the expected return
Markov Decision Process(MDP): The agent needs only the current state (not the history of states/actions) to decide a movement.

Q-learning

Value Function

state->value function
state->(action-value) function

Bellman Equation

It’s like dynamic programming.

Epsilon greedy strategy

A policy that alternates between exploration (random actions) and exploitation.

Learning

Monte Carlo

Learning at the end of the episode.

the balance can be reached as the value $V(S_t)$ grow along with the reward $G_t$

Temporal Difference

$G_t$ is replaced by $R_{t+1} + \gamma V(S_{t+1})$ , which is the TD target implying the estimated return.

Q-learning

Notice the TD Error. It’s similar to the loss function in deeplearnig.(difference between the predicted and the target).
The difference: TD Error is not a supervised signal, it’s a difference between the predicted and the target, and the target is continuous updated according to the far end of rewards.
It’s like far->near(down->top) hireachy.

DQN

Combine Q-learning with deep learning.

Optimization

Experience Replay

Problem: The catasphic interference.

Avoid forgetting previous experiences (aka catastrophic interference, or catastrophic forgetting) and reduce the correlation between experiences.

The problem we get if we give sequential samples of experiences to our neural network is that it tends to forget the previous experiences as it gets new experiences

Make more efficient use of the experiences during the training

Fixed Q-targets

Problem: The Q-network is unstable.(参数相互依赖)
Use a separate network with fixed parameters for estimating the TD Target.
Copy the parameters from our Deep Q-Network every C steps to update the target network.

Double DQN

Problem: If non-optimal actions are regularly given a higher Q value than the optimal best action, the learning will be complicated.
Solution: use two networks to decouple the selection of the action and the evaluation of the action.(解藕)
The double DQN is a kind of regularization to the DQN.

actually, double DQN is similar to Fixed Q-targets. it’s like a checkpoints netword instead of a separate network.

Policy Gradient

From value-based to policy-based.

In value-based methods, policy is simple(greedy). The key is to estimate the value function.

But in policy-based methods, we want to learn the policy directly.

Policy Gradient Theorem

-> Gradient needs to add a ’-’ as the torch acts better with gradient descent than gradient ascent.

Multi-agent RL

Reward engineering problem:
- Having a too complex reward function to force your agent to behave as you want it to do.
- Why? Because by doing that, you might miss interesting strategies that the agent will find with a simpler reward function.

Curiosity

Sparse rewards problem: most rewards do not contain information, and hence are set to zero.
The extrinsic reward function is handmade

Actor-Critic methods

An Actor that controls how our agent behaves (Policy-Based method)
A Critic that measures how good the taken action is (Value-Based method)

Defination

Q-learning

Value Function

Bellman Equation

Epsilon greedy strategy

Learning

Monte Carlo

Temporal Difference

Q-learning

DQN

Optimization

Experience Replay

Fixed Q-targets

Double DQN

Policy Gradient

Policy Gradient Theorem

Multi-agent RL

Curiosity

Actor-Critic methods

Advantage Actor Critic (A2C)