To solve the Banana World problem, I implemented a Deep Q-Network. The parameters, network architecture, and learning algorithm are described below.
Learning rate
: 1e-4Step-size for soft update
: 1e-3Discount rate
: 0.99Update target every ... episodes
: 4Minibatch size
: 64
I used an Epsilon-greedy policy represented by the class DecayEpsilonGreedy
, which
resembles the following code:
class DecayEpsilonGreedy:
epsilon_min: float
epsilon_decay_rate: float
epsilon: float
def step(self, time_step: int) -> float:
self.epsilon *= self.epsilon_decay_rate
self.epsilon = max(self.epsilon, self.epsilon_min)
return self.epsilon
The step method is called after every time step, updating and returning the value of epsilon.
The model used to estimate the action-value function had the following configuration:
Linear(in_features=37, out_features=64)
ReLU()
Linear(in_features=64, out_features=128)
ReLU()
Linear(in_features=128, out_features=256)
ReLU()
Linear(in_features=256, out_features=256)
ReLU()
Linear(in_features=256, out_features=256)
ReLU()
Linear(in_features=256, out_features=64)
ReLU()
Linear(in_features=64, out_features=4)
The learning algorithm used was
the Deep Q-Network. The algorithm stands out for
the usage of an Experience Replay buffer, and a secondary network that serves as the target estimate during training.
The experience replay buffer stores an experience (current state, action taken, reward received and next state) to be
used by the agent during the learning process later. The target network has the same architecture as the local network,
but its parameters are only updated every C
episodes (in this project C=4
).
Whenever the target network is updated, a minibatch of experiences are pulled from the replay buffer. Then, the estimate
action-values for these experiences are calculated for each network. The loss is calculated from the Mean Squared Error
between these estimated action-values. Finally, the local network weights are updated through stochastic gradient
descent, while the target network weights are soft-updated with a constant tau
(in this project, I used tau=1e-3).
This process can become more clear by reading the code file.
The environment was solved in 266
episodes, achieving the average score of 13.0
over the next 100
episodes.
When the goal average score over the last 100
episodes was set to 15.0
, the agent achieved it after 574
episodes. The following gif is a demonstration of the agent performing in the environment.
Agent trained with the average score of 15.0
The score values over each episode are described by the following figure.
Scores plot with agent trained to reach the average score of 13.0
Scores plot with agent trained to reach the average score of 15.0
To better understand the state of the art for reinforcement learning with nonlinear approximation functions, it is interesting to compare the results obtained with other methods, like Dueling DQNs, Double DQNs and using Prioritized Experience Replay. There is also the challenge of solving this environment with image data, which is harder and will prove a valuable exercise to solidify acquired knowledge.