Implementation

To solve the Banana World problem, I implemented a Deep Q-Network. The parameters, network architecture, and learning algorithm are described below.

Parameters

Learning rate: 1e-4
Step-size for soft update: 1e-3
Discount rate: 0.99
Update target every ... episodes: 4
Minibatch size: 64

I used an Epsilon-greedy policy represented by the class DecayEpsilonGreedy, which resembles the following code:

class DecayEpsilonGreedy:
    epsilon_min: float
    epsilon_decay_rate: float
    epsilon: float
    
    def step(self, time_step: int) -> float:
        self.epsilon *= self.epsilon_decay_rate
        self.epsilon = max(self.epsilon, self.epsilon_min)
        return self.epsilon

The step method is called after every time step, updating and returning the value of epsilon.

Network architecture

The model used to estimate the action-value function had the following configuration:

Linear(in_features=37, out_features=64)
ReLU()
Linear(in_features=64, out_features=128)
ReLU()
Linear(in_features=128, out_features=256)
ReLU()
Linear(in_features=256, out_features=256)
ReLU()
Linear(in_features=256, out_features=256)
ReLU()
Linear(in_features=256, out_features=64)
ReLU()
Linear(in_features=64, out_features=4)

Learning algorithm

The learning algorithm used was the Deep Q-Network. The algorithm stands out for the usage of an Experience Replay buffer, and a secondary network that serves as the target estimate during training. The experience replay buffer stores an experience (current state, action taken, reward received and next state) to be used by the agent during the learning process later. The target network has the same architecture as the local network, but its parameters are only updated every C episodes (in this project C=4).

Whenever the target network is updated, a minibatch of experiences are pulled from the replay buffer. Then, the estimate action-values for these experiences are calculated for each network. The loss is calculated from the Mean Squared Error between these estimated action-values. Finally, the local network weights are updated through stochastic gradient descent, while the target network weights are soft-updated with a constant tau (in this project, I used tau=1e-3). This process can become more clear by reading the code file.

Results

The environment was solved in 266 episodes, achieving the average score of 13.0 over the next 100 episodes.

When the goal average score over the last 100 episodes was set to 15.0, the agent achieved it after 574 episodes. The following gif is a demonstration of the agent performing in the environment.

Agent trained with the average score of 15.0

The score values over each episode are described by the following figure.

Scores plot with agent trained to reach the average score of 13.0

Scores plot with agent trained to reach the average score of 15.0

Future work

To better understand the state of the art for reinforcement learning with nonlinear approximation functions, it is interesting to compare the results obtained with other methods, like Dueling DQNs, Double DQNs and using Prioritized Experience Replay. There is also the challenge of solving this environment with image data, which is harder and will prove a valuable exercise to solidify acquired knowledge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report.md

Report.md

Table of contents

Implementation

Parameters

Network architecture

Learning algorithm

Results

Future work

Files

Report.md

Latest commit

History

Report.md

File metadata and controls

Table of contents

Implementation

Parameters

Network architecture

Learning algorithm

Results

Future work