Skip to content
This repository was archived by the owner on Dec 15, 2025. It is now read-only.
/ dqn-atari Public archive

A Deep Q-Network (DQN) implementation inspired by Playing Atari with Deep Reinforcement Learning (Mnih et al., 2013).

Notifications You must be signed in to change notification settings

maximboc/dqn-atari

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deep Q-Networks on Atari Games

Author: Maxim Bocquillion Date: November 2025

Introduction

The goal of this project is to implement a Deep Q-Network (DQN) agent as described in the seminal paper Playing Atari with Deep Reinforcement Learning (Mnih et al., 2013). The objective is to train an agent to play Atari games from the Arcade Learning Environment (ALE) directly from raw pixels, learning control policies without hand-crafted features.

The project explores three distinct network architectures across two games: Pong and Breakout.

Installation & Usage

To reproduce the experiments, configure the game and architecture inside main.py.

  1. Install dependencies
pip install gymnasium[atari] ale-py torch numpy matplotlib imageio
  1. Configuration (in main.py)
Set game = "Pong" or "Breakout"
Set NETWORK_TYPE = "original", "standard", or "dueling"
  1. Run training
python main.py

Implementation Overview

All experiments share a common training pipeline designed to run efficiently on a laptop while remaining faithful to the core DQN algorithm.

Preprocessing

Raw frames (210×160) are converted to grayscale, downsampled, and cropped to 84×84.

Frame Stacking

A stack of the last 4 frames is used as input to capture motion.

Replay Buffer

Transitions are stored in a replay memory. Due to laptop RAM limits, the replay memory is capped at 250k transitions (vs. 1M in the original DQN).

Optimization

Training uses RMSProp and an ε-greedy exploration strategy decaying from 1.0 → 0.1.

Evaluation Protocol

  • Evaluations use a fully greedy policy where ε is set at 0.

Double DQN

Double DQN logic is applied: action selection and target evaluation are decoupled.

Constraints

It is good to mention that while the original DQN studies usually train for around 10-50 million frames, this study is limited to 1M frames for Breakout and 1.5 frames for Pong, alongside a replay buffer of 250k transitions due to both hardware and time constraints. These constraints significantly impact the relatively low results of the agents tested.

Architecture

The "Original" NIPS 2013 Architecture (original_dqn.py)

This phase reproduces the architecture from Mnih et al. (2013). It is relatively shallow, meant to demonstrate that a CNN can learn control from pixels.

  • Conv1: 16 filters (8×8), stride 4, ReLU
  • Conv2: 32 filters (4×4), stride 2, ReLU
  • FC1: 256 units
  • Output: One Q-value per action

The "Standard" Nature 2015 Architecture (standard_dqn.py)

This model corresponds to the deeper architecture used in the Nature 2015 paper. It adds capacity and is far more expressive.

  • Conv1: 32 filters (8×8), stride 4
  • Conv2: 64 filters (4×4), stride 2
  • Conv3: 64 filters (3×3), stride 1
  • FC1: 512 units
  • Output: Q-values

The Dueling Network Architecture (dueling_dqn.py)

A Dueling DQN separates the value and advantage estimation paths:

  • Value Stream: estimates V(s)
  • Advantage Stream: estimates A(s,a)

These combine via:

$Q(s,a) = V(s) + (A(s,a) − mean(A(s,a)))$

  • Convolutional backbone identical to the "standard" model
  • Two 512-unit fully connected heads
  • Custom dueling aggregation layer

Experiments and results

Results: Pong

Original

  • Training Steps: 1.5M
  • Final Outcome: 11.0

Breakout Original DQN - Video

Pong Original DQN - Graph

During the first 400,000 training steps (≈400 episodes), the agent remained stuck around a mean reward of –20 to –18, which is expected since a Pong agent initially loses every point. This corresponds to a long exploration phase in which the agent has not yet learned to return the ball reliably.

After around 500,000 steps, ε reached its minimum value of 0.1, and the agent began to steadily improve. The mean reward progressively increased from about –17 to approximately –3 by the end of training. Although the mean reward never reached positive values during training, the agent did begin to score points more consistently, even achieving a positive episode reward of +9 around episode 800.

The training was cut short at 1.5M steps due to hardware limits, but when evaluated with a greedy (ε = 0) policy, the final agent achieved a score of +11, demonstrating that the underlying policy learned to win reliably, despite the mean reward during training remaining negative.

Standard

  • Training Steps: 1.5M
  • Final Outcome: 17.0

Breakout Original DQN - Video

Pong Standard DQN - Graph

For the first 300,000 steps (≈325 episodes), the agent remained in a pure exploration phase, with a mean reward consistently around –20, which corresponds to losing nearly every point of every match.

Starting around 350,000 steps, the agent finally begins to escape this plateau. The mean reward improves from –19.64 (episode 300) to –17.98 (episode 350), indicating that the agent is learning to occasionally return the ball rather than losing immediately.

As ε continues to decrease, the agent steadily improves. By episode 450, the mean reward reaches –13.18, and by episode 500, it rises further to –11.36. In the later stages of training, once ε has reached its minimum value (0.1), the agent makes significant progress going from -9.90 at episode 550 to -2.96 at episode 650.

Although the mean reward never becomes positive during training, the agent clearly learns a strong policy. When evaluated with a greedy policy (ε = 0), the final agent achieves a total score of 17, corresponding to a win of 21–4.

Dueling

  • Training Steps: 1.5M
  • Final Outcome: +20

Breakout Original DQN - Video

Pong Dueling DQN - Graph

Training the dueling DQN agent on Pong showed the typical behavior observed in the literature when using pixel-based Atari input:

  • We observe a very long plateau for the first million training steps where the agent remained stuck around –20 to –21 reward, indicating that it only learned to survive marginally but had not yet discovered reliably winning trajectories.

We finally see a breakthrough around 1.1M steps, where the mean reward jumps from 5.20 to around 14.14 with a small difference of around 120k steps.

The agents stabilizes around +15 of mean reward with an ε of 0.1.

The final evaluation episode (where ε = 0) reached a score of +20, meaning the agent lost only a single point during the entire game.

However, the video shows signs of overfitting to deterministic serves from the opponenent when the latter loses a point.

Results: Breakout

Original

  • Training Steps: 1M
  • Training Time : 1h10
  • Final Outcome: 45.0

Breakout Original DQN - Video

Breakout Original DQN - Graph

From the start of training until around 300,000 steps (≈1600 episodes), the agent stayed in a low-performance plateau, with mean rewards consistently below 2.0.

After this point, the agent entered a clear learning phase: the mean reward began to rise, reaching 5–10 between 400k and 500k steps, and continuing to improve as ε annealed to its minimum value. In the final portion of training (700k–1M steps), the agent stabilized with mean rewards in the range of 11 to 19, consistent with the original DQN’s behavior on Breakout.

When evaluated greedily (ε = 0), the final policy achieved a score of 45, demonstrating a strong and fully developed breakout strategy.

Standard

  • Training Steps: 1M
  • Standard : 1h27
  • Final Outcome: 103.0

Breakout Original DQN - Video

Breakout Standard DQN - Graph

During the first 300,000 steps, the agent remains in a low-performance plateau, with mean rewards staying under 2.0. This corresponds to the exploration-heavy part of training, where the agent only learns to occasionally hit the ball.

A small improvement appears slightly before 300k steps, with occasional high rewards (e.g., 5.0 at episodes 700 and 1050), but the mean reward remains low and unstable.

After roughly 320,000–350,000 steps, the agent enters a clear learning phase.

Once ε reaches 0.1 (from ~500k steps onward), the agent stabilizes around mean rewards of 14–20, with some peaks attaining more than 60 of reward and 29 of mean reward (episode 2550).

Finally, during the evaluation with ε = 0, the agent reaches a score of 103.

Dueling

  • Training Steps: 1M
  • Training Time : 1h38
  • Final Outcome: +93

Breakout Original DQN - Video

Breakout Dueling DQN - Graph

During the training for breakout, we observe that the first 300k steps are for exploration, where the model has a reward around 1.0-1.6.

From steps 300k to 700k, we observed that the rewards increase steadily as the replay buffer contains higher-quality transitions.

By ~700k steps, the agent achieved episodes scoring more than 65 points and continued until the end of training to have a mean reward of more than 30.

During the final evaluation episodes, the agent's reward peaks at 93.

Recap

Final Performance Summary — Pong

Architecture Training Steps Mean Reward During Training Final Greedy Evaluation Notes
Original (2013) 1.5M –20 → –3 +11 Learns slowly; clearly undertrained vs literature
Standard (2015) 1.5M –20 → –3 +17 Stronger feature extractor leads to faster gains
Dueling DQN 1.5M –20 plateau → +15 +20 Best performance on Pong; big jump after 1M steps

Final Performance Summary — Breakout

Architecture Training Steps Mean Reward During Training Final Greedy Evaluation Notes
Original (2013) 1.0M <2 → ~15 45 Good performance despite short training
Standard (2015) 1.0M <2 → 14–20 103 Best architecture on Breakout
Dueling DQN 1.0M ~1.5 → >30 93 Strong results; slightly behind Standard on Breakout

Conclusion

This project successfully reproduces the behavior of early Deep Reinforcement Learning algorithms on both Breakout and Pong games under strict hardware and training-time constraints.

From what we have observed, the three models suffered from long plateaus at the beginning, but still reached decent performance given the time and hardware constraints.

  • Original (2013) shows slow learning and long plateaus, but still reach competent greedy performance.
  • Standard (2015) benefits from convolutions layers achieving the best results on Breakout on limited training steps.
  • Dueling provides the best results on Pong.

Despite hardware limits restricting training to 1M steps for Breakout and 1.5M for Pong (vs 10M in the literature), we see that the three architectures still achieve reliable strategies and performance aligning with the original 2013 paper.

About

A Deep Q-Network (DQN) implementation inspired by Playing Atari with Deep Reinforcement Learning (Mnih et al., 2013).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages