Deep Q-Networks on Atari Games

Author: Maxim Bocquillion Date: November 2025

Introduction

The goal of this project is to implement a Deep Q-Network (DQN) agent as described in the seminal paper Playing Atari with Deep Reinforcement Learning (Mnih et al., 2013). The objective is to train an agent to play Atari games from the Arcade Learning Environment (ALE) directly from raw pixels, learning control policies without hand-crafted features.

The project explores three distinct network architectures across two games: Pong and Breakout.

Installation & Usage

To reproduce the experiments, configure the game and architecture inside main.py.

Install dependencies

pip install gymnasium[atari] ale-py torch numpy matplotlib imageio

Configuration (in main.py)

Set game = "Pong" or "Breakout"
Set NETWORK_TYPE = "original", "standard", or "dueling"

Run training

python main.py

Implementation Overview

All experiments share a common training pipeline designed to run efficiently on a laptop while remaining faithful to the core DQN algorithm.

Preprocessing

Raw frames (210×160) are converted to grayscale, downsampled, and cropped to 84×84.

Frame Stacking

A stack of the last 4 frames is used as input to capture motion.

Replay Buffer

Transitions are stored in a replay memory. Due to laptop RAM limits, the replay memory is capped at 250k transitions (vs. 1M in the original DQN).

Optimization

Training uses RMSProp and an ε-greedy exploration strategy decaying from 1.0 → 0.1.

Evaluation Protocol

Evaluations use a fully greedy policy where ε is set at 0.

Double DQN

Double DQN logic is applied: action selection and target evaluation are decoupled.

Constraints

It is good to mention that while the original DQN studies usually train for around 10-50 million frames, this study is limited to 1M frames for Breakout and 1.5 frames for Pong, alongside a replay buffer of 250k transitions due to both hardware and time constraints. These constraints significantly impact the relatively low results of the agents tested.

Architecture

The "Original" NIPS 2013 Architecture (original_dqn.py)

This phase reproduces the architecture from Mnih et al. (2013). It is relatively shallow, meant to demonstrate that a CNN can learn control from pixels.

Conv1: 16 filters (8×8), stride 4, ReLU
Conv2: 32 filters (4×4), stride 2, ReLU
FC1: 256 units
Output: One Q-value per action

The "Standard" Nature 2015 Architecture (standard_dqn.py)

This model corresponds to the deeper architecture used in the Nature 2015 paper. It adds capacity and is far more expressive.

Conv1: 32 filters (8×8), stride 4
Conv2: 64 filters (4×4), stride 2
Conv3: 64 filters (3×3), stride 1
FC1: 512 units
Output: Q-values

The Dueling Network Architecture (dueling_dqn.py)

A Dueling DQN separates the value and advantage estimation paths:

Value Stream: estimates V(s)
Advantage Stream: estimates A(s,a)

These combine via:

$Q(s,a) = V(s) + (A(s,a) − mean(A(s,a)))$

Convolutional backbone identical to the "standard" model
Two 512-unit fully connected heads
Custom dueling aggregation layer

Experiments and results

Results: Pong

Original

Training Steps: 1.5M
Final Outcome: 11.0

During the first 400,000 training steps (≈400 episodes), the agent remained stuck around a mean reward of –20 to –18, which is expected since a Pong agent initially loses every point. This corresponds to a long exploration phase in which the agent has not yet learned to return the ball reliably.

After around 500,000 steps, ε reached its minimum value of 0.1, and the agent began to steadily improve. The mean reward progressively increased from about –17 to approximately –3 by the end of training. Although the mean reward never reached positive values during training, the agent did begin to score points more consistently, even achieving a positive episode reward of +9 around episode 800.

The training was cut short at 1.5M steps due to hardware limits, but when evaluated with a greedy (ε = 0) policy, the final agent achieved a score of +11, demonstrating that the underlying policy learned to win reliably, despite the mean reward during training remaining negative.

Standard

Training Steps: 1.5M
Final Outcome: 17.0

For the first 300,000 steps (≈325 episodes), the agent remained in a pure exploration phase, with a mean reward consistently around –20, which corresponds to losing nearly every point of every match.

Starting around 350,000 steps, the agent finally begins to escape this plateau. The mean reward improves from –19.64 (episode 300) to –17.98 (episode 350), indicating that the agent is learning to occasionally return the ball rather than losing immediately.

As ε continues to decrease, the agent steadily improves. By episode 450, the mean reward reaches –13.18, and by episode 500, it rises further to –11.36. In the later stages of training, once ε has reached its minimum value (0.1), the agent makes significant progress going from -9.90 at episode 550 to -2.96 at episode 650.

Although the mean reward never becomes positive during training, the agent clearly learns a strong policy. When evaluated with a greedy policy (ε = 0), the final agent achieves a total score of 17, corresponding to a win of 21–4.

Dueling

Training Steps: 1.5M
Final Outcome: +20

Training the dueling DQN agent on Pong showed the typical behavior observed in the literature when using pixel-based Atari input:

We observe a very long plateau for the first million training steps where the agent remained stuck around –20 to –21 reward, indicating that it only learned to survive marginally but had not yet discovered reliably winning trajectories.

We finally see a breakthrough around 1.1M steps, where the mean reward jumps from 5.20 to around 14.14 with a small difference of around 120k steps.

The agents stabilizes around +15 of mean reward with an ε of 0.1.

The final evaluation episode (where ε = 0) reached a score of +20, meaning the agent lost only a single point during the entire game.

However, the video shows signs of overfitting to deterministic serves from the opponenent when the latter loses a point.

Results: Breakout

Original

Training Steps: 1M
Training Time : 1h10
Final Outcome: 45.0

From the start of training until around 300,000 steps (≈1600 episodes), the agent stayed in a low-performance plateau, with mean rewards consistently below 2.0.

After this point, the agent entered a clear learning phase: the mean reward began to rise, reaching 5–10 between 400k and 500k steps, and continuing to improve as ε annealed to its minimum value. In the final portion of training (700k–1M steps), the agent stabilized with mean rewards in the range of 11 to 19, consistent with the original DQN’s behavior on Breakout.

When evaluated greedily (ε = 0), the final policy achieved a score of 45, demonstrating a strong and fully developed breakout strategy.

Standard

Training Steps: 1M
Standard : 1h27
Final Outcome: 103.0

During the first 300,000 steps, the agent remains in a low-performance plateau, with mean rewards staying under 2.0. This corresponds to the exploration-heavy part of training, where the agent only learns to occasionally hit the ball.

A small improvement appears slightly before 300k steps, with occasional high rewards (e.g., 5.0 at episodes 700 and 1050), but the mean reward remains low and unstable.

After roughly 320,000–350,000 steps, the agent enters a clear learning phase.

Once ε reaches 0.1 (from ~500k steps onward), the agent stabilizes around mean rewards of 14–20, with some peaks attaining more than 60 of reward and 29 of mean reward (episode 2550).

Finally, during the evaluation with ε = 0, the agent reaches a score of 103.

Dueling

Training Steps: 1M
Training Time : 1h38
Final Outcome: +93

During the training for breakout, we observe that the first 300k steps are for exploration, where the model has a reward around 1.0-1.6.

From steps 300k to 700k, we observed that the rewards increase steadily as the replay buffer contains higher-quality transitions.

By ~700k steps, the agent achieved episodes scoring more than 65 points and continued until the end of training to have a mean reward of more than 30.

During the final evaluation episodes, the agent's reward peaks at 93.

Recap

Final Performance Summary — Pong

Architecture	Training Steps	Mean Reward During Training	Final Greedy Evaluation	Notes
Original (2013)	1.5M	–20 → –3	+11	Learns slowly; clearly undertrained vs literature
Standard (2015)	1.5M	–20 → –3	+17	Stronger feature extractor leads to faster gains
Dueling DQN	1.5M	–20 plateau → +15	+20	Best performance on Pong; big jump after 1M steps

Final Performance Summary — Breakout

Architecture	Training Steps	Mean Reward During Training	Final Greedy Evaluation	Notes
Original (2013)	1.0M	<2 → ~15	45	Good performance despite short training
Standard (2015)	1.0M	<2 → 14–20	103	Best architecture on Breakout
Dueling DQN	1.0M	~1.5 → >30	93	Strong results; slightly behind Standard on Breakout

Conclusion

This project successfully reproduces the behavior of early Deep Reinforcement Learning algorithms on both Breakout and Pong games under strict hardware and training-time constraints.

From what we have observed, the three models suffered from long plateaus at the beginning, but still reached decent performance given the time and hardware constraints.

Original (2013) shows slow learning and long plateaus, but still reach competent greedy performance.
Standard (2015) benefits from convolutions layers achieving the best results on Breakout on limited training steps.
Dueling provides the best results on Pong.

Despite hardware limits restricting training to 1M steps for Breakout and 1.5M for Pong (vs 10M in the literature), we see that the three architectures still achieve reliable strategies and performance aligning with the original 2013 paper.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
graphs		graphs
models		models
videos		videos
.gitignore		.gitignore
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Deep Q-Networks on Atari Games

Introduction

Installation & Usage

Implementation Overview

Preprocessing

Frame Stacking

Replay Buffer

Optimization

Evaluation Protocol

Double DQN

Constraints

Architecture

The "Original" NIPS 2013 Architecture (original_dqn.py)

The "Standard" Nature 2015 Architecture (standard_dqn.py)

The Dueling Network Architecture (dueling_dqn.py)

Experiments and results

Results: Pong

Original

Standard

Dueling

Results: Breakout

Original

Standard

Dueling

Recap

Final Performance Summary — Pong

Final Performance Summary — Breakout

Conclusion

About

Uh oh!

Releases

Packages

Languages

maximboc/dqn-atari

Folders and files

Latest commit

History

Repository files navigation

Deep Q-Networks on Atari Games

Introduction

Installation & Usage

Implementation Overview

Preprocessing

Frame Stacking

Replay Buffer

Optimization

Evaluation Protocol

Double DQN

Constraints

Architecture

The "Original" NIPS 2013 Architecture (original_dqn.py)

The "Standard" Nature 2015 Architecture (standard_dqn.py)

The Dueling Network Architecture (dueling_dqn.py)

Experiments and results

Results: Pong

Original

Standard

Dueling

Results: Breakout

Original

Standard

Dueling

Recap

Final Performance Summary — Pong

Final Performance Summary — Breakout

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages