Author: Maxim Bocquillion Date: November 2025
The goal of this project is to implement a Deep Q-Network (DQN) agent as described in the seminal paper Playing Atari with Deep Reinforcement Learning (Mnih et al., 2013). The objective is to train an agent to play Atari games from the Arcade Learning Environment (ALE) directly from raw pixels, learning control policies without hand-crafted features.
The project explores three distinct network architectures across two games: Pong and Breakout.
To reproduce the experiments, configure the game and architecture inside main.py.
- Install dependencies
pip install gymnasium[atari] ale-py torch numpy matplotlib imageio- Configuration (in main.py)
Set game = "Pong" or "Breakout"
Set NETWORK_TYPE = "original", "standard", or "dueling"- Run training
python main.pyAll experiments share a common training pipeline designed to run efficiently on a laptop while remaining faithful to the core DQN algorithm.
Raw frames (210×160) are converted to grayscale, downsampled, and cropped to 84×84.
A stack of the last 4 frames is used as input to capture motion.
Transitions are stored in a replay memory. Due to laptop RAM limits, the replay memory is capped at 250k transitions (vs. 1M in the original DQN).
Training uses RMSProp and an ε-greedy exploration strategy decaying from 1.0 → 0.1.
- Evaluations use a fully greedy policy where ε is set at 0.
Double DQN logic is applied: action selection and target evaluation are decoupled.
It is good to mention that while the original DQN studies usually train for around 10-50 million frames, this study is limited to 1M frames for Breakout and 1.5 frames for Pong, alongside a replay buffer of 250k transitions due to both hardware and time constraints. These constraints significantly impact the relatively low results of the agents tested.
This phase reproduces the architecture from Mnih et al. (2013). It is relatively shallow, meant to demonstrate that a CNN can learn control from pixels.
- Conv1: 16 filters (8×8), stride 4, ReLU
- Conv2: 32 filters (4×4), stride 2, ReLU
- FC1: 256 units
- Output: One Q-value per action
This model corresponds to the deeper architecture used in the Nature 2015 paper. It adds capacity and is far more expressive.
- Conv1: 32 filters (8×8), stride 4
- Conv2: 64 filters (4×4), stride 2
- Conv3: 64 filters (3×3), stride 1
- FC1: 512 units
- Output: Q-values
A Dueling DQN separates the value and advantage estimation paths:
- Value Stream: estimates V(s)
- Advantage Stream: estimates A(s,a)
These combine via:
- Convolutional backbone identical to the "standard" model
- Two 512-unit fully connected heads
- Custom dueling aggregation layer
- Training Steps: 1.5M
- Final Outcome: 11.0
During the first 400,000 training steps (≈400 episodes), the agent remained stuck around a mean reward of –20 to –18, which is expected since a Pong agent initially loses every point. This corresponds to a long exploration phase in which the agent has not yet learned to return the ball reliably.
After around 500,000 steps, ε reached its minimum value of 0.1, and the agent began to steadily improve. The mean reward progressively increased from about –17 to approximately –3 by the end of training. Although the mean reward never reached positive values during training, the agent did begin to score points more consistently, even achieving a positive episode reward of +9 around episode 800.
The training was cut short at 1.5M steps due to hardware limits, but when evaluated with a greedy (ε = 0) policy, the final agent achieved a score of +11, demonstrating that the underlying policy learned to win reliably, despite the mean reward during training remaining negative.
- Training Steps: 1.5M
- Final Outcome: 17.0
For the first 300,000 steps (≈325 episodes), the agent remained in a pure exploration phase, with a mean reward consistently around –20, which corresponds to losing nearly every point of every match.
Starting around 350,000 steps, the agent finally begins to escape this plateau. The mean reward improves from –19.64 (episode 300) to –17.98 (episode 350), indicating that the agent is learning to occasionally return the ball rather than losing immediately.
As ε continues to decrease, the agent steadily improves. By episode 450, the mean reward reaches –13.18, and by episode 500, it rises further to –11.36. In the later stages of training, once ε has reached its minimum value (0.1), the agent makes significant progress going from -9.90 at episode 550 to -2.96 at episode 650.
Although the mean reward never becomes positive during training, the agent clearly learns a strong policy. When evaluated with a greedy policy (ε = 0), the final agent achieves a total score of 17, corresponding to a win of 21–4.
- Training Steps: 1.5M
- Final Outcome: +20
Training the dueling DQN agent on Pong showed the typical behavior observed in the literature when using pixel-based Atari input:
- We observe a very long plateau for the first million training steps where the agent remained stuck around –20 to –21 reward, indicating that it only learned to survive marginally but had not yet discovered reliably winning trajectories.
We finally see a breakthrough around 1.1M steps, where the mean reward jumps from 5.20 to around 14.14 with a small difference of around 120k steps.
The agents stabilizes around +15 of mean reward with an ε of 0.1.
The final evaluation episode (where ε = 0) reached a score of +20, meaning the agent lost only a single point during the entire game.
However, the video shows signs of overfitting to deterministic serves from the opponenent when the latter loses a point.
- Training Steps: 1M
- Training Time : 1h10
- Final Outcome: 45.0
From the start of training until around 300,000 steps (≈1600 episodes), the agent stayed in a low-performance plateau, with mean rewards consistently below 2.0.
After this point, the agent entered a clear learning phase: the mean reward began to rise, reaching 5–10 between 400k and 500k steps, and continuing to improve as ε annealed to its minimum value. In the final portion of training (700k–1M steps), the agent stabilized with mean rewards in the range of 11 to 19, consistent with the original DQN’s behavior on Breakout.
When evaluated greedily (ε = 0), the final policy achieved a score of 45, demonstrating a strong and fully developed breakout strategy.
- Training Steps: 1M
- Standard : 1h27
- Final Outcome: 103.0
During the first 300,000 steps, the agent remains in a low-performance plateau, with mean rewards staying under 2.0. This corresponds to the exploration-heavy part of training, where the agent only learns to occasionally hit the ball.
A small improvement appears slightly before 300k steps, with occasional high rewards (e.g., 5.0 at episodes 700 and 1050), but the mean reward remains low and unstable.
After roughly 320,000–350,000 steps, the agent enters a clear learning phase.
Once ε reaches 0.1 (from ~500k steps onward), the agent stabilizes around mean rewards of 14–20, with some peaks attaining more than 60 of reward and 29 of mean reward (episode 2550).
Finally, during the evaluation with ε = 0, the agent reaches a score of 103.
- Training Steps: 1M
- Training Time : 1h38
- Final Outcome: +93
During the training for breakout, we observe that the first 300k steps are for exploration, where the model has a reward around 1.0-1.6.
From steps 300k to 700k, we observed that the rewards increase steadily as the replay buffer contains higher-quality transitions.
By ~700k steps, the agent achieved episodes scoring more than 65 points and continued until the end of training to have a mean reward of more than 30.
During the final evaluation episodes, the agent's reward peaks at 93.
| Architecture | Training Steps | Mean Reward During Training | Final Greedy Evaluation | Notes |
|---|---|---|---|---|
| Original (2013) | 1.5M | –20 → –3 | +11 | Learns slowly; clearly undertrained vs literature |
| Standard (2015) | 1.5M | –20 → –3 | +17 | Stronger feature extractor leads to faster gains |
| Dueling DQN | 1.5M | –20 plateau → +15 | +20 | Best performance on Pong; big jump after 1M steps |
| Architecture | Training Steps | Mean Reward During Training | Final Greedy Evaluation | Notes |
|---|---|---|---|---|
| Original (2013) | 1.0M | <2 → ~15 | 45 | Good performance despite short training |
| Standard (2015) | 1.0M | <2 → 14–20 | 103 | Best architecture on Breakout |
| Dueling DQN | 1.0M | ~1.5 → >30 | 93 | Strong results; slightly behind Standard on Breakout |
This project successfully reproduces the behavior of early Deep Reinforcement Learning algorithms on both Breakout and Pong games under strict hardware and training-time constraints.
From what we have observed, the three models suffered from long plateaus at the beginning, but still reached decent performance given the time and hardware constraints.
- Original (2013) shows slow learning and long plateaus, but still reach competent greedy performance.
- Standard (2015) benefits from convolutions layers achieving the best results on Breakout on limited training steps.
- Dueling provides the best results on Pong.
Despite hardware limits restricting training to 1M steps for Breakout and 1.5M for Pong (vs 10M in the literature), we see that the three architectures still achieve reliable strategies and performance aligning with the original 2013 paper.











