Does Optimism Help PPO? Optimistic Gradient Updates for Multi-Agent Games and Exploration Benchmarks
This project investigates whether applying optimism, which is a key concept in online learning (e.g., Optimistic Multiplicative Weights), can improve the performance of Proximal Policy Optimization (PPO) in multi-agent and complex environments.
Standard PPO is a function-approximation cousin of mirror descent. In theory, if gradients are predictable, adding an optimistic lookahead or extrapolation term can reduce regret and stabilize dynamics in games. This project implements a lightweight parameter-space optimism mechanism inside PPO (OPPO) and evaluates it across a suite of competitive, cooperative, and hard-exploration tasks.
-
Postive effect: Helps in cooperative and imperfect-information games (e.g., Robot Warehouse, Hanabi, Leduc Poker), small amounts of optimism (
$\beta \approx 0.1$ ) consistently improved performance by ~5-10%. - Negative effect: Optimism performs poorly in structured, low-uncertainty games (e.g., Checkers, Go, and Triple Triad), where value estimates are already stable. Here, optimistic extrapolation magnifies errors instead of encouraging useful exploration.
- Stability: Optimism acts as an accelerator. It often produces higher early peaks but can lead to late-stage instability or collapses (e.g., in Atari Seaquest). We add support for annealing optimism to potentially mitigate this.
- Scope Matters: Applying optimism only to the policy head was often worse than baseline. Applying it to both the policy and value networks recovered improvements, as seen in Checkers.
Recommended setup using uv (fast Python package manager) or standard pip.
- Python 3.10+
- PufferLib (for vectorized environments)
git clone https://github.com/ryanhlewis/oppo.git
cd oppo
# Install dependencies (using pip)
pip install -r requirements.txt
# Or using uv (recommended for speed)
uv pip install -r requirements.txtNote on Dependencies: This project relies on
pufferlib,shimmy,pettingzoo, andopenspiel. Multi-agent environment installation can be tricky on some systems. Please consult the PufferLib documentation if you encounter build errors with specific environments.
⚠️ Warning (Windows/WSL Users):pufferlibrequires system build tools. We have provided a helper script to automate the setup of system dependencies (Core tools, CMake, Swig) and the Python environment (usinguv).Manual Install:
sudo apt-get update && sudo apt-get install -y build-essential cmake swigAutomatic Setup (Recommended for WSL):
# Make the script executable chmod +x setup/setup_wsl.sh # Run the setup script (installs system deps, python 3.10+, and pip packages) ./setup/setup_wsl.sh # Activate the environment source .venv/bin/activateNote on CUDA: If you encounter
RuntimeError: The detected CUDA version ... mismatches ..., ensure your system CUDA version (checknvcc --version) matches the PyTorch version installed by pip. You may need to install a specific PyTorch version (e.g.,pip install torch --index-url https://download.pytorch.org/whl/cu118) before installingpufferlib.
To ensure that all environments (Core, Imperfect Information, Atari) are correctly installed and running, you can run the comprehensive smoke test suite:
python setup/verify_all_games.pyThis script will sequentially launch and step through every supported environment to verify no dependencies (like ROMs or shimmy-wrappers) are missing.
The code is organized into src/ (core implementation) and experiments/ (launch scripts).
You can run the main training script directly from the root directory:
# Example: Run PPO with Optimism (beta=0.1) on Robot Warehouse
python src/oppo.py \
--env puffer_rware \
--optimism-beta 0.1 \
--optimism-scope both \
--total-timesteps 200000 \
--track # Optional: enable tracking (wandb/tensorboard)Key Arguments:
--env: Environment ID (e.g.,puffer_rware,puffer_checkers,puffer_hanabi,montezuma)--optimism-beta: Coefficient for optimism (default: 0.0, which is standard PPO).--optimism-mode:delta(extrapolation) orlookahead(extra-gradient).--optimism-scope:policy(actor only) orboth(actor + critic).
The experiments/ directory contains scripts to reproduce the results from the report.
Core MARL Benchmarks:
# Runs Checkers, Go, Rware with various betas
python experiments/run_paper_experiments.py --experiment coreImperfect Information Games:
# Runs Hanabi, Leduc Poker, Matrix RPS
python experiments/run_paper_experiments.py --experiment imperfectGrid Sweep:
# Run a broader sweep over environments and parameters
python experiments/run_grid.pyFor Atari games (Montezuma's Revenge, Seaquest, etc.), use the specialized script:
python src/oppo_atari.py --env montezuma --optimism-beta 0.1 --anneal-betasrc/oppo.py: Main PPO implementation with support for PufferLib Ocean environments.src/oppo_atari.py: Adapted implementation for single-player Atari/Crafter benchmarks.experiments/: Scripts for launching large-scale sweeps and specific figure reproduction.results/: Directory where logs and summary CSVs are stored.- Note: The original paper results are present in
results/firstrun,results/secondrun, andresults/thirdrunCSV files for analysis.
- Note: The original paper results are present in
setup/: Setup helpers (e.g., for WSL).
The core optimism mechanism is a simple gradient extrapolation: g_t ← g_t + β (g_t − g_{t−1})
This mimics the "optimistic" prediction step in online learning algorithms. We also implement a "lookahead" variant (extra-gradient) which evaluates the loss at a predicted future parameter.
For full technical details, please refer to the paper.