Does Optimism Help PPO? Optimistic Gradient Updates for Multi-Agent Games and Exploration Benchmarks

Project Overview

This project investigates whether applying optimism, which is a key concept in online learning (e.g., Optimistic Multiplicative Weights), can improve the performance of Proximal Policy Optimization (PPO) in multi-agent and complex environments.

Standard PPO is a function-approximation cousin of mirror descent. In theory, if gradients are predictable, adding an optimistic lookahead or extrapolation term can reduce regret and stabilize dynamics in games. This project implements a lightweight parameter-space optimism mechanism inside PPO (OPPO) and evaluates it across a suite of competitive, cooperative, and hard-exploration tasks.

Key Findings

Postive effect: Helps in cooperative and imperfect-information games (e.g., Robot Warehouse, Hanabi, Leduc Poker), small amounts of optimism ($\beta \approx 0.1$) consistently improved performance by ~5-10%.
Negative effect: Optimism performs poorly in structured, low-uncertainty games (e.g., Checkers, Go, and Triple Triad), where value estimates are already stable. Here, optimistic extrapolation magnifies errors instead of encouraging useful exploration.
Stability: Optimism acts as an accelerator. It often produces higher early peaks but can lead to late-stage instability or collapses (e.g., in Atari Seaquest). We add support for annealing optimism to potentially mitigate this.
Scope Matters: Applying optimism only to the policy head was often worse than baseline. Applying it to both the policy and value networks recovered improvements, as seen in Checkers.

Installation

Recommended setup using uv (fast Python package manager) or standard pip.

Prerequisites

Python 3.10+
PufferLib (for vectorized environments)

Setup

git clone https://github.com/ryanhlewis/oppo.git
cd oppo

# Install dependencies (using pip)
pip install -r requirements.txt

# Or using uv (recommended for speed)
uv pip install -r requirements.txt

Note on Dependencies: This project relies on pufferlib, shimmy, pettingzoo, and openspiel. Multi-agent environment installation can be tricky on some systems. Please consult the PufferLib documentation if you encounter build errors with specific environments.

⚠️ Warning (Windows/WSL Users): pufferlib requires system build tools. We have provided a helper script to automate the setup of system dependencies (Core tools, CMake, Swig) and the Python environment (using uv).

Manual Install:
sudo apt-get update && sudo apt-get install -y build-essential cmake swig
Automatic Setup (Recommended for WSL):
# Make the script executable
chmod +x setup/setup_wsl.sh
# Run the setup script (installs system deps, python 3.10+, and pip packages)
./setup/setup_wsl.sh
# Activate the environment
source .venv/bin/activate
Note on CUDA: If you encounter RuntimeError: The detected CUDA version ... mismatches ..., ensure your system CUDA version (check nvcc --version) matches the PyTorch version installed by pip. You may need to install a specific PyTorch version (e.g., pip install torch --index-url https://download.pytorch.org/whl/cu118) before installing pufferlib.

Verification

To ensure that all environments (Core, Imperfect Information, Atari) are correctly installed and running, you can run the comprehensive smoke test suite:

python setup/verify_all_games.py

This script will sequentially launch and step through every supported environment to verify no dependencies (like ROMs or shimmy-wrappers) are missing.

Usage

The code is organized into src/ (core implementation) and experiments/ (launch scripts).

1. Running a Single Experiment

You can run the main training script directly from the root directory:

# Example: Run PPO with Optimism (beta=0.1) on Robot Warehouse
python src/oppo.py \
    --env puffer_rware \
    --optimism-beta 0.1 \
    --optimism-scope both \
    --total-timesteps 200000 \
    --track  # Optional: enable tracking (wandb/tensorboard)

Key Arguments:

--env: Environment ID (e.g., puffer_rware, puffer_checkers, puffer_hanabi, montezuma)
--optimism-beta: Coefficient for optimism (default: 0.0, which is standard PPO).
--optimism-mode: delta (extrapolation) or lookahead (extra-gradient).
--optimism-scope: policy (actor only) or both (actor + critic).

2. Reproducing Paper Experiments

The experiments/ directory contains scripts to reproduce the results from the report.

Core MARL Benchmarks:

# Runs Checkers, Go, Rware with various betas
python experiments/run_paper_experiments.py --experiment core

Imperfect Information Games:

# Runs Hanabi, Leduc Poker, Matrix RPS
python experiments/run_paper_experiments.py --experiment imperfect

Grid Sweep:

# Run a broader sweep over environments and parameters
python experiments/run_grid.py

3. Atari / Exploration

For Atari games (Montezuma's Revenge, Seaquest, etc.), use the specialized script:

python src/oppo_atari.py --env montezuma --optimism-beta 0.1 --anneal-beta

Repository Structure

src/oppo.py: Main PPO implementation with support for PufferLib Ocean environments.
src/oppo_atari.py: Adapted implementation for single-player Atari/Crafter benchmarks.
experiments/: Scripts for launching large-scale sweeps and specific figure reproduction.
results/: Directory where logs and summary CSVs are stored.
- Note: The original paper results are present in results/firstrun, results/secondrun, and results/thirdrun CSV files for analysis.
setup/: Setup helpers (e.g., for WSL).

Implementation Details

The core optimism mechanism is a simple gradient extrapolation: g_t ← g_t + β (g_t − g_{t−1})

This mimics the "optimistic" prediction step in online learning algorithms. We also implement a "lookahead" variant (extra-gradient) which evaluates the loss at a predicted future parameter.

For full technical details, please refer to the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
experiments		experiments
results		results
runs		runs
setup		setup
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Does Optimism Help PPO? Optimistic Gradient Updates for Multi-Agent Games and Exploration Benchmarks

Project Overview

Key Findings

Installation

Prerequisites

Setup

Verification

Usage

1. Running a Single Experiment

2. Reproducing Paper Experiments

3. Atari / Exploration

Repository Structure

Implementation Details

About

Uh oh!

Releases

Packages

Languages

ryanhlewis/oppo

Folders and files

Latest commit

History

Repository files navigation

Does Optimism Help PPO? Optimistic Gradient Updates for Multi-Agent Games and Exploration Benchmarks

Project Overview

Key Findings

Installation

Prerequisites

Setup

Verification

Usage

1. Running a Single Experiment

2. Reproducing Paper Experiments

3. Atari / Exploration

Repository Structure

Implementation Details

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages