Skip to content

Code for Does Optimism Help PPO? Optimistic Gradient Updates for Multi-Agent Games and Exploration Benchmarks (OPPO Optimistic PPO)

Notifications You must be signed in to change notification settings

ryanhlewis/oppo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Does Optimism Help PPO? Optimistic Gradient Updates for Multi-Agent Games and Exploration Benchmarks

Project Overview

This project investigates whether applying optimism, which is a key concept in online learning (e.g., Optimistic Multiplicative Weights), can improve the performance of Proximal Policy Optimization (PPO) in multi-agent and complex environments.

Standard PPO is a function-approximation cousin of mirror descent. In theory, if gradients are predictable, adding an optimistic lookahead or extrapolation term can reduce regret and stabilize dynamics in games. This project implements a lightweight parameter-space optimism mechanism inside PPO (OPPO) and evaluates it across a suite of competitive, cooperative, and hard-exploration tasks.

Key Findings

  • Postive effect: Helps in cooperative and imperfect-information games (e.g., Robot Warehouse, Hanabi, Leduc Poker), small amounts of optimism ($\beta \approx 0.1$) consistently improved performance by ~5-10%.
  • Negative effect: Optimism performs poorly in structured, low-uncertainty games (e.g., Checkers, Go, and Triple Triad), where value estimates are already stable. Here, optimistic extrapolation magnifies errors instead of encouraging useful exploration.
  • Stability: Optimism acts as an accelerator. It often produces higher early peaks but can lead to late-stage instability or collapses (e.g., in Atari Seaquest). We add support for annealing optimism to potentially mitigate this.
  • Scope Matters: Applying optimism only to the policy head was often worse than baseline. Applying it to both the policy and value networks recovered improvements, as seen in Checkers.

Installation

Recommended setup using uv (fast Python package manager) or standard pip.

Prerequisites

  • Python 3.10+
  • PufferLib (for vectorized environments)

Setup

git clone https://github.com/ryanhlewis/oppo.git
cd oppo

# Install dependencies (using pip)
pip install -r requirements.txt

# Or using uv (recommended for speed)
uv pip install -r requirements.txt

Note on Dependencies: This project relies on pufferlib, shimmy, pettingzoo, and openspiel. Multi-agent environment installation can be tricky on some systems. Please consult the PufferLib documentation if you encounter build errors with specific environments.

⚠️ Warning (Windows/WSL Users): pufferlib requires system build tools. We have provided a helper script to automate the setup of system dependencies (Core tools, CMake, Swig) and the Python environment (using uv).

Manual Install:

sudo apt-get update && sudo apt-get install -y build-essential cmake swig

Automatic Setup (Recommended for WSL):

# Make the script executable
chmod +x setup/setup_wsl.sh
# Run the setup script (installs system deps, python 3.10+, and pip packages)
./setup/setup_wsl.sh
# Activate the environment
source .venv/bin/activate

Note on CUDA: If you encounter RuntimeError: The detected CUDA version ... mismatches ..., ensure your system CUDA version (check nvcc --version) matches the PyTorch version installed by pip. You may need to install a specific PyTorch version (e.g., pip install torch --index-url https://download.pytorch.org/whl/cu118) before installing pufferlib.

Verification

To ensure that all environments (Core, Imperfect Information, Atari) are correctly installed and running, you can run the comprehensive smoke test suite:

python setup/verify_all_games.py

This script will sequentially launch and step through every supported environment to verify no dependencies (like ROMs or shimmy-wrappers) are missing.


Usage

The code is organized into src/ (core implementation) and experiments/ (launch scripts).

1. Running a Single Experiment

You can run the main training script directly from the root directory:

# Example: Run PPO with Optimism (beta=0.1) on Robot Warehouse
python src/oppo.py \
    --env puffer_rware \
    --optimism-beta 0.1 \
    --optimism-scope both \
    --total-timesteps 200000 \
    --track  # Optional: enable tracking (wandb/tensorboard)

Key Arguments:

  • --env: Environment ID (e.g., puffer_rware, puffer_checkers, puffer_hanabi, montezuma)
  • --optimism-beta: Coefficient for optimism (default: 0.0, which is standard PPO).
  • --optimism-mode: delta (extrapolation) or lookahead (extra-gradient).
  • --optimism-scope: policy (actor only) or both (actor + critic).

2. Reproducing Paper Experiments

The experiments/ directory contains scripts to reproduce the results from the report.

Core MARL Benchmarks:

# Runs Checkers, Go, Rware with various betas
python experiments/run_paper_experiments.py --experiment core

Imperfect Information Games:

# Runs Hanabi, Leduc Poker, Matrix RPS
python experiments/run_paper_experiments.py --experiment imperfect

Grid Sweep:

# Run a broader sweep over environments and parameters
python experiments/run_grid.py

3. Atari / Exploration

For Atari games (Montezuma's Revenge, Seaquest, etc.), use the specialized script:

python src/oppo_atari.py --env montezuma --optimism-beta 0.1 --anneal-beta

Repository Structure

  • src/oppo.py: Main PPO implementation with support for PufferLib Ocean environments.
  • src/oppo_atari.py: Adapted implementation for single-player Atari/Crafter benchmarks.
  • experiments/: Scripts for launching large-scale sweeps and specific figure reproduction.
  • results/: Directory where logs and summary CSVs are stored.
    • Note: The original paper results are present in results/firstrun, results/secondrun, and results/thirdrun CSV files for analysis.
  • setup/: Setup helpers (e.g., for WSL).

Implementation Details

The core optimism mechanism is a simple gradient extrapolation: g_t ← g_t + β (g_t − g_{t−1})

This mimics the "optimistic" prediction step in online learning algorithms. We also implement a "lookahead" variant (extra-gradient) which evaluates the loss at a predicted future parameter.

For full technical details, please refer to the paper.

About

Code for Does Optimism Help PPO? Optimistic Gradient Updates for Multi-Agent Games and Exploration Benchmarks (OPPO Optimistic PPO)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published