Skip to content

Conversation

@dzorlu
Copy link

@dzorlu dzorlu commented Jan 25, 2026

Summary

  • Remove <done> from vLLM stop tokens to fix premature generation termination
  • Qwen3 thinking models output <done> inside <think> sections when describing their plan
  • This caused vLLM to stop generation before the model could actually call tools

Problem

<think>
...I'll inform the user and include the <done>   ← vLLM STOPS HERE
</think>

Model was planning to say <done>, not actually signaling completion.

Solution

  • Only stop on </tool_call> (for tool execution)
  • Let model naturally finish with <|im_end|> (EOS token)
  • env.step() already detects <done> in the full response to mark episode completion

Files Changed

  • skyrl-train/tasks/openenv-fleet-grpo-qwen3-8b.yaml
  • skyrl-train/tasks/openenv-fleet-grpo.yaml

🤖 Generated with Claude Code

dzorlu and others added 30 commits January 20, 2026 18:35
* feat: Add OpenEnv Fleet training CI with SkyPilot

- Add SkyPilot task YAML for GRPO training on neoclouds (Lambda, RunPod, Vast)
- Add GitHub Actions workflow for PR-triggered training runs
- Update .gitignore for SkyPilot venv

The workflow:
1. Triggers on PRs to main (paths: skyrl-train/integrations/openenv/**)
2. Configures neocloud credentials (Lambda, RunPod, Vast)
3. Launches training job via `sky jobs launch`
4. Posts job info as PR comment

Required secrets:
- LAMBDA_API_KEY
- RUNPOD_API_KEY
- VAST_API_KEY
- FLEET_API_KEY
- WANDB_API_KEY_TOOL_USE
- WANDB_API_KEY_COMPUTER_USE

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Only trigger workflow on skyrl-train/tasks/** changes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Add permissions for PR comments

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Remove auto-trigger, manual dispatch only

Training is expensive - only trigger manually via Actions tab.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Simplify WandB key + add Slack notifications

- Use single WANDB_API_KEY instead of per-modality keys
- Add Slack notification to #fleet-training channel

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Slack channel to #fleet-training-runs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Use modality-specific WandB keys

- WANDB_API_KEY_TOOL_USE for tool_use modality
- WANDB_API_KEY_COMPUTER_USE for computer_use modality

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Proper WandB key selection + separate Slack success/failure notifications

- Use job-level env vars for bash runtime selection
- Add validation that WandB key is set
- Separate Slack notifications for success vs failure

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add WandB dashboard link to Slack notification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Sync launch - wait for job completion before Slack notification

- Removed --async flag, workflow now waits for training to complete
- Slack notification shows completion status (success/failure)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* Add Fleet task integration for SkyRL training

This PR adds a new integration for training on Fleet-hosted environments:

- `integrations/fleet/env.py`: FleetTaskEnv that wraps OpenEnv's FleetTaskEnv
  and adapts it to SkyRL's BaseTextEnv interface
- `integrations/fleet/prepare_dataset.py`: Converts Fleet task JSON files to
  SkyRL parquet dataset format
- `integrations/fleet/entrypoints/main_fleet.py`: Training entrypoint that
  registers the fleet_task environment
- Updated `tasks/openenv-fleet-grpo.yaml` to use Fleet integration

The integration supports:
- Loading tasks from Fleet API or JSON files
- MCP tool execution via OpenEnv's FleetTaskEnv
- Verifier-based rewards on episode completion
- Filtering by modality (tool_use/computer_use) and env_key

Usage:
```bash
sky jobs launch tasks/openenv-fleet-grpo.yaml \
  --env FLEET_API_KEY=sk_... \
  --env WANDB_API_KEY=wandb_... \
  --env MODALITY=tool_use
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Simplify FleetTaskEnv and add unit tests

- Simplified parse_tool_call: removed complex regex patterns, now only
  supports tag-based formats (<tool_call>, <function_call>)
- Removed unused _verified attribute
- Extracted _run_async helper for cleaner async handling
- Removed tools_cache (not needed after init)

Added comprehensive unit tests:
- load_tasks_from_json: array/object formats, caching, errors
- parse_tool_call: various formats, edge cases
- FleetTaskEnv.__init__: validation, config priority
- FleetTaskEnv.init: environment creation, tools info
- FleetTaskEnv.step: tool calls, done signal, max turns
- FleetTaskEnv.close: cleanup, error handling
- FleetTaskEnv.get_metrics/aggregate_metrics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update GHA workflow for Fleet task integration

- Replace env_name with env_key input for Fleet environment filtering
- Add max_tasks input for limiting tasks during testing
- Update Slack notifications to say "Fleet Task" instead of "OpenEnv"
- Update WandB project link to fleet-task-grpo
- Build launch command conditionally (only pass ENV_KEY/MAX_TASKS if set)
- Add max_tasks to job summary

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix SkyPilot task for OpenEnv namespace package

- Clone OpenEnv repo for namespace package access (envs/ has no __init__.py)
- Add OpenEnv/src to PYTHONPATH so 'from envs.fleet_env import ...' works
- Remove redundant --with openenv since we use PYTHONPATH
- Use fleet-integration branch (update to main after PR merge)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Use main branch for SkyPilot workdir

The workflow should use main branch - merge PR first before running.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix ruff linting errors

- Remove unused variable 'loop' in _run_async()
- Remove unused variable 'call_args' in test

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix linting: remove unused imports, apply black formatting

- Remove unused imports (tempfile, AsyncMock)
- Apply black formatting to all fleet integration files

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
1. Move inline Python script to separate file (export_tasks.py)
   - Heredoc in YAML caused parsing errors

2. Fix Job ID extraction
   - Use portable grep patterns instead of Perl-style \K
   - Try multiple patterns for different SkyPilot output formats

3. Clarify Slack notifications
   - "Job Launched" instead of "Training Completed"
   - This workflow only launches the job, doesn't wait for completion
   - Actual training status comes from WandB or sky jobs logs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* Fix YAML parsing error and improve Slack notifications

1. Move inline Python script to separate file (export_tasks.py)
   - Heredoc in YAML caused parsing errors

2. Fix Job ID extraction
   - Use portable grep patterns instead of Perl-style \K
   - Try multiple patterns for different SkyPilot output formats

3. Clarify Slack notifications
   - "Job Launched" instead of "Training Completed"
   - This workflow only launches the job, doesn't wait for completion
   - Actual training status comes from WandB or sky jobs logs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix: Move API keys from secrets to envs section

SkyPilot secrets require --secret flag, but GHA passes via --env.
Move WANDB_API_KEY and FLEET_API_KEY to envs section with empty
defaults that get overridden by --env flags.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
- Add disk_size: 40 (RunPod max is 40GB, default was 50GB causing failures)
- Remove aws (not configured in GHA workflow)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
The managed jobs controller has its own default disk_size=50,
independent of our task YAML. Configure it via ~/.sky/config.yaml
to use disk_size=30 (RunPod limit is 40GB).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Changes:
- Split workflow into separate steps: Submit → Slack Submitted → Stream Logs → Slack Done
- Stream training logs directly to GHA using `sky jobs logs --follow`
- Add job status tracking and report final status in Slack
- Switch from A100:4 to H100:2 (easier to find availability)
- Remove memory requirement (H100 has sufficient memory)
- Add PR_NUMBER env var to training job

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
The Configure SkyPilot step was setting disk_size for the controller
but this might be causing issues. Let SkyPilot auto-select controller
resources from available clouds.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes:
- Use `sky launch` instead of `sky jobs launch` for direct GPU provisioning
- No more CPU controller - GPUs are provisioned directly
- Use `sky logs --follow` to stream training logs
- Auto-terminate cluster after training completes
- Add cleanup step on failure

This removes the managed jobs controller abstraction and directly
provisions H100:2 GPU instances on Lambda/RunPod/Vast.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
H100:2 is more readily available on Lambda/RunPod/Vast.
Removed memory requirement (H100 has sufficient memory).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* Remove OpenEnv dependency, use Fleet SDK directly

Changes:
- Rewrite env.py to use Fleet SDK directly (fleet.load_tasks, task.make, task.verify)
- Remove all OpenEnv clone/install steps from task YAML
- Update __init__.py with lazy import to handle missing dependencies gracefully
- Comprehensive tests for error cases (file not found, task not found, API errors, etc.)

This fixes the setup failure where OpenEnv (private repo) couldn't be cloned.

Error cases now covered:
- Missing tasks file
- Invalid JSON format
- Empty tasks array
- Task not found in file
- Task not found in Fleet API
- task.make() failure
- Tool execution errors
- Verifier errors
- Close errors

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Move Fleet tests to tests/cpu/ for CI pickup

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix black formatting

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Mock fleet and skyrl_gym in tests to allow running without deps

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add noqa for E402 (module imports after sys.modules mock)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix BaseTextEnvStepOutput mock to return actual objects

The MagicMock was causing assertions to fail because result.done etc
were returning MagicMock objects instead of actual values.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Skip Fleet tests if dependencies unavailable instead of mocking sys.modules

The previous approach of mocking sys.modules at module level polluted
global state and broke other tests (skyrl_gym.envs became MagicMock).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix test isolation by using importlib instead of sys.modules mocking

The previous approach of mocking sys.modules polluted global state and
broke other tests (generator tests). This change uses importlib.util.find_spec
to check for dependencies and pytest.mark.skipif to skip tests when
dependencies aren't available.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Use OpenEnv's FleetTaskEnv instead of Fleet SDK directly

This commit rewrites the Fleet integration to use OpenEnv's FleetTaskEnv
as the abstraction layer instead of calling Fleet SDK directly.

Changes:
- integrations/fleet/env.py: Rewrite to use envs.fleet_env.FleetTaskEnv
- tasks/openenv-fleet-grpo.yaml: Install openenv package
- tests/cpu/test_fleet_env.py: Update mocks for OpenEnv
- CLAUDE.md: Add instructions about integration patterns

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
This fixes the error:
  Key 'fleet_task' is not in struct
  full_key: environment.skyrl_gym.fleet_task

Changes:
- Add fleet_task section to skyrl_gym_config/default.yaml with:
  - tasks_file: Path to exported Fleet tasks JSON
  - api_key: Fleet API key (defaults to FLEET_API_KEY env var)
  - ttl_seconds: TTL for Fleet environment instances (default: 600)
- Add tests for config validation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Change H100:2 to H100:1 due to capacity constraints on cloud
providers. The config parameters using $TOTAL_GPUS will
automatically adjust to 1.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
- Add 100 booking tasks sample (fleet_booking_sample.json)
- Update YAML to use committed sample instead of Fleet API export
- Simplifies initial testing before full dataset

Environment: booking
Tasks: 100
Modality: tool_use

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Change `from envs.fleet_env import FleetTaskEnv` to
`from openenv import FleetTaskEnv` - the correct import
path when openenv is installed via pip.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Change `fleet-ai` to `thefleet` in WandB URLs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
FleetTaskEnv is not in PyPI release - install from git branch.
- Change import to `from envs.fleet_env import FleetTaskEnv`
- Install from git+https://github.com/fleet-ai/OpenEnv.git@deniz/fleet_client

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
- Change terminology from "Cluster" to "Run" in user-facing messages
- Update GPU count from H100:2 to H100:1

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
The --with "openenv" was installing from PyPI, overriding
the git branch install. Now uses git URL directly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Comprehensive guide covering:
- How to trigger training via GitHub Actions
- Required secrets and configuration
- Training hyperparameters
- Monitoring and troubleshooting
- Architecture overview

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* feat: Add Fleet task integration to skyrl-agent

Implements multi-turn agentic training on Fleet-hosted environments:

- Add FleetTask class implementing BaseTask interface
- Add skyrl_fleet.yaml config for training
- Add run_fleet.sh launch script
- Add sample dataset (100 booking tasks)
- Add unit tests for Fleet task
- Move docs from skyrl-train to skyrl-agent

The Fleet task:
- Creates Fleet environments via FleetTaskEnv (from OpenEnv)
- Provides task prompts as agent instructions
- Exposes MCP tools for agent interaction
- Evaluates results using task verifiers

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* style: Format fleet_task.py with black

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: Remove Fleet integration from skyrl-train

Fleet integration has been moved to skyrl-agent.
Deleting old files to avoid confusion.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* Fix GitHub Actions workflow path for Fleet training

- Create SkyPilot YAML in skyrl-agent/tasks/fleet-task-training.yaml
- Update workflow to reference new YAML path instead of deleted skyrl-train path

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update docs to reference correct SkyPilot YAML path

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* Change Slack channel to #fleet-training-runs-test

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix YAML syntax error and add validation tests

- Move inline Python code to separate script (scripts/prepare_fleet_dataset.py)
- Add TestYAMLValidation class to catch YAML syntax errors in CI

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* style: Format with black

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
- Restore skyrl-train/integrations/fleet/ from before PR #21
- Restore skyrl-train/tasks/openenv-fleet-grpo.yaml
- Remove skyrl-agent fleet files (tasks/fleet, tests, examples)
- Move data and docs back to skyrl-train
- Update workflow to use skyrl-train path

This reverts the approach from PR #21 which moved Fleet integration
to skyrl-agent. The skyrl-train approach using BaseTextEnv is simpler
and better suited for Fleet's MCP tool interface.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* feat: Add S3 checkpoint upload to prevent disk exhaustion

- Add S3CheckpointUploader module for async checkpoint upload
- Wrap trainer to upload checkpoints to S3 after each save
- Delete local checkpoints after successful upload to save disk
- Increase disk_size from 40GB to 100GB as fallback
- Add AWS credentials support in workflow and SkyPilot YAML

GitHub Secrets required:
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY

S3 bucket: skyrl-checkpoints (configurable via S3_CHECKPOINT_BUCKET)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Include model name in S3 checkpoint path

* refactor: Use Path library for model name extraction

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
* docs: Simplify fleet-training.md

* fix: Make checkpoint cleanup synchronous, always clean old checkpoints before save

- Upload to S3 is now synchronous (blocking) to ensure disk space freed
- Always clean up old checkpoints BEFORE saving new one (keeps only 1 local)
- Works with or without AWS credentials (local cleanup still happens)
- Simplified entrypoint - always wraps trainer for checkpoint management

* fix: Revert to async upload, key fix is cleanup BEFORE save

The actual issue was:
1. AWS credentials not set → no S3 upload happening
2. Old checkpoints not cleaned up BEFORE saving → disk full

Fix:
- Keep async upload (non-blocking, better performance)
- Clean up old checkpoints BEFORE saving new one (key fix)
- Keep 2 local checkpoints for safety margin (~10GB)
- S3 upload deletes local after successful upload

* fix: Increase disk to 200GB and limit Ray object store memory

The first checkpoint save was failing because the disk was already 95% full
from Ray temp files, vLLM cache, and model/optimizer states before any
checkpoint was saved.

Changes:
- Increase disk_size from 100GB to 200GB
- Limit Ray object store memory to 10GB (prevents unbounded growth)
- Add --object-store-memory flag to ray start command

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add task_file input parameter to workflow

Allows selecting which SkyPilot YAML file to use, enabling users to test
different configurations by pushing custom YAML files.

Default: skyrl-train/tasks/openenv-fleet-grpo.yaml

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Change task_file to task_name input parameter

Use just the task name (e.g., 'openenv-fleet-grpo') instead of full path.
Path is constructed as: skyrl-train/tasks/{task_name}.yaml

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* style: Fix black formatting in s3_checkpoints.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Update fleet-training.md with task_name param and Slack channel

- Add task_name parameter mention in Quick Start
- Change Slack channel to #fleet-training-runs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
- Add --with boto3 to uv run --isolated (required for S3 uploads)
- Change Slack channel from #fleet-training-runs-test to #fleet-training-runs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* feat: Use S3 datasets instead of sample data

Downloads real task datasets from S3:
- tool_use: s3://fleet-internal-datasets/v0.1/openenv/all_tool_use.json (3,603 tasks)
- computer_use: s3://fleet-internal-datasets/v0.1/openenv/all_computer_use.json (1,278 tasks)

Changes:
- Add S3_DATASET_BUCKET env var
- Download dataset from S3 based on MODALITY
- Add validation for required env vars (FLEET_API_KEY, AWS credentials, MODALITY)
- Make AWS credentials required in workflow
- Update docs to reflect AWS is now required

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Remove troubleshooting and creating PRs sections

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
dzorlu and others added 16 commits January 22, 2026 22:51
- Add id to "Training Run Started" step to capture message ts
- Use thread_ts in Completed/Failed messages to reply in thread

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix: Handle Fleet env reset failures gracefully

Instead of crashing on reset timeout, log error and continue:
- Add logging to fleet env.py
- On reset failure: log error, mark _init_failed=True
- In step(): if init failed, return done=True with reward=0

This allows training to continue even if some environments fail to reset.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: Clean up verbose import blocks

* fix: Move logger after imports to fix ruff E402

---------

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…re (#33)

The model can still generate actions even if env reset failed.
Errors are handled in the tool execution try/except blocks.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
SkyPilot will try H100 first on all clouds (Lambda, RunPod, Vast),
then fall back to B200 if no H100 capacity available.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* feat: Add per-environment metrics breakdown in WandB

- Add data_source field to Fleet dataset records (using env_key)
- Add calculate_per_source_reward_metrics() for training metrics
- Update postprocess_generator_output to compute per-env metrics
- Track data_source through GeneratedOutputGroup in async trainer

This enables WandB to show performance broken down by Fleet environment
(github, booking, reddit, etc.) for both training and eval metrics.

Metrics format:
- reward/{env_key}/avg_score (training)
- reward/{env_key}/pass_at_N (training)
- eval/{env_key}/avg_score (evaluation)
- eval/{env_key}/pass_at_N (evaluation)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Add pre-commit formatting rule to CLAUDE.md

Always run pre-commit before creating PRs to ensure code is properly formatted.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* style: Fix black formatting in trainer_utils.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
- Add primeintellect to SkyPilot task config for H100 and B200
- Configure Prime Intellect credentials in GitHub workflow
- Add PRIME_INTELLECT_API_KEY to required secrets in docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* feat: stratified eval split with held-out test environments

- Stratified split by environment (each env maintains train/eval ratio)
- Hash-based deterministic assignment (same task always goes to same split)
- Minimum 10 eval samples per env (otherwise all go to train)
- Held-out test envs: outlook (tool_use), instacart (computer_use)
- Document split strategy in fleet-training.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* test: add tests for prepare_dataset stratified split

- 23 tests covering hash_to_split, _task_to_record, load_tasks_from_json
- Integration tests for held-out envs, stratified split, modality/env filters
- Test deterministic split reproducibility across runs
- Add per-environment breakdown table to summary output

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: simplify to train/eval splits, fix WandB Slack links

- Remove test split - held-out envs (outlook, instacart) now go to eval
- Update Slack messages to link to specific WandB run (not just project)
- Update docs and tests to reflect train/eval only

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: simplify WandB run name to fleet_{modality}_{random}

- Remove PR number and env_key from run name
- Use random 8-char hex suffix for uniqueness
- Fallback to random if RUN_ID not set (for local runs)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
WandB search URLs don't work reliably. Instead:
- Show the run name as text (e.g., fleet_tool_use_a3f2b1)
- Link to project dashboard (user can search by name)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
- Change eval_ratio from 10% to 2% (default in prepare_dataset.py)
- Lower MIN_EVAL_SAMPLES from 10 to 5 (threshold for creating eval split)
- Add eval_n_samples_per_prompt=3 to YAML (vs 4 for train)
- Update tests and documentation

This reduces eval time from ~50 min to ~5 min per epoch:
- Before: 366 samples × 4 trajectories = 1464 total
- After: ~88 samples × 3 trajectories = ~264 total

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
- Change cleanup step condition from `failure()` to `failure() || cancelled()`
- Add Slack notification for cancelled runs

Previously, cancelling via GitHub UI left GPU pods running.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Add new task config `openenv-fleet-grpo-qwen3-8b.yaml`:
- Model: Qwen/Qwen3-8B (8B params, instruct-tuned)
- GPUs: B200:2 (preferred) or H100:4 (fallback)

Original config unchanged (Qwen2.5-1.5B-Instruct, H100:1).

Usage:
  sky launch tasks/openenv-fleet-grpo-qwen3-8b.yaml ...

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Replace hardcoded "H100:1" with dynamic task_name in Slack
notifications and workflow summary. Each task config defines
its own GPU requirements.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
4 → 8 prompts per batch (32 trajectories with n_samples=4)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
The model was not using tools - instead giving suggestions/instructions.

Root cause: No system message was being sent, and only tool names (not full
schemas) were added to the user prompt.

Fix:
- Add a system prompt that clearly explains the model is an agent that must
  execute tools
- Include full tool definitions as JSON (not just names)
- Separate system and user messages properly
- Use concise prompt style similar to theseus orchestrator

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* Upload eval trajectories to S3 for persistence

Eval results (JSONL files with full multi-turn traces) were only saved locally
and lost when the cluster terminated.

Changes:
- Add upload_eval_results_to_s3() function to s3_checkpoints.py
- Call S3 upload after local dump in evaluate.py (both regular and step_wise)
- Use separate bucket: S3_TRAJECTORY_BUCKET (default: skyrl-trajectories)
- Add S3_TRAJECTORY_BUCKET env var to SkyPilot task YAMLs

S3 path: s3://skyrl-trajectories/evals/{run_name}/global_step_{N}/

The JSONL files contain full multi-turn traces with model reasoning,
tool calls, and tool results (all decoded from response_ids).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* learnings

* Fix premature <done> by improving system prompt

Model was saying <done> after thinking without calling tools.
Updated prompt to explicitly require tool calls before <done>.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add max_input_length for multi-turn context growth

- MAX_INPUT_LENGTH=24000 (fits in Qwen3-8B's 32K context)
- MAX_GENERATE_LENGTH=4096
- Prevents context truncation during long tool-use conversations

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Make training fully on-policy

Set policy_mini_batch_size = train_batch_size for single optimizer step per batch.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add thinking tokens documentation to learnings

Explains when/why thinking gets stripped and how to preserve it for on-policy training.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Qwen3 thinking models may output <done> inside <think> sections when
describing their plan, causing premature generation termination.

Now vLLM only stops on </tool_call>. The env.step() already detects
<done> in the full response to mark episode completion.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Jan 25, 2026

Someone is attempting to deploy a commit to the Tyler's projects Team on Vercel.

A member of the Team first needs to authorize it.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a major new feature: integration with Fleet-hosted environments for reinforcement learning training. It adds a comprehensive set of new modules, including a custom environment, dataset preparation scripts, S3 checkpointing, and extensive documentation and tests. The change to remove <done> from vLLM stop tokens, as mentioned in the title, is a small but important part of this larger integration. My review focuses on the new Fleet integration code. I've identified a critical issue with asynchronous code handling that could lead to runtime errors, as well as some high-severity issues related to external dependencies and error handling. Additionally, there are opportunities to improve maintainability by reducing code duplication and to fix inaccuracies in documentation. Overall, this is a substantial and valuable contribution that will be ready for merging after addressing the feedback.

Comment on lines +223 to +318
def step(self, action: str) -> BaseTextEnvStepOutput:
"""
Execute one step in the Fleet environment.
Parses the action for tool calls, executes via OpenEnv's FleetTaskEnv,
and returns observation. Reward is computed by the verifier on completion.
"""
self.turns += 1
self.chat_history.append({"role": "assistant", "content": action})

max_turns_reached = self.turns >= self.max_turns

# Check if agent signals completion
agent_done = "<done>" in action.lower() or "[done]" in action.lower()

# Parse tool call from LLM response
tool_call = parse_tool_call(action)

tool_result = None
error = None
reward = 0.0

# Execute tool call if present via OpenEnv
if tool_call and self.openenv_task_env:
# Build action dict for OpenEnv
openenv_action = {
"tool": tool_call["name"],
"params": tool_call.get("arguments", {}),
"done": agent_done,
}

try:
# Use async step method
obs, reward, done, info = asyncio.get_event_loop().run_until_complete(
self.openenv_task_env.step_async(openenv_action)
)
tool_result = obs.get("observation")
if "tool_error" in info:
error = info["tool_error"]
except Exception as e:
error = str(e)
elif agent_done and self.openenv_task_env:
# Agent signaled done without tool call
openenv_action = {"done": True}
try:
obs, reward, done, info = asyncio.get_event_loop().run_until_complete(
self.openenv_task_env.step_async(openenv_action)
)
except Exception as e:
error = str(e)

# Check if episode is done
episode_done = agent_done or max_turns_reached

# Build observation message
if max_turns_reached:
return BaseTextEnvStepOutput(
observations=[],
reward=reward,
done=True,
metadata={"done_reason": "max_turns", "task_key": self.task_key},
)

# Build response observation
if error:
obs_content = f"Error: {error}"
elif tool_result:
if isinstance(tool_result, dict):
obs_content = f"Tool result:\n{json.dumps(tool_result, indent=2)}"
else:
obs_content = f"Tool result:\n{tool_result}"
elif agent_done:
obs_content = "Task marked as complete."
elif not tool_call:
obs_content = 'No tool call found. Use <tool_call>{"name": "...", "arguments": {...}}</tool_call> format.'
else:
obs_content = "Action executed."

new_obs = {"role": "user", "content": obs_content}
self.chat_history.append(new_obs)

metadata = {
"task_key": self.task_key,
"turn": self.turns,
"tool_call": tool_call,
"tool_result": tool_result,
"error": error,
"done_reason": "agent_done" if agent_done else None,
}

return BaseTextEnvStepOutput(
observations=[new_obs],
reward=reward,
done=episode_done,
metadata=metadata,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The step method is synchronous but calls asyncio.get_event_loop().run_until_complete() to execute an async function. Since the training loop is initiated with asyncio.run(), an event loop is already running. This will cause a RuntimeError: This event loop is already running.

To fix this, the step method should be defined as async def and use await to call self.openenv_task_env.step_async(). This change might require updating the BaseTextEnv class, but it's the correct approach for handling asynchronous operations within an async application. A similar issue exists on lines 268-270.

    async def step(self, action: str) -> BaseTextEnvStepOutput:
        """
        Execute one step in the Fleet environment.

        Parses the action for tool calls, executes via OpenEnv's FleetTaskEnv,
        and returns observation. Reward is computed by the verifier on completion.
        """
        self.turns += 1
        self.chat_history.append({"role": "assistant", "content": action})

        max_turns_reached = self.turns >= self.max_turns

        # Check if agent signals completion
        agent_done = "<done>" in action.lower() or "[done]" in action.lower()

        # Parse tool call from LLM response
        tool_call = parse_tool_call(action)

        tool_result = None
        error = None
        reward = 0.0

        # Execute tool call if present via OpenEnv
        if tool_call and self.openenv_task_env:
            # Build action dict for OpenEnv
            openenv_action = {
                "tool": tool_call["name"],
                "params": tool_call.get("arguments", {}),
                "done": agent_done,
            }

            try:
                # Use async step method
                obs, reward, done, info = await self.openenv_task_env.step_async(openenv_action)
                tool_result = obs.get("observation")
                if "tool_error" in info:
                    error = info["tool_error"]
            except Exception as e:
                error = str(e)
        elif agent_done and self.openenv_task_env:
            # Agent signaled done without tool call
            openenv_action = {"done": True}
            try:
                obs, reward, done, info = await self.openenv_task_env.step_async(openenv_action)
            except Exception as e:
                error = str(e)

        # Check if episode is done
        episode_done = agent_done or max_turns_reached

        # Build observation message
        if max_turns_reached:
            return BaseTextEnvStepOutput(
                observations=[],
                reward=reward,
                done=True,
                metadata={"done_reason": "max_turns", "task_key": self.task_key},
            )

        # Build response observation
        if error:
            obs_content = f"Error: {error}"
        elif tool_result:
            if isinstance(tool_result, dict):
                obs_content = f"Tool result:\n{json.dumps(tool_result, indent=2)}"
            else:
                obs_content = f"Tool result:\n{tool_result}"
        elif agent_done:
            obs_content = "Task marked as complete."
        elif not tool_call:
            obs_content = 'No tool call found. Use <tool_call>{"name": "...", "arguments": {...}}</tool_call> format.'
        else:
            obs_content = "Action executed."

        new_obs = {"role": "user", "content": obs_content}
        self.chat_history.append(new_obs)

        metadata = {
            "task_key": self.task_key,
            "turn": self.turns,
            "tool_call": tool_call,
            "tool_result": tool_result,
            "error": error,
            "done_reason": "agent_done" if agent_done else None,
        }

        return BaseTextEnvStepOutput(
            observations=[new_obs],
            reward=reward,
            done=episode_done,
            metadata=metadata,
        )

Comment on lines +177 to +180
except Exception as e:
logger.error(f"Failed to reset Fleet environment for task {self.task_key}: {e}")
self._init_failed = True
obs = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The init method catches any exception during self.openenv_task_env.reset(), logs an error, and then continues. This can mask critical environment setup failures, leading to silent errors or unpredictable behavior later in the training process. The test test_init_reset_fails correctly expects a RuntimeError in this scenario. It would be better to re-raise the exception to make the failure explicit and prevent the system from continuing in a bad state.

Suggested change
except Exception as e:
logger.error(f"Failed to reset Fleet environment for task {self.task_key}: {e}")
self._init_failed = True
obs = {}
except Exception as e:
logger.error(f"Failed to reset Fleet environment for task {self.task_key}: {e}")
raise RuntimeError(f"Failed to reset Fleet environment for task {self.task_key}: {e}") from e

uv pip install wandb boto3 awscli
# Install OpenEnv for Fleet environment access (from branch with FleetTaskEnv)
uv pip install "git+https://github.com/fleet-ai/OpenEnv.git@deniz/fleet_client" fleet-python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The setup script installs OpenEnv from a personal git branch (deniz/fleet_client). This introduces a dependency on a branch that may not be stable or could be deleted, making the build process fragile and difficult to reproduce. It is highly recommended to use a released version from PyPI or a stable branch from the main repository. This same issue is present on line 148.

  uv pip install openenv-client fleet-python

Comment on lines +21 to +22
- Fleet SDK repo: `/Users/deniz/repos/fleet-sdk`
- OpenEnv repo: `/Users/deniz/repos/OpenEnv`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The repository paths for Fleet SDK and OpenEnv are hardcoded to a local user directory. For this documentation to be useful to all contributors, these should be replaced with the public repository URLs.

Suggested change
- Fleet SDK repo: `/Users/deniz/repos/fleet-sdk`
- OpenEnv repo: `/Users/deniz/repos/OpenEnv`
- Fleet SDK repo: https://github.com/fleet-ai/fleet-sdk
- OpenEnv repo: https://github.com/fleet-ai/OpenEnv

# Fleet Task Environment Integration for SkyRL
#
# This module provides a SkyRL-compatible environment wrapper for Fleet-hosted tasks.
# It uses the Fleet SDK directly (no OpenEnv dependency).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment here states that the module uses the Fleet SDK directly with no OpenEnv dependency. This is incorrect, as the implementation in integrations/fleet/env.py uses OpenEnvFleetTaskEnv. The comment should be updated to accurately reflect the dependency on OpenEnv.

Suggested change
# It uses the Fleet SDK directly (no OpenEnv dependency).
# It uses OpenEnv's FleetTaskEnv as an abstraction layer.

Comment on lines +123 to +138
# Upload to S3 if credentials are available
try:
from integrations.fleet.s3_checkpoints import upload_eval_results_to_s3

run_name = getattr(cfg.trainer, "run_name", None)
if run_name:
upload_eval_results_to_s3(
local_dir=str(data_save_dir),
run_name=run_name,
global_step=global_step,
delete_local=False, # Keep local copy
)
except ImportError:
pass # S3 upload not available
except Exception as e:
logger.warning(f"Failed to upload eval results to S3: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for uploading evaluation results to S3 is duplicated in both the evaluate and evaluate_step_wise functions (lines 254-269). To improve maintainability and avoid redundancy, this logic should be refactored into a separate helper function.

@dzorlu dzorlu closed this Jan 25, 2026
@dzorlu dzorlu deleted the fix/remove-done-stop-token branch January 25, 2026 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant