cross-difficulty

This repository contains code for analyzing LLM generalization across difficulty levels, as described in the paper "Revisiting Generalization Across Difficulty Levels: It's Not So Easy".

Setup

Clone the repository:

git clone https://github.com/BatsResearch/cross-difficulty.git
cd cross-difficulty

Create and activate the conda environment:

conda env create -f config/vllm-environment.yml
conda activate easy2hard-vllm

This project uses the LM Evaluation Harness for evaluation.

Setting Up LM Evaluation Harness

Clone and install the LM Evaluation Harness separately:

git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .

Copy the evaluation tasks from this project to the LM Evaluation Harness:

cp -r /path/to/Cross-Difficulty-Generalization/tasks/* /path/to/lm-evaluation-harness/lm_eval/tasks/

Usage

Training and Evaluation

The main training script handles training, model conversion, and evaluation in a single pipeline.

Configure the scripts by adding your paths and settings in scripts provided in scripts directory.
Run the pipeline:

bash scripts/sft_b200.sh

This script will:

Train models on specified difficulty bins
Convert checkpoints from DeepSpeed ZeRO format to standard PyTorch format
Run evaluation using LM Eval Harness

Zero-Shot Evaluation

To run zero-shot evaluation without training:

Configure the script by editing scripts/run-zero-shot-eval-vllm_b200.sh to set the appropriate model paths and output directories.
Run zero-shot evaluation:

bash scripts/run-zero-shot-eval-vllm_b200.sh

Citation

@misc{kordi2025revisitinggeneralizationdifficultylevels,
      title={Revisiting Generalization Across Difficulty Levels: It's Not So Easy},
      author={Yeganeh Kordi and Nihal V. Nayak and Max Zuo and Ilana Nguyen and Stephen H. Bach},
      year={2025},
      eprint={2511.21692},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.21692},
}

LM Evaluation Harness citation:

@misc{eval-harness,
  author       = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
  title        = {The Language Model Evaluation Harness},
  month        = 07,
  year         = 2024,
  publisher    = {Zenodo},
  version      = {v0.4.3},
  doi          = {10.5281/zenodo.12608602},
  url          = {https://zenodo.org/records/12608602}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
config		config
eval		eval
scripts		scripts
train		train
README.md		README.md
gitignore		gitignore

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

cross-difficulty

Setup

Setting Up LM Evaluation Harness

Usage

Training and Evaluation

Zero-Shot Evaluation

Citation

About

Uh oh!

Releases

Packages

Languages

BatsResearch/cross-difficulty

Folders and files

Latest commit

History

Repository files navigation

cross-difficulty

Setup

Setting Up LM Evaluation Harness

Usage

Training and Evaluation

Zero-Shot Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages