This repository contains code for analyzing LLM generalization across difficulty levels, as described in the paper "Revisiting Generalization Across Difficulty Levels: It's Not So Easy".
- Clone the repository:
git clone https://github.com/BatsResearch/cross-difficulty.git
cd cross-difficulty- Create and activate the conda environment:
conda env create -f config/vllm-environment.yml
conda activate easy2hard-vllm- This project uses the LM Evaluation Harness for evaluation.
- Clone and install the LM Evaluation Harness separately:
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .- Copy the evaluation tasks from this project to the LM Evaluation Harness:
cp -r /path/to/Cross-Difficulty-Generalization/tasks/* /path/to/lm-evaluation-harness/lm_eval/tasks/The main training script handles training, model conversion, and evaluation in a single pipeline.
-
Configure the scripts by adding your paths and settings in scripts provided in scripts directory.
-
Run the pipeline:
bash scripts/sft_b200.shThis script will:
- Train models on specified difficulty bins
- Convert checkpoints from DeepSpeed ZeRO format to standard PyTorch format
- Run evaluation using LM Eval Harness
To run zero-shot evaluation without training:
-
Configure the script by editing
scripts/run-zero-shot-eval-vllm_b200.shto set the appropriate model paths and output directories. -
Run zero-shot evaluation:
bash scripts/run-zero-shot-eval-vllm_b200.sh@misc{kordi2025revisitinggeneralizationdifficultylevels,
title={Revisiting Generalization Across Difficulty Levels: It's Not So Easy},
author={Yeganeh Kordi and Nihal V. Nayak and Max Zuo and Ilana Nguyen and Stephen H. Bach},
year={2025},
eprint={2511.21692},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.21692},
}LM Evaluation Harness citation:
@misc{eval-harness,
author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
title = {The Language Model Evaluation Harness},
month = 07,
year = 2024,
publisher = {Zenodo},
version = {v0.4.3},
doi = {10.5281/zenodo.12608602},
url = {https://zenodo.org/records/12608602}
}