UniCode is a novel framework that addresses the limitations of static, human-authored problem sets by automatically generating high-quality algorithmic problems and robust, contamination-resistant test cases. Inspired by biological evolution, our framework creates diverse and challenging programming problems through systematic generation strategies.
To evaluate models with our benchmark, use the following command:
python run_benchmark.py \
--models gpt-4o gpt-4.1 gpt-4.1-mini o3-mini gpt-4.5-preview \
--max-workers 4--models: List of models to evaluate--max-workers: Number of parallel workers for evaluation
To generate new problems and test cases:
# Generate new problems
python gen_new_questions.py
# Generate test cases for new problems
python generate_test_cases_by_brute.py
python generate_test_cases_by_opt.py
python filter.pyUniCode employs three biologically inspired strategies to create novel algorithmic challenges: (1) Single-Problem Extension, (2) Same-Type Fusion, and (3) Cross-Type Fusion.
Our stress-driven pipeline ensures high-quality test suites without requiring ground-truth solutions:
- Random Generation: Broad sampling from valid input space
- Adversarial Generation: Targets boundary conditions and worst-case scenarios
- LLM-based Synthesis: Creates small-scale challenging inputs
- Brute-Force Validation: Establishes trusted outputs for small-scale inputs
- Solver Filtration: Filters optimized solutions using stress tests
- Consensus Validation: Uses majority voting for large-scale inputs
- LLM Adjudication: Resolves conflicts with powerful LLMs


