This repository contains the official implementation of the data synthesis pipeline proposed in our paper:
LLM-Driven Preference Data Synthesis for Proactive Prediction of the User’s Next Utterance in Human–Machine Dialogue
The code implements an LLM-driven framework for synthesizing reasoning-aware preference data tailored to the task of proactive next utterance prediction in multi-turn human–machine dialogues.
Proactive prediction of a user’s next utterance requires models to anticipate user intent progression before the user explicitly expresses it. Existing approaches are limited by the scarcity of task-specific, reasoning-oriented training data.
This work addresses this challenge by introducing a data synthesis paradigm that:
- Explicitly models user intent evolution using intent trees;
- Decomposes prediction into utterance category reasoning and dual-view insight reasoning;
- Generates preference-style data with chosen and rejected reasoning trajectories;
- Automatically constructs hard negative samples via incorrect intent-path perturbation and reasoning revision.
The synthesized data can be directly used to train or align small-to-medium LLMs for proactive next utterance prediction.
proutt/
├── src/
│ ├── config.py # Configuration and hyperparameters
│ ├── client.py # Async LLM client and JSON-safe calls
│ ├── prompts.py # All prompt templates
│ ├── utils.py # Shared utility functions
│ ├── io_utils.py # JSONL I/O helpers
│ └── synthesis.py # Main data synthesis pipeline
│ └── postprocess.py # Post-process synthesized_raw into SFT/DPO-style training formats
├── data/
│ ├── raw/ # Original input data (not included)
│ ├── synthesized_raw/ # Direct LLM-generated synthesis results before post-processing
│ └── processed/ # Final processed preference data used for training
├── requirements.txt
├── .gitignore
└── README.md
The final processed preference datasets are publicly released on Hugging Face.
Specifically, this repository provides links to the following datasets:
- LMSYS-ProUtt-10K: https://huggingface.co/datasets/jqwang/LMSYS_ProUtt_10K
- CrossWOZ-ProUtt-5K: https://huggingface.co/datasets/jqwang/CrossWOZ_ProUtt_5K
Both datasets are constructed using the proposed LLM-driven data synthesis pipeline and are ready for downstream training and evaluation in proactive next utterance prediction.
The input file (input_file) is a JSONL file where each line corresponds to one multi-turn dialogue sample, containing:
- a unique dialogue ID;
- a list of messages in OpenAI-style format (
role,content).
Example:
{
"id": "sample_001",
"conversation": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
]
}- Python ≥ 3.9
- An OpenAI-compatible LLM API endpoint (e.g., Zhipu GLM, DeepSeek, Qwen) Install dependencies:
pip install -r requirements.txtBefore running the pipeline, set the API key as an environment variable. Example (Zhipu / GLM):
export ZAI_API_KEY=<your_api_key>The default configuration in src/synthesis.py is:
base_url = "https://open.bigmodel.cn/api/paas/v4/"
model_id = "glm-4.6"mkdir -p data/raw data/synthesized_raw data/processedPlace the input file at:
data/raw/LMSYS.jsonlFrom the project root directory:
python -m src.synthesisThe synthesized preference data will be written to:
data/synthesized_raw/LMSYS.jsonlConvert the raw synthesized results into training-ready formats (SFT or preference/DPO):
- Preference (DPO-style):
python -m src.postprocess \
--input data/synthesized_raw/LMSYS.jsonl \
--output data/processed/LMSYS_dpo.json \
--mode pref- SFT:
python -m src.postprocess \
--input data/synthesized_raw/LMSYS.jsonl \
--output data/processed/LMSYS_sft.json \
--mode sftThe final processed datasets will be saved under:
data/processed/If you use this code, please cite:
@article{wang2025llm,
title={LLM-Driven Preference Data Synthesis for Proactive Prediction of the Next User Utterance in Human-Machine Dialogue},
author={Wang, Jinqiang and Ning, Huansheng and Ding, Jianguo and Zhu, Tao and Chen, Liming and Nugent, Chris},
journal={arXiv preprint arXiv:2601.09713},
year={2025}
}