A reasoning-focused open-domain Question Answering benchmark for Persian (FA)
covering Boolean, Factoid, and Multiple-choice questions with Reasoning + Multi-hop settings.
- π§ Designed to evaluate reasoning capabilities of LLMs in a low-resource language
- β Supports Zero-shot, Few-shot, and Chain-of-Thought (CoT) evaluation
- π§ͺ Includes scripts for automatic evaluation + fine-tuning utilities
- π₯ Comes with human evaluation interfaces (quality + difficulty validation)
Parse is publicly available on HuggingFace:
- Dataset:
JamshidJDMY/Parse - Link: https://huggingface.co/datasets/JamshidJDMY/Parse
This repository also contains the dataset as JSON files under dataset/:
full.jsonβ the complete Parse benchmarktrain.jsonβ training split (used for fine-tuning experiments)test.jsonβ test split (used for fine-tuning evaluation)
Note:
train.jsonandtest.jsonare provided for reproducibility of fine-tuning experiments.
| Question Type | Subtypes (Categories) |
|---|---|
| Boolean | Reasoning: Simple, Negation, Comparative Multihop: Simple, Negation, Comparative |
| Factoid | Reasoning: Simple, NonAnswerable, ListBased Multihop: Simple, NonAnswerable, ListBased |
| Multiple-choice | Reasoning: SingleAnswer, MultiAnswer, NonAnswerable Multihop: SingleAnswer, MultiAnswer, NonAnswerable |
| Dimension | Values |
|---|---|
| Reasoning Types | Reasoning, Multihop |
| Difficulty | Easy, Medium, Hard |
| Languages | Persian + English prompts supported |
Parse contains 10,800 questions, designed with a balanced and fully-controlled taxonomy.
- Total questions: 10,800
- Uniform coverage: 18 configuration families, each with 600 questions
- Difficulty is balanced inside each configuration: 200 Easy / 200 Medium / 200 Hard
| QA Type | Dimension | Subtypes | # per subtype | Total |
|---|---|---|---|---|
| Boolean | Reasoning | Simple / Negation / Comparative | 600 | 1,800 |
| Boolean | Multihop | Simple / Negation / Comparative | 600 | 1,800 |
| Multiple-choice | Reasoning | Single-Ans / Multi-Ans / Non-Ans | 600 | 1,800 |
| Multiple-choice | Multihop | Single-Ans / Multi-Ans / Non-Ans | 600 | 1,800 |
| Factoid | Reasoning | Simple / List-based / Non-Ans | 600 | 1,800 |
| Factoid | Multihop | Simple / List-based / Non-Ans | 600 | 1,800 |
Overall: 6 blocks Γ 1,800 = 10,800 questions.
We benchmark multilingual and Persian LLMs under:
- Zero-shot
- Few-shot
- Chain-of-Thought (CoT)
Key findings:
- Persian prompts generally improve results compared to English prompts.
- Structured prompting helps:
- CoT is most effective for Boolean and Multiple-choice
- Few-shot is most effective for Factoid
- Fine-tuning improves performance, particularly for Persian-specialized models.
Full result tables are provided in the paper (e.g., Table 4 for Boolean and Table 5 for Multiple-choice).
pip install datasetsfrom datasets import load_dataset
ds = load_dataset("JamshidJDMY/Parse")
print(ds)
example = ds["train"][0]
print(example)Contains all prompt templates used during benchmark creation (question generation), organized by:
- question type (Boolean / Factoid / Multichoice)
- reasoning type (Reasoning / Multihop)
- sub-category (e.g., Simple, Negation, Comparative, ListBased, NonAnswerable)
Includes all automatic evaluation code:
zero_shot/few_shot/chain_of_thought/
Each evaluation setting contains:
boolean_sh.shfactoid_sh.shmultichoice_sh.sh
Utilities to convert Parse into TogetherAI fine-tuning format:
to_together_ai.py- output example:
finetune/together_ai_data_format/train_together.jsonl
evaluation/human_difficulty_validation/β difficulty validation studyevaluation/human_quality_evaluation/β quality evaluation study
Annotation interfaces and guide:
quality_evaluation_interface.htmldifficulty_evalation_interface.htmlQA_Annotation_Guide.pdf
Recommended: Python 3.10+
python -m venv .venv
source .venv/bin/activate # Linux/Mac
# .venv\Scripts\activate # WindowsInstall dependencies:
pip install -U pip
pip install prettytable termcolor together tenacity datasetsIf you use API-based models, ensure you have your TogetherAI API key configured.
All evaluation scripts follow the same structure and produce JSON predictions under prompt_results/.
cd evaluation/zero_shot
bash boolean_sh.sh
bash factoid_sh.sh
bash multichoice_sh.shcd evaluation/few_shot
bash boolean_sh.sh
bash factoid_sh.sh
bash multichoice_sh.shcd evaluation/chain_of_thought
bash boolean_sh.sh
bash factoid_sh.sh
bash multichoice_sh.shPredictions are stored here:
evaluation/<setting>/prompt_results/<task>/<language>/Example:
evaluation/chain_of_thought/prompt_results/boolean/persian/answers_llama-3-70b.jsonEach evaluation setting includes the scoring scripts:
evaluate_results.pyevaluate_finetuned_results.py
Example:
python evaluate_results.pyFine-tuning helper scripts and prompts are available in:
finetune/Key script:
to_together_ai.pyβ converts Parse into TogetherAI-compatible JSONL
Output example:
finetune/together_ai_data_format/train_together.jsonl
We conducted two human evaluation studies to validate benchmark quality and difficulty labels.
Annotators evaluated:
- Ambiguity
- Readability
- Correctness
Average scores across groups:
| Metric | Avg. Score (1β5) |
|---|---|
| Ambiguity | 4.404 |
| Readability | 4.669 |
| Correctness | 4.389 |
These results indicate high linguistic quality and strong factual correctness.
Human accuracy aligns with our difficulty labels (Easy > Medium > Hard) consistently across Boolean, Multiple-choice, and Factoid.
.
βββ dataset/
βββ prompts/
βββ evaluation/
βββ finetune/
βββ interface/
βββ LICENSE
βββ README.mdIf you use Parse, please cite:
See LICENSE.