Skip to content

An Open-Domain Reasoning Question Answering Benchmark for Persian

License

Notifications You must be signed in to change notification settings

DataScienceUIBK/Parse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

29 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌟 Parse: An Open-Domain Reasoning QA Benchmark for Persian

A reasoning-focused open-domain Question Answering benchmark for Persian (FA)
covering Boolean, Factoid, and Multiple-choice questions with Reasoning + Multi-hop settings.

✨ Highlights

  • 🧠 Designed to evaluate reasoning capabilities of LLMs in a low-resource language
  • βœ… Supports Zero-shot, Few-shot, and Chain-of-Thought (CoT) evaluation
  • πŸ§ͺ Includes scripts for automatic evaluation + fine-tuning utilities
  • πŸ‘₯ Comes with human evaluation interfaces (quality + difficulty validation)

πŸ€— Dataset

Parse is publicly available on HuggingFace:

Local dataset files (dataset/)

This repository also contains the dataset as JSON files under dataset/:

  • full.json β†’ the complete Parse benchmark
  • train.json β†’ training split (used for fine-tuning experiments)
  • test.json β†’ test split (used for fine-tuning evaluation)

Note: train.json and test.json are provided for reproducibility of fine-tuning experiments.

πŸ“Œ Task Coverage

Question Types & Subtypes

Question Type Subtypes (Categories)
Boolean Reasoning: Simple, Negation, Comparative
Multihop: Simple, Negation, Comparative
Factoid Reasoning: Simple, NonAnswerable, ListBased
Multihop: Simple, NonAnswerable, ListBased
Multiple-choice Reasoning: SingleAnswer, MultiAnswer, NonAnswerable
Multihop: SingleAnswer, MultiAnswer, NonAnswerable

Benchmark Dimensions

Dimension Values
Reasoning Types Reasoning, Multihop
Difficulty Easy, Medium, Hard
Languages Persian + English prompts supported

πŸ“ˆ Benchmark Statistics

Parse contains 10,800 questions, designed with a balanced and fully-controlled taxonomy.

Dataset Size & Balance

  • Total questions: 10,800
  • Uniform coverage: 18 configuration families, each with 600 questions
  • Difficulty is balanced inside each configuration: 200 Easy / 200 Medium / 200 Hard

Taxonomy Breakdown (Table 2 in the paper)

QA Type Dimension Subtypes # per subtype Total
Boolean Reasoning Simple / Negation / Comparative 600 1,800
Boolean Multihop Simple / Negation / Comparative 600 1,800
Multiple-choice Reasoning Single-Ans / Multi-Ans / Non-Ans 600 1,800
Multiple-choice Multihop Single-Ans / Multi-Ans / Non-Ans 600 1,800
Factoid Reasoning Simple / List-based / Non-Ans 600 1,800
Factoid Multihop Simple / List-based / Non-Ans 600 1,800

Overall: 6 blocks Γ— 1,800 = 10,800 questions.

πŸ§ͺ Benchmarking Results (Paper Summary)

We benchmark multilingual and Persian LLMs under:

  • Zero-shot
  • Few-shot
  • Chain-of-Thought (CoT)

Key findings:

  • Persian prompts generally improve results compared to English prompts.
  • Structured prompting helps:
    • CoT is most effective for Boolean and Multiple-choice
    • Few-shot is most effective for Factoid
  • Fine-tuning improves performance, particularly for Persian-specialized models.

Full result tables are provided in the paper (e.g., Table 4 for Boolean and Table 5 for Multiple-choice).

πŸš€ Quick Start

Install

pip install datasets

Load with πŸ€— Datasets

from datasets import load_dataset

ds = load_dataset("JamshidJDMY/Parse")
print(ds)

example = ds["train"][0]
print(example)

πŸ“¦ Repository Overview

prompts/

Contains all prompt templates used during benchmark creation (question generation), organized by:

  • question type (Boolean / Factoid / Multichoice)
  • reasoning type (Reasoning / Multihop)
  • sub-category (e.g., Simple, Negation, Comparative, ListBased, NonAnswerable)

evaluation/

Includes all automatic evaluation code:

  • zero_shot/
  • few_shot/
  • chain_of_thought/

Each evaluation setting contains:

  • boolean_sh.sh
  • factoid_sh.sh
  • multichoice_sh.sh

finetune/

Utilities to convert Parse into TogetherAI fine-tuning format:

  • to_together_ai.py
  • output example: finetune/together_ai_data_format/train_together.jsonl

Human evaluation data

  • evaluation/human_difficulty_validation/ β†’ difficulty validation study
  • evaluation/human_quality_evaluation/ β†’ quality evaluation study

interface/

Annotation interfaces and guide:

  • quality_evaluation_interface.html
  • difficulty_evalation_interface.html
  • QA_Annotation_Guide.pdf

πŸ” Reproducibility (Minimal Setup)

Recommended: Python 3.10+

python -m venv .venv
source .venv/bin/activate   # Linux/Mac
# .venv\Scripts\activate    # Windows

Install dependencies:

pip install -U pip
pip install prettytable termcolor together tenacity datasets

If you use API-based models, ensure you have your TogetherAI API key configured.

πŸ§ͺ Evaluation (TogetherAI)

All evaluation scripts follow the same structure and produce JSON predictions under prompt_results/.

Running experiments

βœ… Zero-shot

cd evaluation/zero_shot
bash boolean_sh.sh
bash factoid_sh.sh
bash multichoice_sh.sh

βœ… Few-shot

cd evaluation/few_shot
bash boolean_sh.sh
bash factoid_sh.sh
bash multichoice_sh.sh

βœ… Chain-of-Thought (CoT)

cd evaluation/chain_of_thought
bash boolean_sh.sh
bash factoid_sh.sh
bash multichoice_sh.sh

Output format

Predictions are stored here:

evaluation/<setting>/prompt_results/<task>/<language>/

Example:

evaluation/chain_of_thought/prompt_results/boolean/persian/answers_llama-3-70b.json

πŸ“Š Scoring

Each evaluation setting includes the scoring scripts:

  • evaluate_results.py
  • evaluate_finetuned_results.py

Example:

python evaluate_results.py

πŸ”§ Fine-tuning

Fine-tuning helper scripts and prompts are available in:

finetune/

Key script:

  • to_together_ai.py β†’ converts Parse into TogetherAI-compatible JSONL

Output example:

  • finetune/together_ai_data_format/train_together.jsonl

πŸ‘₯ Human Evaluation Summary

We conducted two human evaluation studies to validate benchmark quality and difficulty labels.

βœ… Quality Evaluation (1–5 rating)

Annotators evaluated:

  • Ambiguity
  • Readability
  • Correctness

Average scores across groups:

Metric Avg. Score (1–5)
Ambiguity 4.404
Readability 4.669
Correctness 4.389

These results indicate high linguistic quality and strong factual correctness.

βœ… Difficulty Validation

Human accuracy aligns with our difficulty labels (Easy > Medium > Hard) consistently across Boolean, Multiple-choice, and Factoid.

πŸ“ Repository Structure (Short)

.
β”œβ”€β”€ dataset/
β”œβ”€β”€ prompts/
β”œβ”€β”€ evaluation/
β”œβ”€β”€ finetune/
β”œβ”€β”€ interface/
β”œβ”€β”€ LICENSE
└── README.md

πŸ“œ Citation

If you use Parse, please cite:

πŸ“„ License

See LICENSE.

About

An Open-Domain Reasoning Question Answering Benchmark for Persian

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •