Error in evaluation script: def eval_triva, and reproducted value of HumanEval-infill is not matching with paper

<img width="521" height="29" alt="Image" src="https://github.com/user-attachments/assets/c3497b5f-0080-4804-9503-e9ff9abbc452" />

<img width="812" height="108" alt="Image" src="https://github.com/user-attachments/assets/47332c24-fac0-4725-92ef-a3741f06918a" />

Hello, I'm reproducing the paper results and found 2 weird things:

1. there is an error in eval script: I can't reproduce `def eval_triva` because the function is using `ds = load_dataset("rajpurkar/squad", split='validation')`
2. Reproduced HumanEval-infill value is not matching the reported value of paper (See the last column of the second screenshot, not matching with the last column of your table)

We’re able to reproduce the results well for `Lambada` and `ROCStories`. However, could you update the script just once more for the other evaluations? I’d like to exactly reproduce all the values reported in the paper.

Also, If possible, could you integrate GSM8K evaluation in the eval-diffullama.py? I’d like to be sure that I’m evaluating things in line with your intentions, and it seems the best way to ensure that is to just follow your code exactly as it is.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error in evaluation script: def eval_triva, and reproducted value of HumanEval-infill is not matching with paper #27

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error in evaluation script: def eval_triva, and reproducted value of HumanEval-infill is not matching with paper #27

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions