Hello, I'm reproducing the paper results and found 2 weird things:
- there is an error in eval script: I can't reproduce
def eval_triva because the function is using ds = load_dataset("rajpurkar/squad", split='validation')
- Reproduced HumanEval-infill value is not matching the reported value of paper (See the last column of the second screenshot, not matching with the last column of your table)
We’re able to reproduce the results well for Lambada and ROCStories. However, could you update the script just once more for the other evaluations? I’d like to exactly reproduce all the values reported in the paper.
Also, If possible, could you integrate GSM8K evaluation in the eval-diffullama.py? I’d like to be sure that I’m evaluating things in line with your intentions, and it seems the best way to ensure that is to just follow your code exactly as it is.