Skip to content

Error in evaluation script: def eval_triva, and reproducted value of HumanEval-infill is not matching with paper #27

@JiHoonLee9898

Description

@JiHoonLee9898
Image Image

Hello, I'm reproducing the paper results and found 2 weird things:

  1. there is an error in eval script: I can't reproduce def eval_triva because the function is using ds = load_dataset("rajpurkar/squad", split='validation')
  2. Reproduced HumanEval-infill value is not matching the reported value of paper (See the last column of the second screenshot, not matching with the last column of your table)

We’re able to reproduce the results well for Lambada and ROCStories. However, could you update the script just once more for the other evaluations? I’d like to exactly reproduce all the values reported in the paper.

Also, If possible, could you integrate GSM8K evaluation in the eval-diffullama.py? I’d like to be sure that I’m evaluating things in line with your intentions, and it seems the best way to ensure that is to just follow your code exactly as it is.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions