Buggy Benchmark Generation #183

Adarsh321123 · 2024-07-16T06:35:08Z

Description

I believe the scripts/generate-benchmark-lean4.ipynb is buggy. I evaluated the performance of the ReProver premise retriever on a dataset I generated from Mathlib4. I should get a recall of around 34.7 according to the paper. When I generated the dataset on the commit 29dcec074de168ac2bf835a77ef68bbe069194c5, my recall was absurdly high at around 70, consistent with other commits I tried with this generation code. However, the downloaded dataset v9 at https://zenodo.org/records/10929138 had a recall of 36, as expected. I am unsure what commit version v9 was on, but regardless I believe that this recall disparity suggests buggy code.

Detailed Steps to Reproduce the Behavior

Run scripts/generate-benchmark-lean4.ipynb on commit 29dcec074de168ac2bf835a77ef68bbe069194c5 to reproduce LeanDojo Benchmark 4 version v9.
Clone ReProver and follow its steps for the evaluation of the premise retriever checkpoint on that dataset.
Evaluate the same checkpoint on https://zenodo.org/records/10929138.

Logs

Downloaded dataset:

Average R@1 = 13.057546943693088 %, R@10 = 35.968246863464174 %, MRR = 0.31899176730322715

Generated dataset:

Average R@1 = 28.444059618579143 %, R@10 = 69.65759521602048 %, MRR = 0.5975602059714626

Also, there are some differences between the two datasets. With this code,

import json
from pathlib import Path

downloaded_dataset_path = 'leandojo_benchmark_4_downloaded/random'
new_commit_dataset_path = 'leandojo_benchmark_4_new_commit/random'
downloaded_corpus_path = 'leandojo_benchmark_4_downloaded/corpus.jsonl'
new_commit_corpus_path = 'leandojo_benchmark_4_new_commit/corpus.jsonl'

def load_jsonl(file_path):
    with open(file_path, 'r') as file:
        print(f"Loading {file_path}")
        return [json.loads(line) for line in file]

def load_json(file_path):
    datasets = {}
    for split in ["train", "val", "test"]:
        new_file_path = f"{file_path}/{split}.json"
        with open(new_file_path, 'r') as file:
            print(f"Loading {new_file_path}")
            datasets[split] = json.load(file)
    return datasets

downloaded_dataset = load_json(downloaded_dataset_path)
downloaded_dataset_train, downloaded_dataset_val, downloaded_dataset_test = downloaded_dataset["train"], downloaded_dataset["val"], downloaded_dataset["test"]
new_commit_dataset = load_json(new_commit_dataset_path)
new_commit_dataset_train, new_commit_dataset_val, new_commit_dataset_test = new_commit_dataset["train"], new_commit_dataset["val"], new_commit_dataset["test"]
downloaded_corpus = load_jsonl(downloaded_corpus_path)
new_commit_corpus = load_jsonl(new_commit_corpus_path)

analysis_results = {
    "downloaded_dataset_train_size": len(downloaded_dataset_train),
    "downloaded_dataset_val_size": len(downloaded_dataset_val),
    "downloaded_dataset_test_size": len(downloaded_dataset_test),
    "new_commit_dataset_train_size": len(new_commit_dataset_train),
    "new_commit_dataset_val_size": len(new_commit_dataset_val),
    "new_commit_dataset_test_size": len(new_commit_dataset_test),
    "downloaded_corpus_size": len(downloaded_corpus),
    "new_commit_corpus_size": len(new_commit_corpus),
    "downloaded_corpus_file_premises_size": len(downloaded_corpus[0]['premises']),
    "new_commit_corpus_file_premises_size": len(new_commit_corpus[0]['premises']),
}

print(analysis_results)

I get this final output

{
'downloaded_dataset_train_size': 112729,
'downloaded_dataset_val_size': 2000, 
'downloaded_dataset_test_size': 2000,
'new_commit_dataset_train_size': 118517,
'new_commit_dataset_val_size': 2000,
'new_commit_dataset_test_size': 2000,
'downloaded_corpus_size': 5192,
'new_commit_corpus_size': 5674,
'downloaded_corpus_file_premises_size': 465,
'new_commit_corpus_file_premises_size': 478
}

Platform Information

OS: Ubuntu
LeanDojo Versio: 1.8.4

The text was updated successfully, but these errors were encountered:

yangky11 · 2024-07-20T15:42:36Z

It's probably because the model checkpoint you're evaluating was trained on a different version of the dataset. The train/val/test split differs for each version, so you're essentially testing on the training data.

Adarsh321123 · 2024-07-20T16:11:20Z

I am using the same model checkpoint from HuggingFace for both datasets, so shouldn't the performance still be the same assuming that both datasets are on commit 29dcec074de168ac2bf835a77ef68bbe069194c5?

yangky11 · 2024-07-27T13:59:45Z

Do you know which mathlib commit was used to generate the dataset you downloaded from zenodo? You can find it out in metadata.json.

Adarsh321123 · 2024-07-28T02:20:32Z

So it seems that v9 was trained on commit fe4454af900584467d21f4fd4fe951d29d9332a7. Crucially, I tried generating this dataset on that commit and reproduced the same bug mentioned in the original post. Can you please let me know if you can reproduce the same bug?

yangky11 · 2024-07-28T03:54:04Z

So you generated a new dataset from fe4454af900584467d21f4fd4fe951d29d9332a7 , evaluated the retriever model and got a Recall@10 of 36 or 70?

What model checkpoint are you using?

Adarsh321123 · 2024-07-28T18:16:55Z

So you generated a new dataset from fe4454af900584467d21f4fd4fe951d29d9332a7 , evaluated the retriever model and got a Recall@10 of 36 or 70?

However, when I evaluated on the downloaded v9 dataset, I got a Recall@10 of 36, as expected.

I am using the retriever trained on the random split that used to be at https://huggingface.co/kaiyuy/leandojo-pl-ckpts.,

yangky11 · 2024-08-03T22:25:27Z

That's expected. You should get ~36% only if using exactly the same dataset used for training the model.

Adarsh321123 · 2024-08-04T06:47:52Z

I agree. However, I don't understand why I get ~70% when using a generated dataset from the same commit (fe4454af900584467d21f4fd4fe951d29d9332a7) used to train the model. Shouldn't I get ~36%? The generated dataset should be identical to the v9 dataset, right?

Adarsh321123 · 2024-08-10T22:53:45Z

Any thoughts @yangky11? Thank you so much for your help so far!

yangky11 · 2024-08-11T15:46:34Z

I don't think it's necessarily going to be the same. If you run the current benchmark generation code twice, does it give you exactly the same dataset?

Adarsh321123 · 2024-09-14T19:59:42Z

Thank you for your response, @yangky11. It doesn't give the same dataset, but the two datasets are close enough that the two R@10 values are essentially the same.

Adarsh321123 · 2024-09-19T20:33:47Z

@yangky11 ?

yangky11 closed this as completed Aug 21, 2024

Adarsh321123 mentioned this issue Sep 19, 2024

I'd like to reopen #183 #208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buggy Benchmark Generation #183

Buggy Benchmark Generation #183

Adarsh321123 commented Jul 16, 2024 •

edited

Loading

yangky11 commented Jul 20, 2024

Adarsh321123 commented Jul 20, 2024

yangky11 commented Jul 27, 2024 •

edited

Loading

Adarsh321123 commented Jul 28, 2024

yangky11 commented Jul 28, 2024

Adarsh321123 commented Jul 28, 2024

yangky11 commented Aug 3, 2024

Adarsh321123 commented Aug 4, 2024

Adarsh321123 commented Aug 10, 2024 •

edited

Loading

yangky11 commented Aug 11, 2024

Adarsh321123 commented Sep 14, 2024

Adarsh321123 commented Sep 19, 2024

Buggy Benchmark Generation #183

Buggy Benchmark Generation #183

Comments

Adarsh321123 commented Jul 16, 2024 • edited Loading

yangky11 commented Jul 20, 2024

Adarsh321123 commented Jul 20, 2024

yangky11 commented Jul 27, 2024 • edited Loading

Adarsh321123 commented Jul 28, 2024

yangky11 commented Jul 28, 2024

Adarsh321123 commented Jul 28, 2024

yangky11 commented Aug 3, 2024

Adarsh321123 commented Aug 4, 2024

Adarsh321123 commented Aug 10, 2024 • edited Loading

yangky11 commented Aug 11, 2024

Adarsh321123 commented Sep 14, 2024

Adarsh321123 commented Sep 19, 2024

Adarsh321123 commented Jul 16, 2024 •

edited

Loading

yangky11 commented Jul 27, 2024 •

edited

Loading

Adarsh321123 commented Aug 10, 2024 •

edited

Loading