This repository contains experiments conducted by the TurkuNLP group on Hugging Face's Fineweb data. The primary purpose of this project is to replicate Hugging Face's Fineweb-edu results and develop experimental models based on the dataset.
Fineweb is a dataset developed by Hugging Face that includes web data used for educational purposes and model training. Fineweb-edu is a specific subset of the Fineweb dataset used for model evaluation and fine-tuning experiments. The dataset supports various machine learning tasks such as natural language understanding and generation.
-
TurkuNLP/
: This folder contains evaluation results of models recreated by us using the Fineweb-edu data. These evaluations are part of our efforts to replicate and expand upon the results provided by Hugging Face. -
HF_FW/
: This folder holds evaluation results of the original Hugging Face Fineweb-edu models (e.g.,ablation-model-fineweb-edu
). These models were evaluated by us, as the official results were not made publicly available. -
examples/
: In this folder, you will find example Slurm scripts for running experiments on the LUMI supercomputer.
All evaluations in this project are conducted using the Lighteval suite, which we have adapted for use on the LUMI supercomputer. The evaluation procedure follows the instructions outlined in the official Fineweb repository: lighteval_tasks.py#L12.
The Lighteval suite has been specifically adapted for execution on the LUMI supercomputer. Example Slurm scripts are provided in the examples/
folder to help set up and run experiments efficiently on LUMI.