Preliminary design

Goal

The datasets server aims at providing services for the datasets of the Hugging Face Hub using a web API.

Justification

Datasets can be very big. Getting metadata, fetching data, querying data or processing requires a lot of resources (time, bandwidth, computing, storage). For some use cases (notebook, webpages, etc), these resources are not available. The datasets server is a third-party that bears the cost of the resources and provides a curated list of services on the datasets through a lightweight web API.

Impact

The Hugging Face Hub would ideally be the one-stop shop for ML datasets in the near future. To increase the usage of Hub datasets, it's crucial to provide the services the users need to do the work. By providing specialized services, the datasets server will allow the Hub to add value to the dataset pages (view the data, show stats, do queries, etc.)

Reference

Ecosystem around the Hub datasets

The datasets of the Hugging Face Hub can be accessed directly using git or HTTP or using the datasets library.

Other related projects:

huggingface_hub allow to retrieve metadata on the datasets
AutoNLP (autonlp-backend and autonlp-ui) allow to train models using datasets of the Hub

This project is an evolution of datasets-preview-backend (previous name of this repository), which provided the list of configs, splits, and first rows of the datasets (using the streaming mode of datasets)

Implementation

Services

The datasets server will provide the following services through a web API

get the metadata of a dataset: tags, configs, splits, features (columns)...
get the first N rows of a split
get a quality report on how well the dataset can be accessed using datasets (has metadata, can be downloaded, can be streamed, etc.)
generate the dataset-info.json (see https://github.com/huggingface/datasets/issues/3507#issue-1091214808)
get basic statistics about a split: number of samples, size in bytes
get statistics about a column of a split: distribution, mean, median, etc.
get a range of rows of a split (random access)
post SQL queries (https://github.com/huggingface/data-measurements-tool: frequent words, average+std sentence length, average+std word length,number of samples per tag/label)
scan files for vulnerabilities (related to security scan)

Out of scope

At least for a first version, the following points are out of scope:

launch queries on multiple datasets
make the API public (two main issues: 1. traffic increase, 2. legal rights to redistribute the datasets)
part of it could become the "HuggingFace Actions"
"Know your data" for HF hub (https://github.com/huggingface/datasets/issues/3761)
index / free-text search through the datasets:
- inside the data. See threads on Slack: https://huggingface.slack.com/archives/C01229B19EX/p1645643928941829, https://huggingface.slack.com/archives/C01229B19EX/p1647372809022069, https://huggingface.slack.com/archives/C01229B19EX/p1651014009925419 - look at https://opensearch.org/
- inside the metadata (dataset cards). See thread on Slack: https://huggingface.slack.com/archives/C02V51Q3800/p1650993144004379
access the cached files directly (mount a remote filesystem?) - see https://huggingface.slack.com/archives/C01229B19EX/p1646166958049289

Implementation challenges

Multiple aspects must be taken into account for the implementation. Not all are equally important.

size (and cost) of the storage: some datasets are very big (several TB, generally for audio or vision)
number of files: some datasets have a lot of small files
bandwidth: downloading (and uploading regularly) big datasets takes a lot of bandwidth. It should not be a problem on the datasets server's side, but it might be on the other side (not all the datasets' data files are hosted on the Hub). The dataset hosting platform might rate-limit or have disponibility issues
changes: the datasets are live objects that are versioned with git
changes: the hosted data files might change
response time: generating the response to the services can take time, because 1. the dataset must be downloaded, but also 2. querying the data on the local files can take time too (possibly reduce-like operations)
access rights: some datasets are gated, others are private, others must be downloaded manually
private hub: on-premise hubs might also want to benefit from the services provided by the dataset server
security: downloading a dataset requires executing arbitrary code (the .py script), which might generate security issues.
dependencies: the .py script might require packages, but there is currently no way to specify the dependencies.
resources: the processes might take too many resources (memory, CPU, storage, time)

Implementation decisions

the services are provided only for a selection of datasets, all of them being "small" public datasets from the Hugging Face Hub
the datasets are stored on the server (be it in their original form or in a transformed form: parquet, arrow, SQL?)
the services are provided only for one version of the dataset. Ideally: the last revision of the main branch of a dataset repository
the datasets are updated regularly to try to give access to the "current" version (possibilities: webhooks on git changes, periodic check of the ETags, manual trigger)
the services responses of the static services (statistics, first rows, metadata) are cached
the dynamic services (SQL query, random access) can be rate-limited, possibly per user through a token
indexes are set up to query the dataset contents
the processes on a dataset will run as jobs in an isolated python environment, with all the required dependencies already installed
the jobs resources will be bound (memory, CPU, storage, time)
the general cost of every dataset (storage, jobs, queries) is evaluated
use streaming when possible to speed up the dataset refreshes (see https://huggingface.slack.com/archives/C0311GZ7R6K/p1651592155530169?thread_ts=1651590983.338949&cid=C0311GZ7R6K) or to provide a fallback (if the dataset is too big to be stored on the disk for example)

Technologies and infra

See https://github.com/huggingface/datasets-server/tree/main/infra

Provide feedback

Saved searches

Use saved searches to filter your results more quickly