-
Notifications
You must be signed in to change notification settings - Fork 81
Preliminary design
The datasets server aims at providing services for the datasets of the Hugging Face Hub using a web API.
Datasets can be very big. Getting metadata, fetching data, querying data or processing requires a lot of resources (time, bandwidth, computing, storage). For some use cases (notebook, webpages, etc), these resources are not available. The datasets server is a third-party that bears the cost of the resources and provides a curated list of services on the datasets through a lightweight web API.
The Hugging Face Hub would ideally be the one-stop shop for ML datasets in the near future. To increase the usage of Hub datasets, it's crucial to provide the services the users need to do the work. By providing specialized services, the datasets server will allow the Hub to add value to the dataset pages (view the data, show stats, do queries, etc.)
The datasets of the Hugging Face Hub can be accessed directly using git or HTTP or using the datasets library.
Other related projects:
- huggingface_hub allow to retrieve metadata on the datasets
- AutoNLP (autonlp-backend and autonlp-ui) allow to train models using datasets of the Hub
This project is an evolution of datasets-preview-backend
(previous name of this repository), which provided the list of configs, splits, and first rows of the datasets (using the streaming mode of datasets
)
The datasets server will provide the following services through a web API
- get the metadata of a dataset: tags, configs, splits, features (columns)...
- get the first N rows of a split
- get a quality report on how well the dataset can be accessed using
datasets
(has metadata, can be downloaded, can be streamed, etc.) - generate the dataset-info.json (see https://github.com/huggingface/datasets/issues/3507#issue-1091214808)
- get basic statistics about a split: number of samples, size in bytes
- get statistics about a column of a split: distribution, mean, median, etc.
- get a range of rows of a split (random access)
- post SQL queries (https://github.com/huggingface/data-measurements-tool: frequent words, average+std sentence length, average+std word length,number of samples per tag/label)
- scan files for vulnerabilities (related to security scan)
At least for a first version, the following points are out of scope:
- launch queries on multiple datasets
- make the API public (two main issues: 1. traffic increase, 2. legal rights to redistribute the datasets)
- part of it could become the "HuggingFace Actions"
- "Know your data" for HF hub (https://github.com/huggingface/datasets/issues/3761)
- index / free-text search through the datasets:
- inside the data. See threads on Slack: https://huggingface.slack.com/archives/C01229B19EX/p1645643928941829, https://huggingface.slack.com/archives/C01229B19EX/p1647372809022069, https://huggingface.slack.com/archives/C01229B19EX/p1651014009925419 - look at https://opensearch.org/
- inside the metadata (dataset cards). See thread on Slack: https://huggingface.slack.com/archives/C02V51Q3800/p1650993144004379
- access the cached files directly (mount a remote filesystem?) - see https://huggingface.slack.com/archives/C01229B19EX/p1646166958049289
Multiple aspects must be taken into account for the implementation. Not all are equally important.
- size (and cost) of the storage: some datasets are very big (several TB, generally for audio or vision)
- number of files: some datasets have a lot of small files
- bandwidth: downloading (and uploading regularly) big datasets takes a lot of bandwidth. It should not be a problem on the datasets server's side, but it might be on the other side (not all the datasets' data files are hosted on the Hub). The dataset hosting platform might rate-limit or have disponibility issues
- changes: the datasets are live objects that are versioned with git
- changes: the hosted data files might change
- response time: generating the response to the services can take time, because 1. the dataset must be downloaded, but also 2. querying the data on the local files can take time too (possibly reduce-like operations)
- access rights: some datasets are gated, others are private, others must be downloaded manually
- private hub: on-premise hubs might also want to benefit from the services provided by the dataset server
- security: downloading a dataset requires executing arbitrary code (the
.py
script), which might generate security issues. - dependencies: the
.py
script might require packages, but there is currently no way to specify the dependencies. - resources: the processes might take too many resources (memory, CPU, storage, time)
- the services are provided only for a selection of datasets, all of them being "small" public datasets from the Hugging Face Hub
- the datasets are stored on the server (be it in their original form or in a transformed form: parquet, arrow, SQL?)
- the services are provided only for one version of the dataset. Ideally: the last revision of the main branch of a dataset repository
- the datasets are updated regularly to try to give access to the "current" version (possibilities: webhooks on git changes, periodic check of the ETags, manual trigger)
- the services responses of the static services (statistics, first rows, metadata) are cached
- the dynamic services (SQL query, random access) can be rate-limited, possibly per user through a token
- indexes are set up to query the dataset contents
- the processes on a dataset will run as jobs in an isolated python environment, with all the required dependencies already installed
- the jobs resources will be bound (memory, CPU, storage, time)
- the general cost of every dataset (storage, jobs, queries) is evaluated
- use streaming when possible to speed up the dataset refreshes (see https://huggingface.slack.com/archives/C0311GZ7R6K/p1651592155530169?thread_ts=1651590983.338949&cid=C0311GZ7R6K) or to provide a fallback (if the dataset is too big to be stored on the disk for example)
See https://github.com/huggingface/datasets-server/tree/main/infra