Skip to content

Latest commit

 

History

History
324 lines (226 loc) · 10.4 KB

DEVELOPER_GUIDE.md

File metadata and controls

324 lines (226 loc) · 10.4 KB

Developer guide

This document is intended for developers who want to install, test or contribute to the code.

Set up development environment

Linux

Install rust:

$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
$ source $HOME/.cargo/env

Install pyenv:

$ curl https://pyenv.run | bash

Install Python 3.9.18:

$ pyenv install 3.9.18

Check that the expected local version of Python is used:

$ cd services/worker
$ python --version
Python 3.9.18

Install Poetry with pipx:

  • Either a single version:
pipx install poetry==1.8.2
poetry --version
  • Or a parallel version (with a unique suffix):
pipx install poetry==1.8.2 [email protected]
[email protected] --version

Set the Python version to use with Poetry:

poetry env use 3.9.18

or

[email protected] env use 3.9.18

Install the dependencies:

make install

Mac OS

To install the worker on Mac OS, you can follow the next steps.

First: as an administrator

Install brew:

$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Then: as a normal user

Install pyenv:

$ curl https://pyenv.run | bash

append the following lines to ~/.zshrc:

export PYENV_ROOT="$HOME/.pyenv"
command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"

Logout and login again.

Install Python 3.9.18:

$ pyenv install 3.9.18

Check that the expected local version of Python is used:

$ cd services/worker
$ python --version
Python 3.9.18

Install Poetry with pipx:

  • Either a single version:
pipx install poetry==1.8.2
poetry --version
  • Or a parallel version (with a unique suffix):
pipx install poetry==1.8.2 [email protected]
[email protected] --version

append the following lines to ~/.zshrc:

export PATH="/Users/slesage2/.local/bin:$PATH"

Install rust:

$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
$ source $HOME/.cargo/env

Set the python version to use with poetry:

poetry env use 3.9.18

or

[email protected] env use 3.9.18

Install the dependencies:

make install

Install dataset-viewer

To start working on the project:

git clone [email protected]:huggingface/dataset-viewer.git
cd dataset-viewer

Install all the packages:

make install

Install docker (see https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository and https://docs.docker.com/engine/install/linux-postinstall/)

Run the project locally:

make start

When the docker containers have been started, enter http://localhost:8100/healthcheck: it should show ok.

Run the project in development mode:

make dev-start

In development mode, you don't need to rebuild the docker images to apply a change in a worker. You can just restart the worker's docker container and it will apply your changes.

To install a single job (in jobs), library (in libs) or service (in services), go to their respective directory, and install Python 3.9 (consider pyenv) and poetry (don't forget to add poetry to the PATH environment variable).

If you use pyenv:

cd libs/libcommon/
pyenv install 3.9.18
pyenv local 3.9.18
poetry env use python3.9

then:

make install

It will create a virtual environment in a ./.venv/ subdirectory.

If you use VSCode, it might be useful to use the "monorepo" workspace (see a blogpost for more explanations). It is a multi-root workspace, with one folder for each library and service (note that we hide them from the ROOT to avoid editing there). Each folder has its own Python interpreter, with access to the dependencies installed by Poetry. You might have to manually select the interpreter in every folder though on first access, then VSCode stores the information in its local storage.

Architecture

The repository is structured as a monorepo, with Python libraries and applications in jobs, libs and services:

The following diagram represents the general architecture of the project: Architecture

  • Mongo Server, a Mongo server with databases for: "cache", "queue" and "maintenance".
  • jobs contains the jobs run by Helm before deploying the pods or scheduled basis. For now there are two type of jobs:
  • libs contains the Python libraries used by the services and workers. For now, there are two libraries
    • libcommon, which contains the common code for the services and workers.
    • libapi, which contains common code for authentication, http requests, exceptions and other utilities for the services.
  • services contains the applications:
    • api, the public API, is a web server that exposes the API endpoints. All the responses are served from pre-computed responses in Mongo server. That's the main point of this project: generating these responses takes time, and the API server provides this service to the users.
    • webhook, exposes the /webhook endpoint which is called by the Hub on every creation, update or deletion of a dataset on the Hub. On deletion, the cached responses are deleted. On creation or update, a new job is appended in the "queue" database.
    • rows
    • search
    • admin, the admin API (which is separated from the public API and might be published under its own domain at some point)
    • reverse proxy the reverse proxy
    • worker the worker that processes the queue asynchronously: it gets a "job" collection (caution: the jobs stored in the queue, not the Helm jobs), processes the expected response for the associated endpoint, and stores the response in the "cache" collection. Note also that the workers create local files when the dataset contains images or audios. A shared directory (ASSETS_STORAGE_ROOT) must therefore be provisioned with sufficient space for the generated files. The /first-rows endpoint responses contain URLs to these files, served by the API under the /assets/ endpoint.
    • sse-api
  • Clients

If you have access to the internal HF notion, see https://www.notion.so/huggingface2/Datasets-server-464848da2a984e999c540a4aa7f0ece5.

Hence, the working application has the following core components:

  • a Mongo server with two main databases: "cache" and "queue"
  • one instance of the API service which exposes a port
  • one instance of the ROWS service which exposes a port
  • one instance of the SEARCH service which exposes a port
  • N instances of worker that processes the pending "jobs" and stores the results in the "cache"

The application also has optional components:

  • a reverse proxy in front of the API to serve static files and proxy the rest to the API server
  • an admin server to serve technical endpoints
  • a shared directory for the assets and cached-assets in S3 (It can be configured to point to a local storage instead)
  • a shared storage for temporal files created by the workers in EFS (It can be configured to point to a local storage instead)

The following environments contain all the modules: reverse proxy, API server, admin API server, workers, and the Mongo database.

Environment URL Type How to deploy
Production https://datasets-server.huggingface.co Helm / Kubernetes Argo CD
Development https://datasets-server.us.dev.moon.huggingface.tech Helm / Kubernetes Argo CD
Local build http://localhost:8100 Docker compose make start (builds docker images)

Jobs queue

The following diagram represents the logic when a worker pulls a job from the queue:

Jobs queue

Source: https://www.figma.com/board/Yymk75rQTYpZuIwTqffyKQ/Queues-in-dataset-viewer

Quality

The CI checks the quality of the code through a GitHub action. To manually format the code of a job, library, service or worker:

make style

To check the quality (which includes checking the style, but also security vulnerabilities):

make quality

Tests

The CI checks the tests a GitHub action. To manually test a job, library, service or worker:

make test

Note that it requires the resources to be ready, ie. mongo and the storage for assets.

To launch the end to end tests:

make e2e

Versions

We don't use the package versions (in pyproject.toml files), no need to update them.

Pull requests

All the contributions should go through a pull request. The pull requests must be "squashed" (ie: one commit per pull request).

GitHub Actions

You can use act to test the GitHub Actions (see .github/workflows/) locally. It reduces the retroaction loop when working on the GitHub Actions, avoid polluting the branches with empty pushes only meant to trigger the CI, and allows to only run specific actions.

For example, to launch the build and push of the docker images to Docker Hub:

act -j build-and-push-image-to-docker-hub --secret-file my.secrets

with my.secrets a file with the secrets:

DOCKERHUB_USERNAME=xxx
DOCKERHUB_PASSWORD=xxx
GITHUB_TOKEN=xxx