Epic Type: Implementation Epic
Attention: Please do not put any confidential content here.
We want to implement the Work Package Service outlined in https://docs.ghga-dev.de/main/architecture_concepts/ac001_file_validation_and_encryption.html#client-configuration-and-authorization
The Work Package Service REST API should have the following end points:
Used by the web frontend to create work packages:
POST /work-packages
- auth header: internal access token
- request body:
dataset_id
: string (the ID of a dataset)type
: enum (download/upload)file_ids
: array of strings (null = all files of the dataset)user_public_crypt4gh_key
: string (the user's public Crypt4GH key)
- response body:
id
: string (the ID of the created work package)token
: string (encrypted work and base64 encoded package access token)
Possible extension (not in this epic): The response body could also contain info about the expiration of the token.
Used by the GHGA connector to retrieve work packages and create work order tokens:
GET /work-packages/{work_package_id}
- auth header: work package access token
- gets details on the specified work package
POST /work-packages/{work_package_id}/files/{file_id}/work-order-tokens
- auth header: work package access token
- gets an encrypted work order token for the specified work package and file
The following endpoints could be added later to manage work packages in the web frontend, but for now, this will not be implemented and they are not part of the API.
GET /work-packages
- auth header: internal access token
- gets all list of all work packages of the current user
GET /work-packages/{work_package_id}
- auth header: internal access token
- gets work package with given id if it belongs to the current user
DELETE /work-packages/{work_package_id}
- auth header: internal access token
- deletes work package with given id if it belongs to the current user
Deletion of work packages should be logged or implemented as deactivation (immediate expiration) so that the fact that a work package access token had been created by a user is archived in the system.
The work package access token is a random string generated by the Work Package Service on a POST request. Its SHA256 hash is stored in the database. The token itself is not stored, but it is encrypted with the public_key
and then returned in the response.
The frontend should display the work package ID together with the encrypted work package access token separated by a colon as a single string that can be copied and pasted to the CLI client (GHGA connector), so that the CLI client gets both pieces of information.
This is a JWT with a very short time to live (maximum 30 seconds). It is signed by the Work Package Service and then encrypted with the public_key
of the user, similar to the work package access token. It contains the following claims:
type
:download
orupload
file_id
: the ID of the file that shall be downloaded or uploadeduser_id
: the internal ID of the useruser_public_crypt4gh_key
: the public key of the user stored in the work packagefull_user_name
: the full name of the user (with academic title)email
: the email address of the user
The user name and email are part of the token so that the upload/download controllers can log this information and notify the users.
The work order tokens are signed by the Work Package Service. A separate key pair (different from the keys used for the internal access tokens) is created for this purpose, stored by the HashiCorp vault and injected into the service via configuration. The public key must be provided to the download/upload controller services so that they can validate the work order token, the private key must only be made available to the Work Package Service.
To allow the Work Package Service to check whether a given user has download access to a given dataset, the claims repository provides the following internal endpoint:
-
GET /download-access/users/{user_id}/datasets
- authorization: only internal from download controller (via service mesh)
- returns a list of all dataset IDs that can be downloaded by the user
-
GET /download-access/users/{user_id}/datasets/{dataset_id}
- authorization: only internal from download controller (via service mesh)
- returns
true
orfalse
as a scalar resource
Possible extension (not in this epic): Instead of true/false, the latter endpoint could return null or the date when the access expires (null could be confused with unlimited access though). This could be used by the Work Package Service to limit the expiration date of the tokens, even though the Work Package Service re-checks access on every operation anyway.
In order to facilitate authorization, the path of these endpoint starts with download-access
and not with users
which is already used by other endpoints of the user registry and claims repository.
Note that this is a shortcut as long as we don't have a visa issuer service. Later, the claims repository should not be contacted directly, but the information about access grants will be requested from the visa issuer service and checked with the help of the visa library.
Possible extension (not in this epic): In order to check whether a given user has upload access to a given dataset, we may introduce a similar endpoint that will be based on checking custom visa types for dataset submission in the claims repository. Alternatively, we could store a list of internal user IDs of dataset submitters also in the Dataset
collection described below and update them similarly to the file IDs belonging to the dataset.
In order to provide the user with a list of downloadable datasets, the Work Package Service combines the information from the access checks explained above with information about the dataset as explained below and makes it available to the frontend via the following endpoint:
GET /users/{user_id}/datasets
- auth header: internal access token with the user context
- returns a list of all dataset IDs that can be downloaded by the user
Note: The users/
part has been included in the endpoint to make it more RESTful, even though the user is already known from the user context. The endpoint must validate that the user_id
matches the auth context. It is assumed that this endpoint is used with a special prefix for the work package service so that this does not collide e.g. with the /users
endpoint of the user registry.
Possible extension (not in this epic): The dataset objects returned here could be supplemented with information on when user access expires, obtained via the access checks explained above, if these are extended as well. This information could then be also shown to the user on the profile page or the work package creation page.
The download and upload service controllers authorize access to files by validating the content and signature of the passed work order token. The work order tokens are only valid for a very short time and for a specific file.
The database of the Work Package Service should store the following WorkPackage objects:
WorkPackage:
id: str # the ID of this work package
dataset_id: str # all files must belong to the dataset with this ID
type: enum # the work order type, either download or upload
files: dict[str, str] # IDs of all included files mapped to their extensions
user_id: str # the internal user ID
full_user_name: str # the full name of the user (with academic title)
email: str # the email address of the user
user_public_crypt4gh_key: str # the user's public Crypt4GH key
token_hash: str # hash of the work package access token
created: datetime # creation date of this work package
expires: datetime # expiry date of this work package
The files
field is a mapping of file_ids
to file extensions. If the user specified a list of file IDs when creating the work package, the Work Package Service checks that they actually belong to the given dataset_id
. If the user didn't specify a list of file IDs, the mapping will contain all file_id
s that belong to the given dataset_id
. So it can be assumed this mapping is never empty.
An entry in the WorkPackage
collection is only created by the Work Package Service after verification that the user with the given user_id
is allowed to access the dataset with the given dataset_id
(see access checks above).
In the database, the Work Package Service also keeps track of datasets, particularly how the datasets are named and which file IDs with which extensions belong to them (maybe also which users are allowed to submit data for the individual datasets). The following association collection is used for this purpose:
Dataset:
id: str # id of the dataset
title: str # short title of the dataset
description: str # long description of the dataset
files: list[DatasetFile] # all files contained in the dataset
DatasetFile:
id: str # id of the file
extension: str # the file extension with a leading dot
To populate this collection, the Work Package Service listens to events of type metadata_dataset_overview
which contain all the necessary information.
As part of this epic, a simple form for creating work packages should be added to the data portal as frontend. The form should allow to:
- Select a dataset from the list of accessible datasets
- Show the description of the dataset
- Enter a list of file IDs to restrict the scope of the work package (if the list is empty, all files will be included)
- Create a work package and access token after clicking a submit button
- Show the access token together with the work package ID in the format that can be pasted to the CLI (GHGA connector)
- Add a button to copy this to the clipboard
- Possible extension (not in this epic): Show how long the access token is valid
Number of sprints required: 2
Number of developers required: 1