Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Publishing - first draft documentation for the data publishing service #191

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
0d029df
Data publishing doc draft.
marioa Feb 4, 2025
b46427b
Added an image to create a dataset.
marioa Feb 4, 2025
1176b6a
Moved project file content to assumptions.
marioa Feb 4, 2025
c543202
Removed metadata for data publishing.
marioa Feb 4, 2025
bbc53c1
Renamed files.
marioa Feb 4, 2025
d871f96
Added more generic instructions.
marioa Feb 4, 2025
37eb8f1
Content asking for feedback.
marioa Feb 4, 2025
8f61b71
Capitalised Data Catalogue.
marioa Feb 4, 2025
69409be
Feedback for missing content.
marioa Feb 4, 2025
27b72df
Moved dataset content to the catalogue file.
marioa Feb 4, 2025
e98a40b
Corrected link to the image.
marioa Feb 4, 2025
a81859f
Added an adominition about content.
marioa Feb 4, 2025
1676c2d
Added some linking text to the next section.
marioa Feb 4, 2025
8efe0ca
Moved the downloading content to uploading.
marioa Feb 4, 2025
79af0e0
Renamed uploading to accessing.
marioa Feb 4, 2025
4bc0899
Uploading/Downloading -> Manipulating.
marioa Feb 4, 2025
f5ce201
Added note on providing documentation on how to use published data.
marioa Feb 4, 2025
91767f7
Added something on downloading using the aws client.
marioa Feb 4, 2025
3ffcad6
Removed empty file.
marioa Feb 5, 2025
af7708a
Fixed typos.
marioa Feb 5, 2025
5a89d3f
Changed the admonition types.
marioa Feb 5, 2025
08d573b
Changed section titles.
marioa Feb 5, 2025
245722a
Moved S3 content to the S3 section.
marioa Feb 11, 2025
b02b376
Corrections and improvements.
marioa Feb 11, 2025
d52e5ed
Minor changes.
marioa Feb 11, 2025
8d110e5
Merge with main, fixed conflict.
marioa Feb 11, 2025
62eac3b
Merge branch 'main' into publishing
marioa Feb 25, 2025
b51c107
S3->https for publick S3 buckets.
marioa Feb 25, 2025
be13d1a
Fixed typo.
marioa Feb 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/images/CreateDataset.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
56 changes: 56 additions & 0 deletions docs/services/datapublishing/catalogue.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Data Publishing

## Customising your entry in the EIDF Data Catalogue

When/if your project is approved and you are close to publishing your data, a CKAN organisation will be created in the [EIDF Data Catalogue](https://catalogue.eidf.ac.uk/) for you. We do not automatically create these as we do not want a lot of organisations to be present with no published data.

You can login to the EIDF Data Catalogue using your SAFE credentials - there is a "Log in" link on the top centre right. Find your organisation and then you can customise it by clicking on the "Manage" button on the top-right, e.g. you can provide a more friendly name than the EIDF project number, you can provide a description for your organisation, provide a logo or image representing your organisation and associate metadata pairs to aid discovery. Customising your organisation will make it more attractive to those that may want to use your data and will also aid in discovery.

!!! warning "**Do NOT use the CKAN interface to create Datasets**"
The data ingest process creates these for you and associates S3 links with your data. You can provide additional metadata once the Dataset records are in CKAN. Please do not add datasets through the CKAN interface either. Contact us if would like anything removed.

## Creating your dataset(s)

Once your project is approved go to your project in the EIDF portal at this link:

* [https://projects.eidf.ac.uk/ingest/](https://projects.eidf.ac.uk/ingest/)

Select the project which you want to use to ingest data. The list of `Ingest Datasets` will be empty unless you have already created Datasets.

Create a Dataset by pressing on the `New` button. You will need to provide the following minimal bits of information:

* **Name**: The name for your dataset.
* **S3 Bucket name**: this entry will automatically be populated from your dataset name to create your S3 bucket name. You can customise the name for yourself subject to the constraints specified below the text box by editing the link directly. Note though if you change the dataset name you will overwrite the S3 bucket name link if you have customised it. Your project id at the start you will not be able to change.
* **Number of buckets**: you may want to distribute your data over a number of S3 buckets if your dataset is big.
* **Description**: a description of your dataset.
* **Link**: a link describing your group/contact information.
* **Contact email**: a contact email to answer queries about your data set (this is optional).
* **License**: the license under which you will distribute your data.

An example of the form is given below.

![Interface to create a dataset](../../images/CreateDataset.png)

Once you are happy with the content press on the `Create` button. This will be used to create an S3 bucket to which you will be able migrate your data.

You can create a Dataset within your organisation on the EIDF Data Catalogue and a data buckets in S3.

You should now be able to click on a link to your dataset to see a copy of the information that you provided. When the the S3 bucket has been created and you have added the data, you can add the S3 link to the catalogue entry. You can supplement your Dataset entry in the EIDF catalogue with additional metadata once you have logged into the data catalogue using your SAFE credentials.

## Metadata format

Metadata for resources in your dataset are added directly through the EIDF Data Catalogue.

Make sure you're logged in to the [EIDF Data Catalogue](https://catalogue.eidf.ac.uk). Open the page of your dataset and click on "Manage" at the top right. Open the "Resources" tab and press the button "+ Add new resource". Now you can fill in the form and describe your data as you wish. Some entries that are required and these are marked with a red "\*" in the EIDF Data Catalogue:

* **Name**: a descriptive name for your dataset.
* **Access URL**: this is a link to a file in S3 or a set of files with a common prefix, that you uploaded as explained above.
* **Description**: a human readable description of your data set.
* **Resource format**: the type of data included in your resource.
* **Unique Identifier**
* **Licence**: the licence under which you are releasing your data.

Having created an S3 bucket and provided Metadata for this data set in the EIDF Data Catalogue please consult the [S3 tutorial section](../s3/tutorial.md) to get an overview of the commands you will require and some examples.

!!! note
If it is not going to be immediately obvious to a third party as to how your data may be used then please do provide a link to some documentation showing people how to unpack/use your data. Not everyone who may want to use your data may be a domain expert in your field.
36 changes: 36 additions & 0 deletions docs/services/datapublishing/service.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Data Publishing

## Service provision

The EIDF guarantees, to the best of its ability, to continue its services until at least the 31-Mar-2032 and aims to continue beyond 2032 subject to funding. However, should we have to terminate the service we will give you at least 3 months notice to retrieve your data. It is thus important to keep your contact details up to date. The publishing service is not an archiving service and we recommend where possible that you have a backup version of your data outside of the EIDF.

We reserve the right to remove any data that is not legal or inappropriate as well as any data that remains at the end of service provision.

Some basic assumptions -

* You already have a [SAFE](https://safe.epcc.ed.ac.uk/) account and can access the [EIDF portal](https://portal.eidf.ac.uk/). Otherwise consult the [EIDF portal documentation](https://docs.eidf.ac.uk/access/project/) before proceeding.
* To qualify for the free data publishing service your data must be open and freely available to all. If you want to control access to your data you should use the [S3 service](https://epcced.github.io/eidf-docs/services/s3/) instead. This is not a free service.
* The service is free up to a given threshold data volume which is generous. If the data you wish to publish is bigger than this we will get in touch with you when you apply for a data publishing project.

If you find anything in this documentation that you think is not clear, missing or even wrong please let us know via the EIDF query system.

## Applying for a data project

To start the process you will need to apply for an EIDF data project which is slightly different from other EIDF project applications.

!!! note
A data publishing version of the EIDF portal will be deployed in the near future. For now, you will have to use the generic portal.

In the [EIDF portal](https://portal.eidf.ac.uk/):

* Press on the `Your project applications` link.
* Press on the `New Application` link and put in an application for us to host your data.
* You will be asked to supply a title for your application.
* A start date (when you hope to start publishing your data).
* A proposed end date (at the moment you will not be able to go beyond 31-Dec-2032).

For the EIDF Services you require chose the "*ingest data formally into EIDF for long-term hosting*" choice. Note that all the other EIDF services have a [cost](https://edinburgh-international-data-facility.ed.ac.uk/access) so, if you add any other EIDF services a charge will be imposed. The data publishing incur a cost if you go over a threshold - we will get in touch if you pass this threshold.

Be sure to describe the dataset(s) that you wish to ingest. Submit your application. Your application will be reviewed and you will be notified if your project has been approved or rejected - someone may be in touch to clarify points in your application.

Once your data project has been approved we will create an organisation in our EIDF [Data Catalogue](https://catalogue.eidf.ac.uk). The Data Catalogue is a customised [CKAN](https://ckan.org/) instance, an open source application for data management systems. We map EIDF projects to CKAN organisations. A CKAN organisation allows you to brand your organisation, allow you to provide metadata to aid the discovery of your data and to publish your datasets together with metadata specific to those data sets.
53 changes: 52 additions & 1 deletion docs/services/s3/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,56 @@ To read from a public bucket without providing credentials, add the option `--no
aws s3 ls s3://<bucketname> --no-sign-request
```

### Examples

You want to upload all the files in a subdirectory to your S3 bucket

```bash
aws s3 cp ./mydir s3://mybucket --recursive --exclude "*" \
--include "*.dat"
```

Here all `*.dat` files only in `mydir` will be uploaded to `s3://mybucket`.

You can check your upload using:

```bash
aws s3 ls --summarize --human-readable --recursive s3://mybucket/
```

You can get help on the options for any command using:

```bash
aws s3 help
```

or for particular commands

```bash
aws s3 ls help
```

For public S3 buckets, such as those provided for the data publishing service, you can construct a downloadable https link to download files from an S3 link, e.g. taking:

```text
s3://eidfXXX-my-dataset/mydatafile.csv
```

and by making the following transformation:

```
https://s3.eidf.ac.uk/eidfXXX-my-dataset/mydatafile.csv
```

You can use your browser to download a particular file. Alternatively, you can use the aws client to download an entire data set:

```bash
aws s3 cp --recursive s3://eidf158-walkingtraveltimemaps/ ./walkingtraveltimemaps \
--no-sign-request
```

will copy the entire content of the S3 bucket to your `walkingtraveltimemaps` subdirectory. Note that you must use `--no-sign-request` when looking at other people's buckets.

## Python using `boto3`

The following examples use the Python library `boto3`.
Expand Down Expand Up @@ -158,7 +208,8 @@ Buckets owned by an EIDF project are placed in a tenancy in the EIDF S3 Service.
The project code is a prefix on the bucket name, separated by a colon (`:`), for example `eidfXX1:somebucket`.
Note that some S3 client libraries do not accept bucket names in this format.

Bucket permissions use IAM policies. You can grant other accounts (within the same project or from other projects) read or write access to your buckets.
Bucket permissions use IAM (Identity Access Management) policies. You can grant other accounts (within the same project or from other projects) read or write access to your buckets.

For example to grant permissions to put, get, delete and list objects in bucket `eidfXX1:somebucket` to the account `account2` in project `eidfXX2`:

```json
Expand Down
3 changes: 3 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,9 @@ nav:
- "Overview": services/s3/index.md
- "Manage": services/s3/manage.md
- "Tutorial": services/s3/tutorial.md
- "Data Publishing":
- "Getting started": services/datapublishing/service.md
- "Your Data Catalogue entry": services/datapublishing/catalogue.md
- "Data Catalogue":
- "Metadata information": services/datacatalogue/metadata.md
#- "Managed File Transfer":
Expand Down