GSoC 2021 Project Ideas

Please ask questions here. Tag @ethanwhite, @henrykironde

Preferred names (Henry, Ethan), Preferred_greeting (Hi|Hello|Dear|Thanks|Thank you [First_name])

The code of conduct should be your first read.

Data Retriever: Add Data Packages

Rationale

The Data Retriever is a package manager for data. The Data retriever automatically finds, downloads and pre-processes publicly available datasets and it stores these datasets in a ready-to-analyse state. The Data Retriever handles tabular data and spatial data forms. The data retriever additionally handles compressed version of these data forms, i.e zip, gz and tar files.

Approach

The goal of the project is to scale up the data packages available in the Data Retriever. The Data Retriever utilizes the data packaging specification technology to identify data and process the data. The data packages are defined using JSON and the edge cases are defined and built using Python. The Data Retriever public data packages are stored in the Retriever recipes repository. This project aims to add public data sets as packages to the Data Retriever. Public data is available online under several domains and hosted in various forms. The Data Retriever supports multiple versions of the source data like CSV, XML, JSON, SQLite, and Geospatial data. Some of the sources for public data are Kaggle, Google datasets, data.gov, and the National Ecological Observatory Network. For this project, we will identify a number of these data sources and create data packages for the data.

Some example sources for these raw data forms.

Degree of difficulty and needed skills

Easy
Knowledge of Python
Knowledge of Object Oriented Programming
Knowledge of JSON

Useful skills

Knowledge of Git, continuous development and deployment tools
Knowledge of R and Julia Programming

Involved developer communities

The team at the Data Retriever primarily interacts via issues and pull requests on GitHub or through the Gitter channel.

Mentors

@henrysenyondo
@ethanwhite

Data Retriever: Support for Login/API

Rationale

The Data Retriever is a package manager for publicly accessible data. The Data retriever automatically finds, downloads and pre-processes publicly available datasets and it stores these datasets in a ready-to-analyse state. A number of data providers require the use of an account with an associated Login or API key to access data programmatically. The Data Retriever currently has support for the Kaggle API allowing users to securely use the Data Retriever to install datasets hosted by Kaggle.

Approach

The projects' goal is to generalize the handling of data source platforms that require a Login or use API. The first step is to identify more sources of public data like Kaggle where a Login or API key is required. The users will place the appropriate credentials in a file in their home directory. The Data Retriever will automatically identify the required credential files for each the data package and handle the specific login or API request necessary for the server hosting the data. This will enable users to access several data sources with minimum effort.

Degree of difficulty and needed skills

Easy
Knowledge of Python
Knowledge of Object Oriented Programming

Useful skills

Knowledge of Git, continuous development and deployment tools
Knowledge of R and Julia Programming

Involved developer communities

The team at the Data Retriever primarily interacts via issues and pull requests on GitHub or through the Gitter channel.

Mentors

@henrysenyondo
@ethanwhite

Data Retriever: Update the retriever-dashboard

Rationale

The Retriever dashboard is a Django project that helps to track the changes in the data. The dashboard downloads the data using the Data Retriever's SQLite engine and checks for the differences in the previous version of the data and the current version. The dashboard reports any changes or diff in these two versions and archives the current version

Approach

The projects' goal is to update and improve the current state of the dashbaord. This includes adding support for spacial datasets. The current version uses the "IGNORE_LIST" for the dataset that are either very huge or are Spatial.

Degree of difficulty and needed skills

Easy
Knowledge of Python and Django
Knowledge of Object Oriented Programming

Useful skills

Knowledge of Git, continuous development and deployment tools
Knowledge of R and Julia Programming

Involved developer communities

The team at the Data Retriever primarily interacts via issues and pull requests on GitHub or through the Gitter channel.

Mentors

@henrysenyondo
@ethanwhite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC 2021 Project Ideas

Data Retriever: Add Data Packages

Rationale

Approach

Some example sources for these raw data forms.

Degree of difficulty and needed skills

Useful skills

Involved developer communities

Mentors

Data Retriever: Support for Login/API

Rationale

Approach

Degree of difficulty and needed skills

Useful skills

Involved developer communities

Mentors

Data Retriever: Update the retriever-dashboard

Rationale

Approach

Degree of difficulty and needed skills

Useful skills

Involved developer communities

Mentors

Clone this wiki locally