Skip to content

GSoC 2021 Project Ideas

henry senyondo edited this page Feb 11, 2021 · 4 revisions

Please ask questions here. Tag @ethanwhite, @henrykironde

Preferred names (Henry, Ethan), Preferred_greeting (Hi|Hello|Dear|Thanks|Thank you [First_name])

Join the chat at https://gitter.im/weecology/retriever

The code of conduct should be your first read.

Data Retriever: Add Data Packages

Rationale

The Data Retriever is a package manager for data. The Data retriever automatically finds, downloads and pre-processes publicly available datasets and it stores these datasets in a ready-to-analyse state. The Data Retriever handles tabular data and spatial data forms. The data retriever additionally handles compressed version of these data forms, i.e zip, gz and tar files.

Approach

The goal of the project is to scale up the data packages available in the Data Retriever. The Data Retriever utilizes the data packaging specification technology to identify data and process the data. The data packages are defined using JSON and the edge cases are defined and built using Python. The Data Retriever public data packages are stored in the Retriever recipes repository. This project aims to add public data sets as packages to the Data Retriever. Public data is available online under several domains and hosted in various forms. The Data Retriever supports multiple versions of the source data like CSV, XML, JSON, SQLite, Geospatial, and Spatial data. Some of the sources for public data are Kaggle, Google datasets, data.gov, Neon datasets. For this project, we shall identify many of these data sources and create data packages for the data.

Some example sources for these raw data forms.

Degree of difficulty and needed skills

  • Easy
  • Knowledge of Python
  • Knowledge of Object Oriented Programming
  • Knowledge of JSON

Useful skills

  • Knowledge of Git, continuous development and deployment tools
  • Knowledge of R and Julia Programming

Involved developer communities

The team at the Data Retriever primarily interacts via issues and pull requests on GitHub or through the Gitter channel. Join the chat at https://gitter.im/weecology/retriever

Mentors

  • @henrysenyondo
  • @ethanwhite

Data Retriever: Support for Login/API

Rationale

The Data Retriever is a package manager for publicly accessible data. The Data retriever automatically finds, downloads and pre-processes publicly available datasets and it stores these datasets in a ready-to-analyse state. Data providers have adopted the use of Login/APIs to access the data. The Data Retriever currently has support for the Kaggle API. Users can securely use the Data Retriever to install datasets hosted by Kaggle.

Approach

The projects' goal is to generalize the handling of data source platforms that require a Login or use API. The first step is to identify more sources of public data like Kaggle where a Login is required or use an API. The users should be able to keep the required credentials in the home directory. The Data Retriever will automatically identify the required distinct credential files for the data package. The Data retriever creates a login or API request query to the particular server hosting the data. This will enable users to access several data sources with minimum effort.

Degree of difficulty and needed skills

  • Easy
  • Knowledge of Python
  • Knowledge of Object Oriented Programming

Useful skills

  • Knowledge of Git, continuous development and deployment tools
  • Knowledge of R and Julia Programming

Involved developer communities

The team at the Data Retriever primarily interacts via issues and pull requests on GitHub or through the Gitter channel. Join the chat at https://gitter.im/weecology/retriever

Mentors

  • @henrysenyondo
  • @ethanwhite