-
Notifications
You must be signed in to change notification settings - Fork 135
GSoC 2021 Project Ideas
Please ask questions here. Tag @ethanwhite, @henrykironde
Preferred names (Henry, Ethan)
,
Preferred_greeting (Hi|Hello|Dear|Thanks|Thank you [First_name])
The code of conduct should be your first read.
The Data Retriever is a package manager for data. The Data retriever automatically finds, downloads and pre-processes publicly available datasets and it stores these datasets in a ready-to-analyse state. The Data Retriever handles tabular data and spatial data forms. The data retriever additionally handles compressed version of these data forms, i.e zip, gz and tar files.
The goal of the project is to scale up the data packages available in the Data Retriever. The Data Retriever utilizes the data packaging specification technology to identify data and process the data. The data packages are defined using JSON and the edge cases are defined and built using Python. The Data Retriever public data packages are stored in the Retriever recipes repository. This project aims to add public data sets as packages to the Data Retriever. Public data is available online under several domains and hosted in various forms. The Data Retriever supports multiple versions of the source data like CSV, XML, JSON, SQLite, and Geospatial data. Some of the sources for public data are Kaggle, Google datasets, data.gov, and the National Ecological Observatory Network. For this project, we will identify a number of these data sources and create data packages for the data.
- The data.gov sample data
- https://www.neonscience.org/data-collection
- https://www.kaggle.com/datasets
- Easy
- Knowledge of Python
- Knowledge of Object Oriented Programming
- Knowledge of JSON
- Knowledge of Git, continuous development and deployment tools
- Knowledge of R and Julia Programming
The team at the Data Retriever primarily interacts via issues and pull requests on GitHub or through the Gitter channel.
- @henrysenyondo
- @ethanwhite
The Data Retriever is a package manager for publicly accessible data. The Data retriever automatically finds, downloads and pre-processes publicly available datasets and it stores these datasets in a ready-to-analyse state. A number of data providers require the use of an account with an associated Login or API key to access data programmatically. The Data Retriever currently has support for the Kaggle API allowing users to securely use the Data Retriever to install datasets hosted by Kaggle.
The projects' goal is to generalize the handling of data source platforms that require a Login or use API. The first step is to identify more sources of public data like Kaggle where a Login or API key is required. The users will place the appropriate credentials in a file in their home directory. The Data Retriever will automatically identify the required credential files for each the data package and handle the specific login or API request necessary for the server hosting the data. This will enable users to access several data sources with minimum effort.
- Easy
- Knowledge of Python
- Knowledge of Object Oriented Programming
- Knowledge of Git, continuous development and deployment tools
- Knowledge of R and Julia Programming
The team at the Data Retriever primarily interacts via issues and pull requests on GitHub or through the Gitter channel.
- @henrysenyondo
- @ethanwhite
The Retriever dashboard is a Django project that helps to track the changes in the data. The dashboard downloads the data using the Data Retriever's SQLite engine and checks for the differences in the previous version of the data and the current version. The dashboard reports any changes or diff in these two versions and archives the current version
The projects' goal is to update and improve the current state of the dashbaord. This includes adding support for spacial datasets. The current version uses the "IGNORE_LIST" for the dataset that are either very huge or are Spatial.
- Easy
- Knowledge of Python and Django
- Knowledge of Object Oriented Programming
- Knowledge of Git, continuous development and deployment tools
- Knowledge of R and Julia Programming
The team at the Data Retriever primarily interacts via issues and pull requests on GitHub or through the Gitter channel.
- @henrysenyondo
- @ethanwhite