-
Notifications
You must be signed in to change notification settings - Fork 135
GSoC 2021 Project Ideas
Please ask questions here. Tag @ethanwhite, @henrykironde
Preferred names (Henry, Ethan)
,
Preferred_greeting (Hi|Hello|Dear|Thanks|Thank you [First_name])
The code of conduct should be your first read.
The Data Retriever is a package manager for data. The Data retriever automatically finds, downloads and pre-processes publicly available datasets and it stores these datasets in a ready-to-analyse state. The Data Retriever handles tabular data and spatial data forms. The data retriever additionally handles compressed version of these data forms, i.e zip, gz and tar files.
The goal of the project is to scale up the data packages available in the Data Retriever. The Data Retriever utilizes the data packaging specification technology to identify data and process the data. The data packages are defined using JSON and the edge cases are defined and built using Python. The Data Retriever public data packages are stored in the Retriever recipes repository. This project aims to add public data sets as packages to the Data Retriever. Public data is available online under several domains and hosted in various forms. The Data Retriever supports multiple versions of the source data like CSV, XML, JSON, SQLite, Geospatial, and Spatial data. Some of the sources for public data are Kaggle, Google datasets, data.gov, Neon datasets. For this project, we shall identify many of these data sources and create data packages for the data.
- The data.gov sample data
- https://www.neonscience.org/data-collection
- https://www.kaggle.com/datasets
- Easy
- Knowledge of Python
- Knowledge of Object Oriented Programming
- Knowledge of JSON
- Knowledge of Git, continuous development and deployment tools
- Knowledge of R and Julia Programming
The team at the Data Retriever primarily interacts via issues and pull requests on GitHub or through the Gitter channel.
- @henrysenyondo
- @ethanwhite
The Data Retriever is a package manager for publicly accessible data. The Data retriever automatically finds, downloads and pre-processes publicly available datasets and it stores these datasets in a ready-to-analyse state. Data providers have adopted the use of Login/APIs to access the data. The Data Retriever currently has support for the Kaggle API. Users can securely use the Data Retriever to install datasets hosted by Kaggle.
The projects' goal is to generalize the handling of data source platforms that require a Login or use API. The first step is to identify more sources of public data like Kaggle where a Login is required or use an API. The users should be able to keep the required credentials in the home directory. The Data Retriever will automatically identify the required distinct credential files for the data package. The Data retriever creates a login or API request query to the particular server hosting the data. This will enable users to access several data sources with minimum effort.
- Easy
- Knowledge of Python
- Knowledge of Object Oriented Programming
- Knowledge of Git, continuous development and deployment tools
- Knowledge of R and Julia Programming
The team at the Data Retriever primarily interacts via issues and pull requests on GitHub or through the Gitter channel.
- @henrysenyondo
- @ethanwhite