From 51be2e3cdf8a3080bcecaef95ae973199170c96c Mon Sep 17 00:00:00 2001 From: imbilalbutt Date: Mon, 15 Jan 2024 16:20:15 +0100 Subject: [PATCH] project work 7 --- LICENSE | 21 ++++++++++++++++++ README.md | 66 ++++++++++++++++++++++++++++++------------------------- 2 files changed, 57 insertions(+), 30 deletions(-) create mode 100644 LICENSE diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000000..a22c54d71e --- /dev/null +++ b/LICENSE @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) 2023 Bilal Ahmad Butt + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. \ No newline at end of file diff --git a/README.md b/README.md index 234fed9f53..29409fc3e7 100644 --- a/README.md +++ b/README.md @@ -1,42 +1,48 @@ -# Methods of Advanced Data Engineering Template Project +# `Trends of crop yield in the Netherlands with respect to emission in water` -This template project provides some structure for your open data project in the MADE module at FAU. -This repository contains (a) a data science project that is developed by the student over the course of the semester, and (b) the exercises that are submitted over the course of the semester. -Before you begin, make sure you have [Python](https://www.python.org/) and [Jayvee](https://github.com/jvalue/jayvee) installed. We will work with [Jupyter notebooks](https://jupyter.org/). The easiest way to do so is to set up [VSCode](https://code.visualstudio.com/) with the [Jupyter extension](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter). +This code repository contains exercises and a project developed during the course **Methods of Advanced Data Engineering** in winter semester 2023/24 of the MSc. Data Science at FAU (Friedrich-Alexander-Universität Erlangen-Nürnberg). Herewith, published the automated data pipeline which download the data from specified URL, transform data, and connects with database to load the data in it using different programming tools such python and jayvee (FAU home-made tool). +**[Exercises](https://github.com/imbilalbutt/made-template-ws2324/tree/main/exercises):** Each exercise contains a different URL and implementation of different tasks involved in data pipeline. +**[Project](https://github.com/imbilalbutt/made-template-ws2324/tree/main/project):** A data engineering based project to understand the relationship of crops and water in a european country The Netherlands from year 2000-2015. -## Project Work -Your data engineering project will run alongside lectures during the semester. We will ask you to regularly submit project work as milestones so you can reasonably pace your work. All project work submissions **must** be placed in the `project` folder. +## Description of project -### Exporting a Jupyter Notebook -Jupyter Notebooks can be exported using `nbconvert` (`pip install nbconvert`). For example, to export the example notebook to html: `jupyter nbconvert --to html examples/final-report-example.ipynb --embed-images --output final-report.html` +This projects tries to realize the trend and relationship of different vegetables yield with respect to the different compounds present in the water, which may be emitted from different factors such as Industrial, Chemical manufacturing and Pharmaceuticals manufacturing waste which are included in [report.ipynb](https://github.com/imbilalbutt/made-template-ws2324/blob/main/project/report.ipynb) using practices of advance data engineering (ie: automated data pipeline, test cases etc.). It is important to note that above mentioned factors not only emit hard-elemenal compunds which are strictly damaging for the crop yield but also some nutrients like phospours and nitrogen based compounds are also emitted. Though, nutrients are good for crops but their excess can also lead to minimized yield also. Thus, the hypothesis that is analysed and confirmed in the report is: **With the decrease in heavy elements and nutrients; the yield of different vegetables will increase.** +> **Important:** [Powerpoint presentation](https://github.com/imbilalbutt/made-template-ws2324/blob/main/project/pipeline.py) has been recorded by me in the [video](https://github.com/imbilalbutt/made-template-ws2324/blob/main/project/pipeline.py). -## Exercises -During the semester you will need to complete exercises, sometimes using [Python](https://www.python.org/), sometimes using [Jayvee](https://github.com/jvalue/jayvee). You **must** place your submission in the `exercises` folder in your repository and name them according to their number from one to five: `exercise.`. +## Datasets -In regular intervalls, exercises will be given as homework to complete during the semester. We will divide you into two groups, one completing an exercise in Jayvee, the other in Python, switching each exercise. Details and deadlines will be discussed in the lecture, also see the [course schedule](https://made.uni1.de/). At the end of the semester, you will therefore have the following files in your repository: +For this project two datasets from CBS Open data StatLine have been used. This data repository is Netherlands statistics database. This database offers a wealth of data on the Dutch economy and society. -1. `./exercises/exercise1.jv` or `./exercises/exercise1.py` -2. `./exercises/exercise2.jv` or `./exercises/exercise2.py` -3. `./exercises/exercise3.jv` or `./exercises/exercise3.py` -4. `./exercises/exercise4.jv` or `./exercises/exercise4.py` -5. `./exercises/exercise5.jv` or `./exercises/exercise5.py` +Following datasets have been used. -### Exercise Feedback -We provide automated exercise feedback using a GitHub action (that is defined in `.github/workflows/exercise-feedback.yml`). +[1]: [Dataset 1: Vegetables: Yield and cultivated area per kind (type) of vegetable](https://opendata.cbs.nl/statline/#/CBS/en/dataset/37738ENG/table) -To view your exercise feedback, navigate to Actions -> Exercise Feedback in your repository. +[2]: [Dataset 2: Environmental accounts; emissions to water](https://opendata.cbs.nl/statline/#/CBS/en/dataset/83605ENG/table?ts=1698675109480) -The exercise feedback is executed whenever you make a change in files in the `exercise` folder and push your local changes to the repository on GitHub. To see the feedback, open the latest GitHub Action run, open the `exercise-feedback` job and `Exercise Feedback` step. You should see command line output that contains output like this: +The project is analysed on four years to understand the trend and relationship of crops yield and quantity of different elemental compunds in water: + + 1. 2000 + 2. 2005 + 3. 2010 + 4. 2015 -```sh -Found exercises/exercise1.jv, executing model... -Found output file airports.sqlite, grading... -Grading Exercise 1 - Overall points 17 of 17 - --- - By category: - Shape: 4 of 4 - Types: 13 of 13 -``` +## Context + +This repository is the result of my participation in the course [Advanced Methods of Software Engineering](https://oss.cs.fau.de/teaching/specific/amse/) provided by the [Professorship of Open-Source Software](https://oss.cs.fau.de/) from FAU. The task was to build a Data Engineering Project, which takes at least two public available datasources and processes them with an automated datapipeline, in order to report some findings from the result. + +## Tools and requirements + + - attrs==22.2.0 + - greenlet==2.0.2 + - iniconfig==2.0.0 + - numpy==1.24.2 + - packaging==23.0 + - pandas==1.5.3 + - pluggy==1.0.0 + - python-dateutil==2.8.2 + - pytz==2022.7.1 + - six==1.16.0 + - SQLAlchemy==1.4.46 + - typing_extensions==4.5.0 \ No newline at end of file