Skip to content

Latest commit

 

History

History

week-05_ETL_detailed

ETL: Extract Transform Load


1. ETL and ELT in Pictures

1.1 ETL in Picture

1.2 ETL vs ELT

1.3 ETL vs ELT


2. What is ETL and ELT?

2.1 ETL

Extract, transform, and load (ETL) is the process 
of combining data from multiple sources into a 
large, central repository called a data warehouse. 

ETL uses a set of business rules to clean and organize 
raw data and prepare it for storage, data analytics, 
and machine learning (ML). You can address specific 
business intelligence needs through data analytics 
(such as predicting the outcome of business decisions, 
generating reports and dashboards, reducing operational 
inefficiency, and more).

source: https://aws.amazon.com/what-is/etl/

2.2 ELT

ELT, which stands for “Extract, Load, Transform,” 
is another type of data integration process, similar 
to its counterpart ETL, “Extract, Transform, Load”. 
This process moves raw data from a source system to 
a destination resource, such as a data warehouse. 
While similar to ETL, ELT is a fundamentally different 
approach to data pre-processing which has only more 
recently gained adoption with the transition to 
cloud environments.

3. The ETL Process

  • The most underestimated process in DW development
  • The most time-consuming process in DW development
  • Up to 80% of the development time is spent on ETL!

3.1 Extract

  • Extract relevant data
  • Extraction can be from many data sources

3.2 Transform

  • Transform data to DW format
  • Build DW keys, etc.
  • Cleansing of data

3.3 Load

  • Load data into DW
  • Build aggregates, etc.

4. Sample ETL Program

4.1.0 Create Parquet File by PySpark (as a data source)

4.1.1 Create Parquet File by PySpark (log file)

4.2 ETL: 1. extract, 2. transform, and 3. load


5. Sample ELT Program

5.1.0 Create Parquet File by PySpark (as a data source)

5.1.1 Create Parquet File by PySpark (log file)

5.2 ETL: 1. extract, 2. load, and 3. transform




8. ETL References

  1. Understanding ETL by O'reilly

  2. What is ETL? by IBM

  3. What is ETL (Extract Transform Load)?

  4. What is ETL? The Ultimate Guide

  5. Create Your First ETL Pipeline with Python

  6. Implementing ETL Process Using Python to Learn Data Engineering

  7. Build an ETL Data Pipeline using Python

  8. Setting Up ETL Using Python Simplified 101