Extract, transform, and load (ETL) is the process
of combining data from multiple sources into a
large, central repository called a data warehouse.
ETL uses a set of business rules to clean and organize
raw data and prepare it for storage, data analytics,
and machine learning (ML). You can address specific
business intelligence needs through data analytics
(such as predicting the outcome of business decisions,
generating reports and dashboards, reducing operational
inefficiency, and more).
source: https://aws.amazon.com/what-is/etl/
ELT, which stands for “Extract, Load, Transform,”
is another type of data integration process, similar
to its counterpart ETL, “Extract, Transform, Load”.
This process moves raw data from a source system to
a destination resource, such as a data warehouse.
While similar to ETL, ELT is a fundamentally different
approach to data pre-processing which has only more
recently gained adoption with the transition to
cloud environments.
- The most underestimated process in DW development
- The most time-consuming process in DW development
- Up to 80% of the development time is spent on ETL!
- Extract relevant data
- Extraction can be from many data sources
- Transform data to DW format
- Build DW keys, etc.
- Cleansing of data
- Load data into DW
- Build aggregates, etc.
4.1.0 Create Parquet File by PySpark (as a data source)
4.1.1 Create Parquet File by PySpark (log file)
4.2 ETL: 1. extract, 2. transform, and 3. load
5.1.0 Create Parquet File by PySpark (as a data source)
5.1.1 Create Parquet File by PySpark (log file)
5.2 ETL: 1. extract, 2. load, and 3. transform