Skip to content

Understanding ETL

Felipe Ortega edited this page Mar 4, 2015 · 3 revisions

Understanding the ETL process

Extracting and processing data from Wikipedia dump files containing XML-formatted documents can be a time-consuming task, especially for very large Wikipedias. Fortunately, we can use multiprocessing to speed up this workflow, even in a single computer, as long as we have a multi-core CPU.

The following figure illustrates the solution implemented in WikiDAT to achieve this.

ETL workflow

We can configure up to two nested levels of multiprocessing:

  1. The first (lower) multiprocessing level enables parallel processing of individual elements extracted from dump files (e.g. revision, page or logitem). This is possible for any dump file containing XML-formatted documents. It corresponds to the orange-shaded areas in the graph.

  2. The second (higher) multiprocessing level allows parallel processing of multiple dump files. In turn, each file can have a nested multiprocessing to work on extracted data elements (low-level). It corresponds to the blue-line rectangles in the graph. Important: this is only feasible for dump files split in several chunks.

Therefore, an Extract-Transform-Load process in WikiDAT refers to the higher multiprocessing level. To date, this is only possible for very large Wikipedias in which the dump files has been split in different chunks. WikiDAT will download (and verify the integrity) of all dump files of the specified type for a given language. Then, it will proceed with the ETL process according to the options specified in the configuration file.

However, the lower multiprocessing level can be activated for any dump file containing XML documents (stub-meta-history, rev-page-history or pages-logging files). As long as we have a multi-core CPU in our system, WikiDAT will spawn the requested number of subrpocesses indicated in the configuration file for that type of ETL process.

For example, to extract the full revision history dump file of the Spanish Wikipedia with 2 ETL processes, 1 worker for page elements and 3 workers for revision elements, we edit the options in the configuration file like this:

[ETL:RevHistory]
# Some other options first
etl_lines=2
page_fan=1
rev_fan=3

Finally, another important remark about the base_ports and control_ports parameters for each ETL section. Internally, WikiDAT uses the ZMQ messaging service to implement fast data pipelines among subprocesses. Each pipeline must use a different network port to work correctly. In practice, to meet this design requirement the user must always remember to specify as many ports in the list as the number of ETL processes that will be created.

For example, if we want to create 3 ETL lines to process the English Wikipedia, with 1 page worker and 3 revision workers in each ETL line, we can set up the following configuration:

[ETL:RevHistory]
# Some other options first
etl_lines=2
page_fan=1
rev_fan=3
# Communication ports
# There must be at least one base_port and control_port for each ETL line
base_ports=[10000, 10100, 10200]
control_ports=[11000, 11001, 11002]

Only the number of ETL processes matters for communication ports. In the case of base_ports it is advisable to leave an ample margin (50-100) between each port number in the list, as WikiDAT internally opens individual ports for each worker subprocess (but this is transparent to the user and cannot be currently configured).

Table of content | (Prev) Config files

Clone this wiki locally