-
Notifications
You must be signed in to change notification settings - Fork 11
Configuration files
You can always refer to the latest version of the example configuration file to obtain a list of accepted parameters in WikiDAT.
Configuration parameters are organized in several different sections:
Parameters affecting general features.
-
lang (String): Target Wikipedia project to be processed, as specified in http://dumps.wikimedia.org. For example,
enwiki
,dewiki
oreswiki
are valid project names. -
date (YYYYMMDD): Date of the dump to be processed (must be a valid date, listed in the mirror site and already created). Alternatively, the value
latest
will try to download the latest available dump. - mirror (URL): Mirror site from which dump files will be downloaded.
- donwload_files (Boolean): If true, database dump files will be downloaded. Otherwise, the program will try to process dump files already retrieved and stored in a local directory (see next option).
- dumps_dir (Path): Absolute or relative path to the local directory in which the dump files for this language have already been stored. If the previous option is True this value will be skipped.
Parameters affecting database-related features.
- host (Hostname): Name of the host in which the local database is running.
- port (Port num.): Port to connect to the local database.
-
db_engine (Engine name): Name of the database engine for the tables that will store extracted information. Recommended values are
MyISAM
for MySQL database andARIA
for MariaDB. - db_user (User name): Valid user name to connect to database. The user must have privileges for database and table creation.
- db_passw (Password): Password for this user to connect to the local database.
Parameters affecting the Extract-Transform-Load process for full revision history dumps.
- etl_lines (Positive Integer): Number of ETL processing lines to be created. See next section to understand the structure of the ETL process and how it is parallelized.
-
page_fan (Positive Integer): Number of worker processes to handle extracted
page
elements. -
rev_fan (Positive Integer): Number of worker processes to handle extracted
revision
elements. -
page_cache_size (Positive Integer): Number of rows that will be stored with
page
information in a temporal file before it is uploaded to the local database. -
rev_cache_size (Positive Integer): Number of rows that will be stored with
revision
(and revision hash) information in temporal files before they are uploaded to the local database. - base_ports (List of port numbers): Port numbers for the data communication sockets created with ZeroMQ. At least one port number must be provided for each ETL line
- control_ports (List of port numbers): Port numbers for command sockets created with ZermMQ. At least one port number must be provided for each ETL line (avoid overlapping with base ports specified above).
-
detect_FA (Boolean): If
True
, revisions corresponding to featured articles (containing the FA template in that language) will be detected. -
detect_FLIST (Boolean): If
True
, revisions corresponding to featured lists (containing the FLIST template in that language) will be detected. -
detect_GA (Boolean): If
True
, revisions corresponding to good articles (containing the GA template in that language) will be detected.
Parameters affecting the Extract-Transform-Load process for metadata revision history dumps.
NOT IMPLEMENTED YET
Parameters affecting the Extract-Transform-Load process for dumps containing records of administrative events (logging
table in MediaWiki). Since pages-logging
dump files are not split in different chunks for any language, we cannot set up more than a single ETL process in this case. Hence, the only parallelization level available is using more workers to process logitem
elements (data units stored in these kind of files):
-
log_fan (Positive Integer): Number of worker processes to handle extracted
logitem
elements. -
log_cache_size (Positive Integer): Number of rows that will be stored with
logitem
information in a temporal file before it is uploaded to the local database.
Table of content | (Prev) Default execution | (Next) Understanding ETL
WikiDAT: Wikipedia Data Analysis Tooolkit. CC-BY-SA 3.0 Felipe Ortega. Icons: Font Awesome