-
Notifications
You must be signed in to change notification settings - Fork 16
Persisting Data
isaacmg edited this page Jun 10, 2020
·
11 revisions
With new COVID-19 data coming in on a daily basis we need to have pipelines to join and stash the relevant data sources. We want to enable data to be easily tracked and versioned to make models reproducible.
Airflow
Airflow will be used to schedule daily jobs to persist data to GCS and Dataverse.
GCS Layout
GCS will be organized into directories based on the date. For example COVID files will be stashed
06-10-2020/raw_data.csv
** Big Query **