In order to build data pipelines between our data sources and data warehouse, the workflow of a DAG must be defined. There will be a data pipeline and corresponding DAGs for each data source. If there are ten data sources, then at least ten DAGs must be defined in the ETL pipeline.
A DAG must be defined by setting all parameters a DAG usually needs to be instantiated, such as the start date, schedule interval, number of retries, contingency, and backfilling. Then, different tasks in the DAG will be implemented using operators. The tasks taken in a standard DAG are as follows:
This figure represents the corresponding DAG showing the different tasks and their dependencies represented by the arrows. There are six tasks in total. Each task uses a different operator, interacting or not with a big data technology.
This post is part of a 23 part mini-series about implementing a cost-effective modern data infrastructure for a small organization. This is a small part of a whitepaper that will be released at the end of this series.