An essential step to creating a unified data platform is building high-performance data pipelines. Data pipelines execute a process known as Extract, Transform, and Load (ETL). These pipelines are tasked with moving data into the data warehouse, cleaning, transforming, and modeling schema within the data warehouse.
The data pipelines for the unified data platform will be built using Apache Airflow. Apache Airflow can programmatically author, schedule, and monitor data pipelines according to Apache.org’s Airflow documentation. A critical difference between Airflow and other orchestrators is that data pipelines are defined as code, and tasks are instantiated dynamically.
There are five key concepts to Airflow. The first concept is the directed, acyclic, graph (DAG). A DAG is an acyclic graph object representing tasks along with their dependencies. Another concept is the operator. An operator describes a single task in a DAG. For example, a Python operator is used to execute Python code. A task is an instance of an operator and a task instance represents a specific run of a task. This formula represents a task instance.
DAG ∩ Task ∩ Point in Time
That is, a task instance is the intersection of a DAG, a task in that DAG, and a point in time.
The last essential concept is workflow, which is the flow of all the concepts mentioned above.
Airflow provides a way of configuring data pipelines via Python code, making them dynamic. The user interface provides an overview of DAGs and access to metadata, such as the time taken by each task, delay between executions, and how many times a task failed.
Airflow can be scaled out either vertically or horizontally to run multiple tasks in parallel in a distributed way.
Airflow implements backfilling, which is the ability to run a DAG from the past. Backfilling is useful for replaying data pipelines because of an error or pipelining historical data into a database.
Another exciting feature is its modularity due to it being an open-source project. Contributors can create operators to bring new functionality. Additionally, the Airflow repository is continuously updated and improved by many open-source contributors.
An important note to make is that Airflow is not a data streaming option. It is built primarily to perform scheduled batch jobs. Batch jobs suit the unified data platform’s use case since real-time data delivered in a streaming manner is not required.
This post is part of a 23 part mini-series about implementing a cost-effective modern data infrastructure for a small organization. This is a small part of a whitepaper that will be released at the end of this series.