[9/23] Directed Acyclic Graphs: A Cost-Effective Unified Data Platform for Small Organizations

By Eric Burt on December 9, 2020 in Apache Airflow, Data Engineering

Directed Acyclic Graphs

The most important concept in Airflow is the directed acyclic graph (DAG). A concise definition is that a DAG is a finite directed graph that does not contain any loops. For example, suppose there exist tasks A and B, such that B depends on A. This example is a valid DAG since there does not exist a cycle between vertices.

Simple Directed Acyclic Graph

A dependency where A points to B cannot be created because there would be a loop making the DAG non-finite. Task A runs then task B is triggered. Once task B is finished, then task A runs, thus creating an endless loop.

Invalid Directed Acyclic Graph

In Airflow, a DAG represents a collection of tasks that are scheduled according to their dependencies between them. The responsibility to ensure tasks execute at the right time and in the right order is defined as the workflow. A node is a task, and an edge is a dependency between N tasks, such that N is a positive non-zero integer.

DAG Data Pipeline

Suppose we have a DAG with five tasks A, B, C, D, and E. Task A extracts data from an API. Then task B processes them using Spark. Once the processing is done, task C checks that the result is formatted as expected. Task D transforms the data into the schema model, and task E finally loads the data into the data warehouse.

This workflow is an example of what workflows can be orchestrated using Airflow. The above figure is an example of a visualization of this workflow in the Airflow user interface. Each node is a task, and between each of them, there exists a linked dependency. When a task is finished, the next task is triggered until all the tasks have been executed. If there was a dependency from E to C, this would be an invalid DAG due to an endless loop from C to D to E repeating back to C.

______________________________________________________________________________

This post is part of a 23 part mini-series about implementing a cost-effective modern data infrastructure for a small organization. This is a small part of a whitepaper that will be released at the end of this series.

Leave a Reply

Your email address will not be published. Required fields are marked *