[13/23] Systems Distribution: A Cost-Effective Unified Data Platform for Small Organizations

Systems Distribution

There are four different ways of executing tasks in Airflow. When a DAG is scheduled, each task has a record with its status stored in the metadata database. The task scheduler periodically reads from the metadata database to check if there is any task to run. If a task is ready to be triggered, it is pushed into an internal queue into the executor responsible for defining how that task will be executed. Airflow has many executors, such as sequential, local, and Kubernetes executors. The differences between each executor are their resources and how they choose to use them to get the work done.

Sequential Executor

The sequential executor runs a single task instance at a time in a linear fashion. It does not support parallelism. The sequential executor is mainly used in testing because having a limit of one task running is limiting for an orchestrator.

Local Executor

The local executor executes tasks in parallel on a single machine, such as a server. It is based on the multiprocessing Python library, where a process is spawned when a task needs to be triggered. For example, if a DAG has five tasks to execute, the local executor will spawn five parallel processes. While this executor allows for parallelism, it does not offer a multi-node distributed architecture’s durability or availability. If the machine goes down, Airflow and any tasks running go down as well. Additionally, the number of tasks that can be executed in parallel depends on the machine’s resources. As the number of tasks increases, the CPU, memory, and I/O requirements will increase.

Kubernetes Executor

The executor that will be used in the architecture of the unified data platform is the Kubernetes executor. The Kubernetes executor is based on Kubernetes, an open-source platform for managing containerized applications.

Kubernetes orchestrates computing, networking, and storage infrastructure while offering auto-scaling, load-balancing, and monitoring. It reduces the time needed to deploy applications and allows zero-downtime deployments with rolling updates.

Kubernetes can automatically vertically and horizontally scale up or down an application based on demand. This flexibility is a huge advantage as it avoids wasting resources when the application is not being used. Kubernetes has redundancy and failover built-in, which makes the application more reliable and resilient.

While setting up a Kubernetes cluster is simple to do in testing, it is challenging to implement in production. There is a significant investment of time and energy involved in managing a cluster in production. The pipeline will be deployed on Amazon Web Services’ Elastic Kubernetes Service (EKS), a managed Kubernetes service, to simplify the Kubernetes cluster’s management in Airflow.

The Kubernetes executor executes tasks inside a Kubernetes cluster. Each task runs in a dedicated container allowing for one task per container. Dedicated containers are helpful because tasks are isolated. If a crash happens, only the container running where the crash happened is impacted. Then, Kubernetes will automatically restart the container.

It also offers fine-grained control of resources in terms of CPU and memory. If a task needs more memory than CPU, this can be specified within the container.

It can also scale down to near zero if no tasks are being executed. If no containers are running, then resources are released and assigned to other jobs.

Containers are also run to completion. If the scheduler goes down, then the tasks will still be able to complete. Containerization is particularly useful when Airflow needs to be updated, and the scheduler and web server go down.

Another great feature is that the scheduler subscribes to the Kubernetes API event stream. When a task is finished, Kubernetes gets notified about this event and checkpoints where it is. If the scheduler goes down, it goes back to where it was without running anything twice and avoiding forgetting to run some tasks.

______________________________________________________________________________

This post is part of a 23 part mini-series about implementing a cost-effective modern data infrastructure for a small organization. This is a small part of a whitepaper that will be released at the end of this series.

Leave a Reply

Your email address will not be published. Required fields are marked *