An essential requirement is that the architecture must be highly available. That means if a node goes down, then any application within Airflow should still be accessible in the EKS cluster.
The architecture must also be highly scalable. The architecture has to execute as many tasks as needed and expand or shrink according to the workload.
The next requirement is that there should be different environments, such as the development, staging, and production environments mentioned in the testing section. Having different testing environments is vital because DAGs and tasks need to be tested before being pushed into production.
Once the DAGs and tasks have been tested, they will be pushed to the production environment. Continuous integration and continuous deployment (CI/CD) pipelines will do this automatically. CI/CD pipelines fetch code from a GitHub repository, compile and test the code, and deploy it to the Kubernetes cluster.
An essential requirement is setting up remote logging. Remote logging is critical as logging helps with debugging if there are any errors in the DAGs. A shared file system must be used because it will be executed in a shared environment on multiple nodes. The shared file system will be implemented using AWS S3.
Other requirements include a secret back-end to store credentials, DAG serialization to make Airflow’s web server stateless, improve performance, and use S3’s elastic file system to share DAGs among a single shared file system.
This post is part of a 23 part mini-series about implementing a cost-effective modern data infrastructure for a small organization. This is a small part of a whitepaper that will be released at the end of this series.