The schema of the architecture used for deploying the ETL pipeline is illustrated in the figure below.
The first component is in the upper left-hand corner of the figure. Instead of deploying resources directly in the EKS cluster, a Git repository where files corresponding to namespaces and deployments will be stored. Flux will detect any changes made to these files and deploy the EKS cluster changes.
The component in the upper right-hand corner of the figure corresponds to the CI/CD pipeline. There will be another Git repository with the different environments mentioned earlier in the Testing section. When new DAGs are pushed into the production/master branch, the CI/CD pipeline will be triggered. Then it will build new Docker images with those modifications to push the images into the AWS ECR registry. The ECR registry is where the Docker images corresponding to Airflow are pushed as soon as the CI/CD pipeline has built them. When the new Docker image of Airflow is available, the EKS cluster will use it.
The largest and most important component is the EKS cluster under the previously mentioned components in the figure. The cluster will be deployed in three different availability zones within the US East region on AWS. Spreading the nodes across different availability zones in different physical locations is critical for ensuring the architecture is highly available. If one node goes down in an availability zone, the cluster can still be accessed from another availability zone.
This cluster is comprised of four nodes and four different EC2 T3.medium instances. T3.medium instances were chosen because they are cheap and burstable. Within each node, there is a public and private subnet. In the public subnet, an application load balancer will run to access the Airflow user interface, and the components of Airflow will run inside the private subnet. Private subnets are used to run the scheduler and webservers as a security best practice. There is only one scheduler in the cluster because Airflow’s current version does not support multiple concurrent schedulers. Airflow version 2.0, which will be released soon, will support multiple schedulers.
There are five AWS services used at the bottom of the figure. RDS will be the metastore database used by Airflow, where all the metadata related to DAGs is stored. A db.t3.micro multi-AZ instance will be used for the RDS database because it is only responsible for reading and writing metadata. AWS S3 will be used as a shared file system to store and read logs from the Airflow instances. Having a shared file system is essential because multiple tasks will be executed across different nodes, and there must be a shared file system with logs for debugging. AWS ElasticSearch will be used to deploy ElasticSearch in order to track and monitor logs automatically.
The final components are AWS Lambda and Cloudwatch. Amazon Web Services only charges for any EC2 resources that are being used and turned on. This pricing structure means that when the EC2 instances are not running any workloads, they should be turned off to save money. Since the pipeline will only be running for one or two hours per day, there has to be a process to automatically turn on and off the EKS and EC2 cluster.
The DAGs will need to be triggered once a day at 12:00 AM PST and shut off when moving the data. At 12:00 AM, AWS Cloudwatch will automatically trigger a serverless lambda function to turn on the EC2 instances and start running the pipeline. When all of the DAGs have been completed, Cloudwatch will trigger another lambda function to turn off the EC2 instances.
This post is part of a 23 part mini-series about implementing a cost-effective modern data infrastructure for a small organization. This is a small part of a whitepaper that will be released at the end of this series.