Monitoring the Airflow instances and DAGs are a top priority in production. Without a monitoring system, there is no way to know if anything goes wrong, when, and why.
The logging system of Airflow is based on the Python library logging, offering flexibility in configuration. Logs are defined as the stream of aggregated, time-ordered events collected from all running processes’ output streams. Airflow relies on many components, such as a web server, a scheduler, and workers. Each component generates its stream of logs stored into a file by default by using the logging module. This module allows creating a logger object in charge of obtaining the logs needed according to a defined log level such as INFO, DEBUG, ERROR, or WARNING. Then, the logs are formatted according to the configuration set in the Airflow.cfg configuration file. Finally, the logs are redirected to a specified destination depending on the Handler used.
The destination that the Handler will send the log files to will be an AWS S3 bucket. An S3 bucket is a public cloud storage service on Amazon Web Services. By default, Airflow stores its log files locally without compression. Since there will be many jobs running on the pipeline, disk space will get eaten up reasonably quickly. Storing logs on S3 not only alleviates disk maintenance considerations, but the durability and availability of these files will be better guaranteed with AWS S3.
When the logging files have been stored on AWS S3, there must be a way to monitor and alert when an error happens automatically. Log data must be able to be easily analyzed. The analysis will be done using ElasticSearch, an open-source text search engine with an HTTP web interface and schema-free JSON documents. It stores, searches, and analyzes large volumes of unstructured text data. It is well suited for collecting and aggregating log data at scale to look for trends, statistics, summaries, and more in near real-time. ElasticSearch will be deployed on AWS using its ElasticSearch architecture to connect to the log files stored in S3.
This post is part of a 23 part mini-series about implementing a cost-effective modern data infrastructure for a small organization. This is a small part of a whitepaper that will be released at the end of this series.