[11/23] Workflow: A Cost-Effective Unified Data Platform for Small Organizations

By Eric Burt on December 30, 2020 in Apache Airflow, Data Engineering

In order to build data pipelines between our data sources and data warehouse, the workflow of a DAG must be defined. There will be a data pipeline and corresponding DAGs for each data source. If there are ten data sources, then at least ten DAGs must be defined in the ETL pipeline.

A DAG must be defined by setting all parameters a DAG usually needs to be instantiated, such as the start date, schedule interval, number of retries, contingency, and backfilling. Then, different tasks in the DAG will be implemented using operators. The tasks taken in a standard DAG are as follows:

  • Checking that the data source API is available
  • Extracting unstructured JSON data for a given day from the API using Python
  • Saving the data in AWS S3
  • Processing the data using Spark on an AWS EMR Cluster
  • Transferring the transformed data from S3 to AWS Redshift
  • Sending a Slack notification of the DAG status
DAG Workflow

This figure represents the corresponding DAG showing the different tasks and their dependencies represented by the arrows. There are six tasks in total. Each task uses a different operator, interacting or not with a big data technology.

______________________________________________________________________________

This post is part of a 23 part mini-series about implementing a cost-effective modern data infrastructure for a small organization. This is a small part of a whitepaper that will be released at the end of this series.

Leave a Reply

Your email address will not be published. Required fields are marked *