An important aspect to take care of when writing software is testing. Test-driven development is crucial to ensure that nothing is broken after an application has been modified. Automated tests are even more critical when collaborating in teams and pushing into production without bugs.
Pytest will be used when writing unit tests for data pipelines and DAGs. Pytest is a testing framework that allows an engineer to write simple test codes and complex functional testing for applications and libraries in Python.
In order to test DAGs, there are five categories of tests that will be implemented.
DAG validation tests are meant to be standard tests for all DAGs within a pipeline. Validation tests check for typos, cycles, or incorrectly set default arguments in the DAGs. These tests’ goal is to avoid having a message box on the Airflow UI, notifying that a DAG is broken.
The next category is definition tests, which are used to check the definitions of all DAGs. The goal is not to verify the logic of any DAGs but only to check if the modification made was intentional or not. These tests check how many tasks a DAG has and the nature of the tests. The tests also check if the upstream and downstream dependencies of tasks are set with the right tasks. Unlike validation tests, there will be a single test for each DAG.
Since Airflow is only an orchestrator and not a data processing framework like Spark, there should not be too much logic within the DAG operators. The processing part should be externalized in the big data tools, and Airflow should only call them in the right order. As a best practice, operators should be kept small and clean with only one task. The only operators and sensors that should be tested in Airflow are the ones that have been created in Airflow. Unit testing external processing components should only be done within those components. For example, unit tests for Spark processes should be done in Spark.
An integration test is a different form of testing in which the interactions between two or more tasks are explicitly tested. Integration tests verify that the tasks of the DAG work together. For example, a task can fetch the data required to work if two or more tasks can exchange data between them if any external resource connections are available from a given task. The integration tests can be complicated and slower than the others, as external frameworks like Spark must be tested.
The end to end pipeline tests are the last tests to execute, and their goal is to check the full logic of the DAGs from start to finish. End to end pipeline tests will allow us to verify if a given DAG’s output is correct and not slow.
A large copy of the production code will be used to be as close as possible to the conditions in production. This setup requires an environment in version control called Staging where the tests and all the tools needed will be set up. Of course, it will be challenging to ensure that there will be no errors in production. However, after running the previous four tests, the odds of an error occurring are dramatically reduced.
When building and testing pipelines, there will be four main branches within version control for development. These branches are Development, Staging, and Production.
Development -> Staging -> Production
The first branch used in testing will be the development branch. The development branch uses small faked data to run the validation, definition, and unit tests. This environment aims to verify that the data pipelines’ tasks can be executed with non-zero exit codes. The tests will be executed quickly because the input data is small.
The second branch is the staging branch. There will be a full copy of the production data to verify that a DAG passes integration and end-to-end pipeline tests. The staging branch is also where the pipeline must meet the defined requirements before pushing it to the production branch.
Finally, the production branch will be the environment set for deployment, where the risk of errors is minimized.
The following figure illustrates the different Git version control branches with their corresponding environments. When a DAG needs to be modified, a feature branch is created for making changes. Once the feature is done, the code is reviewed and manually merged into the development branch. When the merge happens, AWS CodePipeline will run the validation, definition, and unit tests. After they succeed, the CI tool merges the development branch into the staging branch for integration and end-to-end pipeline tests. At this point, the DAG will not be automatically deployed in production. It is better to check that everything works as expected and the standard requirements are met as a best practice. If this passes, a merge request is opened into the master branch for the last check. Then, the merge request is manually validated, and changes are pushed into production.
This post is part of a 23 part mini-series about implementing a cost-effective modern data infrastructure for a small organization. This is a small part of a whitepaper that will be released at the end of this series.