In Airflow, the way of creating a task is by using an operator. Since a DAG describes how to run a sequence of tasks, an operator determines what gets done. An operator describes a single task in a workflow, and it is usually atomic and idempotent. It can stand on its own and does not need to share resources or information with any other operators. The DAG will ensure that operators run in the order they are defined, but operators generally run independently. They can run on two completely different machines for scalability reasons, and a single task can be triggered from the terminal without triggering the entire DAG.
An operator always defines a single task. For example, a Bash command uses a Bash operator. It is best practice to divide processing into multiple tasks where each task has only a single thing to do. Dividing processing into multiple tasks makes it easier to spot errors, organize DAGs, and increase durability.
Operators should be idempotent. An operator should produce the same output with the same input regardless of how many times it is run. If an operator running SQL requests is running multiple times, it is expected that each output is equal.
In the case of a failure, the scheduler will automatically restart the task if a retry parameter is specified.
Finally, when an operator is instantiated, that operator becomes an edge in the DAG. Airflow brings many operators by default, such as the BashOperator to execute a bash command, the PythonOperator to execute an arbitrary Python function, and a SlackOperator to send slack notifications.
Although Airflow officially supports these operators, there is a growing list of operators made by open-source contributors.
This post is part of a 23 part mini-series about implementing a cost-effective modern data infrastructure for a small organization. This is a small part of a whitepaper that will be released at the end of this series.