Using Airflow in Glovo for data orchestration

Summary
In this article we briefly introduce Apache Airflow as a data workflow orchestrator, and we present Glovo’s data strategy based on the Data Mesh paradigm. We then illustrate how Airflow is used in Glovo, and present some customizations that have made a successful implementation of Data Mesh possible. We finally show the evolution towards a declarative approach that truly democratizes data production and usage, making Airflow a cornerstone of Glovo’s data architecture.
What is Apache Airflow?
Apache Airflow is “an open-source platform for developing, scheduling, and monitoring batch-oriented workflows” [1]. It was created by Maxime Beauchemin in 2014 while working at Airbnb to handle increasingly complicated data engineering pipelines. The project joined the Apache Software Foundation in 2016, and became a top-level project in 2019, ensuring its future continuity [2]. Today, it’s probably the most used orchestration tool in the Data Engineering field [3][4].
Orchestration, in the context of Data Engineering, automates the scheduling of jobs, and the sequencing of the steps required to perform the movement and transformation of data between systems. It is crucial to ensure timeliness and quality of the data to be used in analytics, reporting, modeling or machine learning [5][6][7].

Airflow organizes the data pipelines or workflows in so-called DAGs (Directed Acyclic Graphs):
- The different jobs to be performed are represented as nodes in a graph, which are called tasks in Airflow. Tasks are instances of operators that perform a certain type of work (for example, reading from a database or writing to a file).
- The relationships between the tasks are reflected as arcs connecting the nodes, and they are called dependencies in Airflow. These relationships between tasks are directed: a certain task needs to be executed after one or more other tasks.
- Being acyclic means that once a task is completed it is not possible to go back and re-execute it.
DAGs are not exclusive to Airflow, and there are many applications of this data structure.

These properties allow Airflow to implement the “sequencing of the steps” component of orchestration by ensuring that:
- There is a clear beginning of the execution of the tasks.
- There is a clear path forward when each task is completed.
- There is a clear ending of the execution of the tasks.
- Eventually all the tasks will be completed.
Additionally, Airflow implements the “scheduling of jobs” component of orchestration by allowing a cron-like expression in the DAGs, and determining at which moment in time a DAG needs to be run. More complicated running configurations can be set through timetables, which do not need to be periodic in time.
Dependencies between DAGs can be handled in several ways:
- The dependent DAG’s schedule could be set up so that it starts after the dependencies have normally completed execution. However, this setup is vulnerable to errors or delays, as there is no way to verify whether the dependencies have effectively run.
- It is also possible to trigger a DAG run from another DAG through the TriggerDagRun operator.
- Airflow’s recommended way is to use the ExternalTaskSensor operator in the dependent DAG to check for the dependencies to be completed.
Glovo’s data strategy
Three years ago, data at Glovo was in distress. The growth of the business implied also a growth in the usage of data for informed decision-making, but the infrastructure and the organization supporting the increased usage could not scale more. There was no way to make our centralized Data Warehouse larger or more powerful. Outages were frequent, recovery times were in the order of days, and there was no clear technical, operational or business owner for many of the data processes.
Glovo decided to migrate this centralized approach to a Data Mesh organization. After more than two years of work, the decentralization strategy has been a resounding success, and we have been able to turn off our gigantic Data Warehouse. The four pillars of Data Mesh have been crucial to achieve this level of success [8]:

Domain ownership
In the same way as Engineering teams have achieved decentralization by embracing domain-driven design and adopting the Reverse Conway Maneuver, Data teams need to be arranged into separate business domains. Each domain covers a bounded context, for which the data team has full ownership and is fully accountable. Each domain data team decides which data to expose to other domains, while keeping the implementation details internally.
Data as a product
In Data Mesh, a data product is the smallest architectural block that can be deployed as a cohesive unit, and is the result of applying product thinking to domain-oriented data. It is composed of the code needed to perform the required transformations, the data resulting from those transformations, the metadata that identifies the product and the outputs, and the infrastructure required to run the previous elements.
A data product in Glovo represents a set of tables designed to fulfill the same use cases / user needs, with the same timeliness, loading frequency and criticality requirements. These tables can be exposed to users via multiple interfaces such as other data products, query engines, BI tools or others. There is no way to produce a data table that is outside of the data product enclosure.
Self-serve data platform
Data domain teams can autonomously own their data products by having access to a data platform that provides a higher level of abstraction than the direct management level. This platform removes the complexity and friction of provisioning and managing the lifecycle of data products by providing simple declarative interfaces, and implementing the cross-cutting concerns that are defined as a set of standards and global conventions across the organization. The self-serve data platform also includes capabilities to lower the cost and specialization needed to build data products.
Federated computational governance
Data Mesh follows a distributed system architecture: a collection of separate data products, each with independent lifecycles, built and deployed by autonomous data domain teams. However, to get greater value, these independent data products need to interoperate. which is possible through a governance model that embraces decentralization and domain self-sovereignty, global standardization, a dynamic topology, and, most importantly, automated execution of decisions by the data platform.
Usage of Airflow in Glovo
Airflow honors its orchestrator role by acting as the central piece in the computation of Data Products in Glovo, as illustrated below:

Each Data Product unit includes at least one Airflow DAG for periodic computation, although in many cases there are additional DAGs for a variety of purposes:
- Running some transformations that are different from the main ones: they have a different periodicity or temporality, a different intent or even for splitting outputs with different criticality.
- Performing initial data loads.
- Backfilling data.
- Doing auxiliary operations such as table definition changes or deletions.
Regardless of their purpose, all the Data Product DAGs have the same general structure, although some of the blocks may not always be present:

DagFactory to simplify DAG creation
This general structure has led us to build an internal package to abstract and simplify the definition of DAGs. Writing convoluted Python code defining a workflow is replaced by a much simpler file setting up a DAG configuration. We call this module DagFactory, much inspired by the dag-factory package by Astronomer [9] and, to a lesser degree, by the airflow-declarative project [10].
Below there is an example of how a DAG is defined in DagFactory:
from datetime import datetime
from datetime import timedelta
from pathlib import Path
from data_pipeline_tools.airflow.dag_factory.dag_factory import DagFactory
dag_configuration = {
"dag_name": "my_first_dag_factory_dag",
"image_version": "0.2.22",
"dags_path": str(Path(__file__).parent.resolve()),
"process_name": "calculate_odps_courier_distances",
"domain": "courier",
"data_product_name": "order_flow",
"data_product_key_prefix": "OF",
"schedule_interval": "30 5 * * *",
"default_args": {
"owner": "Operations Data Engineering",
"description": "A set of ODPs covering order-level and city-day KPIs related to courier distances of Orders.",
"start_date": datetime(2021, 11, 12),
"retries": 2,
"email_on_failure": False,
"email_on_retry": False,
"retry_delay": timedelta(minutes=5),
"depends_on_past": False,
"max_active_runs_per_dag": 1,
},
"runtime_date_local": "'2022-01-08'",
"runtime_date_dev": "'2022-04-15'",
"num_days_local": 8,
"num_days_default": 30,
"slack_channel_prod": "data-mesh-monitors-courier",
"slack_channel_dev": "data-mesh-monitors-courier-dev",
"slack_conn_id": "slack_webhook",
"script_module": "order_flow.jobs.courier_order_flow_job",
"jobs": {
"courier_distances_points_intermediate": [],
"courier_distances_order_level_attributes": [
"courier_distances_points_intermediate"
],
},
"sensor_specs": {
"ORDER_DESCRIPTORS_ORDER_DESCRIPTORS": {
"checkpointer_path": "COD__DAG_CHECKPOINT_PATH",
"domain": "central",
"product": "central_order_descriptors",
"task": "order_descriptors",
"freshness_hours": 8,
"timeout_hours": 8,
"mode": "reschedule",
"poke_interval": 300,
},
},
}
# Magic words, DO NOT MODIFY
airflow_dag_factory = DagFactory(**dag_configuration)
globals()[dag_configuration["dag_name"]] = airflow_dag_factory.create_pyspark_dag()
Although this seems quite a simple DAG, in reality it is formed by 9 tasks (not counting groups). There is a stark difference with the code requirements of a standard Airflow DAG definition: the DagFactory script is much shorter, reducing the cognitive load required to understand the structure, and lowering the possibility of introducing errors.
As a parallel benefit, this package has brought a high level of standardization in the definition and operation of the Data Products. Before DagFactory, the DAGs defined by the different domains, and even the ones in the same domain, had significant differences in the grouping of tasks, naming, parameterisation, and others. This made operating the DAGs quite dangerous, as mistakes were relatively easy to make, even leading to accidental deletion of data. After implementing DagFactory, all the workflows behave in the same way, and anyone can operate them with confidence that no unexpected side effects will occur.
Another benefit of the standardization is that the blocks of tasks forming the general structure of the DAGs have the same name across all of the data pipelines. In particular, the block of transformations always ends with a “transformations_end” task. This has been crucial to facilitate the creation of early alerts for DAG failures in observability tools using only standard Airflow metrics.

In summary, DagFactory has been an accelerator to the Data Engineers tasked with creating Data Products.
Checkpoints and sensors to manage DAG dependencies
Another component that has been developed in Glovo is an alternative way to ensure that the data dependencies for the transformations contained in a DAG are met before starting to process them:
- In every DAG, a custom operator writes a small JSON file indicating the time it has executed. We call these files checkpoints, and the operator is named CheckpointerOperator.
- Each dependent DAG can use custom sensor operators that understand the previous file format, and are able to determine whether the execution can take place or it should be kept on hold while the dependencies finish. We call this sensor operator CheckpointSensor.
{
"data":{
"updated_at": "2024-09-24T08:07:51.058120+00:00",
"backfilling": false
}
}
These custom checkpoints allow greater flexibility than the standard solutions, as they abstract out the Data Product contents from the DAG that generates them. That is, they work at the table level, and frees the owners of a Data Product to define how it is computed without affecting their consumers. In consequence, they favor the separation in domains that is key to our data strategy.
In the general structure of a DAG we saw how the sensors are the first set of tasks to be run. As for the checkpoints, they are created as part of the transformations group of tasks. This group is formed by chaining together different transformation blocks, each of them composed of three stages:
- The computation of the transformation, either a PySpark step or a set of dbt models.
- A data quality assessment of the transformed data.
- The creation of a checkpoint file.
If the computation of the transformation or the data quality assessment tasks fail, then the checkpoint is not generated, and downstream users are not signaled that a particular Data Product output is ready for consumption.

Different 3-step transformation blocks can be linked according to their dependencies so that the overall process is performed in the right order. Also, transformations computing more than a single output can be split in parallel blocks to allow a more granular control of the checkpoints. In this case, some transformation blocks may have failed, but checkpoints would be generated for the successful blocks. This pattern increases the robustness of the Data Mesh, as subsequent Data Product DAGs dependent on the successful outputs can start their updates.


Checkpoints along with transformation splitting have increased the overall availability of data, which would be seriously impaired if checkpoints worked only at the Data Product level.
The next iteration: more democratization and autonomy
The first implementation of Data Mesh has been so successful that Glovo has developed an even simpler approach based on purely declarative interfaces, as was publicly introduced in our Data Experts’ RoundTable Meetup some months ago. The main advantage of this approach is a higher abstraction layer over the way of defining data transformations and the underlying infrastructure that runs them. As a consequence, the technical skills and the cognitive load needed to build a data product are highly reduced. This has produced a powerful democratization of data, and a surge in the value of the Data Mesh (which is based on the number of meaningful relationships among data products, not on the number of data products per se [11]).
The comparison below shows the differences between the first and the second approaches to Data Mesh:
- In Data Mesh v1 the data product creators were only Data Engineers, whereas in the Declarative Data Mesh they can be anyone with SQL and basic coding skills.
- In Data Mesh v1 the infrastructure management was distributed and handled by each data domain, whereas in the Declarative Data Mesh it belongs to a centralized data platform.
- In Data Mesh v1 the computing engine availability was limited and created on demand, whereas in the Declarative Data Mesh it is always on.
- In Data Mesh v1 the orchestration was managed by each domain, whereas in the Declarative Data Mesh it belongs to a centralized data platform.
- In Data Mesh v1 there was no golden path and standardized structure, whereas in the Declarative Data Mesh it is well-defined and enforced.
- In Data Mesh v1 the maintainability was low due to the lack of standardized structure, whereas in the Declarative Data Mesh it is high due to the centralization of infrastructure and orchestration.
- In Data Mesh v1 the cost efficiency was low and managed by each domain, whereas in the Declarative Data Mesh it is high and centrally managed.
- In Data Mesh v1 the technical complexity and the testability were high, whereas in the Declarative Data Mesh they are low.
- In Data Mesh v1 the time to develop a data product was in the order of hours to weeks, whereas in the Declarative Data Mesh it is in the order or minutes to hours.
Airflow plays a crucial role in Glovo’s data platform. Each declarative Data Product is mapped to a DAG in a centralized Airflow instance, which is one of the main visible interfaces of the data platform (the other being the query/computing engine). Owners have full capacity to operate the DAGs of their data products: clear tasks or DAG runs, marking them as successes or failures, trigger manual runs, and enable or disable entire DAGs. This capacity goes in line with the “domain ownership” and the “self-serve data platform” principles of Data Mesh: owners cannot ensure timeliness and quality of the data they are responsible for if they are not able to operate their data products effectively. A great responsibility needs to bring great power along.
Declarative DAG definition
In the new approach, DAGs are automatically created from a declarative definition of the tasks to be executed. Only an internal Python package defining a SDK for Data Product creation is required to start building pipelines, as illustrated in the following code fragment:
from glovo_data_platform.declarative.manager import DataProductManager
from glovo_data_platform.declarative.utils import print_deployment_info
def data_product_definition() -> DataProductManager:
data_product_manager = DataProductManager(
domain="growth",
name="sample_ddp_scripting",
owner="pablo.rodriguez@glovoapp.com",
tier="t2",
contacts=[
{"kind": "email", "value": "ga.eng@glovoapp.com"},
{"kind": "email", "value": "pablo.rodriguez@glovoapp.com"},
],
)
data_product_manager.add_sql_transformation(
data_classification="l0",
sql="""SELECT
gsc_date,
COUNT(1) as cnt_records
FROM
"delta"."growth_master_attribution_odp"."google_search_console"
GROUP BY gsc_date""",
partition_by=[],
target_table="summary_gsc",
write_mode="FULL",
is_odp=True,
)
return data_product_manager
if __name__ == "__main__":
schedule = None
publish = False
creation_reason = "Testing creation of T2 DDPs through scripting."
revision_name = None
data_product_manager = data_product_definition()
revision = data_product_manager.submit(
schedule=schedule,
publish=publish,
creation_reason=creation_reason,
revision_name=revision_name,
)
print_deployment_info(revision)
The path between this code and an Airflow DAG is not straightforward, however, although this is hidden from the Data Product creator. The SDK first encodes all the definitions in a JSON DTO (Data Transfer Object). Secondly, the SDK invokes an internal API to deploy the Data Product. Finally, the SDK communicates the result of the deployment to the Data Product developer.
The internal API is in reality the gateway to the Meshub system. Meshub stores the defining characteristics of the Data Products, manages their lifecycle, and provides information for Data Mesh Governance purposes, among several other functions. When deploying a Data Product, the appropriate lifecycle management methods of Meshub validate the encoded Data Product definition, attach the relevant parameters of the different Airflow operators to be used, and copy the modified JSON DTO into a Python file ready for Airflow to parse as a DAG. The following code block shows the file generated from the sample declarative Data Product code illustrated above.
from glovo_data_platform.declarative_airflow.dag_creator import build_dags
REVISION_DEPLOY_SPEC_JSON = r"""
{
"revision":{
"revision_id":"2af464bc-96f5-4443-b73f-f0507a61fdde",
"data_product":{
"domain":"growth",
"name":"sample_ddp_scripting",
...
}
"""
globals().update(build_dags(REVISION_DEPLOY_SPEC_JSON, __file__))
The conversion of the file to an actual Airflow DAG is performed by a custom package at every file processing cycle. This step translates the JSON DTO into Airflow operators, relationships, and parameters. Additional operators for execution control are also added. The following DAG is the result of processing the Python file shown above:

The whole process is summarized in the next diagram:

Checkpoints and sensors to manage DAG dependencies
The way to handle DAG dependencies has evolved towards a more orchestrator-independent solution, in exchange for more cloud-dependent components. In particular, the checkpointing subsystem leverages several services from AWS, Glovo’s cloud provider. Equivalent components can be found in other cloud providers.
The following diagram shows this process schematically:

As Data Product outputs are ultimately daily-partitioned Delta Lake files written in AWS S3, the checkpointing system is notified whenever a Delta changelog file is added. This triggers a lambda function that processes the changelog and extracts the paths of the modified partitions. These paths are then checked in Glue to get the database and the table names. Finally, the information about which table and partitions have been modified is recorded in a DynamoDB table for later usage.

Downstream Data Products can check whether their dependencies have completed their processes through a custom CheckpointSensor operator:

The custom CheckpointSensor operator queries the DynamoDB for the existence of a partition of a particular table. An interplay between Airflow macros and partition names allow checking whether the daily data required for a given execution is ready or not:
wait_for_order_descriptors_v2 = data_product_manager.add_wait_for_table(
domain="central",
data_product="order_descriptors",
table="order_descriptors_v2",
partitions=["p_creation_date={{ data_interval_start | ds }}"],
)
Conclusion: the role of Airflow in Glovo’s Data Mesh
Whether in the first implementation of Data Mesh or in the declarative approach, Airflow is a cornerstone in Glovo’s data architecture. Going beyond the already powerful features of Airflow, Glovo has implemented improved components to handle dependencies between the DAGs that orchestrate the computation of Data Products. Also, abstractions to simplify the definition of DAGs have been designed in order to reduce the cognitive load to build Data Products.
Airflow is the main interface to inspect and operate the computation of most of the Data Products that compose Glovo’s Data Mesh. Understanding how Airflow works is crucial, as anyone with basic coding skills is now able to create Data Products, bringing a true democratization of data transformation and usage across the company.
References
[1] ^ https://airflow.apache.org/docs/apache-airflow/stable/index.html
[2] ^ https://airflow.apache.org/docs/apache-airflow/stable/project.html#history
[3] ^ https://gradientflow.com/wp-content/uploads/2022/06/GradientFlow-2022-Workflow-Orchestration-Report.pdf
[4] ^ https://6sense.com/tech/workflow-automation
[5] ^ https://en.wikipedia.org/wiki/Orchestration_(computing)
[6] ^ https://www.reddit.com/r/dataengineering/comments/uvckp1/can_someone_please_explain_orchestration_and_why/
[7] ^ https://www.ascend.io/blog/what-is-data-pipeline-orchestration-and-why-you-need-it/
[8] ^ https://martinfowler.com/articles/data-mesh-principles.html
[9] ^ https://github.com/astronomer/dag-factory
[10] ^ https://github.com/rambler-digital-solutions/airflow-declarative
[11] ^ Dehghani, Zhamak. Data Mesh: Delivering data-driven value at scale. O’Reilly Media, Inc. March 2022. ISBN: 9781492092391.
Using Airflow in Glovo was originally published in The Glovo Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.