AWS Aurora

Aurora MySQL at Glovo — The Foundation

17.03.2025 |
by Nishaad Ajani

Let me take you back to a time when managing Aurora MySQL databases at Glovo felt like wrestling with a growing beast. It was 2021, and our small Platform team was juggling a rapidly expanding fleet of databases with tools that, while powerful, were showing their limits. Every new challenge — scaling clusters, rolling out updates, handling upgrades — felt like a mountain we had to climb manually, armed with Terraform, custom scripts and a lot of caffeine.
We knew there had to be a better way. We dreamed of a system where managing Aurora MySQL clusters didn’t require late-night interventions or painstaking coordination across teams. What if we could build something that just worked — automatically, safely, and at scale? That dream led us down an ambitious path, one where a handful of engineers would build a Kubernetes operator that changed everything.
This blog series is the story of that journey. It’s about how a small team tackled big problems, transforming database management at Glovo from a tedious manual process into a seamless, automated system. It’s a story of innovation, challenges, and the power of leveraging Kubernetes to not just solve problems but create a foundation for future growth. Join me as we dive into how we built this operator, the impact it had, and what we learned along the way.

The Challenge: Growing Pains in a Rapidly Expanding Company

Back in early 2021, our database infrastructure at Glovo was manageable — barely. With just a handful of Aurora MySQL clusters, a single Terraform module was enough to keep things running. But as Glovo grew, the cracks in this setup started to show, and what once felt straightforward turned into a maze of complexity.
It began with distributed configurations. Each team owned its own git repository and Terraform workspace, which sounded great for autonomy but quickly turned into a headache. Rolling out a simple update meant tracking down dozens of configurations, hoping nothing broke along the way. It wasn’t long before essential tasks — like scaling, backups, and reboots — became anything but straightforward. These jobs ate up hours of engineering time, and as the number of clusters grew, so did the grind.
The real pain came with major version upgrades. MySQL upgrades are tricky at the best of times, but doing them manually, often late at night to avoid disrupting traffic, was downright brutal. And then there were the inevitable mishaps — a misplaced configuration or a poorly reviewed Terraform apply could mean downtime or worse, leaving us scrambling to recover a deleted database cluster.
As our database fleet ballooned to over 200 clusters, even simple updates became cumbersome and error-prone, taking weeks to roll out across all teams. It was clear that the system we’d relied on for so long just wasn’t built to handle this level of growth. We needed a new approach, one that didn’t just patch over the problems but completely rethought how we managed our databases. It was time to scale smarter, not harder.
Terraform Scalability Challenges
Terraform was our trusty tool for managing infrastructure, but as our needs grew, we started to hit its limits. It’s great for describing the end state you want — “Make it so!” — but not so much for handling the messy in-between. Managing Aurora MySQL clusters highlighted these gaps, especially when we tried to scale.
Take complex business logic, for example. Imagine you need to change the instance type of a database cluster, but only during a specific maintenance window. Terraform doesn’t natively support adding that kind of conditional logic. Either you manually intervene at just the right time or lean on AWS features, like maintenance scheduling, when they’re available. And if they aren’t? You’re stuck with manual effort and a bit of hope.
Then there were the orchestration challenges. For operations like scaling or major version upgrades, we often needed multi-step workflows. A task as simple as resizing an instance might involve draining traffic, updating configurations, restarting instances, and checking everything is back online — steps Terraform can’t sequence dynamically. This left us juggling AWS automation tools where possible and writing custom scripts to fill in the gaps.
Perhaps the trickiest part was state management. Terraform’s state file is great for tracking what’s been done but doesn’t handle transitions well. If changing instance type fails during the scheduled maintenance by AWS, Terraform might think everything’s fine just because the desired state was technically applied. Recovery often meant rolling up our sleeves to manually fix state files — a risky, tedious process.
It became clear that while Terraform was powerful, it wasn’t designed for the dynamic, time-sensitive workflows that managing Aurora MySQL at scale demanded. We needed something more — something that could handle the transitions, incorporate business logic, and still let us leverage Terraform’s strengths. That’s where our Kubernetes operator came into play.

Interim Solutions: Bridging the Gap

We knew we couldn’t solve all our challenges overnight, so we introduced several interim measures to ease the growing pains and reduce operational overhead. These stopgap solutions weren’t perfect, but they gave us the breathing room we needed to keep things running as our infrastructure scaled.
One of the key changes was making better use of existing maintenance windows on AWS. Instead of leaving them unmanaged, we optimised how we used maintenance windows, scheduling them during low-traffic hours to reduce risk and improve efficiency. By carefully distributing updates, we minimised the impact of potential issues and ensured non-critical problems could be addressed promptly. This approach wasn’t revolutionary, but it was effective — preserved high availability and provided a reliable, structured way for teams to make certain changes with greater confidence.
Another improvement was the partial automation of MySQL version upgrades. This tool streamlined a notoriously complex process with a structured workflow:

  • Clone Creation: A new clone of the database was provisioned.

  • Upgrade Process: The clone was upgraded to the target MySQL version.

  • Binlog Replication: Synchronisation was maintained between the old and new clusters.

  • Integrity Checks: Data integrity was validated to catch issues early.

  • Traffic Cutover: With manual approval, traffic was shifted to the upgraded cluster.


These were just two examples. Across the platform, we worked to streamline other operational tasks and build tools that tackled immediate pain points. From automating routine maintenance to refining monitoring and alerting, we made incremental improvements wherever we could.
While these measures helped reduce some of the toil and risks, they weren’t enough to address the underlying complexity of managing Aurora MySQL at scale. Each solution felt like a patch on a system that needed a complete rethink. We knew the only way forward was to build a more cohesive and automated approach — one that could handle the scale and complexity of our growing infrastructure. That vision set us on the path to creating our Kubernetes operator.
The Turning Point: Introducing a Kubernetes Operator
The breakthrough came with the decision to build a Kubernetes operator tailored to manage Aurora MySQL clusters. Kubernetes operators extend the Kubernetes API, encapsulating the logic required to automate the lifecycle of complex applications. This approach aligned perfectly with our goals:
Why Operators?
Automate complex, application-specific tasks (e.g., scaling, backups, upgrades).
Manage stateful applications (like databases) seamlessly in Kubernetes environments.
Provide consistent deployment and management across environments.
Encapsulate domain-specific knowledge, reducing manual interventions.
For Glovo, this meant transitioning from manual, distributed workflows to a centralised and automated control plane, tailored for Aurora MySQL at scale.
The First Generation: A Hybrid Approach
In its first iteration, our operator leveraged the existing Terraform module instead of building everything from scratch or relying on the AWS RDS operator. This allowed us to capitalise on the rich, business-critical logic already built into our Terraform setup, including:
Custom Metrics Collectors: Automated provisioning of Lambda functions to capture detailed InnoDB table and query level metrics that went beyond CloudWatch’s default capabilities.
MySQL Partitioning rotation: Lambda functions to automate the creation and rotation of MySQL range partitions, optimising query performance and storage retention for time-series data.
Disaster Recovery Readiness: Support for provisioning Aurora global clusters, ensuring a robust setup in our disaster recovery (DR) region.
However, we designed the architecture to clearly separate developer responsibilities from platform management, ensuring simplicity and safety.
Developer-Centric YAML Configuration
Developers interacted with the system via a minimal YAML configuration stored directly in their service repositories. This specification included only the details they cared about, such as instance size, scaling limits, and partitioned tables. For example:

apiVersion: storage.platform.glovoapp.com/v1alpha1kind: AuroraResource
metadata:
name: orders-db
spec:
version: 5.7.mysql_aurora.2.11.2
instanceClass: db.r6g.large
scaling:
targetCpuUsage: 70
minReaders: 1
maxReaders: 3
parameters:
maxConnections: 200
mysqlPartitionedTables:
- name: my_table
intervalType: "DAY"
intervalFormat: "Snowflake"
retention: 7
buffer: 5

This approach allowed product teams to define their database requirements declaratively, abstracting away the complexities of underlying infrastructure.

Platform-Controlled Terraform Repository

On the platform side, all Terraform code was centralised in a dedicated repository managed by the Platform team. This repository contained all the Terraform configurations for Aurora MySQL clusters, which were automatically generated by the Kubernetes operator based on the developer-provided YAML specifications.
The repository served as a standardised and centralised home for all database clusters, replacing the fragmented, team-specific Terraform setups that had been manually maintained before. This approach allowed us to:
Provision and Update Aurora Clusters: Automatically translate YAML configurations into Terraform code to handle cluster lifecycle tasks.
Rollout updates faster: Updates to our custom metrics collectors, and other advanced functionality could be rolled out quickly and transparently from the developer teams.
Enforce Guardrails: Use automated checks to validate Terraform plans, ensuring safety and consistency.
All configurations for database clusters were linked to their own Terraform Cloud workspace, creating a controlled environment for running plans and applies. The lift-and-shift process brought all existing Terraform configurations under a single, standardised structure, ensuring consistency across all database clusters.


This setup completely eliminated the need for product developers to interact directly with Terraform, reducing errors and freeing them to focus on their applications. Instead, their simple YAML configurations drove the entire process, with the operator handling the generation and application of Terraform code behind the scenes.

How It Worked

Developer Workflow:
Developers updated their database configurations in a minimal YAML file located in their service repositories.
These changes triggered GitHub Actions, which synced the YAML to the corresponding Kubernetes CRD.
The operator then took over, orchestrating the necessary Terraform updates in the platform’s central repository and managing the lifecycle of the database cluster.
Status updates were reported back to the developer repository via GitHub commit statuses, providing visibility into the progress and outcome of the changes.
Centralised CI/CD Pipeline:
The operator translated the developer’s YAML spec into standardised Terraform configurations and committed these to the platform’s central Terraform repository.
Updates were validated in Terraform Cloud workspaces, enforcing safety and consistency:
Sentinel Checks: Automatically blocked unsafe changes, such as accidental deletions or misconfigurations.
Automated PR Validation: Ensured all changes adhered to predefined standards before being merged and applied.
Manual Review When Needed:
For high-impact changes — such as major version upgrades, provisioning global clusters, or adjusting disaster recovery setups — the system flagged updates for manual review and approval to ensure additional oversight.
Safe Application:
Once validated, the operator applied the changes via Terraform Cloud, ensuring consistent and safe updates across all environments. Developers could monitor the entire process through the commit status updates in their service repository, ensuring transparency without requiring direct interaction with the Terraform workflows.



Challenges Faced Along the Way

Building the Kubernetes operator wasn’t without its hurdles. One particularly tricky challenge arose from how Terraform Cloud interacted with GitHub. As we scaled, we ran into significant bottlenecks caused by GitHub API rate limits.
Here’s what happened:
Rate Limiting on GitHub API: Terraform Cloud frequently updated Git commit statuses to report the state of each workspace. However, as our fleet of Aurora MySQL clusters grew, these calls overwhelmed the GitHub API, triggering rate limits.
Unintended Consequences: When Terraform Cloud hit the rate limit, it couldn’t accurately detect which files had changed. Instead of running plans only for the affected database, Terraform would trigger plans for all workspaces in the central repository. This created a cascade of issues:
The Terraform apply queue became overwhelmed, blocking changes from other teams.
With limited Terraform Cloud agents, critical updates were delayed, impacting productivity across multiple projects.
The Solution: Smarter Commit Status Updates
To address this, we made a critical adjustment:
We updated the operator to enable commit status updates in Terraform Cloud workspaces only on demand.
For any database change, the operator dynamically toggled this setting to ensure that only the affected workspace updated Git commit statuses.
This adjustment drastically reduced the number of API calls to GitHub, avoiding rate limits and ensuring Terraform Cloud only processed the necessary plans. It also prevented the apply queue from being flooded, allowing teams to work without interference.
This challenge highlighted the complexities of integrating multiple systems at scale, but it also reinforced the value of automation. With this workaround, we ensured our operator could continue to scale alongside Glovo’s growing infrastructure needs.

The Results: A Transformed Landscape

The introduction of the Kubernetes operator was a game-changer for how we manage Aurora MySQL clusters at Glovo. What started as a small-scale experiment soon became the backbone of our database infrastructure. Here’s what changed:
Centralised Control: Gone were the days of fragmented configurations. Now, everything was unified — one consistent approach to managing all clusters across the platform.
Reduced Toil: Routine tasks like terraform module updates became automated, giving engineers more time to focus on strategic projects that added value.
Enhanced Safety: Built-in guardrails, canary releases of new terraform changes, and automated checks dramatically reduced the risk of human error, ensuring safer deployments and fewer incidents.
Improved Developer Experience: With simple YAML files in their service repositories, developers no longer needed to worry about the complexities of Terraform or underlying infrastructure. They could self-service their database needs, boosting productivity and reducing friction.
This shift didn’t just streamline operations — it reshaped how we think about infrastructure management. The operator turned a complicated, manual process into something that scaled with us, providing reliability and efficiency.

Looking Ahead

This operator has laid the foundation for a more scalable and efficient infrastructure management system at Glovo. In Part 2, we’ll explore how this architecture enabled us to automate one of the most complex and critical tasks: MySQL version upgrades — and the advanced features we built to support product teams. Stay tuned!

Nishaad Ajani

Software Engineer in Barcelona at Glovo

Read other related articles