Thursday, October 09, 2025
Home Innovation DevOps Agoda Launches DevOps-Based Ka...
DevOps
Business Honor
14 August, 2025
Agoda unveils a custom DevOps-driven Kafka failover system for seamless, resilient multi-data center streaming.
Agoda's engineering team has released a Kafka failover and failback solution built specifically to provide smooth consumer operations across multiple on-premise data centers—effort clearly reflecting a high investment in DevOps and site reliability engineering (SRE) practices. Handling more than 3 trillion Kafka records per day, Agoda required streaming data pipelines to be accessible even during outages without data loss or duplication. The tools available such as Kafka's stretch clusters and MirrorMaker 2 (MM2) were not sufficient for Agoda's requirements because of geographic latency and the absence of bidirectional offset synchronisation.
To accomplish this, Agoda implemented MM2 to enable automated failover and transparent failback by developing a custom sync service using Kafka Connect and OffsetSync. The DevOps-driven solution is based on real-time, two-way synchronization of consumer group offsets across clusters. When failover takes place, the secondary Kafka cluster resumes processing from the exact offset where the primary left off. When the primary resumes operation, offsets are reverse-translated for seamless failback—prevents manual intervention and reprocessing risk.
At the core of the system's resilience is its observability plane. Through targeted Grafana dashboards monitoring replication lag, sync status, and consumer lag, the platform supports early detection and response to anomalies—a core SRE task.
Agoda's architecture is purpose-built to emphasize the DevOps value of creating strong, observable, and automated infrastructure at scale. In contrast to Netflix or Uber, where replay or idempotency takes higher precedence, Agoda's system ensures real-time correctness through offset synchronization at all times. This solution reflects the growing need for customized platform engineering in modern DevOps, especially for businesses handling high-throughput, mission-critical streaming platforms across multiple data centers.