De-bottlenecking Aurora MySQL for 19 Million concurrent users | by Ishank Gulati

January 30, 2023

Payment service is a Tier-0 service that orchestrates payment transactions at Hotstar — critical for the purchase and renewal of subscriptions.

The throughput and availability aspects are paramount, considering its importance in accepting payments across multiple payment methods and payment gateways.

As part of our routine high-velocity event readiness, we ran benchmarks to certify the throughput of the system. We discovered, to our surprise, that the system cannot scale to our target transactions per second (TPS).

We will detail below how we went about investigating and scaling the system to handle the target TPS.

Let’s briefly look at payment service architecture before proceeding further:

Payment service is a Java Spring Boot application sitting behind a load balancer that accepts the incoming transaction request, routes the request to the payment gateway based on the user’s preferred payment mode, creates and tracks the state of payment transactions.
Payment service persists transaction state in an Aurora MySQL cluster — consisting of a single writer and multiple readers.
Workers are responsible for running scheduled reconciliation with Payment Gateways and relaying payment confirmation notifications to dependent services.
Data persisted in MySQL is replicated in the data lake via the Change Data Capture (CDC) process using Debezium connectors and Kafka.
Kafka is also used to queue power payment notifications.

We started the investigation by trying and eliminating some of the common suspects that could impact application performance.

Scale-out Application

Experimented with increasing the number of pods but there was no improvement

Pod CPU & Memory

Analyzed individual pod CPU and Memory usage but didn’t notice any anomalies. At the peak load too, both CPU &Memory were well below 50% of maximum capacity.

Scale-up Aurora MySQL

Experimented with different instance types provided by AWS but didn’t see any improvement. From the cloud watch metrics, there was no indication of a bottleneck in CPU, Memory or I/O, all of them were well within bounds.

Though it was not apparent, we had a suspicion that the database is the point of contention since it is the only piece that is not horizontally scalable and we suspected having hit some bottleneck concerning IO.

At this point, we started digging deeper if anything was causing the contention on the database, as the single-node MySQL instance was not showing any signs of a bottleneck.

TPS on the instance was not high enough for the database to give up. We wanted to do a few sanity steps and eliminate unknowns for the root cause.

Inspecting MySQL Process list

Took a dump of the MySQL process list to detect any lingering queries but didn’t notice any such case.

Re-evaluating Indexes

Listed down every query that our application could run on DB and verified indexes for each one against the production database.

Tuning DB connection pool

Changed tomcat managed database connection pool properties, but no noticeable change in performance

None of the above sanity steps, eliminated or provided hints for further performance issue debugging. We went ahead with query instrumentation.

DB Query instrumentation

Instrumented the DB queries, and noticed that some of the select queries were taking much more time than expected as can be seen in Fig 2 from 19:30 to 19:40.

This was unusual because all of these queries were using indexes and were very selective so these queries being latent didn’t make much sense.

Hence we decided to confirm the same on the database side.

Logs for slow queries (latency > 50ms) were enabled in the database, during the load test run we found that only insert/update queries were present. The latent select queries which we saw in the previous section were absent from the slow query logs.

We re-analysed the application side telemetry and noticed that the select queries that were latent were also the top 3 highest throughput queries and by a good enough margin than the rest.

We formed a hypothesis that update/insert queries are the actual culprit. The reason why the application side instrumentation was reporting the select queries as slow because they were also the top three queries by throughput. Hence, they were competing much more for a database connection than the update/insert queries.

So the latency that is visible in Fig 2 is majorly due to the time spent to acquire a connection rather than running the query on DB. This was validated by increasing the connection pool size.

Though we were sure that the updates and inserts are the sources of the bottleneck, the reason was not clear yet. Is insert throughput so high that MySQL is not able to handle it?

Dead End

We tried all the possible ways to debug the issue from scaling the database to the largest instance type, tuning the DB connection pool, and application thread pool, verifying database indices, tuning certain aurora parameters and visualising any anomalies in RDS/app metrics. But nothing got us closer to the root cause.

We had indeed reached a dead end but decided to further deep dive into Aurora’s architecture and explore any instrumentation tools provided by AWS.

We landed on Aurora Performance Insights provided by AWS to analyze database load.

Performance Insights (PI)

Performance Insights (PI) is a feature provided by Aurora built on top of MySQL performance schema which helps to visualise database load for various dimensions like wait conditions, SQL statements etc. We enabled PI for Aurora MySQL and saw that most of the active sessions were waiting on the below two conditions:

wait/io/aurora_redo_log_flush

Redo log is used to ensure durability in MySQL by recording the changes in physical data pages which enable DB to correct data written by incomplete transactions on restart due to crash etc.

Changes to data pages are first written to a log buffer which is flushed to disk in the following scenarios:

Periodically by MySQL master thread once per second
When free space in the log buffer is less than half
On transaction commit

The above event is emitted when this log is flushed to disk. A large number of concurrent sessions waiting for redo log flush could be then due to a small log buffer size or a large number of commits.

Log buffer and Redo log file size although tunable in vanilla MySQL (and advised to do so) can’t be tuned in Aurora so the only thing we could investigate further was the number of commits.

2. wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done

Active binlog in MySQL has a mutex which synchronizes reader and writer sessions. The above event showing up in performance insights indicates that a large number of concurrent sessions are competing to acquire the mutex which could be a result of high commit throughput or a large number of consumers reading binlog.

To confirm the hypothesis we turned off binlog and ran another load test and this wait condition vanished from the Performance insights graph as visible in the second half of Fig 3 and we were able to scale above our planned TPS.

Before diving into the root cause, let’s understand Aurora’s architecture in a bit more detail.

Aurora decouples storage and compute where it self-manages the storage autoscaling and developers can scale the compute by increasing the instance size or by adding more reader nodes. Unlike MySQL where storage for each replica is independent, it is shared amongst the writer and reader nodes in Aurora.

Replication

The compute layer consists of one writer and developers can add up to 15 reader nodes for scaling read operations and high availability. Since the storage layer is shared amongst nodes, replication can be performed in milliseconds as updates made by the writer are instantly available to all readers. Also, it eliminates the need for binlog replication.

But then how do all the compute nodes synchronize in-memory data (log buffers, cached pages etc) to ensure consistency? Hitting the disk for every query will deteriorate the performance.

This is achieved by transferring redo log records from the writer to other reader nodes as shown in Fig 4. Only the writer is responsible for persisting log records to disk.

Persistence

The writer instance sends redo log records to disk which consists of multiple storage nodes. The storage engine replicates this data to multiple storage nodes in Multi-AZs to withstand the loss of an entire AZ (Availability Zone).

Commit Commit Commit

From query patterns, we noticed that on average there’s a fan out of 1:10 for a payment transaction API to a DML operation on Aurora.

Since we tend to avoid transactions; DML operations result in commits which create IO and synchronization contention down the line.

High commit throughput results in a high number of redo log records getting flushed to disk. Similarly, this also results in binlog getting flushed frequently and using the ROW bin_log format only amplifies the impact.

In the two wait events that we detailed above; the first event wait/io/aurora_redo_log_flush is a direct result of this and the second event wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done is a side effect of the same.

Reducing the number of commits

There are a couple of ways to reduce the commit throughput by

Turning off auto-commit around multiple DML operations
Using Transactions to pack in multiple DML queries in a single commit operation
Batching multiple DML statements in a single query

Though all of the above are viable but require significant changes in the codebase and testing effort since introducing transactions could result in a lot of behaviour change. So we started exploring other simpler mitigation strategies.

Enabling Binlog I/O cache for Aurora

In Aurora RDS the binlog isn’t required for replication, but we knew that it is used by Debezium connectors for replicating this data in Data Lake as shown in Fig 1. Also, the ROW binlog format is required by debezium.

After some deep dive, we landed on this AWS article where a feature called binlog I/O cache was available in Aurora MySQL 2.10 onwards.

With this feature enabled Aurora still writes the events to the log file as before but consumers read these events from a cache; thereby solving the synchronisation contention. AWS claims that DB performance with this feature is nearly identical to performance without any active binlog replication.

We decided to experiment with this feature, but we were on the older Aurora MySQL 2.07 version. Aurora team also mentioned in the above article that they had introduced some parameters like aurora_binlog_replication_max_yield_seconds in the older version which could be used to enhance binlog performance. We however didn’t notice any significant improvements with this parameter tuning on our workload.

We then ran a load test on the upgraded Aurora version to verify the binlog I/O cache and we were able to scale without any major issues. From the performance insights, there were almost no sessions stuck on the wait condition wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done.

After regressing the changes through multiple test-beds, we performed a minor version upgrade on the production DB to reap the performance benefit.

Database cost was reduced to half by migrating to a smaller instance throughout the tournament.
~30% uptick in TPS (Fig 5. 19.31–19.40 vs 19.55–20.10 ).

7x drop in P99 Latency (Fig 6. 19.31–19.40 vs 19.45–20.10 ) at 30% higher load.

Aurora performance insights is a great monitoring tool provided by AWS which should be used to analyse database load and is safe to be kept ON on production for most large instances. It takes around 1–3 GBs of memory without any significant performance impact. Also, it’s free for the last 7 days of data retention.
An in-depth architectural understanding of data store is extremely helpful in performance optimisation. Although Aurora is a closed-source DB, AWS does have nice documentation providing some high-level implementation details.
Diving deep and being persistent towards solving performance bottlenecks can drive a significant impact on customer experience and cost.

If you are stoked about solving such hard performance and scale problems, join us — https://careers.hotstar.com/

Thanks to Tanuj Johal, Mohit Garg, Aravind S

Source link