Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

AOF fsync Delay Issue in Redis: Analysis and Solutions

Tech May 15 1

Issue Description

Today, I encountered a classic Redis issue in production related to AOF persistence. This article documents the analysis and troubleshooting process.

Background:

Around 10:30 AM, the business team reported service timeouts with numerous IO timeout alerts.

[Image: Business monitoring chart showing a spike in request latency around 10:30]

The alert specifically identified the Redis port and domain, allowing us to pinpoint the problematic server IP.

Analysis and Diagnosis

First, we checked the connection count and noticed a significant increase at 10:32 AM.

[Image: Connection count chart showing a spike at 10:32]

Next, we examined the Redis logs and found a large number of repeated warnings:

24:S 16 Aug 10:32:03.357 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
24:S 16 Aug 10:32:05.067 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
24:S 16 Aug 10:32:39.058 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
24:S 16 Aug 10:33:13.067 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
24:S 16 Aug 10:33:47.064 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
24:S 16 Aug 10:34:21.099 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.

The log timestamps match the alert time frame (10:32 AM). The warning essentially means: The asynchronous AOF fsync operation is taking too long, causing Redis to write the AOF buffer directly without waiting for fsync to complete. This can degrade Redis performance.

Since this is a disk I/O issue, we checked disk utilization:

[Image: Disk utilization chart showing 100% at 10:32]

At 10:32 AM, disk utilization reached 100%. The server used HDD disks with inherently poor performance. Heavy write loads on such disks inevitably impact Redis performance.

At this point, an obvious solution is to replace HDD with SSD. However, this upgrade involves cost assessment and cannot be implemented immediately.

Source Code Analysis

One advantage of Redis over MySQL is its relatively small codebase, which makes it feasible to trace issues through source code.

Using an IDE, I searched for the log message and found the following function:

/* Write the append only file buffer on disk.
 *
 * Since we are required to write the AOF before replying to the client,
 * and the only way the client socket can get a write is entering when the
 * event loop, we accumulate all the AOF writes in a memory
 * buffer and write it on disk using this function just before entering
 * the event loop again.
 *
 * About the 'force' argument:
 *
 * When the fsync policy is set to 'everysec' we may delay the flush if there
 * is still an fsync() going on in the background thread, since for instance
 * on Linux write(2) will be blocked by the background fsync anyway.
 * When this happens we remember that there is some aof buffer to be
 * flushed ASAP, and will try to do that in the serverCron() function.
 *
 * However if force is set to 1 we'll write regardless of the background
 * fsync. */
#define AOF_WRITE_LOG_ERROR_RATE 30 /* Seconds between errors logging. */
void flushAppendOnlyFile(int force) {
    ssize_t nwrittan;
    int sync_in_progress = 0;
    mstime_t latency;

    if (sdslen(server.aof_buf) == 0) return;

    if (server.aof_fsync == AOF_FSYNC_EVERYSEC)
        sync_in_progress = bioPendingJobsOfType(BIO_AOF_FSYNC) != 0;

    if (server.aof_fsync == AOF_FSYNC_EVERYSEC && !force) {
        /* With this append fsync policy we do background fsyncing.
         * If the fsync is still in progress we can try to delay
         * the write for a couple of seconds. */
        if (sync_in_progress) {
            if (server.aof_flush_postponed_start == 0) {
                /* No previous write postponing, remember that we are
                 * postponing the flush and return. */
                server.aof_flush_postponed_start = server.unixtime;
                return;
            } else if (server.unixtime - server.aof_flush_postponed_start < 2) {
                /* We were already waiting for fsync to finish, but for less
                 * than two seconds this is still ok. Postpone again. */
                return;
            }
            /* Otherwise fall through, and go write since we can't wait
             * over two seconds. */
            server.aof_delayed_fsync++;
            serverLog(LL_NOTICE,\

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.