Home > Tech > Content

Performance Improvements in RaftKeeper 2.1.0

Tech 1

RaftKeeper is a high-performance distributed consensus service compatible with ZooKeeper, designed to address performance bottlenecks in systems like ClickHouse and other big data components.

Performance Benchmark Results

Benchmarking was conducted using the raftkeeper-bench tool on a three-node cluster, each with 16 CPU cores, 32GB RAM, and 100GB storage. Tests compared RaftKeeper 2.1.0, RaftKeeper 2.0.4, and ZooKeeper 3.7.1 under default configurations.

In a pure create operation test with 100-byte values, RaftKeeper 2.1.0 showed an 11% improvement over version 2.0.4 and a 143% improvement over ZooKeeper.

A mixed workload test with create (1%), set (8%), get (45%), list (45%), and delete (1%) operations demonstrated a 118% performance gain over version 2.0.4 and a 198% improvement over ZooKeeper. Both average response time and TP99 metrics were superior in version 2.1.0.

Engineering Optimizations

Parallel Resposne Serialization

Analysis of CPU profiling data revealed that the ResponseThread, responsible for serializing responses, consumed significant CPU cycles, with about one-third spent on serialization. This single-threaded serialization increased latency.

By offloading serialization to IO threads, throughput improved through concurrency. Additionally, reducing lock contention in the synchronization queue by freeing memory before tryPop calls enhanced performance. With a 50-byte response size and 10 concurrent clients, this optimization increased TPS by 31% and reduced average response time by 32%.

List Request Enhancement

List request processing was identified as a bottleneck, consuming most of the request-processor thread's CPU time. The original implementation used std::vector<std::string>, requiring dynamic memory allocation for each string.

A compact string storage design was implemented, using two contiguous memory blocks: one for data and another for offsets. This reduced CPU usage for List requests from 5.46% to 3.37%, increased TPS from 458,433 to 619,388, and lowered TP99 latency.

// Compact string storage example
struct CompactStrings {
    std::vector<char> data_buffer;
    std::vector<size_t> offset_buffer;
    // Methods for efficient storage and retrieval
};

Reduction of Unnecessary System Calls

Profiling with bpftrace identified excessive getsockname and getsockopt system calls, which added overhead. These calls originated from logging statements that incorrectly invoked socket address resolution.

Removing these unnecessary calls eliminated the associated performance penalty.

// Before: Unnecessary system call in logging
LOG_TRACE(log, "Dispatch event {} for {} ", notification.name(), sock.address().toString());

// After: Avoid system call
LOG_TRACE(log, "Dispatch event {}", notification.name());

Thread Pool Refactoring

In mixed read-write workloads, the request-processor thread spent over 60% of CPU time waiting on condition variables due to thread pool scheduling overhead for read requests.

Switching to single-threaded read request processing eliminated scheduling delays, reducing read latency from three times that of write requests to comparable levels. This change increased TPS by 13% in benchmarks.

// Simplified request processing flow
void processRequests() {
    // Single-threaded read processing
    processReadRequests();
    // Single-threaded write processing
    processWriteRequests();
}

Snapshot Improvements

Asynchronous Snapshot Creation

Previously, snapshot creation blocked the main request processing thread, causing timeouts and leader election issues. For 60 million data entries, this blocking lasted 180 seconds.

Version 2.1.0 introduces asynchronous snapshots: a copy of the DataTree is made in the main thread (blocking reduced to 4.5 seconds for 60 million entries), then serialized to disk in the background. This approach increases memory usage by over 50%.

Further optimization using SIMD instructions for DataTree copying reduced copy time from 4.5 to 3.5 seconds.

void fast_memcpy(char* dst, const char* src, size_t n) {
    size_t aligned_chunks = n / 16;
    for (size_t i = 0; i < aligned_chunks; ++i) {
        __m128i data = _mm_loadu_si128((__m128i*)src);
        _mm_storeu_si128((__m128i*)dst, data);
        dst += 16;
        src += 16;
    }
    size_t remaining = n % 16;
    memcpy(dst, src, remaining);
}

Accelerated Snapshot Loading

Loading 60 million entries from snapshot previously took 180 seconds on NVMe storage, primarily due to single-threaded parent-child relationship construction in the DataTree.

Parallelizing this process by distributing buckets of the two-level HashMap structure across threads reduced load time to 99 seconds. Additional optimizations including lock refinement, snapshot format improvements, and reduced data copying further decreased load time to 22 seconds.

Production Deployment Results

In a ClickHouse cluster with high ZooKeeper load (approximately 170,000 queries per second, predominantly List requests), upgrading from ZooKeeper to RaftKeeper 2.1.0 showed significant performance advantages over both ZooKeeper and RaftKeeper 2.0.4.

Back to List

Prev: Practical Techniques for Efficient Date and Time Handling with Day.js

Next: Object Creation and Inheritance Patterns in JavaScript

Fading Coder

Performance Improvements in RaftKeeper 2.1.0

Performance Benchmark Results

Engineering Optimizations

Parallel Resposne Serialization

List Request Enhancement

Reduction of Unnecessary System Calls

Thread Pool Refactoring

Snapshot Improvements

Asynchronous Snapshot Creation

Accelerated Snapshot Loading

Production Deployment Results

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Performance Improvements in RaftKeeper 2.1.0

Performance Benchmark Results

Engineering Optimizations

Parallel Resposne Serialization

List Request Enhancement

Reduction of Unnecessary System Calls

Thread Pool Refactoring

Snapshot Improvements

Asynchronous Snapshot Creation

Accelerated Snapshot Loading

Production Deployment Results

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment