Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Deep Dive into QNNPACK and the Indirect Convolution Algorithm

Tech May 9 3

Background: Traditional Convolution Computation

While a simple 1x1 convolution can be directly mapped to matrix-matrix multiplication, more complex convolutions involving larger kernels, padding, or striding require a different approach. The standard method combines the Im2Col algorithm with GEMM (General Matrix Multiplication). This technique rearranges the 4D input image into a 2D matrix based on the convolution window size and flattens the convolution kernels into another 2D matrix. This transforms the convolution operation into a standard matrix multiplication.

After the Im2Col transformation, the operation becomes a matrix multiplication where Matrix B (kernels) is of size N × K, Matrix A (feature map) is K × M, and Result Matrix C is N × M. To compute a specific cell in C, one must multiply a row of A by a column of B.

With vectorization, multiple results are computed in parallel, typically calculating a small block of size MR × NR. Traditional Im2Col+GEMM splits the computation along the K dimension. In this approach, partial sums are accumulated into the output buffer repeatedly, requiring frequent load-store operations for the output.

The traditional method suffers from two main drawbacks:

  • High Memory Overhead: Flattening the input and kernels into 2D matrices consumes significant memory.
  • Cache Pressure: GEMM requires caching large rows of data, which remains an issue even with splitting strategies.

Neural Network Quantization Basics

QNNPACK (Quantized Neural Networks PACKage) was initially designed to optimize quantized inference. Quantization converts model weights and activations from 32-bit floating-point (FP32) to lower-bit integers (like INT8) to improve speed and memory efficiency on mobile devices.

The linear quantization formula involves three parameters:

  • Scale: The factor scaling the floating-point range to the integer range.
  • Zero Point: The offset adjusting the quantized values.
  • Quantized Value: The resulting integer representation.

The quantization process is defined as:

q = round(r / scale) + zeropoint

Where q is the quantized integer. The scale and zero point are derived from the data range:

scale = (rmax - rmin) / (qmax - qmin)
zeropoint = round(qmin - rmin / scale)

During inference, dequantization restores the approximate floating-point value:

r = scale × (q - zeropoint)

QNNPACK utilizes a scheme compatible with Android's Neural Networks API, assuming 8-bit unsigned integers for quantized values.

The Indirect Convolution Algorithm

Developed by Marat Dukhan, the Indirect Convolution Algorithm is the core innovation behind QNNPACK. Unlike traditional BLAS libraries that focus on double-precision floating-point for scientific computing, QNNPACK targets low-precision, mobile-specific vision models.

Computation Partitioning

QNNPACK utilizes PDOT (Parallel Dot Product) micro-kernels optimized for 8-bit elements. Mobile architectures limit MR and NR (register blocks) to 8 or less. Consequently, the data read into the micro-kernel fits within the L1 cache (e.g., 16KB for 8x8 blocks).

The critical distinction of QNNPACK is its optimization for scenarios where input panels fit into L1 cache, allowing it to eliminate all non-essential memory transformations. While traditional GEMM libraries repack matrices A and B to utilize cache hierarchies, QNNPACK processes the entire K dimension within the kernel for a given block without swapping data in and out.

Memory Repacking Optimization

Memory repacking (rearranging data layouts) is usually done to reduce fragmentation and improve cache locality. However, QNNPACK minimizes this:

  • Matrix A (Activations): Since input data changes every run, traditional methods repack A constantly. QNNPACK avoids this by ensuring the data panel fits in L1 cache.
  • Matrix B (Weights): Weights are static. QNNPACK repacks them once into a layout optimal for the micro-kernel during initialization.

By avoiding the runtime repacking of Matrix A, QNNPACK saves significant overhead.

The Indirect Buffer

The key innovation is the Indirect Buffer. Instead of copying actual pixel data into an Im2Col buffer, QNNPACK creates a buffer of pointers to the input pixels required for the computation.

If the input tensor resides at a constant memory address (or is copied to a fixed region), this indirect buffer can be set up once and reused across multiple inference runs. The micro-kernel loads pointers from this buffer rather than computing addresses on the fly.

To handle padding (often implicit zero-padding in convolutions), QNNPACK uses a shared "explicit zero vector." If the convolution window slides out of bounds, the pointer in the indirect buffer is pointed to this zero vector instead of the input tensor.

The structure of the indirect buffer depends on:

  • Model Parameters: Kernel size, stride, dilation, channels (set once).
  • Tensor Shape: Input/Output height and width (rarely changes).
  • Batch Size: May require partial reinitialization.
  • Memory Address: Changes require full reinitialization.

Execution Workflow

The execution is divided into preparation and runtime:

  1. Preparation: Load the model, configure the input buffer, and repack weights into the optimal layout.
  2. Runtime: Loop through the output pixels. For every M × N block of output, use the indirect buffer to fetch the corresponding input pointers and weights, then compute using GEMM.

Indirect Buffer Layout

The buffer is organized as a set of pointers corresponding to the convolution kernel size (KH × KW). For a specific output position, the buffer stores pointers to the input rows needed. Since adjacent output pixels often reuse input data (depending on stride), the indirect buffer naturally captures this data locality.

The pointers are organized so that for M output pixels being computed simultaneous, the relevant input pointers for a specific kernel coordinate (kh, kw) are grouped together. This allows the micro-kernel to iterate through the input channels efficiently.

Computation with Indirect Buffer

The following logic illustrates how the indirect buffer is used to compute the output without explicitly expanding the input matrix:


// Pseudo-code for computing M output pixels at once
for (int kh = 0; kh < KH; kh++) {
    for (int kw = 0; kw < KW; kw++) {
        // Load M pointers corresponding to kernel position (kh, kw)
        // assuming p_indirect points to a structured buffer
        dtype** input_ptrs = get_pointers_from_indirect_buffer(p_indirect, kh, kw);
        
        // Iterate over input channels
        for (int ic = 0; ic < IC; ic++) {
            // For each of the M outputs, multiply input value by weight
            for (int m = 0; m < M; m++) {
                output[m] += (input_ptrs[m][ic]) * weight(kh, kw, ic);
            }
        }
    }
}

This process simulates scanning the Im2Col matrix but operates directly on the 3D tensor via pointers. The total number of kernel calls required is:

ceil(OH * OW / M) * ceil(OC / N)

Comparison with Im2Col

Compared to the standard Im2Col approach, the Indirect Convolution algorithm offers:

  1. Elimination of Transformation: No need to copy data to an Im2Col buffer.
  2. Better Cache Utilization: Reuses input rows for different output pixels more effectively.
  3. Pointer Overhead: Introduces the cost of loading pointers from the indirect buffer.
  4. Loop Structure: Nested loops for kernel size and channels might be slightly less efficient than a single flat loop in GEMM, though usually offset by the removal of memory copies.

Overall, the indirect convolution algorithm solves memory copy issues and address calculation complexity, providing superior performance for mobile quantized neural networks despite a slight increase in memory usage for the buffer structures.

Tags: QNNPACK

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

SBUS Signal Analysis and Communication Implementation Using STM32 with Fus Remote Controller

Overview In a recent project, I utilized the SBUS protocol with the Fus remote controller to control a vehicle's basic operations, including movement, lights, and mode switching. This article is aimed...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.