Deep Dive into QNNPACK and the Indirect Convolution Algorithm
Background: Traditional Convolution Computation
While a simple 1x1 convolution can be directly mapped to matrix-matrix multiplication, more complex convolutions involving larger kernels, padding, or striding require a different approach. The standard method combines the Im2Col algorithm with GEMM (General Matrix Multiplication). This technique rearranges the 4D input image into a 2D matrix based on the convolution window size and flattens the convolution kernels into another 2D matrix. This transforms the convolution operation into a standard matrix multiplication.
After the Im2Col transformation, the operation becomes a matrix multiplication where Matrix B (kernels) is of size N × K, Matrix A (feature map) is K × M, and Result Matrix C is N × M. To compute a specific cell in C, one must multiply a row of A by a column of B.
With vectorization, multiple results are computed in parallel, typically calculating a small block of size MR × NR. Traditional Im2Col+GEMM splits the computation along the K dimension. In this approach, partial sums are accumulated into the output buffer repeatedly, requiring frequent load-store operations for the output.
The traditional method suffers from two main drawbacks:
- High Memory Overhead: Flattening the input and kernels into 2D matrices consumes significant memory.
- Cache Pressure: GEMM requires caching large rows of data, which remains an issue even with splitting strategies.
Neural Network Quantization Basics
QNNPACK (Quantized Neural Networks PACKage) was initially designed to optimize quantized inference. Quantization converts model weights and activations from 32-bit floating-point (FP32) to lower-bit integers (like INT8) to improve speed and memory efficiency on mobile devices.
The linear quantization formula involves three parameters:
- Scale: The factor scaling the floating-point range to the integer range.
- Zero Point: The offset adjusting the quantized values.
- Quantized Value: The resulting integer representation.
The quantization process is defined as:
q = round(r / scale) + zeropoint
Where q is the quantized integer. The scale and zero point are derived from the data range:
scale = (rmax - rmin) / (qmax - qmin)
zeropoint = round(qmin - rmin / scale)
During inference, dequantization restores the approximate floating-point value:
r = scale × (q - zeropoint)
QNNPACK utilizes a scheme compatible with Android's Neural Networks API, assuming 8-bit unsigned integers for quantized values.
The Indirect Convolution Algorithm
Developed by Marat Dukhan, the Indirect Convolution Algorithm is the core innovation behind QNNPACK. Unlike traditional BLAS libraries that focus on double-precision floating-point for scientific computing, QNNPACK targets low-precision, mobile-specific vision models.
Computation Partitioning
QNNPACK utilizes PDOT (Parallel Dot Product) micro-kernels optimized for 8-bit elements. Mobile architectures limit MR and NR (register blocks) to 8 or less. Consequently, the data read into the micro-kernel fits within the L1 cache (e.g., 16KB for 8x8 blocks).
The critical distinction of QNNPACK is its optimization for scenarios where input panels fit into L1 cache, allowing it to eliminate all non-essential memory transformations. While traditional GEMM libraries repack matrices A and B to utilize cache hierarchies, QNNPACK processes the entire K dimension within the kernel for a given block without swapping data in and out.
Memory Repacking Optimization
Memory repacking (rearranging data layouts) is usually done to reduce fragmentation and improve cache locality. However, QNNPACK minimizes this:
- Matrix A (Activations): Since input data changes every run, traditional methods repack A constantly. QNNPACK avoids this by ensuring the data panel fits in L1 cache.
- Matrix B (Weights): Weights are static. QNNPACK repacks them once into a layout optimal for the micro-kernel during initialization.
By avoiding the runtime repacking of Matrix A, QNNPACK saves significant overhead.
The Indirect Buffer
The key innovation is the Indirect Buffer. Instead of copying actual pixel data into an Im2Col buffer, QNNPACK creates a buffer of pointers to the input pixels required for the computation.
If the input tensor resides at a constant memory address (or is copied to a fixed region), this indirect buffer can be set up once and reused across multiple inference runs. The micro-kernel loads pointers from this buffer rather than computing addresses on the fly.
To handle padding (often implicit zero-padding in convolutions), QNNPACK uses a shared "explicit zero vector." If the convolution window slides out of bounds, the pointer in the indirect buffer is pointed to this zero vector instead of the input tensor.
The structure of the indirect buffer depends on:
- Model Parameters: Kernel size, stride, dilation, channels (set once).
- Tensor Shape: Input/Output height and width (rarely changes).
- Batch Size: May require partial reinitialization.
- Memory Address: Changes require full reinitialization.
Execution Workflow
The execution is divided into preparation and runtime:
- Preparation: Load the model, configure the input buffer, and repack weights into the optimal layout.
- Runtime: Loop through the output pixels. For every
M × Nblock of output, use the indirect buffer to fetch the corresponding input pointers and weights, then compute using GEMM.
Indirect Buffer Layout
The buffer is organized as a set of pointers corresponding to the convolution kernel size (KH × KW). For a specific output position, the buffer stores pointers to the input rows needed. Since adjacent output pixels often reuse input data (depending on stride), the indirect buffer naturally captures this data locality.
The pointers are organized so that for M output pixels being computed simultaneous, the relevant input pointers for a specific kernel coordinate (kh, kw) are grouped together. This allows the micro-kernel to iterate through the input channels efficiently.
Computation with Indirect Buffer
The following logic illustrates how the indirect buffer is used to compute the output without explicitly expanding the input matrix:
// Pseudo-code for computing M output pixels at once
for (int kh = 0; kh < KH; kh++) {
for (int kw = 0; kw < KW; kw++) {
// Load M pointers corresponding to kernel position (kh, kw)
// assuming p_indirect points to a structured buffer
dtype** input_ptrs = get_pointers_from_indirect_buffer(p_indirect, kh, kw);
// Iterate over input channels
for (int ic = 0; ic < IC; ic++) {
// For each of the M outputs, multiply input value by weight
for (int m = 0; m < M; m++) {
output[m] += (input_ptrs[m][ic]) * weight(kh, kw, ic);
}
}
}
}
This process simulates scanning the Im2Col matrix but operates directly on the 3D tensor via pointers. The total number of kernel calls required is:
ceil(OH * OW / M) * ceil(OC / N)
Comparison with Im2Col
Compared to the standard Im2Col approach, the Indirect Convolution algorithm offers:
- Elimination of Transformation: No need to copy data to an Im2Col buffer.
- Better Cache Utilization: Reuses input rows for different output pixels more effectively.
- Pointer Overhead: Introduces the cost of loading pointers from the indirect buffer.
- Loop Structure: Nested loops for kernel size and channels might be slightly less efficient than a single flat loop in GEMM, though usually offset by the removal of memory copies.
Overall, the indirect convolution algorithm solves memory copy issues and address calculation complexity, providing superior performance for mobile quantized neural networks despite a slight increase in memory usage for the buffer structures.