Azimuth-Equivariant Feature Learning and Camera-Decoupled Depth Estimation for Multi-View 3D Object Detection
Standard Lift-Splat-Shoot (LSS) paradigms for multi-view 3D perception often neglect the radial symmetry inherent in Bird's-Eye-View (BEV) representations. Conventional 2D backbones and detection heads apply isotropic sampling grids and Cartesian coordinate systems, which introduces two primary limitations. First, shared convolutional kernels produce inconsistent feature embeddings when the same scene is observed from varying azimuth angles. Second, Cartesian-based anchor boxes yield fragmented prediction distributions for identical objects captured across multiple camera viewpoints. To resolve these rotational inconsistencies, a novel architecture integrates azimuth-equivariant convolutional operations, polar-coordinate anchor definitions, and a camera-intrinsic-decoupled depth regression strategy.
Azimuth-Equivariant Convolution (AeConv)
The core mechanism replaces standard fixed-grid sampling with dynamically rotated sampling patterns aligned to the local azimuth angle. By rotating the receptive field relative to a deefined radial coordinate system, the operation preserves feature consistency regardless of viewing orientation. In multi-camera configurations, disparate sensor positions naturally create non-uniform azimuth reference frames. To synchronize these views, the sampling grids are aligned relative to a shared virtual origin, approximated by averaging the physical camera coordinates. Implementation relies on differentiable bilinear interpolation to sample features from the rotated grid, ensuring gradient flow during end-to-end training.
Polar-Coordinate Anchor Formulation
Traditional anchor-free or Cartesien anchor-based detectors implicitly assume fixed directional biases. The proposed framework redefines the anchor space using polar coordinates (x, y, z, l, w, h, α), where α represents the local azimuth angle. Instead of predicting absolute orientation, the detection head outputs the relative heading deviasion Δθ = θ_gt - α. Spatial offsets and velocity vectors are also decomposed into radial and orthogonal components relative to the anchor's orientation. The transformation from local polar offsets back to global Cartesian coordinates is achieved through a rotation matrix multiplication:
import numpy as np
def transform_polar_to_cartesian(delta_r, delta_o, azimuth):
"""Convert radial and orthogonal offsets to global Cartesian coordinates."""
rotation_matrix = np.array([
[np.cos(azimuth), -np.sin(azimuth)],
[np.sin(azimuth), np.cos(azimuth)]
])
local_offsets = np.array([delta_r, delta_o])
global_offsets = rotation_matrix @ local_offsets
return global_offsets
This formulation inherently decouples object properties from camera viewpoint, ensuring consistent bounding box regression across overlapping fields of view.
Camera-Decoupled Virtual Depth Estimation
Conventional depth prediction modules tightly couple network weights to specific camera intrinsics, hindering cross-camera generalization. The architecture introduces a virtual depth pipeline that operates independently of physical focal lengths. A unified virtual focal length f_v is assumed during the initial depth classification stage, which discretizes the range [0, d_v] into M uniform bins. The resulting probability distribution generates a virtual depth map that is subsequently projected into physical space.
The conversion from virtual bin resolution Δd to real-world bin resolution Δd_r follows the proportional relationship dictated by the actual focal length f_r:
def map_virtual_to_real_depth(virtual_depth, f_real, f_virtual):
"""Scale virtual depth bins to match real-world camera intrinsics."""
return virtual_depth * (f_real / f_virtual)
Direct application of this scaling factor causes non-uniform BEV feature densities across cameras with varying intrinsics. To counteract this, the depth bin allocation is dynamically adjusted per camera to maintain a fixed spatial resolution in the BEV plane. Furthermore, a deformable convolution module processes the semantic context branch to mitigate scale discrepancies between depth regression and object recognition features. This separation allows the depth network to focus purely on geometric relationships without being biased by sensor-specific calibration parameters.
Experimental Validation
The framework builds upon a ResNet image encoder and a view transformer, replacing traditional camera-aware attention modules with the virtual depth projection mechanism. Evaluations on the nuScenes dataset demonstrate significant improvements over baseline methods. Rotational robustness tests confirm stable performance when camera azimuths are artificially shifted by 60 degrees, whereas Cartesian baselines exhibit substantial metric degradation. Ablation studies validate the impact of depth bin granularity and the necessity of the deformable context branch. The integrated approach achieves a mean NDS of 62.0%, surpassing prior multi-view 3D detectors by effectively harmonizing radial symmetry, anchor orientation, and cross-camera depth normalization.