We present a simple vector quantizer that combines low distortion with fast search and apply it to approximate nearest neighbor (ANN) search in high dimensional spaces. Leveraging the very same data structure that is used to provide non-exhaustive search,
$K$-means, depicted in Fig. 1(a), is a vector quantization method where by specifying $k$ centroids, $\log_2 k$ bits can represent an arbitrary data point in $\mathcal{R}^d$ for any dimension $d$; but naive search is $O(dk)$ and low distortion means very large $k$. By constraining centroids on an axis-aligned, $m$-dimensional grid, Product Quantization (PQ) [1] achieves $k^m$ centroids keeping search at $O(dk)$; but as illustrated in Fig. 1(b), many of these centroids remain without data support, e.g. if the distributions on $m$ subspaces are not independent.
Optimized Product Quantization (OPQ) [2] allows the grid to undergo arbitrary rotation and re-ordering of dimensions to better align to data and balance their variance across subspaces to match bit allocation that is also balanced. But as shown in Fig. 1(c), a strongly multi-modal distribution may not benefit from such alignment.
Our solution in this work is locally optimized product quantization (LOPQ). Following a quite common search option, a coarse quantizer is used to index data by inverted lists, and residuals between data points and centroids are PQ-encoded. But within-cell distributions are largely unimodal; hence, as in Fig. 1(d), we locally optimize an individual product quantizer per cell. Under no assumptions on the distribution, practically all centroids are supported by data, contributing to a lower distortion.
[1] Jegou et al.. Product quantization for nearest neighbor search. PAMI, 2011.
[2] Ge et al.. Optimized product quantization for approximate nearest neighbor search. CVPR, 2013.
[3] Babenko and Lempitsky. The inverted multi-index. CVPR, 2012.
If $\mathcal{Z}$ is the set of residuals of data points quantized to some cell and $|\mathcal{C}^j| = k$ for $j \in \mathcal{M} = \{ 1, \dots, m \}$, we locally optimize both space decomposition and sub-quantizers per cell, using OPQ: % \begin{equation} \begin{array}{rl} \mathrm{minimize} & {\displaystyle \sum_{\mathbf{z} \in \mathcal{Z}} \min_{\hat{\mathbf{c}} \in \hat{\mathcal{C}}} \| \mathbf{z} - R\hat{\mathbf{z}} \|^2} \\ \mathrm{subject\ to} & \hat{\mathcal{C}} = \mathcal{C}^1 \times \dots \times \mathcal{C}^m \\ & R^{T}R = I, \end{array} \label{eq:opt} \end{equation}
Parametric solution: Assuming a normal distribution, minimize the theoretical lower distortion bound as a function of $R$ alone via PCA alignment and eigenvalue allocation. Sub-quantizer optimization follows as in PQ.
Residual distributions are closer to normal, so parametric solution fits better to LOPQ.
Overhead on top of IVFADC (resp. Multi-D-ADC):
Optimized source code for LOPQ is not publicly available. All resources found below are for demonstrative purposes and do not follow the exact search methodology presented in the paper. For fast and scalable ANN search with LOPQ, one may modify Artem's excellent implementation of the inverted multi-index.
Source code in MATLAB and C++ can be found here. This code is for demonstrative purposes only. Precomputing projections for all queries is only done to facilitate parameter tuning and is suboptimal.
Source code in Python with distributed training in Spark can be found here.