Image and Video Analysis IVA

Inspired by the close relation between nearest neighbor search and clustering in high-dimensional spaces as well as the success of one helping to solve the other, we introduce a new paradigm where both problems are solved simultaneously. Our solution is recursive, not in the size of input data but in the number of dimensions. One result is a **dimenionality-recursive clustering (DRC)** algorithm that is tuned to small codebooks but does not need all data in memory at the same time and is practically constant in the data size. As a by-product, a tree structure performs either exact or approximate quantization on trained centroids, the latter being not very precise but extremely fast. Both schemes are referred to as **dimenionality-recursive quantization (DRQ)**.

We often visualize a clustering process in two dimensions as in Figure 1, where a number of centroids partition the underlying space into *Voronoi cells*. Even with $k$-means, which is arguably the fastest alternative at large scale, the cost is dominated by the assignment of data points to the nearest centroid. It is thus popular to solve this subproblem by approximate nearest neighbor search. But during clustering, one requires millions of queries to the very same points: the cluster centroids. In the two-dimensional discrete space, one may then envision solving first the inverse problem of computing a distance map on the entire two-dimensional grid, which could then respond to assignment queries by *lookup*.

But how about spaces of up to $128$ dimensions as in the case of SIFT descriptors? Unfortunately, the number of grid positions increases exponentially in the number of dimensions, which prevents us from visiting or even representing the entire space. This is exactly our contribution in this work: we use a two-dimensional discrete grid not just as an analogy but to actually solve clustering or search problems in higher-dimensional spaces. The key idea is that the grid actually represents a $2d$-dimensional space $S$. The two "dimensions" that we see in fact capture the discrete topology of two subspaces $S^L, S^R$, each of $d$ dimensions, that decompose $S$ into a *Cartesian product* $S = S^L \times S^R$.

Our clustering algorithm is a variant of $k$-means. We adopt a bottom-up description, from the one-dimensional base case to recursion in higher dimensions. While the former appears like a simple problem, it actually prepares the ground for the latter.

DRC Base case: one dimension

We assume we are **given**:

- a set $X$ of $N$ data points on real interval $I = [a,b)$
- a target number $K > 1$ of centroids

We uniformly partition $I$ into $B \gg K$ subintervals (bins) of length $\ell = (b-a) / B$ and we allocate each $x \in X$ to bin $r(x) = \left\lfloor{(x-a) / \ell}\right\rfloor \in \{0, \ldots, B-1\}$. If $Z = \{z_0, \ldots, z_{B-1}\}$ are the midpoints of subintervals, this is equivalent to *scalar quantization* via $h : I \to Z$ with
\begin{equation*}
x \mapsto h(x) = z_{r(x)} = a + \ell r(x) + \ell / 2.
\end{equation*}
We thus use $Z$ as a discrete representation of $I$.

We measure discrete distribution $f$ of data points by normalized histogram frequency $f_i = |X_i| / N$, where $X_i = \{ x \in X : r(x) = i \}$ is the set of points allocated to bin $i$. We then **initialize** by taking $K$ samples out of $Z$ with replacement, according to $f$. These samples $C = \{c_0, \ldots, c_{K-1}\}$ are the initial centroids.

Now, both at each $k$-means assignment step and as its final output, the optimal quantizer is a function $q : I \to C$ that maps to the nearest centroid: \begin{equation*} x \mapsto q(x) = \arg\min_{c \in C} \|x - c\|. \end{equation*} Instead, we only consider its restriction $q^* : Z \to C$ to the discrete domain $Z$, that is, we compute $q(z)$ and store as $q^*[z]$ for all $z \in Z$.

At each **assignment step**, we compute Voronoi cell $V_k = \{ z \in Z : q(z) = c_k \}$ for each centroid $c_k \in C$. This is easy in one dimension: If $m_k$ is the midpoint of $[c_{k-1}, c_k)$ for $k = 1, \ldots, K-1$ and $m_0 = a$, $m_K = b$, then
\begin{equation*}
V_k = Z \cap [m_k, m_{k+1})
\end{equation*}
for all $c_k \in C$. We then assign $q^*[z] \gets c_k$ for all $z \in V_k$. With one linear scan, we thus assign each bin midpoint to the nearest centroid, without referring to data points.

The **update step** consists of weighted averaging over Voronoi cells
\begin{equation*}
c_k \gets \sum_{i: z_i \in V_k} f_i z_i
\end{equation*}
for all $c_k \in C$. Again, we update centroids without referring to data points.

At **termination**, we approximately quantize each data point $x \in X$ by
\begin{equation*}
q(x) \simeq q^*[h(x)] \in C.
\end{equation*}
We also construct graph $G = \{C, E\}$ with edges $E = \{(c_{k-1}, c_k) : k = 1, \ldots, K-1\}$ between successive centroids as a neighborhood system over $I$.

DRC Recursion: from $n$ to $2n$ dimensions

Given a $2d$-dimensional space $S$, the key to recursion is to decompose it into Cartesian product $S^L \times S^R$ of $d$-dimensional subspaces $S^L, S^R$ and write each vector $x \in S$ as $x = (x^L, x^R)$ with projections $x^L \in S^L, x^R \in S^R$.

In this case, we assume we are **given**:

- a set $X$ of $N$ data points on interval $I = I^L \times I^R$ of $S$
- a target number $K > 1$ of centroids
- the sets of projections $X^L, X^R$ are clustered into $C^L, C^R$, each of $J$ centroids
- each projection $x^L$ (respectively, $x^R$) is quantized to $q^L(x^L) \in C^L$ (respectively, $q^R(x^R) \in C^R$)
- graphs $G^L = \{C^L, E^L\}$, $G^R = \{C^R, E^R\}$ representing neighborhood systems over $I^L, I^R$

Now, let $Z = C^L \times C^R$ be a grid of $B = J \times J$ points in $S$. Writing $Z = \{z_0, \ldots, z_{B-1}\}$, we again use $Z$ as a discrete representation of $I$. We again quantize points on the grid via $h : I \to Z$ with \begin{equation*} x \mapsto h(x) = (q^L(x^L), q^R(x^R)). \end{equation*}

**Initialization** is exactly the same as one dimension: we measure a discrete distribution $f$ on $Z$ and sample the initial centroids $C = \{c_0, \ldots, c_{K-1}\}$ according to $f$.

In the **assignment step**, we again compute $q(z)$ and store it as $q^*[z]$ for all $z \in Z$. This is much more difficult than in one dimension, and is carried out by our algorithm *product propagation*, a variant of fast marching. With time complexity $O(K^3)$, this is the bottleneck of the entire algorithm, although independent of $n$. The **update step** is exactly the same as in one dimension.

At **termination**, we quantize centroids to the nearest points on grid $Z$ as $c_k \gets h(c_k)$ for $c_k \in C$, and we again approximate
\begin{equation*}
q(x) \simeq q^*[h(x)] \in C
\end{equation*}
for all $x \in X$. We also compute graph $G = \{C, E\}$ once at final assignment step, as by-product of propagation, representing a neighborhood system over $I$.

Dimensionality-recursive quantization (DRQ)

The outcome of clustering is not only a set of centroids, a set of data labels, and a graph representing a neighborhood system. The recursive implementation also gives rise to a *tree structure*: each produced codebook is a *node* in the tree and an one-dimensional codebook is a *leaf*. In turn, this structure can recursively respond to approximate or exact nearest neighbor queries over its centroids. The former is used during $k$-means assignment, and the latter to quantize new data.

Approximate quantization

Given a point $x \in I$, we recursively compute $q(x)$ by delegating $q^L(x), q^R(x)$ to underlying codebooks if $d>1$: \begin{equation*} q(x) \simeq \left\{ \begin{array}{ll} q^*[a + \ell r(x) + \ell / 2], & d = 1 \\ q^*[q^L(x^L), q^R(x^R)], & d > 1 \end{array} \right. \end{equation*} The time complexity when $D=2^P$ is $O(D)$. This is because the tree is binary and has $D$ leaves and $D-1$ internal nodes, hence only $D$ scalar quantizations and $D-1$ lookups are needed. This scheme is not precise enough for NN search, but is fine for $k$-means assignment.

Exact quantization

In this case, we recursively compute squared Euclidean distance to all centroids, again by delegating distance computations to underlying codebooks if $d>1$. Given a new point $x=(x^L,x^R)$:

- if $d = 1$, we compute $\delta(x,c) = (x-c)^2$ for all $c \in C$.
- If $d > 1$, we first delegate computation of $\delta^L(x^L, z^L)$, $\delta^R(x^R, z^R)$, for all $z \in Z$. We then exhaustively minimize \begin{equation*} q(x) = \arg\min_{c \in C} \delta(x, c) \end{equation*} where \begin{equation*} \delta(x, z) = \delta^L(x^L, z^L) + \delta^R(x^R, z^R) \end{equation*} for any $x \in I, z \in Z$. So although computation is exhaustive in each node, it takes only one scalar addition per centroid.

The above computation is exact because centroids are stored for $d=1$ and quantized on the grid for $d>1$; and also because squared Euclidean distance is separable across subspaces. When $D=2^P$, the tree is binary with $P$ levels and $2^{P-p}$ nodes at level $p$ (at $2^p$ dimensions). The time complexity is then $O(\phi(\mathcal{K}))$ where \begin{equation*} \phi(\mathcal{K}) = \sum_{p=0}^P 2^{P-p} K_p, \end{equation*} assuming there are $K_p$ centroids at level $p$ and denoting $\mathcal{K} = \{ K_0, \ldots, K_P \}$. With appropriate selection of $K_p$ increasing with $p$, this is $O(K\log D)$ in contrast to $O(K_P 2^P) = O(KD)$ for naive search, so the gain increases with dimensionality.

Experiments

Our main contribution refers to off-line processes, i.e. dimensionality-recursive clustering (codebook training) and vector quantization. On-line applications like nearest neighbor search and image retrieval mainly serve as validation. We focus on the latter in this work.

We apply our methods to specific object retrieval and evaluate on public datasets Oxford buildings and Paris, containing $5062$ and $6412$ images, as well as $55$ queries each. We train codebooks on both datasets and evaluate retrieval performance on the same or on different datasets. At larger scale, we also use the additional $100$K distractor images provided with Oxford buildings. We use local features detected with the modified Hessian-affine detector and $128$-dimensional SIFT descriptors, which are the data points for our algorithms.

Clustering

We choose to focus on fourth order indices, that is, we decompose the $128$-dimensional SIFT descriptor space into $n=4$ $32$-dimensional subspaces. The size $K_p$ of each child codebook increases with the dimensionality $d=2^p$ of the underlying subspace. It is clearly seen in Table 1 that the training time depends explicitly on the codebook size at $d = 16$, which determines the size of the grid where the root codebook is trained.

This reveals that the bottleneck of the algorithm is distance propagation on the grid. At the given sizes, training is much more efficient than approximate $k$-means (AKM), e.g. $25\times$ faster for $4$K codebooks; however at larger sizes it becomes impractical. It is interesting that training is independent of the data size, $N$.

Vector quantization

Table 2 shows average vector quantization times. Our *exact* quantization comes at a speed that offers a practical alternative over other approximate schemes, and this is exactly what we have used to label images for indexing. For instance, FLANN takes $0.118$ms per point on average at the same setup for a $4$K codebook using $200$ checks, corresponding roughly to a precision of $98\%$. Our *approximate*, lookup-based scheme offers unprecedented speed, but its performance is quite low as revealed in Figure 4. Although this is not adequate for labeling images, it is still appropriate for training.

Image retrieval

For image retrieval we use a fourth-order multi-index with a $2^{12}=4$K codebook for each $32$-dimensional subspace of the $128$-dimensional SIFT descriptor space. The resulting codebook is extremely fine ($2^{48}$) so we only partially invert it, resulting in $24$ bits embedded in the index for each image descriptor. We soft-assign each query sub-vector to its $k=90$ nearest neighbors in the corresponding sub-codebook, resulting in a total of $90^4$ neighboring cells to be searched. Table 3 compares our solution to the state of the art on different combinations of training and test sets or distractors.

We clearly outperform most methods. It is also remarkable that searching on the same or on a different dataset than the training one has little impact, unlike most known methods. Perdoch et al. 2009 is superior at the Oxford5K/Oxford5K combination, but using a different test set is more important in general to avoid over-fitting behavior. Mikulik et al. 2012 is superior when using alternative words, but this method is not really comparable as it employs large scale learning over a different training set of millions of images using geometry.

Code

C++ source code is available, including DRC clustering and DRQ approximate/exact vector quantization. Please visit the DRVQ software page or download directly from github.

Publications

[ Abstract ] [ Bibtex ] [ PDF ]