Image and Video Analysis

Approximate Gaussian Mixtures for Large Scale Vocabularies


Figure 1. Estimating the number, population, position and extent of clusters on an 8-mode 2d Gaussian mixture sampled at 800 points, in just 3 iterations (iteration 2 not shown). Red circles: cluster centers; blue: two standard deviations. Clusters are initialized on 50 data points sampled at random, with initial $\sigma = 0.02$. Observe the 'space-filling' behavior of the two clusters on the left.


If function $q$ represents any component or cluster, we define the generalized responsibility $\hat{\gamma}_{ik} = \hat{\gamma}_k(p_i) \in [0,1]$ of component $k$ for component $i$, similar to responsibility $\gamma_{k}(x)$ of $k$ for point $x$: \begin{equation*} \gamma_k(x) = \frac{p_k(x)} {\sum_{j=1}^K p_j(x)} \qquad \rightarrow \qquad \hat{\gamma}_k(q) = \frac{\langle q,p_k \rangle} {\sum_{j=1}^K \langle q,p_j \rangle}, \label{eq:gamma-gen} \end{equation*} \\ where $p_k(x) = \pi_k\mathcal{N}(x|\mu_k,\sigma_k^2)$ and the $L^2$ inner product $\langle p_i,p_k \rangle = \pi_i\pi_k \mathcal{N}(\mathbf{\mu}_i|\mathbf{\mu}_k,(\sigma_i^2+\sigma_k^2)\mathbf{I})$ measures the overlap of components $p_i,p_k$ under the spherical Gaussian model.

If $\hat{\gamma}_{ii}$ is the responsibility of component $i$ for itself and given a set $\mathcal{K}$ of components and one component $i \notin \mathcal{K}$, we define the responsibility $\rho_{i,\mathcal{K}} \in [0,1]$ of component $i$ for itself relative to $\mathcal{K}$ as \begin{equation*} \rho_{i,\mathcal{K}} = \frac{\hat{\gamma}_{ii}} {\hat{\gamma}_{ii} + \sum_{j \in \mathcal{K}} \hat{\gamma}_{ij}} = \frac{\|p_i\|^2} {\|p_i\|^2 + \sum_{j \in \mathcal{K}} \langle p_i,p_j \rangle}. \label{eq:rho} \end{equation*}

If $\rho_{i,\mathcal{K}}$ is large, component $i$ can `explain' itself better than set $\mathcal{K}$ as a whole; otherwise $i$ appears to be redundant. So, if $\mathcal{K}$ represents the components we have decided to keep so far, it makes sense to purge component $i$ if $\rho_{i,\mathcal{K}}$ drops below overlap threshold $\tau \in [0,1]$.


We overestimate the extent of each component as much as this does not overlap with its neighboring components.

The re-estimation equation for the covariance of each component can be decomposed into $D\sigma_k^2 = \frac{\underline{N}_k}{N_k} \underline{\mathrm{\Sigma}}_k + \frac{\overline{N}_k}{N_k} \overline{\mathrm{\Sigma}}_k$ where the inner sum $\underline{\mathrm{\Sigma}}_k$ expresses a weighted average distance from $\mathbf{\mu}_k$ of data points that are better `explained' by component $k$, hence fits the underlying data of the corresponding cluster.

We bias the weighted sum towards the outer sum $\overline{\mathrm{\Sigma}}_k$, and the re-estimation equation becomes $ D\sigma_k^2 = w_k \underline{\mathrm{\Sigma}}_k + (1-w_k) \overline{\mathrm{\Sigma}}_k $, where $w_k = \frac{\underline{N}_k}{N_k}(1-\lambda)$ and $\lambda \in [0,1]$ is an expansion factor.

Figure 2. Component expansion for iterations 1, 2 and 3 of the example of Figure 1. Blue circles: two standard deviations with expansion and $\lambda = 0.25$, as in Figure 1; magenta: without expansion; dashed green: inner and outer sum contributions..


We have conducted experiments on two publicly available datasets, namely Oxford Buildings and world cities. The annotated part of the latter is referred to as the Barcelona dataset and consists of 927 images grouped into 17 landmarks in Barcelona, again with 5 queries for each. The distractor set of world cities consists of 2M images from different cities; we use the first one million as distractors in our experiments.


We choose Barcelona for parameter tuning because, combined with descriptor assignment, it would be prohibitive to use a larger dataset. It turns out however, that AGM outperforms the state of the art at much larger scale with the same parameters. The behavior of EGM is controlled by expansion factor $\lambda$ and overlap threshold $\tau$; AGM is also controlled by memory m that determines ANN precision and cost. Figure 3 (left) shows mAP against number of iterations for different values of $\lambda$ and Figure 3 (right) shows mAP versus overlap threshold $\tau$ for different iterations.

Figure 3. Barcelona-specific parameter tuning. (left) mAP performance versus iteration during learning, for varying expansion factor $\lambda$ and fixed $\tau = 0.5$. (right) mAP versus overlap threshold $\tau$ for different iterations, with fixed $\lambda = 0.2$.


Figure 4 (left) compares all methods on convergence for m = 50 and m = 100, where AKM/RAKM are trained for 80K vocabularies, found to be optimal. For large scale image retrieval, we train generic vocabularies on an independent dataset of 15K images and 6.5M descriptors, and evaluate on Oxford buildings in the presence of up to one million distractors from world cities. Keeping the optimal values found through the tuning experiments on the much smaller Barcelona dataset, i.e. $\lambda$ = 0.2, $\tau$ = 0.55, we extend the experiment up to 1M distractors for the best RAKM 550K and the automatically inferred AGM 857K vocabularies under query-side soft-assignment, as depicted in Figure 4 (right).

Figure 4. (Left) Barcelona-specific mAP versus learning time, measured in vector operations (vop) per data point, for AGM, AKM and RAKM under varying approximation levels, measured in FLANN checks. There are 5 measurements on each curve, corresponding, from left to right, to iteration 5, 10, 20, 30 and 40. (Right) Oxford buildings generic mAP in the presence of up to 1 million distractor images for AGM and RAKM, using query-side soft assignment with 1, 3 and 5 NNs.

The query time of RAKM (AGM) is 354ms (342ms) on average for the full 1M distractor index, so there appears to be no issue with the balance of cluster population. Spatial verification and re-ranking further increases mAP performance to 0.387 (0.411) for RAKM (AGM) at an additional cost of 323ms (328ms) using Hough pyramid matching on a shortlist of 200 images at 1M distractors. Variants of Hamming embedding, advanced scoring or query expansion are complementary, hence expected to further boost performance.


A Matlab script with a demo of EGM can be found here. A C++ binary will soon be released.



Y. Avrithis, Y. Kalantidis. Approximate Gaussian Mixtures for Large Scale Vocabularies. In Proceedings of European Conference on Computer Vision (ECCV 2012), Florence, Italy, October 2012.
[ Abstract ]
[ Bibtex ] [ PDF ]

PhD Thesis

Y. Kalantidis. Clustering and nearest neighbor methods for visual search. PhD Thesis, National Technical University of Athens, November 2014.
[ Abstract ]
[ Bibtex ] [ PDF ]