Purge
If function $q$ represents any component or cluster, we define the generalized responsibility $\hat{\gamma}_{ik} = \hat{\gamma}_k(p_i) \in [0,1]$ of component $k$ for component $i$, similar to responsibility $\gamma_{k}(x)$ of $k$ for point $x$: \begin{equation*} \gamma_k(x) = \frac{p_k(x)} {\sum_{j=1}^K p_j(x)} \qquad \rightarrow \qquad \hat{\gamma}_k(q) = \frac{\langle q,p_k \rangle} {\sum_{j=1}^K \langle q,p_j \rangle}, \label{eq:gamma-gen} \end{equation*} \\ where $p_k(x) = \pi_k\mathcal{N}(x|\mu_k,\sigma_k^2)$ and the $L^2$ inner product $\langle p_i,p_k \rangle = \pi_i\pi_k \mathcal{N}(\mathbf{\mu}_i|\mathbf{\mu}_k,(\sigma_i^2+\sigma_k^2)\mathbf{I})$ measures the overlap of components $p_i,p_k$ under the spherical Gaussian model.
If $\hat{\gamma}_{ii}$ is the responsibility of component $i$ for itself and given a set $\mathcal{K}$ of components and one component $i \notin \mathcal{K}$, we define the responsibility $\rho_{i,\mathcal{K}} \in [0,1]$ of component $i$ for itself relative to $\mathcal{K}$ as \begin{equation*} \rho_{i,\mathcal{K}} = \frac{\hat{\gamma}_{ii}} {\hat{\gamma}_{ii} + \sum_{j \in \mathcal{K}} \hat{\gamma}_{ij}} = \frac{\|p_i\|^2} {\|p_i\|^2 + \sum_{j \in \mathcal{K}} \langle p_i,p_j \rangle}. \label{eq:rho} \end{equation*}
If $\rho_{i,\mathcal{K}}$ is large, component $i$ can `explain' itself better than set $\mathcal{K}$ as a whole; otherwise $i$ appears to be redundant. So, if $\mathcal{K}$ represents the components we have decided to keep so far, it makes sense to purge component $i$ if $\rho_{i,\mathcal{K}}$ drops below overlap threshold $\tau \in [0,1]$.
Expand
We overestimate the extent of each component as much as this does not overlap with its neighboring components.
The re-estimation equation for the covariance of each component can be decomposed into $D\sigma_k^2 = \frac{\underline{N}_k}{N_k} \underline{\mathrm{\Sigma}}_k + \frac{\overline{N}_k}{N_k} \overline{\mathrm{\Sigma}}_k$ where the inner sum $\underline{\mathrm{\Sigma}}_k$ expresses a weighted average distance from $\mathbf{\mu}_k$ of data points that are better `explained' by component $k$, hence fits the underlying data of the corresponding cluster.
We bias the weighted sum towards the outer sum $\overline{\mathrm{\Sigma}}_k$, and the re-estimation equation becomes $ D\sigma_k^2 = w_k \underline{\mathrm{\Sigma}}_k + (1-w_k) \overline{\mathrm{\Sigma}}_k $, where $w_k = \frac{\underline{N}_k}{N_k}(1-\lambda)$ and $\lambda \in [0,1]$ is an expansion factor.
We have conducted experiments on two publicly available datasets, namely Oxford Buildings and world cities. The annotated part of the latter is referred to as the Barcelona dataset and consists of 927 images grouped into 17 landmarks in Barcelona, again with 5 queries for each. The distractor set of world cities consists of 2M images from different cities; we use the first one million as distractors in our experiments.
Tuning
We choose Barcelona for parameter tuning because, combined with descriptor assignment, it would be prohibitive to use a larger dataset. It turns out however, that AGM outperforms the state of the art at much larger scale with the same parameters. The behavior of EGM is controlled by expansion factor $\lambda$ and overlap threshold $\tau$; AGM is also controlled by memory m that determines ANN precision and cost. Figure 3 (left) shows mAP against number of iterations for different values of $\lambda$ and Figure 3 (right) shows mAP versus overlap threshold $\tau$ for different iterations.
Comparisons
Figure 4 (left) compares all methods on convergence for m = 50 and m = 100, where AKM/RAKM are trained for 80K vocabularies, found to be optimal. For large scale image retrieval, we train generic vocabularies on an independent dataset of 15K images and 6.5M descriptors, and evaluate on Oxford buildings in the presence of up to one million distractors from world cities. Keeping the optimal values found through the tuning experiments on the much smaller Barcelona dataset, i.e. $\lambda$ = 0.2, $\tau$ = 0.55, we extend the experiment up to 1M distractors for the best RAKM 550K and the automatically inferred AGM 857K vocabularies under query-side soft-assignment, as depicted in Figure 4 (right).
The query time of RAKM (AGM) is 354ms (342ms) on average for the full 1M distractor index, so there appears to be no issue with the balance of cluster population. Spatial verification and re-ranking further increases mAP performance to 0.387 (0.411) for RAKM (AGM) at an additional cost of 323ms (328ms) using Hough pyramid matching on a shortlist of 200 images at 1M distractors. Variants of Hamming embedding, advanced scoring or query expansion are complementary, hence expected to further boost performance.
A Matlab script with a demo of EGM can be found here. A C++ binary will soon be released.