Image and Video Analysis IVA

Overview

**Approximate Gaussian Mixtures (AGM):**a clustering method that combines the flexibility of Gaussian mixtures with the scaling properties needed to construct visual vocabularies for image retrieval. The algorithm can*dynamically*estimate the number of clusters. An example on 2d data is shown in Figure 1.**Approximate:**Keep a fixed number $m$ of nearest neighbors per data point across iterations so that we: (a) have enough information for an*approximate Gaussian mixture*model and (b) speed-up the nearest neighbor search process.**Purge:**Initialize with all data points as cluster centers and purge them as necessary using an overlap criterion on neighboring clusters.**Expand:**Clusters neighboring to the ones being purged expand towards empty space, boosting convergence rate.**Algorithm:**A modification of EM, where (a) a P-step (purge) is interleaved with the E- and M- steps at each iteration; (b) the E-step is approximate and incremental (N-step); (c) $\sigma$ is over-estimated at the M-step (expand).

Purge

If function $q$ represents any component or cluster, we define the *generalized responsibility* $\hat{\gamma}_{ik} = \hat{\gamma}_k(p_i) \in [0,1]$ of component $k$ for *component* $i$, similar to responsibility $\gamma_{k}(x)$ of $k$ for point $x$:
\begin{equation*}
\gamma_k(x) =
\frac{p_k(x)}
{\sum_{j=1}^K p_j(x)} \qquad \rightarrow \qquad
\hat{\gamma}_k(q) =
\frac{\langle q,p_k \rangle}
{\sum_{j=1}^K \langle q,p_j \rangle},
\label{eq:gamma-gen}
\end{equation*} \\
where $p_k(x) = \pi_k\mathcal{N}(x|\mu_k,\sigma_k^2)$ and the $L^2$ inner product $\langle p_i,p_k \rangle = \pi_i\pi_k
\mathcal{N}(\mathbf{\mu}_i|\mathbf{\mu}_k,(\sigma_i^2+\sigma_k^2)\mathbf{I})$ measures the *overlap* of components $p_i,p_k$ under the spherical Gaussian model.

If $\hat{\gamma}_{ii}$ is the responsibility of component $i$ for *itself* and given a set $\mathcal{K}$ of components and one component $i \notin \mathcal{K}$, we define the responsibility $\rho_{i,\mathcal{K}} \in [0,1]$ of component $i$ for itself *relative to* $\mathcal{K}$ as
\begin{equation*}
\rho_{i,\mathcal{K}} = \frac{\hat{\gamma}_{ii}}
{\hat{\gamma}_{ii} + \sum_{j \in \mathcal{K}} \hat{\gamma}_{ij}}
= \frac{\|p_i\|^2}
{\|p_i\|^2 + \sum_{j \in \mathcal{K}} \langle p_i,p_j \rangle}.
\label{eq:rho}
\end{equation*}

If $\rho_{i,\mathcal{K}}$ is large, component $i$ can `explain' itself better than set $\mathcal{K}$ *as a whole*; otherwise $i$ appears to be redundant. So, if $\mathcal{K}$ represents the components we have decided to *keep* so far, it makes sense to purge component $i$ if $\rho_{i,\mathcal{K}}$ drops below *overlap threshold* $\tau \in [0,1]$.

Expand

We *overestimate* the extent of each component as much as this does not overlap with its neighboring components.

The re-estimation equation for the covariance of each component can be decomposed into $D\sigma_k^2 = \frac{\underline{N}_k}{N_k} \underline{\mathrm{\Sigma}}_k + \frac{\overline{N}_k}{N_k} \overline{\mathrm{\Sigma}}_k$ where the *inner sum* $\underline{\mathrm{\Sigma}}_k$ expresses a weighted average distance from $\mathbf{\mu}_k$ of data points that are better `explained' by component $k$, hence fits the underlying data of the corresponding cluster.

We *bias* the weighted sum towards the *outer sum* $\overline{\mathrm{\Sigma}}_k$, and the re-estimation equation becomes
$
D\sigma_k^2 = w_k \underline{\mathrm{\Sigma}}_k + (1-w_k) \overline{\mathrm{\Sigma}}_k
$, where $w_k = \frac{\underline{N}_k}{N_k}(1-\lambda)$ and $\lambda \in [0,1]$ is an *expansion factor*.

Experiments

We have conducted experiments on two publicly available datasets, namely Oxford Buildings and world cities. The annotated part of the latter is referred to as the Barcelona dataset and consists of 927 images grouped into 17 landmarks in Barcelona, again with 5 queries for each. The distractor set of world cities consists of 2M images from different cities; we use the first one million as distractors in our experiments.

Tuning

We choose Barcelona for parameter tuning because, combined with descriptor assignment, it would be prohibitive to use a larger dataset. It turns out however, that AGM outperforms the state of the art at much larger scale with the same parameters. The behavior of EGM is controlled by expansion factor $\lambda$ and overlap threshold $\tau$; AGM is also controlled by memory m that determines ANN precision and cost. Figure 3 (left) shows mAP against number of iterations for different values of $\lambda$ and Figure 3 (right) shows mAP versus overlap threshold $\tau$ for different iterations.

Comparisons

Figure 4 (left) compares all methods on convergence for m = 50 and m = 100, where AKM/RAKM are trained for 80K vocabularies, found to be optimal.
For large scale image retrieval, we train generic vocabularies on an independent dataset of 15K images and 6.5M descriptors, and evaluate on Oxford
buildings in the presence of up to one million distractors from world cities. Keeping the optimal values found through the tuning experiments on the much smaller Barcelona dataset, *i.e.* $\lambda$ = 0.2, $\tau$ = 0.55, we extend the experiment up to 1M distractors for the best RAKM 550K and the automatically inferred AGM 857K vocabularies under query-side soft-assignment, as depicted in Figure 4 (right).

The query time of RAKM (AGM) is 354ms (342ms) on average for the full 1M distractor index, so there appears to be no issue with the balance of cluster population. Spatial verification and re-ranking further increases mAP performance to 0.387 (0.411) for RAKM (AGM) at an additional cost of 323ms (328ms) using Hough pyramid matching on a shortlist of 200 images at 1M distractors. Variants of Hamming embedding, advanced scoring or query expansion are complementary, hence expected to further boost performance.

Code

A Matlab script with a demo of EGM can be found here. A C++ binary will soon be released.

Publications

[ Abstract ] [ Bibtex ] [ PDF ]

[ Abstract ] [ Bibtex ] [ PDF ]