New applications that exploit the huge data volume in community photo collections are emerging every day and visual image search is therefore becoming increasingly important. In this thesis we propose clustering- and nearest neighbor-based improvements for visual image search. Clustering is either performed on feature space or on image space, i.e. on high-dimensional vector spaces or metric spaces, respectively.
We first introduce a clustering method that combines the flexibility of Gaussian mixtures with the scaling properties needed to construct visual vocabularies for image retrieval. It is a variant of expectation-maximization that can converge rapidly while dynamically estimating the number of components. We employ approximate nearest neighbor search to speed-up the E-step and exploit its iterative nature to make search incremental, boosting both speed and precision. We achieve superior performance in large scale retrieval, being as fast as the best known approximate k-means algorithm.
We then present our locally optimized product quantization scheme, an approximate nearest neighbor search method that locally optimizes product quantizers per cell, after clustering the data in the original space. When combined with a multi-index, its performance is unprecedented and sets the new state-of-the-art in a billion scale dataset. At the same time, our approach enjoys query times in the order of a few milliseconds, and it becomes comparable in terms of speed even to hashing approaches.
We next focus on large community photo collections. Most applications for such collections focus on popular subsets, e.g. images containing landmarks or associated to Wikipedia articles. In this thesis we are concerned with the problem of accurately finding the location where a photo is taken without needing any metadata, that is, solely by its visual content. We also recognize landmarks where applicable, automatically linking them to Wikipedia. We show that the time is right for automating the geo-tagging process, and we show how this can work at large scale. In doing so, we do exploit redundancy of content in popular locations—but unlike most existing solutions, we do not restrict to landmarks. In other words, we can compactly represent the visual content of all thousands of images depicting e.g. the Parthenon and still retrieve any single, isolated, non-landmark image like a house or a graffiti on a wall. Starting from an existing, geo-tagged dataset, we cluster images into sets of different views of the the same scene. This is a very efficient, scalable, and fully automated mining process. We then align all views in a set to one reference image and construct a 2D scene map. Our indexing scheme operates directly on scene maps. We evaluate our solution on a challenging one million urban image dataset and provide public access to our service through our online application, VIRaL.
The thesis concludes with two chapters. The first is a summary of other approaches for visual search and applications, like geometry indexing, logo detection and clothing recognition, while the second presents conclusions and possible future directions.