Image and Video Analysis



We consider a family of metrics to compare images based on their local descriptors. We encompass the VLAD descriptor and matching techniques such as Hamming Embedding. Making the bridge between these approaches leads us to propose a match kernel that takes the best of existing techniques by combining an aggregation procedure with a selective match kernel. Finally, the representation underpinning this kernel is approximated, providing a large scale image search both precise and scalable, as shown by our experiments on several benchmarks.


We propose a query expansion technique for image search in which an enriched representation of the query is obtained by exploiting the binary representation offered by the Hamming Embedding image matching approach. The initial local descriptors are refined by aggregating those of the database, while new descriptors are produced from the images that are deemed relevant. The technique is effective even without using any geometry in contrast to previous query expansion methods.


We exploit self-similaries, symmetries and repeating patterns to select features within a single image. We achieve the same performance compared to the full feature set with only a small fraction of its index size on a dataset of unique views of buildings or urban scenes, in the presence of one million distractors of similar nature. Our best solution is linear in the number of correspondences, with practical running times of just a few milliseconds.


We present a spatial matching model that is flexible and allows non-rigid motion and multiple matching surfaces or objects, yet is fast enough to perform re-ranking in large scale image retrieval. Correspondences between local features are mapped to the geometric transformation space, and a histogram pyramid is then used to compute a similarity measure based on density of correspondences. Our model imposes one-to-one mapping and is linear in the number of correspondences. We apply it to image retrieval, yielding superior performance and a dramatic speed-up compared to the state of the art.


State of the art data mining and image retrieval in community photo collections typically focus on popular subsets, e.g. images containing landmarks or associated to Wikipedia articles. We propose an image clustering scheme that, seen as vector quantization, compresses a large corpus of images by grouping visually consistent ones while providing a guaranteed distortion bound. This allows us, for instance, to represent the visual content of all thousands of images depicting the Parthenon in just a few dozens of scene maps and still be able to retrieve any single, isolated, non-landmark image like a house or a graffiti on a wall.


We present a new approach to image indexing and retrieval, which integrates appearance with global image geometry in the indexing process, while enjoying robustness against viewpoint change, photometric variations, occlusion, and background clutter. Each image is represented by a collection of feature maps and RANSAC-like matching is reduced to a number of set intersections. We extend min-wise independent permutations and finally exploit sparseness to build an inverted file whereby the retrieval process is sub-linear in the total number of images. We achieve excellent performance on 10^4 images, with a query time in the order of milliseconds.


The motivation of this work is to tackle the problem of high-level concept detection within image and video documents using a globally annotated training set. The goal is to determine whether a concept exists within an image along with a degree of confidence and not its actual position. Since this approach begins with a coarse image segmentation, the high-level concepts that is able to tackle can be described as "materials" or "scenes". MPEG-7 color and texture features are locally extracted from coarsely segmented regions using an RSST variation. Using a significantly large set of images and after the application of a hierarchical clustering algorithm on all regions, a relatively small number of them, is selected. These regions are called "region types". This set of region types composes a visual dictionary which facilitates the mapping of low- to high-level features.