Image and Video Analysis



A simple vector quantizer that combines low distortion with fast search and apply it to approximate nearest neighbor (ANN) search in high dimensional spaces. Leveraging the very same data structure that is used to provide non-exhaustive search, i.e. inverted lists or a multi-index, the idea is to locally optimize an individual product quantizer (PQ) per cell and use it to encode residuals. Local optimization is over rotation and space decomposition; interestingly, we apply a parametric solution that assumes a normal distribution and is extremely fast to train. With a reasonable space and time overhead that is constant in the data size, we set a new state-of-the-art on several public datasets, including a billion-scale one.


Inspired by the close relation between nearest neighbor search and clustering in high-dimensional spaces as well as the success of one helping to solve the other, we introduce a new paradigm where both problems are solved simultaneously. Our solution is recursive, not in the size of input data but in the number of dimensions. One result is a clustering algorithm that is tuned to small codebooks but does not need all data in memory at the same time and is practically constant in the data size. As a by-product, a tree structure performs either exact or approximate quantization on trained centroids, the latter being not very precise but extremely fast.


We consider a family of metrics to compare images based on their local descriptors. We encompass the VLAD descriptor and matching techniques such as Hamming Embedding. Making the bridge between these approaches leads us to propose a match kernel that takes the best of existing techniques by combining an aggregation procedure with a selective match kernel. Finally, the representation underpinning this kernel is approximated, providing a large scale image search both precise and scalable, as shown by our experiments on several benchmarks.


We exploit self-similaries, symmetries and repeating patterns to select features within a single image. We achieve the same performance compared to the full feature set with only a small fraction of its index size on a dataset of unique views of buildings or urban scenes, in the presence of one million distractors of similar nature. Our best solution is linear in the number of correspondences, with practical running times of just a few milliseconds.


A clustering method that combines the flexibility of Gaussian mixtures with the scaling properties needed to construct visual vocabularies for image retrieval. It is a variant of expectation-maximization that can converge rapidly while dynamically estimating the number of components. We employ approximate nearest neighbor search to speed-up the E-step and exploit its iterative nature to make search incremental, boosting both speed and precision.


We propose a new feature detector based on the Weighted Alpha Shapes. The detected features are blob-like and include non-extremal regions as well as regions determined by cavities of boundary shape.


Medial features are image regions of arbitrary scale and shape, extracted without explicit scale space construction. They rely on a weighted distance map of image gradient, computed using an exact linear-time algorithm. The corresponding weighted medial axis is then decomposed into a graph representing image structure. A duality property enables reconstruction of regions using the same distance propagation. We select features according to our shape fragmentation factor, favoring those well enclosed by boundaries.


We present a spatial matching model that is flexible and allows non-rigid motion and multiple matching surfaces or objects, yet is fast enough to perform re-ranking in large scale image retrieval. Correspondences between local features are mapped to the geometric transformation space, and a histogram pyramid is then used to compute a similarity measure based on density of correspondences. Our model imposes one-to-one mapping and is linear in the number of correspondences. We apply it to image retrieval, yielding superior performance and a dramatic speed-up compared to the state of the art.


We propose a detector that starts from single scale edges and produces reliable and interpretable blob-like regions and groups of regions of arbitrary shape. The detector is based on merging local maxima of the binary distance transform guided by the gradient strength of the surrounding edges.


State of the art data mining and image retrieval in community photo collections typically focus on popular subsets, e.g. images containing landmarks or associated to Wikipedia articles. We propose an image clustering scheme that, seen as vector quantization, compresses a large corpus of images by grouping visually consistent ones while providing a guaranteed distortion bound. This allows us, for instance, to represent the visual content of all thousands of images depicting the Parthenon in just a few dozens of scene maps and still be able to retrieve any single, isolated, non-landmark image like a house or a graffiti on a wall.


We present a new approach to image indexing and retrieval, which integrates appearance with global image geometry in the indexing process, while enjoying robustness against viewpoint change, photometric variations, occlusion, and background clutter. Each image is represented by a collection of feature maps and RANSAC-like matching is reduced to a number of set intersections. We extend min-wise independent permutations and finally exploit sparseness to build an inverted file whereby the retrieval process is sub-linear in the total number of images. We achieve excellent performance on 10^4 images, with a query time in the order of milliseconds.


We use saliency for spatiotemporal feature detection in videos by incorporating color and motion apart from intensity. Saliency is computed by a global minimization process constrained by pure volumetric constraints, each of them being related to an informative visual aspect inspired by the Gestalt theory.


Based on established computational models of visual attention we propose novel models and methods both for spatial (images) and spatiotemporal (video sequences) analysis. Applications include visual classification and spatiotemporal feature detection.


In this work we propose an object detection approach that extracts a limited number of candidate local regions to guide the detection process. The basic idea of the approach is that object location can be determined by clustering points of interest and hierarchically forming candidate regions according to similarity and spatial proximity predicates. Statistical validation shows that the method is robust across a substantial range of content diversity while its response seems to be comparable to other state of the art object detectors.


Automatic segmentation of images and videos is a very challenging task in computer vision and one of the most crucial steps toward image and video understanding. In this research work we propose to include semantic criteria in the segmentation process to capture the semantic properties of objects that visual features, such as color or texture, are not able to describe.


The idea behind the use of visual context information responds to the fact that not all human acts are relevant in all situations and this holds also when dealing with image analysis problems. Since visual context is a difficult notion to grasp and capture, in our research work we restrict it to the notion of ontological context. The latter is defined as part of a "fuzzified" version of traditional ontologies. Typical problems to be addressed include how to meaningfully readjust the membership degrees of image regions and how to use visual context to influence the overall results of knowledge-assisted image analysis towards higher performance.


The motivation of this work is to tackle the problem of high-level concept detection within image and video documents using a globally annotated training set. The goal is to determine whether a concept exists within an image along with a degree of confidence and not its actual position. Since this approach begins with a coarse image segmentation, the high-level concepts that is able to tackle can be described as "materials" or "scenes". MPEG-7 color and texture features are locally extracted from coarsely segmented regions using an RSST variation. Using a significantly large set of images and after the application of a hierarchical clustering algorithm on all regions, a relatively small number of them, is selected. These regions are called "region types". This set of region types composes a visual dictionary which facilitates the mapping of low- to high-level features.