- & Cees Snoek, University of Amsterdam, The Netherlands
Many solutions to image and video indexing can only be applied in narrow domains using specific concept detectors, e.g., "sunset" or "face". The use of multimodal indexing, advances in machine learning, and the availability of some large, annotated information sources, e.g., the TRECVID benchmark, has paved the way to increase lexicon size by orders of magnitude (now 100 concepts, in a few years 1,000). This brings it within reach of research in ontology engineering, creating and maintaining large, typically 10,000+ structured sets of shared concepts.
This tutorial lays the foundation for these exciting new horizons. It covers basic video analysis techniques, video indexing, connections to ontologies, and interactive access to the data.
Tutorial Presentation, Tutorial Notes
Room: Themistocles B | |
14:00 - 14:45 |
Basic Techniques for Video Indexing
|
---|---|
14:45 - 15:30 |
Semantic Video Indexing
|
15:30 - 16:00 | Coffee Break |
16:00 - 16:45 |
Semantic Retrieval
|
16:45 - 17:30 |
Demonstration
|
We present the authoring metaphor. A multimodal perspective on video representation, which also forms the guiding framework for analysis. The representation covers layout and content. We briefly introduce shot detection and various low-level image, audio, and text features.
Brief introduction to machine learning. We discuss general principles to learn semantic concepts from low-level features and provided examples. Overview of popular state-of-the-art machine learning techniques.
The content perspective relates segments to elements that an author uses to create a video document. The following elements can be distinguished:
- Setting: time and place in which the video's story takes place
- Objects: noticeable static or dynamic entities in the video document
- People: human beings appearing in the video document
Techniques for detection of content elements will be highlighted.
Conversion: For analysis, conversion of elements of visual and auditory modalities to text is most appropriate. We highlight video optical character recognition, automatic speech recognition, and machine translation.
Integration: The purpose of integration of multimodal layout and content elements is to improve classification performance. To that end the addition of modalities may serve as a verification method, a method compensating for inaccuracies, or as an additional information source.
Semantic video indexes: We present a systematical report on the large variety of semantic indexes, as reported in video indexing literature, and the information from which they are derived. Grouped according to:
- Genre: set of video documents sharing similar style
- Sub-genre: a subset of a genre where the video documents share similar content
- Logical units: a continuous part of a video document's content consisting of a set of named events or other logical units which together have a meaning
- Named events: short segments which can be assigned a meaning that doesn't change in time
We discuss techniques to integrate ontologies like Wordnet and the video indexing. Both low-level techniques which add (visual) attributes or visual examples to the ontology as well as more high-level techniques which add the semantic video indices to the result are considered.
Various interaction techniques for accessing the indexed video collection are discussed. Starting from the different ways to pose a query to the system (query-by-keyword, query-by-example, query-by-concept) techniques are considered that support the user in browsing by using advanced interaction mechanisms (using relevance feedback from the user) with visualization.
We discuss the challenges that lie ahead in this exciting field.
The result of the techniques above is demonstrated in the MediaMill semantic video search engine. Currently based on a lexicon of over 400 learned concepts.