Although human perception appears to be automatic and unconscious, complex sensory mechanisms exist that form the preattentive component of understanding and lead to awareness. Considerable research has been carried out into these preattentive mechanisms and computational models have been developed for similar problems in the ¯elds of computer vision and speech analysis. The focus here is to explore aural nd visual information in video streams for modeling attention and detecting alient events. The separate aural and visual modules may convey explicit, complementary or mutually exclusive information around the detected audio-visual events. Based on recent studies on perceptual and computational attention modeling, we formulate measures of attention using features of saliency for the audio-visual stream. Audio saliency is captured by signal modulations and related multifrequency band features, extracted through nonlinear operators and energy tracking. Visual saliency is measured by means of a spatiotemporal attention model driven by various feature cues (intensity, color, motion). Features from both modules mapped to one-dimensional, time-varying saliency curves, from which statistics of salient segments can be extracted and important audio or visual events can be detected through adaptive, threshold-based mechanisms. Audio and video curves are integrated in a single attention curve, where events may be enhanced, suppressed or vanished. Salient events from the audio-visual curve are detected through geometrical features such as local extrema, sharp transitions and level sets. The potential of inter-module fusion and audio-visual event detection is demonstrated in applications such as video key-frame selection, video skimming and video annotation.
Multimodal Processing and Interaction: Audio, Video, Text, Springer-Verlag, pp.179-199, 2008.
[ Bibtex ] [ PDF ]