In this paper we propose a novel saliency-based computational model for visual attention. This model processes both top-down (goal directed) and bottom-up information. Processing in the top-down channel creates the so called skin conspicuity map and emulates the visual search for human faces performed by
humans. This is clearly a goal directed task but is generic enough to be context independent. Processing in the bottom-up information channel follows the principles set by Itti et al. but it deviates from them by computing the orientation, intensity and color conspicuity maps within a unified multi-resolution framework based on wavelet subband analysis. In particular, we apply a wavelet based approach for efficient computation of the topographic feature maps. Given that wavelets and multiresolution theory are naturally connected the usage of wavelet decomposition for mimicking the center surround process in humans is an obvious choice. However, our implementation goes further. We utilize the wavelet decomposition for inline computation of the features (such as orientation angles) that are used to create the topographic feature maps. The bottom-up topographic feature maps and the top-down skin conspicuity map are then combined through a sigmoid function to produce the final saliency map. A prototype
of the proposed model was realized through the TMDSDMK642-0E DSP platform as an embedded system allowing real-time operation. For evaluation purposes, in terms of perceived visual quality and video compression improvement, a ROI-based video compression setup was followed. Extended experiments concerning both MPEG-1 as well as low bit-rate MPEG-4 video encoding were conducted showing significant improvement in video compression efficiency without perceived deterioration in visual quality.
International Journal of Neural Systems, Volume 17, Issue 4, pp.1-16, August 2007.