Paper: | SS-12.6 | ||
Session: | Information Fusion for Multimedia Annotation and Retrieval | ||
Time: | Friday, May 21, 14:25 - 14:42 | ||
Presentation: | Special Session Lecture | ||
Topic: | Special Sessions: Information Fusion for Multimedia Annotation and Retrieval | ||
Title: | MULTIMODAL VIDEO SEARCH TECHNIQUES: LATE FUSION OF SPEECH-BASED RETRIEVAL AND VISUAL CONTENT-BASED RETRIEVAL | ||
Authors: | Arnon Amir; IBM Almaden Research Center | ||
Giridharan Iyengar; IBM T. J. Watson Research Center | |||
Ching-Yung Lin; IBM T. J. Watson Research Center | |||
Milind Naphade; IBM T. J. Watson Research Center | |||
Apostol Natsev; IBM T. J. Watson Research Center | |||
Chalapathy Neti; IBM T. J. Watson Research Center | |||
Harriet J. Nock; IBM T. J. Watson Research Center | |||
John R. Smith; IBM T. J. Watson Research Center | |||
Belle Tseng; IBM T. J. Watson Research Center | |||
Abstract: | There has been extensive research into systems forcontent-based or text-based (e.g. closed captioning, speech transcript) search, some of which has been applied to video.However, the 2001 and 2002 NIST TRECVID benchmarks of broadcast video search systems showed that designing multimodal video search systems which integrate both speech and image (or image sequence) cues, and thereby improve performance beyondthat achievable by systems using only speech or image cues, remains a challenging problem. This paper describes multimodal systems constructed by IBM for the TRECVID 2003 benchmark of search systems for broadcast video. These multimodal systems all use a late fusion of independently developed speech-based and visual content-based retrieval systems and outperform our individual speech-based and content-basedretrieval systems on both manual and interactive search tasks, which represents significant progress in video search beyond the state-of-the-art at TRECVID 2002. For the manual task, our best system use a query-dependent linear weighting between speech-based and image-based retrieval systems. This system has Mean Average Precision (MAP) performance 20% above our best unimodal system for manual search. For the interactive search, where the user has full knowledge of the query topic and the performance of the individual search systems, our best system used an interlacing approach. The user determines the (subjectively) optimal weights A and B for the speech-based and image-based systems, where the multimodal result set is aggregated by combining the top A documents from system A followed by top B documents of system B and then repeating this process until the desired result set size is achieved. This multimodal interactive search has MAP 30% above our best unimodal interactive search system. | ||
Back |
Home -||-
Organizing Committee -||-
Technical Committee -||-
Technical Program -||-
Plenaries
Paper Submission -||-
Special Sessions -||-
ITT -||-
Paper Review -||-
Exhibits -||-
Tutorials
Information -||-
Registration -||-
Travel Insurance -||-
Housing -||-
Workshops