Statistische Modellierung des Inhaltes von Online Videos zur automatisierten Detektion semantischer Konzepte in Videos
Zusammenfassung der Projektergebnisse
The concern of the MOONVID project has been to investigate user-tagged web video - as available from portals like YouTube - as an information source for training concept detectors, i.e. visual recognition systems that detect semantic concepts such as objects, locations, and actions in video data. This way, user-generated annotations on the web like tags, titles, etc. can be exploited as alternatives to a cost-intensive expert labeling of training data, such that concept detectors can be trained at much larger scale and with significantly less effort. The key results of MOONVID are listed in the following: • Evaluation of web-based detectors. A state-of-the-art concept detection system has been implemented and evaluated when training on YouTube material as opposed to standard expert labeled datasets from the TRECVID benchmark. Results indicate that (1) web-based detectors are outperformed when sufficient manual annotations on the target domain are available, but also that (2) web-based detectors generalize comparably well to novel domains as systems trained on expert labeled training data. • Techniques to address the label noise problem. The term "label noise" refers to the fact that user-generated tags are subjective, incomplete, and context-dependent, and thus make unreliable concept labels. We have shown that label noise degrades the accuracy of web-based detectors by up to 33%, and have developed several strategies to overcome this problem, by (1) adapting the machine learning models underlying concept detection to weak label information, (2) taking manual labels of a few carefully selected samples into account, and (3) automatically refining the queries by which web video training content is retrieved. • Combination of motion segmentation and object recognition. To improve the recognition of moving objects in video, motion segmentation was used to filter clutter before applying a patch-based recognition. We demonstrated that - though patchbased recognition methods already provide some inherent robustness to clutter - motion segmentation has the potential to improve recognition further. When compared to a baseline operating on unsegmented images, recognition error was found to improve from 8.1% to 4.4%, and the precision of concept detection from 31% to 41%. • User-generated contexts. A novel approach was developed to make use of user-generated categories on the web. Particularly, Flickr Groups (which are user-structured photo categories on Flickr) are employed as an information source for learning visual contexts. By aggregating training material by these groups and learning group-specific models, concept detection can be tailored much better to a users's particular target content, and accuracy can be improved by up to 100%. Overall, with 16 peer-reviewed publications, several awards (among others, a Google research award in 2010), and presentations at internationed conferences, trade shows, and other research groups, project results have been anticipated very positively in the scientific community. Future Work. The accuracy of concept detection is still limited, particularly when applying concept detectors on different domains (i.e., sources of video data) than trained on. Therefore, one important question for future work remains how to successfully apply concept detectors trained on web video to other video domains like professional TV content. Here, a highly interesting approach would be to integrate domain adaptation with the whole concept learning cycle, i.e. training data acquisition and the use of context infonnation should be adapted to knowledge of the target domain as well. Applications. With the rapid growth that multimedia collections are currently experiencing, content analysis (and with it concept detection) will become even more important issues for granting efficient access to image and video collections, like web video portals, personal image and video content, or digital archives maintained to preserve cultural heritage. With respect to these applications, the manual acquisition of training data does not satisfy the scalability requirements for a practical application of concept detection. In our research group, we have recently made this experience in the project Edutainment 3.0 (targeted at applying concept detection in a journalist video search scenario). Here, training data was acquired with large manual effort but the resulting training sets were found too small to generalize well, and the concept vocabulary remained Umited. In this context, our results in MOONVID suggest that web data can form an interesting information source for visual concept learning. We are therefore planning to apply for a DFG transfer project to explore the potential of web-based concept learning when integrated into a practical video search system.
Projektbezogene Publikationen (Auswahl)
-
TubeTagger - YouTube-based Concept Detection. In Proc. Int. Workshop on Internet Multimedia Mining, December 2009
A. Ulges, M. Koch, D. Borth, and T. Breuel
-
Visual Concept Learning from User-tagged Web Video. PhD thesis. University of Kaiserslautern, Germany, 2009
A. Ulges
-
Can Motion Segmentation Improve Patch-based Object Recognition. In Proc. Int. Conf. on Pattern Recognition, pages 3041-3044, August 2010
A. Ulges and T. Breuel
-
Learning Automatic Concept Detectors from Online Video. Comp. Vis. Img. Underst, 114(4):429-438, 2010
A. Ulges, C. Schulze, M. Koch, and T. Breuel
-
Relevance Filtering meets Active Learning: Improving Web-based Concept Detectors. In Proc. Int. Conf. on Multimedia Information Retrieval, March 2010
D. Borth, A. Ulges, and T. Breuel
-
Visual Concept Learning from Weakly Labeled Web Videos. In Video Search and Mining. Springer-Verlag, 2010
A. Ulges, D. Borth, and T. Breuel
-
Learning Visual Contexts for Image Annotation from Flickr Groups. IEEE Transactions on Multimedia, 13(2):330-341, 2011
A. Ulges, M. Worring, and T. Breuel