Project Details
Projekt Print View

ComPLetely Unsupervised Multimodal Character identification On TV series and movies

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Term from 2016 to 2021
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 316692988
 
Final Report Year 2020

Final Report Abstract

With the rise of deep learning, AI has seen rapid progress in conventional vision tasks such as object detection and recognition, semantic segmentation, action recognition, etc. The community is now moving towards a higher level of semantic abstraction, and aiming at joint vision and language tasks such as video understanding. This project focussed on completely unsupervised character identification in TV series and movies. Motivated by the prior successes of leveraging complementary temporal and multomodal information, the project makes the following contributions: (i) We aimed to learn a representation that exhibits small distances between samples from the same person, and large inter-person distances in feature space. Using metric learning one could achieve that as it is comprised of a pull-term, pulling data points from the same class closer, and a push-term, pushing data points from a different class further away. Metric learning for improving feature quality is useful but requires some form of external supervision to provide labels for the same or different pairs. In the case of face clustering in TV series, we may obtain this supervision from tracks, clustering, similiarity and other cues. The tracking acts as a form of high precision clustering (grouping detections within a shot) and is used to automatically generate positive and negative pairs of face images. Inspired from that we proposed, (a) two variants of discriminative approaches: Track-supervised Siamese network (TSiam) and Self-supervised Siamese network (Ssiam); (b) Clustering-based Contrastive Learning (CCL), a new clustering-based representation learning approach that utilizes automatically discovered partitions obtained from our clustering algorithm (FINCH) as weak supervision along with inherent video constraints to learn discriminative face features; (c) Face grouping on graphs (FGG), a method for unsupervised fine-tuning of deep face feature representations. (ii) True understanding of videos comes from a joint analysis of all its modalities: the video frames, the audio track, and any accompanying text such as closed captions. We presented a way to learn a compact multimodal feature representation that encodes all these modalities. For this purpose, we propose temporal ordering problem of videos is a new task. Our dataset is build on top of Large Scale Movie Description Challenge (LSMDC). Our dataset consists of 202 movies with 118,081 video clips. In total, there are 25,269 scenes in the training set, 1,784 scenes in the validation set, and 2,443 scenes in the test set. Further, we proposed Temporal Compact Bilinear Pooling (TCBP) an extension of the Tensor Sketch projection algorithm [Pham, 2013] to incorporate a temporal dimension for representing face tracks in videos. We learned multimodal clip representation that jointly encodes images, audio, video, and text using TCBP for video ordering task. Additionally, we showed that TCBP features show exceptional transfer abilities to applications video retrieval and face video face clustering. All of our datasets and source codes are publicly available for research in this field.

Publications

 
 

Additional Information

Textvergrößerung und Kontrastanpassung