Computer Graphics Laboratory ETH Zurich

ETH

Label-Based Automatic Alignment of Video with Narrative Sentences

P. Dogan, M. Gross, J.C. Bazin

Proceedings of Computer Vision - ECCV 2016 Workshops (Amsterdam, the Netherlands, October 8-16, 2016), pp. 605-620
[Abstract] [BibTeX] [PDF]

Abstract

In this paper we consider videos (e.g. Hollywood movies) and their accompanying natural language descriptions in the form of narrative sentences (e.g. movie scripts without timestamps). We propose a method for temporally aligning the video frames with the sentences using both visual and textual information, which provides automatic timestamps for each narrative sentence. We compute the similarity between both types of information using vectorial descriptors and propose to cast this alignment task as a matching problem that we solve via dynamic programming. Our approach is simple to implement, highly efficient and does not require the presence of frequent dialogues, subtitles, and character face recognition. Experiments on various movies demonstrate that our method can successfully align the movie script sentences with the video frames of movies.

@Inbook{Dogan2016,
author="Dogan, Pelin and Gross, Markus and Bazin, Jean-Charles",
editor="Hua, Gang and J{\'e}gou, Herv{\'e}",
title="Label-Based Automatic Alignment of Video with Narrative Sentences",
bookTitle="Computer Vision -- ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part I",
year="2016",
publisher="Springer International Publishing",
address="Cham",
pages="605--620",
isbn="978-3-319-46604-0",
doi="10.1007/978-3-319-46604-0_43",
url="http://dx.doi.org/10.1007/978-3-319-46604-0_43"
}

[Download BibTeX]

Downloads

Download Paper
[PDF]