Contextual Video Captioning

Abstract

In recent years, video captioning has received a lot of attention. Transforming the visual input data to a different representation, such as textual domain, helps applications like video indexing, navigation and retrieval, automatic search or human-robot interaction. Whereas most research has focused on improving generic video captions, this work focuses on producing more precise video captions including background knowledge. We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input, and mine relevant knowledge such as names and locations from contextual text. In contrast to previous approaches, we do not pre process the text further, and let the model directly learn to attend over it. Guided by the visual input, the model is able to copy words from the context via a pointer-generator network, allowing to produce more specific video descriptions. We achieve competitive results on the News Video dataset. Further, we augment a subset of the Large Scale Movie Description Challenge dataset (LSMDC) with additional context in form of movie script scenes, and use the extended dataset to set a first benchmark.

Philipp Rimle

Master's Thesis

Status:

Completed