DOI: 10.1049/cvi2.12195 ISSN:

A latent topic‐aware network for dense video captioning

Tao Xu, Yuanyuan Cui, Xinyu He, Caihua Liu
  • Computer Vision and Pattern Recognition
  • Software

Abstract

Multiple events in a long untrimmed video possess the characteristics of similarity and continuity. These characteristics can be considered as a kind of topic semantic information, which probably behaves as same sports, similar scenes, same objects etc. Inspired by this, a novel latent topic‐aware network (LTNet) is proposed in this article. The LTNet explores potential themes within videos and generates more continuous captions. Firstly, a global visual topic finder is employed to detect the similarity among events and obtain latent topic‐level features. Secondly, a latent topic‐oriented relation learner is designed to further enhance the topic‐level representations by capturing the relationship between each event and the video themes. Benefiting from the finder and the learner, the caption generator is capable of predicting more accurate and coherent descriptions. The effectiveness of our proposed method is demonstrated on ActivityNet Captions and YouCook2 datasets, where LTNet shows a relative performance of over 3.03% and 0.50% in CIDEr score respectively.

More from our Archive