Cross-modality Multiple Relations Learning for Knowledge-Based Visual Question Answering
Yan Wang, Peize Li, Qingyi Si, Hanwen Zhang, Wenyu Zang, Zheng Lin, Peng Fu- Computer Networks and Communications
- Hardware and Architecture
Knowledge-based visual question answering not only needs to answer the questions based on images, but also incorporates external knowledge to study reasoning in the joint space of vision and language. To bridge the gap between visual content and semantic cues, it is important to capture the question-related and semantics-rich vision-language connections. Most existing solutions model simple intra-modality relation or represent cross-modality relation using a single vector, which makes it difficult to effectively model complex connections between visual features and question features. Thus, we propose a cross-modality multiple relations learning model, aiming to better enrich cross-modality representations and construct advanced multi-modality knowledge triplets. First, we design a simple yet effective method to generate multiple relations that represent the rich cross-modality relations. The various cross-modality relations link the textual question to the related visual objects. These multi-modality triplets efficiently align the visual objects and corresponding textual answers. Second, to encourage multiple relations to better align with different semantic relations, we further formulate a novel global-local loss. The global loss enables the visual objects and corresponding textual answers close to each other through cross-modality relations in the vision-language space, and the local loss better preserves semantic diversity among multiple relations. Experimental results on the OKVQA and KRVQA datasets demonstrate that our model outperforms the state-of-the-art methods.