DOI: 10.3390/app14083193 ISSN: 2076-3417

EmoStyle: Emotion-Aware Semantic Image Manipulation with Audio Guidance

Qiwei Shen, Junjie Xu, Jiahao Mei, Xingjiao Wu, Daoguo Dong
  • Fluid Flow and Transfer Processes
  • Computer Science Applications
  • Process Chemistry and Technology
  • General Engineering
  • Instrumentation
  • General Materials Science

With the flourishing development of generative models, image manipulation is receiving increasing attention. Rather than text modality, several elegant designs have delved into leveraging audio to manipulate images. However, existing methodologies mainly focus on image generation conditional on semantic alignment, ignoring the vivid affective information depicted in the audio. We propose an Emotion-aware StyleGAN Manipulator (EmoStyle), a framework where affective information from audio can be explicitly extracted and further utilized during image manipulation. Specifically, we first leverage the multi-modality model ImageBind for initial cross-modal retrieval between images and music, and select the music-related image for further manipulation. Simultaneously, by extracting sentiment polarity from the lyrics of the audio, we generate an emotionally rich auxiliary music branch to accentuate the affective information. We then leverage pre-trained encoders to encode audio and the audio-related image into the same embedding space. With the aligned embeddings, we manipulate the image via a direct latent optimization method. We conduct objective and subjective evaluations on the generated images, and our results show that our framework is capable of generating images with specified human emotions conveyed in the audio.

More from our Archive