A Multimodal Framework For Automated Background Music Generation In Japanese Manga Using Large Language Models

Megha Sharma; Muhammad Taimoor Haseeb; Gus Xia; Yoshimasa Tsuruoka

doi:10.24140/ijfma.v11i1.10913

Megha Sharma MBZUAI, Abu Dhabi
Muhammad Taimoor Haseeb MBZUAI, Abu Dhabi
Gus Xia MBZUAI, Abu Dhabi
Yoshimasa Tsuruoka The University of Tokyo, Japan

DOI: https://doi.org/10.24140/ijfma.v11i1.10913

Abstract

Recent waves of digitization have introduced a new form of enjoying comics: with music. We begin our analysis of this movement by discussing the multifarious modes of pairing music and comics as well as how music-comic pairs are received by readers and scholars. Our literature review reveals that this new movement has been growing sluggishly due to the time, effort, and cost required to produce music centered around the detailed story of a comic. In this work, we present a possible approach of using generative AI to automate the background music generation. The biggest challenge in this task is the lack of available data or a baseline. After extensive experimentation, we propose an audio generation pipeline that produces background music for an input manga (Japanese comic) book. By incorporating scene segmentation, longer context, and prompt engineering, we create a novel reading experience for manga readers by adding music as an additional stimulus. The pipeline begins with using the dialogues in a manga to detect scene boundaries and perform emotion classification using the characters’ faces within a scene. Then, we use GPT-4o to translate this low-level scene information into a high-level music directive. Conditioned on the scene information and the music directive, another instance of GPT-4o generates page-level music captions to guide a state-of-the-art text-to-music model. This produces music that is aligned with the manga’s evolving narrative. In a subjective evaluation, we find that participants prefer the proposed pipeline more than the baseline pipeline by a statistically significant margin across the metrics of relevancy, quality, and consistency. In particular, we find that our pipeline’s largest contribution comes in providing consistency across pages as well as generating efficient text prompts for the music generation step. Alongside our development, ethical risks of generative AI in affecting transparency, bias, access and exploitation were examined and mitigated where possible. We provide output samples of the pipeline at: manga-to-music.github.io/M2M-Gen/

Keywords: Manga, Multimodal Analysis, Digital Comics, Generative AI, Content-Based Music Generation

Author Biographies

Megha Sharma, MBZUAI, Abu Dhabi

Megha Sharma is a master of science at the University of Tokyo. She specialises in multimodal music retrieval & generation as well as ethical MIR and NLP. She is a research assistant at MBZUAI, United Arab Emirates. Her interest lies in methods of ethical work with AI, and in particular in generating music, predicting historical risks, and analysing social media outputs.

ORCID ID: 0000-0002-4313-7459

Muhammad Taimoor Haseeb, MBZUAI, Abu Dhabi

Muhammad Taimoor Haseeb is a PhD candidate at MBZUAI, focusing on AI-driven computational music. His research includes multimodal frameworks for music generation and music loop compatibility for Digital Audio Workstations. He is an alumnus of McGill and York University.

Gus Xia, MBZUAI, Abu Dhabi

Prof. Gus Xia is an Assistant Professor of Machine Learning at MBZUAI, working at the intersection of ML, HCI, Robotics, and Computer Music. His research focuses on interactive intelligent systems for musical creation. He holds a PhD from Carnegie Mellon, was a Neukom Fellow at Dartmouth, and is a professional DI/XIAO player, blending technical expertise with deep musical artistry.

Yoshimasa Tsuruoka, The University of Tokyo, Japan

Prof. Yoshimasa Tsuruoka is a Professor at the Department of Information and Communication Engineering at the University of Tokyo. He works on three research directions: Natural Language Processing, Reinforcement Learning, and Artificial Intelligence for games. He holds a PhD from the University of Tokyo.

A Multimodal Framework For Automated Background Music Generation In Japanese Manga Using Large Language Models

Abstract

Author Biographies

Editors

Editor Manager

Editorial Board