A Multimodal Framework For Automated Background Music Generation In Japanese Manga Using Large Language Models
Abstract
Recent waves of digitization have introduced a new form of enjoying comics: with music. We begin our analysis of this movement by discussing the multifarious modes of pairing music and comics as well as how music-comic pairs are received by readers and scholars. Our literature review reveals that this new movement has been growing sluggishly due to the time, effort, and cost required to produce music centered around the detailed story of a comic. In this work, we present a possible approach of using generative AI to automate the background music generation. The biggest challenge in this task is the lack of available data or a baseline. After extensive experimentation, we propose an audio generation pipeline that produces background music for an input manga (Japanese comic) book. By incorporating scene segmentation, longer context, and prompt engineering, we create a novel reading experience for manga readers by adding music as an additional stimulus. The pipeline begins with using the dialogues in a manga to detect scene boundaries and perform emotion classification using the characters’ faces within a scene. Then, we use GPT-4o to translate this low-level scene information into a high-level music directive. Conditioned on the scene information and the music directive, another instance of GPT-4o generates page-level music captions to guide a state-of-the-art text-to-music model. This produces music that is aligned with the manga’s evolving narrative. In a subjective evaluation, we find that participants prefer the proposed pipeline more than the baseline pipeline by a statistically significant margin across the metrics of relevancy, quality, and consistency. In particular, we find that our pipeline’s largest contribution comes in providing consistency across pages as well as generating efficient text prompts for the music generation step. Alongside our development, ethical risks of generative AI in affecting transparency, bias, access and exploitation were examined and mitigated where possible. We provide output samples of the pipeline at: manga-to-music.github.io/M2M-Gen/
Keywords: Manga, Multimodal Analysis, Digital Comics, Generative AI, Content-Based Music Generation
Copyright (c) 2026 International Journal of Film and Media Arts

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.





