MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models
Paper
β’
2506.23009
β’
Published
β’
11
Phi-3-MusiX is a LoRA adapter for microsoft/Phi-3-vision-128k-instruct for understanding symbolic music in the form of scanned music sheets, MIDI files, and structured annotations. This adapter equips Phi-3 with the ability to perform symbolic music reasoning and answer questions about scanned music sheets and MIDI content.
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
from PIL import Image
from http import HTTPStatus
import torch
import requests
from io import BytesIO
def load_img(img_dir):
if img_dir.startswith('http://') or img_dir.startswith('https://'):
response = requests.get(img_dir)
image = Image.open(BytesIO(response.content)).convert('RGB')
else:
image = Image.open(img_dir).convert('RGB')
return image
model = AutoModelForCausalLM.from_pretrained('microsoft/Phi-3-vision-128k-instruct', device_map="cuda", trust_remote_code=True, torch_dtype="auto")
processor = AutoProcessor.from_pretrained('microsoft/Phi-3-vision-128k-instruct', trust_remote_code=True)
model.load_adapter('puar-playground/Phi-3-MusiX')
prompt = '' + f'USER: Answer the question:\n{question_string}. ASSISTANT:'
# setup message
messages = [{"role": "user", "content": f"<|image_1|>\n{prompt}"}]
# load image from dir
image = load_img(img_dir)
prompt_in = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt_in, [image], return_tensors="pt").to("cuda")
generation_args = {
"max_new_tokens": 500,
"temperature": 0.1,
"do_sample": False,
}
with torch.no_grad():
generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)
# remove input tokens
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
model_answer = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
The model is trained on the MusiXQA dataset, which includes four QA sets:
Each entry in the dataset includes:
metadata.json)If you use this dataset in your work, please cite it using the following reference:
@misc{chen2025musixqaadvancingvisualmusic,
title={MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models},
author={Jian Chen and Wenye Ma and Penghang Liu and Wei Wang and Tengwei Song and Ming Li and Chenguang Wang and Jiayu Qin and Ruiyi Zhang and Changyou Chen},
year={2025},
eprint={2506.23009},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.23009},
}
Base model
microsoft/Phi-3-vision-128k-instruct