--- library_name: transformers license: mit language: - multilingual - ar - zh - cs - da - nl - en - fi - fr - de - he - hu - it - ja - ko - no - pl - pt - ru - es - sv - th - tr - uk tags: - nlp - code - audio - automatic-speech-recognition - speech-summarization - speech-translation - phi-4-multimodal - phi - phi-4-mini base_model: microsoft/Phi-4-multimodal-instruct --- # Phi-4-Audio **Phi-4-Audio** is a highly efficient adaptation of the [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) model, exclusively optimized for audio-text interactions (e.g., Automatic Speech Recognition). By surgically removing the vision processing components—including the image encoder, vision projection layers, and associated processing logic—we have created a streamlined model that delivers lower memory usage while retaining the original model's powerful audio understanding capabilities. ## Usage & Performance This model is ideal for scenarios where audio processing is the sole modality, such as transcription services, voice assistants, and audio-based QA systems. It is also well-suited for researchers aiming to fine-tune the model specifically for audio tasks without the overhead of unused vision parameters. ### Key Improvements Comparing **Phi-4-Audio** against the original **Phi-4-multimodal-instruct** on a single NVIDIA RTX 5090 GPU: * **Reduced Footprint:** Parameter count reduced by approximately **450 Million**. * **Lower VRAM Usage:** Peak inference memory usage reduced by **~10% (0.84 GB saved)**. * **Same Audio Performance:** Retains full audio-understanding capabilities while running lighter. ## Uses ### Intended Use * **Automatic Speech Recognition (ASR):** High-fidelity transcription of spoken audio. * **Speech Translation:** Direct speech-to-text translation. * **Audio Summarization:** Generating summaries from audio recordings. * **Spoken Instruction Tuning:** Fine-tuning on pure audio-text pairs. ### Out of Scope - **Image/Vision Tasks:** This model **cannot** process images. Attempts to pass image inputs will fail or raise errors, as the vision encoders have been stripped. ## How to Get Started The model is fully compatible with the Hugging Face `transformers` library. You can use it exactly like the original model, but inputting images is not supported. ```python import torch from torch import nn from io import BytesIO from urllib.request import urlopen from soundfile import read from transformers import ( AutoModelForCausalLM, AutoProcessor, Phi4MultimodalForCausalLM, Phi4MultimodalModel, ) class StrippedVisionModule(nn.Module): def __init__(self): super().__init__() def forward( self, **kwargs, ): raise ValueError("Vision is not supported") def strip_vision_inplace( model: Phi4MultimodalForCausalLM | Phi4MultimodalModel, ) -> Phi4MultimodalForCausalLM | Phi4MultimodalModel: passed_model = model if isinstance(model, Phi4MultimodalForCausalLM): model = model.model emb_ext = model.embed_tokens_extend if hasattr(emb_ext, "image_embed"): emb_ext.image_embed = StrippedVisionModule() if hasattr(emb_ext.audio_embed, "down_proj_for_vision_speech"): emb_ext.audio_embed.down_proj_for_vision_speech = StrippedVisionModule() if hasattr(emb_ext.audio_embed, "up_proj_for_vision_speech"): emb_ext.audio_embed.up_proj_for_vision_speech = StrippedVisionModule() try: torch.cuda.empty_cache() except: pass return passed_model model_path = "JacobLinCool/phi-4-audio" device = "cuda" model = AutoModelForCausalLM.from_pretrained( model_path, device_map=device, dtype=torch.bfloat16 ) processor = AutoProcessor.from_pretrained(model_path) strip_vision_inplace(model) audio_url = "https://huggingface.co/datasets/JacobLinCool/audio-testing/resolve/main/audio/audio-1.mp3" audio, samplerate = read(BytesIO(urlopen(audio_url).read())) user_prompt = "<|user|>" assistant_prompt = "<|assistant|>" prompt_suffix = "<|end|>" speech_prompt = "Transcribe the audio clip into text." prompt = f"{user_prompt}<|audio|>{speech_prompt}{prompt_suffix}{assistant_prompt}" inputs = processor( text=prompt, audio=[audio], sampling_rate=16000, return_tensors="pt" ).to(device) generate_ids = model.generate(**inputs) response = processor.batch_decode( generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True )[0] print(f"{response=}") ``` ## Model Details - Base Architecture: Phi-4 Multimodal - Modifications: - Removed `embed_tokens_extend.image_embed` - Removed `audio_embed.down_proj_for_vision_speech` - Removed `audio_embed.up_proj_for_vision_speech` ## Comparisons ### Parameter Count | Model | Total Parameters | Reduction | | --- | --- | --- | | Phi-4-multimodal-instruct | 4,743,988,032 (4.74B) | - | | Phi-4-Audio | 4,289,848,960 (4.29B) | -454M | ### Benchmark Results Tested on NVIDIA RTX 5090, `torch.bfloat16`. | Metric | Original Model | Phi-4-Audio | Delta | | --- | --- | --- | --- | | Peak Memory (GB) | 8.88 GB | 8.04 GB | -0.84 GB | | Inference Speed (Warm) | ~100.5 tokens/s | ~100.5 tokens/s | Similar | ## Citation If you use this model version, please cite the original Phi-4 Multimodal paper and acknowledge the modifications. ```bibtex @article{abouelenin2025phi, title={Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras}, author={Abouelenin, Abdelrahman and Ashfaq, Atabak and Atkinson, Adam and Awadalla, Hany and Bach, Nguyen and Bao, Jianmin and Benhaim, Alon and Cai, Martin and Chaudhary, Vishrav and Chen, Congcong and others}, journal={arXiv preprint arXiv:2503.01743}, year={2025} } ```