phi-4-audio / README.md
JacobLinCool's picture
Update README.md
66ef993 verified
---
library_name: transformers
license: mit
language:
- multilingual
- ar
- zh
- cs
- da
- nl
- en
- fi
- fr
- de
- he
- hu
- it
- ja
- ko
- no
- pl
- pt
- ru
- es
- sv
- th
- tr
- uk
tags:
- nlp
- code
- audio
- automatic-speech-recognition
- speech-summarization
- speech-translation
- phi-4-multimodal
- phi
- phi-4-mini
base_model: microsoft/Phi-4-multimodal-instruct
---
# Phi-4-Audio
**Phi-4-Audio** is a highly efficient adaptation of the [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) model, exclusively optimized for audio-text interactions (e.g., Automatic Speech Recognition).
By surgically removing the vision processing components—including the image encoder, vision projection layers, and associated processing logic—we have created a streamlined model that delivers lower memory usage while retaining the original model's powerful audio understanding capabilities.
## Usage & Performance
This model is ideal for scenarios where audio processing is the sole modality, such as transcription services, voice assistants, and audio-based QA systems. It is also well-suited for researchers aiming to fine-tune the model specifically for audio tasks without the overhead of unused vision parameters.
### Key Improvements
Comparing **Phi-4-Audio** against the original **Phi-4-multimodal-instruct** on a single NVIDIA RTX 5090 GPU:
* **Reduced Footprint:** Parameter count reduced by approximately **450 Million**.
* **Lower VRAM Usage:** Peak inference memory usage reduced by **~10% (0.84 GB saved)**.
* **Same Audio Performance:** Retains full audio-understanding capabilities while running lighter.
## Uses
### Intended Use
* **Automatic Speech Recognition (ASR):** High-fidelity transcription of spoken audio.
* **Speech Translation:** Direct speech-to-text translation.
* **Audio Summarization:** Generating summaries from audio recordings.
* **Spoken Instruction Tuning:** Fine-tuning on pure audio-text pairs.
### Out of Scope
- **Image/Vision Tasks:** This model **cannot** process images. Attempts to pass image inputs will fail or raise errors, as the vision encoders have been stripped.
## How to Get Started
The model is fully compatible with the Hugging Face `transformers` library. You can use it exactly like the original model, but inputting images is not supported.
```python
import torch
from torch import nn
from io import BytesIO
from urllib.request import urlopen
from soundfile import read
from transformers import (
AutoModelForCausalLM,
AutoProcessor,
Phi4MultimodalForCausalLM,
Phi4MultimodalModel,
)
class StrippedVisionModule(nn.Module):
def __init__(self):
super().__init__()
def forward(
self,
**kwargs,
):
raise ValueError("Vision is not supported")
def strip_vision_inplace(
model: Phi4MultimodalForCausalLM | Phi4MultimodalModel,
) -> Phi4MultimodalForCausalLM | Phi4MultimodalModel:
passed_model = model
if isinstance(model, Phi4MultimodalForCausalLM):
model = model.model
emb_ext = model.embed_tokens_extend
if hasattr(emb_ext, "image_embed"):
emb_ext.image_embed = StrippedVisionModule()
if hasattr(emb_ext.audio_embed, "down_proj_for_vision_speech"):
emb_ext.audio_embed.down_proj_for_vision_speech = StrippedVisionModule()
if hasattr(emb_ext.audio_embed, "up_proj_for_vision_speech"):
emb_ext.audio_embed.up_proj_for_vision_speech = StrippedVisionModule()
try:
torch.cuda.empty_cache()
except:
pass
return passed_model
model_path = "JacobLinCool/phi-4-audio"
device = "cuda"
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map=device, dtype=torch.bfloat16
)
processor = AutoProcessor.from_pretrained(model_path)
strip_vision_inplace(model)
audio_url = "https://huggingface.co/datasets/JacobLinCool/audio-testing/resolve/main/audio/audio-1.mp3"
audio, samplerate = read(BytesIO(urlopen(audio_url).read()))
user_prompt = "<|user|>"
assistant_prompt = "<|assistant|>"
prompt_suffix = "<|end|>"
speech_prompt = "Transcribe the audio clip into text."
prompt = f"{user_prompt}<|audio|>{speech_prompt}{prompt_suffix}{assistant_prompt}"
inputs = processor(
text=prompt, audio=[audio], sampling_rate=16000, return_tensors="pt"
).to(device)
generate_ids = model.generate(**inputs)
response = processor.batch_decode(
generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True
)[0]
print(f"{response=}")
```
## Model Details
- Base Architecture: Phi-4 Multimodal
- Modifications:
- Removed `embed_tokens_extend.image_embed`
- Removed `audio_embed.down_proj_for_vision_speech`
- Removed `audio_embed.up_proj_for_vision_speech`
## Comparisons
### Parameter Count
| Model | Total Parameters | Reduction |
| --- | --- | --- |
| Phi-4-multimodal-instruct | 4,743,988,032 (4.74B) | - |
| Phi-4-Audio | 4,289,848,960 (4.29B) | -454M |
### Benchmark Results
Tested on NVIDIA RTX 5090, `torch.bfloat16`.
| Metric | Original Model | Phi-4-Audio | Delta |
| --- | --- | --- | --- |
| Peak Memory (GB) | 8.88 GB | 8.04 GB | -0.84 GB |
| Inference Speed (Warm) | ~100.5 tokens/s | ~100.5 tokens/s | Similar |
## Citation
If you use this model version, please cite the original Phi-4 Multimodal paper and acknowledge the modifications.
```bibtex
@article{abouelenin2025phi,
title={Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras},
author={Abouelenin, Abdelrahman and Ashfaq, Atabak and Atkinson, Adam and Awadalla, Hany and Bach, Nguyen and Bao, Jianmin and Benhaim, Alon and Cai, Martin and Chaudhary, Vishrav and Chen, Congcong and others},
journal={arXiv preprint arXiv:2503.01743},
year={2025}
}
```