|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
language: |
|
|
- multilingual |
|
|
- ar |
|
|
- zh |
|
|
- cs |
|
|
- da |
|
|
- nl |
|
|
- en |
|
|
- fi |
|
|
- fr |
|
|
- de |
|
|
- he |
|
|
- hu |
|
|
- it |
|
|
- ja |
|
|
- ko |
|
|
- no |
|
|
- pl |
|
|
- pt |
|
|
- ru |
|
|
- es |
|
|
- sv |
|
|
- th |
|
|
- tr |
|
|
- uk |
|
|
tags: |
|
|
- nlp |
|
|
- code |
|
|
- audio |
|
|
- automatic-speech-recognition |
|
|
- speech-summarization |
|
|
- speech-translation |
|
|
- phi-4-multimodal |
|
|
- phi |
|
|
- phi-4-mini |
|
|
base_model: microsoft/Phi-4-multimodal-instruct |
|
|
--- |
|
|
|
|
|
# Phi-4-Audio |
|
|
|
|
|
**Phi-4-Audio** is a highly efficient adaptation of the [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) model, exclusively optimized for audio-text interactions (e.g., Automatic Speech Recognition). |
|
|
|
|
|
By surgically removing the vision processing components—including the image encoder, vision projection layers, and associated processing logic—we have created a streamlined model that delivers lower memory usage while retaining the original model's powerful audio understanding capabilities. |
|
|
|
|
|
## Usage & Performance |
|
|
|
|
|
This model is ideal for scenarios where audio processing is the sole modality, such as transcription services, voice assistants, and audio-based QA systems. It is also well-suited for researchers aiming to fine-tune the model specifically for audio tasks without the overhead of unused vision parameters. |
|
|
|
|
|
### Key Improvements |
|
|
|
|
|
Comparing **Phi-4-Audio** against the original **Phi-4-multimodal-instruct** on a single NVIDIA RTX 5090 GPU: |
|
|
|
|
|
* **Reduced Footprint:** Parameter count reduced by approximately **450 Million**. |
|
|
* **Lower VRAM Usage:** Peak inference memory usage reduced by **~10% (0.84 GB saved)**. |
|
|
* **Same Audio Performance:** Retains full audio-understanding capabilities while running lighter. |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Intended Use |
|
|
|
|
|
* **Automatic Speech Recognition (ASR):** High-fidelity transcription of spoken audio. |
|
|
* **Speech Translation:** Direct speech-to-text translation. |
|
|
* **Audio Summarization:** Generating summaries from audio recordings. |
|
|
* **Spoken Instruction Tuning:** Fine-tuning on pure audio-text pairs. |
|
|
|
|
|
### Out of Scope |
|
|
|
|
|
- **Image/Vision Tasks:** This model **cannot** process images. Attempts to pass image inputs will fail or raise errors, as the vision encoders have been stripped. |
|
|
|
|
|
## How to Get Started |
|
|
|
|
|
The model is fully compatible with the Hugging Face `transformers` library. You can use it exactly like the original model, but inputting images is not supported. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from torch import nn |
|
|
from io import BytesIO |
|
|
from urllib.request import urlopen |
|
|
from soundfile import read |
|
|
from transformers import ( |
|
|
AutoModelForCausalLM, |
|
|
AutoProcessor, |
|
|
Phi4MultimodalForCausalLM, |
|
|
Phi4MultimodalModel, |
|
|
) |
|
|
|
|
|
|
|
|
class StrippedVisionModule(nn.Module): |
|
|
def __init__(self): |
|
|
super().__init__() |
|
|
|
|
|
def forward( |
|
|
self, |
|
|
**kwargs, |
|
|
): |
|
|
raise ValueError("Vision is not supported") |
|
|
|
|
|
|
|
|
def strip_vision_inplace( |
|
|
model: Phi4MultimodalForCausalLM | Phi4MultimodalModel, |
|
|
) -> Phi4MultimodalForCausalLM | Phi4MultimodalModel: |
|
|
passed_model = model |
|
|
|
|
|
if isinstance(model, Phi4MultimodalForCausalLM): |
|
|
model = model.model |
|
|
|
|
|
emb_ext = model.embed_tokens_extend |
|
|
if hasattr(emb_ext, "image_embed"): |
|
|
emb_ext.image_embed = StrippedVisionModule() |
|
|
if hasattr(emb_ext.audio_embed, "down_proj_for_vision_speech"): |
|
|
emb_ext.audio_embed.down_proj_for_vision_speech = StrippedVisionModule() |
|
|
if hasattr(emb_ext.audio_embed, "up_proj_for_vision_speech"): |
|
|
emb_ext.audio_embed.up_proj_for_vision_speech = StrippedVisionModule() |
|
|
|
|
|
try: |
|
|
torch.cuda.empty_cache() |
|
|
except: |
|
|
pass |
|
|
|
|
|
return passed_model |
|
|
|
|
|
|
|
|
model_path = "JacobLinCool/phi-4-audio" |
|
|
device = "cuda" |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_path, device_map=device, dtype=torch.bfloat16 |
|
|
) |
|
|
processor = AutoProcessor.from_pretrained(model_path) |
|
|
strip_vision_inplace(model) |
|
|
|
|
|
|
|
|
audio_url = "https://huggingface.co/datasets/JacobLinCool/audio-testing/resolve/main/audio/audio-1.mp3" |
|
|
audio, samplerate = read(BytesIO(urlopen(audio_url).read())) |
|
|
|
|
|
user_prompt = "<|user|>" |
|
|
assistant_prompt = "<|assistant|>" |
|
|
prompt_suffix = "<|end|>" |
|
|
speech_prompt = "Transcribe the audio clip into text." |
|
|
prompt = f"{user_prompt}<|audio|>{speech_prompt}{prompt_suffix}{assistant_prompt}" |
|
|
|
|
|
inputs = processor( |
|
|
text=prompt, audio=[audio], sampling_rate=16000, return_tensors="pt" |
|
|
).to(device) |
|
|
|
|
|
generate_ids = model.generate(**inputs) |
|
|
response = processor.batch_decode( |
|
|
generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True |
|
|
)[0] |
|
|
|
|
|
print(f"{response=}") |
|
|
``` |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- Base Architecture: Phi-4 Multimodal |
|
|
- Modifications: |
|
|
- Removed `embed_tokens_extend.image_embed` |
|
|
- Removed `audio_embed.down_proj_for_vision_speech` |
|
|
- Removed `audio_embed.up_proj_for_vision_speech` |
|
|
|
|
|
## Comparisons |
|
|
|
|
|
### Parameter Count |
|
|
|
|
|
| Model | Total Parameters | Reduction | |
|
|
| --- | --- | --- | |
|
|
| Phi-4-multimodal-instruct | 4,743,988,032 (4.74B) | - | |
|
|
| Phi-4-Audio | 4,289,848,960 (4.29B) | -454M | |
|
|
|
|
|
### Benchmark Results |
|
|
|
|
|
Tested on NVIDIA RTX 5090, `torch.bfloat16`. |
|
|
|
|
|
| Metric | Original Model | Phi-4-Audio | Delta | |
|
|
| --- | --- | --- | --- | |
|
|
| Peak Memory (GB) | 8.88 GB | 8.04 GB | -0.84 GB | |
|
|
| Inference Speed (Warm) | ~100.5 tokens/s | ~100.5 tokens/s | Similar | |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model version, please cite the original Phi-4 Multimodal paper and acknowledge the modifications. |
|
|
|
|
|
```bibtex |
|
|
@article{abouelenin2025phi, |
|
|
title={Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras}, |
|
|
author={Abouelenin, Abdelrahman and Ashfaq, Atabak and Atkinson, Adam and Awadalla, Hany and Bach, Nguyen and Bao, Jianmin and Benhaim, Alon and Cai, Martin and Chaudhary, Vishrav and Chen, Congcong and others}, |
|
|
journal={arXiv preprint arXiv:2503.01743}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|