---
library_name: transformers
license: mit
language:
- multilingual
- ar
- zh
- cs
- da
- nl
- en
- fi
- fr
- de
- he
- hu
- it
- ja
- ko
- no
- pl
- pt
- ru
- es
- sv
- th
- tr
- uk
tags:
- nlp
- code
- audio
- automatic-speech-recognition
- speech-summarization
- speech-translation
- phi-4-multimodal
- phi
- phi-4-mini
base_model: microsoft/Phi-4-multimodal-instruct
---

# Phi-4-Audio

**Phi-4-Audio** is a highly efficient adaptation of the [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) model, exclusively optimized for audio-text interactions (e.g., Automatic Speech Recognition).

By surgically removing the vision processing components—including the image encoder, vision projection layers, and associated processing logic—we have created a streamlined model that delivers lower memory usage while retaining the original model's powerful audio understanding capabilities.

## Usage & Performance

This model is ideal for scenarios where audio processing is the sole modality, such as transcription services, voice assistants, and audio-based QA systems. It is also well-suited for researchers aiming to fine-tune the model specifically for audio tasks without the overhead of unused vision parameters.

### Key Improvements

Comparing **Phi-4-Audio** against the original **Phi-4-multimodal-instruct** on a single NVIDIA RTX 5090 GPU:

* **Reduced Footprint:** Parameter count reduced by approximately **450 Million**.
* **Lower VRAM Usage:** Peak inference memory usage reduced by **~10% (0.84 GB saved)**.
* **Same Audio Performance:** Retains full audio-understanding capabilities while running lighter.

## Uses

### Intended Use

* **Automatic Speech Recognition (ASR):** High-fidelity transcription of spoken audio.
* **Speech Translation:** Direct speech-to-text translation.
* **Audio Summarization:** Generating summaries from audio recordings.
* **Spoken Instruction Tuning:** Fine-tuning on pure audio-text pairs.

### Out of Scope

-   **Image/Vision Tasks:** This model **cannot** process images. Attempts to pass image inputs will fail or raise errors, as the vision encoders have been stripped.

## How to Get Started

The model is fully compatible with the Hugging Face `transformers` library. You can use it exactly like the original model, but inputting images is not supported.

```python
import torch
from torch import nn
from io import BytesIO
from urllib.request import urlopen
from soundfile import read
from transformers import (
    AutoModelForCausalLM,
    AutoProcessor,
    Phi4MultimodalForCausalLM,
    Phi4MultimodalModel,
)


class StrippedVisionModule(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(
        self,
        **kwargs,
    ):
        raise ValueError("Vision is not supported")


def strip_vision_inplace(
    model: Phi4MultimodalForCausalLM | Phi4MultimodalModel,
) -> Phi4MultimodalForCausalLM | Phi4MultimodalModel:
    passed_model = model

    if isinstance(model, Phi4MultimodalForCausalLM):
        model = model.model

    emb_ext = model.embed_tokens_extend
    if hasattr(emb_ext, "image_embed"):
        emb_ext.image_embed = StrippedVisionModule()
    if hasattr(emb_ext.audio_embed, "down_proj_for_vision_speech"):
        emb_ext.audio_embed.down_proj_for_vision_speech = StrippedVisionModule()
    if hasattr(emb_ext.audio_embed, "up_proj_for_vision_speech"):
        emb_ext.audio_embed.up_proj_for_vision_speech = StrippedVisionModule()

    try:
        torch.cuda.empty_cache()
    except:
        pass

    return passed_model


model_path = "JacobLinCool/phi-4-audio"
device = "cuda"
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map=device, dtype=torch.bfloat16
)
processor = AutoProcessor.from_pretrained(model_path)
strip_vision_inplace(model)


audio_url = "https://huggingface.co/datasets/JacobLinCool/audio-testing/resolve/main/audio/audio-1.mp3"
audio, samplerate = read(BytesIO(urlopen(audio_url).read()))

user_prompt = "<|user|>"
assistant_prompt = "<|assistant|>"
prompt_suffix = "<|end|>"
speech_prompt = "Transcribe the audio clip into text."
prompt = f"{user_prompt}<|audio|>{speech_prompt}{prompt_suffix}{assistant_prompt}"

inputs = processor(
    text=prompt, audio=[audio], sampling_rate=16000, return_tensors="pt"
).to(device)

generate_ids = model.generate(**inputs)
response = processor.batch_decode(
    generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True
)[0]

print(f"{response=}")
```

## Model Details

- Base Architecture: Phi-4 Multimodal
- Modifications:
  - Removed `embed_tokens_extend.image_embed`
  - Removed `audio_embed.down_proj_for_vision_speech`
  - Removed `audio_embed.up_proj_for_vision_speech`

## Comparisons

### Parameter Count

| Model | Total Parameters | Reduction |
| --- | --- | --- |
| Phi-4-multimodal-instruct | 4,743,988,032 (4.74B) | - |
| Phi-4-Audio | 4,289,848,960 (4.29B) | -454M |

### Benchmark Results

Tested on NVIDIA RTX 5090, `torch.bfloat16`.

| Metric | Original Model | Phi-4-Audio | Delta |
| --- | --- | --- | --- |
| Peak Memory (GB) | 8.88 GB | 8.04 GB | -0.84 GB |
| Inference Speed (Warm) | ~100.5 tokens/s | ~100.5 tokens/s | Similar |

## Citation

If you use this model version, please cite the original Phi-4 Multimodal paper and acknowledge the modifications.

```bibtex
@article{abouelenin2025phi,
  title={Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras},
  author={Abouelenin, Abdelrahman and Ashfaq, Atabak and Atkinson, Adam and Awadalla, Hany and Bach, Nguyen and Bao, Jianmin and Benhaim, Alon and Cai, Martin and Chaudhary, Vishrav and Chen, Congcong and others},
  journal={arXiv preprint arXiv:2503.01743},
  year={2025}
}
```