phi-4-audio / README.md

Update README.md

66ef993 verified 24 days ago

5.77 kB

	---
	library_name: transformers
	license: mit
	language:
	- multilingual
	- ar
	- zh
	- cs
	- da
	- nl
	- en
	- fi
	- fr
	- de
	- he
	- hu
	- it
	- ja
	- ko
	- no
	- pl
	- pt
	- ru
	- es
	- sv
	- th
	- tr
	- uk
	tags:
	- nlp
	- code
	- audio
	- automatic-speech-recognition
	- speech-summarization
	- speech-translation
	- phi-4-multimodal
	- phi
	- phi-4-mini
	base_model: microsoft/Phi-4-multimodal-instruct
	---

	# Phi-4-Audio

	Phi-4-Audio is a highly efficient adaptation of the [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) model, exclusively optimized for audio-text interactions (e.g., Automatic Speech Recognition).

	By surgically removing the vision processing components—including the image encoder, vision projection layers, and associated processing logic—we have created a streamlined model that delivers lower memory usage while retaining the original model's powerful audio understanding capabilities.

	## Usage & Performance

	This model is ideal for scenarios where audio processing is the sole modality, such as transcription services, voice assistants, and audio-based QA systems. It is also well-suited for researchers aiming to fine-tune the model specifically for audio tasks without the overhead of unused vision parameters.

	### Key Improvements

	Comparing Phi-4-Audio against the original Phi-4-multimodal-instruct on a single NVIDIA RTX 5090 GPU:

	* Reduced Footprint: Parameter count reduced by approximately 450 Million.
	* Lower VRAM Usage: Peak inference memory usage reduced by ~10% (0.84 GB saved).
	* Same Audio Performance: Retains full audio-understanding capabilities while running lighter.

	## Uses

	### Intended Use

	* Automatic Speech Recognition (ASR): High-fidelity transcription of spoken audio.
	* Speech Translation: Direct speech-to-text translation.
	* Audio Summarization: Generating summaries from audio recordings.
	* Spoken Instruction Tuning: Fine-tuning on pure audio-text pairs.

	### Out of Scope

	- Image/Vision Tasks: This model cannot process images. Attempts to pass image inputs will fail or raise errors, as the vision encoders have been stripped.

	## How to Get Started

	The model is fully compatible with the Hugging Face `transformers` library. You can use it exactly like the original model, but inputting images is not supported.

	```python
	import torch
	from torch import nn
	from io import BytesIO
	from urllib.request import urlopen
	from soundfile import read
	from transformers import (
	AutoModelForCausalLM,
	AutoProcessor,
	Phi4MultimodalForCausalLM,
	Phi4MultimodalModel,
	)


	class StrippedVisionModule(nn.Module):
	def __init__(self):
	super().__init__()

	def forward(
	self,
	**kwargs,
	):
	raise ValueError("Vision is not supported")


	def strip_vision_inplace(
	model: Phi4MultimodalForCausalLM \| Phi4MultimodalModel,
	) -> Phi4MultimodalForCausalLM \| Phi4MultimodalModel:
	passed_model = model

	if isinstance(model, Phi4MultimodalForCausalLM):
	model = model.model

	emb_ext = model.embed_tokens_extend
	if hasattr(emb_ext, "image_embed"):
	emb_ext.image_embed = StrippedVisionModule()
	if hasattr(emb_ext.audio_embed, "down_proj_for_vision_speech"):
	emb_ext.audio_embed.down_proj_for_vision_speech = StrippedVisionModule()
	if hasattr(emb_ext.audio_embed, "up_proj_for_vision_speech"):
	emb_ext.audio_embed.up_proj_for_vision_speech = StrippedVisionModule()

	try:
	torch.cuda.empty_cache()
	except:
	pass

	return passed_model


	model_path = "JacobLinCool/phi-4-audio"
	device = "cuda"
	model = AutoModelForCausalLM.from_pretrained(
	model_path, device_map=device, dtype=torch.bfloat16
	)
	processor = AutoProcessor.from_pretrained(model_path)
	strip_vision_inplace(model)


	audio_url = "https://huggingface.co/datasets/JacobLinCool/audio-testing/resolve/main/audio/audio-1.mp3"
	audio, samplerate = read(BytesIO(urlopen(audio_url).read()))

	user_prompt = "<\|user\|>"
	assistant_prompt = "<\|assistant\|>"
	prompt_suffix = "<\|end\|>"
	speech_prompt = "Transcribe the audio clip into text."
	prompt = f"{user_prompt}<\|audio\|>{speech_prompt}{prompt_suffix}{assistant_prompt}"

	inputs = processor(
	text=prompt, audio=[audio], sampling_rate=16000, return_tensors="pt"
	).to(device)

	generate_ids = model.generate(**inputs)
	response = processor.batch_decode(
	generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True
	)[0]

	print(f"{response=}")
	```

	## Model Details

	- Base Architecture: Phi-4 Multimodal
	- Modifications:
	- Removed `embed_tokens_extend.image_embed`
	- Removed `audio_embed.down_proj_for_vision_speech`
	- Removed `audio_embed.up_proj_for_vision_speech`

	## Comparisons

	### Parameter Count

	\| Model \| Total Parameters \| Reduction \|
	\| --- \| --- \| --- \|
	\| Phi-4-multimodal-instruct \| 4,743,988,032 (4.74B) \| - \|
	\| Phi-4-Audio \| 4,289,848,960 (4.29B) \| -454M \|

	### Benchmark Results

	Tested on NVIDIA RTX 5090, `torch.bfloat16`.

	\| Metric \| Original Model \| Phi-4-Audio \| Delta \|
	\| --- \| --- \| --- \| --- \|
	\| Peak Memory (GB) \| 8.88 GB \| 8.04 GB \| -0.84 GB \|
	\| Inference Speed (Warm) \| ~100.5 tokens/s \| ~100.5 tokens/s \| Similar \|

	## Citation

	If you use this model version, please cite the original Phi-4 Multimodal paper and acknowledge the modifications.

	```bibtex
	@article{abouelenin2025phi,
	title={Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras},
	author={Abouelenin, Abdelrahman and Ashfaq, Atabak and Atkinson, Adam and Awadalla, Hany and Bach, Nguyen and Bao, Jianmin and Benhaim, Alon and Cai, Martin and Chaudhary, Vishrav and Chen, Congcong and others},
	journal={arXiv preprint arXiv:2503.01743},
	year={2025}
	}
	```