TE-86M — Trimodal Embeddings (Depth-2)

TE-86M maps image, audio, and text into a shared 1280-dim embedding space, enabling cross-modal retrieval with a single vector index. All three modalities share a unified space with full Matryoshka truncation support down to 128 dims.

Built for edge deployment — the entire model runs on a Raspberry Pi 5.

Successor to TE-75M, with depth-2 residual projection heads that break through the cross-modal retrieval ceiling of depth-1 architectures while maintaining text retrieval quality.

Also available in GGUF format for quantized edge deployment.

Architecture

TE-86M uses lightweight edge encoders with depth-2 residual projection heads that expand through a 1920-dim hidden layer before projecting into a shared 1280-dim embedding space:

Text  --> LEAF-IR (768-d) -----------> DeepProjectionHead-d2 (768 -> 1920 -> 1920 -> 1280)
Image --> MobileNetV4-Medium (1280-d) --> DeepProjectionHead-d2 (1280 -> 1920 -> 1920 -> 1280)
Audio --> EfficientAT mn20_as (1920-d) --> DeepProjectionHead-d2 (1920 -> 1920 -> 1920 -> 1280)

All outputs are L2-normalized into the shared 1280-dim space for cross-modal cosine similarity.

Component	Architecture	Params	Size
Text encoder	LEAF-IR (MongoDB/mdbr-leaf-ir)	22.7M	87.2 MB
Image encoder	MobileNetV4-Medium (timm)	8.4M	32.4 MB
Audio encoder	EfficientAT mn20_as	17.9M	68.5 MB
Image projection	DeepProjectionHead-d2 (1280 -> 1920 -> 1920 -> 1280)	12.3M	47.0 MB
Audio projection	DeepProjectionHead-d2 (1920 -> 1920 -> 1920 -> 1280)	13.5M	51.7 MB
Text projection	DeepProjectionHead-d2 (768 -> 1920 -> 1920 -> 1280)	11.3M	43.2 MB
Total		86.1M	329.9 MB

Projection head detail

Each DeepProjectionHead-d2 is a depth-2 residual MLP with Matryoshka-aware training:

Linear(encoder_dim, 1920) -> GELU -> LayerNorm -> Dropout(0.3)
  -> Linear(1920, 1920) -> GELU -> LayerNorm -> Dropout(0.3) + residual
  -> Linear(1920, 1920) -> GELU -> LayerNorm -> Dropout(0.3) + residual
  -> Linear(1920, 1280)

Why depth-2?

Ablation experiments showed depth-1 heads hit an I->T retrieval ceiling at ~0.60 R@1 regardless of hyperparameter tuning. Depth-2 heads broke through to 0.618, providing the representational capacity to serve cross-modal AND text retrieval simultaneously. The extra 11M params (75M -> 86M) remain edge-viable.

Matryoshka dimensions

Embeddings can be truncated to [1280, 768, 512, 256, 128] dimensions while preserving retrieval quality — trained with Matryoshka Representation Learning (MRL).

Benchmarks

SALT retrieval benchmarks use 5K trimodal samples. Full MTEB / MAEB evaluation used the 768-d Matryoshka truncation.

Cross-modal retrieval — SALT (5K trimodal samples)

Direction	TE-86M (86M)	TE-75M (75M)	ImageBind (1.2B)	EBind (1.78B*)
Image -> Text R@1	0.618	0.615	0.736	0.783
Text -> Image R@1	0.630	0.614	0.712	0.779
Text -> Audio R@1	0.108	0.103	0.038	0.047
Audio -> Text R@1	0.087	0.082	0.039	0.035
Image -> Audio R@1	0.068	0.062	0.023	0.027
Audio -> Image R@1	0.070	0.063	0.025	0.032

Audio retrieval — AudioCaps & Clotho

Benchmark	Direction	TE-86M	TE-75M	CLAP-Large	ImageBind	EBind
AudioCaps	A->T R@1	0.229	0.210	0.420	0.116	0.225
AudioCaps	T->A R@1	0.156	0.148	0.280	0.080	0.219
Clotho	A->T R@1	0.219	0.208	0.195	0.061	0.088
Clotho	T->A R@1	0.177	0.172	0.167	0.074	0.118

Image-text retrieval — MSCOCO & Flickr30k

Benchmark	Direction	TE-86M (86M)	TE-75M (75M)	EBind (1.78B*)	ImageBind (1.2B)
Flickr30k	I->T R@1	0.494	0.478	0.951	0.918
Flickr30k	T->I R@1	0.332	0.303	0.853	0.766
MSCOCO 5K	I->T R@1	0.343	0.320	0.743	0.658
MSCOCO 5K	T->I R@1	0.225	0.208	0.559	0.490

Zero-shot classification — ESC-50

Model	Params	Accuracy
TE-86M	86M	93.9%
CLAP-Large	67.8M	90.5%
TE-75M	75M	93.2%
EBind	1.78B*	77.0%
ImageBind	1.2B	66.4%

Text retrieval — MTEB (NDCG@10)

Text-text retrieval quality in the shared embedding space, measured on MTEB retrieval tasks:

Task	TE-86M	TE-75M	Raw LEAF-IR	Recovery
ArguAna	0.545	0.544	0.594	92%
CQADupstackGaming	0.515	0.506	0.607	85%
CQADupstackUnix	0.334	0.355	0.428	78%
FEVERHardNegatives	0.561	0.551	0.863	65%
HotpotQAHardNegatives	0.554	0.531	0.700	79%
FiQA2018	0.291	0.292	0.392	74%
ClimateFEVER	0.231	0.215	0.353	65%
SCIDOCS	0.154	0.153	0.198	78%
TRECCOVID	0.507	0.474	0.820	62%

TE-86M improves MTEB text retrieval over TE-75M on 7/9 tasks. The depth-2 projection heads recover 62-92% of raw LEAF-IR's retrieval quality while mapping into the cross-modal shared space.

Full MTEB / MAEB kitchen-sink evaluation

Full evaluation used the 768-d Matryoshka truncation of the TE-86M checkpoint checkpoints/trimodal_3head_h1920_d2_v2_d30_tt025/best_model.pt with MTEB 2.12.15. The run completed 63/71 requested tasks (MTEB(eng, v2) and MAEB). Failed tasks were: BirdCLEF, CREMA_DClustering, CommonLanguageAgeDetection, FleursT2ARetrieval, IEMOCAPGender, VoxCelebSA, VoxPopuliGenderClustering, VoxPopuliLanguageID.

Family	Tasks	Mean primary score
Text retrieval	11	0.437
Text reranking	1	0.314
Text summarization	1	0.308
Text classification	8	0.693
Text clustering	8	0.445
Text STS / pair	12	0.780
Audio retrieval	8	0.169
Audio classification	9	0.379
Audio pair / reranking	4	0.620
Audio clustering	1	0.009

Full per-task scores

Task	Family	Metric	Score	NDCG@10	Recall@10	Subsets
BeijingOpera	Audio classification	Main score	0.868			1
CREMA_D	Audio classification	Main score	0.285			1
FSD2019Kaggle	Audio classification	Main score	0.548			2
GTZANGenre	Audio classification	Main score	0.742			1
MInDS14	Audio classification	Main score	0.094			12
MridinghamTonic	Audio classification	Main score	0.350			1
RavdessZeroshot	Audio classification	Main score	0.197			1
SIBFLEURS	Audio classification	Main score	0.218			102
SpeechCommandsZeroshotv0.02	Audio classification	Main score	0.111			1
VehicleSoundClustering	Audio clustering	Main score	0.009			1
CREMADPairClassification	Audio pair / reranking	Main score	0.528			1
GTZANAudioReranking	Audio pair / reranking	NDCG@10	0.874	0.874	0.987	1
NMSQAPairClassification	Audio pair / reranking	Main score	0.547			1
VoxPopuliAccentPairClassification	Audio pair / reranking	Main score	0.529			1
ClothoT2ARetrieval	Audio retrieval	NDCG@10	0.294	0.294	0.467	1
CommonVoiceMini21T2ARetrieval	Audio retrieval	NDCG@10	0.023	0.023	0.052	117
GigaSpeechT2ARetrieval	Audio retrieval	NDCG@10	0.002	0.002	0.004	1
JamAltArtistA2ARetrieval	Audio retrieval	NDCG@10	0.873	0.873	0.183	4
JamAltLyricA2TRetrieval	Audio retrieval	NDCG@10	0.008	0.008	0.013	4
MACST2ARetrieval	Audio retrieval	NDCG@10	0.136	0.136	0.252	1
SpokenSQuADT2ARetrieval	Audio retrieval	NDCG@10	0.010	0.010	0.020	1
UrbanSound8KT2ARetrieval	Audio retrieval	NDCG@10	0.009	0.009	0.020	1
BIOSSES	Text STS / pair	Main score	0.803			1
SICK-R	Text STS / pair	Main score	0.746			1
STS12	Text STS / pair	Main score	0.709			1
STS13	Text STS / pair	Main score	0.783			1
STS14	Text STS / pair	Main score	0.731			1
STS15	Text STS / pair	Main score	0.828			1
STS17	Text STS / pair	Main score	0.859			1
STS22.v2	Text STS / pair	Main score	0.679			1
STSBenchmark	Text STS / pair	Main score	0.810			1
SprintDuplicateQuestions	Text STS / pair	Main score	0.957			1
TwitterSemEval2015	Text STS / pair	Main score	0.620			1
TwitterURLCorpus	Text STS / pair	Main score	0.837			1
AmazonCounterfactualClassification	Text classification	Main score	0.681			1
Banking77Classification	Text classification	Main score	0.742			1
ImdbClassification	Text classification	Main score	0.722			1
MTOPDomainClassification	Text classification	Main score	0.896			1
MassiveIntentClassification	Text classification	Main score	0.623			1
MassiveScenarioClassification	Text classification	Main score	0.709			1
ToxicConversationsClassification	Text classification	Main score	0.623			1
TweetSentimentExtractionClassification	Text classification	Main score	0.545			1
ArXivHierarchicalClusteringP2P	Text clustering	Main score	0.549			1
ArXivHierarchicalClusteringS2S	Text clustering	Main score	0.523			1
BiorxivClusteringP2P.v2	Text clustering	Main score	0.352			1
MedrxivClusteringP2P.v2	Text clustering	Main score	0.352			1
MedrxivClusteringS2S.v2	Text clustering	Main score	0.333			1
StackExchangeClustering.v2	Text clustering	Main score	0.589			1
StackExchangeClusteringP2P.v2	Text clustering	Main score	0.405			1
TwentyNewsgroupsClustering.v2	Text clustering	Main score	0.457			1
MindSmallReranking	Text reranking	Main score	0.314	0.317	0.542	1
ArguAna	Text retrieval	NDCG@10	0.546	0.546	0.826	1
AskUbuntuDupQuestions	Text retrieval	NDCG@10	0.659	0.659	0.740	1
CQADupstackGamingRetrieval	Text retrieval	NDCG@10	0.519	0.519	0.659	1
CQADupstackUnixRetrieval	Text retrieval	NDCG@10	0.355	0.355	0.462	1
ClimateFEVERHardNegatives	Text retrieval	NDCG@10	0.239	0.239	0.301	1
FEVERHardNegatives	Text retrieval	NDCG@10	0.569	0.569	0.765	1
FiQA2018	Text retrieval	NDCG@10	0.292	0.292	0.361	1
HotpotQAHardNegatives	Text retrieval	NDCG@10	0.553	0.553	0.605	1
SCIDOCS	Text retrieval	NDCG@10	0.154	0.154	0.164	1
TRECCOVID	Text retrieval	NDCG@10	0.513	0.513	0.014	1
Touche2020Retrieval.v3	Text retrieval	NDCG@10	0.414	0.414	0.173	1
SummEvalSummarization.v2	Text summarization	Main score	0.308			1

Usage

Loading components

from safetensors.torch import load_file

# Load entire model
tensors = load_file("TE-86M.safetensors")

# Extract components by prefix
text_enc_sd = {k.removeprefix("text_encoder."): v for k, v in tensors.items() if k.startswith("text_encoder.")}
image_enc_sd = {k.removeprefix("image_encoder."): v for k, v in tensors.items() if k.startswith("image_encoder.")}
audio_enc_sd = {k.removeprefix("audio_encoder."): v for k, v in tensors.items() if k.startswith("audio_encoder.")}
image_proj_sd = {k.removeprefix("image_projection."): v for k, v in tensors.items() if k.startswith("image_projection.")}
audio_proj_sd = {k.removeprefix("audio_projection."): v for k, v in tensors.items() if k.startswith("audio_projection.")}
text_proj_sd = {k.removeprefix("text_projection."): v for k, v in tensors.items() if k.startswith("text_projection.")}

Matryoshka truncation

import torch.nn.functional as F

# Full 1280-dim embedding
embedding = model(input)  # (N, 1280)

# Truncate to 256-dim and re-normalize
embedding_256 = F.normalize(embedding[:, :256], dim=-1)

File layout

TE-86M.safetensors     # All components in one file (~330 MB)

Tensor key prefixes

Prefix	Component	Tensors
`text_encoder.*`	LEAF-IR (float32)	103
`image_encoder.*`	MobileNetV4-Medium	462
`audio_encoder.*`	EfficientAT mn20_as	312
`image_projection.*`	Depth-2 projection head	14
`audio_projection.*`	Depth-2 projection head	14
`text_projection.*`	Depth-2 projection head	14

Training

Loss: InfoNCE (contrastive) with Matryoshka Representation Learning
Data: ~2.2M synthetically generated trimodal triplets (WordNet) + 200K MSCOCO img+txt + 262K WavCaps aud+txt + 1.5M Nomic text pairs
Hardware: 2x NVIDIA L4 GPUs
Optimizer: AdamW, lr=1.41e-3, weight decay=1e-4, cosine scheduler
Epochs: 50
Batch size: 4096
Dropout: 0.20 -> 0.25 (ep27) -> 0.30 (ep29) — mid-run regularization increases
Text mixing: λ_tt=0.5 (ep1-9) -> 0.25 (ep10-50) — Nomic supervised text pairs
Projection heads only — source encoders are frozen during training

Improvements over TE-75M

Change	TE-75M	TE-86M
Projection depth	1 (single residual block)	2 (two residual blocks)
Head params	26.1M	37.2M
Total params	75.2M	86.1M
SALT I->T R@1	0.615	0.618 (+0.5%)
SALT T->I R@1	0.614	0.630 (+2.6%)
MSCOCO I->T R@1	0.320	0.343 (+7.2%)
Clotho A->T R@1	0.208	0.219 (+5.3%)
ESC-50	93.2%	93.9% (+0.7%)

Design decisions

Depth-2 residual heads: Ablation confirmed depth-1 hits I->T ceiling at ~0.60 regardless of dropout or λ_tt. Depth-2 provides capacity to serve cross-modal and text retrieval simultaneously.
3-head shared space: All modalities project into a learned 1280-dim space (image-native dimension)
LEAF-IR text encoder: 23M-param retrieval-optimized text encoder enables fully edge-deployable text inference
Frozen source encoders: MobileNetV4, EfficientAT, and LEAF-IR are kept frozen; only projection heads are trained
Edge-first: All source encoders can run on devices like Raspberry Pi 5

Limitations

Audio retrieval lags behind specialist models like CLAP on audio-only benchmarks
Image-text retrieval trades accuracy vs larger vision encoders for edge deployability
Text retrieval recovers 62-92% of raw LEAF-IR quality (gap is domain-dependent)

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for augmem/TE-86M

Quantizations

1 model

augmem
/

TE-86M