TE-86M β€” Trimodal Embeddings (Depth-2)

TE-86M maps image, audio, and text into a shared 1280-dim embedding space, enabling cross-modal retrieval with a single vector index. All three modalities share a unified space with full Matryoshka truncation support down to 128 dims.

Built for edge deployment β€” the entire model runs on a Raspberry Pi 5.

Successor to TE-75M, with depth-2 residual projection heads that break through the cross-modal retrieval ceiling of depth-1 architectures while maintaining text retrieval quality.

Also available in GGUF format for quantized edge deployment.

Architecture

TE-86M uses lightweight edge encoders with depth-2 residual projection heads that expand through a 1920-dim hidden layer before projecting into a shared 1280-dim embedding space:

Text  --> LEAF-IR (768-d) -----------> DeepProjectionHead-d2 (768 -> 1920 -> 1920 -> 1280)
Image --> MobileNetV4-Medium (1280-d) --> DeepProjectionHead-d2 (1280 -> 1920 -> 1920 -> 1280)
Audio --> EfficientAT mn20_as (1920-d) --> DeepProjectionHead-d2 (1920 -> 1920 -> 1920 -> 1280)

All outputs are L2-normalized into the shared 1280-dim space for cross-modal cosine similarity.

Component Architecture Params Size
Text encoder LEAF-IR (MongoDB/mdbr-leaf-ir) 22.7M 87.2 MB
Image encoder MobileNetV4-Medium (timm) 8.4M 32.4 MB
Audio encoder EfficientAT mn20_as 17.9M 68.5 MB
Image projection DeepProjectionHead-d2 (1280 -> 1920 -> 1920 -> 1280) 12.3M 47.0 MB
Audio projection DeepProjectionHead-d2 (1920 -> 1920 -> 1920 -> 1280) 13.5M 51.7 MB
Text projection DeepProjectionHead-d2 (768 -> 1920 -> 1920 -> 1280) 11.3M 43.2 MB
Total 86.1M 329.9 MB

Projection head detail

Each DeepProjectionHead-d2 is a depth-2 residual MLP with Matryoshka-aware training:

Linear(encoder_dim, 1920) -> GELU -> LayerNorm -> Dropout(0.3)
  -> Linear(1920, 1920) -> GELU -> LayerNorm -> Dropout(0.3) + residual
  -> Linear(1920, 1920) -> GELU -> LayerNorm -> Dropout(0.3) + residual
  -> Linear(1920, 1280)

Why depth-2?

Ablation experiments showed depth-1 heads hit an I->T retrieval ceiling at ~0.60 R@1 regardless of hyperparameter tuning. Depth-2 heads broke through to 0.618, providing the representational capacity to serve cross-modal AND text retrieval simultaneously. The extra 11M params (75M -> 86M) remain edge-viable.

Matryoshka dimensions

Embeddings can be truncated to [1280, 768, 512, 256, 128] dimensions while preserving retrieval quality β€” trained with Matryoshka Representation Learning (MRL).

Benchmarks

SALT retrieval benchmarks use 5K trimodal samples. Full MTEB / MAEB evaluation used the 768-d Matryoshka truncation.

Cross-modal retrieval β€” SALT (5K trimodal samples)

Direction TE-86M (86M) TE-75M (75M) ImageBind (1.2B) EBind (1.78B*)
Image -> Text R@1 0.618 0.615 0.736 0.783
Text -> Image R@1 0.630 0.614 0.712 0.779
Text -> Audio R@1 0.108 0.103 0.038 0.047
Audio -> Text R@1 0.087 0.082 0.039 0.035
Image -> Audio R@1 0.068 0.062 0.023 0.027
Audio -> Image R@1 0.070 0.063 0.025 0.032

Audio retrieval β€” AudioCaps & Clotho

Benchmark Direction TE-86M TE-75M CLAP-Large ImageBind EBind
AudioCaps A->T R@1 0.229 0.210 0.420 0.116 0.225
AudioCaps T->A R@1 0.156 0.148 0.280 0.080 0.219
Clotho A->T R@1 0.219 0.208 0.195 0.061 0.088
Clotho T->A R@1 0.177 0.172 0.167 0.074 0.118

Image-text retrieval β€” MSCOCO & Flickr30k

Benchmark Direction TE-86M (86M) TE-75M (75M) EBind (1.78B*) ImageBind (1.2B)
Flickr30k I->T R@1 0.494 0.478 0.951 0.918
Flickr30k T->I R@1 0.332 0.303 0.853 0.766
MSCOCO 5K I->T R@1 0.343 0.320 0.743 0.658
MSCOCO 5K T->I R@1 0.225 0.208 0.559 0.490

Zero-shot classification β€” ESC-50

Model Params Accuracy
TE-86M 86M 93.9%
CLAP-Large 67.8M 90.5%
TE-75M 75M 93.2%
EBind 1.78B* 77.0%
ImageBind 1.2B 66.4%

Text retrieval β€” MTEB (NDCG@10)

Text-text retrieval quality in the shared embedding space, measured on MTEB retrieval tasks:

Task TE-86M TE-75M Raw LEAF-IR Recovery
ArguAna 0.545 0.544 0.594 92%
CQADupstackGaming 0.515 0.506 0.607 85%
CQADupstackUnix 0.334 0.355 0.428 78%
FEVERHardNegatives 0.561 0.551 0.863 65%
HotpotQAHardNegatives 0.554 0.531 0.700 79%
FiQA2018 0.291 0.292 0.392 74%
ClimateFEVER 0.231 0.215 0.353 65%
SCIDOCS 0.154 0.153 0.198 78%
TRECCOVID 0.507 0.474 0.820 62%

TE-86M improves MTEB text retrieval over TE-75M on 7/9 tasks. The depth-2 projection heads recover 62-92% of raw LEAF-IR's retrieval quality while mapping into the cross-modal shared space.

Full MTEB / MAEB kitchen-sink evaluation

Full evaluation used the 768-d Matryoshka truncation of the TE-86M checkpoint checkpoints/trimodal_3head_h1920_d2_v2_d30_tt025/best_model.pt with MTEB 2.12.15. The run completed 63/71 requested tasks (MTEB(eng, v2) and MAEB). Failed tasks were: BirdCLEF, CREMA_DClustering, CommonLanguageAgeDetection, FleursT2ARetrieval, IEMOCAPGender, VoxCelebSA, VoxPopuliGenderClustering, VoxPopuliLanguageID.

Family Tasks Mean primary score
Text retrieval 11 0.437
Text reranking 1 0.314
Text summarization 1 0.308
Text classification 8 0.693
Text clustering 8 0.445
Text STS / pair 12 0.780
Audio retrieval 8 0.169
Audio classification 9 0.379
Audio pair / reranking 4 0.620
Audio clustering 1 0.009
Full per-task scores
Task Family Metric Score NDCG@10 Recall@10 Subsets
BeijingOpera Audio classification Main score 0.868 1
CREMA_D Audio classification Main score 0.285 1
FSD2019Kaggle Audio classification Main score 0.548 2
GTZANGenre Audio classification Main score 0.742 1
MInDS14 Audio classification Main score 0.094 12
MridinghamTonic Audio classification Main score 0.350 1
RavdessZeroshot Audio classification Main score 0.197 1
SIBFLEURS Audio classification Main score 0.218 102
SpeechCommandsZeroshotv0.02 Audio classification Main score 0.111 1
VehicleSoundClustering Audio clustering Main score 0.009 1
CREMADPairClassification Audio pair / reranking Main score 0.528 1
GTZANAudioReranking Audio pair / reranking NDCG@10 0.874 0.874 0.987 1
NMSQAPairClassification Audio pair / reranking Main score 0.547 1
VoxPopuliAccentPairClassification Audio pair / reranking Main score 0.529 1
ClothoT2ARetrieval Audio retrieval NDCG@10 0.294 0.294 0.467 1
CommonVoiceMini21T2ARetrieval Audio retrieval NDCG@10 0.023 0.023 0.052 117
GigaSpeechT2ARetrieval Audio retrieval NDCG@10 0.002 0.002 0.004 1
JamAltArtistA2ARetrieval Audio retrieval NDCG@10 0.873 0.873 0.183 4
JamAltLyricA2TRetrieval Audio retrieval NDCG@10 0.008 0.008 0.013 4
MACST2ARetrieval Audio retrieval NDCG@10 0.136 0.136 0.252 1
SpokenSQuADT2ARetrieval Audio retrieval NDCG@10 0.010 0.010 0.020 1
UrbanSound8KT2ARetrieval Audio retrieval NDCG@10 0.009 0.009 0.020 1
BIOSSES Text STS / pair Main score 0.803 1
SICK-R Text STS / pair Main score 0.746 1
STS12 Text STS / pair Main score 0.709 1
STS13 Text STS / pair Main score 0.783 1
STS14 Text STS / pair Main score 0.731 1
STS15 Text STS / pair Main score 0.828 1
STS17 Text STS / pair Main score 0.859 1
STS22.v2 Text STS / pair Main score 0.679 1
STSBenchmark Text STS / pair Main score 0.810 1
SprintDuplicateQuestions Text STS / pair Main score 0.957 1
TwitterSemEval2015 Text STS / pair Main score 0.620 1
TwitterURLCorpus Text STS / pair Main score 0.837 1
AmazonCounterfactualClassification Text classification Main score 0.681 1
Banking77Classification Text classification Main score 0.742 1
ImdbClassification Text classification Main score 0.722 1
MTOPDomainClassification Text classification Main score 0.896 1
MassiveIntentClassification Text classification Main score 0.623 1
MassiveScenarioClassification Text classification Main score 0.709 1
ToxicConversationsClassification Text classification Main score 0.623 1
TweetSentimentExtractionClassification Text classification Main score 0.545 1
ArXivHierarchicalClusteringP2P Text clustering Main score 0.549 1
ArXivHierarchicalClusteringS2S Text clustering Main score 0.523 1
BiorxivClusteringP2P.v2 Text clustering Main score 0.352 1
MedrxivClusteringP2P.v2 Text clustering Main score 0.352 1
MedrxivClusteringS2S.v2 Text clustering Main score 0.333 1
StackExchangeClustering.v2 Text clustering Main score 0.589 1
StackExchangeClusteringP2P.v2 Text clustering Main score 0.405 1
TwentyNewsgroupsClustering.v2 Text clustering Main score 0.457 1
MindSmallReranking Text reranking Main score 0.314 0.317 0.542 1
ArguAna Text retrieval NDCG@10 0.546 0.546 0.826 1
AskUbuntuDupQuestions Text retrieval NDCG@10 0.659 0.659 0.740 1
CQADupstackGamingRetrieval Text retrieval NDCG@10 0.519 0.519 0.659 1
CQADupstackUnixRetrieval Text retrieval NDCG@10 0.355 0.355 0.462 1
ClimateFEVERHardNegatives Text retrieval NDCG@10 0.239 0.239 0.301 1
FEVERHardNegatives Text retrieval NDCG@10 0.569 0.569 0.765 1
FiQA2018 Text retrieval NDCG@10 0.292 0.292 0.361 1
HotpotQAHardNegatives Text retrieval NDCG@10 0.553 0.553 0.605 1
SCIDOCS Text retrieval NDCG@10 0.154 0.154 0.164 1
TRECCOVID Text retrieval NDCG@10 0.513 0.513 0.014 1
Touche2020Retrieval.v3 Text retrieval NDCG@10 0.414 0.414 0.173 1
SummEvalSummarization.v2 Text summarization Main score 0.308 1

Usage

Loading components

from safetensors.torch import load_file

# Load entire model
tensors = load_file("TE-86M.safetensors")

# Extract components by prefix
text_enc_sd = {k.removeprefix("text_encoder."): v for k, v in tensors.items() if k.startswith("text_encoder.")}
image_enc_sd = {k.removeprefix("image_encoder."): v for k, v in tensors.items() if k.startswith("image_encoder.")}
audio_enc_sd = {k.removeprefix("audio_encoder."): v for k, v in tensors.items() if k.startswith("audio_encoder.")}
image_proj_sd = {k.removeprefix("image_projection."): v for k, v in tensors.items() if k.startswith("image_projection.")}
audio_proj_sd = {k.removeprefix("audio_projection."): v for k, v in tensors.items() if k.startswith("audio_projection.")}
text_proj_sd = {k.removeprefix("text_projection."): v for k, v in tensors.items() if k.startswith("text_projection.")}

Matryoshka truncation

import torch.nn.functional as F

# Full 1280-dim embedding
embedding = model(input)  # (N, 1280)

# Truncate to 256-dim and re-normalize
embedding_256 = F.normalize(embedding[:, :256], dim=-1)

File layout

TE-86M.safetensors     # All components in one file (~330 MB)

Tensor key prefixes

Prefix Component Tensors
text_encoder.* LEAF-IR (float32) 103
image_encoder.* MobileNetV4-Medium 462
audio_encoder.* EfficientAT mn20_as 312
image_projection.* Depth-2 projection head 14
audio_projection.* Depth-2 projection head 14
text_projection.* Depth-2 projection head 14

Training

  • Loss: InfoNCE (contrastive) with Matryoshka Representation Learning
  • Data: ~2.2M synthetically generated trimodal triplets (WordNet) + 200K MSCOCO img+txt + 262K WavCaps aud+txt + 1.5M Nomic text pairs
  • Hardware: 2x NVIDIA L4 GPUs
  • Optimizer: AdamW, lr=1.41e-3, weight decay=1e-4, cosine scheduler
  • Epochs: 50
  • Batch size: 4096
  • Dropout: 0.20 -> 0.25 (ep27) -> 0.30 (ep29) β€” mid-run regularization increases
  • Text mixing: Ξ»_tt=0.5 (ep1-9) -> 0.25 (ep10-50) β€” Nomic supervised text pairs
  • Projection heads only β€” source encoders are frozen during training

Improvements over TE-75M

Change TE-75M TE-86M
Projection depth 1 (single residual block) 2 (two residual blocks)
Head params 26.1M 37.2M
Total params 75.2M 86.1M
SALT I->T R@1 0.615 0.618 (+0.5%)
SALT T->I R@1 0.614 0.630 (+2.6%)
MSCOCO I->T R@1 0.320 0.343 (+7.2%)
Clotho A->T R@1 0.208 0.219 (+5.3%)
ESC-50 93.2% 93.9% (+0.7%)

Design decisions

  • Depth-2 residual heads: Ablation confirmed depth-1 hits I->T ceiling at ~0.60 regardless of dropout or Ξ»_tt. Depth-2 provides capacity to serve cross-modal and text retrieval simultaneously.
  • 3-head shared space: All modalities project into a learned 1280-dim space (image-native dimension)
  • LEAF-IR text encoder: 23M-param retrieval-optimized text encoder enables fully edge-deployable text inference
  • Frozen source encoders: MobileNetV4, EfficientAT, and LEAF-IR are kept frozen; only projection heads are trained
  • Edge-first: All source encoders can run on devices like Raspberry Pi 5

Limitations

  • Audio retrieval lags behind specialist models like CLAP on audio-only benchmarks
  • Image-text retrieval trades accuracy vs larger vision encoders for edge deployability
  • Text retrieval recovers 62-92% of raw LEAF-IR quality (gap is domain-dependent)

Links

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for augmem/TE-86M

Quantizations
1 model