TE-86M β Trimodal Embeddings (Depth-2)
TE-86M maps image, audio, and text into a shared 1280-dim embedding space, enabling cross-modal retrieval with a single vector index. All three modalities share a unified space with full Matryoshka truncation support down to 128 dims.
Built for edge deployment β the entire model runs on a Raspberry Pi 5.
Successor to TE-75M, with depth-2 residual projection heads that break through the cross-modal retrieval ceiling of depth-1 architectures while maintaining text retrieval quality.
Also available in GGUF format for quantized edge deployment.
Architecture
TE-86M uses lightweight edge encoders with depth-2 residual projection heads that expand through a 1920-dim hidden layer before projecting into a shared 1280-dim embedding space:
Text --> LEAF-IR (768-d) -----------> DeepProjectionHead-d2 (768 -> 1920 -> 1920 -> 1280)
Image --> MobileNetV4-Medium (1280-d) --> DeepProjectionHead-d2 (1280 -> 1920 -> 1920 -> 1280)
Audio --> EfficientAT mn20_as (1920-d) --> DeepProjectionHead-d2 (1920 -> 1920 -> 1920 -> 1280)
All outputs are L2-normalized into the shared 1280-dim space for cross-modal cosine similarity.
| Component | Architecture | Params | Size |
|---|---|---|---|
| Text encoder | LEAF-IR (MongoDB/mdbr-leaf-ir) | 22.7M | 87.2 MB |
| Image encoder | MobileNetV4-Medium (timm) | 8.4M | 32.4 MB |
| Audio encoder | EfficientAT mn20_as | 17.9M | 68.5 MB |
| Image projection | DeepProjectionHead-d2 (1280 -> 1920 -> 1920 -> 1280) | 12.3M | 47.0 MB |
| Audio projection | DeepProjectionHead-d2 (1920 -> 1920 -> 1920 -> 1280) | 13.5M | 51.7 MB |
| Text projection | DeepProjectionHead-d2 (768 -> 1920 -> 1920 -> 1280) | 11.3M | 43.2 MB |
| Total | 86.1M | 329.9 MB |
Projection head detail
Each DeepProjectionHead-d2 is a depth-2 residual MLP with Matryoshka-aware training:
Linear(encoder_dim, 1920) -> GELU -> LayerNorm -> Dropout(0.3)
-> Linear(1920, 1920) -> GELU -> LayerNorm -> Dropout(0.3) + residual
-> Linear(1920, 1920) -> GELU -> LayerNorm -> Dropout(0.3) + residual
-> Linear(1920, 1280)
Why depth-2?
Ablation experiments showed depth-1 heads hit an I->T retrieval ceiling at ~0.60 R@1 regardless of hyperparameter tuning. Depth-2 heads broke through to 0.618, providing the representational capacity to serve cross-modal AND text retrieval simultaneously. The extra 11M params (75M -> 86M) remain edge-viable.
Matryoshka dimensions
Embeddings can be truncated to [1280, 768, 512, 256, 128] dimensions while preserving retrieval quality β trained with Matryoshka Representation Learning (MRL).
Benchmarks
SALT retrieval benchmarks use 5K trimodal samples. Full MTEB / MAEB evaluation used the 768-d Matryoshka truncation.
Cross-modal retrieval β SALT (5K trimodal samples)
| Direction | TE-86M (86M) | TE-75M (75M) | ImageBind (1.2B) | EBind (1.78B*) |
|---|---|---|---|---|
| Image -> Text R@1 | 0.618 | 0.615 | 0.736 | 0.783 |
| Text -> Image R@1 | 0.630 | 0.614 | 0.712 | 0.779 |
| Text -> Audio R@1 | 0.108 | 0.103 | 0.038 | 0.047 |
| Audio -> Text R@1 | 0.087 | 0.082 | 0.039 | 0.035 |
| Image -> Audio R@1 | 0.068 | 0.062 | 0.023 | 0.027 |
| Audio -> Image R@1 | 0.070 | 0.063 | 0.025 | 0.032 |
Audio retrieval β AudioCaps & Clotho
| Benchmark | Direction | TE-86M | TE-75M | CLAP-Large | ImageBind | EBind |
|---|---|---|---|---|---|---|
| AudioCaps | A->T R@1 | 0.229 | 0.210 | 0.420 | 0.116 | 0.225 |
| AudioCaps | T->A R@1 | 0.156 | 0.148 | 0.280 | 0.080 | 0.219 |
| Clotho | A->T R@1 | 0.219 | 0.208 | 0.195 | 0.061 | 0.088 |
| Clotho | T->A R@1 | 0.177 | 0.172 | 0.167 | 0.074 | 0.118 |
Image-text retrieval β MSCOCO & Flickr30k
| Benchmark | Direction | TE-86M (86M) | TE-75M (75M) | EBind (1.78B*) | ImageBind (1.2B) |
|---|---|---|---|---|---|
| Flickr30k | I->T R@1 | 0.494 | 0.478 | 0.951 | 0.918 |
| Flickr30k | T->I R@1 | 0.332 | 0.303 | 0.853 | 0.766 |
| MSCOCO 5K | I->T R@1 | 0.343 | 0.320 | 0.743 | 0.658 |
| MSCOCO 5K | T->I R@1 | 0.225 | 0.208 | 0.559 | 0.490 |
Zero-shot classification β ESC-50
| Model | Params | Accuracy |
|---|---|---|
| TE-86M | 86M | 93.9% |
| CLAP-Large | 67.8M | 90.5% |
| TE-75M | 75M | 93.2% |
| EBind | 1.78B* | 77.0% |
| ImageBind | 1.2B | 66.4% |
Text retrieval β MTEB (NDCG@10)
Text-text retrieval quality in the shared embedding space, measured on MTEB retrieval tasks:
| Task | TE-86M | TE-75M | Raw LEAF-IR | Recovery |
|---|---|---|---|---|
| ArguAna | 0.545 | 0.544 | 0.594 | 92% |
| CQADupstackGaming | 0.515 | 0.506 | 0.607 | 85% |
| CQADupstackUnix | 0.334 | 0.355 | 0.428 | 78% |
| FEVERHardNegatives | 0.561 | 0.551 | 0.863 | 65% |
| HotpotQAHardNegatives | 0.554 | 0.531 | 0.700 | 79% |
| FiQA2018 | 0.291 | 0.292 | 0.392 | 74% |
| ClimateFEVER | 0.231 | 0.215 | 0.353 | 65% |
| SCIDOCS | 0.154 | 0.153 | 0.198 | 78% |
| TRECCOVID | 0.507 | 0.474 | 0.820 | 62% |
TE-86M improves MTEB text retrieval over TE-75M on 7/9 tasks. The depth-2 projection heads recover 62-92% of raw LEAF-IR's retrieval quality while mapping into the cross-modal shared space.
Full MTEB / MAEB kitchen-sink evaluation
Full evaluation used the 768-d Matryoshka truncation of the TE-86M checkpoint checkpoints/trimodal_3head_h1920_d2_v2_d30_tt025/best_model.pt with MTEB 2.12.15.
The run completed 63/71 requested tasks (MTEB(eng, v2) and MAEB). Failed tasks were: BirdCLEF, CREMA_DClustering, CommonLanguageAgeDetection, FleursT2ARetrieval, IEMOCAPGender, VoxCelebSA, VoxPopuliGenderClustering, VoxPopuliLanguageID.
| Family | Tasks | Mean primary score |
|---|---|---|
| Text retrieval | 11 | 0.437 |
| Text reranking | 1 | 0.314 |
| Text summarization | 1 | 0.308 |
| Text classification | 8 | 0.693 |
| Text clustering | 8 | 0.445 |
| Text STS / pair | 12 | 0.780 |
| Audio retrieval | 8 | 0.169 |
| Audio classification | 9 | 0.379 |
| Audio pair / reranking | 4 | 0.620 |
| Audio clustering | 1 | 0.009 |
Full per-task scores
| Task | Family | Metric | Score | NDCG@10 | Recall@10 | Subsets |
|---|---|---|---|---|---|---|
| BeijingOpera | Audio classification | Main score | 0.868 | 1 | ||
| CREMA_D | Audio classification | Main score | 0.285 | 1 | ||
| FSD2019Kaggle | Audio classification | Main score | 0.548 | 2 | ||
| GTZANGenre | Audio classification | Main score | 0.742 | 1 | ||
| MInDS14 | Audio classification | Main score | 0.094 | 12 | ||
| MridinghamTonic | Audio classification | Main score | 0.350 | 1 | ||
| RavdessZeroshot | Audio classification | Main score | 0.197 | 1 | ||
| SIBFLEURS | Audio classification | Main score | 0.218 | 102 | ||
| SpeechCommandsZeroshotv0.02 | Audio classification | Main score | 0.111 | 1 | ||
| VehicleSoundClustering | Audio clustering | Main score | 0.009 | 1 | ||
| CREMADPairClassification | Audio pair / reranking | Main score | 0.528 | 1 | ||
| GTZANAudioReranking | Audio pair / reranking | NDCG@10 | 0.874 | 0.874 | 0.987 | 1 |
| NMSQAPairClassification | Audio pair / reranking | Main score | 0.547 | 1 | ||
| VoxPopuliAccentPairClassification | Audio pair / reranking | Main score | 0.529 | 1 | ||
| ClothoT2ARetrieval | Audio retrieval | NDCG@10 | 0.294 | 0.294 | 0.467 | 1 |
| CommonVoiceMini21T2ARetrieval | Audio retrieval | NDCG@10 | 0.023 | 0.023 | 0.052 | 117 |
| GigaSpeechT2ARetrieval | Audio retrieval | NDCG@10 | 0.002 | 0.002 | 0.004 | 1 |
| JamAltArtistA2ARetrieval | Audio retrieval | NDCG@10 | 0.873 | 0.873 | 0.183 | 4 |
| JamAltLyricA2TRetrieval | Audio retrieval | NDCG@10 | 0.008 | 0.008 | 0.013 | 4 |
| MACST2ARetrieval | Audio retrieval | NDCG@10 | 0.136 | 0.136 | 0.252 | 1 |
| SpokenSQuADT2ARetrieval | Audio retrieval | NDCG@10 | 0.010 | 0.010 | 0.020 | 1 |
| UrbanSound8KT2ARetrieval | Audio retrieval | NDCG@10 | 0.009 | 0.009 | 0.020 | 1 |
| BIOSSES | Text STS / pair | Main score | 0.803 | 1 | ||
| SICK-R | Text STS / pair | Main score | 0.746 | 1 | ||
| STS12 | Text STS / pair | Main score | 0.709 | 1 | ||
| STS13 | Text STS / pair | Main score | 0.783 | 1 | ||
| STS14 | Text STS / pair | Main score | 0.731 | 1 | ||
| STS15 | Text STS / pair | Main score | 0.828 | 1 | ||
| STS17 | Text STS / pair | Main score | 0.859 | 1 | ||
| STS22.v2 | Text STS / pair | Main score | 0.679 | 1 | ||
| STSBenchmark | Text STS / pair | Main score | 0.810 | 1 | ||
| SprintDuplicateQuestions | Text STS / pair | Main score | 0.957 | 1 | ||
| TwitterSemEval2015 | Text STS / pair | Main score | 0.620 | 1 | ||
| TwitterURLCorpus | Text STS / pair | Main score | 0.837 | 1 | ||
| AmazonCounterfactualClassification | Text classification | Main score | 0.681 | 1 | ||
| Banking77Classification | Text classification | Main score | 0.742 | 1 | ||
| ImdbClassification | Text classification | Main score | 0.722 | 1 | ||
| MTOPDomainClassification | Text classification | Main score | 0.896 | 1 | ||
| MassiveIntentClassification | Text classification | Main score | 0.623 | 1 | ||
| MassiveScenarioClassification | Text classification | Main score | 0.709 | 1 | ||
| ToxicConversationsClassification | Text classification | Main score | 0.623 | 1 | ||
| TweetSentimentExtractionClassification | Text classification | Main score | 0.545 | 1 | ||
| ArXivHierarchicalClusteringP2P | Text clustering | Main score | 0.549 | 1 | ||
| ArXivHierarchicalClusteringS2S | Text clustering | Main score | 0.523 | 1 | ||
| BiorxivClusteringP2P.v2 | Text clustering | Main score | 0.352 | 1 | ||
| MedrxivClusteringP2P.v2 | Text clustering | Main score | 0.352 | 1 | ||
| MedrxivClusteringS2S.v2 | Text clustering | Main score | 0.333 | 1 | ||
| StackExchangeClustering.v2 | Text clustering | Main score | 0.589 | 1 | ||
| StackExchangeClusteringP2P.v2 | Text clustering | Main score | 0.405 | 1 | ||
| TwentyNewsgroupsClustering.v2 | Text clustering | Main score | 0.457 | 1 | ||
| MindSmallReranking | Text reranking | Main score | 0.314 | 0.317 | 0.542 | 1 |
| ArguAna | Text retrieval | NDCG@10 | 0.546 | 0.546 | 0.826 | 1 |
| AskUbuntuDupQuestions | Text retrieval | NDCG@10 | 0.659 | 0.659 | 0.740 | 1 |
| CQADupstackGamingRetrieval | Text retrieval | NDCG@10 | 0.519 | 0.519 | 0.659 | 1 |
| CQADupstackUnixRetrieval | Text retrieval | NDCG@10 | 0.355 | 0.355 | 0.462 | 1 |
| ClimateFEVERHardNegatives | Text retrieval | NDCG@10 | 0.239 | 0.239 | 0.301 | 1 |
| FEVERHardNegatives | Text retrieval | NDCG@10 | 0.569 | 0.569 | 0.765 | 1 |
| FiQA2018 | Text retrieval | NDCG@10 | 0.292 | 0.292 | 0.361 | 1 |
| HotpotQAHardNegatives | Text retrieval | NDCG@10 | 0.553 | 0.553 | 0.605 | 1 |
| SCIDOCS | Text retrieval | NDCG@10 | 0.154 | 0.154 | 0.164 | 1 |
| TRECCOVID | Text retrieval | NDCG@10 | 0.513 | 0.513 | 0.014 | 1 |
| Touche2020Retrieval.v3 | Text retrieval | NDCG@10 | 0.414 | 0.414 | 0.173 | 1 |
| SummEvalSummarization.v2 | Text summarization | Main score | 0.308 | 1 |
Usage
Loading components
from safetensors.torch import load_file
# Load entire model
tensors = load_file("TE-86M.safetensors")
# Extract components by prefix
text_enc_sd = {k.removeprefix("text_encoder."): v for k, v in tensors.items() if k.startswith("text_encoder.")}
image_enc_sd = {k.removeprefix("image_encoder."): v for k, v in tensors.items() if k.startswith("image_encoder.")}
audio_enc_sd = {k.removeprefix("audio_encoder."): v for k, v in tensors.items() if k.startswith("audio_encoder.")}
image_proj_sd = {k.removeprefix("image_projection."): v for k, v in tensors.items() if k.startswith("image_projection.")}
audio_proj_sd = {k.removeprefix("audio_projection."): v for k, v in tensors.items() if k.startswith("audio_projection.")}
text_proj_sd = {k.removeprefix("text_projection."): v for k, v in tensors.items() if k.startswith("text_projection.")}
Matryoshka truncation
import torch.nn.functional as F
# Full 1280-dim embedding
embedding = model(input) # (N, 1280)
# Truncate to 256-dim and re-normalize
embedding_256 = F.normalize(embedding[:, :256], dim=-1)
File layout
TE-86M.safetensors # All components in one file (~330 MB)
Tensor key prefixes
| Prefix | Component | Tensors |
|---|---|---|
text_encoder.* |
LEAF-IR (float32) | 103 |
image_encoder.* |
MobileNetV4-Medium | 462 |
audio_encoder.* |
EfficientAT mn20_as | 312 |
image_projection.* |
Depth-2 projection head | 14 |
audio_projection.* |
Depth-2 projection head | 14 |
text_projection.* |
Depth-2 projection head | 14 |
Training
- Loss: InfoNCE (contrastive) with Matryoshka Representation Learning
- Data: ~2.2M synthetically generated trimodal triplets (WordNet) + 200K MSCOCO img+txt + 262K WavCaps aud+txt + 1.5M Nomic text pairs
- Hardware: 2x NVIDIA L4 GPUs
- Optimizer: AdamW, lr=1.41e-3, weight decay=1e-4, cosine scheduler
- Epochs: 50
- Batch size: 4096
- Dropout: 0.20 -> 0.25 (ep27) -> 0.30 (ep29) β mid-run regularization increases
- Text mixing: Ξ»_tt=0.5 (ep1-9) -> 0.25 (ep10-50) β Nomic supervised text pairs
- Projection heads only β source encoders are frozen during training
Improvements over TE-75M
| Change | TE-75M | TE-86M |
|---|---|---|
| Projection depth | 1 (single residual block) | 2 (two residual blocks) |
| Head params | 26.1M | 37.2M |
| Total params | 75.2M | 86.1M |
| SALT I->T R@1 | 0.615 | 0.618 (+0.5%) |
| SALT T->I R@1 | 0.614 | 0.630 (+2.6%) |
| MSCOCO I->T R@1 | 0.320 | 0.343 (+7.2%) |
| Clotho A->T R@1 | 0.208 | 0.219 (+5.3%) |
| ESC-50 | 93.2% | 93.9% (+0.7%) |
Design decisions
- Depth-2 residual heads: Ablation confirmed depth-1 hits I->T ceiling at ~0.60 regardless of dropout or Ξ»_tt. Depth-2 provides capacity to serve cross-modal and text retrieval simultaneously.
- 3-head shared space: All modalities project into a learned 1280-dim space (image-native dimension)
- LEAF-IR text encoder: 23M-param retrieval-optimized text encoder enables fully edge-deployable text inference
- Frozen source encoders: MobileNetV4, EfficientAT, and LEAF-IR are kept frozen; only projection heads are trained
- Edge-first: All source encoders can run on devices like Raspberry Pi 5
Limitations
- Audio retrieval lags behind specialist models like CLAP on audio-only benchmarks
- Image-text retrieval trades accuracy vs larger vision encoders for edge deployability
- Text retrieval recovers 62-92% of raw LEAF-IR quality (gap is domain-dependent)
Links
- Website: augmem.ai
- GitHub: github.com/augmem
License
Apache 2.0