AIST-95M

AIST-95M is the dual-audio Trimodal Embeddings teacher checkpoint built on:

  • text: MongoDB/mdbr-leaf-ir
  • image: mobilenetv4_conv_medium.e180_r384_in12k
  • audio: mn20_as + whisper-tiny encoder

Its canonical Augmem name follows the repo standard:

  • AIST = audio + image + speech + text, alphabetized and reduced to first letters
  • 95M = exact loaded parameter count rounded to integer millions

It maps text, image, and audio into a shared 1280-dimensional embedding space with Matryoshka truncation support at [1280, 768, 512, 256, 128].

This repo publishes the dual-audio teacher as a safetensors release artifact plus the exact local gate baseline used for later teacher-recovery experiments.

Parameter Count

Exact loaded parameter count in the deployed evaluation path:

Component Params
Text encoder (MongoDB/mdbr-leaf-ir) 22,861,056
Image encoder (mobilenetv4_conv_medium.e180_r384_in12k) 8,434,512
Audio encoder (mn20_as, full loaded module) 17,909,287
Audio encoder (openai/whisper-tiny, encoder only) 8,208,384
Image projection head 12,306,560
Audio projection head 14,272,640
Text projection head 11,323,520
Total exact loaded params 95,315,959

For continuity with older notes:

  • historical shorthand: TE-86M Dual Audio
  • 89,048,552 params if you exclude the EfficientAT classifier head from the mn20_as module
  • 37,902,720 params are trainable checkpoint weights in the three projection heads

Architecture

The dual-audio teacher uses a frozen-encoder / trained-head setup:

Text   -> mdbr-leaf-ir (768-d) ----------------> DeepProjectionHead-d2 -> 1280
Image  -> MobileNetV4-Medium (1280-d) ---------> DeepProjectionHead-d2 -> 1280
Audio  -> EfficientAT mn20_as (1920-d) \
                                           +--> concat(2304-d) -> DeepProjectionHead-d2 -> 1280
         Whisper-Tiny encoder (384-d)    /

The audio path is dual-encoder by construction. EfficientAT contributes the environmental / general audio branch; Whisper-Tiny contributes the speech-sensitive branch.

Core training config:

  • projection hidden dim: 1920
  • projection output dim: 1280
  • projection depth: 2
  • loss: InfoNCE
  • audio encoder dim after concat: 2304
  • Matryoshka dims: [1280, 768, 512, 256, 128]

Published config file: te_mn20_whisper_d2_validaudio.yaml

Local Gate Baseline

The attached JSON teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json is the canonical local gate baseline used for later teacher continuation experiments.

Seeded split-excluded baseline at 1280d:

Slice Metric
Speech holdout A->T R@1 0.5652
Speech holdout T->A R@1 0.5992
Speech holdout avg R@1 0.5822
WavCaps FSD A->T R@1 0.1078
WavCaps FSD T->A R@1 0.1030
WavCaps FSD avg R@1 0.1054
SALT A->I R@1 0.1692
SALT I->A R@1 0.1261

Important scope note:

  • These are the exact local gate numbers used for bounded recovery experiments.
  • They are not a claim of broad public benchmark superiority.
  • The external 4-task audio smoke baseline was not packaged into this release.

Files

File Purpose
AIST-95M.safetensors Self-contained dual-audio teacher release artifact
te_mn20_whisper_d2_validaudio.yaml Training config for the teacher line
teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json Canonical exact-gate baseline
parameter_breakdown.json Exact parameter accounting used in this card

Loading

This release is a self-contained safetensors artifact containing:

  • text encoder weights
  • image encoder weights
  • EfficientAT audio encoder weights
  • Whisper-Tiny encoder weights
  • text / image / audio projection heads

Caveats

  • This release uses the canonical Augmem name AIST-95M.
  • Older TE-86M Dual Audio references are legacy aliases for the same artifact line.
  • The existing older augmem/TE-86M release on Hugging Face is a different artifact line.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support