AIST-95M

AIST-95M is the dual-audio Trimodal Embeddings teacher checkpoint built on:

text: MongoDB/mdbr-leaf-ir
image: mobilenetv4_conv_medium.e180_r384_in12k
audio: mn20_as + whisper-tiny encoder

Its canonical Augmem name follows the repo standard:

AIST = audio + image + speech + text, alphabetized and reduced to first letters
95M = exact loaded parameter count rounded to integer millions

It maps text, image, and audio into a shared 1280-dimensional embedding space with Matryoshka truncation support at [1280, 768, 512, 256, 128].

This repo publishes the dual-audio teacher as a safetensors release artifact plus the exact local gate baseline used for later teacher-recovery experiments.

Parameter Count

Exact loaded parameter count in the deployed evaluation path:

Component	Params
Text encoder (`MongoDB/mdbr-leaf-ir`)	22,861,056
Image encoder (`mobilenetv4_conv_medium.e180_r384_in12k`)	8,434,512
Audio encoder (`mn20_as`, full loaded module)	17,909,287
Audio encoder (`openai/whisper-tiny`, encoder only)	8,208,384
Image projection head	12,306,560
Audio projection head	14,272,640
Text projection head	11,323,520
Total exact loaded params	95,315,959

For continuity with older notes:

historical shorthand: TE-86M Dual Audio
89,048,552 params if you exclude the EfficientAT classifier head from the mn20_as module
37,902,720 params are trainable checkpoint weights in the three projection heads

Architecture

The dual-audio teacher uses a frozen-encoder / trained-head setup:

Text   -> mdbr-leaf-ir (768-d) ----------------> DeepProjectionHead-d2 -> 1280
Image  -> MobileNetV4-Medium (1280-d) ---------> DeepProjectionHead-d2 -> 1280
Audio  -> EfficientAT mn20_as (1920-d) \
                                           +--> concat(2304-d) -> DeepProjectionHead-d2 -> 1280
         Whisper-Tiny encoder (384-d)    /

The audio path is dual-encoder by construction. EfficientAT contributes the environmental / general audio branch; Whisper-Tiny contributes the speech-sensitive branch.

Core training config:

projection hidden dim: 1920
projection output dim: 1280
projection depth: 2
loss: InfoNCE
audio encoder dim after concat: 2304
Matryoshka dims: [1280, 768, 512, 256, 128]

Published config file: te_mn20_whisper_d2_validaudio.yaml

Local Gate Baseline

The attached JSON teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json is the canonical local gate baseline used for later teacher continuation experiments.

Seeded split-excluded baseline at 1280d:

Slice	Metric
Speech holdout A->T R@1	0.5652
Speech holdout T->A R@1	0.5992
Speech holdout avg R@1	0.5822
WavCaps FSD A->T R@1	0.1078
WavCaps FSD T->A R@1	0.1030
WavCaps FSD avg R@1	0.1054
SALT A->I R@1	0.1692
SALT I->A R@1	0.1261

Important scope note:

These are the exact local gate numbers used for bounded recovery experiments.
They are not a claim of broad public benchmark superiority.
The external 4-task audio smoke baseline was not packaged into this release.

Files

File	Purpose
`AIST-95M.safetensors`	Self-contained dual-audio teacher release artifact
`te_mn20_whisper_d2_validaudio.yaml`	Training config for the teacher line
`teacher_dual_mn20whisper_exact_gate_baseline_20260424T155324Z.json`	Canonical exact-gate baseline
`parameter_breakdown.json`	Exact parameter accounting used in this card

Loading

This release is a self-contained safetensors artifact containing:

text encoder weights
image encoder weights
EfficientAT audio encoder weights
Whisper-Tiny encoder weights
text / image / audio projection heads

Caveats

This release uses the canonical Augmem name AIST-95M.
Older TE-86M Dual Audio references are legacy aliases for the same artifact line.
The existing older augmem/TE-86M release on Hugging Face is a different artifact line.

Downloads last month: -; Downloads are not tracked for this model. How to track