Argus-Lite

Multi-task perception on a single frozen EUPE-ViT-S backbone, adapted from phanerozoic/argus at roughly ¼ the parameter budget.

Architecture

Image → EUPE-ViT-S (frozen, 21M) → shared features
                                    │
                    ┌───────────────┼──────────────┬──────────────┐
                    ▼               ▼              ▼              ▼
              Classification  Segmentation      Depth        Detection
              Linear(384,1K)  BN+Conv(384,150)  DPT-style   Split-tower (384-D)
              385 K params     58 K params     1.54 M params 2.91 M params

Plus correspondence via cosine max on patch tokens (0 params).

Component	Params
EUPE-ViT-S backbone (frozen)	21.59 M
Classifier head	0.39 M
Segmentation head	0.06 M
Depth head (DPT decoder)	13.06 M
Detection head	2.91 M
Total	~38.0 M

Roughly ⅓ the Argus-B system (103 M) parameter count.

Training

All four heads trained on pre-cached ViT-S features produced by a single forward pass over each target dataset. Backbone is frozen throughout.

Head	Dataset	Input	Recipe	Result
Classifier	ImageNet-1k train	224 px CLS token	SGD, lr 30, WD 0, cosine, 30 epochs	82.87 % train top-1 / 79.13 % val top-1 / 95.53 % val top-5
Segmentation	ADE20K (20,210 train / 2,000 val)	512 px, 32×32 grid	AdamW, lr 1e-3, 32 epochs (~40k iters)	mIoU 0.419
Depth	NYUv2 (32K train / 5K val)	416 px, 4 hooked blocks at strides 4/8/16/32	SILog, AdamW, lr 1e-4 cosine, DPT decoder with reflection-padded 3×3 convs	RMSE 0.537
Detection	COCO train 2017 (117 K)	768 px, 48×48 grid	FCOS targets, AdamW, lr 1e-4, 2 epochs	COCO val2017 mAP 27.3 (AP@50 49.6 · AR@100 43.2); RF100-VL AR@100 0.266 (20-domain subset)

Training logs per head live alongside the weights (*_training_log.json).

Files

cls_head.safetensors          Linear(384, 1000) classifier
cls_training_log.json
seg_head.safetensors          BN + Conv2d(384, 150, 1)
seg_training_log.json
depth_head.safetensors        DPT decoder over 4 hooked ViT-S blocks
depth_training_log.json
det_head.safetensors          SplitTowerHead (feat_dim=384)
det_training_log.json
argus_lite.py                 ArgusLite class
infer.py                      CLI dispatcher (6 subcommands)

Usage

from argus_lite import ArgusLite

model = ArgusLite.from_pretrained('phanerozoic/argus-lite').cuda().eval()

# Single-image multi-task inference
out = model.perceive('image.jpg')
# out['classification']  {'label': 'tabby', 'score': 0.62, 'top5': [...]}
# out['segmentation']    (512, 512) int array of ADE20K class ids
# out['depth']           (416, 416) float array, metric depth in meters
# out['detection']       [{'box': [x1,y1,x2,y2], 'score': 0.78, 'class_id': 17}, ...]
# out['correspondence']  None  (needs a second image)

# Per-task methods
model.classify('image.jpg', top_k=5)
model.segment('street.jpg')                          # (512, 512)
model.depth('room.jpg')                              # (416, 416) metric meters
model.detect('photo.jpg', score_thresh=0.3)          # list of dicts
model.correspond('a.jpg', 'b.jpg')                   # cosine-max patch matches

# Paired inference with correspondence populated
out = model.perceive('a.jpg', image_b='b.jpg')
# out['correspondence']  {'matches': (1024,), 'scores': (1024,), 'grid': 32}

CLI dispatcher in infer.py (six subcommands):

python infer.py classify   cat.jpg
python infer.py segment    street.jpg --save seg.png
python infer.py depth      room.jpg --save depth.png
python infer.py detect     photo.jpg --thresh 0.3
python infer.py correspond a.jpg b.jpg
python infer.py perceive   image.jpg --second image2.jpg --save out/

Each head runs at its own training resolution: classifier 224 px (CLS token), segmentation 512 px (32×32 patch grid), depth 416 px (26×26 grid, DPT decoder over hooked blocks 2, 5, 8, 11), detection 768 px (48×48 grid). perceive() therefore does four backbone forward passes per image.

Requires argus.py from phanerozoic/argus on sys.path for the DinoVisionTransformer and SplitTowerHead classes.

Cross-domain detection benchmark (RF100-VL subset)

Same 20-domain class-agnostic AR@100 protocol as Argus, evaluated live through the ViT-S backbone at 768 px input.

Model	Total params	Mean AR@100
Argus+FCOS (ViT-B backbone)	102.1 M	0.251
Argus-Lite (this model)	~26.5 M	0.266
Argus+(current picker, ViT-B)	89.0 M	0.289

Per-domain numbers live in rf100vl_results.json.

Evaluation details

Classifier val top-1 is 79.13 %, top-5 95.53 % on 50K ImageNet val 2012 images, using the TensorFlow Models repo's synset-label mapping for ground truth. Above the EUPE-ViT-S paper kNN baseline (78.2).
Detection head: COCO val2017 mAP 0.273 (AP@50 0.496, AP@75 0.268, AR@100 0.432). See coco_val_eval.json for the full breakdown including per-size AP.
Depth head is a DPT decoder reassembling the four hooked ViT-S block activations (blocks 2, 5, 8, 11) at strides [4, 8, 16, 32], followed by 4 FeatureFusion blocks with residual conv units and a 256-bin depth head. Trained on NYUv2 (32K train / 5K val held-out split) with SILog loss and AdamW lr 1e-4 on a cosine schedule. For reference, the same DPT decoder on EUPE-ViT-B (Argus, 4× larger backbone) reaches 0.391 RMSE on the equivalent split.
Segmentation head is a linear probe at 5 epochs; the EUPE-ViT-S paper reports mIoU 0.466 at a much longer schedule.

Source backbone

EUPE-ViT-S from Meta FAIR (arXiv:2603.22387, Zhu et al., March 2026). Three-stage distillation from PEcore-G + PElang-G + DINOv3-H+ via a 1.9 B proxy teacher. License: FAIR Research License (non-commercial).

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for phanerozoic/argus-lite

Base model

facebook/EUPE-ViT-S

Finetuned

(1)

this model

Datasets used to train phanerozoic/argus-lite

Space using phanerozoic/argus-lite 1

Paper for phanerozoic/argus-lite

Efficient Universal Perception Encoder

Paper • 2603.22387 • Published Mar 23 • 9