Argus-Lite

Multi-task perception on a single frozen EUPE-ViT-S backbone, adapted from phanerozoic/argus at roughly ΒΌ the parameter budget.

Architecture

Image β†’ EUPE-ViT-S (frozen, 21M) β†’ shared features
                                    β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β–Ό               β–Ό              β–Ό              β–Ό
              Classification  Segmentation      Depth        Detection
              Linear(384,1K)  BN+Conv(384,150)  DPT-style   Split-tower (384-D)
              385 K params     58 K params     1.54 M params 2.91 M params

Plus correspondence via cosine max on patch tokens (0 params).

Component Params
EUPE-ViT-S backbone (frozen) 21.59 M
Classifier head 0.39 M
Segmentation head 0.06 M
Depth head (DPT decoder) 13.06 M
Detection head 2.91 M
Total ~38.0 M

Roughly β…“ the Argus-B system (103 M) parameter count.

Training

All four heads trained on pre-cached ViT-S features produced by a single forward pass over each target dataset. Backbone is frozen throughout.

Head Dataset Input Recipe Result
Classifier ImageNet-1k train 224 px CLS token SGD, lr 30, WD 0, cosine, 30 epochs 82.87 % train top-1 / 79.13 % val top-1 / 95.53 % val top-5
Segmentation ADE20K (20,210 train / 2,000 val) 512 px, 32Γ—32 grid AdamW, lr 1e-3, 32 epochs (~40k iters) mIoU 0.419
Depth NYUv2 (32K train / 5K val) 416 px, 4 hooked blocks at strides 4/8/16/32 SILog, AdamW, lr 1e-4 cosine, DPT decoder with reflection-padded 3Γ—3 convs RMSE 0.537
Detection COCO train 2017 (117 K) 768 px, 48Γ—48 grid FCOS targets, AdamW, lr 1e-4, 2 epochs COCO val2017 mAP 27.3 (AP@50 49.6 Β· AR@100 43.2); RF100-VL AR@100 0.266 (20-domain subset)

Training logs per head live alongside the weights (*_training_log.json).

Files

cls_head.safetensors          Linear(384, 1000) classifier
cls_training_log.json
seg_head.safetensors          BN + Conv2d(384, 150, 1)
seg_training_log.json
depth_head.safetensors        DPT decoder over 4 hooked ViT-S blocks
depth_training_log.json
det_head.safetensors          SplitTowerHead (feat_dim=384)
det_training_log.json
argus_lite.py                 ArgusLite class
infer.py                      CLI dispatcher (6 subcommands)

Usage

from argus_lite import ArgusLite

model = ArgusLite.from_pretrained('phanerozoic/argus-lite').cuda().eval()

# Single-image multi-task inference
out = model.perceive('image.jpg')
# out['classification']  {'label': 'tabby', 'score': 0.62, 'top5': [...]}
# out['segmentation']    (512, 512) int array of ADE20K class ids
# out['depth']           (416, 416) float array, metric depth in meters
# out['detection']       [{'box': [x1,y1,x2,y2], 'score': 0.78, 'class_id': 17}, ...]
# out['correspondence']  None  (needs a second image)

# Per-task methods
model.classify('image.jpg', top_k=5)
model.segment('street.jpg')                          # (512, 512)
model.depth('room.jpg')                              # (416, 416) metric meters
model.detect('photo.jpg', score_thresh=0.3)          # list of dicts
model.correspond('a.jpg', 'b.jpg')                   # cosine-max patch matches

# Paired inference with correspondence populated
out = model.perceive('a.jpg', image_b='b.jpg')
# out['correspondence']  {'matches': (1024,), 'scores': (1024,), 'grid': 32}

CLI dispatcher in infer.py (six subcommands):

python infer.py classify   cat.jpg
python infer.py segment    street.jpg --save seg.png
python infer.py depth      room.jpg --save depth.png
python infer.py detect     photo.jpg --thresh 0.3
python infer.py correspond a.jpg b.jpg
python infer.py perceive   image.jpg --second image2.jpg --save out/

Each head runs at its own training resolution: classifier 224 px (CLS token), segmentation 512 px (32Γ—32 patch grid), depth 416 px (26Γ—26 grid, DPT decoder over hooked blocks 2, 5, 8, 11), detection 768 px (48Γ—48 grid). perceive() therefore does four backbone forward passes per image.

Requires argus.py from phanerozoic/argus on sys.path for the DinoVisionTransformer and SplitTowerHead classes.

Cross-domain detection benchmark (RF100-VL subset)

Same 20-domain class-agnostic AR@100 protocol as Argus, evaluated live through the ViT-S backbone at 768 px input.

Model Total params Mean AR@100
Argus+FCOS (ViT-B backbone) 102.1 M 0.251
Argus-Lite (this model) ~26.5 M 0.266
Argus+(current picker, ViT-B) 89.0 M 0.289

Per-domain numbers live in rf100vl_results.json.

Evaluation details

  • Classifier val top-1 is 79.13 %, top-5 95.53 % on 50K ImageNet val 2012 images, using the TensorFlow Models repo's synset-label mapping for ground truth. Above the EUPE-ViT-S paper kNN baseline (78.2).
  • Detection head: COCO val2017 mAP 0.273 (AP@50 0.496, AP@75 0.268, AR@100 0.432). See coco_val_eval.json for the full breakdown including per-size AP.
  • Depth head is a DPT decoder reassembling the four hooked ViT-S block activations (blocks 2, 5, 8, 11) at strides [4, 8, 16, 32], followed by 4 FeatureFusion blocks with residual conv units and a 256-bin depth head. Trained on NYUv2 (32K train / 5K val held-out split) with SILog loss and AdamW lr 1e-4 on a cosine schedule. For reference, the same DPT decoder on EUPE-ViT-B (Argus, 4Γ— larger backbone) reaches 0.391 RMSE on the equivalent split.
  • Segmentation head is a linear probe at 5 epochs; the EUPE-ViT-S paper reports mIoU 0.466 at a much longer schedule.

Source backbone

EUPE-ViT-S from Meta FAIR (arXiv:2603.22387, Zhu et al., March 2026). Three-stage distillation from PEcore-G + PElang-G + DINOv3-H+ via a 1.9 B proxy teacher. License: FAIR Research License (non-commercial).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for phanerozoic/argus-lite

Finetuned
(1)
this model

Datasets used to train phanerozoic/argus-lite

Space using phanerozoic/argus-lite 1

Paper for phanerozoic/argus-lite