Argus-Lite
Multi-task perception on a single frozen EUPE-ViT-S backbone, adapted from phanerozoic/argus at roughly ΒΌ the parameter budget.
Architecture
Image β EUPE-ViT-S (frozen, 21M) β shared features
β
βββββββββββββββββΌβββββββββββββββ¬βββββββββββββββ
βΌ βΌ βΌ βΌ
Classification Segmentation Depth Detection
Linear(384,1K) BN+Conv(384,150) DPT-style Split-tower (384-D)
385 K params 58 K params 1.54 M params 2.91 M params
Plus correspondence via cosine max on patch tokens (0 params).
| Component | Params |
|---|---|
| EUPE-ViT-S backbone (frozen) | 21.59 M |
| Classifier head | 0.39 M |
| Segmentation head | 0.06 M |
| Depth head (DPT decoder) | 13.06 M |
| Detection head | 2.91 M |
| Total | ~38.0 M |
Roughly β the Argus-B system (103 M) parameter count.
Training
All four heads trained on pre-cached ViT-S features produced by a single forward pass over each target dataset. Backbone is frozen throughout.
| Head | Dataset | Input | Recipe | Result |
|---|---|---|---|---|
| Classifier | ImageNet-1k train | 224 px CLS token | SGD, lr 30, WD 0, cosine, 30 epochs | 82.87 % train top-1 / 79.13 % val top-1 / 95.53 % val top-5 |
| Segmentation | ADE20K (20,210 train / 2,000 val) | 512 px, 32Γ32 grid | AdamW, lr 1e-3, 32 epochs (~40k iters) | mIoU 0.419 |
| Depth | NYUv2 (32K train / 5K val) | 416 px, 4 hooked blocks at strides 4/8/16/32 | SILog, AdamW, lr 1e-4 cosine, DPT decoder with reflection-padded 3Γ3 convs | RMSE 0.537 |
| Detection | COCO train 2017 (117 K) | 768 px, 48Γ48 grid | FCOS targets, AdamW, lr 1e-4, 2 epochs | COCO val2017 mAP 27.3 (AP@50 49.6 Β· AR@100 43.2); RF100-VL AR@100 0.266 (20-domain subset) |
Training logs per head live alongside the weights (*_training_log.json).
Files
cls_head.safetensors Linear(384, 1000) classifier
cls_training_log.json
seg_head.safetensors BN + Conv2d(384, 150, 1)
seg_training_log.json
depth_head.safetensors DPT decoder over 4 hooked ViT-S blocks
depth_training_log.json
det_head.safetensors SplitTowerHead (feat_dim=384)
det_training_log.json
argus_lite.py ArgusLite class
infer.py CLI dispatcher (6 subcommands)
Usage
from argus_lite import ArgusLite
model = ArgusLite.from_pretrained('phanerozoic/argus-lite').cuda().eval()
# Single-image multi-task inference
out = model.perceive('image.jpg')
# out['classification'] {'label': 'tabby', 'score': 0.62, 'top5': [...]}
# out['segmentation'] (512, 512) int array of ADE20K class ids
# out['depth'] (416, 416) float array, metric depth in meters
# out['detection'] [{'box': [x1,y1,x2,y2], 'score': 0.78, 'class_id': 17}, ...]
# out['correspondence'] None (needs a second image)
# Per-task methods
model.classify('image.jpg', top_k=5)
model.segment('street.jpg') # (512, 512)
model.depth('room.jpg') # (416, 416) metric meters
model.detect('photo.jpg', score_thresh=0.3) # list of dicts
model.correspond('a.jpg', 'b.jpg') # cosine-max patch matches
# Paired inference with correspondence populated
out = model.perceive('a.jpg', image_b='b.jpg')
# out['correspondence'] {'matches': (1024,), 'scores': (1024,), 'grid': 32}
CLI dispatcher in infer.py (six subcommands):
python infer.py classify cat.jpg
python infer.py segment street.jpg --save seg.png
python infer.py depth room.jpg --save depth.png
python infer.py detect photo.jpg --thresh 0.3
python infer.py correspond a.jpg b.jpg
python infer.py perceive image.jpg --second image2.jpg --save out/
Each head runs at its own training resolution: classifier 224 px (CLS token), segmentation 512 px (32Γ32 patch grid), depth 416 px (26Γ26 grid, DPT decoder over hooked blocks 2, 5, 8, 11), detection 768 px (48Γ48 grid). perceive() therefore does four backbone forward passes per image.
Requires argus.py from phanerozoic/argus on sys.path for the DinoVisionTransformer and SplitTowerHead classes.
Cross-domain detection benchmark (RF100-VL subset)
Same 20-domain class-agnostic AR@100 protocol as Argus, evaluated live through the ViT-S backbone at 768 px input.
| Model | Total params | Mean AR@100 |
|---|---|---|
| Argus+FCOS (ViT-B backbone) | 102.1 M | 0.251 |
| Argus-Lite (this model) | ~26.5 M | 0.266 |
| Argus+(current picker, ViT-B) | 89.0 M | 0.289 |
Per-domain numbers live in rf100vl_results.json.
Evaluation details
- Classifier val top-1 is 79.13 %, top-5 95.53 % on 50K ImageNet val 2012 images, using the TensorFlow Models repo's synset-label mapping for ground truth. Above the EUPE-ViT-S paper kNN baseline (78.2).
- Detection head: COCO val2017 mAP 0.273 (AP@50 0.496, AP@75 0.268, AR@100 0.432). See
coco_val_eval.jsonfor the full breakdown including per-size AP. - Depth head is a DPT decoder reassembling the four hooked ViT-S block activations (blocks 2, 5, 8, 11) at strides [4, 8, 16, 32], followed by 4 FeatureFusion blocks with residual conv units and a 256-bin depth head. Trained on NYUv2 (32K train / 5K val held-out split) with SILog loss and AdamW lr 1e-4 on a cosine schedule. For reference, the same DPT decoder on EUPE-ViT-B (Argus, 4Γ larger backbone) reaches 0.391 RMSE on the equivalent split.
- Segmentation head is a linear probe at 5 epochs; the EUPE-ViT-S paper reports mIoU 0.466 at a much longer schedule.
Source backbone
EUPE-ViT-S from Meta FAIR (arXiv:2603.22387, Zhu et al., March 2026). Three-stage distillation from PEcore-G + PElang-G + DINOv3-H+ via a 1.9 B proxy teacher. License: FAIR Research License (non-commercial).
Model tree for phanerozoic/argus-lite
Base model
facebook/EUPE-ViT-S