Upload README.md with huggingface_hub

cfa3ca0 verified 4 months ago

6.97 kB

	---
	license: apache-2.0
	tags:
	- robotics
	- lerobot
	- pi0
	- vla
	- imitation-learning
	- so101
	datasets:
	- abdul004/so101_ball_in_cup_v5
	pipeline_tag: robotics
	---

	# SO-101 Ball-in-Cup Pi0.5 Policy

	A fine-tuned [Pi0.5 (π₀.₅)](https://www.physicalintelligence.company/blog/pi05) Vision-Language-Action model for the ball-in-cup task using the SO-101 robot arm.

	## Task Description

	Goal: Pick up an orange ball from the table and place it into a pink cup.

	Robot: [SO-101](https://github.com/huggingface/lerobot/blob/main/examples/10_use_so100.md) - 6-DOF robot arm with gripper

	Cameras: Dual camera setup (overhead + wrist-mounted)

	## Model Architecture

	Pi0.5 is a Vision-Language-Action (VLA) model from Physical Intelligence:

	\| Component \| Description \|
	\|-----------\|-------------\|
	\| Vision Encoder \| SigLIP 400M - processes camera images \|
	\| Language Model \| Gemma 2B - scene understanding & task grounding \|
	\| Action Expert \| Flow Matching head - generates smooth action trajectories \|
	\| Total Parameters \| ~3B \|

	The model takes natural language instructions + camera images → outputs continuous joint actions.

	## Training Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base Model \| Pi0.5 (Physical Intelligence) \|
	\| Dataset \| [abdul004/so101_ball_in_cup_v5](https://huggingface.co/datasets/abdul004/so101_ball_in_cup_v5) \|
	\| Episodes \| 72 teleoperated demonstrations \|
	\| Frames \| 25,045 \|
	\| Fine-tuning Steps \| 5,000 \|
	\| Hardware \| A100 80GB on RunPod \|
	\| Training Time \| ~3-4 hours \|
	\| Cost \| ~$6-8 USD \|
	\| Framework \| OpenPi (JAX/Flax) \|

	## Inference Performance

	### JPEG Compression Optimization

	We implemented JPEG compression to reduce network transfer time for remote inference:

	\| Location \| Raw Images \| JPEG (Q80) \| Speedup \|
	\|----------\|-----------\|------------\|---------\|
	\| EU Spot \| 1448ms \| 375ms \| 3.9x \|
	\| US On-Demand \| 600ms \| 270ms \| 2.2x \|

	\| Metric \| Before \| After \|
	\|--------\|--------\|-------\|
	\| Payload Size \| 1.8 MB \| 71 KB \|
	\| Control Rate (US) \| 1.7 Hz \| 3.7 Hz \|
	\| Compression Ratio \| - \| 25x \|

	### Architecture

	```
	[RunPod GPU Server] [Robot Mac]
	┌─────────────────┐ ┌──────────────┐
	│ Pi0.5 Model │◄── WSS ────►│ run_pi05.py │
	│ (RTX 4090) │ JPEG │ (Robot ctrl) │
	└─────────────────┘ └──────────────┘
	```

	## Demo

	### With JPEG Compression (~270ms latency)

	![Evaluation Demo - JPEG](eval_demo_jpeg.gif)
	Side-by-side: Overhead camera (left) + Wrist camera (right) - Smooth 3.7 Hz control

	### Without JPEG Compression (~600ms latency)

	![Evaluation Demo - Raw](eval_demo_raw.gif)
	Side-by-side: Same task but with raw image transfer - 1.7 Hz control

	## Sample Evaluation

	### JPEG Compression (Fast)
	![Evaluation Composite - JPEG](eval_composite_jpeg.png)
	5-frame composite: Start → Approach → Grasp → Transport → Final

	### Raw Images (Slow)
	![Evaluation Composite - Raw](eval_composite_raw.png)
	Same task without JPEG optimization

	## Usage

	### Server Setup (RunPod)

	```bash
	# Clone OpenPi fork with JPEG support
	git clone https://github.com/abdulrahman004/openpi.git
	cd openpi
	uv sync

	# Download checkpoint
	uv run huggingface-cli download abdul004/pi05_so101_checkpoint \
	--include "4999/**" \
	--local-dir checkpoints/pi05_so101

	# Start server
	uv run scripts/serve_policy.py --port 8000 \
	policy:checkpoint \
	--policy.config=pi05_so101 \
	--policy.dir=checkpoints/pi05_so101/4999
	```

	### Client (Robot Mac)

	```bash
	pip install openpi-client

	# Run inference with JPEG compression
	python run_pi05.py --server wss://YOUR-POD-8000.proxy.runpod.net

	# Or without compression (slower)
	python run_pi05.py --server wss://YOUR-POD-8000.proxy.runpod.net --no-jpeg
	```

	## External Videos (Phone Capture)

	Real-world demonstrations recorded externally during evaluation runs:

	### JPEG Compression (~270ms)
	![External Video - JPEG](external_jpeg.gif)
	External phone recording showing smooth robot control with JPEG compression

	### Raw Images (~600ms)
	![External Video - Raw](external_raw.gif)
	Same task without compression - noticeably slower/choppier control

	### Edge Retrieval (Out-of-Distribution)
	![External Video - Edge](external_edge.gif)
	Ball placed at workspace edge - a position that appeared in <10% of training episodes

	## Comparison with ACT Policy

	Trained on the same dataset:

	\| Policy \| Architecture \| Inference \| Grasp \| Generalization \|
	\|--------\|-------------\|-----------\|-------\|----------------\|
	\| Pi0.5 \| VLA (3B params) \| Remote GPU \| ✅ \| ✅ Edge positions \|
	\| ACT \| Transformer (25M) \| Local \| ✅ \| ❌ Center only \|

	### Edge Retrieval: Pi0.5 vs ACT

	ACT failed at edge positions - the policy was only trained with ~72 episodes where the ball was mostly in the center/reachable area. When the ball was placed at the edge of the workspace, ACT would miss or fail to reach it entirely.

	Pi0.5 succeeds at edge positions despite having the same training data. This demonstrates the power of VLA pre-training:

	1. SigLIP (vision encoder) was pre-trained on billions of images - understands "ball" and "edge" concepts generally
	2. Gemma (language model) provides semantic grounding - "pick up ball" applies regardless of position
	3. Action Expert learned smooth motion primitives from diverse robot arms during base model training

	The base Pi0.5 model was trained on data from many different robot arms performing various tasks. This gives it a strong prior on reachable workspace and arm kinematics that ACT (trained from scratch) simply doesn't have.

	## Infrastructure Notes

	Remote Inference Setup:
	- Server: RunPod RTX 4090 24GB (~$0.40/hr on-demand)
	- Client: Mac Mini M4 controlling SO-101 robot
	- Protocol: WebSocket with msgpack serialization
	- Optimization: JPEG compression reduces 1.8MB → 71KB per inference

	Known Issues:
	- RTX 4090 is borderline for memory - occasional OOM during model loading
	- US datacenters preferred (2x faster than EU for network transfer)
	- First inference takes 30-60s (JAX JIT compilation)

	## Limitations

	- Requires GPU server for inference (not yet optimized for edge deployment)
	- Sensitive to lighting changes
	- 72 training episodes may limit extreme edge case handling

	## Citation

	```bibtex
	@misc{so101_pi05_ball_in_cup,
	author = {Abdul},
	title = {SO-101 Ball-in-Cup Pi0.5 Fine-tuning},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/abdul004/pi05_so101_checkpoint}
	}
	```

	## Acknowledgments

	- [Physical Intelligence](https://www.physicalintelligence.company/) for Pi0.5 and OpenPi
	- [LeRobot](https://github.com/huggingface/lerobot) by Hugging Face
	- SO-101 robot design community