Instructions to use abdul004/pi05_so101_checkpoint with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LeRobot
How to use abdul004/pi05_so101_checkpoint with LeRobot:
- Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| tags: | |
| - robotics | |
| - lerobot | |
| - pi0 | |
| - vla | |
| - imitation-learning | |
| - so101 | |
| datasets: | |
| - abdul004/so101_ball_in_cup_v5 | |
| pipeline_tag: robotics | |
| # SO-101 Ball-in-Cup Pi0.5 Policy | |
| A fine-tuned [Pi0.5 (Οβ.β )](https://www.physicalintelligence.company/blog/pi05) Vision-Language-Action model for the ball-in-cup task using the SO-101 robot arm. | |
| ## Task Description | |
| **Goal:** Pick up an orange ball from the table and place it into a pink cup. | |
| **Robot:** [SO-101](https://github.com/huggingface/lerobot/blob/main/examples/10_use_so100.md) - 6-DOF robot arm with gripper | |
| **Cameras:** Dual camera setup (overhead + wrist-mounted) | |
| ## Model Architecture | |
| Pi0.5 is a Vision-Language-Action (VLA) model from Physical Intelligence: | |
| | Component | Description | | |
| |-----------|-------------| | |
| | **Vision Encoder** | SigLIP 400M - processes camera images | | |
| | **Language Model** | Gemma 2B - scene understanding & task grounding | | |
| | **Action Expert** | Flow Matching head - generates smooth action trajectories | | |
| | **Total Parameters** | ~3B | | |
| The model takes natural language instructions + camera images β outputs continuous joint actions. | |
| ## Training Details | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | **Base Model** | Pi0.5 (Physical Intelligence) | | |
| | **Dataset** | [abdul004/so101_ball_in_cup_v5](https://huggingface.co/datasets/abdul004/so101_ball_in_cup_v5) | | |
| | **Episodes** | 72 teleoperated demonstrations | | |
| | **Frames** | 25,045 | | |
| | **Fine-tuning Steps** | 5,000 | | |
| | **Hardware** | A100 80GB on RunPod | | |
| | **Training Time** | ~3-4 hours | | |
| | **Cost** | ~$6-8 USD | | |
| | **Framework** | OpenPi (JAX/Flax) | | |
| ## Inference Performance | |
| ### JPEG Compression Optimization | |
| We implemented JPEG compression to reduce network transfer time for remote inference: | |
| | Location | Raw Images | JPEG (Q80) | Speedup | | |
| |----------|-----------|------------|---------| | |
| | EU Spot | 1448ms | 375ms | **3.9x** | | |
| | **US On-Demand** | **600ms** | **270ms** | **2.2x** | | |
| | Metric | Before | After | | |
| |--------|--------|-------| | |
| | **Payload Size** | 1.8 MB | 71 KB | | |
| | **Control Rate (US)** | 1.7 Hz | 3.7 Hz | | |
| | **Compression Ratio** | - | 25x | | |
| ### Architecture | |
| ``` | |
| [RunPod GPU Server] [Robot Mac] | |
| βββββββββββββββββββ ββββββββββββββββ | |
| β Pi0.5 Model ββββ WSS βββββΊβ run_pi05.py β | |
| β (RTX 4090) β JPEG β (Robot ctrl) β | |
| βββββββββββββββββββ ββββββββββββββββ | |
| ``` | |
| ## Demo | |
| ### With JPEG Compression (~270ms latency) | |
|  | |
| *Side-by-side: Overhead camera (left) + Wrist camera (right) - Smooth 3.7 Hz control* | |
| ### Without JPEG Compression (~600ms latency) | |
|  | |
| *Side-by-side: Same task but with raw image transfer - 1.7 Hz control* | |
| ## Sample Evaluation | |
| ### JPEG Compression (Fast) | |
|  | |
| *5-frame composite: Start β Approach β Grasp β Transport β Final* | |
| ### Raw Images (Slow) | |
|  | |
| *Same task without JPEG optimization* | |
| ## Usage | |
| ### Server Setup (RunPod) | |
| ```bash | |
| # Clone OpenPi fork with JPEG support | |
| git clone https://github.com/abdulrahman004/openpi.git | |
| cd openpi | |
| uv sync | |
| # Download checkpoint | |
| uv run huggingface-cli download abdul004/pi05_so101_checkpoint \ | |
| --include "4999/**" \ | |
| --local-dir checkpoints/pi05_so101 | |
| # Start server | |
| uv run scripts/serve_policy.py --port 8000 \ | |
| policy:checkpoint \ | |
| --policy.config=pi05_so101 \ | |
| --policy.dir=checkpoints/pi05_so101/4999 | |
| ``` | |
| ### Client (Robot Mac) | |
| ```bash | |
| pip install openpi-client | |
| # Run inference with JPEG compression | |
| python run_pi05.py --server wss://YOUR-POD-8000.proxy.runpod.net | |
| # Or without compression (slower) | |
| python run_pi05.py --server wss://YOUR-POD-8000.proxy.runpod.net --no-jpeg | |
| ``` | |
| ## External Videos (Phone Capture) | |
| Real-world demonstrations recorded externally during evaluation runs: | |
| ### JPEG Compression (~270ms) | |
|  | |
| *External phone recording showing smooth robot control with JPEG compression* | |
| ### Raw Images (~600ms) | |
|  | |
| *Same task without compression - noticeably slower/choppier control* | |
| ### Edge Retrieval (Out-of-Distribution) | |
|  | |
| *Ball placed at workspace edge - a position that appeared in <10% of training episodes* | |
| ## Comparison with ACT Policy | |
| Trained on the same dataset: | |
| | Policy | Architecture | Inference | Grasp | Generalization | | |
| |--------|-------------|-----------|-------|----------------| | |
| | **Pi0.5** | VLA (3B params) | Remote GPU | β | β Edge positions | | |
| | ACT | Transformer (25M) | Local | β | β Center only | | |
| ### Edge Retrieval: Pi0.5 vs ACT | |
| **ACT failed at edge positions** - the policy was only trained with ~72 episodes where the ball was mostly in the center/reachable area. When the ball was placed at the edge of the workspace, ACT would miss or fail to reach it entirely. | |
| **Pi0.5 succeeds at edge positions** despite having the same training data. This demonstrates the power of VLA pre-training: | |
| 1. **SigLIP** (vision encoder) was pre-trained on billions of images - understands "ball" and "edge" concepts generally | |
| 2. **Gemma** (language model) provides semantic grounding - "pick up ball" applies regardless of position | |
| 3. **Action Expert** learned smooth motion primitives from diverse robot arms during base model training | |
| The base Pi0.5 model was trained on data from many different robot arms performing various tasks. This gives it a strong prior on reachable workspace and arm kinematics that ACT (trained from scratch) simply doesn't have. | |
| ## Infrastructure Notes | |
| **Remote Inference Setup:** | |
| - Server: RunPod RTX 4090 24GB (~$0.40/hr on-demand) | |
| - Client: Mac Mini M4 controlling SO-101 robot | |
| - Protocol: WebSocket with msgpack serialization | |
| - Optimization: JPEG compression reduces 1.8MB β 71KB per inference | |
| **Known Issues:** | |
| - RTX 4090 is borderline for memory - occasional OOM during model loading | |
| - US datacenters preferred (2x faster than EU for network transfer) | |
| - First inference takes 30-60s (JAX JIT compilation) | |
| ## Limitations | |
| - Requires GPU server for inference (not yet optimized for edge deployment) | |
| - Sensitive to lighting changes | |
| - 72 training episodes may limit extreme edge case handling | |
| ## Citation | |
| ```bibtex | |
| @misc{so101_pi05_ball_in_cup, | |
| author = {Abdul}, | |
| title = {SO-101 Ball-in-Cup Pi0.5 Fine-tuning}, | |
| year = {2026}, | |
| publisher = {Hugging Face}, | |
| url = {https://huggingface.co/abdul004/pi05_so101_checkpoint} | |
| } | |
| ``` | |
| ## Acknowledgments | |
| - [Physical Intelligence](https://www.physicalintelligence.company/) for Pi0.5 and OpenPi | |
| - [LeRobot](https://github.com/huggingface/lerobot) by Hugging Face | |
| - SO-101 robot design community | |