Ropedia Xperience-10M Task Suite

A research-development project built on the public Xperience-10M sample episode released by Ropedia. The goal is to make one richly multimodal egocentric episode understandable, turn it into concrete embodied-AI task definitions, and prepare the same pipeline for future held-out multi-episode training.

The central research questions are:

What can be learned from one aligned Xperience-10M episode while separating sample-specific observations from later multi-episode questions?
Which input/output tasks are meaningful for embodied AI when video, depth, pose, mocap, IMU, and language annotations are synchronized?
What baseline models and evaluation files should exist before scaling to Qwen3-Omni or other multimodal foundation-model fine-tuning?

Why This Project Exists

This project is organized as a compact research artifact around Xperience-10M: start from a real public episode, make every modality and label path inspectable, turn the data into concrete embodied-AI tasks, and keep the evaluation boundary clear while preparing the next multi-episode experiments. The emphasis is on research judgment as much as implementation: what the sample can show, what it cannot show, and what evidence should exist before claiming model quality.

The work is designed to demonstrate four capabilities that matter for embodied-AI research infrastructure:

Capability	What this project shows
Multimodal data understanding	Parses the public sample into synchronized windows across video, audio, depth, pose/SLAM, mocap, IMU, calibration, and language-derived signals
Task design	Defines 12 human-readable tasks plus four direction-extension probes with inputs, outputs, process modules, metrics, and case-study walkthroughs
Model and evaluation discipline	Runs minimal and compact neural baselines, records predictions/metrics, keeps chronological split boundaries explicit, and separates sample evidence from held-out claims
Scale-up planning	Connects the public-sample pipeline to 32/128-episode held-out pilots, Qwen3-Omni LoRA, Cosmos-style world-model branches, policy-model branches, and the future Xperience-native foundation-model pretraining goal

Start Here

For a first pass, use PROJECT_BRIEF.md or the machine-readable docs/data/project_brief.json. They give the project shape in one page: what exists now, what the public sample can support, where the 12 tasks and baselines live, and what must happen before the multi-episode omni-model stage becomes a real held-out evaluation.

Reader goal	Best entry point
Understand the whole project quickly	`PROJECT_BRIEF.md`
See the visual research dashboard	GitHub Pages dashboard
Navigate the 12 tasks, four tracks, and scale-up plan	Interactive research roadmap, `docs/data/research_roadmap_interactive.json`
Compare current task metrics	`RESEARCH_TAKEAWAYS.md`, `docs/data/summary_metrics.json`
Compare possible foundation backbones	`FOUNDATION_MODEL_PLAN.md`, `docs/data/foundation_model_plan.json`
Understand the future native pretraining goal	`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`
See additional concrete project directions	`ADDITIONAL_DEVELOPMENT_DIRECTIONS.md`, `docs/data/additional_development_directions.json`
Understand one model input	`results/episode_task_suite/feature_manifest.json`, `results/episode_task_suite/windows.csv`
Check multi-episode data status	`results/omni_finetune/DATA_ACCESS_STATUS.md`

Research Project Overview

Theme	Current implementation
Dataset slice	One public Xperience-10M sample episode, 5,821 frames, 1,161 windows, and an 8,546-dimensional representation
Modalities	Video, audio, depth, camera pose/SLAM, hand/body mocap, IMU, calibration, and language annotations
Task suite	12 human-readable embodied-AI task contracts with input, process, output, metrics, predictions, and case-study walkthroughs
Baselines	Minimal linear/ridge/logistic heads plus compact PyTorch MLP task heads over the same chronological split
Research directions	Task mapping and extension probes for human modeling, 3D/4D reconstruction, egocentric interaction, and world modeling
Scale-up path	The gated Xperience-10M dataset is available for a selected 128-episode pilot before Qwen3-Omni LoRA, followed by Cosmos 3/world-model and VLA/policy branches; the long-term goal is an Xperience-native embodied foundation model if full-corpus data, storage, and compute are available
Public surfaces	GitHub repo, GitHub Pages dashboard, GHCR static-site package, HF Space, HF artifact dataset, HF baseline-model repo, and HF collection

For the fastest interpretation of the current metrics, start with RESEARCH_TAKEAWAYS.md and docs/data/research_takeaways.json. They summarize what the public sample results actually show: class shift under chronological splits, neural gains on dynamics/order/alignment, harder retrieval/reconstruction probes, and why the next model-quality step needs held-out episodes.

Current contributions:

manifested sliding-window features over the currently extracted modalities,
motion-only and current all-feature baseline models,
12 end-to-end episode-level tasks,
lightweight neural MLP heads for the same 12 task contracts,
a generated four-direction research taxonomy matching the Ropedia job tracks,
four additional direction-extension probes with minimal and neural baselines,
human-readable research task cards and an interactive scrub/play walkthrough storyboard for every task,
an interactive research roadmap connecting 12 tasks, four research tracks, current sample evidence, the Qwen3-Omni scale-up path, and foundation-model branch selection,
a next-milestone track for Qwen3-Omni fine-tuning, Cosmos 3 world modeling, and sensor-bridge evaluation,
a future pretraining plan for an Xperience Embodied Foundation Model over the full corpus after smaller multi-episode stages prove value,
metrics, predictions, model weights, manifests, charts, and a two-level tabbed static research website,
a clear explanation of what is implemented now and what moves to the multi-episode stage.

Current Research Scope

This project is best read as a staged embodied-AI research study:

Layer	Current scope	Where to start
Data understanding	One public Xperience-10M sample episode is converted into 5,821 frames, 1,161 aligned windows, and an 8,546-dimensional multimodal representation.	`PROJECT_BRIEF.md`, `PROJECT_STATUS.md`
Task suite	Twelve human-readable tasks cover action, procedure, contact, object, language, retrieval, reconstruction, order, and synchronization questions.	`RESEARCH_TAKEAWAYS.md`, `results/episode_task_suite/summary_report.json`
Baselines	Minimal heads and compact PyTorch MLP heads provide a first controlled comparison on the same chronological split.	`results/episode_task_suite/neural_mlp/`
Diagnostics	Audio contribution, modality ablations, timeline overlays, object labels, and alignment stress tests show which signals are useful and which tasks remain hard.	`results/audio_ablation/AUDIO_ABLATION_SUMMARY.md`, `docs/single_episode_explorer.html`
Scale-up	A selected 128-episode Qwen3-Omni LoRA pilot is being prepared from the gated dataset; held-out model metrics will be added only after training and evaluation finish. The long-term native-pretraining plan is documented separately as a future research goal.	`RESEARCH_ROADMAP.md`, `FOUNDATION_MODEL_PLAN.md`, `XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`, `results/omni_finetune/DATA_ACCESS_STATUS.md`

Detailed dataset notes, reproduction checks, and generated JSON reports are included for readers who want to inspect the implementation, but they are supporting materials rather than the main reading path. Use ARTIFACT_GUIDE.md when you want the full file map.

Project Status

If you only have one minute, use PROJECT_STATUS.md and docs/data/project_status.json. They give the current research state in one compact table:

Area	Current decision
Public-sample pipeline	Verified on one public sample episode: 5,821 frames, 1,161 windows, 8,546 dimensions
12-task suite	Verified minimal baselines with committed metrics, predictions, and manifests
Neural heads	Verified compact PyTorch MLP heads over the same task contracts and chronological splits
Dataset context	Official Xperience-10M links, sample-vs-gated-data boundary, modality coverage, and redistribution policy are documented
Evaluation protocol	Verified generated protocol for windowing, split policy, leakage controls, and per-task metrics
Website and Hub pages	Public dashboard, Hugging Face Space, artifact dataset, baseline model repo, and collection use the same project framing and links
Qwen3-Omni multi-episode pilot	The gated Xperience-10M dataset is available for selected 128-episode preparation, with full metrics pending completed preprocessing, training, and held-out evaluation
Raw Xperience-10M data / full Qwen weights	Not redistributed

90-Second Research Project Path

If you are reading the project cold, open these in order:

Step	Question	Primary artifacts	What should be true
1	What is this project?	`PROJECT_BRIEF.md`, `PROJECT_STATUS.md`, dashboard	A public-sample Xperience-10M research project with 12 tasks, baselines, and a scale-up plan.
2	What data is used?	`XPERIENCE10M_DATASET_CARD_ALIGNMENT.md`, official HF dataset, sample HF dataset	The implemented suite uses one public sample episode; the gated dataset is reserved for selected multi-episode training.
3	What does one model input contain?	`windows.csv`, `feature_manifest.json`, `available_modalities.json`	Each window is an aligned multimodal unit with video, audio, depth, pose/SLAM, mocap, IMU, calibration, and language-derived signals.
4	What are the 12 tasks?	`results/episode_task_suite/task_walkthroughs/`, `docs/data/task_walkthroughs.json`	Every task has a human-readable name, case study, input, process modules, output, metric, and limitation.
5	How are tasks evaluated?	`EVALUATION_PROTOCOL.md`, `docs/data/evaluation_protocol.json`	The window unit, chronological split, leakage controls, task metrics, and current limitations are explicit.
6	What do the current results mean?	`RESEARCH_TAKEAWAYS.md`, `docs/data/research_takeaways.json`, `docs/data/summary_metrics.json`	Current metrics describe sample-level task behavior and identify which signals need larger held-out experiments.
7	Which models are implemented?	`results/episode_task_suite/summary_report.json`, `results/episode_task_suite/neural_mlp/`, HF baseline repo	Each task has minimal and neural-head evidence over the same feature windows.
8	What research directions does this support?	`RESEARCH_ROADMAP.md`, `docs/data/research_directions.json`, `docs/data/research_direction_extensions.json`	The tasks are mapped to human modeling, 3D/4D reconstruction, egocentric interaction, and world modeling.
9	Which foundation model comes next?	`FOUNDATION_MODEL_PLAN.md`, `docs/data/foundation_model_plan.json`, `XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`	Qwen3-Omni is the first held-out LoRA baseline; Cosmos 3 is the first world-model branch; policy models wait for explicit action targets; Xperience-native pretraining is the full-corpus future goal.
10	How do I reproduce it?	`REPRODUCIBILITY.md`, `notes/reproducibility_audit.md`	Public commands and expected outputs are documented for the sample-episode task suite.
11	What is still pending?	`DATA_ACCESS_STATUS.md`, `MULTI_EPISODE_ACCESS_STATUS.md`	Multi-episode Qwen3-Omni model quality will be reported after preprocessing, training, and held-out evaluation complete.

A compact reader-path summary is available at docs/data/project_packet.json.

Supporting Files

ARTIFACT_GUIDE.md is the human-readable map for readers who want to inspect the project files after the first pass. It groups the main briefs, task outputs, baseline results, visual assets, data notes, and scale-up documents.

docs/data/artifact_index.json is the compact machine-readable companion used by the website and Hugging Face artifact dataset.

Evaluation Protocol

EVALUATION_PROTOCOL.md and docs/data/evaluation_protocol.json are generated from committed metric artifacts. They define:

the 20-frame window unit, stride, feature dimension, and raw-data policy,
the chronological 70/30 single-episode split and its generalization limit,
the per-task input, target, primary metric, minimal score, and neural score,
leakage controls for future labels, target-side signals, caption/object labels, and train-only normalization,
current limitations, including cross-episode generalization, audio-visual learning, pixel-depth reconstruction, and real held-out multi-episode Qwen3-Omni quality.

Dataset Context

The official ropedia-ai/xperience-10m dataset is a gated large-scale egocentric multimodal dataset for embodied AI, robotics, spatial intelligence, and world modeling. The public ropedia-ai/xperience-10m-sample repo provides the sample episode used for the implemented task suite here.

This project keeps those layers separate: the public sample supports the current 12-task study, while the gated full dataset is used only for the selected multi-episode Qwen3-Omni pilot. Raw Xperience-10M MP4/HDF5/RRD files are not redistributed in this repo or in the Hugging Face mirrors.

The current verified public-sample subset is:

one public sample episode, 5,821 frames, and 1,161 aligned windows,
raw sample files with six MP4 video streams and audio streams,
annotation.hdf5 carrying depth, SLAM/camera pose, hand/body mocap, IMU, language/caption annotations, calibration, metadata, and timing records,
an 8,546-dimensional baseline representation using video, audio, depth, pose/SLAM, mocap, IMU, calibration, and language-derived signals.

Detailed dataset notes are available in XPERIENCE10M_DATASET_CARD_ALIGNMENT.md for readers who need the full upstream-card and access-term context. The practical boundary is simple: current results come from the public sample, and multi-episode model quality is pending the selected held-out pilot.

Start with the visual dashboard:

chaoyue0307.github.io/ropedia-xperience-10m-task-suite

Hugging Face Space app:

cy0307-ropedia-xperience-10m-task-suite.static.hf.space

Read This Project In Three Layers

Layer	What to inspect	Why it matters
Project status	`PROJECT_STATUS.md`, `docs/data/project_status.json`	Gives a one-table current project summary before reading the full artifact trail
Data contract	`windows.csv`, `feature_manifest.json`, modality manifests	Confirms what each sample window contains before modeling
Dataset context	`XPERIENCE10M_DATASET_CARD_ALIGNMENT.md`, official dataset links	Explains the official dataset, public sample, modalities, access boundary, and what this repo uses
Visual assets	`FIGURE_INDEX.md`, `docs/assets/`	Shows the task-suite graphic, modality thumbnails, pipeline diagrams, charts, and logo assets
Evaluation protocol	`EVALUATION_PROTOCOL.md`, `docs/data/evaluation_protocol.json`	Defines the task unit, split, metrics, leakage controls, and current limitations
Research roadmap	`RESEARCH_ROADMAP.md`, `docs/data/research_roadmap.json`	Shows the path from sample-level task development to multi-episode work, larger model branches, and the future native-pretraining goal
Additional development directions	`ADDITIONAL_DEVELOPMENT_DIRECTIONS.md`, `docs/data/additional_development_directions.json`	Records concrete non-backbone tracks: taxonomy, benchmark protocol, representation learning, skill graphs, affordances, 3D/4D memory, QA, and policy transfer
Xperience Embodied Foundation Model plan	`XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md`	Describes the long-term full-corpus pretraining goal, target modules, objectives, staged scale-up, hardware ranges, and evaluation protocol
Minimal heads	softmax, ridge projection/regression, multi-label logistic heads	Keeps every input/output contract visible and inspectable
Neural heads	PyTorch MLP classifiers/regressors under `neural_mlp/`	Checks whether nonlinear heads improve each task without changing features
Evidence	metrics, predictions, confusion matrices, diagrams, dashboard	Makes the single-episode task development inspectable without rerunning first
Artifact guide	`ARTIFACT_GUIDE.md`	Groups the public evidence into research-project layers after the first-pass overview
Reproducibility contract	`REPRODUCIBILITY.md`, `docs/data/reproducibility_matrix.json`	States public commands, expected outputs, exact-match reproduction evidence, and non-reproducible boundaries
Citation metadata	`CITATION.cff`, `codemeta.json`, `LICENSE`	Makes the repo easier to cite, index, and reuse without confusing code license and dataset terms

Links

Resource	Link
This GitHub repo	github.com/ChaoYue0307/ropedia-xperience-10m-task-suite
This project website	chaoyue0307.github.io/ropedia-xperience-10m-task-suite
This Hugging Face Space	huggingface.co/spaces/cy0307/ropedia-xperience-10m-task-suite
Live Hugging Face static app	cy0307-ropedia-xperience-10m-task-suite.static.hf.space
GitHub Container package	ghcr.io/chaoyue0307/ropedia-xperience-10m-task-suite
Derived artifacts on Hugging Face	huggingface.co/datasets/cy0307/ropedia-xperience-10m-task-suite-artifacts
Minimal and neural task baselines on Hugging Face	huggingface.co/cy0307/ropedia-xperience-10m-task-baselines
Hugging Face collection	huggingface.co/collections/cy0307/ropedia-xperience-10m-task-suite
Xperience-10M dataset website	ropedia.com/dataset
Xperience-10M release page	ropedia.com/blog/20260316_xperience_10m
Ropedia GitHub organization	github.com/Ropedia
HOMIE Toolkit	github.com/Ropedia/HOMIE-toolkit
Xperience-10M Hugging Face dataset	huggingface.co/datasets/ropedia-ai/xperience-10m
Xperience-10M sample on Hugging Face	huggingface.co/datasets/ropedia-ai/xperience-10m-sample
Ropedia Hugging Face organization	huggingface.co/ropedia-ai

Citation, License, And Metadata

Use CITATION.cff when citing this project. The repository also includes codemeta.json for machine-readable software metadata and docs/data/project_manifest.json for website/Hugging Face surface metadata.

The code files are MIT-licensed. Raw Xperience-10M data is not redistributed here, and dataset use remains governed by the official Ropedia/Xperience-10M terms. See LICENSE and DATA_NOTICE.md.

The infographic uses a custom text-free research background and puts the shared processing contract plus all 12 task families before the modality atlas. Public-sample modality thumbnails remain enlarged below the task map. The task names, input/output summaries, and metrics are overlaid from results/episode_task_suite/summary_report.json with scripts/render_task_suite_infographic.py, so the published PNG is a presentation graphic with verified labels and metrics, not a hallucinated metric sheet.

The website also includes a responsive native modality atlas backed by docs/data/modality_atlas.json and docs/assets/modalities/. Those assets are small derived thumbnails from the public sample, not raw Xperience-10M files.

The pipeline and architecture figures use the same pattern: text-free visual backgrounds carry the composition, while scripts/render_overview_figures.py overlays exact labels, dimensions, and metrics from the committed result files.

Scope

This is a learning, inspection, and pipeline-validation repo built from one public sample episode. The next model-quality stage is to run the same suite over many episodes and split train/test by held-out episode.

What Is Inside

scripts/
  train_min_action_model.py         # motion/IMU baseline
  train_all_modalities_model.py     # current all-feature lightweight baseline
  episode_task_suite.py             # 12 end-to-end task definitions
  neural_task_models.py             # optional PyTorch MLP heads for all 12 tasks
  research_direction_taxonomy.py    # maps 12 tasks to the four research tracks
  research_direction_extension_tasks.py # one extra data-backed probe per track
  task_walkthroughs.py              # human-readable task-card and walkthrough-storyboard metadata
  generate_visualizations.py        # refreshes SVG charts + summary JSON
  render_task_suite_infographic.py  # renders the task-suite presentation PNG
  export_modality_atlas_assets.py   # exports responsive modality-card assets
  render_overview_figures.py        # renders polished pipeline/architecture PNGs
  build_brand_assets.py             # derives logo sizes, favicon, social card
  build_artifact_index.py           # builds the compact artifact guide data
  build_quality_gates.py            # builds release checks
  validate_mirror_parity.py         # checks prepared GitHub/HF mirror file parity
  validate_scope_claims.py          # separates setup artifacts from completed model metrics
  validate_task_surface.py          # checks readable task cards and interactive storyboard wiring
  validate_website_integrity.py     # checks local site links, anchors, and images
  validate_publication_package.py   # checks public repo + HF bundle contents
  publish_hf_bundles.py             # uploads prepared HF Space/artifact/model bundles
  omni/
    download_sample_modelscope.py   # ModelScope sample download helper
    build_episode_manifest.py       # metadata-only multi-episode scanner
    plan_finetune_sample_budget.py  # storage/sample-count planner
    qwen3_omni_adapter_smoke.py     # real-data Qwen3-Omni adapter setup check

results/
  min_action_model/                 # motion-only action baseline artifacts
  min_subtask_model/                # motion-only subtask baseline artifacts
  min_all_modalities_action_model/  # current all-feature action artifacts
  min_all_modalities_subtask_model/ # current all-feature subtask artifacts
  episode_task_suite/               # 12-task suite metrics and predictions
    neural_mlp/                     # optional neural baseline artifacts per task
    research_directions/            # four-track taxonomy, CSV, and summary
    research_direction_extensions/  # four extra direction probes + predictions
    task_walkthroughs/              # case-study walkthroughs for all 12 tasks
  omni_exploration/                 # ModelScope readiness-check artifacts

docs/
  index.html                        # GitHub Pages dashboard
  data/additional_development_directions.json # concrete non-backbone project directions
  data/summary_metrics.json         # website-readable metrics bundle
  data/evidence_contract.json       # machine-readable project scope
  data/artifact_index.json          # compact project-artifact catalog
  data/live_publication_status.json # live GitHub/HF publication verification
  data/quality_gates.json           # machine-readable release checks
  data/task_surface_integrity.json  # machine-readable task-card/storyboard integrity check
  data/project_manifest.json        # machine-readable public-surface metadata
  data/project_packet.json          # compact project path and scope summary
  data/research_roadmap.json        # multi-episode and omni-model roadmap
  data/research_directions.json     # four-track website data bundle
  data/research_direction_extensions.json # four extra probe data bundle
  data/task_walkthroughs.json       # human-readable task-card and walkthrough-storyboard data
  data/modality_atlas.json          # responsive modality-card data
  assets/brand/*.png                # project logo, favicon, social card
  assets/task_suite_infographic.png # 12-task presentation graphic
  assets/modalities/                # public-sample derived modality thumbnails
  assets/pipeline_diagram.png       # verified episode pipeline graphic
  assets/qwen3_omni_lora_pipeline.png # Qwen3-Omni LoRA training-flow figure
  assets/task_architectures.png     # verified 12-task minimal architecture map
  assets/charts/*.svg               # regenerated visualizations

notes/
  min_action_model.md
  all_modalities_model.md
  episode_task_suite.md

Raw Xperience-10M data is not committed. Download it from the official Ropedia distribution and follow the dataset terms.

GitHub Package

The public dashboard is packaged as a static-site container on GitHub Container Registry. It contains the docs/ site plus the main reader documents; it does not include raw Xperience-10M videos, raw annotations, gated data, or model weights.

docker pull ghcr.io/chaoyue0307/ropedia-xperience-10m-task-suite:latest
docker run --rm -p 8080:80 ghcr.io/chaoyue0307/ropedia-xperience-10m-task-suite:latest

Then open http://localhost:8080.

Data Expected

The scripts expect a workspace with the Ropedia HOMIE toolkit and the Xperience-10M sample episode:

<workspace>/
  HOMIE-toolkit/
  data/sample/xperience-10m-sample/
    annotation.hdf5
    fisheye_cam0.mp4
    fisheye_cam1.mp4
    fisheye_cam2.mp4
    fisheye_cam3.mp4
    stereo_left.mp4
    stereo_right.mp4

The public sample dataset identifier is:

ropedia-ai/xperience-10m-sample

Hugging Face URL:

https://huggingface.co/datasets/ropedia-ai/xperience-10m-sample

Quickstart

From a workspace folder:

git clone https://github.com/Ropedia/HOMIE-toolkit.git
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r HOMIE-toolkit/requirements.txt huggingface_hub hf_xet

Download the sample:

hf download ropedia-ai/xperience-10m-sample \
  --repo-type dataset \
  --local-dir data/sample/xperience-10m-sample

If Hugging Face access is unavailable in your environment, use ModelScope:

python scripts/omni/download_sample_modelscope.py \
  --output-dir data/sample/xperience-10m-sample \
  --mode minimal

--mode minimal downloads annotation.hdf5, README.md, and fisheye_cam0.mp4. Use --mode all-training to add all six MP4 streams while still skipping visualization.rrd.

Clone and run this repo:

git clone https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite.git
cd ropedia-xperience-10m-task-suite
python scripts/episode_task_suite.py --workspace /path/to/workspace

Run the same 12-task suite with lightweight neural heads:

pip install torch
python scripts/episode_task_suite.py \
  --workspace /path/to/workspace \
  --include-neural

Run the smaller baselines:

python scripts/train_min_action_model.py --workspace /path/to/workspace
python scripts/train_all_modalities_model.py --workspace /path/to/workspace

Xperience-10M Fine-Tuning Exploration

This repo includes a first Qwen3-Omni fine-tuning path over Xperience-10M. The current artifacts are setup-stage evidence, with held-out multi-episode metrics pending completed staging, preprocessing, training, and evaluation. The useful distinction is:

direct Qwen3-Omni inputs: RGB/fisheye video, embedded MP4 audio, and language prompts,
adapter-required Xperience-10M sensor inputs: depth, pose/SLAM, hand/body mocap, contacts, and IMU.

The figure shows the intended end-to-end training flow: raw valid episodes enter episode-level split validation, parallel media/sensor export creates Qwen-style JSONL records, Qwen3-Omni receives video/audio/text directly, the sensor bridge adds depth/pose/mocap/IMU features, LoRA adapters are trained on prepared train/val episodes, and sealed held-out test evaluation produces predictions, metrics, run reports, and upload-ready adapter artifacts.

The current scale-up artifacts show that the export, manifest, sensor-feature, LoRA, and evaluation scripts can run on the available sample episode. They do not show a real multi-episode result. A real pilot requires valid prepared episodes, held-out episode splits, training metadata, predictions, metrics, and a run report; the current selected pilot target is 128 episodes.

Sample Count Decision

Do not treat "10M" as a reason to start with the entire dataset. The engineering unit that matters first is diverse held-out episodes, not adjacent windows from one session.

Phase	Episodes/samples	Approx windows at stride 5	Purpose
Readiness	1-3	1k-3k	Verify loaders, token alignment, and task heads
Pilot	16-32	18k-37k	First held-out-episode evaluation
Useful LoRA run	64-128	74k-149k	Train sensor adapters plus selected Qwen3-Omni LoRA
Storage-heavy run	256+	297k+	Only after download layout and checkpoint size are stable

Use the budget helper before downloading:

python scripts/omni/plan_finetune_sample_budget.py \
  --storage-root /path/to/storage \
  --target-free-after-download-gb 800 \
  --all-training-per-episode-gb 2.4 \
  --full-preview-per-episode-gb 5.1

Multi-Episode Readiness Gate

python scripts/omni/discover_xperience10m_sources.py \
  --workspace /path/to/ropedia-xperience-10m-task-suite \
  --data-root /path/to/xperience10m_data \
  --output results/omni_finetune/source_discovery.json

Current status in this repo:

public_sample_valid_episodes: 1 (degraded-valid: annotation + fisheye_cam0.mp4)
gated_metadata_audit: 12,102 complete visible episodes across 802 complete sessions
selected_episode_plan: 128 metadata-balanced episodes, 96/16/16 train/val/test
selected_download_size: 277.71 GiB excluding visualization.rrd
ready_for_held_out_pilot: false until the selected episodes are fully prepared and checked
gated dataset: available for selected multi-episode data preparation
source_discovery: results/omni_finetune/source_discovery.json
data_status: results/omni_finetune/DATA_ACCESS_STATUS.md
access_status: results/omni_finetune/MULTI_EPISODE_ACCESS_STATUS.md

Use this gate before scheduling any full fine-tune run. The pilot should use balanced held-out selection, not the first paths in repository order. The current 128-episode selection filters for complete leaf episodes, excludes visualization.rrd, balances episode-size bands, and preserves one selected episode per top-level session UUID.

Progressive Train/Validation Pilot

The selected 128-episode plan can be used before every episode has arrived by training only on prepared train episodes and monitoring prepared val episodes. The final test episodes stay sealed until the end, so early development does not contaminate held-out evaluation.

python scripts/omni/build_selection_episode_manifest.py \
  --workspace /path/to/ropedia-xperience-10m-task-suite \
  --data-root /path/to/xperience10m_128 \
  --selection-json results/omni_finetune/xperience10m_128_episode_selection.json \
  --output results/omni_finetune/trainval_progressive/episode_manifest_trainval.json \
  --include-split train \
  --include-split val

scripts/omni/run_trainval_progressive_128.sh wraps the same guard, exports a train/val-only Qwen3-Omni JSONL dataset, and launches LoRA training without running final test evaluation. The exporter uses session-qualified episode IDs and path-based split matching so repeated folder names such as ep1 cannot collide across different sessions.

For larger prepared subsets, scripts/omni/run_trainval_parallel_export_8gpu.sh uses the same split guard, exports episodes in parallel CPU shards, skips and reports episodes that contain no labeled windows under the configured label rule, then launches Qwen3-Omni LoRA with NUM_PROCESSES=8.

Full 128-Episode Held-Out Pilot

Once all selected episodes are complete, use the fixed selected-episode split:

96 train episodes,
16 validation episodes,
16 held-out test episodes.

The clean full-run launcher validates the selected split, exports all splits in parallel, trains Qwen3-Omni LoRA on train/val only, then evaluates on the held- out test split:

RUN_ID=xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \
DATA_ROOT=/path/to/xperience10m_128 \
SELECTION_JSON=results/omni_finetune/xperience10m_128_episode_selection.json \
MODEL_DIR=/path/to/Qwen__Qwen3-Omni-30B-A3B-Instruct \
NUM_PROCESSES=8 \
scripts/omni/run_128_fullsplit_parallel_export_8gpu.sh

Monitor the run with:

python scripts/omni/monitor_omni_progress.py \
  --run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu

Validate the run artifacts stage by stage:

python scripts/omni/validate_omni_finetune_run.py \
  --run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \
  --require-stage manifest

python scripts/omni/validate_omni_finetune_run.py \
  --run-id xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu \
  --require-stage eval \
  --min-json-validity 0.98

After dataset export, a model-neutral window index can be created for future backbones:

python scripts/omni/export_model_neutral_window_index.py \
  --dataset-jsonl results/omni_finetune/xperience10m_qwen3_omni_128ep_fullsplit_fast8gpu_dataset/dataset.jsonl

This produces window_index.jsonl and window_index_manifest.json so Cosmos- style world models and VLA/policy branches can reuse the same split-checked windows without depending on Qwen chat-message records.

Uploading the pilot Qwen3-Omni LoRA

A prepared upload package is available at results/omni_finetune/hf_upload.

python3 scripts/omni/upload_qwen3_omni_lora_to_hf.py \
  --repo-id cy0307/ropedia-qwen3-omni-lora-readiness \
  --source-dir results/omni_finetune/hf_upload \
  --message "Upload Xperience-10M Qwen3-Omni LoRA pilot"

This script requires a valid Hugging Face token via HF_TOKEN or --token. Network availability to huggingface.co is required.

Foundation Backbone Plan

The next modeling plan tracks several foundation-model branches instead of assuming one backbone solves every Xperience-10M objective.

Branch	Current role	When to use it
Qwen3-Omni	First trainable multimodal LoRA pilot	Use for the selected 128-episode held-out baseline over video/audio/language plus sensor-bridge features.
Cosmos 3	First world-model/action-generation branch	Use after data preparation for future-window prediction, action-conditioned world modeling, and synthetic-data usefulness tests.
GR00T	Humanoid/action-policy branch	Use after mocap/contact retargeting creates well-defined humanoid action targets.
OpenVLA / openpi	Open VLA/policy baselines	Use after the project defines robot-compatible or action-token targets.
Gemini Robotics	External reasoning reference	Use only for qualitative comparison or annotation support unless local trainable access exists.
Xperience Embodied Foundation Model	Future Xperience-native pretraining goal	Use only after multi-episode pilots, full-corpus storage, distributed training infrastructure, and scaling evidence justify a from-scratch domain model.

See FOUNDATION_MODEL_PLAN.md and docs/data/foundation_model_plan.json for the full selection matrix, source links, and model-specific evaluation additions. See XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md for the long-term full-corpus pretraining plan.

Backbone-specific contracts now live in configs/omni_backbones. The extension contract is documented in OMNI_MODEL_EXTENSION_CONTRACT.md, and the registry can be checked with:

python scripts/omni/backbone_registry.py --validate --json

Additional Development Directions

Beyond backbone selection and fine-tuning, Xperience-10M supports several concrete research-development tracks:

Direction	First useful artifact	Role in the project
Episode taxonomy and data engine	Episode atlas, balance report, and split builder	Select representative data before training.
Standardized benchmark protocol	Versioned train/val/test manifests and metric scripts	Make future model results comparable.
Multimodal representation learning	Contrastive and masked-window encoder objectives	Learn reusable video/audio/depth/pose/mocap/IMU/language features.
Skill and procedure graph mining	Step graph, transitions, preconditions, and effects	Connect perception to planning and long-horizon reasoning.
Human-object affordance modeling	Contact, reachable-object, tool-use, and next-affordance tasks	Model what actions the scene makes possible.
3D/4D scene and object memory	Persistent scene/object maps from depth, pose, multiview video, and objects	Track world state beyond single frames.
Data-quality and synchronization diagnostics	Per-episode QA for drift, missing streams, calibration, and corrupted files	Keep large multimodal training trustworthy.
Policy, retargeting, and simulation transfer	Action-token conversion and robot-compatible imitation examples	Bridge human egocentric experience to robot policy work.

See ADDITIONAL_DEVELOPMENT_DIRECTIONS.md and docs/data/additional_development_directions.json.

Four Research Directions

The 12 tasks are now organized against the four Ropedia research directions in a generated artifact, not only in prose:

The taxonomy uses two current baselines for every task:

Baseline	Role
Minimal interpretable heads	Softmax, logistic, ridge, and retrieval heads over the 8,546-dimensional multimodal representation. These expose the input/output contract cleanly.
Neural MLP heads	Small PyTorch MLP classifiers/regressors on the same features and splits. These check whether nonlinear heads help before moving to Qwen/Omni fine-tuning.

Current direction-level coverage:

Direction	Current status	Covered task evidence	What is not solved yet
A. Human Modeling & Motion Understanding	Partially implemented	Hand Trajectory Forecasting and Contact State Prediction are direct; Action Recognition and Object Relevance Prediction are proxies. Neural MLP improves hand forecasting from `0.8647` to `0.1079` MPJPE.	No full body/shape model, SMPL/MANO target, deformation prior, or multi-episode motion-generation evaluation yet.
B. 3D/4D Reconstruction & Neural Rendering	Proxy tasks only	Cross-Modal Retrieval, Cross-Modal Reconstruction, and Multimodal Synchronization Detection test alignment/reconstruction prerequisites.	No NeRF, Gaussian Splatting, TSDF, mesh, novel-view synthesis, or calibrated 4D reconstruction model yet.
C. Egocentric Vision & Interaction	Strongest implemented track	6 direct tasks: action, subtask, transition, next-action, object relevance, and caption grounding, plus alignment/order diagnostics and audio ablation.	Single-episode chronological split limits generalization; stronger audio and video-language backbones still need multi-episode testing.
D. Scene Reconstruction & World Modeling	Early proxy tasks	Procedure Step Recognition, Next-Action Prediction, Object Relevance Prediction, Cross-Modal Retrieval, Cross-Modal Reconstruction, Temporal Order Verification, and Multimodal Synchronization Detection provide state/world-model probes.	No persistent scene graph, object permanence task, long-term map, or held-out-episode world model yet.

The important interpretation is that all four directions can be started from the Xperience-10M sample modalities, but only direction C is strongly represented by the current 12-task suite. Directions A, B, and D need additional targets and multi-episode training before they become full research deliverables.

Four Direction-Extension Probes

Beyond the original 12 core tasks, the repo now includes one extra data-backed probe for each research direction. These probes are computed from the same shared_windows.npz, windows.csv, and feature_manifest.json artifacts, so the reported numbers are computed from sample-derived features and saved metric artifacts.

Direction	New extension task	Input	Output	Minimal	Neural MLP	Why it matters
A. Human Modeling & Motion Understanding	Body and Hand Motion Intensity	non-mocap video/depth/pose/IMU/SLAM/language features	high vs low body/hand motion	`0.7827` macro-F1	`0.7986` macro-F1	Starts a human-motion-energy target without leaking mocap input.
B. 3D/4D Reconstruction & Neural Rendering	Multi-View Consistency Retrieval	fisheye camera feature query	synchronized stereo-left view rank	`0.5534` MRR	`0.3469` MRR	Tests whether multi-view features preserve synchronized 4D scene identity.
C. Egocentric Vision & Interaction	Action Phase Progress Estimation	non-caption multimodal window	progress inside current action segment	`0.3416` MAE	`0.3038` MAE	Adds a task-structure/intent-style target beyond class labels.
D. Scene Reconstruction & World Modeling	Short-Horizon Ego-Motion Forecasting	current sensors excluding camera translation and captions	future camera-translation delta	`0.1989` MAE	`0.0989` MAE	Starts a short-horizon world-model target over wearer motion.

Run:

python scripts/research_direction_extension_tasks.py

These four probes make the four-direction mapping more concrete, but they are still single-episode extension baselines. Full research conclusions still require multi-episode training, held-out episode evaluation, and stronger task-specific models.

Task Walkthroughs For Juniors

Every task now has a beginner-facing explanation with:

a concrete coffee-episode case study,
exact input contract,
middle process modules,
output contract,
minimal and neural metric,
one important limitation.

Primary files:

Compact map:

Task	Case study	Input -> process -> output
Action Recognition	A pouring window should be named as the current action.	all-modality window -> action label builder + classifier -> action class
Procedure Step Recognition	A fine action is grouped into a broader drink-preparation stage.	all-modality window -> subtask label builder + classifier -> subtask label
Action Boundary Detection	Detect the change from preparing to pouring.	window -> boundary builder + binary classifier -> boundary/steady
Next-Action Prediction	A preparing window predicts what happens 20 frames later.	current window -> future-label shift + classifier -> next action
Hand Trajectory Forecasting	A hand moving toward a cup becomes a future 3D hand path.	current window -> future mocap target + regressor -> hand trajectory
Contact State Prediction	Decide whether hand/body contact is happening.	non-contact features -> contact target + binary classifier -> contact label
Object Relevance Prediction	Infer milk, cup, coffee, or related objects during pouring.	non-caption features -> multi-hot object target + sigmoid heads -> object set
Language Grounding	Query Pour milk into coffee and retrieve the matching moment.	text-like query + candidates -> projection + cosine ranker -> ranked windows
Cross-Modal Retrieval	Motion/IMU from pouring retrieves matching depth/video.	motion/IMU/camera -> projection + candidate index -> ranked depth/video windows
Cross-Modal Reconstruction	Infer depth/video features from motion, IMU, and camera pose.	source modalities -> scaler + regressor -> target modality vector
Temporal Order Verification	Tell whether reaching then pouring was reversed.	adjacent window pair -> pair combiner + binary classifier -> correct/reversed
Multimodal Synchronization Detection	Catch motion paired with visual/depth features shifted in time.	motion side + visual side -> aligned/shifted pair builder + classifier -> aligned/shifted

Minimal 12-Task Architectures

These are deliberately minimal baselines. They are useful because every input/output contract is explicit, not because they are strong embodied-AI models.

Shared setup:

raw episode -> 20-frame windows, stride 5 -> 8,546-dimensional multimodal representation
chronological split: first 70% train, last 30% test
scalers are fit on train windows only

There are four reusable head families:

Head family	Used by	What it means
Linear softmax classifier	Action Recognition, Procedure Step Recognition, Action Boundary Detection, Next-Action Prediction, Contact State Prediction, Temporal Order Verification, Multimodal Synchronization Detection	z-score features, then `XW+b`, softmax, cross-entropy, L2
Dual ridge regression/projection	Hand Trajectory Forecasting, Cross-Modal Reconstruction	z-score input/target, solve ridge regression with L2=10
Ridge + cosine ranking	Language Grounding, Cross-Modal Retrieval	project one modality into another feature space, then rank candidates by cosine
Multi-label logistic regression	Object Relevance Prediction	z-score non-caption features, sigmoid object heads, threshold at 0.5

The optional neural run keeps the same window representation, leakage filters, chronological splits, and metrics, but replaces the task heads with small PyTorch MLP classifiers or regressors. Its outputs live under results/episode_task_suite/neural_mlp/, and the rollup is stored in the neural_tasks section of results/episode_task_suite/summary_report.json.

The task-specific heads are:

Task	Input	Minimal head	Output
Action Recognition	all featurized modalities	linear softmax	current action class
Procedure Step Recognition	all featurized modalities	linear softmax	current subtask class
Action Boundary Detection	all featurized modalities	linear softmax	steady vs action boundary
Next-Action Prediction	all featurized modalities at `t`	linear softmax	action at `t+20` frames
Hand Trajectory Forecasting	all featurized modalities at `t`	ridge regression	future 10-frame left/right hand joints
Contact State Prediction	non-contact and non-caption signals	linear softmax	any body contact
Object Relevance Prediction	non-caption signals	multi-label logistic	relevant object set
Language Grounding	sensor windows projected to text space	ridge projection + cosine ranking	matching time window for text query
Cross-Modal Retrieval	motion/IMU/camera projected to visual space	ridge projection + cosine ranking	matching depth/video window
Cross-Modal Reconstruction	motion/IMU/camera	ridge regression	compressed depth/video target
Temporal Order Verification	`[x_t, x_t+1, x_t+1-x_t]`	binary linear softmax	correct vs reversed order
Multimodal Synchronization Detection	motion plus visual pair	binary linear softmax	aligned vs shifted by 8 windows

Key Results

Experiment	Main score	Accuracy	Notes
Motion-only action	0.9688 macro-F1	0.9828	Uses motion/IMU features only
Current all-feature action	0.9829 macro-F1	0.9863	8,546-dimensional multimodal representation
Motion-only subtask	0.9528 macro-F1	0.9759	Strong within-episode subtask signal
Current all-feature subtask	0.9173 macro-F1	0.9828	High accuracy, lower class-balanced score
Cross-modal retrieval	0.3678 top-5	n/a	Motion/IMU/camera/audio retrieves matching depth/video
Transition detection	0.6118 macro-F1	0.9080	Boundary F1 is 0.1250
Hand trajectory forecast	0.8647 MPJPE	n/a	Predicts future hand-joint trajectory
Neural MLP hand forecast	0.1079 MPJPE	n/a	Same features/split, nonlinear regression head
Neural MLP temporal order	0.8520 F1	0.8578	Strong improvement on adjacent-window ordering
Neural MLP misalignment	0.7153 F1	0.7009	Detects shifted motion/visual/audio pairs better than the linear head
Audio ablation	+0.0418 mean delta	n/a	Current audio variant improves the primary metric on 6 of 12 task contracts
Alternate audio representation	+0.0936 mean delta	n/a	Alternate audio-window representation improves over the baseline audio variant on 6 of 12 task contracts

Audio Contribution Study

The audio ablation keeps the same windows and task labels, then compares input variants under the same chronological split. The script scripts/audio_ablation_and_raw_upgrade.py reuses the real task-suite windows and evaluates six variants for every task: current inputs, no audio, audio-only, alternate audio-only, audio representation replacement, and all inputs plus the alternate audio representation.

The measured single-episode result is task-specific:

Readout	Value
Tasks where current audio improves the primary metric	6 / 12
Mean current-audio delta	+0.0418
Tasks where alternate audio representation improves over baseline audio	6 / 12
Mean alternate-representation delta vs baseline audio	+0.0936

Full files:

Neural MLP Results

The neural baseline was run locally with --include-neural for all 12 tasks using 80 epochs, hidden size 128, batch size 128, and CPU execution. It is not a foundation model result; it is a controlled nonlinear-head comparison over the same 8,546-dimensional multimodal representation.

Task	Neural metric	Minimal metric	Readout
Action Recognition	0.0148 macro-F1	0.0500 macro-F1	Still blocked by unseen future classes
Procedure Step Recognition	0.0281 macro-F1	0.0506 macro-F1	Same single-episode split limitation
Action Boundary Detection	0.5862 macro-F1	0.6118 macro-F1	Similar to the linear baseline
Next-Action Prediction	0.0419 macro-F1	0.0593 macro-F1	Same unseen-label issue
Hand Trajectory Forecasting	0.1079 MPJPE	0.8647 MPJPE	Neural regression improves this target
Contact State Prediction	1.0000 macro-F1	1.0000 macro-F1	Degenerate one-class sample
Object Relevance Prediction	0.1679 micro-F1	0.1803 micro-F1	Similar weak object signal
Language Grounding	0.0168 MRR	0.0160 MRR	Similar ranking behavior
Cross-Modal Retrieval	0.1300 MRR	0.2693 MRR	Linear ridge remains stronger here
Cross-Modal Reconstruction	-0.0102 R2	-0.0153 R2	Small improvement but still weak
Temporal Order Verification	0.8520 F1	0.5400 F1	Neural head captures local temporal structure
Multimodal Synchronization Detection	0.7153 F1	0.5052 F1	Neural head improves alignment detection

The strongest single-episode self-supervised signal is cross-modal retrieval: motion/IMU/camera/audio features retrieve matching depth/video windows substantially better than random.

Single-Episode Diagnostics and Explorer

While waiting for broader Xperience-10M access, the repo now includes an artifact-driven diagnostics pass over the public sample episode:

results/single_episode_diagnostics/object_labels/window_object_labels.csv exports 1,161 real window-level object-label sets from annotation.hdf5.
results/single_episode_diagnostics/modality_ablation/ablation_metrics.csv recomputes all 96 task/modality cells, including object relevance.
results/single_episode_diagnostics/timeline_overlay/timeline_overlay.csv aligns 2,079 existing prediction rows back to the episode timeline.
results/single_episode_diagnostics/alignment_stress/alignment_shift_metrics.csv evaluates cross-modal retrieval under explicit time shifts.
docs/single_episode_explorer.html is a static interactive page for inspecting window labels, objects, predictions, modality statistics, and diagnostic scores.

These are single-episode research diagnostics. They are useful for studying task definitions, feature behavior, and model errors before scaling to more episodes; they are not reported as multi-episode benchmark results.

Reproducibility Check

I re-ran the full pipeline from the local raw public sample into a temporary local workspace and compared regenerated metrics with the committed artifacts. The baseline metrics, 12 task metrics, feature manifest, and available modality manifest matched exactly after float normalization.

See notes/reproducibility_audit.md for the commands and verification evidence.

Why Some Scores Are Low

The task suite intentionally uses a chronological split:

first 70% of the episode -> train
last 30% of the episode  -> test

The test segment contains some action/subtask labels never seen during training. Timeline and next-action classifiers therefore expose the core limitation of single-episode learning instead of hiding it behind random splits.

Modalities Used

The current public-sample pipeline uses:

hand/body mocap joints and contact labels,
camera translation and rotation,
IMU acceleration and gyroscope traces,
depth confidence features,
six video streams,
audio from the sample MP4 stream,
caption/object/interaction text features,
SLAM point-cloud summary features,
calibration parameters.

The full technical source manifest is stored in results/episode_task_suite/feature_manifest.json.

Data Notice

Xperience-10M data belongs to its original authors and is subject to the official Ropedia dataset license and access terms. This repo contains code and derived experiment artifacts only; it does not redistribute the raw videos or raw annotation dataset.

Downloads last month: -

Video Preview

Robotics

Datasets used to train cy0307/ropedia-xperience-10m-task-baselines

Space using cy0307/ropedia-xperience-10m-task-baselines 1

Collection including cy0307/ropedia-xperience-10m-task-baselines

Ropedia Xperience-10M Task Suite

Collection

Space, artifact dataset, and minimal plus neural baseline model repos for the Ropedia Xperience-10M single-episode task suite. • 3 items • Updated 4 days ago • 1