Framework for generating synthetic Inertial Measurement Unit (IMU) data from human motion. It bridges two domains that are rarely connected: pose estimation and motion generation on one side, and wearable sensor simulation on the other. Given a video of a person moving, or a plain-text description of a movement, the framework produces synthetic accelerometer readings as if physical sensors had been attached to the body.
The primary motivation is the scarcity of labelled IMU datasets. Collecting real IMU data requires physical hardware, synchronized recordings, and careful placement of sensors — a process that is slow, expensive, and hard to scale. IMUGPT addresses this by deriving IMU signals from 3D skeletal trajectories, which can themselves be obtained either from video (via pose estimation) or generated from text prompts (via language-conditioned motion models). This makes it possible to produce large, diverse, and controllable datasets for training and evaluating activity recognition models without any physical sensor.
The framework integrates two main pipelines. The first takes a monocular video, lifts the detected 2D keypoints to a metric 3D skeleton using MotionBERT, and feeds the resulting trajectory into IMUSim to synthesize the sensor signals. The second pipeline uses a Vision-Language Model (Qwen3-VL) to automatically describe short windows of real motion and feeds those descriptions — together with real joint positions as anchors — into Kimodo, a SMPL-X motion generator, to produce novel but physically grounded synthetic motions.
This project was developed in a partnership between the Cognitive Architectures research line from the Hub for Artificial Intelligence and Cognitive Architectures (H.IAAC) from State University of Campinas (UNICAMP), Brazil; and the Robotics and Artificial Inteligence Lab (AIRLab), from Politecnico di Milano (POLIMI), Italy.
pose_module/— core pipeline: 2D/3D pose estimation, virtual IMU synthesis, MotionBERT lifting. Runs in thepose_moduleconda env.robot_emotions_vlm/— Qwen3-VL video description, anchor catalog construction, and Kimodo batch generation. Runs in thekimodoconda env.kimodo/— git submodule: SMPL-X motion generator CLI (kimodo_gen,kimodo_textencoder).imusim/— IMU physics simulation library used to synthesize accelerometer/gyroscope readings from 3D skeleton trajectories.data/— input datasets (e.g.data/RobotEmotions/).output/— per-clip outputs organized by experiment; manifests (JSONL) index all artifacts.evaluation/— notebooks and scripts for classifier experiments and IMU quality assessment.scripts/— utility scripts.
Requirements: Linux, CUDA-capable GPU (≥ 21 GB VRAM recommended).
This project uses Miniconda to manage Python environments. We do not recommend python-env or venv, because the project requires multiple envs with different Python versions.
Make sure to install it accordingly to your Linux distribution. If not, follow the official instructions here.
Also, install ffmpeg for your distro. For Ubuntu/Debian, use:
sudo apt update && sudo apt install ffmpeg -yNow, clone the project's repository. For using the necessary 3th-party codes, use the --recurse-submodules flag:
git clone --recurse-submodules git@github.com:H-IAAC/POSE2IMU-Framework.git
cd POSE2IMU-FrameworkAll environments (pose_module, openmmlab, kimodo) are configured by a single script:
sudo chmod +x config_envs.sh
bash config_envs.shThe script creates and installs each conda environment in order, printing progress for each step. If any step fails it stops immediately.
Qwen3-VL weights (Qwen/Qwen3-VL-8B-Instruct) are downloaded automatically from Hugging Face on first use.
The project supports two main pipelines. All commands are run from the repository root.
Converts real video recordings into synthetic IMU data. The pose_module drives the full chain: 2D detection (OpenMMlab/ViTPose), 3D lifting (MotionBERT), metric normalization, root estimation, and physics-based IMU synthesis (IMUSim).
# Step 1 — Export 3D poses from video
conda run -n pose_module python -m pose_module.robot_emotions export-pose3d \
--dataset-root data/RobotEmotions \
--domains 10ms 30ms \
--output-dir output/robot_emotions_pose3d \
--env-name openmmlab \
--no-debug-2d --no-debug-3d
# Step 2 — Synthesize virtual IMU signals, reusing the poses computed above
conda run -n pose_module python -m pose_module.robot_emotions export-virtual-imu \
--dataset-root data/RobotEmotions \
--domains 10ms 30ms \
--output-dir output/robot_emotions_virtual_imu \
--pose3d-manifest-path output/robot_emotions_pose3d/pose3d_manifest.jsonl \
--no-debug-2d --no-debug-3dPassing --pose3d-manifest-path skips the pose estimation stage for every clip found in the manifest and loads the existing pose3d.npz directly, going straight to IK + IMUSim. Clips not found in the manifest fall back to the full pipeline. Omitting the flag runs the full pipeline for all clips (original behaviour).
The pipeline exports raw (uncalibrated) signals. Calibration follows a rank-transform method that maps the virtual signal distribution onto the real IMU distribution of the same clip. There are two use cases:
- Per fold during evaluation —
evaluation/classifiers_pose_experiments.ipynbrecalibrates each fold using only training-set subjects, avoiding data leakage. - Batch via CLI —
calibrate-virtual-imureads the manifest and calibrates each clip against its ownimu.npz, writingvirtual_imu_calibrated.npzalongside the original:
conda run -n pose_module python -m pose_module.robot_emotions calibrate-virtual-imu \
--manifest-path output/robot_emotions_virtual_imu/virtual_imu_manifest.jsonl \
--calibration-fraction 0.5| Flag | Default | Description |
|---|---|---|
--manifest-path |
required | virtual_imu_manifest.jsonl from export-virtual-imu. |
--calibration-fraction |
1.0 |
Fraction of each clip's real imu.npz to use as reference (e.g. 0.5 = first 50%). |
--activity-label-key |
None |
Manifest label field for per-class calibration (e.g. action). |
--signal-mode |
acc |
Channels to calibrate: acc, gyro, or both. |
--in-place |
off | Overwrite virtual_imu.npz instead of writing virtual_imu_calibrated.npz. |
Outputs per clip under output/<experiment>/<clip_id>/:
pose/pose3d/pose3d.npz— 3D skeleton trajectoryimu/virtual_imu.npz— synthetic accelerometer + gyroscope (uncalibrated)imu/virtual_imu_calibrated.npz— calibrated signal (whencalibrate-virtual-imuis run without--in-place)
Uses real video windows as anchors: Qwen3-VL describes each 5-second segment; Kimodo generates new SMPL-X motion conditioned on the text and real joint positions. The resulting synthetic motions are then converted to virtual IMU signals using the same simulation stages as Pipeline 1.
Step 1 — Export real pose3d (same as Pipeline 1, pose_module env):
conda run -n pose_module python -m pose_module.robot_emotions export-pose3d \
--dataset-root data/RobotEmotions --domains 10ms 30ms \
--output-dir output/robot_emotions_pose3d \
--env-name openmmlab \
--no-debug-2d --no-debug-3dStep 2 — Describe windows with Qwen3-VL (kimodo env):
conda run -n kimodo python -m robot_emotions_vlm describe-windows \
--pose3d-manifest-path output/robot_emotions_pose3d/pose3d_manifest.jsonl \
--output-dir output/robot_emotions_qwen_windows \
--window-sec 5.0 --window-hop-sec 2.5 --num-video-frames 48Step 3 — Build anchor catalog (kimodo env):
Extracts ground trajectory and optional end-effector keyframes from the real pose3d to spatially constrain Kimodo generation.
conda run -n kimodo python -m robot_emotions_vlm build-anchor-catalog \
--pose3d-manifest-path output/robot_emotions_pose3d/pose3d_manifest.jsonl \
--qwen-window-catalog-path output/robot_emotions_qwen_windows/kimodo_window_prompt_catalog.jsonl \
--output-dir output/robot_emotions_kimodo_anchors \
--model Kimodo-SMPLX-RP-v1 \
--effector-keyframes 5Step 4 — Generate with Kimodo (kimodo env):
conda run -n kimodo python -m robot_emotions_vlm generate-kimodo \
--model Kimodo-SMPLX-RP-v1 \
--catalog-path output/robot_emotions_kimodo_anchors/kimodo_anchor_catalog.jsonl \
--output-dir output/robot_emotions_kimodoEach generated clip produces motion.npz + motion_amass.npz (SMPL-X) under output/robot_emotions_kimodo/<prompt_id>/.
Step 5 — Export virtual IMU (kimodo env):
Applies metric normalization → root estimation → IMUSim → geometric alignment. Percentile calibration is not applied here — it is deferred to evaluation time (per fold, restricted to training subjects) to avoid data leakage.
conda run -n kimodo python -m robot_emotions_vlm export-kimodo-virtual-imu \
--kimodo-manifest output/robot_emotions_kimodo/kimodo_generation_manifest.jsonl \
--output-dir output/robot_emotions_kimodo_imu \
--real-imu-root output/robot_emotions_virtual_imuEach clip produces virtual_imu.npz (geometrically aligned, pre-calibration) under output/robot_emotions_kimodo_imu/<prompt_id>/virtual_imu/.
Step 6 — Merge real + synthetic (kimodo env):
Combines the real-video manifest and the Kimodo manifest into a single JSONL for mixed training experiments.
conda run -n kimodo python -m robot_emotions_vlm export-mixed-virtual-imu \
--real-manifest output/robot_emotions_virtual_imu/virtual_imu_manifest.jsonl \
--synthetic-manifest output/robot_emotions_kimodo_imu/virtual_imu_manifest.jsonl \
--output-dir output/robot_emotions_mixed_imuconda run -n kimodo kimodo_gen "A person sits down and stands up" \
--model Kimodo-SMPLX-RP-v1 --duration 10.0 --output output/kimodo_direct/@software{POSE2IMU,
author = {Parede, Henrique and Bonarini, Andrea and Dornhofer Paro Costa, Paula},
title = { POSE2IMU-Framework},
url = {https://github.com/H-IAAC/POSE2IMU-Framework}
}- (2026-) Henrique Parede: Computer Engineering student, FEEC-UNICAMP
- (Advisor, 2026-) Andrea Bonarini: Professor, DEIB-POLIMI
- (Advisor, 2026-) Paula Dornhofer Paro Costa: Professor, FEEC-UNICAMP
This study was financed by the São Paulo Research Foundation (FAPESP), Brasil. Process Number 2025/21964-5.
This codebase was developed starting from a fork of IMUGPT by Leng et al. We are grateful for their foundational work, which made this project possible.
We also gratefully acknowledge the authors of the following open-source projects, which are integrated as submodules:
