This is the official repository for the paper:
MotionVLA: Vision-Language-Action Model for Humanoid Motion
Nonghai Zhang*, Siyu Zhai*, Zeyu Zhang*, and Hao Tang#
*Equal contribution. β Project lead. #Corresponding author.
Note
πͺ MotionVLA generates humanoid motion from a scene image and a text instruction by combining a Qwen3.5 autoregressive backbone with DSFT (Dual-Stream Frequency-domain Tokenizer), which decouples low-frequency pose semantics from high-frequency physical dynamics.
If you find our code or paper helpful, please consider starring β us and citing:
@article{zhang2026motionvla,
title={MotionVLA: Vision-Language-Action Model for Humanoid Motion},
author={Zhang, Nonghai and Zhai, Siyu and Li, Yanjun and Zhang, Zeyu and Yin, Zhihan and Guo, Yandong and Shi, Boxin and Tang, Hao},
journal={arXiv preprint arXiv:2606.15142},
year={2026}
}Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. Many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion shows a clear mismatch: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, biasing single-codebook quantization toward pose statistics and under-representing high-frequency velocity components.
To address this, we propose DSFT (Dual-Stream Frequency-domain Tokenizer), which separates motion into a Base stream (joint rotations + positions + root orientation/coordinates) and a Phys stream (joint velocities + root velocities), and compresses them independently with DCT truncation + BPE. We then present MotionVLA, a Qwen3.5-based autoregressive model that arranges the two streams in a unified sequence
[ M_BOS, b_1, β¦, b_N, M_SEP, p_1, β¦, p_M, M_EOS ]
so that Phys tokens are predicted after all Base tokens through causal attention β a hierarchical semantic-to-physical generation order.
Despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation.
- DSFT Tokenizer: Frequency-domain decomposition of motion into Base (semantic) and Phys (dynamic) streams with independent DCT truncation (default
K_base=5,K_phys=25) and BPE compression. - Qwen3.5 VLA Backbone: A standard autoregressive transformer extended with motion tokens, supporting scene-image + text conditioning.
- Unified Sequence: Base-then-Phys layout enables phase-aware causal generation; a logit mask enforces the BASE β SEP β PHYS β EOS order at inference.
- Two-Phase ms-swift Training: Phase 1 warms up new motion-token embeddings, Phase 2 runs LoRA SFT on the full backbone.
- Lightweight Default: 2B backbone is sufficient for SOTA-competitive results; ablations cover 0.8B / 2B / 4B / 9B.
2026/06/01: π Project website is live at aigeeksgroup.github.io/MotionVLA.
2026/06/01: π Code is released; models and tokenizer checkpoints will be uploaded to HuggingFace shortly.
Important
We are actively developing and improving MotionVLA. Stay tuned for updates!
- Release MotionVLA training code (ms-swift two-phase pipeline)
- Release DSFT tokenizer training & inference code
- Release dual-stream frequency analysis scripts
- Upload paper to arXiv and finalize project page
- Release pre-trained MotionVLA checkpoints (2B / 4B / 9B) on HuggingFace
- Release the ViMoGen-derived training JSONL on HuggingFace
- Release motion visualization & MuJoCo simulation toolkit
- Add interactive demo on HuggingFace Spaces
Scene Image + Text Instruction
β
βΌ
Qwen3.5 (default 2B) βββ new motion vocabulary βββ
Vocabulary V = V_LM + V_motion + 3 β Base : 4096 tokens β
β Phys : 4096 tokens β
β M_BOS / M_SEP / M_EOS β
βββββββββββββββββββββββββββββ
β
βΌ masked next-token prediction
[ M_BOS, b_1, β¦, b_N, M_SEP, p_1, β¦, p_M, M_EOS ]
β
βΌ phase-aware logit mask at inference
BPEβ»ΒΉ + IDCT (per-stream)
β
βΌ
Reconstructed motion β β^{T Γ D}
DSFT decomposes motion into two streams via DCT (HumanML3D 263-dim β Base 190 / Phys 73; ViMoGen 276-dim β Base 201 / Phys 75):
- Base stream: joint rotations + positions + root orientation/coordinates β low-frequency pose semantics (~86β93% of energy in the first 5 DCT coefficients).
- Phys stream: joint velocities + root velocities β high-frequency physical dynamics (only ~37% energy in the first 5 coefficients).
Each stream is DCT-truncated (K_base=5, K_phys=25) and BPE-encoded independently, yielding two compact discrete vocabularies.
MotionVLA/
βββ tokenizer/
β βββ ds_fast_tokenizer.py # DSFT dual-stream tokenizer (DSFT class)
β βββ train_tokenizer.py # Train DSFT from raw motion data
β βββ tokenize_dataset.py # Batch tokenize a motion dataset
β βββ 276to263/ # 276-dim β 263-dim conversion utilities
βββ training/
β βββ prepare_swift_data.py # Convert tokenized dataset β ms-swift JSONL
β βββ train_swift_phase1.sh # Phase 1: motion-token embed warmup
β βββ train_swift_phase2.sh # Phase 2: LoRA SFT
β βββ train_swift_h100.sh # Combined Phase 1 + Phase 2 (H100)
βββ analysis/ # Frequency analysis & DSFT reconstruction quality
β βββ freq_analysis_combined.py
β βββ dsfast/ # Per-dim low-freq ratio, energy coverage, rRMSE
βββ theory/ # Supporting theoretical analyses
β βββ theory1_dualstream.py
β βββ theory3_tokenizer.py
β βββ theory4_dualstream_vs_single.py
βββ docs/
β βββ ARCHITECTURE.md
β βββ DATA_FORMAT.md
β βββ TRAINING.md
βββ requirements.txt
βββ README.md # This file
Tested with CUDA 11.8 / 12.x and Python 3.10:
conda create -n motionvla python=3.10
conda activate motionvla
# PyTorch (pick the build matching your CUDA)
pip install torch>=2.1.0 torchvision --index-url https://download.pytorch.org/whl/cu118
# Other dependencies
pip install -r requirements.txtKey dependencies (see requirements.txt):
torch>=2.1.0transformers>=4.45.0peft>=0.12.0ms-swift>=2.0.0qwen-vl-utilstokenizers,scipy
# Qwen3.5 backbone (replace size with the one you want to use)
huggingface-cli download <qwen3.5-checkpoint-name> --local-dir checkpoints/Qwen3.5-VL-8B
# (Once released) DSFT tokenizer & MotionVLA checkpoints
huggingface-cli download AIGeeksGroup/MotionVLA --local-dir checkpoints/MotionVLAThe dataset JSON format and motion .pt layout are described in docs/DATA_FORMAT.md. At a high level:
- Each entry has
id,text,motion_path, optionalimage_path. - Motion
.ptfiles contain a tokenized sequenceseq = [BOS, baseβ¦, SEP, physβ¦, EOS]produced by DSFT.
Train DSFT on raw motion data:
python tokenizer/train_tokenizer.py \
--motiondata_root data/motions \
--output_dir tokenizer/checkpoints \
--K_base 5 --K_phys 25 \
--base_vocab 4096 --phys_vocab 2048Then tokenize a dataset:
python tokenizer/tokenize_dataset.py \
--json data/dataset.json \
--motiondata data/motions \
--tok_dir tokenizer/checkpoints \
--out_dir data/motions_tokenized \
--out_json data/dataset_tokenized.json \
--workers 4The official training pipeline uses ms-swift and runs in two phases.
Step 1 β Prepare ms-swift JSONL
python training/prepare_swift_data.py \
--json data/dataset_tokenized.json \
--root . \
--out data/swift \
--split 0.9This writes train.jsonl, val.jsonl, and motion_tokens.txt (the new motion-token vocabulary) into data/swift/.
Step 2 β Phase 1: Embed warmup
Freeze all transformer layers; train only embed_tokens and lm_head rows for the new motion tokens (LR 1e-3, ~500 steps):
bash training/train_swift_phase1.shOutputs land under checkpoints/phase1_embed/.
Step 3 β Phase 2: LoRA SFT
Load the Phase 1 checkpoint and run LoRA SFT (rank=32, alpha=64, target_modules=all-linear, LR=2e-4, 3 epochs):
bash training/train_swift_phase2.shOutputs land under checkpoints/swift_lora/.
Combined H100 recipe (Phase 1 + Phase 2 in one script):
bash training/train_swift_h100.shFor configuration details and hyper-parameters see docs/TRAINING.md.
| Method | M-C Cons. β | M-Gen. β | Jitter β | Dynamic β | F-Float β | F-Slide β | B-Pen β | P-Qual β |
|---|---|---|---|---|---|---|---|---|
| MDM (ICLR'23) | 0.42 | 0.51 | 0.0136 | 0.0376 | 0.156 | 0.0136 | 1.68 | 2.67 |
| T2M-GPT (CVPR'23) | 0.39 | 0.38 | 0.0156 | 0.0349 | 0.209 | 0.0156 | 1.33 | 2.43 |
| MoMask (CVPR'24) | 0.38 | 0.44 | 0.0147 | 0.0396 | 0.178 | 0.0147 | 1.48 | 2.67 |
| MotionDiffuse (TPAMI'24) | 0.44 | 0.42 | 0.0111 | 0.0289 | 0.126 | 0.0063 | 1.35 | 2.21 |
| MotionLCM (ECCV'24) | 0.48 | 0.55 | 0.0218 | 0.0439 | 0.193 | 0.0202 | 1.73 | 2.40 |
| FineMoGen (NeurIPS'24) | 0.37 | 0.42 | 0.0118 | 0.0386 | 0.281 | 0.0091 | 1.18 | 2.28 |
| MotionCraft (CVPR'25) | 0.42 | 0.45 | 0.0132 | 0.0420 | 0.402 | 0.0090 | 1.15 | 2.12 |
| ViMoGen (ICLR'26) | 0.53 | 0.68 | 0.0108 | 0.0251 | 0.204 | 0.0064 | 1.78 | 2.38 |
| ViMoGen-light (ICLR'26) | 0.47 | 0.55 | 0.0129 | 0.0294 | 0.155 | 0.0051 | 1.43 | 2.10 |
| MotionVLA (Ours) β | 0.55 | 0.66 | 0.0110 | 0.0419 | 0.149 | 0.0049 | 1.34 | 2.14 |
β : uses additional visual (scene) input. Best in bold.
| Method | R-Top1 β | R-Top2 β | R-Top3 β | FID β | MM-Dist β | Diversity β | MModality β |
|---|---|---|---|---|---|---|---|
| Real | 0.511 | 0.703 | 0.797 | 0.002 | 2.974 | 9.503 | β |
| TM2T (ECCV'22) | 0.424 | 0.618 | 0.729 | 1.501 | 3.467 | 8.589 | 2.424 |
| MDM (ICLR'23) | β | β | 0.611 | 0.544 | 5.566 | β | 2.799 |
| MotionDiffuse (TPAMI'24) | 0.491 | 0.681 | 0.782 | 0.630 | 3.113 | 9.410 | 1.553 |
| T2M-GPT (CVPR'23) | 0.492 | 0.679 | 0.775 | 0.141 | 3.121 | 9.722 | 1.831 |
| FineMoGen (NeurIPS'24) | 0.504 | 0.690 | 0.784 | 0.151 | 2.998 | 9.263 | 2.696 |
| MoMask (CVPR'24) | 0.521 | 0.713 | 0.807 | 0.045 | 2.958 | β | 1.241 |
| DisCoRD (ICCV'25) | 0.524 | 0.715 | 0.809 | 0.032 | 2.938 | β | 1.288 |
| GenM3 (ICCV'25) | 0.511 | 0.705 | 0.804 | 0.046 | 2.852 | 9.675 | β |
| MG-MotionLLM (CVPR'25) | 0.516 | 0.706 | 0.802 | 0.303 | 2.952 | 9.960 | 2.125 |
| MotionVLA (Ours) | 0.507 | 0.699 | 0.798 | 0.071 | 2.906 | 9.548 | 2.821 |
MotionVLA achieves the Diversity score closest to real data and the highest MModality among generated methods, despite using only a 2B backbone.
| Backbone | Params | M-C Cons. β | M-Gen. β | Jitter β | F-Slide β |
|---|---|---|---|---|---|
| Qwen3.5-0.8B | 0.8B | 0.51 | 0.60 | 0.0122 | 0.0058 |
| Qwen3.5-2B (default) | 2B | 0.55 | 0.66 | 0.0110 | 0.0049 |
| Qwen3.5-4B | 4B | 0.55 | 0.66 | 0.0109 | 0.0049 |
| Qwen3.5-9B | 9B | 0.56 | 0.68 | 0.0107 | 0.0047 |
- Animation & content creation: Text- and image-conditioned humanoid motion for previs, games, and film.
- Robotics & embodied AI: Vision-language motion priors for humanoid policies; deployment on platforms such as Unitree G1.
- AR / VR & interactive media: Avatar animation from natural-language prompts and scene context.
- Research: Studying disentanglement of semantic intent vs. physical dynamics, and frequency-aware motion tokenization.
We welcome contributions. Please feel free to:
- Report bugs and issues
- Submit pull requests
- Suggest new features
- Share your results and applications
This project is released under the MIT License. See LICENSE for details.
We thank the authors of the following projects for their open-source contributions:
- Qwen for the Qwen3.5 backbone
- ms-swift for the training framework
- PEFT for LoRA implementations
- HumanML3D and ViMoGen for datasets and benchmarks
- The motion-generation research community for prior work on motion tokenization
For questions and discussions, please:
- Open an issue on GitHub
- Visit our project website
- Browse our models on HuggingFace