During 1-step inference, the input is pure noise and no "noisy image" actually exists yet, meaning the action branch is either guessing blindly or simply relying heavily on historical context. Therefore, have you conducted any ablation studies by completely removing/masking out the current-chunk video prediction branch during 1-step inference? Also, have you considered improving your algorithm using approaches like MeanFlow or Consistency Models to rectify the 1-step vector field?
During 1-step inference, the input is pure noise and no "noisy image" actually exists yet, meaning the action branch is either guessing blindly or simply relying heavily on historical context. Therefore, have you conducted any ablation studies by completely removing/masking out the current-chunk video prediction branch during 1-step inference? Also, have you considered improving your algorithm using approaches like MeanFlow or Consistency Models to rectify the 1-step vector field?