ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework
for Image-Based Virtual Try-On

Korea Advanced Institute of Science and Technology (KAIST)
CVPR 2025

Abstract

This paper introduces ITA-MDT, the Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On (IVTON), designed to overcome the limitations of previous approaches by leveraging the Masked Diffusion Transformer (MDT) for improved handling of both global garment context and fine-grained details. The IVTON task involves seamlessly superimposing a garment from one image onto a person in another, creating a realistic depiction of the person wearing the specified garment. Unlike conventional diffusion-based virtual try-on models that depend on large pre-trained U-Net architectures, ITA-MDT leverages a lightweight, scalable transformer-based denoising diffusion model with a mask latent modeling scheme, achieving competitive results while reducing computational overhead. A key component of ITA-MDT is the Image-Timestep Adaptive Feature Aggregator (ITAFA), a dynamic feature aggregator that combines all of the features from the image encoder into a unified feature of the same size, guided by diffusion timestep and garment image complexity. This enables adaptive weighting of features, allowing the model to emphasize either global information or fine-grained details based on the requirements of the denoising stage. Additionally, the Salient Region Extractor (SRE) module is presented to identify complex region of the garment to provide high-resolution local information to the denoising model as an additional condition alongside the global information of the full garment image. This targeted conditioning strategy enhances detail preservation of fine details in highly salient garment regions, optimizing computational resources by avoiding unnecessarily processing entire garment image. Comparative evaluations confirms that ITA-MDT improves efficiency while maintaining strong performance, reaching state-of-the-art results in several metrics.

Overview


Result Image

Overview of the ITA-MDT Framework for Image-Based Virtual Try-On (IVTON).
The framework takes multiple reference images encoded into latent space as query, which includes Garment Agnostic Map \( A \), DensePose \( P \), and Garment Agnostic Mask \( M_X \).
These reference latent images are concatenated to be patch embedded for the masked diffusion process within our MDT-IVTON, which follows the architecture of MDTv2 with integrated cross-attention blocks.
The image feature of Garment \( X \) is extracted with the ViT image encoder, DINOv2, and is adaptively aggregated with our proposed Image-Timestep Adaptive Feature Aggregator (ITAFA) to produce Garment Feature \( F_g \).
With our Salient Region Extractor (SRE), the Salient Region \( X_s \) is extracted from the Garment \( X \) and processed through ITAFA separately to produce Salient Feature \( F_s \). The Garment Feature \( F_g \) and Salient Feature \( F_s \) are concatenated to serve as conditions of MDT-IVTON.

ITAFA & SRE


Result Image

Illustration of the Image-Timestep Adaptive Feature Aggregator (ITAFA) and the Salient Region Extractor (SRE).
Left: The ITAFA dynamically aggregates Vision Transformer (ViT) feature embeddings f based on a combination of the Timestep Embedding Projector, which projects the diffusion timestep embedding T to match the feature embedding dimensions, and the Image Complexity Projector, which transforms the image complexity vector [S, V, G] (sparsity, variance, gradient magnitude) into a comparable dimension. The weight vectors are combined and normalized via softmax to form W, which is used to adaptively aggregate the feature embeddings {fi }i=0h across hidden layers to produce the final output tensor F. Garment X and Salient Region Xs are processed separately through ITAFA to generate the Garment Feature Fg and Salient Feature Fs.
Right: The SRE processes the input image X by computing an entropy map Xe, creating a binary High-Entropy Mask Xm , and applying circular region expansion from the entropy centroid to extract the final high-entropy region Xs , ensuring preservation of detail within a consistent aspect ratio.

Qualitative Comparisons

Citation

Citation TBA

Acknowledgements

This work was supported by Institute for Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.RS-2021-II211381, Development of Causal AI through Video Understanding and Reinforcement Learning, and Its Applications to Real Environments) and partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.RS-2022-II220184, Development and Study of AI Technologies to Inexpensively Conform to Evolving Policy on Ethics)