ACD : Direct Conditional Control for Video Diffusion Models via Attention Supervision

ACD enables accurate and interpretable video synthesis by conditioning diffusion models on sparse, 3D-aware object layouts and attention supervision.

Abstract

Controllable generation is critical for the practical deployment of diffusion-based generative models, particularly in video synthesis where alignment with conditioning signals is essential. Existing approaches typically rely on indirect conditioning by learning the joint distribution \( p(x, c) \), where \( x \) is the generated sample and \( c \) is the conditioning input. However, such data-driven strategies offer limited enforcement of conditioning semantics and often fail to guarantee that generated content reflects the intended guidance. Classifier-based guidance enforces stronger conditioning but is prone to adversarial artifacts, while classifier-free guidance offers empirical effectiveness with less interpretability or control precision. In this paper, we propose Attention-Conditional Diffusion (ACD), a novel framework for direct conditional control in video diffusion models via attention supervision. By aligning the model’s attention maps with external control signals, ACD achieves semantically grounded and precise generation. To support this, we introduce a sparse 3D-aware object layout as an efficient and interpretable conditioning signal, along with a dedicated Layout ControlNet and an automated annotation pipeline for scalable layout integration. Extensive experiments on benchmark video generation datasets demonstrate that ACD delivers superior alignment with conditioning inputs while preserving temporal coherence and visual fidelity, establishing a robust and interpretable paradigm for conditional video synthesis.

Method

Overview of our Attention-Conditional Diffusion (ACD) framework. The input video and its masked version are encoded into visual tokens, while the sparse 3D-aware object layout is converted into layout tokens. These tokens pass through stacked Attention-Conditional DiT blocks, where a router constraint supervises attention maps between masked and unmasked video tokens. Gradients from this constraint update the model parameters. A VAE decoder then reconstructs the video, enabling ACD to generate outputs that closely follow the given layouts and camera trajectories.

Comparisons with sota Methods

Qualitative comparisons with state-of-the-art methods.

More Video Results

Given a single reference image and a sparse object layout with an associated camera trajectory, ACD generates videos that faithfully preserve structural semantics and exhibit precise camera motion.

Long Camera Trajectories