Controllable generation is critical for the practical deployment of diffusion-based generative models, particularly in video synthesis where alignment with conditioning signals is essential. Existing approaches typically rely on indirect conditioning by learning the joint distribution \( p(x, c) \), where \( x \) is the generated sample and \( c \) is the conditioning input. However, such data-driven strategies offer limited enforcement of conditioning semantics and often fail to guarantee that generated content reflects the intended guidance. Classifier-based guidance enforces stronger conditioning but is prone to adversarial artifacts, while classifier-free guidance offers empirical effectiveness with less interpretability or control precision. In this paper, we propose Attention-Conditional Diffusion (ACD), a novel framework for direct conditional control in video diffusion models via attention supervision. By aligning the model’s attention maps with external control signals, ACD achieves semantically grounded and precise generation. To support this, we introduce a sparse 3D-aware object layout as an efficient and interpretable conditioning signal, along with a dedicated Layout ControlNet and an automated annotation pipeline for scalable layout integration. Extensive experiments on benchmark video generation datasets demonstrate that ACD delivers superior alignment with conditioning inputs while preserving temporal coherence and visual fidelity, establishing a robust and interpretable paradigm for conditional video synthesis.
Qualitative comparisons with state-of-the-art methods.
Given a single reference image and a sparse object layout with an associated camera trajectory, ACD generates videos that faithfully preserve structural semantics and exhibit precise camera motion.