Synthetic human dynamics aims to generate photorealistic videos of human subjects performing expressive, intention-driven motions. However, current approaches face two core challenges: (1) geometric inconsistency and coarse reconstruction, due to limited 3D modeling and detail preservation; and (2) motion generalization limitations and scene inharmonization, stemming from weak generative capabilities. To address these, we present HumanGenesis, a framework that integrates geometric and generative modeling through four collaborative agents: (1) Reconstructor builds 3D-consistent human-scene representations from monocular video using 3D Gaussian Splatting and deformation decomposition. (2) Critique Agent enhances reconstruction fidelity by identifying and refining poor regions via multi-round MLLM-based reflection. (3) Pose Guider enables motion generalization by generating expressive pose sequences using time-aware parametric encoders. (4) Video Harmonizer synthesizes photorealistic, coherent video via a hybrid rendering pipeline with diffusion, refining the Reconstructor through a Back-to-4D feedback loop. HumanGenesis achieves state-of-the-art performance on tasks including text-guided synthesis, video reenactment, and novel-pose generalization, significantly improving expressiveness, geometric fidelity, and scene integration.
Qualitative comparisons with state-of-the-art methods.