HumanGenesis : Agent-Based Geometric and Generative Modeling for Synthetic Human Dynamics

Weiqi Li1, Zehao Zhang2, Liang Lin1,3,4, Guangrun Wang1,3,4
1Sun Yat-sen University,  2Yonsei University,  3X-Era AI Lab,  4 Guangdong Key Laboratory of Big Data Analysis and Processing 

Compelling applications of our proposed method in human dynamics synthesis highlight four key strengths: temporal consistency, geometric plausibility, expressive motion handling, and seamless integration into target scenes. These capabilities enable a wide range of applications, including but not limited to: (a) animating characters with novel motions—either drawn from motion capture datasets such as AMASS or synthesized from textual descriptions; and (b) video reenactment, where characters are smoothly inserted into target scenes and animated according to predefined motion trajectories.

Abstract

Synthetic human dynamics aims to generate photorealistic videos of human subjects performing expressive, intention-driven motions. However, current approaches face two core challenges: (1) geometric inconsistency and coarse reconstruction, due to limited 3D modeling and detail preservation; and (2) motion generalization limitations and scene inharmonization, stemming from weak generative capabilities. To address these, we present HumanGenesis, a framework that integrates geometric and generative modeling through four collaborative agents: (1) Reconstructor builds 3D-consistent human-scene representations from monocular video using 3D Gaussian Splatting and deformation decomposition. (2) Critique Agent enhances reconstruction fidelity by identifying and refining poor regions via multi-round MLLM-based reflection. (3) Pose Guider enables motion generalization by generating expressive pose sequences using time-aware parametric encoders. (4) Video Harmonizer synthesizes photorealistic, coherent video via a hybrid rendering pipeline with diffusion, refining the Reconstructor through a Back-to-4D feedback loop. HumanGenesis achieves state-of-the-art performance on tasks including text-guided synthesis, video reenactment, and novel-pose generalization, significantly improving expressiveness, geometric fidelity, and scene integration.

Method

The Reconstructor first recovers the 3D human and scene from monocular video by modeling rigid and non-rigid deformations. The Critique Agent then evaluates the rendered outputs to identify and refine low-quality regions, enabling fine-grained reconstruction. Next, the Pose Guider generates temporally-aware embeddings from novel parametric pose sequences using a time-aware encoder, allowing expressive motion synthesis. Finally, the Video Harmonizer leverages Spatial Feature Transform (SFT) within a video diffusion pipeline to produce photorealistic sequences and forms a feedback loop that enhances the Reconstructor’s input.

Comparisons with sota Methods

Qualitative comparisons with state-of-the-art methods.