Li Weiqi 李伟祺
M.S. Student · Computer Technology
Sun Yat-sen University, Guangzhou, China
Sun Yat-sen University, Guangzhou, China
I am a first-year M.S. student at Sun Yat-sen University (SYSU), advised by Prof. Liang Lin and Prof. Guangrun Wang. I received my B.S. in Software Engineering from South China University of Technology (SCUT) in 2024. Currently, I am a research intern at Tencent RoboticX, focusing on VLA-based mobile manipulation.
My research interests lie in Embodied AI, Vision-Language-Action (VLA) models, Multimodal Large Language Models, and Controllable Video Generation. I aim to build robust, generalizable embodied agents that can seamlessly operate in diverse real-world environments.
News
- 2025.05 🤖 Joined Tencent RoboticX as a research intern, working on VLA-based mobile manipulation.
- 2025.12 🎉 One paper accepted at CVPR 2026 (CCF-A): VLA Models Are More Generalizable Than You Think.
- 2025.12 📄 New preprint: ACD submitted to IJCV. [arXiv:2512.21268]
- 2025.09 📄 HumanGenesis submitted to NeurIPS 2026. [arXiv:2508.09858]
- 2024.09 🎓 Started M.S. at Sun Yat-sen University.
Publications
Bold denotes my name | † denotes corresponding author
VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling
CVPR 2026
CCF-A
Investigates why VLA models (e.g., π0.5) fail dramatically under novel viewpoints.
Decouples the issue into physical vs. spatial modeling failures, showing that the pretrained model retains
strong physical understanding while spatial representation mismatch is the key bottleneck.
Proposes two lightweight adaptation methods — FTM (token affine modulation) and
FLA (low-rank ViT update) — that recover cross-viewpoint performance by updating only
4K–4.7M parameters, achieving 90.8% success rate on the LIBERO-V benchmark
with a 99× parameter efficiency gain over LoRA.
One-shot sim-to-real transfer validated on a real Franka arm.
HumanGenesis: Agent-Based Geometric and Generative Modeling for Synthetic Human Dynamics
NeurIPS 2026 (Under Review)
Proposes HumanGenesis, a multi-agent collaborative framework that unifies
Real2Sim and Sim2Real in a closed loop for human dynamics modeling.
The system integrates a 3DGS+SMPL+Learnable-LBS Reconstructor,
a Qwen2.5-VL-driven Critique Agent with multi-round self-reflection for
fine-grained reconstruction refinement, and a Video Harmonizer that enhances
human-scene consistency and temporal coherence in rendered videos.
Achieves state-of-the-art results on HumanVid and NeuMan benchmarks.
ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision
IJCV (Under Review)
Proposes Attention-Conditional Diffusion (ACD), a controllable video generation
framework built on CogVideoX.
Unlike traditional guidance-level conditioning, ACD directly supervises cross-attention maps
inside the diffusion model via a dual-branch (masked/unmasked) shared-parameter fine-tuning scheme,
moving conditioning from output-level to attention-level and eliminating common artifacts.
Uses sparse 3D-aware object layouts as control signals with a layout ControlNet,
supported by an automated annotation pipeline on 20K RealEstate training clips.
Outperforms AC3D and other baselines on FID/FVD and camera error metrics.
Research Experience
Tencent RoboticX
Research Intern — Embodied AI
Working on Vision-Language-Action (VLA) models for mobile manipulation tasks,
investigating generalization, spatial understanding, and sim-to-real transfer
in whole-body robot control pipelines.
2025 – Present
Sun Yat-sen University — Graduate Research
M.S. Researcher — Embodied AI & Multimodal Generation
Research on generalizable VLA models (CVPR 2026) and multi-agent frameworks
for human dynamics modeling (ICML 2026 submission).
2024.09 – Present
Education
Sun Yat-sen University (中山大学)
M.S. in Computer Technology
School of Computer Science and Engineering
Sep 2024 – Jun 2027
South China University of Technology (华南理工大学)
B.S. in Software Engineering
School of Software Engineering
Sep 2020 – Jun 2024
Technical Skills
Frameworks:
PyTorch · HuggingFace Transformers · Diffusers · LoRA / PEFT · 3D Gaussian Splatting · SMPL
Research Areas:
Embodied AI · VLA Policy Learning · Multimodal Modeling · Controllable Video Generation · Large Model Fine-tuning
Languages:
Python · C++ · CUDA
Last updated: May 2026