Li Weiqi | 李伟祺

Li Weiqi 李伟祺

M.S. Student · Computer Technology
Sun Yat-sen University, Guangzhou, China

I am an M.S. student at Sun Yat-sen University (SYSU), advised by Prof. Guangrun Wang and Prof. Liang Lin. I received my B.S. in Software Engineering from South China University of Technology (SCUT) in 2024. Currently, I am a research intern at Tencent Robotics X, focusing on VLA-based mobile manipulation.

My research interests lie in Embodied AI, Vision-Language-Action (VLA) models, Multimodal Large Language Models, and Controllable Video Generation. I aim to build robust, generalizable embodied agents that can seamlessly operate in diverse real-world environments.

Embodied AI VLA Models Mobile Manipulation Multimodal LLMs Video Diffusion 3D Vision

Email GitHub Google Scholar arXiv

News

2026.07 📄 New preprint: Frustum-Aware 3D Vision-Language Learning for Adaptive City-Scale Spatial Reasoning — active frustum-focused reasoning (FAPO) and the CityVerse-Bench benchmark for city-scale 3D-VLMs.
2026.05 🤖 Joined Tencent Robotics X as a research intern, working on VLA-based mobile manipulation.
2025.12 🎉 One paper accepted at CVPR 2026 (CCF-A): VLA Models Are More Generalizable Than You Think.
2025.12 📄 New preprint: ACD — direct conditional control for video diffusion via attention supervision. [arXiv:2512.21268]
2025.09 📄 New preprint: HumanGenesis — agent-based geometric and generative modeling for synthetic human dynamics. [arXiv:2508.09858]
2024.09 🎓 Started M.S. at Sun Yat-sen University.

Publications

Bold denotes my name | * denotes equal contribution | † denotes corresponding author

Frustum-Aware 3D Vision-Language Learning for Adaptive City-Scale Spatial Reasoning

Peixin Chen^*, Weiqi Li^*, Sibei Yang, Guangrun Wang, Liang Lin, Guanbin Li

Preprint

Introduces Aerial 3D Urban Reasoning, a new task for low-altitude city-scale 3D understanding and spatial reasoning, where sparse targets and severe depth ambiguity make passive global perception ineffective. Proposes Frustum-Aware Policy Optimization (FAPO), which enables active vision-language interactive reasoning with adaptive frustum focusing: a two-stage strategy that first uses supervised fine-tuning to initialize frustum-based visual evidence localization and step-by-step reasoning, then reinforcement learning to optimize the focusing policy end-to-end with a depth-aware IoU (D-IoU) reward. Also establishes CityVerse-Bench, a large-scale city-level benchmark unifying posed images, point clouds, and depth maps for fine-grained description, referential grounding, holistic scene understanding, and complex spatial reasoning. Consistently outperforms strong baselines across diverse urban tasks.

VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling

Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, Guangrun Wang^†

CVPR 2026 CCF-A

Investigates why VLA models (e.g., π0.5) fail dramatically under novel viewpoints. Decouples the issue into physical vs. spatial modeling failures, showing that the pretrained model retains strong physical understanding while spatial representation mismatch is the key bottleneck. Proposes two lightweight adaptation methods — FTM (token affine modulation) and FLA (low-rank ViT update) — that recover cross-viewpoint performance by updating only 4K–4.7M parameters, achieving 90.8% success rate on the LIBERO-V benchmark with a 99× parameter efficiency gain over LoRA. One-shot sim-to-real transfer validated on a real Franka arm.

📄 arXiv

HumanGenesis: Agent-Based Geometric and Generative Modeling for Synthetic Human Dynamics

Weiqi Li, Zehao Zhang, Liang Lin, Guangrun Wang^†

Preprint

Proposes HumanGenesis, a multi-agent collaborative framework that unifies Real2Sim and Sim2Real in a closed loop for human dynamics modeling. The system integrates a 3DGS+SMPL+Learnable-LBS Reconstructor, a Qwen2.5-VL-driven Critique Agent with multi-round self-reflection for fine-grained reconstruction refinement, and a Video Harmonizer that enhances human-scene consistency and temporal coherence in rendered videos. Achieves state-of-the-art results on HumanVid and NeuMan benchmarks.

📄 arXiv

ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision

Weiqi Li, Zehao Zhang, Liang Lin, Guangrun Wang^†

Preprint

Proposes Attention-Conditional Diffusion (ACD), a controllable video generation framework built on CogVideoX. Unlike traditional guidance-level conditioning, ACD directly supervises cross-attention maps inside the diffusion model via a dual-branch (masked/unmasked) shared-parameter fine-tuning scheme, moving conditioning from output-level to attention-level and eliminating common artifacts. Uses sparse 3D-aware object layouts as control signals with a layout ControlNet, supported by an automated annotation pipeline on 20K RealEstate training clips. Outperforms AC3D and other baselines on FID/FVD and camera error metrics.

🌐 Project Page 📄 arXiv

Research Experience

Tencent Robotics X

Research Intern — Embodied AI

Working on Vision-Language-Action (VLA) models for mobile manipulation tasks, investigating generalization, spatial understanding, and sim-to-real transfer in whole-body robot control pipelines.

May 2026 – Present

Sun Yat-sen University — Graduate Research

M.S. Researcher — Embodied AI & Multimodal Generation

Research on generalizable VLA models (CVPR 2026) and multi-agent frameworks for human dynamics modeling (ICML 2026 submission).

2024.09 – Present

Education

Sun Yat-sen University (中山大学)

M.S. in Computer Technology

School of Computer Science and Engineering

Sep 2024 – Jun 2027

South China University of Technology (华南理工大学)

B.S. in Software Engineering

School of Software Engineering

Sep 2020 – Jun 2024

Technical Skills

Machine Learning: PyTorch · Diffusers · LoRA / PEFT · ControlNet · Multimodal Model Fine-tuning

Embodied AI: VLA / VLM · Mobile Manipulation · Spatial Grounding · Simulation-to-Real Transfer

Generative & 3D: Controllable Video Diffusion · 3D Gaussian Splatting · SMPL · Human Dynamics Modeling

Last updated: July 2026