Yingqing He
Hi there👋😋.
I am a Ph.D. student at HKUST supervised by Prof. Qifeng Chen .
My research focuses are text-to-video generation, multimodal generation, and controllable generation .
Welcome to any types of research collaboration and discussions!
Email: yhebm [at] connect [dot] ust [dot] hk
Google Scholar /
Github /
LinkedIn /
Twitter /
Rednote
News
- [12/2024] 1 paper was accepted to AAAI 2025 .
- [10/2024] 1 paper was accepted to WACV 2025 .
- [08/2024] 1 paper was accepted to ECCV 2024 AI4VA Workshop .
- [07/2024] 1 paper was accepted to SIGGRAPH Asia 2024 .
- [07/2024] 1 paper was accepted to ECCV 2024 .
- [05/2024] We released a survey paper: LLMs Meet Multimodal Generation and Editing: A Survey .
- [03/2024] 1 paper was accepted to CVPR 2024 .
- [02/2024] 1 paper was accepted to TVCG 2024 .
- [01/2024] 2 papers were accepted to ICLR 2024 (including 1 Spotlight paper).
- [12/2023] 1 paper was accepted to AAAI 2024 .
- [11/2023] We released VideoCrafter 1 .
- [08/2023] 1 paper was accepted to SIGGRAPH Asia 2023 .
- [04/2023] We released VideoCrafter 0.9 .
- [08/2021] 1 paper was accepted to ACM MM 2021 as an Oral paper.
VideoTuna: A Powerful Toolkit for Video Generation with Model Fine-Tuning and Post-Training
Yingqing He , Yazhou Xing, Zhefan Rao, Haoyu Wu, Zhaoyang Liu, Jingye Chen, Pengjun Fang, Jiajun Li, Liya Ji, Runtao Liu, Xiaowei Chi, Yang Fei, Guocheng Shao, Yue Ma, Qifeng Chen
Github , 2024 Nov
Project page /
Github
VideoTuna is the first repo that integrates multiple AI video generation models for text-to-video, image-to-video, text-to-image generation for fine-tuning and post-training (to the best of our knowledge).
Additionally, VideoTuna provides a comprehensive pipeline in video generation, including pre-training, continuous training, post-training (alignment), and fine-tuning.
Your browser does not support the video tag.
VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE
Yazhou Xing* ,
Yang Fei* ,
Yingqing He*† ,
Jingye Chen ,
Jiaxin Xie ,
Xiaowei Chi ,
Qifeng Chen†
arXiv , 2024 Dec
Project page /
arXiv /
Github
VideoVAE+ is a state-of-the-art Video VAE model that can encode and decode video clips with large motion and high definition.
ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement
Zhefan Rao* ,
Liya Ji* ,
Yazhou Xing ,
Runtao Liu ,
Zhaoyang Liu ,
Jiaxin Xie ,
Ziqiao Peng ,
Yingqing He† ,
Qifeng Chen†
arXiv , 2024 Dec
Project page /
arXiv /
Github
ModelGrow is a method that scales the model capacity and enhances the language understanding of text-to-video models during continous Pre-training.
VideoDPO: Omni-Preference Alignment for Video Diffusion Generation
Runtao Liu* ,
Haoyu Wu* ,
Ziqiang Zheng ,
Chen Wei ,
Yingqing He ,
Renjie Pi ,
Qifeng Chen†
arXiv , 2024 Dec
Project page /
arXiv /
Github
HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts
Xinyu Liu, Yingqing He , Lanqing Guo, Xiang Li, Bu Jin, Peng Li, Yan Li, Chi-Min Chan, Qifeng Chen, Wei Xue, Wenhan Luo, Qifeng Liu, Yike Guo
arXiv , 2024 Sep
Project page /
arXiv /
Github
Your browser does not support the video tag.
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
Xiaowei Chi ,
Aosong Cheng ,
Pengjun Fang ,
Yatian Wang ,
Zeyue Tian ,
Yingqing He ,
Zhaoyang Liu ,
Xingqun Qi ,
Jiahao Pan ,
Rongyu Zhang ,
Mengfei Li ,
Yanbing Jiang ,
Wei Xue ,
Wenhan Luo ,
Qifeng Chen
Shanghang Zhang ,
Qifeng Liu ,
Yike Guo ,
arXiv , 2024
Project page /
arXiv /
Github
MMTrail is a large-scale multi-modality video-language dataset with over 20M trailer clips, featuring high-quality multimodal captions that integrate context, visual frames, and background music, aiming to enhance cross-modality studies and fine-grained multimodal-language model training.
FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models
Haonan Qiu ,
Zhaoxi Chen ,
Zhouxia Wang ,
Yingqing He ,
Menghan Xia ,
Ziwei Liu
arXiv , 2024
Project page /
arXiv /
Github
FreeTraj is a tuning-free method for trajectory-controllable video generation based on pre-trained video diffusion models.
LLMs Meet Multimodal Generation and Editing: A Survey
Yingqing He ,
Zhaoyang Liu ,
Jingye Chen ,
Zeyue Tian ,
Hongyu Liu ,
Xiaowei Chi ,
Runtao Liu ,
Ruibin Yuan ,
Yazhou Xing ,
Wenhai Wang ,
Jifeng Dai ,
Yong Zhang ,
Wei Xue ,
Qifeng Liu ,
Yike Guo ,
Qifeng Chen
arXiv , 2024
arXiv /
Github
This survey includes works of image, video, 3D, and audio generation and editing.
We emphasize the roles of LLMs on the generation and editing of these modalities.
We also includes works of multimodal agents and generative AI safety.
Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts
Yue Ma * ,
Yingqing He * ,
Hongfa Wang ,
Andong Wang ,
Chenyang Qi ,
Chengfei Cai ,
Xiu Li ,
Zhifeng Li ,
Heung-Yeung Shum ,
Wei Liu ,
Qifeng Chen
AAAI , 2025
Project page /
arXiv /
Github
Your browser does not support the video tag.
Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation
Yue Ma* /a>,
Hongyu Liu* ,
Hongfa Wang* ,
Heng Pan* ,
Yingqing He ,
Heng Pan ,
Junkun Yuan ,
Ailing Zeng ,
Chengfei Cai ,
Heung-Yeung Shum ,
Wei Liu ,
Qifeng Chen
SIGGRAPH Asia , 2024
Project page /
arXiv
Github
Your browser does not support the video tag.
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation
Yingqing He ,*
Menghan Xia* ,
Haoxin Chen,*
Xiaodong Cun ,
Yuan Gong,
Jinbo Xing ,
Yong Zhang ,
Xintao Wang ,
Chao Weng,
Ying Shan ,
Qifeng Chen
ECCV AI4VA Workshop , 2024  
Project page /
arXiv /
Github
A novel story-to-video pipeline with both structure and character controls, facilitating the generation of a vlog for a teddy bear.
Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation
Lanqing Guo * ,
Yingqing He * ,
Haoxin Chen ,
Menghan Xia ,
Xiaodong Cun ,
Yufei Wang ,
Siyu Huang,
Yong Zhang ,
Xintao Wang ,
Qifeng Chen ,
Ying Shan ,
Binhan Wen
ECCV , 2024
Project page /
arXiv /
Github
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
Yazhou Xing * ,
Yingqing He * ,
Zeyue Tian,*
Xintao Wang ,
Qifeng Chen
CVPR , 2024
Project page /
arXiv /
Github
Your browser does not support the video tag.
Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models
Yingqing He ,*
Shaoshu Yang,*
Haoxin Chen,
Xiaodong Cun ,
Menghan Xia ,
Yong Zhang ,
Xintao Wang ,
Ran He,
Qifeng Chen ,
Ying Shan
ICLR , 2024   (Spotlight)
Project page /
arXiv /
Github
Generating 16x higher-resolution images and 4x higher-resolution videos without any extra data and training effort.
Your browser does not support the video tag.
MagicStick: Controllable Video Editing via Control Handle Transformations
Yue Ma ,
Xiaodong Cun ,
Yingqing He ,
Chenyang Qi ,
Xintao Wang ,
Ying Shan ,
Xiu Li,
Qifeng Chen
WACV , 2025
Project page /
arXiv /
Github
Your browser does not support the video tag.
Freenoise: Tuning-free longer video diffusion via noise rescheduling
Haonan Qiu ,
Menghan Xia ,
Yong Zhang ,
Yingqing He ,
Xintao Wang ,
Ying Shan ,
Ziwei Liu
ICLR , 2024
Project page /
arXiv /
Github
Your browser does not support the video tag.
Follow your pose: Pose-guided text-to-video generation using pose-free videos
Yue Ma * ,
Yingqing He * ,
Xiaodong Cun ,
Xintao Wang ,
Siran Chen,
Ying Shan ,
Xiu Li,
Qifeng Chen
AAAI , 2024
Project page /
arXiv /
Github
Your browser does not support the video tag.
Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance
Jinbo Xing ,
Menghan Xia ,
Yuxin Liu,
Yuechen Zhang ,
Yong Zhang ,
Yingqing He ,
Hanyuan Liu ,
Haoxin Chen,
Xiaodong Cun ,
Xintao Wang ,
Ying Shan ,
Tien-Tsin Wong
TVCG , 2024
Project page /
arXiv
Given text description and video structure (depth), our approach can generate temporally coherent and high-fidelity videos. Its applications include dynamic 3d-scene-to-video creation, real-life scene to video, and video rerendering.
Your browser does not support the video tag.
TaleCrafter: Interactive Story Visualization with Multiple Characters
Yuan Gong,
Youxi Pang,
Xiaodong Cun ,
Menghan Xia ,
Yingqing He ,
Haoxin Chen,
Longyue Wang,
Yong Zhang ,
Xintao Wang ,
Ying Shan ,
Yujiu Yang
SIGGRAPH Asia , 2023
Project page /
arXiv /
Github
Videocrafter1: Open diffusion models for high-quality video generation
Haoxin Chen * ,
Menghan Xia * ,
Yingqing He * ,
Yong Zhang ,
Xiaodong Cun ,
Shaoshu Yang,
Jinbo Xing ,
Yaofang Liu,
Qifeng Chen ,
Xintao Wang ,
Chao Weng,
Ying Shan
arXiv , 2023
Project page /
arXiv /
Github
An open-sourced foundational text-to-video and image-to-video diffusion model for high-quality video generation.
Your browser does not support the video tag.
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Yingqing He ,
Tianyu Yang ,
Yong Zhang ,
Ying Shan ,
Qifeng Chen
arXiv , 2022
Project page /
arXiv /
Github
Interpreting class conditional GANs with channel awareness
Yingqing He ,
Zhiyi Zhang,
Jiapeng Zhu,
Yujun Shen ,
Qifeng Chen
arXiv , 2022
Project page /
arXiv /
Github
Unsupervised portrait shadow removal via generative priors
Yingqing He ,*
Yazhou Xing,*
Tianjia Zhang,
Qifeng Chen
ACM MM , 2021   (Oral)
arXiv /
Github
we propose an unsupervised method for portrait shadow removal, leveraging the facial priors from StyleGAN2 .
Our approach also supports facial tattoo and watermark removal.
Academic Services
Conference Reviewer for CVPR, ICLR, SIGGRAPH Asia.
Journal Reviewer for TPAMI, IJCV, ACM Computing Surveys.
Webpage templete is borrowed from this