I am Yang Jin (金阳), a Ph.D. student at Wangxuan Institute of Computer Technology (WICT) in Peking University, advised by Prof. Yadong Mu. Before that, I obtained my B.S. degree in Computer Science and Engineering from Beihang University. I have been a research intern at ByteDance and Kuaishou Technology.

My research interests cover visual grounding, multi-modal learning, and generative models. Recently, my work mainly focus on developing effective multi-modal large language models. I have published several papers and been reviewer at many conferences such as CVPR, ECCV, ICCV. If you are interested in my research, feel free to contact me through e-mail.

🔥 News

  • 2024.05: One paper is accepted at ICML 2024!
  • 2024.01: One paper is accepted at ICLR 2024!
  • 2023.07: One paper is accepted at ICCV 2023!
  • 2023.03: One paper is accepted at CVPR 2023!
  • 2022.09: One paper is accepted at NeurIPS 2022 as a spotlight presentation!
  • 2022.03: One paper is accepted at CVPR 2022!
  • 2021.04: One paper is accepted at TMM 2021!

📝 Publications

ICML 2024

[ICML 2024] Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, Yadong Mu

[Paper] [Project] [Code]

  • We present a multimodal LLM capable of both comprehending and generating videos, based on an efficient decomposed video representation.
ICLR 2024

[ICLR 2024] Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Quzhe Huang, Bin Chen, Chenyi Lei, An Liu, Chengru Song, Xiaoqiang Lei, Di Zhang, Wenwu Ou, Kun Gai, Yadong Mu

[Paper] [Code]

  • We present an effective dynamic discrete visual tokenizer that represents an image as the foreign language in Large Language Models, which supports both multi-modal understanding and generation.
CVPR 2023

[CVPR 2023] Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce

Yang Jin, Yongzhi Li, Zehuan Yuan, Yadong Mu


  • We propose a generic multi-modal foundation model in E-commerce that learns the instance-level representation of products and achieves superior performance on massive downstream E-commerce applications.
NeurIPS 2022 Spotlight

[NeurIPS 2022 Spotlight] Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

Yang Jin, Yongzhi Li, Zehuan Yuan, Yadong Mu

[Paper] [Code]

  • We propose STCAT, a new one-stage spatio-temporal video grounding model that enjoys more consistent cross-modal feature alignment and tube prediction. It also achieved state-of-the-art performance on VidSTG and HC-STVG benchmarks.
CVPR 2022

[CVPR 2022] Complex Video Action Reasoning via Learnable Markov Logic Network

Yang Jin, Linchao Zhu, Yadong Mu


  • We devise an video action reasoning framework that performs Markov Logic Network (MLN) based probabilistic logical inference. The proposed framework enjoys remarkable interpretability through the learned logical rules.

🎖 Honors and Awards

  • Peking University President’s Scholarship
  • Wang Xuan Scholarship
  • Peking University Study Excellence Award
  • Peking University Excellent Research Award

📖 Educations

💻 Internships

  • 2023.07 - now, Content Understanding and Generation Group, Kuaishou Technology, China.
  • 2022.04 - 2023.06, Content Understanding Group, ByteDance, China.

🏫 Professional Services

  • Reviewer for CVPR 2023, CVPR 2024, ICCV 2023, ECCV 2024.