Yijie Qian
MS Student
Institute of Cyber-Systems and Control, Zhejiang University, China
Biography
I am pursuing my M.S. degree in College of Control Engineering, Zhejiang University, Hangzhou, China. My major research interests are deep learning and computer vision, which include object detection and segmentation.
Research and Interests
- Deep Learning
- Computer Vision
Publications
- Chao Xu, Yijie Qian, Shaoting Zhu, Baigui Sun, Jian Zhao, Yong Liu, and Xuelong Li. UniFace++: Revisiting a Unified Framework for Face Reenactment and Swapping via 3D Priors. International Journal of Computer Vision, 133:4538-4554, 2025.
[BibTeX] [Abstract] [DOI] [PDF]Face reenactment and swapping share a similar pattern of identity and attribute manipulation. Our previous work UniFace has preliminarily explored establishing a unification between the two at the feature level, but it heavily relies on the accuracy of feature disentanglement, and GANs are also unstable during training. In this work, we delve into the intrinsic connections between the two from a more general training paradigm perspective, introducing a novel diffusion-based unified method UniFace++. Specifically, this work combines the advantages of each, i.e., stability of reconstruction training from reenactment, simplicity and effectiveness of the target-oriented processing from swapping, and redefining both as target-oriented reconstruction tasks. In this way, face reenactment avoids complex source feature deformation and face swapping mitigates the unstable seesaw-style optimization. The core of our approach is the rendered face obtained from reassembled 3D facial priors serving as the target pivot, which contains precise geometry and coarse identity textures. We further incorporate it with the proposed Texture-Geometry-aware Diffusion Model (TGDM) to perform texture transfer under the reconstruction supervision for high-fidelity face synthesis. Extensive quantitative and qualitative experiments demonstrate the superiority of our method for both tasks.
@article{xu2025uni, title = {UniFace++: Revisiting a Unified Framework for Face Reenactment and Swapping via 3D Priors}, author = {Chao Xu and Yijie Qian and Shaoting Zhu and Baigui Sun and Jian Zhao and Yong Liu and Xuelong Li}, year = 2025, journal = {International Journal of Computer Vision}, volume = 133, pages = {4538-4554}, doi = {10.1007/s11263-025-02395-6}, abstract = {Face reenactment and swapping share a similar pattern of identity and attribute manipulation. Our previous work UniFace has preliminarily explored establishing a unification between the two at the feature level, but it heavily relies on the accuracy of feature disentanglement, and GANs are also unstable during training. In this work, we delve into the intrinsic connections between the two from a more general training paradigm perspective, introducing a novel diffusion-based unified method UniFace++. Specifically, this work combines the advantages of each, i.e., stability of reconstruction training from reenactment, simplicity and effectiveness of the target-oriented processing from swapping, and redefining both as target-oriented reconstruction tasks. In this way, face reenactment avoids complex source feature deformation and face swapping mitigates the unstable seesaw-style optimization. The core of our approach is the rendered face obtained from reassembled 3D facial priors serving as the target pivot, which contains precise geometry and coarse identity textures. We further incorporate it with the proposed Texture-Geometry-aware Diffusion Model (TGDM) to perform texture transfer under the reconstruction supervision for high-fidelity face synthesis. Extensive quantitative and qualitative experiments demonstrate the superiority of our method for both tasks.} } - Jiazheng Xing, Chao Xu, Yijie Qian, Yang Liu, Guang Dai, Baigui Sun, Yong Liu, and Jingdong Wang. TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On. International Journal of Computer Vision, 133:3781-3802, 2025.
[BibTeX] [Abstract] [DOI] [PDF]Virtual try-on focuses on adjusting the given clothes to fit a specific person seamlessly while avoiding any distortion of the patterns and textures of the garment. However, the clothing identity uncontrollability and training inefficiency of existing diffusion-based methods, which struggle to maintain the identity even with full parameter training, are significant limitations that hinder the widespread applications. In this work, we propose an effective and efficient framework, termed TryOn-Adapter. Specifically, we first decouple clothing identity into fine-grained factors: style for color and category information, texture for high-frequency details, and structure for smooth spatial adaptive transformation. Our approach utilizes a pre-trained exemplar-based diffusion model as the fundamental network, whose parameters are frozen except for the attention layers. We then customize three lightweight modules (Style Preserving, Texture Highlighting, and Structure Adapting) incorporated with fine-tuning techniques to enable precise and efficient identity control. Meanwhile, we introduce the training-free T-RePaint strategy to further enhance clothing identity preservation while maintaining the realistic try-on effect during the inference. Our experiments demonstrate that our approach achieves state-of-the-art performance on two widely-used benchmarks. Additionally, compared with recent full-tuning diffusion-based methods, we only use about half of their tunable parameters during training. The code will be made publicly available at https://github.com/jiazheng-xing/TryOn-Adapter.
@article{xing2025try, title = {TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On}, author = {Jiazheng Xing and Chao Xu and Yijie Qian and Yang Liu and Guang Dai and Baigui Sun and Yong Liu and Jingdong Wang}, year = 2025, journal = {International Journal of Computer Vision}, volume = 133, pages = {3781-3802}, doi = {10.1007/s11263-025-02352-3}, abstract = {Virtual try-on focuses on adjusting the given clothes to fit a specific person seamlessly while avoiding any distortion of the patterns and textures of the garment. However, the clothing identity uncontrollability and training inefficiency of existing diffusion-based methods, which struggle to maintain the identity even with full parameter training, are significant limitations that hinder the widespread applications. In this work, we propose an effective and efficient framework, termed TryOn-Adapter. Specifically, we first decouple clothing identity into fine-grained factors: style for color and category information, texture for high-frequency details, and structure for smooth spatial adaptive transformation. Our approach utilizes a pre-trained exemplar-based diffusion model as the fundamental network, whose parameters are frozen except for the attention layers. We then customize three lightweight modules (Style Preserving, Texture Highlighting, and Structure Adapting) incorporated with fine-tuning techniques to enable precise and efficient identity control. Meanwhile, we introduce the training-free T-RePaint strategy to further enhance clothing identity preservation while maintaining the realistic try-on effect during the inference. Our experiments demonstrate that our approach achieves state-of-the-art performance on two widely-used benchmarks. Additionally, compared with recent full-tuning diffusion-based methods, we only use about half of their tunable parameters during training. The code will be made publicly available at https://github.com/jiazheng-xing/TryOn-Adapter.} } - Yu Yang, Jianbiao Mei, Yukai Ma, Siliang Du, Wenqing Chen, Yijie Qian, Yuxiang Feng, and Yong Liu. Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving. In 39th AAAI Conference on Artificial Intelligence (AAAI), pages 9327-9335, 2025.
[BibTeX] [Abstract] [DOI] [PDF]World models envision potential future states based on various ego actions. They embed extensive knowledge about the driving environment, facilitating safe and scalable autonomous driving. Most existing methods primarily focus on either data generation or the pretraining paradigms of world models. Unlike the aforementioned prior works, we propose Drive-OccWorld, which adapts a vision-centric 4D forecasting world model to end-to-end planning for autonomous driving. Specifically, we first introduce a semantic and motion-conditional normalization in the memory module, which accumulates semantic and dynamic information from historical BEV embeddings. These BEV features are then conveyed to the world decoder for future occupancy and flow forecasting, considering both geometry and spatiotemporal modeling. Additionally, we propose injecting flexible action conditions, such as velocity, steering angle, trajectory, and commands, into the world model to enable controllable generation and facilitate a broader range of downstream applications. Furthermore, we explore integrating the generative capabilities of the 4D world model with end-to-end planning, enabling continuous forecasting of future states and the selection of optimal trajectories using an occupancy-based cost function. Extensive experiments on the nuScenes dataset demonstrate that our method can generate plausible and controllable 4D occupancy, opening new avenues for driving world generation and end-to-end planning.
@inproceedings{yang2025dow, title = {Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving}, author = {Yu Yang and Jianbiao Mei and Yukai Ma and Siliang Du and Wenqing Chen and Yijie Qian and Yuxiang Feng and Yong Liu}, year = 2025, booktitle = {39th AAAI Conference on Artificial Intelligence (AAAI)}, pages = {9327-9335}, doi = {10.1609/aaai.v39i9.33010}, abstract = {World models envision potential future states based on various ego actions. They embed extensive knowledge about the driving environment, facilitating safe and scalable autonomous driving. Most existing methods primarily focus on either data generation or the pretraining paradigms of world models. Unlike the aforementioned prior works, we propose Drive-OccWorld, which adapts a vision-centric 4D forecasting world model to end-to-end planning for autonomous driving. Specifically, we first introduce a semantic and motion-conditional normalization in the memory module, which accumulates semantic and dynamic information from historical BEV embeddings. These BEV features are then conveyed to the world decoder for future occupancy and flow forecasting, considering both geometry and spatiotemporal modeling. Additionally, we propose injecting flexible action conditions, such as velocity, steering angle, trajectory, and commands, into the world model to enable controllable generation and facilitate a broader range of downstream applications. Furthermore, we explore integrating the generative capabilities of the 4D world model with end-to-end planning, enabling continuous forecasting of future states and the selection of optimal trajectories using an occupancy-based cost function. Extensive experiments on the nuScenes dataset demonstrate that our method can generate plausible and controllable 4D occupancy, opening new avenues for driving world generation and end-to-end planning.} } - Xiaojun Hou, Jiazheng Xing, Yijie Qian, Yaowei Guo, Shuo Xin, Junhao Chen, Kai Tang, Mengmeng Wang, Zhengkai Jiang, Liang Liu, and Yong Liu. SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26541-26551, 2024.
[BibTeX] [Abstract] [DOI] [PDF]Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers, which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore, recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However, the modality gap limits pre-trained knowledge recall, and the dominance of the RGB modality persists, preventing the full utilization of information from other modalities. To address these issues, we propose a novel symmetric multimodal tracking framework called SDSTrack. We introduce lightweight adaptation for efficient fine-tuning, which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced, symmetric manner. Furthermore, we design a complementary masked patch distillation strategy to enhance the robustness of trackers in complex environments, such as extreme weather, poor imaging, and sensor failure. Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art methods in various multimodal tracking scenarios, including RGB+Depth, RGB+Thermal, and RGB+Event tracking, and exhibits impressive results in extreme conditions. Our source code is available at: https://github.com/hoqolo/SDSTrack.
@inproceedings{hou2024sds, title = {SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking}, author = {Xiaojun Hou and Jiazheng Xing and Yijie Qian and Yaowei Guo and Shuo Xin and Junhao Chen and Kai Tang and Mengmeng Wang and Zhengkai Jiang and Liang Liu and Yong Liu}, year = 2024, booktitle = {2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, pages = {26541-26551}, doi = {10.1109/CVPR52733.2024.02507}, abstract = {Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers, which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore, recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However, the modality gap limits pre-trained knowledge recall, and the dominance of the RGB modality persists, preventing the full utilization of information from other modalities. To address these issues, we propose a novel symmetric multimodal tracking framework called SDSTrack. We introduce lightweight adaptation for efficient fine-tuning, which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced, symmetric manner. Furthermore, we design a complementary masked patch distillation strategy to enhance the robustness of trackers in complex environments, such as extreme weather, poor imaging, and sensor failure. Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art methods in various multimodal tracking scenarios, including RGB+Depth, RGB+Thermal, and RGB+Event tracking, and exhibits impressive results in extreme conditions. Our source code is available at: https://github.com/hoqolo/SDSTrack.} } - Hanchen Tai, Yijie Qian, Xiao Kang, Liang Liu, and Yong Liu. Fusing LiDAR and Radar with Pillars Attention for 3D Object Detection. In 7th International Symposium on Autonomous Systems (ISAS), 2024.
[BibTeX] [Abstract] [DOI] [PDF]In recent years, LiDAR has emerged as one of the primary sensors for mobile robots, enabling accurate detection of 3D objects. On the other hand, 4D millimeter-wave Radar presents several advantages which can be a complementary for LiDAR, including an extended detection range, enhanced sensitivity to moving objects, and the ability to operate seamlessly in various weather conditions, making it a highly promising technology. To leverage the strengths of both sensors, this paper proposes a novel fusion method that combines LiDAR and 4D millimeter-wave Radar for 3D object detection. The proposed approach begins with an efficient multi-modal feature extraction technique utilizing a pillar representation. This method captures comprehensive information from both LiDAR and millimeter-wave Radar data, facilitating a holistic understanding of the environment. Furthermore, a Pillar Attention Fusion (PAF) module is employed to merge the extracted features, enabling seamless integration and fusion of information from both sensors. This fusion process results in lightweight detection headers capable of accurately predicting object boxes. To evaluate the effectiveness of our proposed approach, extensive experiments were conducted on the VoD dataset. The experimental results demonstrate the superiority of our fusion method, showcasing improved performance in terms of detection accuracy and robustness across different environmental conditions. The fusion of LiDAR and 4D millimeter-wave Radar holds significant potential for enhancing the capabilities of mobile robots in real-world scenarios. The proposed method, with its efficient multi-modal feature extraction and attention-based fusion, provides a reliable and effective solution for 3D object detection.
@inproceedings{tai2024lidar, title = {Fusing LiDAR and Radar with Pillars Attention for 3D Object Detection}, author = {Hanchen Tai and Yijie Qian and Xiao Kang and Liang Liu and Yong Liu}, year = 2024, booktitle = {7th International Symposium on Autonomous Systems (ISAS)}, doi = {10.1109/ISAS61044.2024.10552581}, abstract = {In recent years, LiDAR has emerged as one of the primary sensors for mobile robots, enabling accurate detection of 3D objects. On the other hand, 4D millimeter-wave Radar presents several advantages which can be a complementary for LiDAR, including an extended detection range, enhanced sensitivity to moving objects, and the ability to operate seamlessly in various weather conditions, making it a highly promising technology. To leverage the strengths of both sensors, this paper proposes a novel fusion method that combines LiDAR and 4D millimeter-wave Radar for 3D object detection. The proposed approach begins with an efficient multi-modal feature extraction technique utilizing a pillar representation. This method captures comprehensive information from both LiDAR and millimeter-wave Radar data, facilitating a holistic understanding of the environment. Furthermore, a Pillar Attention Fusion (PAF) module is employed to merge the extracted features, enabling seamless integration and fusion of information from both sensors. This fusion process results in lightweight detection headers capable of accurately predicting object boxes. To evaluate the effectiveness of our proposed approach, extensive experiments were conducted on the VoD dataset. The experimental results demonstrate the superiority of our fusion method, showcasing improved performance in terms of detection accuracy and robustness across different environmental conditions. The fusion of LiDAR and 4D millimeter-wave Radar holds significant potential for enhancing the capabilities of mobile robots in real-world scenarios. The proposed method, with its efficient multi-modal feature extraction and attention-based fusion, provides a reliable and effective solution for 3D object detection.} }
