Address

Room 101, Institute of Cyber-Systems and Control, Yuquan Campus, Zhejiang University, Hangzhou, Zhejiang, China

Contact Information

Email: yaowei@zju.edu.cn

Yaowei Guo

PhD Student

Institute of Cyber-Systems and Control, Zhejiang University, China

Biography

I received my B.S. and M.S. degrees from University of California, Los Angeles in 2019 and 2021 respectively. My research area includes deep learning and its application in autonomous driving. 

Research and Interests

  • deep learning and its application in autonomous driving

Publications

  • Xiaojun Hou, Jiazheng Xing, Yijie Qian, Yaowei Guo, Shuo Xin, Junhao Chen, Kai Tang, Mengmeng Wang, Zhengkai Jiang, Liang Liu, and Yong Liu. SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26541-26551, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]
    Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers, which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore, recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However, the modality gap limits pre-trained knowledge recall, and the dominance of the RGB modality persists, preventing the full utilization of information from other modalities. To address these issues, we propose a novel symmetric multimodal tracking framework called SDSTrack. We introduce lightweight adaptation for efficient fine-tuning, which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced, symmetric manner. Furthermore, we design a complementary masked patch distillation strategy to enhance the robustness of trackers in complex environments, such as extreme weather, poor imaging, and sensor failure. Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art methods in various multimodal tracking scenarios, including RGB+Depth, RGB+Thermal, and RGB+Event tracking, and exhibits impressive results in extreme conditions. Our source code is available at: https://github.com/hoqolo/SDSTrack.
    @inproceedings{hou2024sds,
    title = {SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking},
    author = {Xiaojun Hou and Jiazheng Xing and Yijie Qian and Yaowei Guo and Shuo Xin and Junhao Chen and Kai Tang and Mengmeng Wang and Zhengkai Jiang and Liang Liu and Yong Liu},
    year = 2024,
    booktitle = {2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {26541-26551},
    doi = {10.1109/CVPR52733.2024.02507},
    abstract = {Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers, which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore, recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However, the modality gap limits pre-trained knowledge recall, and the dominance of the RGB modality persists, preventing the full utilization of information from other modalities. To address these issues, we propose a novel symmetric multimodal tracking framework called SDSTrack. We introduce lightweight adaptation for efficient fine-tuning, which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced, symmetric manner. Furthermore, we design a complementary masked patch distillation strategy to enhance the robustness of trackers in complex environments, such as extreme weather, poor imaging, and sensor failure. Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art methods in various multimodal tracking scenarios, including RGB+Depth, RGB+Thermal, and RGB+Event tracking, and exhibits impressive results in extreme conditions. Our source code is available at: https://github.com/hoqolo/SDSTrack.}
    }
  • Shuo Xin, Zhen Zhang, Mengmeng Wang, Xiaojun Hou, Yaowei Guo, Xiao Kang, Liang Liu, and Yong Liu. Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 337-344, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]
    Tracking a specific person in 3D scene is gaining momentum due to its numerous applications in robotics. Currently, most 3D trackers focus on driving scenarios with neglected jitter and uncomplicated surroundings, which results in their severe degeneration in complex environments, especially on jolting robot platforms (only 20-60% success rate). To improve the accuracy, a Point-Video-based Transformer Tracking model (PVTrack) is presented for robots. It is the first multi-modal 3D human tracking work that incorporates point clouds together with RGB videos to achieve information complementarity. Moreover, PVTrack proposes the Siamese Point-Video Transformer for feature aggregation to overcome dynamic environments, which captures more target-aware information through the hierarchical attention mechanism adaptively. Considering the violent shaking on robots and rugged terrains, a lateral Human-ware Proposal Network is designed together with an Anti-shake Proposal Compensation module. It alleviates the disturbance caused by complex scenes as well as the particularity of the robot platform. Experiments show that our method achieves state-of-the-art performance on both KITTI/Waymo datasets and a quadruped robot for various indoor and outdoor scenes.
    @inproceedings{xin2024mmh,
    title = {Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer},
    author = {Shuo Xin and Zhen Zhang and Mengmeng Wang and Xiaojun Hou and Yaowei Guo and Xiao Kang and Liang Liu and Yong Liu},
    year = 2024,
    booktitle = {2024 IEEE International Conference on Robotics and Automation (ICRA)},
    pages = {337-344},
    doi = {10.1109/ICRA57147.2024.10610979},
    abstract = {Tracking a specific person in 3D scene is gaining momentum due to its numerous applications in robotics. Currently, most 3D trackers focus on driving scenarios with neglected jitter and uncomplicated surroundings, which results in their severe degeneration in complex environments, especially on jolting robot platforms (only 20-60% success rate). To improve the accuracy, a Point-Video-based Transformer Tracking model (PVTrack) is presented for robots. It is the first multi-modal 3D human tracking work that incorporates point clouds together with RGB videos to achieve information complementarity. Moreover, PVTrack proposes the Siamese Point-Video Transformer for feature aggregation to overcome dynamic environments, which captures more target-aware information through the hierarchical attention mechanism adaptively. Considering the violent shaking on robots and rugged terrains, a lateral Human-ware Proposal Network is designed together with an Anti-shake Proposal Compensation module. It alleviates the disturbance caused by complex scenes as well as the particularity of the robot platform. Experiments show that our method achieves state-of-the-art performance on both KITTI/Waymo datasets and a quadruped robot for various indoor and outdoor scenes.}
    }
  • Jiazheng Xing, Mengmeng Wang, Yudi Ruan, Bofan Chen, Yaowei Guo, Boyu Mu, Guang Dai, Jingdong Wang, and Yong Liu. Boosting Few-Shot Action Recognition with Graph-Guided Hybrid Matching. In 19th IEEE/CVF International Conference on Computer Vision (ICCV), pages 1740-1750, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Class prototype construction and matching are core aspects of few-shot action recognition. Previous methods mainly focus on designing spatiotemporal relation modeling modules or complex temporal alignment algorithms. Despite the promising results, they ignored the value of class prototype construction and matching, leading to unsatisfactory performance in recognizing similar categories in every task. In this paper, we propose GgHM, a new framework with Graph-guided Hybrid Matching. Concretely, we learn task-oriented features by the guidance of a graph neural network during class prototype construction, optimizing the intra- and inter-class feature correlation explicitly. Next, we design a hybrid matching strategy, combining frame-level and tuple-level matching to classify videos with multivariate styles. We additionally propose a learnable dense temporal modeling module to enhance the video feature temporal representation to build a more solid foundation for the matching process. GgHM shows consistent improvements over other challenging baselines on several few-shot datasets, demonstrating the effectiveness of our method. The code will be publicly available at https://github.com/jiazheng-xing/GgHM.
    @inproceedings{xing2023bfs,
    title = {Boosting Few-Shot Action Recognition with Graph-Guided Hybrid Matching},
    author = {Jiazheng Xing and Mengmeng Wang and Yudi Ruan and Bofan Chen and Yaowei Guo and Boyu Mu and Guang Dai and Jingdong Wang and Yong Liu},
    year = 2023,
    booktitle = {19th IEEE/CVF International Conference on Computer Vision (ICCV)},
    pages = {1740-1750},
    doi = {10.1109/ICCV51070.2023.00167},
    abstract = {Class prototype construction and matching are core aspects of few-shot action recognition. Previous methods mainly focus on designing spatiotemporal relation modeling modules or complex temporal alignment algorithms. Despite the promising results, they ignored the value of class prototype construction and matching, leading to unsatisfactory performance in recognizing similar categories in every task. In this paper, we propose GgHM, a new framework with Graph-guided Hybrid Matching. Concretely, we learn task-oriented features by the guidance of a graph neural network during class prototype construction, optimizing the intra- and inter-class feature correlation explicitly. Next, we design a hybrid matching strategy, combining frame-level and tuple-level matching to classify videos with multivariate styles. We additionally propose a learnable dense temporal modeling module to enhance the video feature temporal representation to build a more solid foundation for the matching process. GgHM shows consistent improvements over other challenging baselines on several few-shot datasets, demonstrating the effectiveness of our method. The code will be publicly available at https://github.com/jiazheng-xing/GgHM.}
    }