Shuo Xin
MS Student
Institute of Cyber-Systems and Control, Zhejiang University, China
Biography
I have attended Zhejiang University as a MS student in Sep 2022, supervised by Yong Liu. My research area is machine learning and computer vision which includes object tracking, object detection, and so on.
Research and Interests
- 3D Object Tracking
- Small Object Detection
- Deep Learning Methods
Publications
- Deye Zhu, Chengrui Zhu, Zhen Zhang, Shuo Xin, and Yong Liu. Learning Safe Locomotion for Quadrupedal Robots by Derived-Action Optimization. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6870-6876, 2024.
[BibTeX] [Abstract] [DOI] [PDF]Deep reinforcement learning controllers with exteroception have enabled quadrupedal robots to traverse terrain robustly. However, most of these controllers heavily depend on complex reward functions and suffer from poor convergence. This work proposes a novel learning framework called derived-action optimization. The derived action is defined as a high-level representation of a policy and can be introduced into the reward function to guide decision-making behaviors. The proposed derived-action optimization method is applied to learn safer quadrupedal locomotion, achieving fast convergence and better performance. Specifically, we choose the foothold as the derived action and optimize the flatness of the terrain around the foothold to reduce potential sliding and collisions. Extensive experiments demonstrate the high safety and effectiveness of our method.
@inproceedings{zhu2024lsl, title = {Learning Safe Locomotion for Quadrupedal Robots by Derived-Action Optimization}, author = {Deye Zhu and Chengrui Zhu and Zhen Zhang and Shuo Xin and Yong Liu}, year = 2024, booktitle = {2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, pages = {6870-6876}, doi = {10.1109/IROS58592.2024.10802725}, abstract = {Deep reinforcement learning controllers with exteroception have enabled quadrupedal robots to traverse terrain robustly. However, most of these controllers heavily depend on complex reward functions and suffer from poor convergence. This work proposes a novel learning framework called derived-action optimization. The derived action is defined as a high-level representation of a policy and can be introduced into the reward function to guide decision-making behaviors. The proposed derived-action optimization method is applied to learn safer quadrupedal locomotion, achieving fast convergence and better performance. Specifically, we choose the foothold as the derived action and optimize the flatness of the terrain around the foothold to reduce potential sliding and collisions. Extensive experiments demonstrate the high safety and effectiveness of our method.} } - Shuo Xin, Zhen Zhang, Liang Liu, Xiaojun Hou, Deye Zhu, Mengmeng Wang, and Yong Liu. A Robotic-centric Paradigm for 3D Human Tracking Under Complex Environments Using Multi-modal Adaptation. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4934-4940, 2024.
[BibTeX] [Abstract] [DOI] [PDF]The goal of this paper is to strike a feasible tracking paradigm that can make 3D human trackers applicable on robot platforms and enable more high-level tasks. Till now, two fundamental problems haven’t been adequately addressed. One is the computational cost lightweight enough for robotic deployment, and the other is the easily-influenced accuracy varied greatly in complex real environments. In this paper, a robotic-centric tracking paradigm called MATNet is proposed that directly matches the LiDAR point clouds and RGB videos through end-to-end learning. To improve the low accuracy of human tracking against disturbance, a coarse-to-fine Transformer along with target-ware augmentation is proposed by fusing RGB videos and point clouds through a pyramid encoding and decoding strategy. To better meet the real-time requirement of actual robot deployment, we introduce the parameter-efficient adaptation tuning that greatly shortens the model’s training time. Furthermore, we also propose a five-step Anti-shake Refinement strategy and have added human prior values to overcome the strong shaking on the robot plat-form. Extensive experiments confirm that MATNet significantly outperforms the previous state-of-the-art on both open-source datasets and large-scale robotic datasets.
@inproceedings{xin2024arc, title = {A Robotic-centric Paradigm for 3D Human Tracking Under Complex Environments Using Multi-modal Adaptation}, author = {Shuo Xin and Zhen Zhang and Liang Liu and Xiaojun Hou and Deye Zhu and Mengmeng Wang and Yong Liu}, year = 2024, booktitle = {2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, pages = {4934-4940}, doi = {10.1109/IROS58592.2024.10802166}, abstract = {The goal of this paper is to strike a feasible tracking paradigm that can make 3D human trackers applicable on robot platforms and enable more high-level tasks. Till now, two fundamental problems haven't been adequately addressed. One is the computational cost lightweight enough for robotic deployment, and the other is the easily-influenced accuracy varied greatly in complex real environments. In this paper, a robotic-centric tracking paradigm called MATNet is proposed that directly matches the LiDAR point clouds and RGB videos through end-to-end learning. To improve the low accuracy of human tracking against disturbance, a coarse-to-fine Transformer along with target-ware augmentation is proposed by fusing RGB videos and point clouds through a pyramid encoding and decoding strategy. To better meet the real-time requirement of actual robot deployment, we introduce the parameter-efficient adaptation tuning that greatly shortens the model's training time. Furthermore, we also propose a five-step Anti-shake Refinement strategy and have added human prior values to overcome the strong shaking on the robot plat-form. Extensive experiments confirm that MATNet significantly outperforms the previous state-of-the-art on both open-source datasets and large-scale robotic datasets.} } - Xiaojun Hou, Jiazheng Xing, Yijie Qian, Yaowei Guo, Shuo Xin, Junhao Chen, Kai Tang, Mengmeng Wang, Zhengkai Jiang, Liang Liu, and Yong Liu. SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26541-26551, 2024.
[BibTeX] [Abstract] [DOI] [PDF]Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers, which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore, recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However, the modality gap limits pre-trained knowledge recall, and the dominance of the RGB modality persists, preventing the full utilization of information from other modalities. To address these issues, we propose a novel symmetric multimodal tracking framework called SDSTrack. We introduce lightweight adaptation for efficient fine-tuning, which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced, symmetric manner. Furthermore, we design a complementary masked patch distillation strategy to enhance the robustness of trackers in complex environments, such as extreme weather, poor imaging, and sensor failure. Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art methods in various multimodal tracking scenarios, including RGB+Depth, RGB+Thermal, and RGB+Event tracking, and exhibits impressive results in extreme conditions. Our source code is available at: https://github.com/hoqolo/SDSTrack.
@inproceedings{hou2024sds, title = {SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking}, author = {Xiaojun Hou and Jiazheng Xing and Yijie Qian and Yaowei Guo and Shuo Xin and Junhao Chen and Kai Tang and Mengmeng Wang and Zhengkai Jiang and Liang Liu and Yong Liu}, year = 2024, booktitle = {2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, pages = {26541-26551}, doi = {10.1109/CVPR52733.2024.02507}, abstract = {Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers, which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore, recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However, the modality gap limits pre-trained knowledge recall, and the dominance of the RGB modality persists, preventing the full utilization of information from other modalities. To address these issues, we propose a novel symmetric multimodal tracking framework called SDSTrack. We introduce lightweight adaptation for efficient fine-tuning, which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced, symmetric manner. Furthermore, we design a complementary masked patch distillation strategy to enhance the robustness of trackers in complex environments, such as extreme weather, poor imaging, and sensor failure. Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art methods in various multimodal tracking scenarios, including RGB+Depth, RGB+Thermal, and RGB+Event tracking, and exhibits impressive results in extreme conditions. Our source code is available at: https://github.com/hoqolo/SDSTrack.} } - Shuo Xin, Zhen Zhang, Mengmeng Wang, Xiaojun Hou, Yaowei Guo, Xiao Kang, Liang Liu, and Yong Liu. Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 337-344, 2024.
[BibTeX] [Abstract] [DOI] [PDF]Tracking a specific person in 3D scene is gaining momentum due to its numerous applications in robotics. Currently, most 3D trackers focus on driving scenarios with neglected jitter and uncomplicated surroundings, which results in their severe degeneration in complex environments, especially on jolting robot platforms (only 20-60% success rate). To improve the accuracy, a Point-Video-based Transformer Tracking model (PVTrack) is presented for robots. It is the first multi-modal 3D human tracking work that incorporates point clouds together with RGB videos to achieve information complementarity. Moreover, PVTrack proposes the Siamese Point-Video Transformer for feature aggregation to overcome dynamic environments, which captures more target-aware information through the hierarchical attention mechanism adaptively. Considering the violent shaking on robots and rugged terrains, a lateral Human-ware Proposal Network is designed together with an Anti-shake Proposal Compensation module. It alleviates the disturbance caused by complex scenes as well as the particularity of the robot platform. Experiments show that our method achieves state-of-the-art performance on both KITTI/Waymo datasets and a quadruped robot for various indoor and outdoor scenes.
@inproceedings{xin2024mmh, title = {Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer}, author = {Shuo Xin and Zhen Zhang and Mengmeng Wang and Xiaojun Hou and Yaowei Guo and Xiao Kang and Liang Liu and Yong Liu}, year = 2024, booktitle = {2024 IEEE International Conference on Robotics and Automation (ICRA)}, pages = {337-344}, doi = {10.1109/ICRA57147.2024.10610979}, abstract = {Tracking a specific person in 3D scene is gaining momentum due to its numerous applications in robotics. Currently, most 3D trackers focus on driving scenarios with neglected jitter and uncomplicated surroundings, which results in their severe degeneration in complex environments, especially on jolting robot platforms (only 20-60% success rate). To improve the accuracy, a Point-Video-based Transformer Tracking model (PVTrack) is presented for robots. It is the first multi-modal 3D human tracking work that incorporates point clouds together with RGB videos to achieve information complementarity. Moreover, PVTrack proposes the Siamese Point-Video Transformer for feature aggregation to overcome dynamic environments, which captures more target-aware information through the hierarchical attention mechanism adaptively. Considering the violent shaking on robots and rugged terrains, a lateral Human-ware Proposal Network is designed together with an Anti-shake Proposal Compensation module. It alleviates the disturbance caused by complex scenes as well as the particularity of the robot platform. Experiments show that our method achieves state-of-the-art performance on both KITTI/Waymo datasets and a quadruped robot for various indoor and outdoor scenes.} } - Shuo Xin, Liang Liu, Xiao Kang, Zhen Zhang, Mengmeng Wang, and Yong Liu. Beyond Traditional Driving Scenes: A Robotic-Centric Paradigm for 2D+3D Human Tracking Using Siamese Transformer Network. In 7th International Symposium on Autonomous Systems (ISAS), 2024.
[BibTeX] [Abstract] [DOI] [PDF]3D human tracking plays a crucial role in the automation intelligence system. Current approaches focus on achieving higher performance on traditional driving datasets like KITTI, which overlook the jitteriness of the platform and the complexity of the environments. Once the scenarios are migrated to jolting robot platforms, they all degenerate severely with only a 20-60% success rate, which greatly restricts the high-level application of autonomous systems. In this work, beyond traditional flat scenes, we introduce Multi-modal Human Tracking Paradigm (MHTrack), a unified multimodal transformer-based model that can effectively track the target person frame-by-frame in point and video sequences. Specifically, we design a speed-inertia module-assisted stabilization mechanism along with an alternate training strategy to better migrate the tracking algorithm to the robot platform. To capture more target-aware information, we combine the geometric and appearance features of point clouds and video frames together based on a hierarchical Siamese Transformer Network. Additionally, considering the prior characteristics of the human category, we design a lateral cross-attention pyramid head for deeper feature aggregation and final 3D BBox generation. Extensive experiments confirm that MHTrack significantly outperforms the previous state-of-the-arts on both open-source datasets and large-scale robotic datasets. Further analysis verifies each component’s effectiveness and shows the robotic-centric paradigm’s promising potential when deployed into dynamic robotic systems.
@inproceedings{xin2024btd, title = {Beyond Traditional Driving Scenes: A Robotic-Centric Paradigm for 2D+3D Human Tracking Using Siamese Transformer Network}, author = {Shuo Xin and Liang Liu and Xiao Kang and Zhen Zhang and Mengmeng Wang and Yong Liu}, year = 2024, booktitle = {7th International Symposium on Autonomous Systems (ISAS)}, doi = {10.1109/ISAS61044.2024.10552604}, abstract = {3D human tracking plays a crucial role in the automation intelligence system. Current approaches focus on achieving higher performance on traditional driving datasets like KITTI, which overlook the jitteriness of the platform and the complexity of the environments. Once the scenarios are migrated to jolting robot platforms, they all degenerate severely with only a 20-60% success rate, which greatly restricts the high-level application of autonomous systems. In this work, beyond traditional flat scenes, we introduce Multi-modal Human Tracking Paradigm (MHTrack), a unified multimodal transformer-based model that can effectively track the target person frame-by-frame in point and video sequences. Specifically, we design a speed-inertia module-assisted stabilization mechanism along with an alternate training strategy to better migrate the tracking algorithm to the robot platform. To capture more target-aware information, we combine the geometric and appearance features of point clouds and video frames together based on a hierarchical Siamese Transformer Network. Additionally, considering the prior characteristics of the human category, we design a lateral cross-attention pyramid head for deeper feature aggregation and final 3D BBox generation. Extensive experiments confirm that MHTrack significantly outperforms the previous state-of-the-arts on both open-source datasets and large-scale robotic datasets. Further analysis verifies each component's effectiveness and shows the robotic-centric paradigm's promising potential when deployed into dynamic robotic systems.} }
