Yukai Ma

PhD Student

Institute of Cyber-Systems and Control, Zhejiang University, China

Biography

I am pursuing my Ph.D. degree in College of Control Science and Engineering, Zhejiang University, Hangzhou, China. My major research interests include machine learning in sensor fusion and SLAM.

Research and Interests

Sensor Fusion
Machine Learning in Sensor Fusion
SLAM

Publications

Yukai Ma, Tiantian Wei, Naiting Zhong, Jianbiao Mei, Tao Hu, Licheng Wen, Xuemeng Yang, Botian Shi, and Yong Liu. LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process Thinking. IEEE Transactions on Neural Networks and Learning System, 2025.
[BibTeX] [Abstract] [DOI]

While autonomous driving technology has made remarkable strides, data-driven approaches still struggle with complex scenarios due to their limited reasoning capabilities. Meanwhile, knowledge-driven autonomous driving systems have evolved considerably with the popularization of visual language models. In this article, we propose LeapVAD, a novel method based on cognitive perception and dual-process thinking. Our approach implements a human-attentional mechanism to identify and focus on critical traffic elements that influence driving decisions. By characterizing these objects through comprehensive attributes—including appearance, motion patterns, and associated risks—LeapVAD achieves more effective environmental representation and streamlines the decision-making process. Furthermore, LeapVAD incorporates an innovative dual-process decision-making module mimicking the human-driving learning process. The system consists of an analytic process (System-II) that accumulates driving experience through logical reasoning and a heuristic process (System-I) that refines this knowledge via fine-tuning and few-shot learning. LeapVAD also includes reflective mechanisms and a growing memory bank, enabling it to learn from past mistakes and continuously improve its performance in a closed-loop environment. To enhance efficiency, we develop a scene encoder network that generates compact scene representations for rapid retrieval of relevant driving experiences. Extensive evaluations conducted on two leading autonomous driving simulators, CARLA and DriveArena, demonstrate that LeapVAD achieves superior performance compared with camera-only approaches despite limited training data. Comprehensive ablation studies further emphasize its effectiveness in continuous learning and domain adaptation. Project page: https://pjlab-adg.github.io/LeapVAD/

@article{ma2025leap,
title = {LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process Thinking},
author = {Yukai Ma and Tiantian Wei and Naiting Zhong and Jianbiao Mei and Tao Hu and Licheng Wen and Xuemeng Yang and Botian Shi and Yong Liu},
year = 2025,
journal = {IEEE Transactions on Neural Networks and Learning System},
doi = {10.1109/TNNLS.2025.3626711},
abstract = {While autonomous driving technology has made remarkable strides, data-driven approaches still struggle with complex scenarios due to their limited reasoning capabilities. Meanwhile, knowledge-driven autonomous driving systems have evolved considerably with the popularization of visual language models. In this article, we propose LeapVAD, a novel method based on cognitive perception and dual-process thinking. Our approach implements a human-attentional mechanism to identify and focus on critical traffic elements that influence driving decisions. By characterizing these objects through comprehensive attributes—including appearance, motion patterns, and associated risks—LeapVAD achieves more effective environmental representation and streamlines the decision-making process. Furthermore, LeapVAD incorporates an innovative dual-process decision-making module mimicking the human-driving learning process. The system consists of an analytic process (System-II) that accumulates driving experience through logical reasoning and a heuristic process (System-I) that refines this knowledge via fine-tuning and few-shot learning. LeapVAD also includes reflective mechanisms and a growing memory bank, enabling it to learn from past mistakes and continuously improve its performance in a closed-loop environment. To enhance efficiency, we develop a scene encoder network that generates compact scene representations for rapid retrieval of relevant driving experiences. Extensive evaluations conducted on two leading autonomous driving simulators, CARLA and DriveArena, demonstrate that LeapVAD achieves superior performance compared with camera-only approaches despite limited training data. Comprehensive ablation studies further emphasize its effectiveness in continuous learning and domain adaptation. Project page: https://pjlab-adg.github.io/LeapVAD/}
}

Yukai Ma, Jianbiao Mei, Xuemeng Yang, Licheng Wen, Weihua Xu, Jiangning Zhang, Xingxing Zuo, Botian Shi, and Yong Liu. LiCROcc: Teach Radar for Accurate Semantic Occupancy Prediction using LiDAR and Camera. IEEE Robotics and Automation Letters, 10:852-859, 2025.
[BibTeX] [Abstract] [DOI] [PDF]

Semantic Scene Completion (SSC) is pivotal in autonomous driving perception, frequently confronted with the complexities of weather and illumination changes. The long-term strategy involves fusing multi-modal information to bolster the system’s robustness. Radar, increasingly utilized for 3D target detection, is gradually replacing LiDAR in autonomous driving applications, offering a robust sensing alternative. In this letter, we focus on the potential of 3D radar in semantic scene completion, pioneering cross-modal refinement techniques for improved robustness against weather and illumination changes and enhancing SSC performance. Regarding model architecture, we propose a threestage tight fusion approach on BEV to realize a fusion framework for point clouds and images. Based on this foundation, we designed three cross-modal distillation modules—CMRD, BRD, and PDD. Our approach enhances the performance in radar-only (R-LiCROcc) and radar-camera (RC-LiCROcc) settings by distilling to them the rich semantic and structural information of the fused features of LiDAR and camera. Finally, our LC-Fusion, R-LiCROcc and RC-LiCROcc achieve the best performance on the nuScenes-Occupancy dataset, with mIOU exceeding the baseline by 22.9%, 44.1%, and 15.5%, respectively.

@article{ma2025locro,
title = {LiCROcc: Teach Radar for Accurate Semantic Occupancy Prediction using LiDAR and Camera},
author = {Yukai Ma and Jianbiao Mei and Xuemeng Yang and Licheng Wen and Weihua Xu and Jiangning Zhang and Xingxing Zuo and Botian Shi and Yong Liu},
year = 2025,
journal = {IEEE Robotics and Automation Letters},
volume = 10,
pages = {852-859},
doi = {10.1109/LRA.2024.3511427},
abstract = {Semantic Scene Completion (SSC) is pivotal in autonomous driving perception, frequently confronted with the complexities of weather and illumination changes. The long-term strategy involves fusing multi-modal information to bolster the system’s robustness. Radar, increasingly utilized for 3D target detection, is gradually replacing LiDAR in autonomous driving applications, offering a robust sensing alternative. In this letter, we focus on the potential of 3D radar in semantic scene completion, pioneering cross-modal refinement techniques for improved robustness against weather and illumination changes and enhancing SSC performance. Regarding model architecture, we propose a threestage tight fusion approach on BEV to realize a fusion framework for point clouds and images. Based on this foundation, we designed three cross-modal distillation modules—CMRD, BRD, and PDD. Our approach enhances the performance in radar-only (R-LiCROcc) and radar-camera (RC-LiCROcc) settings by distilling to them the rich semantic and structural information of the fused features of LiDAR and camera. Finally, our LC-Fusion, R-LiCROcc and RC-LiCROcc achieve the best performance on the nuScenes-Occupancy dataset, with mIOU exceeding the baseline by 22.9%, 44.1%, and 15.5%, respectively.}
}

Yu Yang, Alan Liang, Jianbiao Mei, Yukai Ma, Yong Liu, and Gim Hee Lee. X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability. In 39th Conference on Neural Information Processing Systems (NeurIPS), 2025.
[BibTeX]

@inproceedings{yang2025xscene,
title = {X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability},
author = {Yu Yang and Alan Liang and Jianbiao Mei and Yukai Ma and Yong Liu and Gim Hee Lee},
year = 2025,
booktitle = {39th Conference on Neural Information Processing Systems (NeurIPS)}
}

Yuehao Huang, Liang Liu, Shuangming Lei, Yukai Ma, Hao Su, Jianbiao Mei, Pengxiang Zhao, Yaqing Gu, Yong Liu, and Jiajun Lv. CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking. In Proceedings of the 33rd ACM International Conference on Multimedia (MM), page 5237–5246, 2025.
[BibTeX] [Abstract] [DOI] [PDF]

Mobile robots are increasingly required to navigate and interact within unknown and unstructured environments to meet human demands. Demand-driven navigation (DDN) enables robots to identify and locate objects based on implicit human intent, even when object locations are unknown. However, traditional data-driven DDN methods rely on pre-collected data for model training and decision-making, limiting their generalization capability in unseen scenarios. In this paper, we propose CogDDN, a VLM-based framework that emulates the human cognitive and learning mechanisms by integrating fast and slow thinking systems and selectively identifying key objects essential to fulfilling user demands. CogDDN identifies appropriate target objects by semantically aligning detected objects with the given instructions. Furthermore, it incorporates a dual-process decision-making module, comprising a Heuristic Process for rapid, efficient decisions and an Analytic Process that analyzes past errors, accumulates them in a knowledge base, and continuously improves performance. Chain of Thought (CoT) reasoning strengthens the decision-making process. Extensive closed-loop evaluations on the AI2Thor simulator with the ProcThor dataset show that CogDDN outperforms single-view camera-only methods by 15%, demonstrating significant improvements in navigation accuracy and adaptability. The project page is available at https://yuehaohuang.github.io/CogDDN/.

@inproceedings{huang2025cog,
title = {CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking},
author = {Yuehao Huang and Liang Liu and Shuangming Lei and Yukai Ma and Hao Su and Jianbiao Mei and Pengxiang Zhao and Yaqing Gu and Yong Liu and Jiajun Lv},
year = 2025,
booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia (MM)},
pages = {5237--5246},
doi = {10.1145/3746027.3755832},
abstract = {Mobile robots are increasingly required to navigate and interact within unknown and unstructured environments to meet human demands. Demand-driven navigation (DDN) enables robots to identify and locate objects based on implicit human intent, even when object locations are unknown. However, traditional data-driven DDN methods rely on pre-collected data for model training and decision-making, limiting their generalization capability in unseen scenarios. In this paper, we propose CogDDN, a VLM-based framework that emulates the human cognitive and learning mechanisms by integrating fast and slow thinking systems and selectively identifying key objects essential to fulfilling user demands. CogDDN identifies appropriate target objects by semantically aligning detected objects with the given instructions. Furthermore, it incorporates a dual-process decision-making module, comprising a Heuristic Process for rapid, efficient decisions and an Analytic Process that analyzes past errors, accumulates them in a knowledge base, and continuously improves performance. Chain of Thought (CoT) reasoning strengthens the decision-making process. Extensive closed-loop evaluations on the AI2Thor simulator with the ProcThor dataset show that CogDDN outperforms single-view camera-only methods by 15%, demonstrating significant improvements in navigation accuracy and adaptability. The project page is available at https://yuehaohuang.github.io/CogDDN/.}
}

Ruoyu Wang, Yukai Ma, Yi Yao, Sheng Tao, Haoang Li, Zongzhi Zhu, Yong Liu, and Xingxing Zuo. L2COcc: Lightweight Camera-Centric Semantic Scene Completion via Distillation of LiDAR Model. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025.
[BibTeX] [Abstract] [DOI]

Semantic Scene Completion (SSC) constitutes a pivotal element in autonomous driving perception systems, tasked with inferring the 3D semantic occupancy of a scene from sensory data. To improve accuracy, prior research has implemented various computationally demanding and memory-intensive 3D operations, imposing significant computational requirements on the platform during training and testing. This paper proposes L2COcc, a lightweight camera-centric SSC framework that also accommodates LiDAR inputs. With our proposed efficient voxel transformer (EVT) and cross-modal knowledge modules, including feature similarity distillation (FSD), TPV distillation (TPVD) and prediction alignment distillation (PAD), our method substantially reduce computational burden while maintaining high accuracy. The experimental evaluations demonstrate that our proposed method surpasses the current state-of-the-art vision-based SSC methods regarding accuracy on both the SemanticKITTI and SSCBench-KITTI-360 benchmarks, respectively. Additionally, our method is more lightweight, exhibiting a reduction in both memory consumption and inference time by over 23% compared to the current state-of-the-arts method. Code is available at our project page: https://studyingfufu.github.io/L2COcc/.

@inproceedings{wang2025l2cocc,
title = {L2COcc: Lightweight Camera-Centric Semantic Scene Completion via Distillation of LiDAR Model},
author = {Ruoyu Wang and Yukai Ma and Yi Yao and Sheng Tao and Haoang Li and Zongzhi Zhu and Yong Liu and Xingxing Zuo},
year = 2025,
booktitle = {2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
doi = {10.1109/IROS60139.2025.11245916},
abstract = {Semantic Scene Completion (SSC) constitutes a pivotal element in autonomous driving perception systems, tasked with inferring the 3D semantic occupancy of a scene from sensory data. To improve accuracy, prior research has implemented various computationally demanding and memory-intensive 3D operations, imposing significant computational requirements on the platform during training and testing. This paper proposes L2COcc, a lightweight camera-centric SSC framework that also accommodates LiDAR inputs. With our proposed efficient voxel transformer (EVT) and cross-modal knowledge modules, including feature similarity distillation (FSD), TPV distillation (TPVD) and prediction alignment distillation (PAD), our method substantially reduce computational burden while maintaining high accuracy. The experimental evaluations demonstrate that our proposed method surpasses the current state-of-the-art vision-based SSC methods regarding accuracy on both the SemanticKITTI and SSCBench-KITTI-360 benchmarks, respectively. Additionally, our method is more lightweight, exhibiting a reduction in both memory consumption and inference time by over 23% compared to the current state-of-the-arts method. Code is available at our project page: https://studyingfufu.github.io/L2COcc/.}
}

Yi Yao, Yuhang Lin, Yukai Ma, Yiming Dong, Zongzhi Zhu, and Yong Liu. FUSION-IGV: Design and Implementation of a Multi-Sensor Fusion Framework for Port IGVs. In 2025 8th IEEE International Conference on Unmanned Systems (ICUS), 2025.
[BibTeX]

@inproceedings{yao2025igv,
title = {FUSION-IGV: Design and Implementation of a Multi-Sensor Fusion Framework for Port IGVs},
author = {Yi Yao and Yuhang Lin and Yukai Ma and Yiming Dong and Zongzhi Zhu and Yong Liu},
year = 2025,
booktitle = {2025 8th IEEE International Conference on Unmanned Systems (ICUS)}
}

Yu Yang, Jianbiao Mei, Yukai Ma, Siliang Du, Wenqing Chen, Yijie Qian, Yuxiang Feng, and Yong Liu. Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving. In 39th AAAI Conference on Artificial Intelligence (AAAI), pages 9327-9335, 2025.
[BibTeX] [Abstract] [DOI] [PDF]

World models envision potential future states based on various ego actions. They embed extensive knowledge about the driving environment, facilitating safe and scalable autonomous driving. Most existing methods primarily focus on either data generation or the pretraining paradigms of world models. Unlike the aforementioned prior works, we propose Drive-OccWorld, which adapts a vision-centric 4D forecasting world model to end-to-end planning for autonomous driving. Specifically, we first introduce a semantic and motion-conditional normalization in the memory module, which accumulates semantic and dynamic information from historical BEV embeddings. These BEV features are then conveyed to the world decoder for future occupancy and flow forecasting, considering both geometry and spatiotemporal modeling. Additionally, we propose injecting flexible action conditions, such as velocity, steering angle, trajectory, and commands, into the world model to enable controllable generation and facilitate a broader range of downstream applications. Furthermore, we explore integrating the generative capabilities of the 4D world model with end-to-end planning, enabling continuous forecasting of future states and the selection of optimal trajectories using an occupancy-based cost function. Extensive experiments on the nuScenes dataset demonstrate that our method can generate plausible and controllable 4D occupancy, opening new avenues for driving world generation and end-to-end planning.

@inproceedings{yang2025dow,
title = {Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving},
author = {Yu Yang and Jianbiao Mei and Yukai Ma and Siliang Du and Wenqing Chen and Yijie Qian and Yuxiang Feng and Yong Liu},
year = 2025,
booktitle = {39th AAAI Conference on Artificial Intelligence (AAAI)},
pages = {9327-9335},
doi = {10.1609/aaai.v39i9.33010},
abstract = {World models envision potential future states based on various ego actions. They embed extensive knowledge about the driving environment, facilitating safe and scalable autonomous driving. Most existing methods primarily focus on either data generation or the pretraining paradigms of world models. Unlike the aforementioned prior works, we propose Drive-OccWorld, which adapts a vision-centric 4D forecasting world model to end-to-end planning for autonomous driving. Specifically, we first introduce a semantic and motion-conditional normalization in the memory module, which accumulates semantic and dynamic information from historical BEV embeddings. These BEV features are then conveyed to the world decoder for future occupancy and flow forecasting, considering both geometry and spatiotemporal modeling. Additionally, we propose injecting flexible action conditions, such as velocity, steering angle, trajectory, and commands, into the world model to enable controllable generation and facilitate a broader range of downstream applications. Furthermore, we explore integrating the generative capabilities of the 4D world model with end-to-end planning, enabling continuous forecasting of future states and the selection of optimal trajectories using an occupancy-based cost function. Extensive experiments on the nuScenes dataset demonstrate that our method can generate plausible and controllable 4D occupancy, opening new avenues for driving world generation and end-to-end planning.}
}

Jianbiao Mei, Yu Yang, Mengmeng Wang, Junyu Zhu, Jongwon Ra, Yukai Ma, Laijian Li, and Yong Liu. Camera-Based 3D Semantic Scene Completion With Sparse Guidance Network. IEEE Transactions on Image Processing, 33:5468-5481, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80% mIoU and 45.45% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. Code is available at https://github.com/Jieqianyu/SGN.

@article{mei2024cbs,
title = {Camera-Based 3D Semantic Scene Completion With Sparse Guidance Network},
author = {Jianbiao Mei and Yu Yang and Mengmeng Wang and Junyu Zhu and Jongwon Ra and Yukai Ma and Laijian Li and Yong Liu},
year = 2024,
journal = {IEEE Transactions on Image Processing},
volume = 33,
pages = {5468-5481},
doi = {10.1109/TIP.2024.3461989},
abstract = {Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80% mIoU and 45.45% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. Code is available at https://github.com/Jieqianyu/SGN.}
}

Han Li, Yukai Ma, Yuehao Huang, Yaqing Gu, Weihua Xu, Yong Liu, and Xingxing Zuo. RIDERS: Radar-Infrared Depth Estimation for Robust Sensing. IEEE Transactions on Intelligent Transportation Systems, 25:18764-18778, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

Dense depth recovery is crucial in autonomous driving, serving as a foundational element for obstacle avoidance, 3D object detection, and local path planning. Adverse weather conditions, including haze, dust, rain, snow, and darkness, introduce significant challenges to accurate dense depth estimation, thereby posing substantial safety risks in autonomous driving. These challenges are particularly pronounced for traditional depth estimation methods that rely on short electromagnetic wave sensors, such as visible spectrum cameras and near-infrared LiDAR, due to their susceptibility to diffraction noise and occlusion in such environments. To fundamentally overcome this issue, we present a novel approach for robust metric depth estimation by fusing a millimeter-wave radar and a monocular infrared thermal camera, which are capable of penetrating atmospheric particles and unaffected by lighting conditions. Our proposed Radar-Infrared fusion method achieves highly accurate and finely detailed dense depth estimation through three stages, including monocular depth prediction with global scale alignment, quasi-dense radar augmentation by learning radar-pixels correspondences, and local scale refinement of dense depth using a scale map learner. Our method achieves exceptional visual quality and accurate metric estimation by addressing the challenges of ambiguity and misalignment that arise from directly fusing multi-modal long-wave features. We evaluate the performance of our approach on the NTU4DRadLM dataset and our self-collected challenging ZJU-Multispectrum dataset. Especially noteworthy is the unprecedented robustness demonstrated by our proposed method in smoky scenarios.

@article{li2024riders,
title = {RIDERS: Radar-Infrared Depth Estimation for Robust Sensing},
author = {Han Li and Yukai Ma and Yuehao Huang and Yaqing Gu and Weihua Xu and Yong Liu and Xingxing Zuo},
year = 2024,
journal = {IEEE Transactions on Intelligent Transportation Systems},
volume = 25,
pages = {18764-18778},
doi = {10.1109/TITS.2024.3432996},
abstract = {Dense depth recovery is crucial in autonomous driving, serving as a foundational element for obstacle avoidance, 3D object detection, and local path planning. Adverse weather conditions, including haze, dust, rain, snow, and darkness, introduce significant challenges to accurate dense depth estimation, thereby posing substantial safety risks in autonomous driving. These challenges are particularly pronounced for traditional depth estimation methods that rely on short electromagnetic wave sensors, such as visible spectrum cameras and near-infrared LiDAR, due to their susceptibility to diffraction noise and occlusion in such environments. To fundamentally overcome this issue, we present a novel approach for robust metric depth estimation by fusing a millimeter-wave radar and a monocular infrared thermal camera, which are capable of penetrating atmospheric particles and unaffected by lighting conditions. Our proposed Radar-Infrared fusion method achieves highly accurate and finely detailed dense depth estimation through three stages, including monocular depth prediction with global scale alignment, quasi-dense radar augmentation by learning radar-pixels correspondences, and local scale refinement of dense depth using a scale map learner. Our method achieves exceptional visual quality and accurate metric estimation by addressing the challenges of ambiguity and misalignment that arise from directly fusing multi-modal long-wave features. We evaluate the performance of our approach on the NTU4DRadLM dataset and our self-collected challenging ZJU-Multispectrum dataset. Especially noteworthy is the unprecedented robustness demonstrated by our proposed method in smoky scenarios.}
}

Yukai Ma, Han Li, Xiangrui Zhao, Yaqing Gu, Xiaolei Lang, Laijian Li, and Yong Liu. FMCW Radar on LiDAR Map Localization in Structural Urban Environments. Journal of Field Robotics, 41:699-717, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

Multisensor fusion-based localization technology has achieved high accuracy in autonomous systems. How to improve the robustness is the main challenge at present. The most commonly used LiDAR and camera are weather-sensitive, while the frequency-modulated continuous wave Radar has strong adaptability but suffers from noise and ghost effects. In this paper, we propose a heterogeneous localization method called Radar on LiDAR Map, which aims to enhance localization accuracy without relying on loop closures by mitigating the accumulated error in Radar odometry in real time. To accomplish this, we utilize LiDAR scans and ground truth paths as Teach paths and Radar scans as the trajectories to be estimated, referred to as Repeat paths. By establishing a correlation between the Radar and LiDAR scan data, we can enhance the accuracy of Radar odometry estimation. Our approach involves embedding the data from both Radar and LiDAR sensors into a density map. We calculate the spatial vector similarity with an offset to determine the corresponding place index within the candidate map and estimate the rotation and translation. To refine the alignment, we utilize the Iterative Closest Point algorithm to achieve optimal matching on the LiDAR submap. The estimated bias is subsequently incorporated into the Radar SLAM for optimizing the position map. We conducted extensive experiments on the Mulran Radar Data set, Oxford Radar RobotCar Dataset, and our data set to demonstrate the feasibility and effectiveness of our proposed approach. Our proposed scan projection descriptors achieves homogeneous and heterogeneous place recognition and works much better than existing methods. Its application to the Radar SLAM system also substantially improves the positioning accuracy. All sequences’ root mean square error is 2.53 m for positioning and 1.83 degrees for angle.

@article{ma2024fmcw,
title = {FMCW Radar on LiDAR Map Localization in Structural Urban Environments},
author = {Yukai Ma and Han Li and Xiangrui Zhao and Yaqing Gu and Xiaolei Lang and Laijian Li and Yong Liu},
year = 2024,
journal = {Journal of Field Robotics},
volume = 41,
pages = {699-717},
doi = {10.1002/rob.22291},
abstract = {Multisensor fusion-based localization technology has achieved high accuracy in autonomous systems. How to improve the robustness is the main challenge at present. The most commonly used LiDAR and camera are weather-sensitive, while the frequency-modulated continuous wave Radar has strong adaptability but suffers from noise and ghost effects. In this paper, we propose a heterogeneous localization method called Radar on LiDAR Map, which aims to enhance localization accuracy without relying on loop closures by mitigating the accumulated error in Radar odometry in real time. To accomplish this, we utilize LiDAR scans and ground truth paths as Teach paths and Radar scans as the trajectories to be estimated, referred to as Repeat paths. By establishing a correlation between the Radar and LiDAR scan data, we can enhance the accuracy of Radar odometry estimation. Our approach involves embedding the data from both Radar and LiDAR sensors into a density map. We calculate the spatial vector similarity with an offset to determine the corresponding place index within the candidate map and estimate the rotation and translation. To refine the alignment, we utilize the Iterative Closest Point algorithm to achieve optimal matching on the LiDAR submap. The estimated bias is subsequently incorporated into the Radar SLAM for optimizing the position map. We conducted extensive experiments on the Mulran Radar Data set, Oxford Radar RobotCar Dataset, and our data set to demonstrate the feasibility and effectiveness of our proposed approach. Our proposed scan projection descriptors achieves homogeneous and heterogeneous place recognition and works much better than existing methods. Its application to the Radar SLAM system also substantially improves the positioning accuracy. All sequences' root mean square error is 2.53 m for positioning and 1.83 degrees for angle.}
}

Jianbiao Mei, Yukai Ma, Xuemeng Yang, Licheng Wen, Xinyu Cai, Xin Li, Daocheng Fu, Bo Zhang, Pinlong Cai, Min Dou, Botian Shi, Liang He, Yong Liu, and Yu Qiao. Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving. In 38th Conference on Neural Information Processing Systems (NeurIPS), 2024.
[BibTeX] [Abstract]

Autonomous driving has advanced significantly due to sensors, machine learning, and artificial intelligence improvements. However, prevailing methods struggle with intricate scenarios and causal relationships, hindering adaptability and interpretability in varied environments. To address the above problems, we introduce LeapAD, a novel paradigm for autonomous driving inspired by the human cognitive process. Specifically, LeapAD emulates human attention by selecting critical objects relevant to driving decisions, simplifying environmental interpretation, and mitigating decision-making complexities. Additionally, LeapAD incorporates an innovative dual-process decision-making module, which consists of an Analytic Process (System-II) for thorough analysis and reasoning, along with a Heuristic Process (System-I) for swift and empirical processing. The Analytic Process leverages its logical reasoning to accumulate linguistic driving experience, which is then transferred to the Heuristic Process by supervised fine-tuning. Through reflection mechanisms and a growing memory bank, LeapAD continuously improves itself from past mistakes in a closed-loop environment. Closed-loop testing in CARLA shows that LeapAD outperforms all methods relying solely on camera input, requiring 1-2 orders of magnitude less labeled data. Experiments also demonstrate that as the memory bank expands, the Heuristic Process with only 1.8B parameters can inherit the knowledge from a GPT-4 powered Analytic Process and achieve continuous performance improvement. Project page: https://pjlab-adg.github.io/LeapAD/.

@inproceedings{mei2024cla,
title = {Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving},
author = {Jianbiao Mei and Yukai Ma and Xuemeng Yang and Licheng Wen and Xinyu Cai and Xin Li and Daocheng Fu and Bo Zhang and Pinlong Cai and Min Dou and Botian Shi and Liang He and Yong Liu and Yu Qiao},
year = 2024,
booktitle = {38th Conference on Neural Information Processing Systems (NeurIPS)},
abstract = {Autonomous driving has advanced significantly due to sensors, machine learning, and artificial intelligence improvements. However, prevailing methods struggle with intricate scenarios and causal relationships, hindering adaptability and interpretability in varied environments. To address the above problems, we introduce LeapAD, a novel paradigm for autonomous driving inspired by the human cognitive process. Specifically, LeapAD emulates human attention by selecting critical objects relevant to driving decisions, simplifying environmental interpretation, and mitigating decision-making complexities. Additionally, LeapAD incorporates an innovative dual-process decision-making module, which consists of an Analytic Process (System-II) for thorough analysis and reasoning, along with a Heuristic Process (System-I) for swift and empirical processing. The Analytic Process leverages its logical reasoning to accumulate linguistic driving experience, which is then transferred to the Heuristic Process by supervised fine-tuning. Through reflection mechanisms and a growing memory bank, LeapAD continuously improves itself from past mistakes in a closed-loop environment. Closed-loop testing in CARLA shows that LeapAD outperforms all methods relying solely on camera input, requiring 1-2 orders of magnitude less labeled data. Experiments also demonstrate that as the memory bank expands, the Heuristic Process with only 1.8B parameters can inherit the knowledge from a GPT-4 powered Analytic Process and achieve continuous performance improvement. Project page: https://pjlab-adg.github.io/LeapAD/.}
}

Kai Tang, Xiaolei Lang, Yukai Ma, Yuehao Huang, Laijian Li, Yong Liu, and Jiajun Lv. Monocular Event-Inertial Odometry with Adaptive decay-based Time Surface and Polarity-aware Tracking. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 12544-12551, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

Event cameras have garnered considerable attention due to their advantages over traditional cameras in low power consumption, high dynamic range, and no motion blur. This paper proposes a monocular event-inertial odometry incorporating an adaptive decay kernel-based time surface with polarity-aware tracking. We utilize an adaptive decay-based Time Surface to extract texture information from asynchronous events, which adapts to the dynamic characteristics of the event stream and enhances the representation of environmental textures. However, polarity-weighted time surfaces suffer from event polarity shifts during changes in motion direction. To mitigate its adverse effects on feature tracking, we optimize the feature tracking by incorporating an additional polarity-inverted time surface to enhance the robustness. Comparative analysis with visual-inertial and event-inertial odometry methods shows that our approach outperforms state-of-the-art techniques, with competitive results across various datasets.

@inproceedings{tang2024meio,
title = {Monocular Event-Inertial Odometry with Adaptive decay-based Time Surface and Polarity-aware Tracking},
author = {Kai Tang and Xiaolei Lang and Yukai Ma and Yuehao Huang and Laijian Li and Yong Liu and Jiajun Lv},
year = 2024,
booktitle = {2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
pages = {12544-12551},
doi = {10.1109/IROS58592.2024.10802605},
abstract = {Event cameras have garnered considerable attention due to their advantages over traditional cameras in low power consumption, high dynamic range, and no motion blur. This paper proposes a monocular event-inertial odometry incorporating an adaptive decay kernel-based time surface with polarity-aware tracking. We utilize an adaptive decay-based Time Surface to extract texture information from asynchronous events, which adapts to the dynamic characteristics of the event stream and enhances the representation of environmental textures. However, polarity-weighted time surfaces suffer from event polarity shifts during changes in motion direction. To mitigate its adverse effects on feature tracking, we optimize the feature tracking by incorporating an additional polarity-inverted time surface to enhance the robustness. Comparative analysis with visual-inertial and event-inertial odometry methods shows that our approach outperforms state-of-the-art techniques, with competitive results across various datasets.}
}

Han Li, Yukai Ma, Yaqing Gu, Kewei Hu, Yong Liu, and Xingxing Zuo. RadarCam-Depth: Radar-Camera Fusion for Depth Estimation with Learned Metric Scale. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 10665-10672, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

We present a novel approach for metric dense depth estimation based on the fusion of a single-view image and a sparse, noisy Radar point cloud. The direct fusion of heterogeneous Radar and image data, or their encodings, tends to yield dense depth maps with significant artifacts, blurred boundaries, and suboptimal accuracy. To circumvent this issue, we learn to augment versatile and robust monocular depth prediction with the dense metric scale induced from sparse and noisy Radar data. We propose a Radar-Camera framework for highly accurate and fine-detailed dense depth estimation with four stages, including monocular depth prediction, global scale alignment of monocular depth with sparse Radar points, quasi-dense scale estimation through learning the association between Radar points and image patches, and local scale refinement of dense depth using a scale map learner. Our proposed method significantly outperforms the state-of-the-art Radar-Camera depth estimation methods by reducing the mean absolute error (MAE) of depth estimation by 25.6% and 40.2% on the challenging nuScenes dataset and our self-collected ZJU-4DRadarCam dataset, respectively. Our code and dataset will be released at https://github.com/MMOCKING/RadarCam-Depth.

@inproceedings{li2024rcd,
title = {RadarCam-Depth: Radar-Camera Fusion for Depth Estimation with Learned Metric Scale},
author = {Han Li and Yukai Ma and Yaqing Gu and Kewei Hu and Yong Liu and Xingxing Zuo},
year = 2024,
booktitle = {2024 IEEE International Conference on Robotics and Automation (ICRA)},
pages = {10665-10672},
doi = {10.1109/ICRA57147.2024.10610929},
abstract = {We present a novel approach for metric dense depth estimation based on the fusion of a single-view image and a sparse, noisy Radar point cloud. The direct fusion of heterogeneous Radar and image data, or their encodings, tends to yield dense depth maps with significant artifacts, blurred boundaries, and suboptimal accuracy. To circumvent this issue, we learn to augment versatile and robust monocular depth prediction with the dense metric scale induced from sparse and noisy Radar data. We propose a Radar-Camera framework for highly accurate and fine-detailed dense depth estimation with four stages, including monocular depth prediction, global scale alignment of monocular depth with sparse Radar points, quasi-dense scale estimation through learning the association between Radar points and image patches, and local scale refinement of dense depth using a scale map learner. Our proposed method significantly outperforms the state-of-the-art Radar-Camera depth estimation methods by reducing the mean absolute error (MAE) of depth estimation by 25.6% and 40.2% on the challenging nuScenes dataset and our self-collected ZJU-4DRadarCam dataset, respectively. Our code and dataset will be released at https://github.com/MMOCKING/RadarCam-Depth.}
}

Chencan Fu, Lin Li, Jianbiao Mei, Yukai Ma, Linpeng Peng, Xiangrui Zhao, and Yong Liu. A Coarse-to-Fine Place Recognition Approach using Attention-guided Descriptors and Overlap Estimation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 8493-8499, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

Place recognition is a challenging but crucial task in robotics. Current description-based methods may be limited by representation capabilities, while pairwise similarity-based methods require exhaustive searches, which is time-consuming. In this paper, we present a novel coarse-to-fine approach to address these problems, which combines BEV (Bird’s Eye View) feature extraction, coarse-grained matching and fine-grained verification. In the coarse stage, our approach utilizes an attention-guided network to generate attention-guided descriptors. We then employ a fast affinity-based candidate selection process to identify the Top-K most similar candidates. In the fine stage, we estimate pairwise overlap among the narrowed-down place candidates to determine the final match. Experimental results on the KITTI and KITTI-360 datasets demonstrate that our approach outperforms state-of-the-art methods. The code will be released publicly soon.

@inproceedings{fu2024ctf,
title = {A Coarse-to-Fine Place Recognition Approach using Attention-guided Descriptors and Overlap Estimation},
author = {Chencan Fu and Lin Li and Jianbiao Mei and Yukai Ma and Linpeng Peng and Xiangrui Zhao and Yong Liu},
year = 2024,
booktitle = {2024 IEEE International Conference on Robotics and Automation (ICRA)},
pages = {8493-8499},
doi = {10.1109/ICRA57147.2024.10611569},
abstract = {Place recognition is a challenging but crucial task in robotics. Current description-based methods may be limited by representation capabilities, while pairwise similarity-based methods require exhaustive searches, which is time-consuming. In this paper, we present a novel coarse-to-fine approach to address these problems, which combines BEV (Bird's Eye View) feature extraction, coarse-grained matching and fine-grained verification. In the coarse stage, our approach utilizes an attention-guided network to generate attention-guided descriptors. We then employ a fast affinity-based candidate selection process to identify the Top-K most similar candidates. In the fine stage, we estimate pairwise overlap among the narrowed-down place candidates to determine the final match. Experimental results on the KITTI and KITTI-360 datasets demonstrate that our approach outperforms state-of-the-art methods. The code will be released publicly soon.}
}

Xiaolei Lang, Chao Chen, Kai Tang, Yukai Ma, Jiajun Lv, Yong Liu, and Xingxing Zuo. Coco-LIC: Continuous-Time Tightly-Coupled LiDAR-Inertial-Camera Odometry using Non-Uniform B-spline. IEEE Robotics and Automation Letters, 8:7074-7081, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

In this paper, we propose an effcient continuous-time LiDAR-Inertial-Camera Odometry, utilizing non-uniform B-splines to tightly couple measurements from the LiDAR, IMU, and camera. In contrast to uniform B-spline-based continuous-time methods, our non-uniform B-spline approach offers signifcant advantages in terms of achieving real-time effciency and high accuracy. This is accomplished by dynamically and adaptively placing control points, taking into account the varying dynamics of the motion. To enable effcient fusion of heterogeneous LiDAR-Inertial-Camera data within a short sliding-window optimization, we assign depth to visual pixels using corresponding map points from a global LiDAR map, and formulate frame-to-map reprojection factors for the associated pixels in the current image frame. This way circumvents the necessity for depth optimization of visual pixels, which typically entails a lengthy sliding window with numerous control points for continuous-time trajectory estimation. We conduct dedicated experiments on real-world datasets to demonstrate the advantage and effcacy of adopting non-uniform continuous-time trajectory representation. Our LiDAR-Inertial-Camera odometry system is also extensively evaluated on both challenging scenarios with sensor degenerations and large-scale scenarios, and has shown comparable or higher accuracy than the state-of-the-art methods. The codebase of this paper will also be open-sourced at https://github.com/APRIL-ZJU/Coco-LIC .

@article{lang2023lic,
title = {Coco-LIC: Continuous-Time Tightly-Coupled LiDAR-Inertial-Camera Odometry using Non-Uniform B-spline},
author = {Xiaolei Lang and Chao Chen and Kai Tang and Yukai Ma and Jiajun Lv and Yong Liu and Xingxing Zuo},
year = 2023,
journal = {IEEE Robotics and Automation Letters},
volume = 8,
pages = {7074-7081},
doi = {10.1109/LRA.2023.3315542},
abstract = {In this paper, we propose an effcient continuous-time LiDAR-Inertial-Camera Odometry, utilizing non-uniform B-splines to tightly couple measurements from the LiDAR, IMU, and camera. In contrast to uniform B-spline-based continuous-time methods, our non-uniform B-spline approach offers signifcant advantages in terms of achieving real-time effciency and high accuracy. This is accomplished by dynamically and adaptively placing control points, taking into account the varying dynamics of the motion. To enable effcient fusion of heterogeneous LiDAR-Inertial-Camera data within a short sliding-window optimization, we assign depth to visual pixels using corresponding map points from a global LiDAR map, and formulate frame-to-map reprojection factors for the associated pixels in the current image frame. This way circumvents the necessity for depth optimization of visual pixels, which typically entails a lengthy sliding window with numerous control points for continuous-time trajectory estimation. We conduct dedicated experiments on real-world datasets to demonstrate the advantage and effcacy of adopting non-uniform continuous-time trajectory representation. Our LiDAR-Inertial-Camera odometry system is also extensively evaluated on both challenging scenarios with sensor degenerations and large-scale scenarios, and has shown comparable or higher accuracy than the state-of-the-art methods. The codebase of this paper will also be open-sourced at https://github.com/APRIL-ZJU/Coco-LIC .}
}

Laijian Li, Yukai Ma, Kai Tang, Xiangrui Zhao, Chao Chen, Jianxin Huang, Jianbiao Mei, and Yong Liu. Geo-localization with Transformer-based 2D-3D match Network. IEEE Robotics and Automation Letters (RA-L), 8:4855-4862, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

This letter presents a novel method for geographical localization by registering satellite maps with LiDAR point clouds. This method includes a Transformer-based 2D-3D matching network called D-GLSNet that directly matches the LiDAR point clouds and satellite images through end-to-end learning. Without the need for feature point detection, D-GLSNet provides accurate pixel-to-point association between the LiDAR point clouds and satellite images. And then, we can easily calculate the horizontal offset (Δx,Δy) and angular deviation Δθyaw between them, thereby achieving accurate registration. To demonstrate our network’s localization potential, we have designed a Geo-localization Node (GLN) that implements geographical localization and is plug-and-play in the SLAM system. Compared to GPS, GLN is less susceptible to external interference, such as building occlusion. In urban scenarios, our proposed D-GLSNet can output high-quality matching, enabling GLN to function stably and deliver more accurate localization results. Extensive experiments on the KITTI dataset show that our D-GLSNet method achieves a mean Relative Translation Error (RTE) of 1.43 m. Furthermore, our method outperforms state-of-the-art LiDAR-based geospatial localization methods when combined with odometry.

@article{li2023glw,
title = {Geo-localization with Transformer-based 2D-3D match Network},
author = {Laijian Li and Yukai Ma and Kai Tang and Xiangrui Zhao and Chao Chen and Jianxin Huang and Jianbiao Mei and Yong Liu},
year = 2023,
journal = {IEEE Robotics and Automation Letters (RA-L)},
volume = 8,
pages = {4855-4862},
doi = {10.1109/LRA.2023.3290526},
abstract = {This letter presents a novel method for geographical localization by registering satellite maps with LiDAR point clouds. This method includes a Transformer-based 2D-3D matching network called D-GLSNet that directly matches the LiDAR point clouds and satellite images through end-to-end learning. Without the need for feature point detection, D-GLSNet provides accurate pixel-to-point association between the LiDAR point clouds and satellite images. And then, we can easily calculate the horizontal offset (Δx,Δy) and angular deviation Δθyaw between them, thereby achieving accurate registration. To demonstrate our network's localization potential, we have designed a Geo-localization Node (GLN) that implements geographical localization and is plug-and-play in the SLAM system. Compared to GPS, GLN is less susceptible to external interference, such as building occlusion. In urban scenarios, our proposed D-GLSNet can output high-quality matching, enabling GLN to function stably and deliver more accurate localization results. Extensive experiments on the KITTI dataset show that our D-GLSNet method achieves a mean Relative Translation Error (RTE) of 1.43 m. Furthermore, our method outperforms state-of-the-art LiDAR-based geospatial localization methods when combined with odometry.}
}

Chao Chen, Yukai Ma, Jiajun Lv, Xiangrui Zhao, Laijian Li, Yong Liu, and Wang Gao. OL-SLAM: A Robust and Versatile System of Object Localization and SLAM. Sensors, 23:801, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

This paper proposes a real-time, versatile Simultaneous Localization and Mapping (SLAM) and object localization system, which fuses measurements from LiDAR, camera, Inertial Measurement Unit (IMU), and Global Positioning System (GPS). Our system can locate itself in an unknown environment and build a scene map based on which we can also track and obtain the global location of objects of interest. Precisely, our SLAM subsystem consists of the following four parts: LiDAR-inertial odometry, Visual-inertial odometry, GPS-inertial odometry, and global pose graph optimization. The target-tracking and positioning subsystem is developed based on YOLOv4. Benefiting from the use of GPS sensor in the SLAM system, we can obtain the global positioning information of the target; therefore, it can be highly useful in military operations, rescue and disaster relief, and other scenarios.

@article{chen2023ols,
title = {OL-SLAM: A Robust and Versatile System of Object Localization and SLAM},
author = {Chao Chen and Yukai Ma and Jiajun Lv and Xiangrui Zhao and Laijian Li and Yong Liu and Wang Gao},
year = 2023,
journal = {Sensors},
volume = 23,
pages = {801},
doi = {10.3390/s23020801},
abstract = {This paper proposes a real-time, versatile Simultaneous Localization and Mapping (SLAM) and object localization system, which fuses measurements from LiDAR, camera, Inertial Measurement Unit (IMU), and Global Positioning System (GPS). Our system can locate itself in an unknown environment and build a scene map based on which we can also track and obtain the global location of objects of interest. Precisely, our SLAM subsystem consists of the following four parts: LiDAR-inertial odometry, Visual-inertial odometry, GPS-inertial odometry, and global pose graph optimization. The target-tracking and positioning subsystem is developed based on YOLOv4. Benefiting from the use of GPS sensor in the SLAM system, we can obtain the global positioning information of the target; therefore, it can be highly useful in military operations, rescue and disaster relief, and other scenarios.}
}

Jingyang Xiang, Siqi Li, Jun Chen, Shipeng Bai, Yukai Ma, Guang Dai, and Yong Liu. SUBP: Soft Uniform Block Pruning for 1xN Sparse CNNs Multithreading Acceleration. In 37th Conference on Neural Information Processing Systems (NeurIPS), pages 52033-52050, 2023.
[BibTeX] [Abstract] [PDF]

The study of sparsity in Convolutional Neural Networks (CNNs) has become widespread to compress and accelerate models in environments with limited resources. By constraining N consecutive weights along the output channel to be group-wise non-zero, the recent network with 1×N sparsity has received tremendous popularity for its three outstanding advantages: 1) A large amount of storage space saving by a Block Sparse Row matrix. 2) Excellent performance at a high sparsity. 3) Significant speedups on CPUs with Advanced Vector Extensions. Recent work requires selecting and fine-tuning 1×N sparse weights based on dense pre-trained weights, leading to the problems such as expensive training cost and memory access, sub-optimal model quality, as well as unbalanced workload across threads (different sparsity across output channels). To overcome them, this paper proposes a novel Soft Uniform Block Pruning (SUBP) approach to train a uniform 1×N sparse structured network from scratch. Specifically, our approach tends to repeatedly allow pruned blocks to regrow to the network based on block angular redundancy and importance sampling in a uniform manner throughout the training process. It not only makes the model less dependent on pre-training, reduces the model redundancy and the risk of pruning the important blocks permanently but also achieves balanced workload. Empirically, on ImageNet, comprehensive experiments across various CNN architectures show that our SUBP consistently outperforms existing 1×N and structured sparsity methods based on pre-trained models or training from scratch. Source codes and models are available at https://github.com/JingyangXiang/SUBP.

@inproceedings{xiang2023subp,
title = {SUBP: Soft Uniform Block Pruning for 1xN Sparse CNNs Multithreading Acceleration},
author = {Jingyang Xiang and Siqi Li and Jun Chen and Shipeng Bai and Yukai Ma and Guang Dai and Yong Liu},
year = 2023,
booktitle = {37th Conference on Neural Information Processing Systems (NeurIPS)},
pages = {52033-52050},
abstract = {The study of sparsity in Convolutional Neural Networks (CNNs) has become widespread to compress and accelerate models in environments with limited resources. By constraining N consecutive weights along the output channel to be group-wise non-zero, the recent network with 1×N sparsity has received tremendous popularity for its three outstanding advantages: 1) A large amount of storage space saving by a Block Sparse Row matrix. 2) Excellent performance at a high sparsity. 3) Significant speedups on CPUs with Advanced Vector Extensions. Recent work requires selecting and fine-tuning 1×N sparse weights based on dense pre-trained weights, leading to the problems such as expensive training cost and memory access, sub-optimal model quality, as well as unbalanced workload across threads (different sparsity across output channels). To overcome them, this paper proposes a novel Soft Uniform Block Pruning (SUBP) approach to train a uniform 1×N sparse structured network from scratch. Specifically, our approach tends to repeatedly allow pruned blocks to regrow to the network based on block angular redundancy and importance sampling in a uniform manner throughout the training process. It not only makes the model less dependent on pre-training, reduces the model redundancy and the risk of pruning the important blocks permanently but also achieves balanced workload. Empirically, on ImageNet, comprehensive experiments across various CNN architectures show that our SUBP consistently outperforms existing 1×N and structured sparsity methods based on pre-trained models or training from scratch. Source codes and models are available at https://github.com/JingyangXiang/SUBP.}
}

Chao Chen, Hangyu Wu, Yukai Ma, Jiajun Lv, Laijian Li, and Yong Liu. LiDAR-Inertial SLAM with Efficiently Extracted Planes. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1497-1504, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

This paper proposes a LiDAR-Inertial SLAM with efficiently extracted planes, which couples explicit planes in the odometry to improve accuracy and in the mapping for consistency. The proposed method consists of three parts: an efficient Point →Line→Plane extraction algorithm, a LiDAR-Inertial-Plane tightly coupled odometry, and a global plane-aided mapping. Specifically, we leverage the ring field of the LiDAR point cloud to accelerate the region-growing-based plane extraction algorithm. Then we tightly coupled IMU pre-integration factors, LiDAR odometry factors, and explicit plane factors in the sliding window to obtain a more accurate initial pose for mapping. Finally, we maintain explicit planes in the global map, and enhance system consistency by optimizing the factor graph of optimized odometry factors and plane observation factors. Experimental results show that our plane extraction method is efficient, and the proposed plane-aided LiDAR-Inertial SLAM significantly improves the accuracy and consistency compared to the other state-of-the-art algorithms with only a small increase in time consumption.

@inproceedings{chen2023lidar,
title = {LiDAR-Inertial SLAM with Efficiently Extracted Planes},
author = {Chao Chen and Hangyu Wu and Yukai Ma and Jiajun Lv and Laijian Li and Yong Liu},
year = 2023,
booktitle = {2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
pages = {1497-1504},
doi = {10.1109/IROS55552.2023.10342325},
abstract = {This paper proposes a LiDAR-Inertial SLAM with efficiently extracted planes, which couples explicit planes in the odometry to improve accuracy and in the mapping for consistency. The proposed method consists of three parts: an efficient Point →Line→Plane extraction algorithm, a LiDAR-Inertial-Plane tightly coupled odometry, and a global plane-aided mapping. Specifically, we leverage the ring field of the LiDAR point cloud to accelerate the region-growing-based plane extraction algorithm. Then we tightly coupled IMU pre-integration factors, LiDAR odometry factors, and explicit plane factors in the sliding window to obtain a more accurate initial pose for mapping. Finally, we maintain explicit planes in the global map, and enhance system consistency by optimizing the factor graph of optimized odometry factors and plane observation factors. Experimental results show that our plane extraction method is efficient, and the proposed plane-aided LiDAR-Inertial SLAM significantly improves the accuracy and consistency compared to the other state-of-the-art algorithms with only a small increase in time consumption.}
}

Yukai Ma, Xiangrui Zhao, Han Li, Yaqing Gu, Xiaolei Lang, and Yong Liu. RoLM：Radar on LiDAR Map Localization. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 3976-3982, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

Multi-sensor fusion-based localization technology has achieved high accuracy in autonomous systems. How to improve the robustness is the main challenge at present. The most commonly used LiDAR and camera are weather-sensitive, while the FMCW radar has strong adaptability but suffers from noise and ghost effects. In this paper, we propose a heterogeneous localization method of Radar on LiDAR Map (RoLM), which can eliminate the accumulated error of radar odometry in real-time to achieve higher localization accuracy without dependence on loop closures. We embed the two sensor modalities into a density map and calculate the spatial vector similarity with offset to seek the corresponding place index in the candidates and calculate the rotation and translation. We use the ICP to pursue perfect matching on the LiDAR submap based on the coarse alignment. Extensive experiments on Mulran Radar Dataset, Oxford Radar RobotCar Dataset, and our data verify the feasibility and effectiveness of our approach.

@inproceedings{ma2023rol,
title = {RoLM：Radar on LiDAR Map Localization},
author = {Yukai Ma and Xiangrui Zhao and Han Li and Yaqing Gu and Xiaolei Lang and Yong Liu},
year = 2023,
booktitle = {2023 IEEE International Conference on Robotics and Automation (ICRA)},
pages = {3976-3982},
doi = {10.1109/ICRA48891.2023.10161203},
abstract = {Multi-sensor fusion-based localization technology has achieved high accuracy in autonomous systems. How to improve the robustness is the main challenge at present. The most commonly used LiDAR and camera are weather-sensitive, while the FMCW radar has strong adaptability but suffers from noise and ghost effects. In this paper, we propose a heterogeneous localization method of Radar on LiDAR Map (RoLM), which can eliminate the accumulated error of radar odometry in real-time to achieve higher localization accuracy without dependence on loop closures. We embed the two sensor modalities into a density map and calculate the spatial vector similarity with offset to seek the corresponding place index in the candidates and calculate the rotation and translation. We use the ICP to pursue perfect matching on the LiDAR submap based on the coarse alignment. Extensive experiments on Mulran Radar Dataset, Oxford Radar RobotCar Dataset, and our data verify the feasibility and effectiveness of our approach.}
}

Xiaolei Lang, Jiajun Lv, Jianxin Huang, Yukai Ma, Yong Liu, and Xingxing Zuo. Ctrl-VIO: Continuous-Time Visual-Inertial Odometry for Rolling Shutter Cameras. IEEE Robotics and Automation Letters (RA-L), 7(4):11537-11544, 2022.
[BibTeX] [Abstract] [DOI] [PDF]

In this letter, we propose a probabilistic continuoustime visual-inertial odometry (VIO) for rolling shutter cameras. The continuous-time trajectory formulation naturally facilitates the fusion of asynchronized high-frequency IMU data and motion distorted rolling shutter images. To prevent intractable computation load, the proposed VIO is sliding-window and keyframe-based. We propose to probabilistically marginalize the control points to keep the constant number of keyframes in the sliding window. Furthermore, the line exposure time difference (line delay) of the rolling shutter camera can be online calibrated in our continuous-time VIO. To extensively examine the performance of our continuoustime VIO, experiments are conducted on publicly-available WHURSVI, TUM-RSVI, and SenseTime-RSVI rolling shutter datasets. The results demonstrate the proposed continuous-time VIO significantly outperforms the existing state-of-the-art VIO methods.

@article{lang2022ctv,
title = {Ctrl-VIO: Continuous-Time Visual-Inertial Odometry for Rolling Shutter Cameras},
author = {Xiaolei Lang and Jiajun Lv and Jianxin Huang and Yukai Ma and Yong Liu and Xingxing Zuo},
year = 2022,
journal = {IEEE Robotics and Automation Letters (RA-L)},
volume = {7},
number = {4},
pages = {11537-11544},
doi = {10.1109/LRA.2022.3202349},
abstract = {In this letter, we propose a probabilistic continuoustime visual-inertial odometry (VIO) for rolling shutter cameras. The continuous-time trajectory formulation naturally facilitates
the fusion of asynchronized high-frequency IMU data and motion distorted rolling shutter images. To prevent intractable computation load, the proposed VIO is sliding-window and keyframe-based. We propose to probabilistically marginalize the control points to keep the constant number of keyframes in the sliding window. Furthermore, the line exposure time difference (line delay) of the rolling
shutter camera can be online calibrated in our continuous-time VIO. To extensively examine the performance of our continuoustime VIO, experiments are conducted on publicly-available WHURSVI, TUM-RSVI, and SenseTime-RSVI rolling shutter datasets. The results demonstrate the proposed continuous-time VIO significantly outperforms the existing state-of-the-art VIO methods.}
}

Address

Yukai Ma

Biography

Research and Interests

Publications

Links

Latest Events

APRIL实验室斩获ATEC 2025科技精英赛冠军，具身智能技术实现真实场景重大突破

喜报！APRIL实验室硕士生侯典泳荣获IROS 2025“移动操作领域最佳论文提名奖”

喜报！APRIL实验室在IROS 2025四足机器人挑战赛上荣获最佳自主导航奖