Jianbiao Mei

PhD Student

Institute of Cyber-Systems and Control, Zhejiang University, China

Biography

I am pursuing my Ph.D. degree in College of Control Science and Engineering, Zhejiang University, Hangzhou, China. My major research interest is video segmentation, 3D perception and autonomous driving.

Research and Interests

Video Segmentation
3D Perception
Autonomous Driving

Publications

Yukai Ma, Tiantian Wei, Naiting Zhong, Jianbiao Mei, Tao Hu, Licheng Wen, Xuemeng Yang, Botian Shi, and Yong Liu. LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process Thinking. IEEE Transactions on Neural Networks and Learning System, 2025.
[BibTeX] [Abstract] [DOI]

While autonomous driving technology has made remarkable strides, data-driven approaches still struggle with complex scenarios due to their limited reasoning capabilities. Meanwhile, knowledge-driven autonomous driving systems have evolved considerably with the popularization of visual language models. In this article, we propose LeapVAD, a novel method based on cognitive perception and dual-process thinking. Our approach implements a human-attentional mechanism to identify and focus on critical traffic elements that influence driving decisions. By characterizing these objects through comprehensive attributes—including appearance, motion patterns, and associated risks—LeapVAD achieves more effective environmental representation and streamlines the decision-making process. Furthermore, LeapVAD incorporates an innovative dual-process decision-making module mimicking the human-driving learning process. The system consists of an analytic process (System-II) that accumulates driving experience through logical reasoning and a heuristic process (System-I) that refines this knowledge via fine-tuning and few-shot learning. LeapVAD also includes reflective mechanisms and a growing memory bank, enabling it to learn from past mistakes and continuously improve its performance in a closed-loop environment. To enhance efficiency, we develop a scene encoder network that generates compact scene representations for rapid retrieval of relevant driving experiences. Extensive evaluations conducted on two leading autonomous driving simulators, CARLA and DriveArena, demonstrate that LeapVAD achieves superior performance compared with camera-only approaches despite limited training data. Comprehensive ablation studies further emphasize its effectiveness in continuous learning and domain adaptation. Project page: https://pjlab-adg.github.io/LeapVAD/

@article{ma2025leap,
title = {LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process Thinking},
author = {Yukai Ma and Tiantian Wei and Naiting Zhong and Jianbiao Mei and Tao Hu and Licheng Wen and Xuemeng Yang and Botian Shi and Yong Liu},
year = 2025,
journal = {IEEE Transactions on Neural Networks and Learning System},
doi = {10.1109/TNNLS.2025.3626711},
abstract = {While autonomous driving technology has made remarkable strides, data-driven approaches still struggle with complex scenarios due to their limited reasoning capabilities. Meanwhile, knowledge-driven autonomous driving systems have evolved considerably with the popularization of visual language models. In this article, we propose LeapVAD, a novel method based on cognitive perception and dual-process thinking. Our approach implements a human-attentional mechanism to identify and focus on critical traffic elements that influence driving decisions. By characterizing these objects through comprehensive attributes—including appearance, motion patterns, and associated risks—LeapVAD achieves more effective environmental representation and streamlines the decision-making process. Furthermore, LeapVAD incorporates an innovative dual-process decision-making module mimicking the human-driving learning process. The system consists of an analytic process (System-II) that accumulates driving experience through logical reasoning and a heuristic process (System-I) that refines this knowledge via fine-tuning and few-shot learning. LeapVAD also includes reflective mechanisms and a growing memory bank, enabling it to learn from past mistakes and continuously improve its performance in a closed-loop environment. To enhance efficiency, we develop a scene encoder network that generates compact scene representations for rapid retrieval of relevant driving experiences. Extensive evaluations conducted on two leading autonomous driving simulators, CARLA and DriveArena, demonstrate that LeapVAD achieves superior performance compared with camera-only approaches despite limited training data. Comprehensive ablation studies further emphasize its effectiveness in continuous learning and domain adaptation. Project page: https://pjlab-adg.github.io/LeapVAD/}
}

Yu Yang, Jianbiao Mei, Siliang Du, Yilin Xiao, Huifeng Wu, Xiao Xu, and Yong Liu. DQFormer: Toward Unified LiDAR Panoptic Segmentation With Decoupled Queries for Large-Scale Outdoor Scenes. IEEE Transactions on Geoscience and Remote Sensing, 63:5702515, 2025.
[BibTeX] [Abstract] [DOI] [PDF]

LiDAR panoptic segmentation (LPS) performs semantic and instance segmentation for things (foreground objects) and stuff (background elements), essential for scene perception and remote sensing. While most existing methods separate these tasks using distinct branches (i.e., semantic and instance), recent approaches have unified LPS through a query-based paradigm. However, the distinct spatial distributions of foreground objects and background elements in large-scale outdoor scenes pose challenges. This article presents DQFormer, a novel framework for unified LPS that employs a decoupled query workflow to adapt to the characteristics of things and stuff in outdoor scenes. It first utilizes a feature encoder to extract multiscale voxel-wise, point-wise, and bird’s eye view (BEV) features. Then, a decoupled query generator proposes informative queries by localizing things/stuff positions and fusing multilevel BEV embeddings. A query-oriented mask decoder uses masked cross-attention to decode segmentation masks, which are combined with query semantics to produce panoptic results. Extensive experiments on large-scale outdoor scenes, including the vehicular datasets nuScenes and SemanticKITTI, as well as the aerial point cloud dataset DALES, show that DQFormer outperforms superior methods by +1.8%, +0.9%, and +3.5% in panoptic quality (PQ), respectively. Code is available at https://github.com/yuyang-cloud/DQFormer

@article{yang2025dqf,
title = {DQFormer: Toward Unified LiDAR Panoptic Segmentation With Decoupled Queries for Large-Scale Outdoor Scenes},
author = {Yu Yang and Jianbiao Mei and Siliang Du and Yilin Xiao and Huifeng Wu and Xiao Xu and Yong Liu},
year = 2025,
journal = {IEEE Transactions on Geoscience and Remote Sensing},
volume = 63,
pages = {5702515},
doi = {10.1109/TGRS.2025.3558951},
abstract = {LiDAR panoptic segmentation (LPS) performs semantic and instance segmentation for things (foreground objects) and stuff (background elements), essential for scene perception and remote sensing. While most existing methods separate these tasks using distinct branches (i.e., semantic and instance), recent approaches have unified LPS through a query-based paradigm. However, the distinct spatial distributions of foreground objects and background elements in large-scale outdoor scenes pose challenges. This article presents DQFormer, a novel framework for unified LPS that employs a decoupled query workflow to adapt to the characteristics of things and stuff in outdoor scenes. It first utilizes a feature encoder to extract multiscale voxel-wise, point-wise, and bird’s eye view (BEV) features. Then, a decoupled query generator proposes informative queries by localizing things/stuff positions and fusing multilevel BEV embeddings. A query-oriented mask decoder uses masked cross-attention to decode segmentation masks, which are combined with query semantics to produce panoptic results. Extensive experiments on large-scale outdoor scenes, including the vehicular datasets nuScenes and SemanticKITTI, as well as the aerial point cloud dataset DALES, show that DQFormer outperforms superior methods by +1.8%, +0.9%, and +3.5% in panoptic quality (PQ), respectively. Code is available at https://github.com/yuyang-cloud/DQFormer}
}

Mengmeng Wang, Jiazheng Xing, Jianbiao Mei, Yong Liu, and Yunliang Jiang. ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition. IEEE Transactions on Neural Networks and Learning Systems, 36:625-637, 2025.
[BibTeX] [Abstract] [DOI] [PDF]

The canonical approach to video action recognition dictates a neural network model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferability on new datasets with unseen concepts. In this article, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters’ requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub “pre-train, adapt and fine-tune.” This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then, it makes the action recognition task to act more like pre-training problems via adaptation engineering. Finally, it is fine-tuned end-to-end on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.git.

@article{wang2025aclip,
title = {ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition},
author = {Mengmeng Wang and Jiazheng Xing and Jianbiao Mei and Yong Liu and Yunliang Jiang},
year = 2025,
journal = {IEEE Transactions on Neural Networks and Learning Systems},
volume = 36,
pages = {625-637},
doi = {10.1109/TNNLS.2023.3331841},
abstract = {The canonical approach to video action recognition dictates a neural network model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferability on new datasets with unseen concepts. In this article, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters' requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, adapt and fine-tune." This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then, it makes the action recognition task to act more like pre-training problems via adaptation engineering. Finally, it is fine-tuned end-to-end on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.git.}
}

Yukai Ma, Jianbiao Mei, Xuemeng Yang, Licheng Wen, Weihua Xu, Jiangning Zhang, Xingxing Zuo, Botian Shi, and Yong Liu. LiCROcc: Teach Radar for Accurate Semantic Occupancy Prediction using LiDAR and Camera. IEEE Robotics and Automation Letters, 10:852-859, 2025.
[BibTeX] [Abstract] [DOI] [PDF]

Semantic Scene Completion (SSC) is pivotal in autonomous driving perception, frequently confronted with the complexities of weather and illumination changes. The long-term strategy involves fusing multi-modal information to bolster the system’s robustness. Radar, increasingly utilized for 3D target detection, is gradually replacing LiDAR in autonomous driving applications, offering a robust sensing alternative. In this letter, we focus on the potential of 3D radar in semantic scene completion, pioneering cross-modal refinement techniques for improved robustness against weather and illumination changes and enhancing SSC performance. Regarding model architecture, we propose a threestage tight fusion approach on BEV to realize a fusion framework for point clouds and images. Based on this foundation, we designed three cross-modal distillation modules—CMRD, BRD, and PDD. Our approach enhances the performance in radar-only (R-LiCROcc) and radar-camera (RC-LiCROcc) settings by distilling to them the rich semantic and structural information of the fused features of LiDAR and camera. Finally, our LC-Fusion, R-LiCROcc and RC-LiCROcc achieve the best performance on the nuScenes-Occupancy dataset, with mIOU exceeding the baseline by 22.9%, 44.1%, and 15.5%, respectively.

@article{ma2025locro,
title = {LiCROcc: Teach Radar for Accurate Semantic Occupancy Prediction using LiDAR and Camera},
author = {Yukai Ma and Jianbiao Mei and Xuemeng Yang and Licheng Wen and Weihua Xu and Jiangning Zhang and Xingxing Zuo and Botian Shi and Yong Liu},
year = 2025,
journal = {IEEE Robotics and Automation Letters},
volume = 10,
pages = {852-859},
doi = {10.1109/LRA.2024.3511427},
abstract = {Semantic Scene Completion (SSC) is pivotal in autonomous driving perception, frequently confronted with the complexities of weather and illumination changes. The long-term strategy involves fusing multi-modal information to bolster the system’s robustness. Radar, increasingly utilized for 3D target detection, is gradually replacing LiDAR in autonomous driving applications, offering a robust sensing alternative. In this letter, we focus on the potential of 3D radar in semantic scene completion, pioneering cross-modal refinement techniques for improved robustness against weather and illumination changes and enhancing SSC performance. Regarding model architecture, we propose a threestage tight fusion approach on BEV to realize a fusion framework for point clouds and images. Based on this foundation, we designed three cross-modal distillation modules—CMRD, BRD, and PDD. Our approach enhances the performance in radar-only (R-LiCROcc) and radar-camera (RC-LiCROcc) settings by distilling to them the rich semantic and structural information of the fused features of LiDAR and camera. Finally, our LC-Fusion, R-LiCROcc and RC-LiCROcc achieve the best performance on the nuScenes-Occupancy dataset, with mIOU exceeding the baseline by 22.9%, 44.1%, and 15.5%, respectively.}
}

Yu Yang, Alan Liang, Jianbiao Mei, Yukai Ma, Yong Liu, and Gim Hee Lee. X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability. In 39th Conference on Neural Information Processing Systems (NeurIPS), 2025.
[BibTeX]

@inproceedings{yang2025xscene,
title = {X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability},
author = {Yu Yang and Alan Liang and Jianbiao Mei and Yukai Ma and Yong Liu and Gim Hee Lee},
year = 2025,
booktitle = {39th Conference on Neural Information Processing Systems (NeurIPS)}
}

Yuehao Huang, Liang Liu, Shuangming Lei, Yukai Ma, Hao Su, Jianbiao Mei, Pengxiang Zhao, Yaqing Gu, Yong Liu, and Jiajun Lv. CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking. In Proceedings of the 33rd ACM International Conference on Multimedia (MM), page 5237–5246, 2025.
[BibTeX] [Abstract] [DOI] [PDF]

Mobile robots are increasingly required to navigate and interact within unknown and unstructured environments to meet human demands. Demand-driven navigation (DDN) enables robots to identify and locate objects based on implicit human intent, even when object locations are unknown. However, traditional data-driven DDN methods rely on pre-collected data for model training and decision-making, limiting their generalization capability in unseen scenarios. In this paper, we propose CogDDN, a VLM-based framework that emulates the human cognitive and learning mechanisms by integrating fast and slow thinking systems and selectively identifying key objects essential to fulfilling user demands. CogDDN identifies appropriate target objects by semantically aligning detected objects with the given instructions. Furthermore, it incorporates a dual-process decision-making module, comprising a Heuristic Process for rapid, efficient decisions and an Analytic Process that analyzes past errors, accumulates them in a knowledge base, and continuously improves performance. Chain of Thought (CoT) reasoning strengthens the decision-making process. Extensive closed-loop evaluations on the AI2Thor simulator with the ProcThor dataset show that CogDDN outperforms single-view camera-only methods by 15%, demonstrating significant improvements in navigation accuracy and adaptability. The project page is available at https://yuehaohuang.github.io/CogDDN/.

@inproceedings{huang2025cog,
title = {CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking},
author = {Yuehao Huang and Liang Liu and Shuangming Lei and Yukai Ma and Hao Su and Jianbiao Mei and Pengxiang Zhao and Yaqing Gu and Yong Liu and Jiajun Lv},
year = 2025,
booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia (MM)},
pages = {5237--5246},
doi = {10.1145/3746027.3755832},
abstract = {Mobile robots are increasingly required to navigate and interact within unknown and unstructured environments to meet human demands. Demand-driven navigation (DDN) enables robots to identify and locate objects based on implicit human intent, even when object locations are unknown. However, traditional data-driven DDN methods rely on pre-collected data for model training and decision-making, limiting their generalization capability in unseen scenarios. In this paper, we propose CogDDN, a VLM-based framework that emulates the human cognitive and learning mechanisms by integrating fast and slow thinking systems and selectively identifying key objects essential to fulfilling user demands. CogDDN identifies appropriate target objects by semantically aligning detected objects with the given instructions. Furthermore, it incorporates a dual-process decision-making module, comprising a Heuristic Process for rapid, efficient decisions and an Analytic Process that analyzes past errors, accumulates them in a knowledge base, and continuously improves performance. Chain of Thought (CoT) reasoning strengthens the decision-making process. Extensive closed-loop evaluations on the AI2Thor simulator with the ProcThor dataset show that CogDDN outperforms single-view camera-only methods by 15%, demonstrating significant improvements in navigation accuracy and adaptability. The project page is available at https://yuehaohuang.github.io/CogDDN/.}
}

Yu Yang, Jianbiao Mei, Yukai Ma, Siliang Du, Wenqing Chen, Yijie Qian, Yuxiang Feng, and Yong Liu. Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving. In 39th AAAI Conference on Artificial Intelligence (AAAI), pages 9327-9335, 2025.
[BibTeX] [Abstract] [DOI] [PDF]

World models envision potential future states based on various ego actions. They embed extensive knowledge about the driving environment, facilitating safe and scalable autonomous driving. Most existing methods primarily focus on either data generation or the pretraining paradigms of world models. Unlike the aforementioned prior works, we propose Drive-OccWorld, which adapts a vision-centric 4D forecasting world model to end-to-end planning for autonomous driving. Specifically, we first introduce a semantic and motion-conditional normalization in the memory module, which accumulates semantic and dynamic information from historical BEV embeddings. These BEV features are then conveyed to the world decoder for future occupancy and flow forecasting, considering both geometry and spatiotemporal modeling. Additionally, we propose injecting flexible action conditions, such as velocity, steering angle, trajectory, and commands, into the world model to enable controllable generation and facilitate a broader range of downstream applications. Furthermore, we explore integrating the generative capabilities of the 4D world model with end-to-end planning, enabling continuous forecasting of future states and the selection of optimal trajectories using an occupancy-based cost function. Extensive experiments on the nuScenes dataset demonstrate that our method can generate plausible and controllable 4D occupancy, opening new avenues for driving world generation and end-to-end planning.

@inproceedings{yang2025dow,
title = {Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving},
author = {Yu Yang and Jianbiao Mei and Yukai Ma and Siliang Du and Wenqing Chen and Yijie Qian and Yuxiang Feng and Yong Liu},
year = 2025,
booktitle = {39th AAAI Conference on Artificial Intelligence (AAAI)},
pages = {9327-9335},
doi = {10.1609/aaai.v39i9.33010},
abstract = {World models envision potential future states based on various ego actions. They embed extensive knowledge about the driving environment, facilitating safe and scalable autonomous driving. Most existing methods primarily focus on either data generation or the pretraining paradigms of world models. Unlike the aforementioned prior works, we propose Drive-OccWorld, which adapts a vision-centric 4D forecasting world model to end-to-end planning for autonomous driving. Specifically, we first introduce a semantic and motion-conditional normalization in the memory module, which accumulates semantic and dynamic information from historical BEV embeddings. These BEV features are then conveyed to the world decoder for future occupancy and flow forecasting, considering both geometry and spatiotemporal modeling. Additionally, we propose injecting flexible action conditions, such as velocity, steering angle, trajectory, and commands, into the world model to enable controllable generation and facilitate a broader range of downstream applications. Furthermore, we explore integrating the generative capabilities of the 4D world model with end-to-end planning, enabling continuous forecasting of future states and the selection of optimal trajectories using an occupancy-based cost function. Extensive experiments on the nuScenes dataset demonstrate that our method can generate plausible and controllable 4D occupancy, opening new avenues for driving world generation and end-to-end planning.}
}

Jianbiao Mei, Yu Yang, Mengmeng Wang, Junyu Zhu, Jongwon Ra, Yukai Ma, Laijian Li, and Yong Liu. Camera-Based 3D Semantic Scene Completion With Sparse Guidance Network. IEEE Transactions on Image Processing, 33:5468-5481, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80% mIoU and 45.45% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. Code is available at https://github.com/Jieqianyu/SGN.

@article{mei2024cbs,
title = {Camera-Based 3D Semantic Scene Completion With Sparse Guidance Network},
author = {Jianbiao Mei and Yu Yang and Mengmeng Wang and Junyu Zhu and Jongwon Ra and Yukai Ma and Laijian Li and Yong Liu},
year = 2024,
journal = {IEEE Transactions on Image Processing},
volume = 33,
pages = {5468-5481},
doi = {10.1109/TIP.2024.3461989},
abstract = {Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80% mIoU and 45.45% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. Code is available at https://github.com/Jieqianyu/SGN.}
}

Jianbiao Mei, Mengmeng Wang, Yu Yang, Zizhang Li, and Yong Liu. Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation. Applied Intelligence, 54:6138-6153, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼50% faster inference speed with only a slight 0.2% (J&F) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git.

@article{mei2024lsr,
title = {Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation},
author = {Jianbiao Mei and Mengmeng Wang and Yu Yang and Zizhang Li and Yong Liu},
year = 2024,
journal = {Applied Intelligence},
volume = 54,
pages = {6138-6153},
doi = {10.1007/s10489-024-05486-y},
abstract = {Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼50% faster inference speed with only a slight 0.2% (J&F) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git.}
}

Jianbiao Mei, Yu Yang, Mengmeng Wang, Zizhang Li, Jongwon Ra, and Yong Liu. LiDAR Video Object Segmentation with Dynamic Kernel Refinement. Pattern Recognition Letters, 178:21-27, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

In this paper, we formalize memory- and tracking-based methods to perform the LiDAR-based Video Object Segmentation (VOS) task, which segments points of the specific 3D target (given in the first frame) in a LiDAR sequence. LiDAR-based VOS can directly provide target-aware geometric information for practical application scenarios like behavior analysis and anticipating danger. We first construct a LiDAR-based VOS dataset named KITTI-VOS based on SemanticKITTI, which acts as a testbed and facilitates comprehensive evaluations of algorithm performance. Next, we provide two types of baselines, i.e., memory-based and tracking-based baselines, to explore this task. Specifically, the first memory-based pipeline is built on a space–time memory network equipped with the non-local spatiotemporal attention-based memory bank. We further design a more potent variant to introduce the locality into the spatiotemporal attention module by local self-attention and cross-attention modules. For the second tracking-based baseline, we modify two representative 3D object tracking methods to adapt to LiDAR-based VOS tasks. Finally, we propose a refine module that takes mask priors and generates object-aware kernels, which could boost all the baselines’ performance. We evaluate the proposed methods on the dataset and demonstrate their effectiveness.

@article{mei2024lvo,
title = {LiDAR Video Object Segmentation with Dynamic Kernel Refinement},
author = {Jianbiao Mei and Yu Yang and Mengmeng Wang and Zizhang Li and Jongwon Ra and Yong Liu},
year = 2024,
journal = {Pattern Recognition Letters},
volume = 178,
pages = {21-27},
doi = {10.1016/j.patrec.2023.12.013},
abstract = {In this paper, we formalize memory- and tracking-based methods to perform the LiDAR-based Video Object Segmentation (VOS) task, which segments points of the specific 3D target (given in the first frame) in a LiDAR sequence. LiDAR-based VOS can directly provide target-aware geometric information for practical application scenarios like behavior analysis and anticipating danger. We first construct a LiDAR-based VOS dataset named KITTI-VOS based on SemanticKITTI, which acts as a testbed and facilitates comprehensive evaluations of algorithm performance. Next, we provide two types of baselines, i.e., memory-based and tracking-based baselines, to explore this task. Specifically, the first memory-based pipeline is built on a space–time memory network equipped with the non-local spatiotemporal attention-based memory bank. We further design a more potent variant to introduce the locality into the spatiotemporal attention module by local self-attention and cross-attention modules. For the second tracking-based baseline, we modify two representative 3D object tracking methods to adapt to LiDAR-based VOS tasks. Finally, we propose a refine module that takes mask priors and generates object-aware kernels, which could boost all the baselines’ performance. We evaluate the proposed methods on the dataset and demonstrate their effectiveness.}
}

Jianbiao Mei, Yukai Ma, Xuemeng Yang, Licheng Wen, Xinyu Cai, Xin Li, Daocheng Fu, Bo Zhang, Pinlong Cai, Min Dou, Botian Shi, Liang He, Yong Liu, and Yu Qiao. Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving. In 38th Conference on Neural Information Processing Systems (NeurIPS), 2024.
[BibTeX] [Abstract]

Autonomous driving has advanced significantly due to sensors, machine learning, and artificial intelligence improvements. However, prevailing methods struggle with intricate scenarios and causal relationships, hindering adaptability and interpretability in varied environments. To address the above problems, we introduce LeapAD, a novel paradigm for autonomous driving inspired by the human cognitive process. Specifically, LeapAD emulates human attention by selecting critical objects relevant to driving decisions, simplifying environmental interpretation, and mitigating decision-making complexities. Additionally, LeapAD incorporates an innovative dual-process decision-making module, which consists of an Analytic Process (System-II) for thorough analysis and reasoning, along with a Heuristic Process (System-I) for swift and empirical processing. The Analytic Process leverages its logical reasoning to accumulate linguistic driving experience, which is then transferred to the Heuristic Process by supervised fine-tuning. Through reflection mechanisms and a growing memory bank, LeapAD continuously improves itself from past mistakes in a closed-loop environment. Closed-loop testing in CARLA shows that LeapAD outperforms all methods relying solely on camera input, requiring 1-2 orders of magnitude less labeled data. Experiments also demonstrate that as the memory bank expands, the Heuristic Process with only 1.8B parameters can inherit the knowledge from a GPT-4 powered Analytic Process and achieve continuous performance improvement. Project page: https://pjlab-adg.github.io/LeapAD/.

@inproceedings{mei2024cla,
title = {Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving},
author = {Jianbiao Mei and Yukai Ma and Xuemeng Yang and Licheng Wen and Xinyu Cai and Xin Li and Daocheng Fu and Bo Zhang and Pinlong Cai and Min Dou and Botian Shi and Liang He and Yong Liu and Yu Qiao},
year = 2024,
booktitle = {38th Conference on Neural Information Processing Systems (NeurIPS)},
abstract = {Autonomous driving has advanced significantly due to sensors, machine learning, and artificial intelligence improvements. However, prevailing methods struggle with intricate scenarios and causal relationships, hindering adaptability and interpretability in varied environments. To address the above problems, we introduce LeapAD, a novel paradigm for autonomous driving inspired by the human cognitive process. Specifically, LeapAD emulates human attention by selecting critical objects relevant to driving decisions, simplifying environmental interpretation, and mitigating decision-making complexities. Additionally, LeapAD incorporates an innovative dual-process decision-making module, which consists of an Analytic Process (System-II) for thorough analysis and reasoning, along with a Heuristic Process (System-I) for swift and empirical processing. The Analytic Process leverages its logical reasoning to accumulate linguistic driving experience, which is then transferred to the Heuristic Process by supervised fine-tuning. Through reflection mechanisms and a growing memory bank, LeapAD continuously improves itself from past mistakes in a closed-loop environment. Closed-loop testing in CARLA shows that LeapAD outperforms all methods relying solely on camera input, requiring 1-2 orders of magnitude less labeled data. Experiments also demonstrate that as the memory bank expands, the Heuristic Process with only 1.8B parameters can inherit the knowledge from a GPT-4 powered Analytic Process and achieve continuous performance improvement. Project page: https://pjlab-adg.github.io/LeapAD/.}
}

Chencan Fu, Lin Li, Jianbiao Mei, Yukai Ma, Linpeng Peng, Xiangrui Zhao, and Yong Liu. A Coarse-to-Fine Place Recognition Approach using Attention-guided Descriptors and Overlap Estimation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 8493-8499, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

Place recognition is a challenging but crucial task in robotics. Current description-based methods may be limited by representation capabilities, while pairwise similarity-based methods require exhaustive searches, which is time-consuming. In this paper, we present a novel coarse-to-fine approach to address these problems, which combines BEV (Bird’s Eye View) feature extraction, coarse-grained matching and fine-grained verification. In the coarse stage, our approach utilizes an attention-guided network to generate attention-guided descriptors. We then employ a fast affinity-based candidate selection process to identify the Top-K most similar candidates. In the fine stage, we estimate pairwise overlap among the narrowed-down place candidates to determine the final match. Experimental results on the KITTI and KITTI-360 datasets demonstrate that our approach outperforms state-of-the-art methods. The code will be released publicly soon.

@inproceedings{fu2024ctf,
title = {A Coarse-to-Fine Place Recognition Approach using Attention-guided Descriptors and Overlap Estimation},
author = {Chencan Fu and Lin Li and Jianbiao Mei and Yukai Ma and Linpeng Peng and Xiangrui Zhao and Yong Liu},
year = 2024,
booktitle = {2024 IEEE International Conference on Robotics and Automation (ICRA)},
pages = {8493-8499},
doi = {10.1109/ICRA57147.2024.10611569},
abstract = {Place recognition is a challenging but crucial task in robotics. Current description-based methods may be limited by representation capabilities, while pairwise similarity-based methods require exhaustive searches, which is time-consuming. In this paper, we present a novel coarse-to-fine approach to address these problems, which combines BEV (Bird's Eye View) feature extraction, coarse-grained matching and fine-grained verification. In the coarse stage, our approach utilizes an attention-guided network to generate attention-guided descriptors. We then employ a fast affinity-based candidate selection process to identify the Top-K most similar candidates. In the fine stage, we estimate pairwise overlap among the narrowed-down place candidates to determine the final match. Experimental results on the KITTI and KITTI-360 datasets demonstrate that our approach outperforms state-of-the-art methods. The code will be released publicly soon.}
}

Jongwon Ra, Mengmeng Wang, Jianbiao Mei, Shanqi Liu, Yu Yang, and Yong Liu. Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking via Memory Networks. In 11th International Conference on 3D Vision (3DV), pages 842-851, 2024.
[BibTeX] [Abstract] [DOI]

The point cloud-based 3D single object tracking plays an indispensable role in autonomous driving. However, the application of 3D object tracking in the real world is still challenging due to the inherent sparsity and self-occlusion of point cloud data. Therefore, it is necessary to exploit as much useful information from limited data as we can. Since 3D object tracking is a video-level task, the appearance of objects changes gradually over time, and there is rich spatiotemporal contextual information among historical frames. However, existing methods do not fully utilize this information. To address this, we propose a new method called SCTrack, which utilizes a memory-based paradigm to exploit spatiotemporal contextual information. SCTrack incorporates both long-term and short-term memory banks to store the spatiotemporal features of targets from historical frames. By doing so, the tracker can benefit from the entire video sequence and make more informed predictions. Additionally, SCTrack extracts the mask prior to augmenting the target representation, improving the target-background discriminability. Extensive experiments on KITTI, nuScenes, and Waymo Open datasets verify the effectiveness of our proposed method.

@inproceedings{Ra2024esc,
title = {Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking via Memory Networks},
author = {Jongwon Ra and Mengmeng Wang and Jianbiao Mei and Shanqi Liu and Yu Yang and Yong Liu},
year = 2024,
booktitle = {11th International Conference on 3D Vision (3DV)},
pages = {842-851},
doi = {10.1109/3DV62453.2024.00050},
abstract = {The point cloud-based 3D single object tracking plays an indispensable role in autonomous driving. However, the application of 3D object tracking in the real world is still challenging due to the inherent sparsity and self-occlusion of point cloud data. Therefore, it is necessary to exploit as much useful information from limited data as we can. Since 3D object tracking is a video-level task, the appearance of objects changes gradually over time, and there is rich spatiotemporal contextual information among historical frames. However, existing methods do not fully utilize this information. To address this, we propose a new method called SCTrack, which utilizes a memory-based paradigm to exploit spatiotemporal contextual information. SCTrack incorporates both long-term and short-term memory banks to store the spatiotemporal features of targets from historical frames. By doing so, the tracker can benefit from the entire video sequence and make more informed predictions. Additionally, SCTrack extracts the mask prior to augmenting the target representation, improving the target-background discriminability. Extensive experiments on KITTI, nuScenes, and Waymo Open datasets verify the effectiveness of our proposed method.}
}

Mengmeng Wang, Jiazheng Xing, Boyuan Jiang, Jun Chen, Jianbiao Mei, Xingxing Zuo, Guang Dai, Jingdong Wang, and Yong Liu. A Multimodal, Multi-task Adapting Framework for Video Action Recognition. In 38th AAAI Conference on Artificial Intelligence (AAAI), pages 5517-5525, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend to prioritize strong supervised performance at the expense of compromising the models’ generalization capabilities during transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named M2-CLIP to address these challenges, preserving both high supervised performance and robust transferability. Firstly, to enhance the individual modality architectures, we introduce multimodal adapters to both the visual and text branches. Specifically, we design a novel visual TED-Adapter, that performs global Temporal Enhancement and local temporal Difference modeling to improve the temporal representation capabilities of the visual encoder. Moreover, we adopt text encoder adapters to strengthen the learning of semantic label information. Secondly, we design a multi-task decoder with a rich set of supervisory signals to adeptly satisfy the need for strong supervised performance and generalization within a multimodal framework. Experimental results validate the efficacy of our approach, demonstrating exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.

@inproceedings{wang2024amm,
title = {A Multimodal, Multi-task Adapting Framework for Video Action Recognition},
author = {Mengmeng Wang and Jiazheng Xing and Boyuan Jiang and Jun Chen and Jianbiao Mei and Xingxing Zuo and Guang Dai and Jingdong Wang and Yong Liu},
year = 2024,
booktitle = {38th AAAI Conference on Artificial Intelligence (AAAI)},
pages = {5517-5525},
doi = {10.1609/aaai.v38i6.28361},
abstract = {Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend to prioritize strong supervised performance at the expense of compromising the models' generalization capabilities during transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named M2-CLIP to address these challenges, preserving both high supervised performance and robust transferability. Firstly, to enhance the individual modality architectures, we introduce multimodal adapters to both the visual and text branches. Specifically, we design a novel visual TED-Adapter, that performs global Temporal Enhancement and local temporal Difference modeling to improve the temporal representation capabilities of the visual encoder. Moreover, we adopt text encoder adapters to strengthen the learning of semantic label information. Secondly, we design a multi-task decoder with a rich set of supervisory signals to adeptly satisfy the need for strong supervised performance and generalization within a multimodal framework. Experimental results validate the efficacy of our approach, demonstrating exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.}
}

Laijian Li, Yukai Ma, Kai Tang, Xiangrui Zhao, Chao Chen, Jianxin Huang, Jianbiao Mei, and Yong Liu. Geo-localization with Transformer-based 2D-3D match Network. IEEE Robotics and Automation Letters (RA-L), 8:4855-4862, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

This letter presents a novel method for geographical localization by registering satellite maps with LiDAR point clouds. This method includes a Transformer-based 2D-3D matching network called D-GLSNet that directly matches the LiDAR point clouds and satellite images through end-to-end learning. Without the need for feature point detection, D-GLSNet provides accurate pixel-to-point association between the LiDAR point clouds and satellite images. And then, we can easily calculate the horizontal offset (Δx,Δy) and angular deviation Δθyaw between them, thereby achieving accurate registration. To demonstrate our network’s localization potential, we have designed a Geo-localization Node (GLN) that implements geographical localization and is plug-and-play in the SLAM system. Compared to GPS, GLN is less susceptible to external interference, such as building occlusion. In urban scenarios, our proposed D-GLSNet can output high-quality matching, enabling GLN to function stably and deliver more accurate localization results. Extensive experiments on the KITTI dataset show that our D-GLSNet method achieves a mean Relative Translation Error (RTE) of 1.43 m. Furthermore, our method outperforms state-of-the-art LiDAR-based geospatial localization methods when combined with odometry.

@article{li2023glw,
title = {Geo-localization with Transformer-based 2D-3D match Network},
author = {Laijian Li and Yukai Ma and Kai Tang and Xiangrui Zhao and Chao Chen and Jianxin Huang and Jianbiao Mei and Yong Liu},
year = 2023,
journal = {IEEE Robotics and Automation Letters (RA-L)},
volume = 8,
pages = {4855-4862},
doi = {10.1109/LRA.2023.3290526},
abstract = {This letter presents a novel method for geographical localization by registering satellite maps with LiDAR point clouds. This method includes a Transformer-based 2D-3D matching network called D-GLSNet that directly matches the LiDAR point clouds and satellite images through end-to-end learning. Without the need for feature point detection, D-GLSNet provides accurate pixel-to-point association between the LiDAR point clouds and satellite images. And then, we can easily calculate the horizontal offset (Δx,Δy) and angular deviation Δθyaw between them, thereby achieving accurate registration. To demonstrate our network's localization potential, we have designed a Geo-localization Node (GLN) that implements geographical localization and is plug-and-play in the SLAM system. Compared to GPS, GLN is less susceptible to external interference, such as building occlusion. In urban scenarios, our proposed D-GLSNet can output high-quality matching, enabling GLN to function stably and deliver more accurate localization results. Extensive experiments on the KITTI dataset show that our D-GLSNet method achieves a mean Relative Translation Error (RTE) of 1.43 m. Furthermore, our method outperforms state-of-the-art LiDAR-based geospatial localization methods when combined with odometry.}
}

Yu Yang, Mengmeng Wang, Jianbiao Mei, and Yong Liu. Exploiting Semantic-level Affinities with a Mask-Guided Network for Temporal Action Proposal in Videos. Applied Intelligence, 53:15516-15536, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

Temporal action proposal (TAP) aims to detect the action instances’ starting and ending times in untrimmed videos, which is fundamental and critical for large-scale video analysis and human action understanding. The main challenge of the temporal action proposal lies in modeling representative temporal relations in long untrimmed videos. Existing state-of-the-art methods achieve temporal modeling by building local-level, proposal-level, or global-level temporal dependencies. Local methods lack a wider receptive field, while proposal and global methods lack the focalization of learning action frames and contain background distractions. In this paper, we propose that learning semantic-level affinities can capture more practical information. Specifically, by modeling semantic associations between frames and action units, action segments (foregrounds) can aggregate supportive cues from other co-occurring actions, and nonaction clips (backgrounds) can learn the discriminations between them and action frames. To this end, we propose a novel framework named the Mask-Guided Network (MGNet) to build semantic-level temporal associations for the TAP task. Specifically, we first propose a Foreground Mask Generation (FMG) module to adaptively generate the foreground mask, representing the locations of the action units throughout the video. Second, we design a Mask-Guided Transformer (MGT) by exploiting the foreground mask to guide the self-attention mechanism to focus on and calculate semantic affinities with the foreground frames. Finally, these two modules are jointly explored in a unified framework. MGNet models the intra-semantic similarities for foregrounds, extracting supportive action cues for boundary refinement; it also builds the inter-semantic distances for backgrounds, providing the semantic gaps to suppress false positives and distractions. Extensive experiments are conducted on two challenging datasets, ActivityNet-1.3 and THUMOS14, and the results demonstrate that our method achieves superior performance.

@article{yang2023esl,
title = {Exploiting Semantic-level Affinities with a Mask-Guided Network for Temporal Action Proposal in Videos},
author = {Yu Yang and Mengmeng Wang and Jianbiao Mei and Yong Liu},
year = 2023,
journal = {Applied Intelligence},
volume = 53,
pages = {15516-15536},
doi = {10.1007/s10489-022-04261-1},
abstract = {Temporal action proposal (TAP) aims to detect the action instances' starting and ending times in untrimmed videos, which is fundamental and critical for large-scale video analysis and human action understanding. The main challenge of the temporal action proposal lies in modeling representative temporal relations in long untrimmed videos. Existing state-of-the-art methods achieve temporal modeling by building local-level, proposal-level, or global-level temporal dependencies. Local methods lack a wider receptive field, while proposal and global methods lack the focalization of learning action frames and contain background distractions. In this paper, we propose that learning semantic-level affinities can capture more practical information. Specifically, by modeling semantic associations between frames and action units, action segments (foregrounds) can aggregate supportive cues from other co-occurring actions, and nonaction clips (backgrounds) can learn the discriminations between them and action frames. To this end, we propose a novel framework named the Mask-Guided Network (MGNet) to build semantic-level temporal associations for the TAP task. Specifically, we first propose a Foreground Mask Generation (FMG) module to adaptively generate the foreground mask, representing the locations of the action units throughout the video. Second, we design a Mask-Guided Transformer (MGT) by exploiting the foreground mask to guide the self-attention mechanism to focus on and calculate semantic affinities with the foreground frames. Finally, these two modules are jointly explored in a unified framework. MGNet models the intra-semantic similarities for foregrounds, extracting supportive action cues for boundary refinement; it also builds the inter-semantic distances for backgrounds, providing the semantic gaps to suppress false positives and distractions. Extensive experiments are conducted on two challenging datasets, ActivityNet-1.3 and THUMOS14, and the results demonstrate that our method achieves superior performance.}
}

Jianbiao Mei, Yu Yang, Mengmeng Wang, Zizhang Li, Xiaojun Hou, Jongwon Ra, Laijian Li, and Yong Liu. CenterLPS: Segment Instances by Centers for LiDAR Panoptic Segmentation. In 31st ACM International Conference on Multimedia (MM), pages 1884-1894, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

This paper focuses on LiDAR Panoptic Segmentation (LPS), which has attracted more attention recently due to its broad application prospect for autonomous driving and robotics. The mainstream LPS approaches either adopt a top-down strategy relying on 3D object detectors to discover instances or utilize time-consuming heuristic clustering algorithms to group instances in a bottom-up manner. Inspired by the center representation and kernel-based segmentation, we propose a new detection-free and clustering-free framework called CenterLPS, with the center-based instance encoding and decoding paradigm. Specifically, we propose a sparse center proposal network to generate the sparse 3D instance centers, as well as center feature embedding, which can well encode characteristics of instances. Then a center-aware transformer is applied to collect the context between different center feature embedding and around centers. Moreover, we generate the kernel weights based on the enhanced center feature embedding and initialize dynamic convolutions to decode the final instance masks. Finally, a mask fusion module is devised to unify the semantic and instance predictions and improve the panoptic quality. Extensive experiments on SemanticKITTI and nuScenes demonstrate the effectiveness of our proposed center-based framework CenterLPS.

@inproceedings{mei2023lps,
title = {CenterLPS: Segment Instances by Centers for LiDAR Panoptic Segmentation},
author = {Jianbiao Mei and Yu Yang and Mengmeng Wang and Zizhang Li and Xiaojun Hou and Jongwon Ra and Laijian Li and Yong Liu},
year = 2023,
booktitle = {31st ACM International Conference on Multimedia (MM)},
pages = {1884-1894},
doi = {10.1145/3581783.3612080},
abstract = {This paper focuses on LiDAR Panoptic Segmentation (LPS), which has attracted more attention recently due to its broad application prospect for autonomous driving and robotics. The mainstream LPS approaches either adopt a top-down strategy relying on 3D object detectors to discover instances or utilize time-consuming heuristic clustering algorithms to group instances in a bottom-up manner. Inspired by the center representation and kernel-based segmentation, we propose a new detection-free and clustering-free framework called CenterLPS, with the center-based instance encoding and decoding paradigm. Specifically, we propose a sparse center proposal network to generate the sparse 3D instance centers, as well as center feature embedding, which can well encode characteristics of instances. Then a center-aware transformer is applied to collect the context between different center feature embedding and around centers. Moreover, we generate the kernel weights based on the enhanced center feature embedding and initialize dynamic convolutions to decode the final instance masks. Finally, a mask fusion module is devised to unify the semantic and instance predictions and improve the panoptic quality. Extensive experiments on SemanticKITTI and nuScenes demonstrate the effectiveness of our proposed center-based framework CenterLPS.}
}

Jianbiao Mei, Yu Yang, Mengmeng Wang, Tianxin Huang, Xuemeng Yang, and Yong Liu. SSC-RS: Elevate LiDAR Semantic Scene Completion with Representation Separation and BEV Fusion. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7718-7725, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

Semantic scene completion (SSC) jointly predicts the semantics and geometry of the entire 3D scene, which plays an essential role in 3D scene understanding for autonomous driving systems. SSC has achieved rapid progress with the help of semantic context in segmentation. However, how to effectively exploit the relationships between the semantic context in semantic segmentation and geometric structure in scene completion remains under exploration. In this paper, we propose to solve outdoor SSC from the perspective of representation separation and BEV fusion. Specifically, we present the network, named SSC-RS, which uses separate branches with deep supervision to explicitly disentangle the learning procedure of the semantic and geometric representations. And a BEV fusion network equipped with the proposed Adaptive Representation Fusion (ARF) module is presented to aggregate the multi-scale features effectively and efficiently. Due to the low computational burden and powerful representation ability, our model has good generality while running in real-time. Extensive experiments on SemanticKITTI demonstrate our SSC-RS achieves state-of-the-art performance. Code is available at https://github.com/Jieqianyu/SSC-RS.git.

@inproceedings{mei2023ssc,
title = {SSC-RS: Elevate LiDAR Semantic Scene Completion with Representation Separation and BEV Fusion},
author = {Jianbiao Mei and Yu Yang and Mengmeng Wang and Tianxin Huang and Xuemeng Yang and Yong Liu},
year = 2023,
booktitle = {2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
pages = {7718-7725},
doi = {10.1109/IROS55552.2023.10341742},
abstract = {Semantic scene completion (SSC) jointly predicts the semantics and geometry of the entire 3D scene, which plays an essential role in 3D scene understanding for autonomous driving systems. SSC has achieved rapid progress with the help of semantic context in segmentation. However, how to effectively exploit the relationships between the semantic context in semantic segmentation and geometric structure in scene completion remains under exploration. In this paper, we propose to solve outdoor SSC from the perspective of representation separation and BEV fusion. Specifically, we present the network, named SSC-RS, which uses separate branches with deep supervision to explicitly disentangle the learning procedure of the semantic and geometric representations. And a BEV fusion network equipped with the proposed Adaptive Representation Fusion (ARF) module is presented to aggregate the multi-scale features effectively and efficiently. Due to the low computational burden and powerful representation ability, our model has good generality while running in real-time. Extensive experiments on SemanticKITTI demonstrate our SSC-RS achieves state-of-the-art performance. Code is available at https://github.com/Jieqianyu/SSC-RS.git.}
}

Jianbiao Mei, Yu Yang, Mengmeng Wang, Xiaojun Hou, Laijian Li, and Yong Liu. PANet: LiDAR Panoptic Segmentation with Sparse Instance Proposal and Aggregation. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7726-7733, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

Reliable LiDAR panoptic segmentation (LPS), including both semantic and instance segmentation, is vital for many robotic applications, such as autonomous driving. This work proposes a new LPS framework named PANet to eliminate the dependency on the offset branch and improve the performance on large objects, which are always over-segmented by clustering algorithms. Firstly, we propose a non-learning Sparse Instance Proposal (SIP) module with the “sampling-shifting-grouping” scheme to directly group thing points into instances from the raw point cloud efficiently. More specifically, balanced point sampling is introduced to generate sparse seed points with more uniform point distribution over the distance range. And a shift module, termed bubble shifting, is proposed to shrink the seed points to the clustered centers. Then we utilize the connected component label algorithm to generate instance proposals. Furthermore, an instance aggregation module is devised to integrate potentially fragmented instances, improving the performance of the SIP module on large objects. Extensive experiments show that PANet achieves state-of-the-art performance among published works on the SemanticKITII validation and nuScenes validation for the panoptic segmentation task. Code is available at https://github.com/Jieqianyu/PANet.git.

@inproceedings{mei2023pan,
title = {PANet: LiDAR Panoptic Segmentation with Sparse Instance Proposal and Aggregation},
author = {Jianbiao Mei and Yu Yang and Mengmeng Wang and Xiaojun Hou and Laijian Li and Yong Liu},
year = 2023,
booktitle = {2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
pages = {7726-7733},
doi = {10.1109/IROS55552.2023.10342468},
abstract = {Reliable LiDAR panoptic segmentation (LPS), including both semantic and instance segmentation, is vital for many robotic applications, such as autonomous driving. This work proposes a new LPS framework named PANet to eliminate the dependency on the offset branch and improve the performance on large objects, which are always over-segmented by clustering algorithms. Firstly, we propose a non-learning Sparse Instance Proposal (SIP) module with the “sampling-shifting-grouping” scheme to directly group thing points into instances from the raw point cloud efficiently. More specifically, balanced point sampling is introduced to generate sparse seed points with more uniform point distribution over the distance range. And a shift module, termed bubble shifting, is proposed to shrink the seed points to the clustered centers. Then we utilize the connected component label algorithm to generate instance proposals. Furthermore, an instance aggregation module is devised to integrate potentially fragmented instances, improving the performance of the SIP module on large objects. Extensive experiments show that PANet achieves state-of-the-art performance among published works on the SemanticKITII validation and nuScenes validation for the panoptic segmentation task. Code is available at https://github.com/Jieqianyu/PANet.git.}
}

Mengmeng Wang, Jianbiao Mei, Lina Liu, and Yong Liu. Delving Deeper Into Mask Utilization in Video Object Segmentation. IEEE Transactions on Image Processing, 31:6255-6266, 2022.
[BibTeX] [Abstract] [DOI] [PDF]

This paper focuses on the mask utilization of video object segmentation (VOS). The mask here mains the reference masks in the memory bank, i.e., several chosen high-quality predicted masks, which are usually used with the reference frames together. The reference masks depict the edge and contour features of the target object and indicate the boundary of the target against the background, while the reference frames contain the raw RGB information of the whole image. It is obvious that the reference masks could play a significant role in the VOS, but this is not well explored yet. To tackle this, we propose to investigate the mask advantages of both the encoder and the matcher. For the encoder, we provide a unified codebase to integrate and compare eight different mask-fused encoders. Half of them are inherited or summarized from existing methods, and the other half are devised by ourselves. We find the best configuration from our design and give valuable observations from the comparison. Then, we propose a new mask-enhanced matcher to reduce the background distraction and enhance the locality of the matching process. Combining the mask-fused encoder, mask-enhanced matcher and a standard decoder, we formulate a new architecture named MaskVOS, which sufficiently exploits the mask benefits for VOS. Qualitative and quantitative results demonstrate the effectiveness of our method. We hope our exploration could raise the attention of mask utilization in VOS.

@article{wang2022ddi,
title = {Delving Deeper Into Mask Utilization in Video Object Segmentation},
author = {Mengmeng Wang and Jianbiao Mei and Lina Liu and Yong Liu},
year = 2022,
journal = {IEEE Transactions on Image Processing},
volume = {31},
pages = {6255-6266},
doi = {10.1109/TIP.2022.3208409},
abstract = {This paper focuses on the mask utilization of video object segmentation (VOS). The mask here mains the reference masks in the memory bank, i.e., several chosen high-quality predicted masks, which are usually used with the reference frames together. The reference masks depict the edge and contour features of the target object and indicate the boundary of the target against the background, while the reference frames contain the raw RGB information of the whole image. It is obvious that the reference masks could play a significant role in the VOS, but this is not well explored yet. To tackle this, we propose to investigate the mask advantages of both the encoder and the matcher. For the encoder, we provide a unified codebase to integrate and compare eight different mask-fused encoders. Half of them are inherited or summarized from existing methods, and the other half are devised by ourselves. We find the best configuration from our design and give valuable observations from the comparison. Then, we propose a new mask-enhanced matcher to reduce the background distraction and enhance the locality of the matching process. Combining the mask-fused encoder, mask-enhanced matcher and a standard decoder, we formulate a new architecture named MaskVOS, which sufficiently exploits the mask benefits for VOS. Qualitative and quantitative results demonstrate the effectiveness of our method. We hope our exploration could raise the attention of mask utilization in VOS.}
}

Zizhang Li, Mengmeng wang, Huaijin Pi, Kechun Xu, Jianbiao Mei, and Yong Liu. E-NeRV: Expedite Neural Video Representation with Disentangled Spatial-Temporal Context. In European Conference on Computer Vision (ECCV), 2022.
[BibTeX] [Abstract] [DOI]

Recently, the image-wise implicit neural representation of videos, NeRV, has gained popularity for its promising results and swift speed compared to regular pixel-wise implicit representations. However, the redundant parameters within the network structure can cause a large model size when scaling up for desirable performance. The key reason of this phenomenon is the coupled formulation of NeRV, which outputs the spatial and temporal information of video frames directly from the frame index input. In this paper, we propose E-NeRV, which dramatically expedites NeRV by decomposing the image-wise implicit neural representation into separate spatial and temporal context. Under the guidance of this new formulation, our model greatly reduces the redundant model parameters, while retaining the representation ability. We experimentally find that our method can improve the performance to a large extent with fewer parameters, resulting in a more than 8× faster speed on convergence. Code is available at https://github.com/kyleleey/E-NeRV.

@inproceedings{li2022ene,
title = {E-NeRV: Expedite Neural Video Representation with Disentangled Spatial-Temporal Context},
author = {Zizhang Li and Mengmeng wang and Huaijin Pi and Kechun Xu and Jianbiao Mei and Yong Liu},
year = 2022,
booktitle = {European Conference on Computer Vision (ECCV)},
doi = {10.1007/978-3-031-19833-5_16},
abstract = {Recently, the image-wise implicit neural representation of videos, NeRV, has gained popularity for its promising results and swift speed compared to regular pixel-wise implicit representations. However, the redundant parameters within the network structure can cause a large model size when scaling up for desirable performance. The key reason of this phenomenon is the coupled formulation of NeRV, which outputs the spatial and temporal information of video frames directly from the frame index input. In this paper, we propose E-NeRV, which dramatically expedites NeRV by decomposing the image-wise implicit neural representation into separate spatial and temporal context. Under the guidance of this new formulation, our model greatly reduces the redundant model parameters, while retaining the representation ability. We experimentally find that our method can improve the performance to a large extent with fewer parameters, resulting in a more than 8× faster speed on convergence. Code is available at https://github.com/kyleleey/E-NeRV.}
}

Address

Jianbiao Mei

Biography

Research and Interests

Publications

Links

Latest Events

APRIL实验室斩获ATEC 2025科技精英赛冠军，具身智能技术实现真实场景重大突破

喜报！APRIL实验室硕士生侯典泳荣获IROS 2025“移动操作领域最佳论文提名奖”

喜报！APRIL实验室在IROS 2025四足机器人挑战赛上荣获最佳自主导航奖