Mengmeng Wang

PhD Student

Institute of Cyber-Systems and Control, Zhejiang University, China

Biography

I have attended Zhejiang University as a Phd student in Sep 2020, supervised by Yong Liu. My research area includes object tracking, object detection, action recognition, depth estimation, deep fake, pose estimation and so on.

My works have been published on top computer vision conferences (CVPR, ICCV, ECCV, AAAI etc) and top robotic conferences (ICRA, IROS).

Research and Interests

Object Tracking: Single Object Tracking (SOT), Multiple Objects Tracking (MOT)
Action Recognition
Depth Estimation

Publications

Jiazheng Xing, Jian Zhao, Chao Xu, Mengmeng Wang, Guang Dai, Yong Liu, Jingdong Wang, and Xuelong Li. MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition. Pattern Recognition, 169:111902, 2026.
[BibTeX] [Abstract] [DOI] [PDF]

Applying large-scale vision-language pre-trained models like CLIP to few-shot action recognition (FSAR) can significantly enhance both performance and efficiency. While several studies have recognized this advantage, most rely on full-parameter fine-tuning to adapt CLIP’s visual encoder to FSAR data, which not only incurs high computational costs but also overlooks the potential of the visual encoder to engage in temporal modeling and focus on targeted semantics directly. To tackle these issues, we introduce MA-FSAR, a framework that employs the Parameter-Efficient Fine-Tuning (PEFT) technique to enhance the CLIP visual encoder in terms of action related temporal and semantic representations. Our solution involves a token-level Fine-grained Multimodal Adaptation mechanism: a Global Temporal Adaptation captures motion cues from video sequences, while a Local Multimodal Adaptation integrates text-guided semantics from the support set to emphasize action-critical features. Additionally, we propose a prototype-level text-guided construction module to further enrich the temporal and semantic characteristics of video prototypes. Extensive experiments demonstrate our superior performance in various tasks using minor trainable parameters.

@article{xing2026maf,
title = {MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition},
author = {Jiazheng Xing and Jian Zhao and Chao Xu and Mengmeng Wang and Guang Dai and Yong Liu and Jingdong Wang and Xuelong Li},
year = 2026,
journal = {Pattern Recognition},
volume = 169,
pages = {111902},
doi = {10.1016/j.patcog.2025.111902},
abstract = {Applying large-scale vision-language pre-trained models like CLIP to few-shot action recognition (FSAR) can significantly enhance both performance and efficiency. While several studies have recognized this advantage, most rely on full-parameter fine-tuning to adapt CLIP’s visual encoder to FSAR data, which not only incurs high computational costs but also overlooks the potential of the visual encoder to engage in temporal modeling and focus on targeted semantics directly. To tackle these issues, we introduce MA-FSAR, a framework that employs the Parameter-Efficient Fine-Tuning (PEFT) technique to enhance the CLIP visual encoder in terms of action related temporal and semantic representations. Our solution involves a token-level Fine-grained Multimodal Adaptation mechanism: a Global Temporal Adaptation captures motion cues from video sequences, while a Local Multimodal Adaptation integrates text-guided semantics from the support set to emphasize action-critical features. Additionally, we propose a prototype-level text-guided construction module to further enrich the temporal and semantic characteristics of video prototypes. Extensive experiments demonstrate our superior performance in various tasks using minor trainable parameters.}
}

Mengmeng Wang, Jiazheng Xing, Jianbiao Mei, Yong Liu, and Yunliang Jiang. ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition. IEEE Transactions on Neural Networks and Learning Systems, 36:625-637, 2025.
[BibTeX] [Abstract] [DOI] [PDF]

The canonical approach to video action recognition dictates a neural network model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferability on new datasets with unseen concepts. In this article, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters’ requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub “pre-train, adapt and fine-tune.” This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then, it makes the action recognition task to act more like pre-training problems via adaptation engineering. Finally, it is fine-tuned end-to-end on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.git.

@article{wang2025aclip,
title = {ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition},
author = {Mengmeng Wang and Jiazheng Xing and Jianbiao Mei and Yong Liu and Yunliang Jiang},
year = 2025,
journal = {IEEE Transactions on Neural Networks and Learning Systems},
volume = 36,
pages = {625-637},
doi = {10.1109/TNNLS.2023.3331841},
abstract = {The canonical approach to video action recognition dictates a neural network model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferability on new datasets with unseen concepts. In this article, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters' requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, adapt and fine-tune." This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then, it makes the action recognition task to act more like pre-training problems via adaptation engineering. Finally, it is fine-tuned end-to-end on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.git.}
}

Guanzhong Tian, Yiran Sun, Yuang Liu, Xianfang Zeng, Mengmeng Wang, Yong Liu, Jiangning Zhang, and Jun Chen. Adding before Pruning: Sparse Filter Fusion for Deep Convolutional Neural Networks via Auxiliary Attention. IEEE Transactions on Neural Networks and Learning Systems, 36:3930-3942, 2025.
[BibTeX] [Abstract] [DOI] [PDF]

Filter pruning is a significant feature selection technique to shrink the existing feature fusion schemes (especially on convolution calculation and model size), which helps to develop more efficient feature fusion models while maintaining state-of-the-art performance. In addition, it reduces the storage and computation requirements of deep neural networks (DNNs) and accelerates the inference process dramatically. Existing methods mainly rely on manual constraints such as normalization to select the filters. A typical pipeline comprises two stages: first pruning the original neural network and then fine-tuning the pruned model. However, choosing a manual criterion can be somehow tricky and stochastic. Moreover, directly regularizing and modifying filters in the pipeline suffer from being sensitive to the choice of hyperparameters, thus making the pruning procedure less robust. To address these challenges, we propose to handle the filter pruning issue through one stage: using an attention-based architecture that adaptively fuses the filter selection with filter learning in a unified network. Specifically, we present a pruning method named adding before pruning (ABP) to make the model focus on the filters of higher significance by training instead of man-made criteria such as norm, rank, etc. First, we add an auxiliary attention layer into the original model and set the significance scores in this layer to be binary. Furthermore, to propagate the gradients in the auxiliary attention layer, we design a specific gradient estimator and prove its effectiveness for convergence in the graph flow through mathematical derivation. In the end, to relieve the dependence on the complicated prior knowledge for designing the thresholding criterion, we simultaneously prune and train the filters to automatically eliminate network redundancy with recoverability. Extensive experimental results on the two typical image classification benchmarks, CIFAR-10 and ILSVRC-2012, illustrate that the proposed approach performs favorably against previous state-of-the-art filter pruning algorithms.

@article{tian2025abp,
title = {Adding before Pruning: Sparse Filter Fusion for Deep Convolutional Neural Networks via Auxiliary Attention},
author = {Guanzhong Tian and Yiran Sun and Yuang Liu and Xianfang Zeng and Mengmeng Wang and Yong Liu and Jiangning Zhang and Jun Chen},
year = 2025,
journal = {IEEE Transactions on Neural Networks and Learning Systems},
volume = 36,
pages = {3930-3942},
doi = {10.1109/TNNLS.2021.3106917},
abstract = {Filter pruning is a significant feature selection technique to shrink the existing feature fusion schemes (especially on convolution calculation and model size), which helps to develop more efficient feature fusion models while maintaining state-of-the-art performance. In addition, it reduces the storage and computation requirements of deep neural networks (DNNs) and accelerates the inference process dramatically. Existing methods mainly rely on manual constraints such as normalization to select the filters. A typical pipeline comprises two stages: first pruning the original neural network and then fine-tuning the pruned model. However, choosing a manual criterion can be somehow tricky and stochastic. Moreover, directly regularizing and modifying filters in the pipeline suffer from being sensitive to the choice of hyperparameters, thus making the pruning procedure less robust. To address these challenges, we propose to handle the filter pruning issue through one stage: using an attention-based architecture that adaptively fuses the filter selection with filter learning in a unified network. Specifically, we present a pruning method named adding before pruning (ABP) to make the model focus on the filters of higher significance by training instead of man-made criteria such as norm, rank, etc. First, we add an auxiliary attention layer into the original model and set the significance scores in this layer to be binary. Furthermore, to propagate the gradients in the auxiliary attention layer, we design a specific gradient estimator and prove its effectiveness for convergence in the graph flow through mathematical derivation. In the end, to relieve the dependence on the complicated prior knowledge for designing the thresholding criterion, we simultaneously prune and train the filters to automatically eliminate network redundancy with recoverability. Extensive experimental results on the two typical image classification benchmarks, CIFAR-10 and ILSVRC-2012, illustrate that the proposed approach performs favorably against previous state-of-the-art filter pruning algorithms.}
}

Mengmeng Wang, Zeyi Huang, Xiangjie Kong, Guojiang Shen, Guang Dai, Jingdong Wang, and Yong Liu. Action Detail Matters: Refining Video Recognition with Local Action Queries. In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19132-19142, 2025.
[BibTeX] [Abstract] [DOI] [PDF]

Video action recognition involves interpreting both global context and specific details to accurately identify actions. While previous models are effective at capturing spatiotemporal features, they often lack a focused representation of key action details. To address this, we introduce FocusVideo, a framework designed for refining video action recognition through integrated global and local feature learning. Inspired by human visual cognition theory, our approach balances the focus on both broad contextual changes and action-specific details, minimizing the influence of irrelevant background noise. We first employ learnable action queries to selectively emphasize action-relevant regions without requiring region-specific labels. Next, these queries are learned by a local action streaming branch that enables progressive query propagation. Moreover, we introduce a parameter-free feature interaction mechanism for effective multi-scale interaction between global and local features with minimal additional overhead. Extensive experiments demonstrate that FocusVideo achieves state-of-the-art performance across multiple action recognition datasets, validating its effectiveness and robustness in handling action-relevant details.

@inproceedings{wang2025adm,
title = {Action Detail Matters: Refining Video Recognition with Local Action Queries},
author = {Mengmeng Wang and Zeyi Huang and Xiangjie Kong and Guojiang Shen and Guang Dai and Jingdong Wang and Yong Liu},
year = 2025,
booktitle = {2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {19132-19142},
doi = {10.1109/CVPR52734.2025.01782},
abstract = {Video action recognition involves interpreting both global context and specific details to accurately identify actions. While previous models are effective at capturing spatiotemporal features, they often lack a focused representation of key action details. To address this, we introduce FocusVideo, a framework designed for refining video action recognition through integrated global and local feature learning. Inspired by human visual cognition theory, our approach balances the focus on both broad contextual changes and action-specific details, minimizing the influence of irrelevant background noise. We first employ learnable action queries to selectively emphasize action-relevant regions without requiring region-specific labels. Next, these queries are learned by a local action streaming branch that enables progressive query propagation. Moreover, we introduce a parameter-free feature interaction mechanism for effective multi-scale interaction between global and local features with minimal additional overhead. Extensive experiments demonstrate that FocusVideo achieves state-of-the-art performance across multiple action recognition datasets, validating its effectiveness and robustness in handling action-relevant details.}
}

Jun Chen, Hanwen Chen, Mengmeng Wang, Guang Dai, Ivor W. Tsang, and Yong Liu. Learning Discretized Neural Networks under Ricci Flow. Journal of Machine Learning Research, 25:1-44, 2024.
[BibTeX] [Abstract] [PDF]

In this paper, we study Discretized Neural Networks (DNNs) composed of low-precision weights and activations, which suffer from either infinite or zero gradients due to the non-differentiable discrete function during training. Most training-based DNNs in such scenarios employ the standard Straight-Through Estimator (STE) to approximate the gradient w.r.t. discrete values. However, the use of STE introduces the problem of gradient mismatch, arising from perturbations in the approximated gradient. To address this problem, this paper reveals that this mismatch can be interpreted as a metric perturbation in a Riemannian manifold, viewed through the lens of duality theory. Building on information geometry, we construct the Linearly Nearly Euclidean (LNE) manifold for DNNs, providing a background for addressing perturbations. By introducing a partial differential equation on metrics, i.e., the Ricci flow, we establish the dynamical stability and convergence of the LNE metric with the L2-norm perturbation. In contrast to previous perturbation theories with convergence rates in fractional powers, the metric perturbation under the Ricci flow exhibits exponential decay in the LNE manifold. Experimental results across various datasets demonstrate that our method achieves superior and more stable performance for DNNs compared to other representative training-based methods.

@article{chen2024ldn,
title = {Learning Discretized Neural Networks under Ricci Flow},
author = {Jun Chen and Hanwen Chen and Mengmeng Wang and Guang Dai and Ivor W. Tsang and Yong Liu},
year = 2024,
journal = {Journal of Machine Learning Research},
volume = 25,
pages = {1-44},
abstract = {In this paper, we study Discretized Neural Networks (DNNs) composed of low-precision weights and activations, which suffer from either infinite or zero gradients due to the non-differentiable discrete function during training. Most training-based DNNs in such scenarios employ the standard Straight-Through Estimator (STE) to approximate the gradient w.r.t. discrete values. However, the use of STE introduces the problem of gradient mismatch, arising from perturbations in the approximated gradient. To address this problem, this paper reveals that this mismatch can be interpreted as a metric perturbation in a Riemannian manifold, viewed through the lens of duality theory. Building on information geometry, we construct the Linearly Nearly Euclidean (LNE) manifold for DNNs, providing a background for addressing perturbations. By introducing a partial differential equation on metrics, i.e., the Ricci flow, we establish the dynamical stability and convergence of the LNE metric with the L2-norm perturbation. In contrast to previous perturbation theories with convergence rates in fractional powers, the metric perturbation under the Ricci flow exhibits exponential decay in the LNE manifold. Experimental results across various datasets demonstrate that our method achieves superior and more stable performance for DNNs compared to other representative training-based methods.}
}

Jianbiao Mei, Yu Yang, Mengmeng Wang, Junyu Zhu, Jongwon Ra, Yukai Ma, Laijian Li, and Yong Liu. Camera-Based 3D Semantic Scene Completion With Sparse Guidance Network. IEEE Transactions on Image Processing, 33:5468-5481, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80% mIoU and 45.45% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. Code is available at https://github.com/Jieqianyu/SGN.

@article{mei2024cbs,
title = {Camera-Based 3D Semantic Scene Completion With Sparse Guidance Network},
author = {Jianbiao Mei and Yu Yang and Mengmeng Wang and Junyu Zhu and Jongwon Ra and Yukai Ma and Laijian Li and Yong Liu},
year = 2024,
journal = {IEEE Transactions on Image Processing},
volume = 33,
pages = {5468-5481},
doi = {10.1109/TIP.2024.3461989},
abstract = {Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80% mIoU and 45.45% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. Code is available at https://github.com/Jieqianyu/SGN.}
}

Jianbiao Mei, Mengmeng Wang, Yu Yang, Zizhang Li, and Yong Liu. Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation. Applied Intelligence, 54:6138-6153, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼50% faster inference speed with only a slight 0.2% (J&F) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git.

@article{mei2024lsr,
title = {Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation},
author = {Jianbiao Mei and Mengmeng Wang and Yu Yang and Zizhang Li and Yong Liu},
year = 2024,
journal = {Applied Intelligence},
volume = 54,
pages = {6138-6153},
doi = {10.1007/s10489-024-05486-y},
abstract = {Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼50% faster inference speed with only a slight 0.2% (J&F) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git.}
}

Jianbiao Mei, Yu Yang, Mengmeng Wang, Zizhang Li, Jongwon Ra, and Yong Liu. LiDAR Video Object Segmentation with Dynamic Kernel Refinement. Pattern Recognition Letters, 178:21-27, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

In this paper, we formalize memory- and tracking-based methods to perform the LiDAR-based Video Object Segmentation (VOS) task, which segments points of the specific 3D target (given in the first frame) in a LiDAR sequence. LiDAR-based VOS can directly provide target-aware geometric information for practical application scenarios like behavior analysis and anticipating danger. We first construct a LiDAR-based VOS dataset named KITTI-VOS based on SemanticKITTI, which acts as a testbed and facilitates comprehensive evaluations of algorithm performance. Next, we provide two types of baselines, i.e., memory-based and tracking-based baselines, to explore this task. Specifically, the first memory-based pipeline is built on a space–time memory network equipped with the non-local spatiotemporal attention-based memory bank. We further design a more potent variant to introduce the locality into the spatiotemporal attention module by local self-attention and cross-attention modules. For the second tracking-based baseline, we modify two representative 3D object tracking methods to adapt to LiDAR-based VOS tasks. Finally, we propose a refine module that takes mask priors and generates object-aware kernels, which could boost all the baselines’ performance. We evaluate the proposed methods on the dataset and demonstrate their effectiveness.

@article{mei2024lvo,
title = {LiDAR Video Object Segmentation with Dynamic Kernel Refinement},
author = {Jianbiao Mei and Yu Yang and Mengmeng Wang and Zizhang Li and Jongwon Ra and Yong Liu},
year = 2024,
journal = {Pattern Recognition Letters},
volume = 178,
pages = {21-27},
doi = {10.1016/j.patrec.2023.12.013},
abstract = {In this paper, we formalize memory- and tracking-based methods to perform the LiDAR-based Video Object Segmentation (VOS) task, which segments points of the specific 3D target (given in the first frame) in a LiDAR sequence. LiDAR-based VOS can directly provide target-aware geometric information for practical application scenarios like behavior analysis and anticipating danger. We first construct a LiDAR-based VOS dataset named KITTI-VOS based on SemanticKITTI, which acts as a testbed and facilitates comprehensive evaluations of algorithm performance. Next, we provide two types of baselines, i.e., memory-based and tracking-based baselines, to explore this task. Specifically, the first memory-based pipeline is built on a space–time memory network equipped with the non-local spatiotemporal attention-based memory bank. We further design a more potent variant to introduce the locality into the spatiotemporal attention module by local self-attention and cross-attention modules. For the second tracking-based baseline, we modify two representative 3D object tracking methods to adapt to LiDAR-based VOS tasks. Finally, we propose a refine module that takes mask priors and generates object-aware kernels, which could boost all the baselines’ performance. We evaluate the proposed methods on the dataset and demonstrate their effectiveness.}
}

Xingxing Zuo, Mingming Zhang, Mengmeng Wang, Yiming Chen, Guoquan Huang, Yong Liu, and Mingyang Li. Visual-Based Kinematics and Pose Estimation for Skid-Steering Robots. IEEE Transactions on Automation Science and Engineering, 21:91-105, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

To build commercial robots, skid-steering mechanical design is of increased popularity due to its manufacturing simplicity and unique mechanism. However, these also cause significant challenges on software and algorithm design, especially for the pose estimation (i.e., determining the robot’s rotation and position) of skid-steering robots, since they change their orientation with an inevitable skid. To tackle this problem, we propose a probabilistic sliding-window estimator dedicated to skid-steering robots, using measurements from a monocular camera, the wheel encoders, and optionally an inertial measurement unit (IMU). Specifically, we explicitly model the kinematics of skid-steering robots by both track instantaneous centers of rotation (ICRs) and correction factors, which are capable of compensating for the complexity of track-to-terrain interaction, the imperfectness of mechanical design, terrain conditions and smoothness, etc. To prevent performance reduction in robots’ long-term missions, the time- and location- varying kinematic parameters are estimated online along with pose estimation states in a tightly-coupled manner. More importantly, we conduct indepth observability analysis for different sensors and design configurations in this paper, which provides us with theoretical tools in making the correct choice when building real commercial robots. In our experiments, we validate the proposed method by both simulation tests and real-world experiments, which demonstrate that our method outperforms competing methods by wide margins.

@article{zuo2024vbk,
title = {Visual-Based Kinematics and Pose Estimation for Skid-Steering Robots},
author = {Xingxing Zuo and Mingming Zhang and Mengmeng Wang and Yiming Chen and Guoquan Huang and Yong Liu and Mingyang Li},
year = 2024,
journal = {IEEE Transactions on Automation Science and Engineering},
volume = 21,
pages = {91-105},
doi = {10.1109/TASE.2022.3214984},
abstract = {To build commercial robots, skid-steering mechanical design is of increased popularity due to its manufacturing simplicity and unique mechanism. However, these also cause significant challenges on software and algorithm design, especially for the pose estimation (i.e., determining the robot’s rotation and position) of skid-steering robots, since they change their orientation with an inevitable skid. To tackle this problem, we propose a probabilistic sliding-window estimator dedicated to skid-steering robots, using measurements from a monocular camera, the wheel encoders, and optionally an inertial measurement unit (IMU). Specifically, we explicitly model the kinematics of skid-steering robots by both track instantaneous centers of rotation (ICRs) and correction factors, which are capable of compensating for the complexity of track-to-terrain interaction, the imperfectness of mechanical design, terrain conditions and smoothness, etc. To prevent performance reduction in robots’ long-term missions, the time- and location- varying kinematic parameters are estimated online along with pose estimation states in a tightly-coupled manner. More importantly, we conduct indepth observability analysis for different sensors and design configurations in this paper, which provides us with theoretical tools in making the correct choice when building real commercial robots. In our experiments, we validate the proposed method by both simulation tests and real-world experiments, which demonstrate that our method outperforms competing methods by wide margins.}
}

Lina Liu, Xibin Song, Mengmeng Wang, Yuchao Dai, Yong Liu, and Liangjun Zhang. AGDF-Net: Learning Domain Generalizable Depth Features with Adaptive Guidance Fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46:3137-3155, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

Cross-domain generalizable depth estimation aims to estimate the depth of target domains (i.e., real-world) using models trained on the source domains (i.e., synthetic). Previous methods mainly use additional real-world domain datasets to extract depth specific information for cross-domain generalizable depth estimation. Unfortunately, due to the large domain gap, adequate depth specific information is hard to obtain and interference is difficult to remove, which limits the performance. To relieve these problems, we propose a domain generalizable feature extraction network with adaptive guidance fusion (AGDF-Net) to fully acquire essential features for depth estimation at multi-scale feature levels. Specifically, our AGDF-Net first separates the image into initial depth and weak-related depth components with reconstruction and contrary losses. Subsequently, an adaptive guidance fusion module is designed to sufficiently intensify the initial depth features for domain generalizable intensified depth features acquisition. Finally, taking intensified depth features as input, an arbitrary depth estimation network can be used for real-world depth estimation. Using only synthetic datasets, our AGDF-Net can be applied to various real-world datasets (i.e., KITTI, NYUDv2, NuScenes, DrivingStereo and CityScapes) with state-of-the-art performances. Furthermore, experiments with a small amount of real-world data in a semi-supervised setting also demonstrate the superiority of AGDF-Net over state-of-the-art approaches.

@article{liu2024agdf,
title = {AGDF-Net: Learning Domain Generalizable Depth Features with Adaptive Guidance Fusion},
author = {Lina Liu and Xibin Song and Mengmeng Wang and Yuchao Dai and Yong Liu and Liangjun Zhang},
year = 2024,
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = 46,
pages = {3137-3155},
doi = {10.1109/TPAMI.2023.3342634},
abstract = {Cross-domain generalizable depth estimation aims to estimate the depth of target domains (i.e., real-world) using models trained on the source domains (i.e., synthetic). Previous methods mainly use additional real-world domain datasets to extract depth specific information for cross-domain generalizable depth estimation. Unfortunately, due to the large domain gap, adequate depth specific information is hard to obtain and interference is difficult to remove, which limits the performance. To relieve these problems, we propose a domain generalizable feature extraction network with adaptive guidance fusion (AGDF-Net) to fully acquire essential features for depth estimation at multi-scale feature levels. Specifically, our AGDF-Net first separates the image into initial depth and weak-related depth components with reconstruction and contrary losses. Subsequently, an adaptive guidance fusion module is designed to sufficiently intensify the initial depth features for domain generalizable intensified depth features acquisition. Finally, taking intensified depth features as input, an arbitrary depth estimation network can be used for real-world depth estimation. Using only synthetic datasets, our AGDF-Net can be applied to various real-world datasets (i.e., KITTI, NYUDv2, NuScenes, DrivingStereo and CityScapes) with state-of-the-art performances. Furthermore, experiments with a small amount of real-world data in a semi-supervised setting also demonstrate the superiority of AGDF-Net over state-of-the-art approaches.}
}

Juntao Jiang, Mengmeng Wang, Huizhong Tian, Linbo Cheng, and Yong Liu. LV-UNet: A Lightweight and Vanilla Model for Medical Image Segmentation. In 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 4240-4246, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

While large models have achieved significant progress in computer vision, challenges such as optimization complexity, the intricacy of transformer architectures, computational constraints, and practical application demands highlight the importance of simpler model designs in medical image segmentation. This need is particularly pronounced in mobile medical devices, which require lightweight, deployable models with real-time performance. However, existing lightweight models often suffer from poor robustness across datasets, limiting their widespread adoption. To address these challenges, this paper introduces LV-UNet, a lightweight and vanilla model that leverages pre-trained MobileNetv3-Large backbones and incorporates fusible modules. LV-UNet employs an enhanced deep training strategy and switches to a deployment mode during inference by re-parametrization, significantly reducing parameter count and computational overhead. Experimental results on ISIC 2016, BUSI, CVC-ClinicDB, CVC-ColonDB, and Kvair-SEG datasets demonstrate a better trade-off between performance and the computational load. The code will be released at https://github.com/juntaoJianggavin/LV-UNet.

@inproceedings{jiang2024lvu,
title = {LV-UNet: A Lightweight and Vanilla Model for Medical Image Segmentation},
author = {Juntao Jiang and Mengmeng Wang and Huizhong Tian and Linbo Cheng and Yong Liu},
year = 2024,
booktitle = {2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)},
pages = {4240-4246},
doi = {10.1109/BIBM62325.2024.10822465},
abstract = {While large models have achieved significant progress in computer vision, challenges such as optimization complexity, the intricacy of transformer architectures, computational constraints, and practical application demands highlight the importance of simpler model designs in medical image segmentation. This need is particularly pronounced in mobile medical devices, which require lightweight, deployable models with real-time performance. However, existing lightweight models often suffer from poor robustness across datasets, limiting their widespread adoption. To address these challenges, this paper introduces LV-UNet, a lightweight and vanilla model that leverages pre-trained MobileNetv3-Large backbones and incorporates fusible modules. LV-UNet employs an enhanced deep training strategy and switches to a deployment mode during inference by re-parametrization, significantly reducing parameter count and computational overhead. Experimental results on ISIC 2016, BUSI, CVC-ClinicDB, CVC-ColonDB, and Kvair-SEG datasets demonstrate a better trade-off between performance and the computational load. The code will be released at https://github.com/juntaoJianggavin/LV-UNet.}
}

Shuo Xin, Zhen Zhang, Liang Liu, Xiaojun Hou, Deye Zhu, Mengmeng Wang, and Yong Liu. A Robotic-centric Paradigm for 3D Human Tracking Under Complex Environments Using Multi-modal Adaptation. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4934-4940, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

The goal of this paper is to strike a feasible tracking paradigm that can make 3D human trackers applicable on robot platforms and enable more high-level tasks. Till now, two fundamental problems haven’t been adequately addressed. One is the computational cost lightweight enough for robotic deployment, and the other is the easily-influenced accuracy varied greatly in complex real environments. In this paper, a robotic-centric tracking paradigm called MATNet is proposed that directly matches the LiDAR point clouds and RGB videos through end-to-end learning. To improve the low accuracy of human tracking against disturbance, a coarse-to-fine Transformer along with target-ware augmentation is proposed by fusing RGB videos and point clouds through a pyramid encoding and decoding strategy. To better meet the real-time requirement of actual robot deployment, we introduce the parameter-efficient adaptation tuning that greatly shortens the model’s training time. Furthermore, we also propose a five-step Anti-shake Refinement strategy and have added human prior values to overcome the strong shaking on the robot plat-form. Extensive experiments confirm that MATNet significantly outperforms the previous state-of-the-art on both open-source datasets and large-scale robotic datasets.

@inproceedings{xin2024arc,
title = {A Robotic-centric Paradigm for 3D Human Tracking Under Complex Environments Using Multi-modal Adaptation},
author = {Shuo Xin and Zhen Zhang and Liang Liu and Xiaojun Hou and Deye Zhu and Mengmeng Wang and Yong Liu},
year = 2024,
booktitle = {2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
pages = {4934-4940},
doi = {10.1109/IROS58592.2024.10802166},
abstract = {The goal of this paper is to strike a feasible tracking paradigm that can make 3D human trackers applicable on robot platforms and enable more high-level tasks. Till now, two fundamental problems haven't been adequately addressed. One is the computational cost lightweight enough for robotic deployment, and the other is the easily-influenced accuracy varied greatly in complex real environments. In this paper, a robotic-centric tracking paradigm called MATNet is proposed that directly matches the LiDAR point clouds and RGB videos through end-to-end learning. To improve the low accuracy of human tracking against disturbance, a coarse-to-fine Transformer along with target-ware augmentation is proposed by fusing RGB videos and point clouds through a pyramid encoding and decoding strategy. To better meet the real-time requirement of actual robot deployment, we introduce the parameter-efficient adaptation tuning that greatly shortens the model's training time. Furthermore, we also propose a five-step Anti-shake Refinement strategy and have added human prior values to overcome the strong shaking on the robot plat-form. Extensive experiments confirm that MATNet significantly outperforms the previous state-of-the-art on both open-source datasets and large-scale robotic datasets.}
}

Xiaojun Hou, Jiazheng Xing, Yijie Qian, Yaowei Guo, Shuo Xin, Junhao Chen, Kai Tang, Mengmeng Wang, Zhengkai Jiang, Liang Liu, and Yong Liu. SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26541-26551, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers, which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore, recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However, the modality gap limits pre-trained knowledge recall, and the dominance of the RGB modality persists, preventing the full utilization of information from other modalities. To address these issues, we propose a novel symmetric multimodal tracking framework called SDSTrack. We introduce lightweight adaptation for efficient fine-tuning, which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced, symmetric manner. Furthermore, we design a complementary masked patch distillation strategy to enhance the robustness of trackers in complex environments, such as extreme weather, poor imaging, and sensor failure. Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art methods in various multimodal tracking scenarios, including RGB+Depth, RGB+Thermal, and RGB+Event tracking, and exhibits impressive results in extreme conditions. Our source code is available at: https://github.com/hoqolo/SDSTrack.

@inproceedings{hou2024sds,
title = {SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking},
author = {Xiaojun Hou and Jiazheng Xing and Yijie Qian and Yaowei Guo and Shuo Xin and Junhao Chen and Kai Tang and Mengmeng Wang and Zhengkai Jiang and Liang Liu and Yong Liu},
year = 2024,
booktitle = {2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {26541-26551},
doi = {10.1109/CVPR52733.2024.02507},
abstract = {Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers, which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore, recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However, the modality gap limits pre-trained knowledge recall, and the dominance of the RGB modality persists, preventing the full utilization of information from other modalities. To address these issues, we propose a novel symmetric multimodal tracking framework called SDSTrack. We introduce lightweight adaptation for efficient fine-tuning, which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced, symmetric manner. Furthermore, we design a complementary masked patch distillation strategy to enhance the robustness of trackers in complex environments, such as extreme weather, poor imaging, and sensor failure. Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art methods in various multimodal tracking scenarios, including RGB+Depth, RGB+Thermal, and RGB+Event tracking, and exhibits impressive results in extreme conditions. Our source code is available at: https://github.com/hoqolo/SDSTrack.}
}

Shuo Xin, Zhen Zhang, Mengmeng Wang, Xiaojun Hou, Yaowei Guo, Xiao Kang, Liang Liu, and Yong Liu. Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 337-344, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

Tracking a specific person in 3D scene is gaining momentum due to its numerous applications in robotics. Currently, most 3D trackers focus on driving scenarios with neglected jitter and uncomplicated surroundings, which results in their severe degeneration in complex environments, especially on jolting robot platforms (only 20-60% success rate). To improve the accuracy, a Point-Video-based Transformer Tracking model (PVTrack) is presented for robots. It is the first multi-modal 3D human tracking work that incorporates point clouds together with RGB videos to achieve information complementarity. Moreover, PVTrack proposes the Siamese Point-Video Transformer for feature aggregation to overcome dynamic environments, which captures more target-aware information through the hierarchical attention mechanism adaptively. Considering the violent shaking on robots and rugged terrains, a lateral Human-ware Proposal Network is designed together with an Anti-shake Proposal Compensation module. It alleviates the disturbance caused by complex scenes as well as the particularity of the robot platform. Experiments show that our method achieves state-of-the-art performance on both KITTI/Waymo datasets and a quadruped robot for various indoor and outdoor scenes.

@inproceedings{xin2024mmh,
title = {Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer},
author = {Shuo Xin and Zhen Zhang and Mengmeng Wang and Xiaojun Hou and Yaowei Guo and Xiao Kang and Liang Liu and Yong Liu},
year = 2024,
booktitle = {2024 IEEE International Conference on Robotics and Automation (ICRA)},
pages = {337-344},
doi = {10.1109/ICRA57147.2024.10610979},
abstract = {Tracking a specific person in 3D scene is gaining momentum due to its numerous applications in robotics. Currently, most 3D trackers focus on driving scenarios with neglected jitter and uncomplicated surroundings, which results in their severe degeneration in complex environments, especially on jolting robot platforms (only 20-60% success rate). To improve the accuracy, a Point-Video-based Transformer Tracking model (PVTrack) is presented for robots. It is the first multi-modal 3D human tracking work that incorporates point clouds together with RGB videos to achieve information complementarity. Moreover, PVTrack proposes the Siamese Point-Video Transformer for feature aggregation to overcome dynamic environments, which captures more target-aware information through the hierarchical attention mechanism adaptively. Considering the violent shaking on robots and rugged terrains, a lateral Human-ware Proposal Network is designed together with an Anti-shake Proposal Compensation module. It alleviates the disturbance caused by complex scenes as well as the particularity of the robot platform. Experiments show that our method achieves state-of-the-art performance on both KITTI/Waymo datasets and a quadruped robot for various indoor and outdoor scenes.}
}

Jun Chen, Haishan Ye, Mengmeng Wang, Tianxin Huang, Guang Dai, Ivor W. Tsang, and Yong Liu. Decentralized Riemannian Conjugate Gradient Method on the Stiefel Manifold. In 12nd International Conference on Learning Representations (ICLR), 2024.
[BibTeX] [Abstract]

The conjugate gradient method is a crucial first-order optimization method that generally converges faster than the steepest descent method, and its computational cost is much lower than that of second-order methods. However, while various types of conjugate gradient methods have been studied in Euclidean spaces and on Riemannian manifolds, there is little study for those in distributed scenarios. This paper proposes a decentralized Riemannian conjugate gradient descent (DRCGD) method that aims at minimizing a global function over the Stiefel manifold. The optimization problem is distributed among a network of agents, where each agent is associated with a local function, and the communication between agents occurs over an undirected connected graph. Since the Stiefel manifold is a non-convex set, a global function is represented as a finite sum of possibly non-convex (but smooth) local functions. The proposed method is free from expensive Riemannian geometric operations such as retractions, exponential maps, and vector transports, thereby reducing the computational complexity required by each agent. To the best of our knowledge, DRCGD is the first decentralized Riemannian conjugate gradient algorithm to achieve global convergence over the Stiefel manifold.

@inproceedings{chen2024drc,
title = {Decentralized Riemannian Conjugate Gradient Method on the Stiefel Manifold},
author = {Jun Chen and Haishan Ye and Mengmeng Wang and Tianxin Huang and Guang Dai and Ivor W Tsang and Yong Liu},
year = 2024,
booktitle = {12nd International Conference on Learning Representations (ICLR)},
abstract = {The conjugate gradient method is a crucial first-order optimization method that generally converges faster than the steepest descent method, and its computational cost is much lower than that of second-order methods. However, while various types of conjugate gradient methods have been studied in Euclidean spaces and on Riemannian manifolds, there is little study for those in distributed scenarios. This paper proposes a decentralized Riemannian conjugate gradient descent (DRCGD) method that aims at minimizing a global function over the Stiefel manifold. The optimization problem is distributed among a network of agents, where each agent is associated with a local function, and the communication between agents occurs over an undirected connected graph. Since the Stiefel manifold is a non-convex set, a global function is represented as a finite sum of possibly non-convex (but smooth) local functions. The proposed method is free from expensive Riemannian geometric operations such as retractions, exponential maps, and vector transports, thereby reducing the computational complexity required by each agent. To the best of our knowledge, DRCGD is the first decentralized Riemannian conjugate gradient algorithm to achieve global convergence over the Stiefel manifold.}
}

Shuo Xin, Liang Liu, Xiao Kang, Zhen Zhang, Mengmeng Wang, and Yong Liu. Beyond Traditional Driving Scenes: A Robotic-Centric Paradigm for 2D+3D Human Tracking Using Siamese Transformer Network. In 7th International Symposium on Autonomous Systems (ISAS), 2024.
[BibTeX] [Abstract] [DOI] [PDF]

3D human tracking plays a crucial role in the automation intelligence system. Current approaches focus on achieving higher performance on traditional driving datasets like KITTI, which overlook the jitteriness of the platform and the complexity of the environments. Once the scenarios are migrated to jolting robot platforms, they all degenerate severely with only a 20-60% success rate, which greatly restricts the high-level application of autonomous systems. In this work, beyond traditional flat scenes, we introduce Multi-modal Human Tracking Paradigm (MHTrack), a unified multimodal transformer-based model that can effectively track the target person frame-by-frame in point and video sequences. Specifically, we design a speed-inertia module-assisted stabilization mechanism along with an alternate training strategy to better migrate the tracking algorithm to the robot platform. To capture more target-aware information, we combine the geometric and appearance features of point clouds and video frames together based on a hierarchical Siamese Transformer Network. Additionally, considering the prior characteristics of the human category, we design a lateral cross-attention pyramid head for deeper feature aggregation and final 3D BBox generation. Extensive experiments confirm that MHTrack significantly outperforms the previous state-of-the-arts on both open-source datasets and large-scale robotic datasets. Further analysis verifies each component’s effectiveness and shows the robotic-centric paradigm’s promising potential when deployed into dynamic robotic systems.

@inproceedings{xin2024btd,
title = {Beyond Traditional Driving Scenes: A Robotic-Centric Paradigm for 2D+3D Human Tracking Using Siamese Transformer Network},
author = {Shuo Xin and Liang Liu and Xiao Kang and Zhen Zhang and Mengmeng Wang and Yong Liu},
year = 2024,
booktitle = {7th International Symposium on Autonomous Systems (ISAS)},
doi = {10.1109/ISAS61044.2024.10552604},
abstract = {3D human tracking plays a crucial role in the automation intelligence system. Current approaches focus on achieving higher performance on traditional driving datasets like KITTI, which overlook the jitteriness of the platform and the complexity of the environments. Once the scenarios are migrated to jolting robot platforms, they all degenerate severely with only a 20-60% success rate, which greatly restricts the high-level application of autonomous systems. In this work, beyond traditional flat scenes, we introduce Multi-modal Human Tracking Paradigm (MHTrack), a unified multimodal transformer-based model that can effectively track the target person frame-by-frame in point and video sequences. Specifically, we design a speed-inertia module-assisted stabilization mechanism along with an alternate training strategy to better migrate the tracking algorithm to the robot platform. To capture more target-aware information, we combine the geometric and appearance features of point clouds and video frames together based on a hierarchical Siamese Transformer Network. Additionally, considering the prior characteristics of the human category, we design a lateral cross-attention pyramid head for deeper feature aggregation and final 3D BBox generation. Extensive experiments confirm that MHTrack significantly outperforms the previous state-of-the-arts on both open-source datasets and large-scale robotic datasets. Further analysis verifies each component's effectiveness and shows the robotic-centric paradigm's promising potential when deployed into dynamic robotic systems.}
}

Jongwon Ra, Mengmeng Wang, Jianbiao Mei, Shanqi Liu, Yu Yang, and Yong Liu. Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking via Memory Networks. In 11th International Conference on 3D Vision (3DV), pages 842-851, 2024.
[BibTeX] [Abstract] [DOI]

The point cloud-based 3D single object tracking plays an indispensable role in autonomous driving. However, the application of 3D object tracking in the real world is still challenging due to the inherent sparsity and self-occlusion of point cloud data. Therefore, it is necessary to exploit as much useful information from limited data as we can. Since 3D object tracking is a video-level task, the appearance of objects changes gradually over time, and there is rich spatiotemporal contextual information among historical frames. However, existing methods do not fully utilize this information. To address this, we propose a new method called SCTrack, which utilizes a memory-based paradigm to exploit spatiotemporal contextual information. SCTrack incorporates both long-term and short-term memory banks to store the spatiotemporal features of targets from historical frames. By doing so, the tracker can benefit from the entire video sequence and make more informed predictions. Additionally, SCTrack extracts the mask prior to augmenting the target representation, improving the target-background discriminability. Extensive experiments on KITTI, nuScenes, and Waymo Open datasets verify the effectiveness of our proposed method.

@inproceedings{Ra2024esc,
title = {Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking via Memory Networks},
author = {Jongwon Ra and Mengmeng Wang and Jianbiao Mei and Shanqi Liu and Yu Yang and Yong Liu},
year = 2024,
booktitle = {11th International Conference on 3D Vision (3DV)},
pages = {842-851},
doi = {10.1109/3DV62453.2024.00050},
abstract = {The point cloud-based 3D single object tracking plays an indispensable role in autonomous driving. However, the application of 3D object tracking in the real world is still challenging due to the inherent sparsity and self-occlusion of point cloud data. Therefore, it is necessary to exploit as much useful information from limited data as we can. Since 3D object tracking is a video-level task, the appearance of objects changes gradually over time, and there is rich spatiotemporal contextual information among historical frames. However, existing methods do not fully utilize this information. To address this, we propose a new method called SCTrack, which utilizes a memory-based paradigm to exploit spatiotemporal contextual information. SCTrack incorporates both long-term and short-term memory banks to store the spatiotemporal features of targets from historical frames. By doing so, the tracker can benefit from the entire video sequence and make more informed predictions. Additionally, SCTrack extracts the mask prior to augmenting the target representation, improving the target-background discriminability. Extensive experiments on KITTI, nuScenes, and Waymo Open datasets verify the effectiveness of our proposed method.}
}

Mengmeng Wang, Jiazheng Xing, Boyuan Jiang, Jun Chen, Jianbiao Mei, Xingxing Zuo, Guang Dai, Jingdong Wang, and Yong Liu. A Multimodal, Multi-task Adapting Framework for Video Action Recognition. In 38th AAAI Conference on Artificial Intelligence (AAAI), pages 5517-5525, 2024.
[BibTeX] [Abstract] [DOI] [PDF]

Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend to prioritize strong supervised performance at the expense of compromising the models’ generalization capabilities during transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named M2-CLIP to address these challenges, preserving both high supervised performance and robust transferability. Firstly, to enhance the individual modality architectures, we introduce multimodal adapters to both the visual and text branches. Specifically, we design a novel visual TED-Adapter, that performs global Temporal Enhancement and local temporal Difference modeling to improve the temporal representation capabilities of the visual encoder. Moreover, we adopt text encoder adapters to strengthen the learning of semantic label information. Secondly, we design a multi-task decoder with a rich set of supervisory signals to adeptly satisfy the need for strong supervised performance and generalization within a multimodal framework. Experimental results validate the efficacy of our approach, demonstrating exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.

@inproceedings{wang2024amm,
title = {A Multimodal, Multi-task Adapting Framework for Video Action Recognition},
author = {Mengmeng Wang and Jiazheng Xing and Boyuan Jiang and Jun Chen and Jianbiao Mei and Xingxing Zuo and Guang Dai and Jingdong Wang and Yong Liu},
year = 2024,
booktitle = {38th AAAI Conference on Artificial Intelligence (AAAI)},
pages = {5517-5525},
doi = {10.1609/aaai.v38i6.28361},
abstract = {Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend to prioritize strong supervised performance at the expense of compromising the models' generalization capabilities during transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named M2-CLIP to address these challenges, preserving both high supervised performance and robust transferability. Firstly, to enhance the individual modality architectures, we introduce multimodal adapters to both the visual and text branches. Specifically, we design a novel visual TED-Adapter, that performs global Temporal Enhancement and local temporal Difference modeling to improve the temporal representation capabilities of the visual encoder. Moreover, we adopt text encoder adapters to strengthen the learning of semantic label information. Secondly, we design a multi-task decoder with a rich set of supervisory signals to adeptly satisfy the need for strong supervised performance and generalization within a multimodal framework. Experimental results validate the efficacy of our approach, demonstrating exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.}
}

Yufei Liang, Mengmeng Wang, Yining Jin, Shuwen Pan, and Yong Liu. Hierarchical Supervisions with Two-Stream Network for Deepfake Detection. Pattern Recognition Letters, 172:121-127, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

Recently, the quality of face generation and manipulation has reached impressive levels, making it diffi-cult even for humans to distinguish real and fake faces. At the same time, methods to distinguish fake faces from reals came out, such as Deepfake detection. However, the task of Deepfake detection remains challenging, especially the low-quality fake images circulating on the Internet and the diversity of face generation methods. In this work, we propose a new Deepfake detection network that could effectively distinguish both high-quality and low-quality faces generated by various generation methods. First, we design a two-stream framework that incorporates a regular spatial stream and a frequency stream to handle the low-quality problem since we find that the frequency domain artifacts of low-quality images will be preserved. Second, we introduce hierarchical supervisions in a coarse-to-fine manner, which con-sists of a coarse binary classification branch to classify reals and fakes and a five-category classification branch to classify reals and four different types of fakes. Extensive experiments have proved the effec-tiveness of our framework on several widely used datasets.

@article{liang2023hs,
title = {Hierarchical Supervisions with Two-Stream Network for Deepfake Detection},
author = {Yufei Liang and Mengmeng Wang and Yining Jin and Shuwen Pan and Yong Liu},
year = 2023,
journal = {Pattern Recognition Letters},
volume = 172,
pages = {121-127},
doi = {10.1016/j.patrec.2023.05.029},
abstract = {Recently, the quality of face generation and manipulation has reached impressive levels, making it diffi-cult even for humans to distinguish real and fake faces. At the same time, methods to distinguish fake faces from reals came out, such as Deepfake detection. However, the task of Deepfake detection remains challenging, especially the low-quality fake images circulating on the Internet and the diversity of face generation methods. In this work, we propose a new Deepfake detection network that could effectively distinguish both high-quality and low-quality faces generated by various generation methods. First, we design a two-stream framework that incorporates a regular spatial stream and a frequency stream to handle the low-quality problem since we find that the frequency domain artifacts of low-quality images will be preserved. Second, we introduce hierarchical supervisions in a coarse-to-fine manner, which con-sists of a coarse binary classification branch to classify reals and fakes and a five-category classification branch to classify reals and four different types of fakes. Extensive experiments have proved the effec-tiveness of our framework on several widely used datasets.}
}

Jiajun Lv, Xiaolei Lang, Jinhong Xu, Mengmeng Wang, Yong Liu, and Xingxing Zuo. Continuous-Time Fixed-Lag Smoothing for LiDAR-Inertial-Camera SLAM. IEEE/ASME Transactions on Mechatronics, 28:2259-2270, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

Localization and mapping with heterogeneous multi-sensor fusion have been prevalent in recent years. To adequately fuse multi-modal sensor measurements received at different time instants and different frequencies, we estimate the continuous-time trajectory by fixed-lag smoothing within a factor-graph optimization framework. With the continuous-time formulation, we can query poses at any time instants corresponding to the sensor measurements. To bound the computation complexity of the continuous-time fixed-lag smoother, we maintain temporal and keyframe sliding windows with constant size, and probabilistically marginalize out control points of the trajectory and other states, which allows preserving prior information for future sliding-window optimization. Based on continuous-time fixed-lag smoothing, we design tightly-coupled multi-modal SLAM algorithms with a variety of sensor combinations, like the LiDAR-inertial and LiDAR-inertial-camera SLAM systems, in which online timeoffset calibration is also naturally supported. More importantly, benefiting from the marginalization and our derived analytical Jacobians for optimization, the proposed continuous-time SLAM systems can achieve real-time performance regardless of the high complexity of continuous-time formulation. The proposed multi-modal SLAM systems have been widely evaluated on three public datasets and self-collect datasets. The results demonstrate that the proposed continuous-time SLAM systems can achieve high-accuracy pose estimations and outperform existing state-of-the-art methods. To benefit the research community, we will open source our code at {https://github.com/APRIL-ZJU/clic}.

@article{lv2023ctfl,
title = {Continuous-Time Fixed-Lag Smoothing for LiDAR-Inertial-Camera SLAM},
author = {Jiajun Lv and Xiaolei Lang and Jinhong Xu and Mengmeng Wang and Yong Liu and Xingxing Zuo},
year = 2023,
journal = {IEEE/ASME Transactions on Mechatronics},
volume = 28,
pages = {2259-2270},
doi = {10.1109/TMECH.2023.3241398},
abstract = {Localization and mapping with heterogeneous multi-sensor fusion have been prevalent in recent years. To adequately fuse multi-modal sensor measurements received at different time instants and different frequencies, we estimate the continuous-time trajectory by fixed-lag smoothing within a factor-graph optimization framework. With the continuous-time formulation, we can query poses at any time instants corresponding to the sensor measurements. To bound the computation complexity of the continuous-time fixed-lag smoother, we maintain temporal and keyframe sliding windows with constant size, and probabilistically marginalize out control points of the trajectory and other states, which allows preserving prior information for future sliding-window optimization. Based on continuous-time fixed-lag smoothing, we design tightly-coupled multi-modal SLAM algorithms with a variety of sensor combinations, like the LiDAR-inertial and LiDAR-inertial-camera SLAM systems, in which online timeoffset calibration is also naturally supported. More importantly, benefiting from the marginalization and our derived analytical Jacobians for optimization, the proposed continuous-time SLAM systems can achieve real-time performance regardless of the high complexity of continuous-time formulation. The proposed multi-modal SLAM systems have been widely evaluated on three public datasets and self-collect datasets. The results demonstrate that the proposed continuous-time SLAM systems can achieve high-accuracy pose estimations and outperform existing state-of-the-art methods. To benefit the research community, we will open source our code at {https://github.com/APRIL-ZJU/clic}.}
}

Yu Yang, Mengmeng Wang, Jianbiao Mei, and Yong Liu. Exploiting Semantic-level Affinities with a Mask-Guided Network for Temporal Action Proposal in Videos. Applied Intelligence, 53:15516-15536, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

Temporal action proposal (TAP) aims to detect the action instances’ starting and ending times in untrimmed videos, which is fundamental and critical for large-scale video analysis and human action understanding. The main challenge of the temporal action proposal lies in modeling representative temporal relations in long untrimmed videos. Existing state-of-the-art methods achieve temporal modeling by building local-level, proposal-level, or global-level temporal dependencies. Local methods lack a wider receptive field, while proposal and global methods lack the focalization of learning action frames and contain background distractions. In this paper, we propose that learning semantic-level affinities can capture more practical information. Specifically, by modeling semantic associations between frames and action units, action segments (foregrounds) can aggregate supportive cues from other co-occurring actions, and nonaction clips (backgrounds) can learn the discriminations between them and action frames. To this end, we propose a novel framework named the Mask-Guided Network (MGNet) to build semantic-level temporal associations for the TAP task. Specifically, we first propose a Foreground Mask Generation (FMG) module to adaptively generate the foreground mask, representing the locations of the action units throughout the video. Second, we design a Mask-Guided Transformer (MGT) by exploiting the foreground mask to guide the self-attention mechanism to focus on and calculate semantic affinities with the foreground frames. Finally, these two modules are jointly explored in a unified framework. MGNet models the intra-semantic similarities for foregrounds, extracting supportive action cues for boundary refinement; it also builds the inter-semantic distances for backgrounds, providing the semantic gaps to suppress false positives and distractions. Extensive experiments are conducted on two challenging datasets, ActivityNet-1.3 and THUMOS14, and the results demonstrate that our method achieves superior performance.

@article{yang2023esl,
title = {Exploiting Semantic-level Affinities with a Mask-Guided Network for Temporal Action Proposal in Videos},
author = {Yu Yang and Mengmeng Wang and Jianbiao Mei and Yong Liu},
year = 2023,
journal = {Applied Intelligence},
volume = 53,
pages = {15516-15536},
doi = {10.1007/s10489-022-04261-1},
abstract = {Temporal action proposal (TAP) aims to detect the action instances' starting and ending times in untrimmed videos, which is fundamental and critical for large-scale video analysis and human action understanding. The main challenge of the temporal action proposal lies in modeling representative temporal relations in long untrimmed videos. Existing state-of-the-art methods achieve temporal modeling by building local-level, proposal-level, or global-level temporal dependencies. Local methods lack a wider receptive field, while proposal and global methods lack the focalization of learning action frames and contain background distractions. In this paper, we propose that learning semantic-level affinities can capture more practical information. Specifically, by modeling semantic associations between frames and action units, action segments (foregrounds) can aggregate supportive cues from other co-occurring actions, and nonaction clips (backgrounds) can learn the discriminations between them and action frames. To this end, we propose a novel framework named the Mask-Guided Network (MGNet) to build semantic-level temporal associations for the TAP task. Specifically, we first propose a Foreground Mask Generation (FMG) module to adaptively generate the foreground mask, representing the locations of the action units throughout the video. Second, we design a Mask-Guided Transformer (MGT) by exploiting the foreground mask to guide the self-attention mechanism to focus on and calculate semantic affinities with the foreground frames. Finally, these two modules are jointly explored in a unified framework. MGNet models the intra-semantic similarities for foregrounds, extracting supportive action cues for boundary refinement; it also builds the inter-semantic distances for backgrounds, providing the semantic gaps to suppress false positives and distractions. Extensive experiments are conducted on two challenging datasets, ActivityNet-1.3 and THUMOS14, and the results demonstrate that our method achieves superior performance.}
}

Chao Xu, Xia Wu, Mengmeng Wang, Feng Qiu, Yong Liu, and Jun Ren. Improving Dynamic Gesture Recognition in Untrimmed Videos by An Online Lightweight Framework and A New Gesture Dataset ZJUGesture. Neurocomputing, 523:58-68, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

Human–computer interaction technology brings great convenience to people, and dynamic gesture recognition makes it possible for a man to interact naturally with a machine. However, recognizing gestures quickly and precisely in untrimmed videos remains a challenge in real-world systems since: (1) It is challenging to locate the temporal boundaries of performing gestures; (2) There are significant differences in performing gestures among different people, resulting in a variety of gestures; (3) There must be a trade-off between the accuracy and the computational consumption. In this work, we propose an online lightweight two-stage framework, including a detection module and a gesture recognition module, to precisely detect and classify dynamic gestures in untrimmed videos. Specifically, we first design a low-power detection module to locate gestures in time series, then a temporal relational reasoning module is employed for gesture recognition. Moreover, we present a new dynamic gesture dataset named ZJUGesture, which contains nine classes of common gestures in various scenarios. Extensive experiments on the proposed ZJUGesture and 20-bn-Jester dataset demonstrate the attractive performance of our method with high accuracy and a low computational cost.

@article{xv2022idg,
title = {Improving Dynamic Gesture Recognition in Untrimmed Videos by An Online Lightweight Framework and A New Gesture Dataset ZJUGesture},
author = {Chao Xu and Xia Wu and Mengmeng Wang and Feng Qiu and Yong Liu and Jun Ren},
year = 2023,
journal = {Neurocomputing},
volume = 523,
pages = {58-68},
doi = {10.1016/j.neucom.2022.12.022},
abstract = {Human–computer interaction technology brings great convenience to people, and dynamic gesture recognition makes it possible for a man to interact naturally with a machine. However, recognizing gestures quickly and precisely in untrimmed videos remains a challenge in real-world systems since: (1) It is challenging to locate the temporal boundaries of performing gestures; (2) There are significant differences in performing gestures among different people, resulting in a variety of gestures; (3) There must be a trade-off between the accuracy and the computational consumption. In this work, we propose an online lightweight two-stage framework, including a detection module and a gesture recognition module, to precisely detect and classify dynamic gestures in untrimmed videos. Specifically, we first design a low-power detection module to locate gestures in time series, then a temporal relational reasoning module is employed for gesture recognition. Moreover, we present a new dynamic gesture dataset named ZJUGesture, which contains nine classes of common gestures in various scenarios. Extensive experiments on the proposed ZJUGesture and 20-bn-Jester dataset demonstrate the attractive performance of our method with high accuracy and a low computational cost.}
}

Mengmeng Wang, Jiazheng Xing, Jing Su, Jun Chen, and Yong Liu. Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:3347-3362, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

Recent methods for action recognition always apply 3D Convolutional Neural Networks (CNNs) to extract spatiotemporal features and introduce optical flows to present motion features. Although achieving state-of-the-art performance, they are expensive in both time and space. In this paper, we propose to represent both the two kinds of features in a unified 2D CNN without any 3D convolution or optical flows calculation. In particular, we first design a channel-wise spatiotemporal module to present the spatiotemporal features and a channel-wise motion module to encode feature-level motion features efficiently. Secondly, we combine these two modules and an identity mapping path into one united block that can easily replaces the original residual block in the ResNet architecture, forming a simple yet effective network termed STM network by introducing very limited extra computation cost and parameters. Thirdly, we propose a novel Twins Training framework for action recognition by incorporating a correlation loss to optimize the inter-class and intra-class correlation and a siamese structure to fully stretch the training data. We extensively validate the proposed STM on both temporal-related datasets (i.e., Something-Something v1 & v2) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51). It achieves favorable results against state-of-the-art methods in all the datasets.

@article{wang2022lsm,
title = {Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition},
author = {Mengmeng Wang and Jiazheng Xing and Jing Su and Jun Chen and Yong Liu},
year = 2023,
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = 45,
pages = {3347-3362},
doi = {10.1109/TPAMI.2022.3173658},
abstract = {Recent methods for action recognition always apply 3D Convolutional Neural Networks (CNNs) to extract spatiotemporal features and introduce optical flows to present motion features. Although achieving state-of-the-art performance, they are expensive in both time and space. In this paper, we propose to represent both the two kinds of features in a unified 2D CNN without any 3D convolution or optical flows calculation. In particular, we first design a channel-wise spatiotemporal module to present the spatiotemporal features and a channel-wise motion module to encode feature-level motion features efficiently. Secondly, we combine these two modules and an identity mapping path into one united block that can easily replaces the original residual block in the ResNet architecture, forming a simple yet effective network termed STM network by introducing very limited extra computation cost and parameters. Thirdly, we propose a novel Twins Training framework for action recognition by incorporating a correlation loss to optimize the inter-class and intra-class correlation and a siamese structure to fully stretch the training data. We extensively validate the proposed STM on both temporal-related datasets (i.e., Something-Something v1 \& v2) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51). It achieves favorable results against state-of-the-art methods in all the datasets.}
}

Honglin Lin, Mengmeng Wang, Yong Liu, and Jiaxin Kou. Correlation-based and content-enhanced network for video style transfer. Pattern Analysis and Applications, 26:343-355, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

Artistic style transfer aims to migrate the style pattern from a referenced style image to a given content image, which has achieved significant advances in recent years. However, producing temporally coherent and visually pleasing stylized frames is still challenging. Although existing works have made some effort, they rely on the inefficient optical flow or other cumbersome operations to model spatiotemporal information. In this paper, we propose an arbitrary video style transfer network that can generate consistent results with reasonable style patterns and clear content structure. We adopt multi-channel correlation module to render the input images stably according to cross-domain feature correlation. Meanwhile, Earth Movers’ Distance is used to capture the main characteristics of style images. To maintain the semantic structure during the stylization, we also employ the AdaIN-based skip connections and self-similarity loss, which can further improve the temporal consistency. Qualitative and quantitative experiments have demonstrated the effectiveness of our framework.

@article{lin2023cbc,
title = {Correlation-based and content-enhanced network for video style transfer},
author = {Honglin Lin and Mengmeng Wang and Yong Liu and Jiaxin Kou},
year = 2023,
journal = {Pattern Analysis and Applications},
volume = {26},
pages = {343-355},
doi = {10.1007/s10044-022-01106-y},
abstract = {Artistic style transfer aims to migrate the style pattern from a referenced style image to a given content image, which has achieved significant advances in recent years. However, producing temporally coherent and visually pleasing stylized frames is still challenging. Although existing works have made some effort, they rely on the inefficient optical flow or other cumbersome operations to model spatiotemporal information. In this paper, we propose an arbitrary video style transfer network that can generate consistent results with reasonable style patterns and clear content structure. We adopt multi-channel correlation module to render the input images stably according to cross-domain feature correlation. Meanwhile, Earth Movers' Distance is used to capture the main characteristics of style images. To maintain the semantic structure during the stylization, we also employ the AdaIN-based skip connections and self-similarity loss, which can further improve the temporal consistency. Qualitative and quantitative experiments have demonstrated the effectiveness of our framework.}
}

Jianbiao Mei, Yu Yang, Mengmeng Wang, Zizhang Li, Xiaojun Hou, Jongwon Ra, Laijian Li, and Yong Liu. CenterLPS: Segment Instances by Centers for LiDAR Panoptic Segmentation. In 31st ACM International Conference on Multimedia (MM), pages 1884-1894, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

This paper focuses on LiDAR Panoptic Segmentation (LPS), which has attracted more attention recently due to its broad application prospect for autonomous driving and robotics. The mainstream LPS approaches either adopt a top-down strategy relying on 3D object detectors to discover instances or utilize time-consuming heuristic clustering algorithms to group instances in a bottom-up manner. Inspired by the center representation and kernel-based segmentation, we propose a new detection-free and clustering-free framework called CenterLPS, with the center-based instance encoding and decoding paradigm. Specifically, we propose a sparse center proposal network to generate the sparse 3D instance centers, as well as center feature embedding, which can well encode characteristics of instances. Then a center-aware transformer is applied to collect the context between different center feature embedding and around centers. Moreover, we generate the kernel weights based on the enhanced center feature embedding and initialize dynamic convolutions to decode the final instance masks. Finally, a mask fusion module is devised to unify the semantic and instance predictions and improve the panoptic quality. Extensive experiments on SemanticKITTI and nuScenes demonstrate the effectiveness of our proposed center-based framework CenterLPS.

@inproceedings{mei2023lps,
title = {CenterLPS: Segment Instances by Centers for LiDAR Panoptic Segmentation},
author = {Jianbiao Mei and Yu Yang and Mengmeng Wang and Zizhang Li and Xiaojun Hou and Jongwon Ra and Laijian Li and Yong Liu},
year = 2023,
booktitle = {31st ACM International Conference on Multimedia (MM)},
pages = {1884-1894},
doi = {10.1145/3581783.3612080},
abstract = {This paper focuses on LiDAR Panoptic Segmentation (LPS), which has attracted more attention recently due to its broad application prospect for autonomous driving and robotics. The mainstream LPS approaches either adopt a top-down strategy relying on 3D object detectors to discover instances or utilize time-consuming heuristic clustering algorithms to group instances in a bottom-up manner. Inspired by the center representation and kernel-based segmentation, we propose a new detection-free and clustering-free framework called CenterLPS, with the center-based instance encoding and decoding paradigm. Specifically, we propose a sparse center proposal network to generate the sparse 3D instance centers, as well as center feature embedding, which can well encode characteristics of instances. Then a center-aware transformer is applied to collect the context between different center feature embedding and around centers. Moreover, we generate the kernel weights based on the enhanced center feature embedding and initialize dynamic convolutions to decode the final instance masks. Finally, a mask fusion module is devised to unify the semantic and instance predictions and improve the panoptic quality. Extensive experiments on SemanticKITTI and nuScenes demonstrate the effectiveness of our proposed center-based framework CenterLPS.}
}

Zizhang Li, Xiaoyang Lyu, Yuanyuan Ding, Mengmeng Wang, Yiyi Liao, and Yong Liu. RICO: Regularizing the Unobservable for Indoor Compositional Reconstruction. In 19th IEEE/CVF International Conference on Computer Vision (ICCV), pages 17715-17725, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

Recently, neural implicit surfaces have become popular for multi-view reconstruction. To facilitate practical applications like scene editing and manipulation, some works extend the framework with semantic masks input for the object-compositional reconstruction rather than the holistic perspective. Though achieving plausible disentanglement, the performance drops significantly when processing the indoor scenes where objects are usually partially observed. We propose RICO to address this by regularizing the unobservable regions for indoor compositional reconstruction. Our key idea is to first regularize the smoothness of the occluded background, which then in turn guides the foreground object reconstruction in unobservable regions based on the object-background relationship. Particularly, we regularize the geometry smoothness of occluded background patches. With the improved background surface, the signed distance function and the reversedly rendered depth of objects can be optimized to bound them within the background range. Extensive experiments show our method outperforms other methods on synthetic and real-world indoor scenes and prove the effectiveness of proposed regularizations. The code is available at https://github.com/kyleleey/RICO

@inproceedings{li2023rico,
title = {RICO: Regularizing the Unobservable for Indoor Compositional Reconstruction},
author = {Zizhang Li and Xiaoyang Lyu and Yuanyuan Ding and Mengmeng Wang and Yiyi Liao and Yong Liu},
year = 2023,
booktitle = {19th IEEE/CVF International Conference on Computer Vision (ICCV)},
pages = {17715-17725},
doi = {10.1109/ICCV51070.2023.01628},
abstract = {Recently, neural implicit surfaces have become popular for multi-view reconstruction. To facilitate practical applications like scene editing and manipulation, some works extend the framework with semantic masks input for the object-compositional reconstruction rather than the holistic perspective. Though achieving plausible disentanglement, the performance drops significantly when processing the indoor scenes where objects are usually partially observed. We propose RICO to address this by regularizing the unobservable regions for indoor compositional reconstruction. Our key idea is to first regularize the smoothness of the occluded background, which then in turn guides the foreground object reconstruction in unobservable regions based on the object-background relationship. Particularly, we regularize the geometry smoothness of occluded background patches. With the improved background surface, the signed distance function and the reversedly rendered depth of objects can be optimized to bound them within the background range. Extensive experiments show our method outperforms other methods on synthetic and real-world indoor scenes and prove the effectiveness of proposed regularizations. The code is available at https://github.com/kyleleey/RICO}
}

Jiazheng Xing, Mengmeng Wang, Yudi Ruan, Bofan Chen, Yaowei Guo, Boyu Mu, Guang Dai, Jingdong Wang, and Yong Liu. Boosting Few-Shot Action Recognition with Graph-Guided Hybrid Matching. In 19th IEEE/CVF International Conference on Computer Vision (ICCV), pages 1740-1750, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

Class prototype construction and matching are core aspects of few-shot action recognition. Previous methods mainly focus on designing spatiotemporal relation modeling modules or complex temporal alignment algorithms. Despite the promising results, they ignored the value of class prototype construction and matching, leading to unsatisfactory performance in recognizing similar categories in every task. In this paper, we propose GgHM, a new framework with Graph-guided Hybrid Matching. Concretely, we learn task-oriented features by the guidance of a graph neural network during class prototype construction, optimizing the intra- and inter-class feature correlation explicitly. Next, we design a hybrid matching strategy, combining frame-level and tuple-level matching to classify videos with multivariate styles. We additionally propose a learnable dense temporal modeling module to enhance the video feature temporal representation to build a more solid foundation for the matching process. GgHM shows consistent improvements over other challenging baselines on several few-shot datasets, demonstrating the effectiveness of our method. The code will be publicly available at https://github.com/jiazheng-xing/GgHM.

@inproceedings{xing2023bfs,
title = {Boosting Few-Shot Action Recognition with Graph-Guided Hybrid Matching},
author = {Jiazheng Xing and Mengmeng Wang and Yudi Ruan and Bofan Chen and Yaowei Guo and Boyu Mu and Guang Dai and Jingdong Wang and Yong Liu},
year = 2023,
booktitle = {19th IEEE/CVF International Conference on Computer Vision (ICCV)},
pages = {1740-1750},
doi = {10.1109/ICCV51070.2023.00167},
abstract = {Class prototype construction and matching are core aspects of few-shot action recognition. Previous methods mainly focus on designing spatiotemporal relation modeling modules or complex temporal alignment algorithms. Despite the promising results, they ignored the value of class prototype construction and matching, leading to unsatisfactory performance in recognizing similar categories in every task. In this paper, we propose GgHM, a new framework with Graph-guided Hybrid Matching. Concretely, we learn task-oriented features by the guidance of a graph neural network during class prototype construction, optimizing the intra- and inter-class feature correlation explicitly. Next, we design a hybrid matching strategy, combining frame-level and tuple-level matching to classify videos with multivariate styles. We additionally propose a learnable dense temporal modeling module to enhance the video feature temporal representation to build a more solid foundation for the matching process. GgHM shows consistent improvements over other challenging baselines on several few-shot datasets, demonstrating the effectiveness of our method. The code will be publicly available at https://github.com/jiazheng-xing/GgHM.}
}

Teli Ma, Mengmeng Wang, Jimin Xiao, Huifeng Wu, and Yong Liu. Synchronize Feature Extracting and Matching: A Single Branch Framework for 3D Object Tracking. In 19th IEEE/CVF International Conference on Computer Vision (ICCV), pages 9919-9929, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

Siamese network has been a de facto benchmark framework for 3D LiDAR object tracking with a shared-parametric encoder extracting features from template and search region, respectively. This paradigm relies heavily on an additional matching network to model the cross-correlation/similarity of the template and search region. In this paper, we forsake the conventional Siamese paradigm and propose a novel single-branch framework, SyncTrack, synchronizing the feature extracting and matching to avoid forwarding encoder twice for template and search region as well as introducing extra parameters of matching network. The synchronization mechanism is based on the dynamic affinity of the Transformer, and an in-depth analysis of the relevance is provided theoretically. Moreover, based on the synchronization, we introduce a novel Attentive PointsSampling strategy into the Transformer layers (APST), replacing the random/Farthest Points Sampling (FPS) method with sampling under the supervision of attentive relations between the template and search region. It implies connecting point-wise sampling with the feature learning, beneficial to aggregating more distinctive and geometric features for tracking with sparse points. Extensive experiments on two benchmark datasets (KITTI and NuScenes) show that SyncTrack achieves state-of-the-art performance in realtime tracking.

@inproceedings{ma2023sfe,
title = {Synchronize Feature Extracting and Matching: A Single Branch Framework for 3D Object Tracking},
author = {Teli Ma and Mengmeng Wang and Jimin Xiao and Huifeng Wu and Yong Liu},
year = 2023,
booktitle = {19th IEEE/CVF International Conference on Computer Vision (ICCV)},
pages = {9919-9929},
doi = {10.1109/ICCV51070.2023.00913},
abstract = {Siamese network has been a de facto benchmark framework for 3D LiDAR object tracking with a shared-parametric encoder extracting features from template and search region, respectively. This paradigm relies heavily on an additional matching network to model the cross-correlation/similarity of the template and search region. In this paper, we forsake the conventional Siamese paradigm and propose a novel single-branch framework, SyncTrack, synchronizing the feature extracting and matching to avoid forwarding encoder twice for template and search region as well as introducing extra parameters of matching network. The synchronization mechanism is based on the dynamic affinity of the Transformer, and an in-depth analysis of the relevance is provided theoretically. Moreover, based on the synchronization, we introduce a novel Attentive PointsSampling strategy into the Transformer layers (APST), replacing the random/Farthest Points Sampling (FPS) method with sampling under the supervision of attentive relations between the template and search region. It implies connecting point-wise sampling with the feature learning, beneficial to aggregating more distinctive and geometric features for tracking with sparse points. Extensive experiments on two benchmark datasets (KITTI and NuScenes) show that SyncTrack achieves state-of-the-art performance in realtime tracking.}
}

Jianbiao Mei, Yu Yang, Mengmeng Wang, Tianxin Huang, Xuemeng Yang, and Yong Liu. SSC-RS: Elevate LiDAR Semantic Scene Completion with Representation Separation and BEV Fusion. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7718-7725, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

Semantic scene completion (SSC) jointly predicts the semantics and geometry of the entire 3D scene, which plays an essential role in 3D scene understanding for autonomous driving systems. SSC has achieved rapid progress with the help of semantic context in segmentation. However, how to effectively exploit the relationships between the semantic context in semantic segmentation and geometric structure in scene completion remains under exploration. In this paper, we propose to solve outdoor SSC from the perspective of representation separation and BEV fusion. Specifically, we present the network, named SSC-RS, which uses separate branches with deep supervision to explicitly disentangle the learning procedure of the semantic and geometric representations. And a BEV fusion network equipped with the proposed Adaptive Representation Fusion (ARF) module is presented to aggregate the multi-scale features effectively and efficiently. Due to the low computational burden and powerful representation ability, our model has good generality while running in real-time. Extensive experiments on SemanticKITTI demonstrate our SSC-RS achieves state-of-the-art performance. Code is available at https://github.com/Jieqianyu/SSC-RS.git.

@inproceedings{mei2023ssc,
title = {SSC-RS: Elevate LiDAR Semantic Scene Completion with Representation Separation and BEV Fusion},
author = {Jianbiao Mei and Yu Yang and Mengmeng Wang and Tianxin Huang and Xuemeng Yang and Yong Liu},
year = 2023,
booktitle = {2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
pages = {7718-7725},
doi = {10.1109/IROS55552.2023.10341742},
abstract = {Semantic scene completion (SSC) jointly predicts the semantics and geometry of the entire 3D scene, which plays an essential role in 3D scene understanding for autonomous driving systems. SSC has achieved rapid progress with the help of semantic context in segmentation. However, how to effectively exploit the relationships between the semantic context in semantic segmentation and geometric structure in scene completion remains under exploration. In this paper, we propose to solve outdoor SSC from the perspective of representation separation and BEV fusion. Specifically, we present the network, named SSC-RS, which uses separate branches with deep supervision to explicitly disentangle the learning procedure of the semantic and geometric representations. And a BEV fusion network equipped with the proposed Adaptive Representation Fusion (ARF) module is presented to aggregate the multi-scale features effectively and efficiently. Due to the low computational burden and powerful representation ability, our model has good generality while running in real-time. Extensive experiments on SemanticKITTI demonstrate our SSC-RS achieves state-of-the-art performance. Code is available at https://github.com/Jieqianyu/SSC-RS.git.}
}

Jianbiao Mei, Yu Yang, Mengmeng Wang, Xiaojun Hou, Laijian Li, and Yong Liu. PANet: LiDAR Panoptic Segmentation with Sparse Instance Proposal and Aggregation. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7726-7733, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

Reliable LiDAR panoptic segmentation (LPS), including both semantic and instance segmentation, is vital for many robotic applications, such as autonomous driving. This work proposes a new LPS framework named PANet to eliminate the dependency on the offset branch and improve the performance on large objects, which are always over-segmented by clustering algorithms. Firstly, we propose a non-learning Sparse Instance Proposal (SIP) module with the “sampling-shifting-grouping” scheme to directly group thing points into instances from the raw point cloud efficiently. More specifically, balanced point sampling is introduced to generate sparse seed points with more uniform point distribution over the distance range. And a shift module, termed bubble shifting, is proposed to shrink the seed points to the clustered centers. Then we utilize the connected component label algorithm to generate instance proposals. Furthermore, an instance aggregation module is devised to integrate potentially fragmented instances, improving the performance of the SIP module on large objects. Extensive experiments show that PANet achieves state-of-the-art performance among published works on the SemanticKITII validation and nuScenes validation for the panoptic segmentation task. Code is available at https://github.com/Jieqianyu/PANet.git.

@inproceedings{mei2023pan,
title = {PANet: LiDAR Panoptic Segmentation with Sparse Instance Proposal and Aggregation},
author = {Jianbiao Mei and Yu Yang and Mengmeng Wang and Xiaojun Hou and Laijian Li and Yong Liu},
year = 2023,
booktitle = {2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
pages = {7726-7733},
doi = {10.1109/IROS55552.2023.10342468},
abstract = {Reliable LiDAR panoptic segmentation (LPS), including both semantic and instance segmentation, is vital for many robotic applications, such as autonomous driving. This work proposes a new LPS framework named PANet to eliminate the dependency on the offset branch and improve the performance on large objects, which are always over-segmented by clustering algorithms. Firstly, we propose a non-learning Sparse Instance Proposal (SIP) module with the “sampling-shifting-grouping” scheme to directly group thing points into instances from the raw point cloud efficiently. More specifically, balanced point sampling is introduced to generate sparse seed points with more uniform point distribution over the distance range. And a shift module, termed bubble shifting, is proposed to shrink the seed points to the clustered centers. Then we utilize the connected component label algorithm to generate instance proposals. Furthermore, an instance aggregation module is devised to integrate potentially fragmented instances, improving the performance of the SIP module on large objects. Extensive experiments show that PANet achieves state-of-the-art performance among published works on the SemanticKITII validation and nuScenes validation for the panoptic segmentation task. Code is available at https://github.com/Jieqianyu/PANet.git.}
}

Jiazheng Xing, Mengmeng Wang, Boyu Mu, and Yong Liu. Revisiting the Spatial and Temporal Modeling for Few-Shot Action Recognition. In 37th AAAI Conference on Artificial Intelligence (AAAI), pages 3001-3009, 2023.
[BibTeX] [Abstract] [PDF]

Spatial and temporal modeling is one of the most core aspects of few-shot action recognition. Most previous works mainly focus on long-term temporal relation modeling based on high-level spatial representations, without considering the crucial low-level spatial features and short-term temporal relations. Actually, the former feature could bring rich local semantic information, and the latter feature could represent motion characteristics of adjacent frames, respectively. In this paper, we propose SloshNet, a new framework that revisits the spatial and temporal modeling for few-shot action recognition in a finer manner. First, to exploit the low-level spatial features, we design a feature fusion architecture search module to automatically search for the best combination of the low-level and high-level spatial features. Next, inspired by the recent transformer, we introduce a long-term temporal modeling module to model the global temporal relations based on the extracted spatial appearance features. Meanwhile, we design another short-term temporal modeling module to encode the motion characteristics between adjacent frame representations. After that, the final predictions can be obtained by feeding the embedded rich spatial-temporal features to a common frame-level class prototype matcher. We extensively validate the proposed SloshNet on four few-shot action recognition datasets, including Something-Something V2, Kinetics, UCF101, and HMDB51. It achieves favorable results against state-of-the-art methods in all datasets.

@inproceedings{xing2023rst,
title = {Revisiting the Spatial and Temporal Modeling for Few-Shot Action Recognition},
author = {Jiazheng Xing and Mengmeng Wang and Boyu Mu and Yong Liu},
year = 2023,
booktitle = {37th AAAI Conference on Artificial Intelligence (AAAI)},
pages = {3001-3009},
abstract = {Spatial and temporal modeling is one of the most core aspects of few-shot action recognition. Most previous works mainly focus on long-term temporal relation modeling based
on high-level spatial representations, without considering the crucial low-level spatial features and short-term temporal relations. Actually, the former feature could bring rich local semantic information, and the latter feature could represent motion characteristics of adjacent frames, respectively. In this paper, we propose SloshNet, a new framework that revisits the spatial and temporal modeling for few-shot action recognition in a finer manner. First, to exploit the low-level spatial features, we design a feature fusion architecture search module to automatically search for the best combination of the low-level and high-level spatial features. Next, inspired by the recent transformer, we introduce a long-term temporal modeling module to model the global temporal relations based on
the extracted spatial appearance features. Meanwhile, we design another short-term temporal modeling module to encode the motion characteristics between adjacent frame representations. After that, the final predictions can be obtained by feeding the embedded rich spatial-temporal features to a common frame-level class prototype matcher. We extensively validate the proposed SloshNet on four few-shot action recognition datasets, including Something-Something V2, Kinetics, UCF101, and HMDB51. It achieves favorable results against state-of-the-art methods in all datasets.}
}

Chao Xu, Jiangning Zhang, Mengmeng Wang, Guanzhong Tian, and Yong Liu. Multi-level Spatial-temporal Feature Aggregation for Video Object Detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(11):7809-7820, 2022.
[BibTeX] [Abstract] [DOI] [PDF]

Video object detection (VOD) focuses on detecting objects for each frame in a video, which is a challenging task due to appearance deterioration in certain video frames. Recent works usually distill crucial information from multiple support frames to improve the reference features, but they only perform at frame level or proposal level that cannot integrate spatial-temporal features sufficiently. To deal with this challenge, we treat VOD as a spatial-temporal hierarchical features interacting process and introduce a Multi-level Spatial-Temporal (MST) feature aggregation framework to fully exploit frame-level, proposal-level, and instance-level information in a unified framework. Specifically, MST first measures context similarity in pixel space to enhance all frame-level features rather than only update reference features. The proposal-level feature aggregation then models object relation to augment reference object proposals. Furthermore, to filter out irrelevant information from other classes and backgrounds, we introduce an instance ID constraint to boost instance-level features by leveraging support object proposal features that belong to the same object. Besides, we propose a Deformable Feature Alignment (DAlign) module before MST to achieve a more accurate pixel-level spatial alignment for better feature aggregation. Extensive experiments are conducted on ImageNet VID and UAVDT datasets that demonstrate the superiority of our method over state-of-the-art (SOTA) methods. Our method achieves 83.3% and 62.1% with ResNet-101 on two datasets, outperforming SOTA MEGA by 0.4% and 2.7%.

@article{xu2022mls,
title = {Multi-level Spatial-temporal Feature Aggregation for Video Object Detection},
author = {Chao Xu and Jiangning Zhang and Mengmeng Wang and Guanzhong Tian and Yong Liu},
year = 2022,
journal = {IEEE Transactions on Circuits and Systems for Video Technology},
volume = {32},
number = {11},
pages = {7809-7820},
doi = {10.1109/TCSVT.2022.3183646},
abstract = {Video object detection (VOD) focuses on detecting objects for each frame in a video, which is a challenging task due to appearance deterioration in certain video frames. Recent works usually distill crucial information from multiple support frames to improve the reference features, but they only perform at frame level or proposal level that cannot integrate spatial-temporal features sufficiently. To deal with this challenge, we treat VOD as a spatial-temporal hierarchical features interacting process and introduce a Multi-level Spatial-Temporal (MST) feature aggregation framework to fully exploit frame-level, proposal-level, and instance-level information in a unified framework. Specifically, MST first measures context similarity in pixel space to enhance all frame-level features rather than only update reference features. The proposal-level feature aggregation then models object relation to augment reference object proposals. Furthermore, to filter out irrelevant information from other classes and backgrounds, we introduce an instance ID constraint to boost instance-level features by leveraging support object proposal features that belong to the same object. Besides, we propose a Deformable Feature Alignment (DAlign) module before MST to achieve a more accurate pixel-level spatial alignment for better feature aggregation. Extensive experiments are conducted on ImageNet VID and UAVDT datasets that demonstrate the superiority of our method over state-of-the-art (SOTA) methods. Our method achieves 83.3% and 62.1% with ResNet-101 on two datasets, outperforming SOTA MEGA by 0.4% and 2.7%.}
}

Mengmeng Wang, Jianbiao Mei, Lina Liu, and Yong Liu. Delving Deeper Into Mask Utilization in Video Object Segmentation. IEEE Transactions on Image Processing, 31:6255-6266, 2022.
[BibTeX] [Abstract] [DOI] [PDF]

This paper focuses on the mask utilization of video object segmentation (VOS). The mask here mains the reference masks in the memory bank, i.e., several chosen high-quality predicted masks, which are usually used with the reference frames together. The reference masks depict the edge and contour features of the target object and indicate the boundary of the target against the background, while the reference frames contain the raw RGB information of the whole image. It is obvious that the reference masks could play a significant role in the VOS, but this is not well explored yet. To tackle this, we propose to investigate the mask advantages of both the encoder and the matcher. For the encoder, we provide a unified codebase to integrate and compare eight different mask-fused encoders. Half of them are inherited or summarized from existing methods, and the other half are devised by ourselves. We find the best configuration from our design and give valuable observations from the comparison. Then, we propose a new mask-enhanced matcher to reduce the background distraction and enhance the locality of the matching process. Combining the mask-fused encoder, mask-enhanced matcher and a standard decoder, we formulate a new architecture named MaskVOS, which sufficiently exploits the mask benefits for VOS. Qualitative and quantitative results demonstrate the effectiveness of our method. We hope our exploration could raise the attention of mask utilization in VOS.

@article{wang2022ddi,
title = {Delving Deeper Into Mask Utilization in Video Object Segmentation},
author = {Mengmeng Wang and Jianbiao Mei and Lina Liu and Yong Liu},
year = 2022,
journal = {IEEE Transactions on Image Processing},
volume = {31},
pages = {6255-6266},
doi = {10.1109/TIP.2022.3208409},
abstract = {This paper focuses on the mask utilization of video object segmentation (VOS). The mask here mains the reference masks in the memory bank, i.e., several chosen high-quality predicted masks, which are usually used with the reference frames together. The reference masks depict the edge and contour features of the target object and indicate the boundary of the target against the background, while the reference frames contain the raw RGB information of the whole image. It is obvious that the reference masks could play a significant role in the VOS, but this is not well explored yet. To tackle this, we propose to investigate the mask advantages of both the encoder and the matcher. For the encoder, we provide a unified codebase to integrate and compare eight different mask-fused encoders. Half of them are inherited or summarized from existing methods, and the other half are devised by ourselves. We find the best configuration from our design and give valuable observations from the comparison. Then, we propose a new mask-enhanced matcher to reduce the background distraction and enhance the locality of the matching process. Combining the mask-fused encoder, mask-enhanced matcher and a standard decoder, we formulate a new architecture named MaskVOS, which sufficiently exploits the mask benefits for VOS. Qualitative and quantitative results demonstrate the effectiveness of our method. We hope our exploration could raise the attention of mask utilization in VOS.}
}

Yeneng Lin, Mengmeng Wang, Wenzhou Chen, Wang Gao, Lei Li, and Yong Liu. Multiple Object Tracking of Drone Videos by a Temporal-Association Network with Separated-Tasks Structure. Remote Sensing, 14(16):3862, 2022.
[BibTeX] [Abstract] [DOI] [PDF]

The task of multi-object tracking via deep learning methods for UAV videos has become an important research direction. However, with some current multiple object tracking methods, the relationship between object detection and tracking is not well handled, and decisions on how to make good use of temporal information can affect tracking performance as well. To improve the performance of multi-object tracking, this paper proposes an improved multiple object tracking model based on FairMOT. The proposed model contains a structure to separate the detection and ReID heads to decrease the influence between every function head. Additionally, we develop a temporal embedding structure to strengthen the representational ability of the model. By combing the temporal-association structure and separating different function heads, the model’s performance in object detection and tracking tasks is improved, which has been verified on the VisDrone2019 dataset. Compared with the original method, the proposed model improves MOTA by 4.9% and MOTP by 1.2% and has better tracking performance than the models such as SORT and HDHNet on the UAV video dataset.

@article{lin2022mot,
title = {Multiple Object Tracking of Drone Videos by a Temporal-Association Network with Separated-Tasks Structure},
author = {Yeneng Lin and Mengmeng Wang and Wenzhou Chen and Wang Gao and Lei Li and Yong Liu},
year = 2022,
journal = {Remote Sensing},
volume = {14},
number = {16},
pages = {3862},
doi = {10.3390/rs14163862},
abstract = {The task of multi-object tracking via deep learning methods for UAV videos has become an important research direction. However, with some current multiple object tracking methods,
the relationship between object detection and tracking is not well handled, and decisions on how to make good use of temporal information can affect tracking performance as well. To improve the performance of multi-object tracking, this paper proposes an improved multiple object tracking model based on FairMOT. The proposed model contains a structure to separate the detection and ReID heads to decrease the influence between every function head. Additionally, we develop a temporal embedding structure to strengthen the representational ability of the model. By combing the temporal-association structure and separating different function heads, the model’s performance in object detection and tracking tasks is improved, which has been verified on the VisDrone2019 dataset. Compared with the original method, the proposed model improves MOTA by 4.9% and MOTP by 1.2% and has better tracking performance than the models such as SORT and HDHNet on the UAV video dataset.}
}

Chunfang Deng, Mengmeng Wang, Liang Liu, Yong Liu, and Yunliang Jiang. Extended feature pyramid network for small object detection. IEEE Transactions on Multimedia, 24:1968-1979, 2022.
[BibTeX] [Abstract] [DOI] [PDF]

Small object detection remains an unsolved challenge because it is hard to extract information of small objects with only a few pixels. While scale-level corresponding detection in feature pyramid network alleviates this problem, we find feature coupling of various scales still impairs the performance of small objects. In this paper, we propose an extended feature pyramid network (EFPN) with an extra high-resolution pyramid level specialized for small object detection. Specifically, we design a novel module, named feature texture transfer (FTT), which is used to super-resolve features and extract credible regional details simultaneously. Moreover, we introduce a cross resolution distillation mechanism to transfer the ability of perceiving details across the scales of the network, where a foreground-background-balanced loss function is designed to alleviate area imbalance of foreground and background. In our experiments, the proposed EFPN is efficient on both computation and memory, and yields state-of-the-art results on small traffic-sign dataset Tsinghua-Tencent 100K and small category of general object detection dataset MS COCO.

@article{deng2022efp,
title = {Extended feature pyramid network for small object detection},
author = {Chunfang Deng and Mengmeng Wang and Liang Liu and Yong Liu and Yunliang Jiang},
year = 2022,
journal = {IEEE Transactions on Multimedia},
volume = {24},
pages = {1968-1979},
doi = {10.1109/TMM.2021.3074273},
abstract = {Small object detection remains an unsolved challenge because it is hard to extract information of small objects with only a few pixels. While scale-level corresponding detection in feature pyramid network alleviates this problem, we find feature coupling of various scales still impairs the performance of small objects. In this paper, we propose an extended feature pyramid network (EFPN) with an extra high-resolution pyramid level specialized for small object detection. Specifically, we design a novel module, named feature texture transfer (FTT), which is used to super-resolve features and extract credible regional details simultaneously. Moreover, we introduce a cross resolution distillation mechanism to transfer the ability of perceiving details across the scales of the network, where a foreground-background-balanced loss function is designed to alleviate area imbalance of foreground and background. In our experiments, the proposed EFPN is efficient on both computation and memory, and yields state-of-the-art results on small traffic-sign dataset Tsinghua-Tencent 100K and small category of general object detection dataset MS COCO.}
}

Chao Xu, Xia Wu, Yachun Li, Yining Jin, Mengmeng Wang, and Yong Liu. Cross-modality online distillation for multi-view action recognition. Neurocomputing, 456:384-393, 2021.
[BibTeX] [Abstract] [DOI] [PDF]

Recently, some multi-modality features are introduced to the multi-view action recognition methods in order to obtain a more robust performance. However, it is intuitive that not all modalities are available in real applications, such as daily scenes that missing depth modal data and capture RGB sequences only. This raises the challenge of how to learn critical features from multi-modality data, while relying on RGB sequences and still get robust performance at test time. To address this challenge, our paper presents a novel two-stage teacher-student framework, the teacher network takes advantage of multi-view geometry-andtexture features during training, while a student network only given RGB sequences at test time. Specifically, in the first stage, Cross-modality Aggregated Transfer (CAT) network is proposed to transfer multi-view cross-modality aggregated features from the teacher network to the student network. Moreover, We design a Viewpoint-Aware Attention (VAA) module which captures discriminative information across different views to e_ectively combine multi-view features. In the second stage, Multi-view Features Strengthen (MFS) network that contains VAA module as well further strengthen the global view-invariance features of the student network. Besides, both of CAT and MFS learn in an online distillation manner so that the teacher and the student network can be trained jointly. Extensive experiments at IXMAS and Northwestern-UCLA demonstrate the effectiveness of the proposed method.

@article{xu2021cmo,
title = {Cross-modality online distillation for multi-view action recognition},
author = {Chao Xu and Xia Wu and Yachun Li and Yining Jin and Mengmeng Wang and Yong Liu},
year = 2021,
journal = {Neurocomputing},
volume = 456,
pages = {384-393},
doi = {10.1016/j.neucom.2021.05.077},
abstract = {Recently, some multi-modality features are introduced to the multi-view action recognition methods in order to obtain a more robust performance. However, it is intuitive that not all modalities are available in real applications, such as daily scenes that missing depth modal data and capture RGB sequences only. This raises the challenge of how to learn critical features from multi-modality data, while relying on RGB sequences and still get robust performance at test time. To address this challenge, our paper presents a novel two-stage teacher-student framework, the teacher network takes advantage of multi-view geometry-andtexture features during training, while a student network only given RGB sequences at test time. Specifically, in the first stage, Cross-modality Aggregated Transfer (CAT) network is proposed to transfer multi-view cross-modality aggregated features from the teacher network to the student network. Moreover, We design a Viewpoint-Aware Attention (VAA) module which captures discriminative information across different views to e_ectively combine multi-view features. In the second stage, Multi-view Features Strengthen (MFS) network that contains VAA module as well further strengthen the global view-invariance features of the student network. Besides, both of CAT and MFS learn in an online distillation manner so that the teacher and the student network can be trained jointly. Extensive experiments at IXMAS and Northwestern-UCLA demonstrate the effectiveness of the proposed method.}
}

Xiangfang Zeng, Yusu Pan, Hao Zhang, Mengmeng Wang, Guanzhong Tian, and Yong Liu. Unpaired Salient Object Translation via Spatial Attention Prior. Neurocomputing, 2021.
[BibTeX] [Abstract] [DOI] [PDF]

With only set-level constraints, unpaired image translation is challenging in discovering the correct semantic-level correspondences between two domains. This limitation often results in false positives such as significantly changing color and appearance of the background during image translation. To address this limitation, we propose the Spatial Attention-Aware Generative Adversarial Network (SAAGAN), a novel approach to jointly learn salient object discovery and translation. Specifically, our generator consists of (1) spatial attention prediction branch and (2) image translation branch. For attention branch, we extract spatial attention prior from a pre-trained classification network to provide weak supervision for object discovery. The proposed attention loss can largely stabilize the training process of attention-guided generator. For translation branch, we revise classical adversarial loss for salient object translation. Such a discriminator only distinguish the distribution of the object between two domains. What is more, we propose a fake sample augmentation strategy to provide extra spatial information for discriminator. Our approach allows simultaneously locating the attention areas in each image and translating the related areas between two domains. Extensive experiments and evaluations show that our model can achieve more realistic mappings compared to state-of-the-art unpaired image translation methods.

@article{zeng2021unpairedso,
title = {Unpaired Salient Object Translation via Spatial Attention Prior},
author = {Xiangfang Zeng and Yusu Pan and Hao Zhang and Mengmeng Wang and Guanzhong Tian and Yong Liu},
year = 2021,
journal = {Neurocomputing},
doi = {10.1016/j.neucom.2020.05.105},
abstract = {With only set-level constraints, unpaired image translation is challenging in discovering the correct semantic-level correspondences between two domains. This limitation often results in false positives such as significantly changing color and appearance of the background during image translation. To address this limitation, we propose the Spatial Attention-Aware Generative Adversarial Network (SAAGAN), a novel approach to jointly learn salient object discovery and translation. Specifically, our generator consists of (1) spatial attention prediction branch and (2) image translation branch. For attention branch, we extract spatial attention prior from a pre-trained classification network to provide weak supervision for object discovery. The proposed attention loss can largely stabilize the training process of attention-guided generator. For translation branch, we revise classical adversarial loss for salient object translation. Such a discriminator only distinguish the distribution of the object between two domains. What is more, we propose a fake sample augmentation strategy to provide extra spatial information for discriminator. Our approach allows simultaneously locating the attention areas in each image and translating the related areas between two domains. Extensive experiments and evaluations show that our model can achieve more realistic mappings compared to state-of-the-art unpaired image translation methods.}
}

Tianxin Huang, Hao Zou, Jinhao Cui, Xuemeng Yang, Mengmeng Wang, Xiangrui Zhao, Jiangning Zhang and Yi Yuan, Yifan Xu, and Yong Liu. RFNet: Recurrent Forward Network for Dense Point Cloud Completion. In 2021 International Conference on Computer Vision, pages 12488-12497, 2021.
[BibTeX] [Abstract] [DOI] [PDF]

Point cloud completion is an interesting and challenging task in 3D vision, aiming to recover complete shapes from sparse and incomplete point clouds. Existing learning based methods often require vast computation cost to achieve excellent performance, which limits their practical applications. In this paper, we propose a novel Recurrent Forward Network (RFNet), which is composed of three modules: Recurrent Feature Extraction (RFE), Forward Dense Completion (FDC) and Raw Shape Protection (RSP). The RFE extracts multiple global features from the incomplete point clouds for different recurrent levels, and the FDC generates point clouds in a coarse-to-fine pipeline. The RSP introduces details from the original incomplete models to refine the completion results. Besides, we propose a Sampling Chamfer Distance to better capture the shapes of models and a new Balanced Expansion Constraint to restrict the expansion distances from coarse to fine. According to the experiments on ShapeNet and KITTI, our network can achieve the state-of-the-art with lower memory cost and faster convergence.

@inproceedings{huang2021rfnetrf,
title = {RFNet: Recurrent Forward Network for Dense Point Cloud Completion},
author = {Tianxin Huang and Hao Zou and Jinhao Cui and Xuemeng Yang and Mengmeng Wang and Xiangrui Zhao and Jiangning Zhang and Yi Yuan and Yifan Xu and Yong Liu},
year = 2021,
booktitle = {2021 International Conference on Computer Vision},
pages = {12488-12497},
doi = {https://doi.org/10.1109/ICCV48922.2021.01228},
abstract = {Point cloud completion is an interesting and challenging task in 3D vision, aiming to recover complete shapes from sparse and incomplete point clouds. Existing learning based methods often require vast computation cost to achieve excellent performance, which limits their practical applications. In this paper, we propose a novel Recurrent Forward Network (RFNet), which is composed of three modules: Recurrent Feature Extraction (RFE), Forward Dense Completion (FDC) and Raw Shape Protection (RSP). The RFE extracts multiple global features from the incomplete point clouds for different recurrent levels, and the FDC generates point clouds in a coarse-to-fine pipeline. The RSP introduces details from the original incomplete models to refine the completion results. Besides, we propose a Sampling Chamfer Distance to better capture the shapes of models and a new Balanced Expansion Constraint to restrict the expansion distances from coarse to fine. According to the experiments on ShapeNet and KITTI, our network can achieve the state-of-the-art with lower memory cost and faster convergence.}
}

Lina Liu, Xibin Song, Mengmeng Wang, Yong Liu, and Liangjun Zhang. Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation. In 2021 International Conference on Computer Vision, pages 12717-12726, 2021.
[BibTeX] [Abstract] [DOI] [PDF]

Remarkable results have been achieved by DCNN based self-supervised depth estimation approaches. However, most of these approaches can only handle either day-time or night-time images, while their performance degrades for all-day images due to large domain shift and the variation of illumination between day and night images. To relieve these limitations, we propose a domain-separated network for self-supervised depth estimation of all-day images. Specifically, to relieve the negative influence of disturbing terms (illumination, etc.), we partition the information of day and night image pairs into two complementary sub-spaces: private and invariant domains, where the former contains the unique information (illumination, etc.) of day and night images and the latter contains essential shared information (texture, etc.). Meanwhile, to guarantee that the day and night images contain the same information, the domain-separated network takes the day-time images and corresponding night-time images (generated by GAN) as input, and the private and invariant feature extractors are learned by orthogonality and similarity loss, where the domain gap can be alleviated, thus better depth maps can be expected. Meanwhile, the reconstruction and photometric losses are utilized to estimate complementary information and depth maps effectively. Experimental results demonstrate that our approach achieves state-of-the art depth estimation results for all-day images on the challenging Oxford RobotCar dataset, proving the superiority of our proposed approach. Code and data split are available at https://github.com/LINA-lln/ADDS-DepthNet.

@inproceedings{liu2021selfsm,
title = {Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation},
author = {Lina Liu and Xibin Song and Mengmeng Wang and Yong Liu and Liangjun Zhang},
year = 2021,
booktitle = {2021 International Conference on Computer Vision},
pages = {12717-12726},
doi = {https://doi.org/10.1109/ICCV48922.2021.01250},
abstract = {Remarkable results have been achieved by DCNN based self-supervised depth estimation approaches. However, most of these approaches can only handle either day-time or night-time images, while their performance degrades for all-day images due to large domain shift and the variation of illumination between day and night images. To relieve these limitations, we propose a domain-separated network for self-supervised depth estimation of all-day images. Specifically, to relieve the negative influence of disturbing terms (illumination, etc.), we partition the information of day and night image pairs into two complementary sub-spaces: private and invariant domains, where the former contains the unique information (illumination, etc.) of day and night images and the latter contains essential shared information (texture, etc.). Meanwhile, to guarantee that the day and night images contain the same information, the domain-separated network takes the day-time images and corresponding night-time images (generated by GAN) as input, and the private and invariant feature extractors are learned by orthogonality and similarity loss, where the domain gap can be alleviated, thus better depth maps can be expected. Meanwhile, the reconstruction and photometric losses are utilized to estimate complementary information and depth maps effectively. Experimental results demonstrate that our approach achieves state-of-the art depth estimation results for all-day images on the challenging Oxford RobotCar dataset, proving the superiority of our proposed approach. Code and data split are available at https://github.com/LINA-lln/ADDS-DepthNet.}
}

Lina Liu, Xibin Song, Xiaoyang Lyu, Junwei Diao, Mengmeng Wang, Yong Liu, and Liangjun Zhang. FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), 2021.
[BibTeX] [Abstract] [arXiv] [PDF]

Depth completion aims to recover a dense depth map from a sparse depth map with the corresponding color image as input. Recent approaches mainly formulate the depth completion as a one-stage end-to-end learning task, which outputs dense depth maps directly. However, the feature extraction and supervision in one-stage frameworks are insufficient, limiting the performance of these approaches. To address this problem, we propose a novel end-to-end residual learning framework, which formulates the depth completion as a two-stage learning task, i.e., a sparse-to-coarse stage and a coarse-to-fine stage. First, a coarse dense depth map is obtained by a simple CNN framework. Then, a refined depth map is further obtained using a residual learning strategy in the coarse-to-fine stage with coarse depth map and color image as input. Specially, in the coarse-to-fine stage, a channel shuffle extraction operation is utilized to extract more representative features from color image and coarse depth map, and an energy based fusion operation is exploited to effectively fuse these features obtained by channel shuffle operation, thus leading to more accurate and refined depth maps. We achieve SoTA performance in RMSE on KITTI benchmark. Extensive experiments on other datasets future demonstrate the superiority of our approach over current state-of-the-art depth completion approaches.

@inproceedings{liu2020fcfrnetff,
title = {FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion},
author = {Lina Liu and Xibin Song and Xiaoyang Lyu and Junwei Diao and Mengmeng Wang and Yong Liu and Liangjun Zhang},
year = 2021,
booktitle = {Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI)},
abstract = {Depth completion aims to recover a dense depth map from a sparse depth map with the corresponding color image as input. Recent approaches mainly formulate the depth completion as a one-stage end-to-end learning task, which outputs dense depth maps directly. However, the feature extraction and supervision in one-stage frameworks are insufficient, limiting the performance of these approaches. To address this problem, we propose a novel end-to-end residual learning framework, which formulates the depth completion as a two-stage learning task, i.e., a sparse-to-coarse stage and a coarse-to-fine stage. First, a coarse dense depth map is obtained by a simple CNN framework. Then, a refined depth map is further obtained using a residual learning strategy in the coarse-to-fine stage with coarse depth map and color image as input. Specially, in the coarse-to-fine stage, a channel shuffle extraction operation is utilized to extract more representative features from color image and coarse depth map, and an energy based fusion operation is exploited to effectively fuse these features obtained by channel shuffle operation, thus leading to more accurate and refined depth maps. We achieve SoTA performance in RMSE on KITTI benchmark. Extensive experiments on other datasets future demonstrate the superiority of our approach over current state-of-the-art depth completion approaches.},
arxiv = {https://arxiv.org/pdf/2012.08270.pdf}
}

Xiaoyang Lyu, Liang Liu, Mengmeng Wang, Xin Kong, Lina Liu, Yong Liu, Xinxin Chen, and Yi Yuan. HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), 2021.
[BibTeX] [Abstract] [arXiv] [PDF]

Self-supervised learning shows great potential in monoculardepth estimation, using image sequences as the only source ofsupervision. Although people try to use the high-resolutionimage for depth estimation, the accuracy of prediction hasnot been significantly improved. In this work, we find thecore reason comes from the inaccurate depth estimation inlarge gradient regions, making the bilinear interpolation er-ror gradually disappear as the resolution increases. To obtainmore accurate depth estimation in large gradient regions, itis necessary to obtain high-resolution features with spatialand semantic information. Therefore, we present an improvedDepthNet, HR-Depth, with two effective strategies: (1) re-design the skip-connection in DepthNet to get better high-resolution features and (2) propose feature fusion Squeeze-and-Excitation(fSE) module to fuse feature more efficiently.Using Resnet-18 as the encoder, HR-Depth surpasses all pre-vious state-of-the-art(SoTA) methods with the least param-eters at both high and low resolution. Moreover, previousstate-of-the-art methods are based on fairly complex and deepnetworks with a mass of parameters which limits their realapplications. Thus we also construct a lightweight networkwhich uses MobileNetV3 as encoder. Experiments show thatthe lightweight network can perform on par with many largemodels like Monodepth2 at high-resolution with only20%parameters. All codes and models will be available at this https URL.

@inproceedings{lyu2020hrdepthhr,
title = {HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation},
author = {Xiaoyang Lyu and Liang Liu and Mengmeng Wang and Xin Kong and Lina Liu and Yong Liu and Xinxin Chen and Yi Yuan},
year = 2021,
booktitle = {Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI)},
abstract = {Self-supervised learning shows great potential in monoculardepth estimation, using image sequences as the only source ofsupervision. Although people try to use the high-resolutionimage for depth estimation, the accuracy of prediction hasnot been significantly improved. In this work, we find thecore reason comes from the inaccurate depth estimation inlarge gradient regions, making the bilinear interpolation er-ror gradually disappear as the resolution increases. To obtainmore accurate depth estimation in large gradient regions, itis necessary to obtain high-resolution features with spatialand semantic information. Therefore, we present an improvedDepthNet, HR-Depth, with two effective strategies: (1) re-design the skip-connection in DepthNet to get better high-resolution features and (2) propose feature fusion Squeeze-and-Excitation(fSE) module to fuse feature more efficiently.Using Resnet-18 as the encoder, HR-Depth surpasses all pre-vious state-of-the-art(SoTA) methods with the least param-eters at both high and low resolution. Moreover, previousstate-of-the-art methods are based on fairly complex and deepnetworks with a mass of parameters which limits their realapplications. Thus we also construct a lightweight networkwhich uses MobileNetV3 as encoder. Experiments show thatthe lightweight network can perform on par with many largemodels like Monodepth2 at high-resolution with only20%parameters. All codes and models will be available at this https URL.},
arxiv = {https://arxiv.org/pdf/2012.07356.pdf}
}

Xin Kong, Xuemeng Yang, Guangyao Zhai, Xiangrui Zhao, Xianfang Zeng, Mengmeng Wang, Yong Liu, Wanlong Li, and Feng Wen. Semantic Graph Based Place Recognition for 3D Point Clouds. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), page 8216–8223, 2020.
[BibTeX] [Abstract] [DOI] [arXiv] [PDF]

Due to the difficulty in generating the effective descriptors which are robust to occlusion and viewpoint changes, place recognition for 3D point cloud remains an open issue. Unlike most of the existing methods that focus on extracting local, global, and statistical features of raw point clouds, our method aims at the semantic level that can be superior in terms of robustness to environmental changes. Inspired by the perspective of humans, who recognize scenes through identifying semantic objects and capturing their relations, this paper presents a novel semantic graph based approach for place recognition. First, we propose a novel semantic graph representation for the point cloud scenes by reserving the semantic and topological information of the raw point cloud. Thus, place recognition is modeled as a graph matching problem. Then we design a fast and effective graph similarity network to compute the similarity. Exhaustive evaluations on the KITTI dataset show that our approach is robust to the occlusion as well as viewpoint changes and outperforms the state-of-the-art methods with a large margin. Our code is available at: https://github.com/kxhit/SG_PR.

@inproceedings{kong2020semanticgb,
title = {Semantic Graph Based Place Recognition for 3D Point Clouds},
author = {Xin Kong and Xuemeng Yang and Guangyao Zhai and Xiangrui Zhao and Xianfang Zeng and Mengmeng Wang and Yong Liu and Wanlong Li and Feng Wen},
year = 2020,
booktitle = {2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
pages = {8216--8223},
doi = {https://doi.org/10.1109/IROS45743.2020.9341060},
abstract = {Due to the difficulty in generating the effective descriptors which are robust to occlusion and viewpoint changes, place recognition for 3D point cloud remains an open issue. Unlike most of the existing methods that focus on extracting local, global, and statistical features of raw point clouds, our method aims at the semantic level that can be superior in terms of robustness to environmental changes. Inspired by the perspective of humans, who recognize scenes through identifying semantic objects and capturing their relations, this paper presents a novel semantic graph based approach for place recognition. First, we propose a novel semantic graph representation for the point cloud scenes by reserving the semantic and topological information of the raw point cloud. Thus, place recognition is modeled as a graph matching problem. Then we design a fast and effective graph similarity network to compute the similarity. Exhaustive evaluations on the KITTI dataset show that our approach is robust to the occlusion as well as viewpoint changes and outperforms the state-of-the-art methods with a large margin. Our code is available at: https://github.com/kxhit/SG_PR.},
arxiv = {https://arxiv.org/pdf/2008.11459.pdf}
}

Xianfang Zeng, Yusu Pan, Mengmeng Wang, Jiangning Zhang, and Yong Liu. Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), 2020.
[BibTeX] [Abstract] [DOI] [arXiv] [PDF]

Recent works have shown how realistic talking face images can be obtained under the supervision of geometry guidance, e.g., facial landmark or boundary. To alleviate the demand for manual annotations, in this paper, we propose a novel self-supervised hybrid model (DAE-GAN) that learns how to reenact face naturally given large amounts of unlabeled videos. Our approach combines two deforming autoencoders with the latest advances in the conditional generation. On the one hand, we adopt the deforming autoencoder to disentangle identity and pose representations. A strong prior in talking face videos is that each frame can be encoded as two parts: one for video-specific identity and the other for various poses. Inspired by that, we utilize a multi-frame deforming autoencoder to learn a pose-invariant embedded face for each video. Meanwhile, a multi-scale deforming autoencoder is proposed to extract pose-related information for each frame. On the other hand, the conditional generator allows for enhancing fine details and overall reality. It leverages the disentangled features to generate photo-realistic and pose-alike face images. We evaluate our model on VoxCeleb1 and RaFD dataset. Experiment results demonstrate the superior quality of reenacted images and the flexibility of transferring facial movements between identities.

@inproceedings{zeng2020realisticfr,
title = {Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose},
author = {Xianfang Zeng and Yusu Pan and Mengmeng Wang and Jiangning Zhang and Yong Liu},
year = 2020,
booktitle = {Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI)},
doi = {https://doi.org/10.1609/AAAI.V34I07.6970},
abstract = {Recent works have shown how realistic talking face images can be obtained under the supervision of geometry guidance, e.g., facial landmark or boundary. To alleviate the demand for manual annotations, in this paper, we propose a novel self-supervised hybrid model (DAE-GAN) that learns how to reenact face naturally given large amounts of unlabeled videos. Our approach combines two deforming autoencoders with the latest advances in the conditional generation. On the one hand, we adopt the deforming autoencoder to disentangle identity and pose representations. A strong prior in talking face videos is that each frame can be encoded as two parts: one for video-specific identity and the other for various poses. Inspired by that, we utilize a multi-frame deforming autoencoder to learn a pose-invariant embedded face for each video. Meanwhile, a multi-scale deforming autoencoder is proposed to extract pose-related information for each frame. On the other hand, the conditional generator allows for enhancing fine details and overall reality. It leverages the disentangled features to generate photo-realistic and pose-alike face images. We evaluate our model on VoxCeleb1 and RaFD dataset. Experiment results demonstrate the superior quality of reenacted images and the flexibility of transferring facial movements between identities.},
arxiv = {https://arxiv.org/pdf/2003.12957.pdf}
}

Jiangning Zhang, Chao Xu, Liang Liu, Mengmeng Wang, Xia Wu, Yong Liu, and Yunliang Jiang. Dtvnet: Dynamic time-lapse video generation via single still image. In ECCV, page 300–315, 2020.
[BibTeX] [Abstract] [DOI] [arXiv] [PDF]

This paper presents a novel end-to-end dynamic time-lapse video generation framework, named DTVNet, to generate diversified time-lapse videos from a single landscape image, which are conditioned on normalized motion vectors. The proposed DTVNet consists of two submodules: Optical Flow Encoder (OFE) and Dynamic Video Generator (DVG). The OFE maps a sequence of optical flow maps to a normalized motion vector that encodes the motion information inside the generated video. The DVG contains motion and content streams that learn from the motion vector and the single image respectively, as well as an encoder and a decoder to learn shared content features and construct video frames with corresponding motion respectively. Specifically, the motion stream introduces multiple adaptive instance normalization (AdaIN) layers to integrate multi-level motion information that are processed by linear layers. In the testing stage, videos with the same content but various motion information can be generated by different normalized motion vectors based on only one input image. We further conduct experiments on Sky Time-lapse dataset, and the results demonstrate the superiority of our approach over the state-of-the-art methods for generating high-quality and dynamic videos, as well as the variety for generating videos with various motion information.

@inproceedings{zhang2020dtvnet,
title = {Dtvnet: Dynamic time-lapse video generation via single still image},
author = {Zhang, Jiangning and Xu, Chao and Liu, Liang and Wang, Mengmeng and Wu, Xia and Liu, Yong and Jiang, Yunliang},
year = 2020,
booktitle = {{ECCV}},
pages = {300--315},
doi = {https://doi.org/10.1007/978-3-030-58558-7_18},
abstract = {This paper presents a novel end-to-end dynamic time-lapse video generation framework, named DTVNet, to generate diversified time-lapse videos from a single landscape image, which are conditioned on normalized motion vectors. The proposed DTVNet consists of two submodules: Optical Flow Encoder (OFE) and Dynamic Video Generator (DVG). The OFE maps a sequence of optical flow maps to a normalized motion vector that encodes the motion information inside the generated video. The DVG contains motion and content streams that learn from the motion vector and the single image respectively, as well as an encoder and a decoder to learn shared content features and construct video frames with corresponding motion respectively. Specifically, the motion stream introduces multiple adaptive instance normalization (AdaIN) layers to integrate multi-level motion information that are processed by linear layers. In the testing stage, videos with the same content but various motion information can be generated by different normalized motion vectors based on only one input image. We further conduct experiments on Sky Time-lapse dataset, and the results demonstrate the superiority of our approach over the state-of-the-art methods for generating high-quality and dynamic videos, as well as the variety for generating videos with various motion information.},
arxiv = {https://arxiv.org/abs/2008.04776}
}

Hao Zhang, Mengmeng Wang, Yong Liu, and Yi Yuan. FDN: Feature Decoupling Network for Head Pose Estimation. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), 2020.
[BibTeX] [Abstract] [DOI] [PDF]

Head pose estimation from RGB images without depth information is a challenging task due to the loss of spatial information as well as large head pose variations in the wild. The performance of existing landmark-free methods remains unsatisfactory as the quality of estimated pose is inferior. In this paper, we propose a novel three-branch network architecture, termed as Feature Decoupling Network (FDN), a more powerful architecture for landmark-free head pose estimation from a single RGB image. In FDN, we first propose a feature decoupling (FD) module to explicitly learn the discriminative features for each pose angle by adaptively recalibrating its channel-wise responses. Besides, we introduce a cross-category center (CCC) loss to constrain the distribution of the latent variable subspaces and thus we can obtain more compact and distinct subspaces. Extensive experiments on both in-the-wild and controlled environment datasets demonstrate that the proposed method outperforms other state-of-the-art methods based on a single RGB image and behaves on par with approaches based on multimodal input resources.

@inproceedings{zhang2020fdnfd,
title = {FDN: Feature Decoupling Network for Head Pose Estimation},
author = {Hao Zhang and Mengmeng Wang and Yong Liu and Yi Yuan},
year = 2020,
booktitle = {Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI)},
doi = {https://doi.org/10.1609/AAAI.V34I07.6974},
abstract = {Head pose estimation from RGB images without depth information is a challenging task due to the loss of spatial information as well as large head pose variations in the wild. The performance of existing landmark-free methods remains unsatisfactory as the quality of estimated pose is inferior. In this paper, we propose a novel three-branch network architecture, termed as Feature Decoupling Network (FDN), a more powerful architecture for landmark-free head pose estimation from a single RGB image. In FDN, we first propose a feature decoupling (FD) module to explicitly learn the discriminative features for each pose angle by adaptively recalibrating its channel-wise responses. Besides, we introduce a cross-category center (CCC) loss to constrain the distribution of the latent variable subspaces and thus we can obtain more compact and distinct subspaces. Extensive experiments on both in-the-wild and controlled environment datasets demonstrate that the proposed method outperforms other state-of-the-art methods based on a single RGB image and behaves on par with approaches based on multimodal input resources.}
}

Jiangning Zhang, Xianfang Zeng, Mengmeng Wang, Yusu Pan, Liang Liu, Yong Liu, Yu Ding, and Changjie Fan. FReeNet: Multi-Identity Face Reenactment. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 5325–5334, 2020.
[BibTeX] [Abstract] [DOI] [arXiv] [PDF]

This paper presents a novel multi-identity face reenactment framework, named FReeNet, to transfer facial expressions from an arbitrary source face to a target face with a shared model. The proposed FReeNet consists of two parts: Unified Landmark Converter (ULC) and Geometry-aware Generator (GAG). The ULC adopts an encode-decoder architecture to efficiently convert expression in a latent landmark space, which significantly narrows the gap of the face contour between source and target identities. The GAG leverages the converted landmark to reenact the photorealistic image with a reference image of the target person. Moreover, a new triplet perceptual loss is proposed to force the GAG module to learn appearance and geometry information simultaneously, which also enriches facial details of the reenacted images. Further experiments demonstrate the superiority of our approach for generating photorealistic and expression-alike faces, as well as the flexibility for transferring facial expressions between identities.

@inproceedings{zhang2020freenetmf,
title = {FReeNet: Multi-Identity Face Reenactment},
author = {Jiangning Zhang and Xianfang Zeng and Mengmeng Wang and Yusu Pan and Liang Liu and Yong Liu and Yu Ding and Changjie Fan},
year = 2020,
booktitle = {2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {5325--5334},
doi = {https://doi.org/10.1109/cvpr42600.2020.00537},
abstract = {This paper presents a novel multi-identity face reenactment framework, named FReeNet, to transfer facial expressions from an arbitrary source face to a target face with a shared model. The proposed FReeNet consists of two parts: Unified Landmark Converter (ULC) and Geometry-aware Generator (GAG). The ULC adopts an encode-decoder architecture to efficiently convert expression in a latent landmark space, which significantly narrows the gap of the face contour between source and target identities. The GAG leverages the converted landmark to reenact the photorealistic image with a reference image of the target person. Moreover, a new triplet perceptual loss is proposed to force the GAG module to learn appearance and geometry information simultaneously, which also enriches facial details of the reenacted images. Further experiments demonstrate the superiority of our approach for generating photorealistic and expression-alike faces, as well as the flexibility for transferring facial expressions between identities.},
arxiv = {http://arxiv.org/pdf/1905.11805}
}

Mengmeng Wang, Yong Liu, Daobilige Su, Yufan Liao, Lei Shi, Jinhong Xu, and Jaime Valls Miro. Accurate and Real-Time 3-D Tracking for the Following Robots by Fusing Vision and Ultrasonar Information. IEEE/ASME Transactions on Mechatronics, 23:997–1006, 2018.
[BibTeX] [Abstract] [DOI] [PDF]

Acquiring the accurate three-dimensional (3-D) position of a target person around a robot provides valuable information that is applicable to a wide range of robotic tasks, especially for promoting the intelligent manufacturing processes of industries. This paper presents a real-time robotic 3-D human tracking system that combines a monocular camera with an ultrasonic sensor by an extended Kalman filter (EKF). The proposed system consists of three submodules: a monocular camera sensor tracking module, an ultrasonic sensor tracking module, and the multisensor fusion algorithm. An improved visual tracking algorithm is presented to provide 2-D partial location estimation. The algorithm is designed to overcome severe occlusions, scale variation, target missing, and achieve robust redetection. The scale accuracy is further enhanced by the estimated 3-D information. An ultrasonic sensor array is employed to provide the range information from the target person to the robot, and time of flight is used for the 2-D partial location estimation. EKF is adopted to sequentially process multiple, heterogeneous measurements arriving in an asynchronous order from the vision sensor, and the ultrasonic sensor separately. In the experiments, the proposed tracking system is tested in both a simulation platform and actual mobile robot for various indoor and outdoor scenes. The experimental results show the persuasive performance of the 3-D tracking system in terms of both the accuracy and robustness.

@article{wang2018accuratear,
title = {Accurate and Real-Time 3-D Tracking for the Following Robots by Fusing Vision and Ultrasonar Information},
author = {Mengmeng Wang and Yong Liu and Daobilige Su and Yufan Liao and Lei Shi and Jinhong Xu and Jaime Valls Miro},
year = 2018,
journal = {IEEE/ASME Transactions on Mechatronics},
volume = 23,
pages = {997--1006},
doi = {https://doi.org/10.1109/TMECH.2018.2820172},
abstract = {Acquiring the accurate three-dimensional (3-D) position of a target person around a robot provides valuable information that is applicable to a wide range of robotic tasks, especially for promoting the intelligent manufacturing processes of industries. This paper presents a real-time robotic 3-D human tracking system that combines a monocular camera with an ultrasonic sensor by an extended Kalman filter (EKF). The proposed system consists of three submodules: a monocular camera sensor tracking module, an ultrasonic sensor tracking module, and the multisensor fusion algorithm. An improved visual tracking algorithm is presented to provide 2-D partial location estimation. The algorithm is designed to overcome severe occlusions, scale variation, target missing, and achieve robust redetection. The scale accuracy is further enhanced by the estimated 3-D information. An ultrasonic sensor array is employed to provide the range information from the target person to the robot, and time of flight is used for the 2-D partial location estimation. EKF is adopted to sequentially process multiple, heterogeneous measurements arriving in an asynchronous order from the vision sensor, and the ultrasonic sensor separately. In the experiments, the proposed tracking system is tested in both a simulation platform and actual mobile robot for various indoor and outdoor scenes. The experimental results show the persuasive performance of the 3-D tracking system in terms of both the accuracy and robustness.}
}

Mengmeng Wang, Yong Liu, and Zeyi Huang. Large Margin Object Tracking with Circulant Feature Maps. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 4800–4808, 2017.
[BibTeX] [Abstract] [DOI] [arXiv] [PDF]

Structured output support vector machine (SVM) based tracking algorithms have shown favorable performance recently. Nonetheless, the time-consuming candidate sampling and complex optimization limit their real-time applications. In this paper, we propose a novel large margin object tracking method which absorbs the strong discriminative ability from structured output SVM and speeds up by the correlation filter algorithm significantly. Secondly, a multimodal target detection technique is proposed to improve the target localization precision and prevent model drift introduced by similar objects or background noise. Thirdly, we exploit the feedback from high-confidence tracking results to avoid the model corruption problem. We implement two versions of the proposed tracker with the representations from both conventional hand-crafted and deep convolution neural networks (CNNs) based features to validate the strong compatibility of the algorithm. The experimental results demonstrate that the proposed tracker performs superiorly against several state-of-the-art algorithms on the challenging benchmark sequences while runs at speed in excess of 80 frames per second.

@inproceedings{wang2017largemo,
title = {Large Margin Object Tracking with Circulant Feature Maps},
author = {Mengmeng Wang and Yong Liu and Zeyi Huang},
year = 2017,
booktitle = {2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {4800--4808},
doi = {https://doi.org/10.1109/CVPR.2017.510},
arxiv = {http://arxiv.org/pdf/1703.05020},
abstract = {Structured output support vector machine (SVM) based tracking algorithms have shown favorable performance recently. Nonetheless, the time-consuming candidate sampling and complex optimization limit their real-time applications. In this paper, we propose a novel large margin object tracking method which absorbs the strong discriminative ability from structured output SVM and speeds up by the correlation filter algorithm significantly. Secondly, a multimodal target detection technique is proposed to improve the target localization precision and prevent model drift introduced by similar objects or background noise. Thirdly, we exploit the feedback from high-confidence tracking results to avoid the model corruption problem. We implement two versions of the proposed tracker with the representations from both conventional hand-crafted and deep convolution neural networks (CNNs) based features to validate the strong compatibility of the algorithm. The experimental results demonstrate that the proposed tracker performs superiorly against several state-of-the-art algorithms on the challenging benchmark sequences while runs at speed in excess of 80 frames per second.}
}

Mengmeng Wang, Daobilige Su, Lei Shi, Yong Liu, and Jaime Valls Miro. Real-time 3D human tracking for mobile robots with multisensors. In 2017 IEEE International Conference on Robotics and Automation (ICRA), page 5081–5087, 2017.
[BibTeX] [Abstract] [DOI] [PDF]

Acquiring the accurate 3-D position of a target person around a robot provides fundamental and valuable information that is applicable to a wide range of robotic tasks, including home service, navigation and entertainment. This paper presents a real-time robotic 3-D human tracking system which combines a monocular camera with an ultrasonic sensor by the extended Kalman filter (EKF). The proposed system consists of three sub-modules: monocular camera sensor tracking model, ultrasonic sensor tracking model and multi-sensor fusion. An improved visual tracking algorithm is presented to provide partial location estimation (2-D). The algorithm is designed to overcome severe occlusions, scale variation, target missing and achieve robust re-detection. The scale accuracy is further enhanced by the estimated 3-D information. An ultrasonic sensor array is employed to provide the range information from the target person to the robot and Gaussian Process Regression is used for partial location estimation (2-D). EKF is adopted to sequentially process multiple, heterogeneous measurements arriving in an asynchronous order from the vision sensor and the ultrasonic sensor separately. In the experiments, the proposed tracking system is tested in both simulation platform and actual mobile robot for various indoor and outdoor scenes. The experimental results show the superior performance of the 3-D tracking system in terms of both the accuracy and robustness.

@inproceedings{wang2017realtime3h,
title = {Real-time 3D human tracking for mobile robots with multisensors},
author = {Mengmeng Wang and Daobilige Su and Lei Shi and Yong Liu and Jaime Valls Miro},
year = 2017,
booktitle = {2017 IEEE International Conference on Robotics and Automation (ICRA)},
pages = {5081--5087},
doi = {https://doi.org/10.1109/ICRA.2017.7989593},
abstract = {Acquiring the accurate 3-D position of a target person around a robot provides fundamental and valuable information that is applicable to a wide range of robotic tasks, including home service, navigation and entertainment. This paper presents a real-time robotic 3-D human tracking system which combines a monocular camera with an ultrasonic sensor by the extended Kalman filter (EKF). The proposed system consists of three sub-modules: monocular camera sensor tracking model, ultrasonic sensor tracking model and multi-sensor fusion. An improved visual tracking algorithm is presented to provide partial location estimation (2-D). The algorithm is designed to overcome severe occlusions, scale variation, target missing and achieve robust re-detection. The scale accuracy is further enhanced by the estimated 3-D information. An ultrasonic sensor array is employed to provide the range information from the target person to the robot and Gaussian Process Regression is used for partial location estimation (2-D). EKF is adopted to sequentially process multiple, heterogeneous measurements arriving in an asynchronous order from the vision sensor and the ultrasonic sensor separately. In the experiments, the proposed tracking system is tested in both simulation platform and actual mobile robot for various indoor and outdoor scenes. The experimental results show the superior performance of the 3-D tracking system in terms of both the accuracy and robustness.}
}

Mengmeng Wang, Yong Liu, and Rong Xiong. Robust object tracking with a hierarchical ensemble framework. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), page 438–445, 2016.
[BibTeX] [Abstract] [DOI] [arXiv] [PDF]

Autonomous robots enjoy a wide popularity nowadays and have been applied in many applications, such as home security, entertainment, delivery, navigation and guidance. It is vital for robots to track objects accurately in real time in these applications, so it is necessary to focus on tracking algorithms to improve the robustness, speed and accuracy. In this paper, we propose a real-time robust object tracking algorithm based on a hierarchical ensemble framework which incorporates information including individual pixel features, local patches and holistic target models. The framework combines multiple ensemble models simultaneously instead of using a single ensemble model individually. A discriminative model which accounts for the matching degree of local patches is adopted via a bottom ensemble layer, and a generative model which exploits holistic templates is used to search for the object based on the middle ensemble layer as well as an adaptive Kalman filter. We test the proposed tracker on challenging benchmark image sequences. The experimental results demonstrate that the proposed tracker performs superiorly against several state-of-the-art algorithms, especially when the appearance changes dramatically and the occlusions occur.

@inproceedings{wang2016robustot,
title = {Robust object tracking with a hierarchical ensemble framework},
author = {Mengmeng Wang and Yong Liu and Rong Xiong},
year = 2016,
booktitle = {2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
pages = {438--445},
doi = {https://doi.org/10.1109/IROS.2016.7759091},
arxiv = {http://arxiv.org/pdf/1509.06925},
abstract = {Autonomous robots enjoy a wide popularity nowadays and have been applied in many applications, such as home security, entertainment, delivery, navigation and guidance. It is vital for robots to track objects accurately in real time in these applications, so it is necessary to focus on tracking algorithms to improve the robustness, speed and accuracy. In this paper, we propose a real-time robust object tracking algorithm based on a hierarchical ensemble framework which incorporates information including individual pixel features, local patches and holistic target models. The framework combines multiple ensemble models simultaneously instead of using a single ensemble model individually. A discriminative model which accounts for the matching degree of local patches is adopted via a bottom ensemble layer, and a generative model which exploits holistic templates is used to search for the object based on the middle ensemble layer as well as an adaptive Kalman filter. We test the proposed tracker on challenging benchmark image sequences. The experimental results demonstrate that the proposed tracker performs superiorly against several state-of-the-art algorithms, especially when the appearance changes dramatically and the occlusions occur.}
}

Address

Mengmeng Wang

Biography

Research and Interests

Publications

Links

Latest Events

APRIL实验室斩获ATEC 2025科技精英赛冠军，具身智能技术实现真实场景重大突破

喜报！APRIL实验室硕士生侯典泳荣获IROS 2025“移动操作领域最佳论文提名奖”

喜报！APRIL实验室在IROS 2025四足机器人挑战赛上荣获最佳自主导航奖