Address

Room 101, Institute of Cyber-Systems and Control, Yuquan Campus, Zhejiang University, Hangzhou, Zhejiang, China

Contact Information

Email: mengmengwang@zju.edu.cn

Mengmeng Wang

PhD Student

Institute of Cyber-Systems and Control, Zhejiang University, China

Biography

I have attended Zhejiang University as a Phd student in Sep 2020, supervised by Yong Liu. My research area includes object tracking, object detection, action recognition, depth estimation, deep fake, pose estimation and so on.

My works have been published on top computer vision conferences (CVPR, ICCV, ECCV, AAAI etc) and top robotic conferences (ICRA, IROS).

Research and Interests

  • Object Tracking: Single Object Tracking (SOT), Multiple Objects Tracking (MOT)
  • Action Recognition
  • Depth Estimation

Publications

  • Jun Chen, Shipeng Bai, Tianxin Huang, Mengmeng Wang, Guanzhong Tian, and Yong Liu. Data-Free Quantization via Mixed-Precision Compensation without Fine-Tuning. Pattern Recognition, 143:109780, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Neural network quantization is a very promising solution in the field of model compression, but its resulting accuracy highly depends on a training/fine-tuning process and requires the original data. This not only brings heavy computation and time costs but also is not conducive to privacy and sensitive information protection. Therefore, a few recent works are starting to focus on data-free quantization. However, data free quantization does not perform well while dealing with ultra-low precision quantization. Although researchers utilize generative methods of synthetic data to address this problem partially, data synthesis needs to take a lot of computation and time. In this paper, we propose a data-free mixed-precision compensation (DF-MPC) method to recover the performance of an ultra-low precision quantized model without any data and fine-tuning process. By assuming the quantized error caused by a low-precision quantized layer can be restored via the reconstruction of a high-precision quantized layer, we mathematically formulate the reconstruction loss between the pre-trained full-precision model and its layer-wise mixed-precision quantized model. Based on our formulation, we theoretically deduce the closed-form solution by minimizing the reconstruction loss of the feature maps. Since DF-MPC does not require any original/synthetic data, it is a more efficient method to approximate the full-precision model. Experimentally, our DF-MPC is able to achieve higher accuracy for an ultra-low precision quantized model compared to the recent methods without any data and fine-tuning process.
    @article{chen2023dfq,
    title = {Data-Free Quantization via Mixed-Precision Compensation without Fine-Tuning},
    author = {Jun Chen and Shipeng Bai and Tianxin Huang and Mengmeng Wang and Guanzhong Tian and Yong Liu},
    year = 2023,
    journal = {Pattern Recognition},
    volume = 143,
    pages = {109780},
    doi = {10.1016/j.patcog.2023.109780},
    abstract = {Neural network quantization is a very promising solution in the field of model compression, but its resulting accuracy highly depends on a training/fine-tuning process and requires the original data. This not only brings heavy computation and time costs but also is not conducive to privacy and sensitive information protection. Therefore, a few recent works are starting to focus on data-free quantization. However, data free quantization does not perform well while dealing with ultra-low precision quantization. Although researchers utilize generative methods of synthetic data to address this problem partially, data synthesis needs to take a lot of computation and time. In this paper, we propose a data-free mixed-precision compensation (DF-MPC) method to recover the performance of an ultra-low precision quantized model without any data and fine-tuning process. By assuming the quantized error caused by a low-precision quantized layer can be restored via the reconstruction of a high-precision quantized layer, we mathematically formulate the reconstruction loss between the pre-trained full-precision model and its layer-wise mixed-precision quantized model. Based on our formulation, we theoretically deduce the closed-form solution by minimizing the reconstruction loss of the feature maps. Since DF-MPC does not require any original/synthetic data, it is a more efficient method to approximate the full-precision model. Experimentally, our DF-MPC is able to achieve higher accuracy for an ultra-low precision quantized model compared to the recent methods without any data and fine-tuning process.}
    }
  • Yufei Liang, Mengmeng Wang, Yining Jin, Shuwen Pan, and Yong Liu. Hierarchical Supervisions with Two-Stream Network for Deepfake Detection. Pattern Recognition Letters, 172:121-127, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Recently, the quality of face generation and manipulation has reached impressive levels, making it diffi-cult even for humans to distinguish real and fake faces. At the same time, methods to distinguish fake faces from reals came out, such as Deepfake detection. However, the task of Deepfake detection remains challenging, especially the low-quality fake images circulating on the Internet and the diversity of face generation methods. In this work, we propose a new Deepfake detection network that could effectively distinguish both high-quality and low-quality faces generated by various generation methods. First, we design a two-stream framework that incorporates a regular spatial stream and a frequency stream to handle the low-quality problem since we find that the frequency domain artifacts of low-quality images will be preserved. Second, we introduce hierarchical supervisions in a coarse-to-fine manner, which con-sists of a coarse binary classification branch to classify reals and fakes and a five-category classification branch to classify reals and four different types of fakes. Extensive experiments have proved the effec-tiveness of our framework on several widely used datasets.
    @article{liang2023hs,
    title = {Hierarchical Supervisions with Two-Stream Network for Deepfake Detection},
    author = {Yufei Liang and Mengmeng Wang and Yining Jin and Shuwen Pan and Yong Liu},
    year = 2023,
    journal = {Pattern Recognition Letters},
    volume = 172,
    pages = {121-127},
    doi = {10.1016/j.patrec.2023.05.029},
    abstract = {Recently, the quality of face generation and manipulation has reached impressive levels, making it diffi-cult even for humans to distinguish real and fake faces. At the same time, methods to distinguish fake faces from reals came out, such as Deepfake detection. However, the task of Deepfake detection remains challenging, especially the low-quality fake images circulating on the Internet and the diversity of face generation methods. In this work, we propose a new Deepfake detection network that could effectively distinguish both high-quality and low-quality faces generated by various generation methods. First, we design a two-stream framework that incorporates a regular spatial stream and a frequency stream to handle the low-quality problem since we find that the frequency domain artifacts of low-quality images will be preserved. Second, we introduce hierarchical supervisions in a coarse-to-fine manner, which con-sists of a coarse binary classification branch to classify reals and fakes and a five-category classification branch to classify reals and four different types of fakes. Extensive experiments have proved the effec-tiveness of our framework on several widely used datasets.}
    }
  • Jiajun Lv, Xiaolei Lang, Jinhong Xu, Mengmeng Wang, Yong Liu, and Xingxing Zuo. Continuous-Time Fixed-Lag Smoothing for LiDAR-Inertial-Camera SLAM. IEEE/ASME Transactions on Mechatronics, 28:2259-2270, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Localization and mapping with heterogeneous multi-sensor fusion have been prevalent in recent years. To adequately fuse multi-modal sensor measurements received at different time instants and different frequencies, we estimate the continuous-time trajectory by fixed-lag smoothing within a factor-graph optimization framework. With the continuous-time formulation, we can query poses at any time instants corresponding to the sensor measurements. To bound the computation complexity of the continuous-time fixed-lag smoother, we maintain temporal and keyframe sliding windows with constant size, and probabilistically marginalize out control points of the trajectory and other states, which allows preserving prior information for future sliding-window optimization. Based on continuous-time fixed-lag smoothing, we design tightly-coupled multi-modal SLAM algorithms with a variety of sensor combinations, like the LiDAR-inertial and LiDAR-inertial-camera SLAM systems, in which online timeoffset calibration is also naturally supported. More importantly, benefiting from the marginalization and our derived analytical Jacobians for optimization, the proposed continuous-time SLAM systems can achieve real-time performance regardless of the high complexity of continuous-time formulation. The proposed multi-modal SLAM systems have been widely evaluated on three public datasets and self-collect datasets. The results demonstrate that the proposed continuous-time SLAM systems can achieve high-accuracy pose estimations and outperform existing state-of-the-art methods. To benefit the research community, we will open source our code at {https://github.com/APRIL-ZJU/clic}.
    @article{lv2023ctfl,
    title = {Continuous-Time Fixed-Lag Smoothing for LiDAR-Inertial-Camera SLAM},
    author = {Jiajun Lv and Xiaolei Lang and Jinhong Xu and Mengmeng Wang and Yong Liu and Xingxing Zuo},
    year = 2023,
    journal = {IEEE/ASME Transactions on Mechatronics},
    volume = 28,
    pages = {2259-2270},
    doi = {10.1109/TMECH.2023.3241398},
    abstract = {Localization and mapping with heterogeneous multi-sensor fusion have been prevalent in recent years. To adequately fuse multi-modal sensor measurements received at different time instants and different frequencies, we estimate the continuous-time trajectory by fixed-lag smoothing within a factor-graph optimization framework. With the continuous-time formulation, we can query poses at any time instants corresponding to the sensor measurements. To bound the computation complexity of the continuous-time fixed-lag smoother, we maintain temporal and keyframe sliding windows with constant size, and probabilistically marginalize out control points of the trajectory and other states, which allows preserving prior information for future sliding-window optimization. Based on continuous-time fixed-lag smoothing, we design tightly-coupled multi-modal SLAM algorithms with a variety of sensor combinations, like the LiDAR-inertial and LiDAR-inertial-camera SLAM systems, in which online timeoffset calibration is also naturally supported. More importantly, benefiting from the marginalization and our derived analytical Jacobians for optimization, the proposed continuous-time SLAM systems can achieve real-time performance regardless of the high complexity of continuous-time formulation. The proposed multi-modal SLAM systems have been widely evaluated on three public datasets and self-collect datasets. The results demonstrate that the proposed continuous-time SLAM systems can achieve high-accuracy pose estimations and outperform existing state-of-the-art methods. To benefit the research community, we will open source our code at {https://github.com/APRIL-ZJU/clic}.}
    }
  • Chao Xu, Xia Wu, Mengmeng Wang, Feng Qiu, Yong Liu, and Jun Ren. Improving Dynamic Gesture Recognition in Untrimmed Videos by An Online Lightweight Framework and A New Gesture Dataset ZJUGesture. Neurocomputing, 523:58-68, 2023.
    [BibTeX] [Abstract] [DOI]
    Human–computer interaction technology brings great convenience to people, and dynamic gesture recognition makes it possible for a man to interact naturally with a machine. However, recognizing gestures quickly and precisely in untrimmed videos remains a challenge in real-world systems since: (1) It is challenging to locate the temporal boundaries of performing gestures; (2) There are significant differences in performing gestures among different people, resulting in a variety of gestures; (3) There must be a trade-off between the accuracy and the computational consumption. In this work, we propose an online lightweight two-stage framework, including a detection module and a gesture recognition module, to precisely detect and classify dynamic gestures in untrimmed videos. Specifically, we first design a low-power detection module to locate gestures in time series, then a temporal relational reasoning module is employed for gesture recognition. Moreover, we present a new dynamic gesture dataset named ZJUGesture, which contains nine classes of common gestures in various scenarios. Extensive experiments on the proposed ZJUGesture and 20-bn-Jester dataset demonstrate the attractive performance of our method with high accuracy and a low computational cost.
    @article{xv2022idg,
    title = {Improving Dynamic Gesture Recognition in Untrimmed Videos by An Online Lightweight Framework and A New Gesture Dataset ZJUGesture},
    author = {Chao Xu and Xia Wu and Mengmeng Wang and Feng Qiu and Yong Liu and Jun Ren},
    year = 2023,
    journal = {Neurocomputing},
    volume = 523,
    pages = {58-68},
    doi = {10.1016/j.neucom.2022.12.022},
    abstract = {Human–computer interaction technology brings great convenience to people, and dynamic gesture recognition makes it possible for a man to interact naturally with a machine. However, recognizing gestures quickly and precisely in untrimmed videos remains a challenge in real-world systems since: (1) It is challenging to locate the temporal boundaries of performing gestures; (2) There are significant differences in performing gestures among different people, resulting in a variety of gestures; (3) There must be a trade-off between the accuracy and the computational consumption. In this work, we propose an online lightweight two-stage framework, including a detection module and a gesture recognition module, to precisely detect and classify dynamic gestures in untrimmed videos. Specifically, we first design a low-power detection module to locate gestures in time series, then a temporal relational reasoning module is employed for gesture recognition. Moreover, we present a new dynamic gesture dataset named ZJUGesture, which contains nine classes of common gestures in various scenarios. Extensive experiments on the proposed ZJUGesture and 20-bn-Jester dataset demonstrate the attractive performance of our method with high accuracy and a low computational cost.}
    }
  • Mengmeng Wang, Jiazheng Xing, Jing Su, Jun Chen, and Yong Liu. Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:3347-3362, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    Recent methods for action recognition always apply 3D Convolutional Neural Networks (CNNs) to extract spatiotemporal features and introduce optical flows to present motion features. Although achieving state-of-the-art performance, they are expensive in both time and space. In this paper, we propose to represent both the two kinds of features in a unified 2D CNN without any 3D convolution or optical flows calculation. In particular, we first design a channel-wise spatiotemporal module to present the spatiotemporal features and a channel-wise motion module to encode feature-level motion features efficiently. Secondly, we combine these two modules and an identity mapping path into one united block that can easily replaces the original residual block in the ResNet architecture, forming a simple yet effective network termed STM network by introducing very limited extra computation cost and parameters. Thirdly, we propose a novel Twins Training framework for action recognition by incorporating a correlation loss to optimize the inter-class and intra-class correlation and a siamese structure to fully stretch the training data. We extensively validate the proposed STM on both temporal-related datasets (i.e., Something-Something v1 & v2) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51). It achieves favorable results against state-of-the-art methods in all the datasets.
    @article{wang2022lsm,
    title = {Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition},
    author = {Mengmeng Wang and Jiazheng Xing and Jing Su and Jun Chen and Yong Liu},
    year = 2022,
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    volume = 45,
    pages = {3347-3362},
    doi = {10.1109/TPAMI.2022.3173658},
    abstract = {Recent methods for action recognition always apply 3D Convolutional Neural Networks (CNNs) to extract spatiotemporal features and introduce optical flows to present motion features. Although achieving state-of-the-art performance, they are expensive in both time and space. In this paper, we propose to represent both the two kinds of features in a unified 2D CNN without any 3D convolution or optical flows calculation. In particular, we first design a channel-wise spatiotemporal module to present the spatiotemporal features and a channel-wise motion module to encode feature-level motion features efficiently. Secondly, we combine these two modules and an identity mapping path into one united block that can easily replaces the original residual block in the ResNet architecture, forming a simple yet effective network termed STM network by introducing very limited extra computation cost and parameters. Thirdly, we propose a novel Twins Training framework for action recognition by incorporating a correlation loss to optimize the inter-class and intra-class correlation and a siamese structure to fully stretch the training data. We extensively validate the proposed STM on both temporal-related datasets (i.e., Something-Something v1 \& v2) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51). It achieves favorable results against state-of-the-art methods in all the datasets.}
    }
  • Jiazheng Xing, Mengmeng Wang, Boyu Mu, and Yong Liu. Revisiting the Spatial and Temporal Modeling for Few-Shot Action Recognition. In 37th AAAI Conference on Artificial Intelligence (AAAI), 2023.
    [BibTeX] [Abstract] [DOI]
    Spatial and temporal modeling is one of the most core aspects of few-shot action recognition. Most previous works mainly focus on long-term temporal relation modeling based on high-level spatial representations, without considering the crucial low-level spatial features and short-term temporal relations. Actually, the former feature could bring rich local semantic information, and the latter feature could represent motion characteristics of adjacent frames, respectively. In this paper, we propose SloshNet, a new framework that revisits the spatial and temporal modeling for few-shot action recognition in a finer manner. First, to exploit the low-level spatial features, we design a feature fusion architecture search module to automatically search for the best combination of the low-level and high-level spatial features. Next, inspired by the recent transformer, we introduce a long-term temporal modeling module to model the global temporal relations based on the extracted spatial appearance features. Meanwhile, we design another short-term temporal modeling module to encode the motion characteristics between adjacent frame representations. After that, the final predictions can be obtained by feeding the embedded rich spatial-temporal features to a common frame-level class prototype matcher. We extensively validate the proposed SloshNet on four few-shot action recognition datasets, including Something-Something V2, Kinetics, UCF101, and HMDB51. It achieves favorable results against state-of-the-art methods in all datasets.
    @inproceedings{xing2023rst,
    title = {Revisiting the Spatial and Temporal Modeling for Few-Shot Action Recognition},
    author = {Jiazheng Xing and Mengmeng Wang and Boyu Mu and Yong Liu},
    year = 2023,
    booktitle = {37th AAAI Conference on Artificial Intelligence (AAAI)},
    doi = {10.48550/arXiv.2301.07944},
    abstract = {Spatial and temporal modeling is one of the most core aspects of few-shot action recognition. Most previous works mainly focus on long-term temporal relation modeling based
    on high-level spatial representations, without considering the crucial low-level spatial features and short-term temporal relations. Actually, the former feature could bring rich local semantic information, and the latter feature could represent motion characteristics of adjacent frames, respectively. In this paper, we propose SloshNet, a new framework that revisits the spatial and temporal modeling for few-shot action recognition in a finer manner. First, to exploit the low-level spatial features, we design a feature fusion architecture search module to automatically search for the best combination of the low-level and high-level spatial features. Next, inspired by the recent transformer, we introduce a long-term temporal modeling module to model the global temporal relations based on
    the extracted spatial appearance features. Meanwhile, we design another short-term temporal modeling module to encode the motion characteristics between adjacent frame representations. After that, the final predictions can be obtained by feeding the embedded rich spatial-temporal features to a common frame-level class prototype matcher. We extensively validate the proposed SloshNet on four few-shot action recognition datasets, including Something-Something V2, Kinetics, UCF101, and HMDB51. It achieves favorable results against state-of-the-art methods in all datasets.}
    }
  • Honglin Lin, Mengmeng Wang, Yong Liu, and Jiaxin Kou. Correlation-based and content-enhanced network for video style transfer. Pattern Analysis and Applications, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    Artistic style transfer aims to migrate the style pattern from a referenced style image to a given content image, which has achieved significant advances in recent years. However, producing temporally coherent and visually pleasing stylized frames is still challenging. Although existing works have made some effort, they rely on the inefficient optical flow or other cumbersome operations to model spatiotemporal information. In this paper, we propose an arbitrary video style transfer network that can generate consistent results with reasonable style patterns and clear content structure. We adopt multi-channel correlation module to render the input images stably according to cross-domain feature correlation. Meanwhile, Earth Movers’ Distance is used to capture the main characteristics of style images. To maintain the semantic structure during the stylization, we also employ the AdaIN-based skip connections and self-similarity loss, which can further improve the temporal consistency. Qualitative and quantitative experiments have demonstrated the effectiveness of our framework.
    @article{lin2022cbc,
    title = {Correlation-based and content-enhanced network for video style transfer},
    author = {Honglin Lin and Mengmeng Wang and Yong Liu and Jiaxin Kou},
    year = 2022,
    journal = {Pattern Analysis and Applications},
    doi = {10.1007/s10044-022-01106-y},
    abstract = {Artistic style transfer aims to migrate the style pattern from a referenced style image to a given content image, which has achieved significant advances in recent years. However, producing temporally coherent and visually pleasing stylized frames is still challenging. Although existing works have made some effort, they rely on the inefficient optical flow or other cumbersome operations to model spatiotemporal information. In this paper, we propose an arbitrary video style transfer network that can generate consistent results with reasonable style patterns and clear content structure. We adopt multi-channel correlation module to render the input images stably according to cross-domain feature correlation. Meanwhile, Earth Movers' Distance is used to capture the main characteristics of style images. To maintain the semantic structure during the stylization, we also employ the AdaIN-based skip connections and self-similarity loss, which can further improve the temporal consistency. Qualitative and quantitative experiments have demonstrated the effectiveness of our framework.}
    }
  • Yu Yang, Mengmeng Wang, Jianbiao Mei, and Yong Liu. Exploiting Semantic-level Affinities with a Mask-Guided Network for Temporal Action Proposal in Videos. Applied Intelligence, 2022.
    [BibTeX] [Abstract] [DOI]
    Temporal action proposal (TAP) aims to detect the action instances’ starting and ending times in untrimmed videos, which is fundamental and critical for large-scale video analysis and human action understanding. The main challenge of the temporal action proposal lies in modeling representative temporal relations in long untrimmed videos. Existing state-of-the-art methods achieve temporal modeling by building local-level, proposal-level, or global-level temporal dependencies. Local methods lack a wider receptive field, while proposal and global methods lack the focalization of learning action frames and contain background distractions. In this paper, we propose that learning semantic-level affinities can capture more practical information. Specifically, by modeling semantic associations between frames and action units, action segments (foregrounds) can aggregate supportive cues from other co-occurring actions, and nonaction clips (backgrounds) can learn the discriminations between them and action frames. To this end, we propose a novel framework named the Mask-Guided Network (MGNet) to build semantic-level temporal associations for the TAP task. Specifically, we first propose a Foreground Mask Generation (FMG) module to adaptively generate the foreground mask, representing the locations of the action units throughout the video. Second, we design a Mask-Guided Transformer (MGT) by exploiting the foreground mask to guide the self-attention mechanism to focus on and calculate semantic affinities with the foreground frames. Finally, these two modules are jointly explored in a unified framework. MGNet models the intra-semantic similarities for foregrounds, extracting supportive action cues for boundary refinement; it also builds the inter-semantic distances for backgrounds, providing the semantic gaps to suppress false positives and distractions. Extensive experiments are conducted on two challenging datasets, ActivityNet-1.3 and THUMOS14, and the results demonstrate that our method achieves superior performance.
    @article{yang2022esl,
    title = {Exploiting Semantic-level Affinities with a Mask-Guided Network for Temporal Action Proposal in Videos},
    author = {Yu Yang and Mengmeng Wang and Jianbiao Mei and Yong Liu},
    year = 2022,
    journal = {Applied Intelligence},
    doi = {10.1007/s10489-022-04261-1},
    abstract = {Temporal action proposal (TAP) aims to detect the action instances' starting and ending times in untrimmed videos, which is fundamental and critical for large-scale video analysis and human action understanding. The main challenge of the temporal action proposal lies in modeling representative temporal relations in long untrimmed videos. Existing state-of-the-art methods achieve temporal modeling by building local-level, proposal-level, or global-level temporal dependencies. Local methods lack a wider receptive field, while proposal and global methods lack the focalization of learning action frames and contain background distractions. In this paper, we propose that learning semantic-level affinities can capture more practical information. Specifically, by modeling semantic associations between frames and action units, action segments (foregrounds) can aggregate supportive cues from other co-occurring actions, and nonaction clips (backgrounds) can learn the discriminations between them and action frames. To this end, we propose a novel framework named the Mask-Guided Network (MGNet) to build semantic-level temporal associations for the TAP task. Specifically, we first propose a Foreground Mask Generation (FMG) module to adaptively generate the foreground mask, representing the locations of the action units throughout the video. Second, we design a Mask-Guided Transformer (MGT) by exploiting the foreground mask to guide the self-attention mechanism to focus on and calculate semantic affinities with the foreground frames. Finally, these two modules are jointly explored in a unified framework. MGNet models the intra-semantic similarities for foregrounds, extracting supportive action cues for boundary refinement; it also builds the inter-semantic distances for backgrounds, providing the semantic gaps to suppress false positives and distractions. Extensive experiments are conducted on two challenging datasets, ActivityNet-1.3 and THUMOS14, and the results demonstrate that our method achieves superior performance.}
    }
  • Xingxing Zuo, Mingming Zhang, Mengmeng Wang, Yiming Chen, Guoquan Huang, Yong Liu, and Mingyang Li. Visual-Based Kinematics and Pose Estimation for Skid-Steering Robots. IEEE Transactions on Automation Science and Engineering, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    To build commercial robots, skid-steering mechanical design is of increased popularity due to its manufacturing simplicity and unique mechanism. However, these also cause significant challenges on software and algorithm design, especially for the pose estimation (i.e., determining the robot’s rotation and position) of skid-steering robots, since they change their orientation with an inevitable skid. To tackle this problem, we propose a probabilistic sliding-window estimator dedicated to skid-steering robots, using measurements from a monocular camera, the wheel encoders, and optionally an inertial measurement unit (IMU). Specifically, we explicitly model the kinematics of skid-steering robots by both track instantaneous centers of rotation (ICRs) and correction factors, which are capable of compensating for the complexity of track-to-terrain interaction, the imperfectness of mechanical design, terrain conditions and smoothness, etc. To prevent performance reduction in robots’ long-term missions, the time- and location- varying kinematic parameters are estimated online along with pose estimation states in a tightly-coupled manner. More importantly, we conduct indepth observability analysis for different sensors and design configurations in this paper, which provides us with theoretical tools in making the correct choice when building real commercial robots. In our experiments, we validate the proposed method by both simulation tests and real-world experiments, which demonstrate that our method outperforms competing methods by wide margins.
    @article{zuo2022vbk,
    title = {Visual-Based Kinematics and Pose Estimation for Skid-Steering Robots},
    author = {Xingxing Zuo and Mingming Zhang and Mengmeng Wang and Yiming Chen and Guoquan Huang and Yong Liu and Mingyang Li},
    year = 2022,
    journal = {IEEE Transactions on Automation Science and Engineering},
    doi = {10.1109/TASE.2022.3214984},
    abstract = {To build commercial robots, skid-steering mechanical design is of increased popularity due to its manufacturing simplicity and unique mechanism. However, these also cause significant challenges on software and algorithm design, especially for the pose estimation (i.e., determining the robot’s rotation and position) of skid-steering robots, since they change their orientation with an inevitable skid. To tackle this problem, we propose a probabilistic sliding-window estimator dedicated to skid-steering robots, using measurements from a monocular camera, the wheel encoders, and optionally an inertial measurement unit (IMU). Specifically, we explicitly model the kinematics of skid-steering robots by both track instantaneous centers of rotation (ICRs) and correction factors, which are capable of compensating for the complexity of track-to-terrain interaction, the imperfectness of mechanical design, terrain conditions and smoothness, etc. To prevent performance reduction in robots’ long-term missions, the time- and location- varying kinematic parameters are estimated online along with pose estimation states in a tightly-coupled manner. More importantly, we conduct indepth observability analysis for different sensors and design configurations in this paper, which provides us with theoretical tools in making the correct choice when building real commercial robots. In our experiments, we validate the proposed method by both simulation tests and real-world experiments, which demonstrate that our method outperforms competing methods by wide margins.}
    }
  • Chao Xu, Jiangning Zhang, Mengmeng Wang, Guanzhong Tian, and Yong Liu. Multi-level Spatial-temporal Feature Aggregation for Video Object Detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(11):7809-7820, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    Video object detection (VOD) focuses on detecting objects for each frame in a video, which is a challenging task due to appearance deterioration in certain video frames. Recent works usually distill crucial information from multiple support frames to improve the reference features, but they only perform at frame level or proposal level that cannot integrate spatial-temporal features sufficiently. To deal with this challenge, we treat VOD as a spatial-temporal hierarchical features interacting process and introduce a Multi-level Spatial-Temporal (MST) feature aggregation framework to fully exploit frame-level, proposal-level, and instance-level information in a unified framework. Specifically, MST first measures context similarity in pixel space to enhance all frame-level features rather than only update reference features. The proposal-level feature aggregation then models object relation to augment reference object proposals. Furthermore, to filter out irrelevant information from other classes and backgrounds, we introduce an instance ID constraint to boost instance-level features by leveraging support object proposal features that belong to the same object. Besides, we propose a Deformable Feature Alignment (DAlign) module before MST to achieve a more accurate pixel-level spatial alignment for better feature aggregation. Extensive experiments are conducted on ImageNet VID and UAVDT datasets that demonstrate the superiority of our method over state-of-the-art (SOTA) methods. Our method achieves 83.3% and 62.1% with ResNet-101 on two datasets, outperforming SOTA MEGA by 0.4% and 2.7%.
    @article{xu2022mls,
    title = {Multi-level Spatial-temporal Feature Aggregation for Video Object Detection},
    author = {Chao Xu and Jiangning Zhang and Mengmeng Wang and Guanzhong Tian and Yong Liu},
    year = 2022,
    journal = {IEEE Transactions on Circuits and Systems for Video Technology},
    volume = {32},
    number = {11},
    pages = {7809-7820},
    doi = {10.1109/TCSVT.2022.3183646},
    abstract = {Video object detection (VOD) focuses on detecting objects for each frame in a video, which is a challenging task due to appearance deterioration in certain video frames. Recent works usually distill crucial information from multiple support frames to improve the reference features, but they only perform at frame level or proposal level that cannot integrate spatial-temporal features sufficiently. To deal with this challenge, we treat VOD as a spatial-temporal hierarchical features interacting process and introduce a Multi-level Spatial-Temporal (MST) feature aggregation framework to fully exploit frame-level, proposal-level, and instance-level information in a unified framework. Specifically, MST first measures context similarity in pixel space to enhance all frame-level features rather than only update reference features. The proposal-level feature aggregation then models object relation to augment reference object proposals. Furthermore, to filter out irrelevant information from other classes and backgrounds, we introduce an instance ID constraint to boost instance-level features by leveraging support object proposal features that belong to the same object. Besides, we propose a Deformable Feature Alignment (DAlign) module before MST to achieve a more accurate pixel-level spatial alignment for better feature aggregation. Extensive experiments are conducted on ImageNet VID and UAVDT datasets that demonstrate the superiority of our method over state-of-the-art (SOTA) methods. Our method achieves 83.3% and 62.1% with ResNet-101 on two datasets, outperforming SOTA MEGA by 0.4% and 2.7%.}
    }
  • Mengmeng Wang, Jianbiao Mei, Lina Liu, and Yong Liu. Delving Deeper Into Mask Utilization in Video Object Segmentation. IEEE Transactions on Image Processing, 31:6255-6266, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    This paper focuses on the mask utilization of video object segmentation (VOS). The mask here mains the reference masks in the memory bank, i.e., several chosen high-quality predicted masks, which are usually used with the reference frames together. The reference masks depict the edge and contour features of the target object and indicate the boundary of the target against the background, while the reference frames contain the raw RGB information of the whole image. It is obvious that the reference masks could play a significant role in the VOS, but this is not well explored yet. To tackle this, we propose to investigate the mask advantages of both the encoder and the matcher. For the encoder, we provide a unified codebase to integrate and compare eight different mask-fused encoders. Half of them are inherited or summarized from existing methods, and the other half are devised by ourselves. We find the best configuration from our design and give valuable observations from the comparison. Then, we propose a new mask-enhanced matcher to reduce the background distraction and enhance the locality of the matching process. Combining the mask-fused encoder, mask-enhanced matcher and a standard decoder, we formulate a new architecture named MaskVOS, which sufficiently exploits the mask benefits for VOS. Qualitative and quantitative results demonstrate the effectiveness of our method. We hope our exploration could raise the attention of mask utilization in VOS.
    @article{wang2022ddi,
    title = {Delving Deeper Into Mask Utilization in Video Object Segmentation},
    author = {Mengmeng Wang and Jianbiao Mei and Lina Liu and Yong Liu},
    year = 2022,
    journal = {IEEE Transactions on Image Processing},
    volume = {31},
    pages = {6255-6266},
    doi = {10.1109/TIP.2022.3208409},
    abstract = {This paper focuses on the mask utilization of video object segmentation (VOS). The mask here mains the reference masks in the memory bank, i.e., several chosen high-quality predicted masks, which are usually used with the reference frames together. The reference masks depict the edge and contour features of the target object and indicate the boundary of the target against the background, while the reference frames contain the raw RGB information of the whole image. It is obvious that the reference masks could play a significant role in the VOS, but this is not well explored yet. To tackle this, we propose to investigate the mask advantages of both the encoder and the matcher. For the encoder, we provide a unified codebase to integrate and compare eight different mask-fused encoders. Half of them are inherited or summarized from existing methods, and the other half are devised by ourselves. We find the best configuration from our design and give valuable observations from the comparison. Then, we propose a new mask-enhanced matcher to reduce the background distraction and enhance the locality of the matching process. Combining the mask-fused encoder, mask-enhanced matcher and a standard decoder, we formulate a new architecture named MaskVOS, which sufficiently exploits the mask benefits for VOS. Qualitative and quantitative results demonstrate the effectiveness of our method. We hope our exploration could raise the attention of mask utilization in VOS.}
    }
  • Yeneng Lin, Mengmeng Wang, Wenzhou Chen, Wang Gao, Lei Li, and Yong Liu. Multiple Object Tracking of Drone Videos by a Temporal-Association Network with Separated-Tasks Structure. Remote Sensing, 14(16):3862, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    The task of multi-object tracking via deep learning methods for UAV videos has become an important research direction. However, with some current multiple object tracking methods, the relationship between object detection and tracking is not well handled, and decisions on how to make good use of temporal information can affect tracking performance as well. To improve the performance of multi-object tracking, this paper proposes an improved multiple object tracking model based on FairMOT. The proposed model contains a structure to separate the detection and ReID heads to decrease the influence between every function head. Additionally, we develop a temporal embedding structure to strengthen the representational ability of the model. By combing the temporal-association structure and separating different function heads, the model’s performance in object detection and tracking tasks is improved, which has been verified on the VisDrone2019 dataset. Compared with the original method, the proposed model improves MOTA by 4.9% and MOTP by 1.2% and has better tracking performance than the models such as SORT and HDHNet on the UAV video dataset.
    @article{lin2022mot,
    title = {Multiple Object Tracking of Drone Videos by a Temporal-Association Network with Separated-Tasks Structure},
    author = {Yeneng Lin and Mengmeng Wang and Wenzhou Chen and Wang Gao and Lei Li and Yong Liu},
    year = 2022,
    journal = {Remote Sensing},
    volume = {14},
    number = {16},
    pages = {3862},
    doi = {10.3390/rs14163862},
    abstract = {The task of multi-object tracking via deep learning methods for UAV videos has become an important research direction. However, with some current multiple object tracking methods,
    the relationship between object detection and tracking is not well handled, and decisions on how to make good use of temporal information can affect tracking performance as well. To improve the performance of multi-object tracking, this paper proposes an improved multiple object tracking model based on FairMOT. The proposed model contains a structure to separate the detection and ReID heads to decrease the influence between every function head. Additionally, we develop a temporal embedding structure to strengthen the representational ability of the model. By combing the temporal-association structure and separating different function heads, the model’s performance in object detection and tracking tasks is improved, which has been verified on the VisDrone2019 dataset. Compared with the original method, the proposed model improves MOTA by 4.9% and MOTP by 1.2% and has better tracking performance than the models such as SORT and HDHNet on the UAV video dataset.}
    }
  • Chunfang Deng, Mengmeng Wang, Liang Liu, Yong Liu, and Yunliang Jiang. Extended feature pyramid network for small object detection. IEEE Transactions on Multimedia, 24:1968-1979, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    Small object detection remains an unsolved challenge because it is hard to extract information of small objects with only a few pixels. While scale-level corresponding detection in feature pyramid network alleviates this problem, we find feature coupling of various scales still impairs the performance of small objects. In this paper, we propose an extended feature pyramid network (EFPN) with an extra high-resolution pyramid level specialized for small object detection. Specifically, we design a novel module, named feature texture transfer (FTT), which is used to super-resolve features and extract credible regional details simultaneously. Moreover, we introduce a cross resolution distillation mechanism to transfer the ability of perceiving details across the scales of the network, where a foreground-background-balanced loss function is designed to alleviate area imbalance of foreground and background. In our experiments, the proposed EFPN is efficient on both computation and memory, and yields state-of-the-art results on small traffic-sign dataset Tsinghua-Tencent 100K and small category of general object detection dataset MS COCO.
    @article{deng2022efp,
    title = {Extended feature pyramid network for small object detection},
    author = {Chunfang Deng and Mengmeng Wang and Liang Liu and Yong Liu and Yunliang Jiang},
    year = 2022,
    journal = {IEEE Transactions on Multimedia},
    volume = {24},
    pages = {1968-1979},
    doi = {10.1109/TMM.2021.3074273},
    abstract = {Small object detection remains an unsolved challenge because it is hard to extract information of small objects with only a few pixels. While scale-level corresponding detection in feature pyramid network alleviates this problem, we find feature coupling of various scales still impairs the performance of small objects. In this paper, we propose an extended feature pyramid network (EFPN) with an extra high-resolution pyramid level specialized for small object detection. Specifically, we design a novel module, named feature texture transfer (FTT), which is used to super-resolve features and extract credible regional details simultaneously. Moreover, we introduce a cross resolution distillation mechanism to transfer the ability of perceiving details across the scales of the network, where a foreground-background-balanced loss function is designed to alleviate area imbalance of foreground and background. In our experiments, the proposed EFPN is efficient on both computation and memory, and yields state-of-the-art results on small traffic-sign dataset Tsinghua-Tencent 100K and small category of general object detection dataset MS COCO.}
    }
  • Guanzhong Tian, Yiran Sun, Yuang Liu, Xianfang Zeng, Mengmeng Wang, Yong Liu, Jiangning Zhang, and Jun Chen. Adding before Pruning: Sparse Filter Fusion for Deep Convolutional Neural Networks via Auxiliary Attention. IEEE Transactions on Neural Networks and Learning Systems, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    Filter pruning is a significant feature selection technique to shrink the existing feature fusion schemes (especially on convolution calculation and model size), which helps to develop more efficient feature fusion models while maintaining state-of-the-art performance. In addition, it reduces the storage and computation requirements of deep neural networks (DNNs) and accelerates the inference process dramatically. Existing methods mainly rely on manual constraints such as normalization to select the filters. A typical pipeline comprises two stages: first pruning the original neural network and then fine-tuning the pruned model. However, choosing a manual criterion can be somehow tricky and stochastic. Moreover, directly regularizing and modifying filters in the pipeline suffer from being sensitive to the choice of hyperparameters, thus making the pruning procedure less robust. To address these challenges, we propose to handle the filter pruning issue through one stage: using an attention-based architecture that adaptively fuses the filter selection with filter learning in a unified network. Specifically, we present a pruning method named adding before pruning (ABP) to make the model focus on the filters of higher significance by training instead of man-made criteria such as norm, rank, etc. First, we add an auxiliary attention layer into the original model and set the significance scores in this layer to be binary. Furthermore, to propagate the gradients in the auxiliary attention layer, we design a specific gradient estimator and prove its effectiveness for convergence in the graph flow through mathematical derivation. In the end, to relieve the dependence on the complicated prior knowledge for designing the thresholding criterion, we simultaneously prune and train the filters to automatically eliminate network redundancy with recoverability. Extensive experimental results on the two typical image classification benchmarks, CIFAR-10 and ILSVRC-2012, illustrate that the proposed approach performs favorably against previous state-of-the-art filter pruning algorithms.
    @article{tian2021abp,
    title = {Adding before Pruning: Sparse Filter Fusion for Deep Convolutional Neural Networks via Auxiliary Attention},
    author = {Guanzhong Tian and Yiran Sun and Yuang Liu and Xianfang Zeng and Mengmeng Wang and Yong Liu and Jiangning Zhang and Jun Chen},
    year = 2021,
    journal = {IEEE Transactions on Neural Networks and Learning Systems},
    doi = {10.1109/TNNLS.2021.3106917},
    abstract = {Filter pruning is a significant feature selection technique to shrink the existing feature fusion schemes (especially on convolution calculation and model size), which helps to develop more efficient feature fusion models while maintaining state-of-the-art performance. In addition, it reduces the storage and computation requirements of deep neural networks (DNNs) and accelerates the inference process dramatically. Existing methods mainly rely on manual constraints such as normalization to select the filters. A typical pipeline comprises two stages: first pruning the original neural network and then fine-tuning the pruned model. However, choosing a manual criterion can be somehow tricky and stochastic. Moreover, directly regularizing and modifying filters in the pipeline suffer from being sensitive to the choice of hyperparameters, thus making the pruning procedure less robust. To address these challenges, we propose to handle the filter pruning issue through one stage: using an attention-based architecture that adaptively fuses the filter selection with filter learning in a unified network. Specifically, we present a pruning method named adding before pruning (ABP) to make the model focus on the filters of higher significance by training instead of man-made criteria such as norm, rank, etc. First, we add an auxiliary attention layer into the original model and set the significance scores in this layer to be binary. Furthermore, to propagate the gradients in the auxiliary attention layer, we design a specific gradient estimator and prove its effectiveness for convergence in the graph flow through mathematical derivation. In the end, to relieve the dependence on the complicated prior knowledge for designing the thresholding criterion, we simultaneously prune and train the filters to automatically eliminate network redundancy with recoverability. Extensive experimental results on the two typical image classification benchmarks, CIFAR-10 and ILSVRC-2012, illustrate that the proposed approach performs favorably against previous state-of-the-art filter pruning algorithms.}
    }
  • Chao Xu, Xia Wu, Yachun Li, Yining Jin, Mengmeng Wang, and Yong Liu. Cross-modality online distillation for multi-view action recognition. Neurocomputing, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    Recently, some multi-modality features are introduced to the multi-view action recognition methods in order to obtain a more robust performance. However, it is intuitive that not all modalities are available in real applications, such as daily scenes that missing depth modal data and capture RGB sequences only. This raises the challenge of how to learn critical features from multi-modality data, while relying on RGB sequences and still get robust performance at test time. To address this challenge, our paper presents a novel two-stage teacher-student framework, the teacher network takes advantage of multi-view geometry-andtexture features during training, while a student network only given RGB sequences at test time. Specifically, in the first stage, Cross-modality Aggregated Transfer (CAT) network is proposed to transfer multi-view cross-modality aggregated features from the teacher network to the student network. Moreover, We design a Viewpoint-Aware Attention (VAA) module which captures discriminative information across different views to e_ectively combine multi-view features. In the second stage, Multi-view Features Strengthen (MFS) network that contains VAA module as well further strengthen the global view-invariance features of the student network. Besides, both of CAT and MFS learn in an online distillation manner so that the teacher and the student network can be trained jointly. Extensive experiments at IXMAS and Northwestern-UCLA demonstrate the effectiveness of the proposed method.
    @article{xu2021cmo,
    title = {Cross-modality online distillation for multi-view action recognition},
    author = {Chao Xu and Xia Wu and Yachun Li and Yining Jin and Mengmeng Wang and Yong Liu},
    year = 2021,
    journal = {Neurocomputing},
    doi = {10.1016/j.neucom.2021.05.077},
    abstract = {Recently, some multi-modality features are introduced to the multi-view action recognition methods in order to obtain a more robust performance. However, it is intuitive that not all modalities are available in real applications, such as daily scenes that missing depth modal data and capture RGB sequences only. This raises the challenge of how to learn critical features from multi-modality data, while relying on RGB sequences and still get robust performance at test time. To address this challenge, our paper presents a novel two-stage teacher-student framework, the teacher network takes advantage of multi-view geometry-andtexture features during training, while a student network only given RGB sequences at test time. Specifically, in the first stage, Cross-modality Aggregated Transfer (CAT) network is proposed to transfer multi-view cross-modality aggregated features from the teacher network to the student network. Moreover, We design a Viewpoint-Aware Attention (VAA) module which captures discriminative information across different views to e_ectively combine multi-view features. In the second stage, Multi-view Features Strengthen (MFS) network that contains VAA module as well further strengthen the global view-invariance features of the student network. Besides, both of CAT and MFS learn in an online distillation manner so that the teacher and the student network can be trained jointly. Extensive experiments at IXMAS and Northwestern-UCLA demonstrate the effectiveness of the proposed method.}
    }
  • Xiangfang Zeng, Yusu Pan, Hao Zhang, Mengmeng Wang, Guanzhong Tian, and Yong Liu. Unpaired Salient Object Translation via Spatial Attention Prior. Neurocomputing, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    With only set-level constraints, unpaired image translation is challenging in discovering the correct semantic-level correspondences between two domains. This limitation often results in false positives such as significantly changing color and appearance of the background during image translation. To address this limitation, we propose the Spatial Attention-Aware Generative Adversarial Network (SAAGAN), a novel approach to jointly learn salient object discovery and translation. Specifically, our generator consists of (1) spatial attention prediction branch and (2) image translation branch. For attention branch, we extract spatial attention prior from a pre-trained classification network to provide weak supervision for object discovery. The proposed attention loss can largely stabilize the training process of attention-guided generator. For translation branch, we revise classical adversarial loss for salient object translation. Such a discriminator only distinguish the distribution of the object between two domains. What is more, we propose a fake sample augmentation strategy to provide extra spatial information for discriminator. Our approach allows simultaneously locating the attention areas in each image and translating the related areas between two domains. Extensive experiments and evaluations show that our model can achieve more realistic mappings compared to state-of-the-art unpaired image translation methods.
    @article{zeng2021unpairedso,
    title = {Unpaired Salient Object Translation via Spatial Attention Prior},
    author = {Xiangfang Zeng and Yusu Pan and Hao Zhang and Mengmeng Wang and Guanzhong Tian and Yong Liu},
    year = 2021,
    journal = {Neurocomputing},
    doi = {10.1016/j.neucom.2020.05.105},
    abstract = {With only set-level constraints, unpaired image translation is challenging in discovering the correct semantic-level correspondences between two domains. This limitation often results in false positives such as significantly changing color and appearance of the background during image translation. To address this limitation, we propose the Spatial Attention-Aware Generative Adversarial Network (SAAGAN), a novel approach to jointly learn salient object discovery and translation. Specifically, our generator consists of (1) spatial attention prediction branch and (2) image translation branch. For attention branch, we extract spatial attention prior from a pre-trained classification network to provide weak supervision for object discovery. The proposed attention loss can largely stabilize the training process of attention-guided generator. For translation branch, we revise classical adversarial loss for salient object translation. Such a discriminator only distinguish the distribution of the object between two domains. What is more, we propose a fake sample augmentation strategy to provide extra spatial information for discriminator. Our approach allows simultaneously locating the attention areas in each image and translating the related areas between two domains. Extensive experiments and evaluations show that our model can achieve more realistic mappings compared to state-of-the-art unpaired image translation methods.}
    }
  • Tianxin Huang, Hao Zou, Jinhao Cui, Xuemeng Yang, Mengmeng Wang, Xiangrui Zhao, Jiangning Zhang and Yi Yuan, Yifan Xu, and Yong Liu. RFNet: Recurrent Forward Network for Dense Point Cloud Completion. In 2021 International Conference on Computer Vision, pages 12488-12497, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    Point cloud completion is an interesting and challenging task in 3D vision, aiming to recover complete shapes from sparse and incomplete point clouds. Existing learning based methods often require vast computation cost to achieve excellent performance, which limits their practical applications. In this paper, we propose a novel Recurrent Forward Network (RFNet), which is composed of three modules: Recurrent Feature Extraction (RFE), Forward Dense Completion (FDC) and Raw Shape Protection (RSP). The RFE extracts multiple global features from the incomplete point clouds for different recurrent levels, and the FDC generates point clouds in a coarse-to-fine pipeline. The RSP introduces details from the original incomplete models to refine the completion results. Besides, we propose a Sampling Chamfer Distance to better capture the shapes of models and a new Balanced Expansion Constraint to restrict the expansion distances from coarse to fine. According to the experiments on ShapeNet and KITTI, our network can achieve the state-of-the-art with lower memory cost and faster convergence.
    @inproceedings{huang2021rfnetrf,
    title = {RFNet: Recurrent Forward Network for Dense Point Cloud Completion},
    author = {Tianxin Huang and Hao Zou and Jinhao Cui and Xuemeng Yang and Mengmeng Wang and Xiangrui Zhao and Jiangning Zhang and Yi Yuan and Yifan Xu and Yong Liu},
    year = 2021,
    booktitle = {2021 International Conference on Computer Vision},
    pages = {12488-12497},
    doi = {https://doi.org/10.1109/ICCV48922.2021.01228},
    abstract = {Point cloud completion is an interesting and challenging task in 3D vision, aiming to recover complete shapes from sparse and incomplete point clouds. Existing learning based methods often require vast computation cost to achieve excellent performance, which limits their practical applications. In this paper, we propose a novel Recurrent Forward Network (RFNet), which is composed of three modules: Recurrent Feature Extraction (RFE), Forward Dense Completion (FDC) and Raw Shape Protection (RSP). The RFE extracts multiple global features from the incomplete point clouds for different recurrent levels, and the FDC generates point clouds in a coarse-to-fine pipeline. The RSP introduces details from the original incomplete models to refine the completion results. Besides, we propose a Sampling Chamfer Distance to better capture the shapes of models and a new Balanced Expansion Constraint to restrict the expansion distances from coarse to fine. According to the experiments on ShapeNet and KITTI, our network can achieve the state-of-the-art with lower memory cost and faster convergence.}
    }
  • Lina Liu, Xibin Song, Mengmeng Wang, Yong Liu, and Liangjun Zhang. Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation. In 2021 International Conference on Computer Vision, pages 12717-12726, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    Remarkable results have been achieved by DCNN based self-supervised depth estimation approaches. However, most of these approaches can only handle either day-time or night-time images, while their performance degrades for all-day images due to large domain shift and the variation of illumination between day and night images. To relieve these limitations, we propose a domain-separated network for self-supervised depth estimation of all-day images. Specifically, to relieve the negative influence of disturbing terms (illumination, etc.), we partition the information of day and night image pairs into two complementary sub-spaces: private and invariant domains, where the former contains the unique information (illumination, etc.) of day and night images and the latter contains essential shared information (texture, etc.). Meanwhile, to guarantee that the day and night images contain the same information, the domain-separated network takes the day-time images and corresponding night-time images (generated by GAN) as input, and the private and invariant feature extractors are learned by orthogonality and similarity loss, where the domain gap can be alleviated, thus better depth maps can be expected. Meanwhile, the reconstruction and photometric losses are utilized to estimate complementary information and depth maps effectively. Experimental results demonstrate that our approach achieves state-of-the art depth estimation results for all-day images on the challenging Oxford RobotCar dataset, proving the superiority of our proposed approach. Code and data split are available at https://github.com/LINA-lln/ADDS-DepthNet.
    @inproceedings{liu2021selfsm,
    title = {Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation},
    author = {Lina Liu and Xibin Song and Mengmeng Wang and Yong Liu and Liangjun Zhang},
    year = 2021,
    booktitle = {2021 International Conference on Computer Vision},
    pages = {12717-12726},
    doi = {https://doi.org/10.1109/ICCV48922.2021.01250},
    abstract = {Remarkable results have been achieved by DCNN based self-supervised depth estimation approaches. However, most of these approaches can only handle either day-time or night-time images, while their performance degrades for all-day images due to large domain shift and the variation of illumination between day and night images. To relieve these limitations, we propose a domain-separated network for self-supervised depth estimation of all-day images. Specifically, to relieve the negative influence of disturbing terms (illumination, etc.), we partition the information of day and night image pairs into two complementary sub-spaces: private and invariant domains, where the former contains the unique information (illumination, etc.) of day and night images and the latter contains essential shared information (texture, etc.). Meanwhile, to guarantee that the day and night images contain the same information, the domain-separated network takes the day-time images and corresponding night-time images (generated by GAN) as input, and the private and invariant feature extractors are learned by orthogonality and similarity loss, where the domain gap can be alleviated, thus better depth maps can be expected. Meanwhile, the reconstruction and photometric losses are utilized to estimate complementary information and depth maps effectively. Experimental results demonstrate that our approach achieves state-of-the art depth estimation results for all-day images on the challenging Oxford RobotCar dataset, proving the superiority of our proposed approach. Code and data split are available at https://github.com/LINA-lln/ADDS-DepthNet.}
    }
  • Lina Liu, Xibin Song, Xiaoyang Lyu, Junwei Diao, Mengmeng Wang, Yong Liu, and Liangjun Zhang. FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), 2021.
    [BibTeX] [Abstract] [arXiv] [PDF]
    Depth completion aims to recover a dense depth map from a sparse depth map with the corresponding color image as input. Recent approaches mainly formulate the depth completion as a one-stage end-to-end learning task, which outputs dense depth maps directly. However, the feature extraction and supervision in one-stage frameworks are insufficient, limiting the performance of these approaches. To address this problem, we propose a novel end-to-end residual learning framework, which formulates the depth completion as a two-stage learning task, i.e., a sparse-to-coarse stage and a coarse-to-fine stage. First, a coarse dense depth map is obtained by a simple CNN framework. Then, a refined depth map is further obtained using a residual learning strategy in the coarse-to-fine stage with coarse depth map and color image as input. Specially, in the coarse-to-fine stage, a channel shuffle extraction operation is utilized to extract more representative features from color image and coarse depth map, and an energy based fusion operation is exploited to effectively fuse these features obtained by channel shuffle operation, thus leading to more accurate and refined depth maps. We achieve SoTA performance in RMSE on KITTI benchmark. Extensive experiments on other datasets future demonstrate the superiority of our approach over current state-of-the-art depth completion approaches.
    @inproceedings{liu2020fcfrnetff,
    title = {FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion},
    author = {Lina Liu and Xibin Song and Xiaoyang Lyu and Junwei Diao and Mengmeng Wang and Yong Liu and Liangjun Zhang},
    year = 2021,
    booktitle = {Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI)},
    abstract = {Depth completion aims to recover a dense depth map from a sparse depth map with the corresponding color image as input. Recent approaches mainly formulate the depth completion as a one-stage end-to-end learning task, which outputs dense depth maps directly. However, the feature extraction and supervision in one-stage frameworks are insufficient, limiting the performance of these approaches. To address this problem, we propose a novel end-to-end residual learning framework, which formulates the depth completion as a two-stage learning task, i.e., a sparse-to-coarse stage and a coarse-to-fine stage. First, a coarse dense depth map is obtained by a simple CNN framework. Then, a refined depth map is further obtained using a residual learning strategy in the coarse-to-fine stage with coarse depth map and color image as input. Specially, in the coarse-to-fine stage, a channel shuffle extraction operation is utilized to extract more representative features from color image and coarse depth map, and an energy based fusion operation is exploited to effectively fuse these features obtained by channel shuffle operation, thus leading to more accurate and refined depth maps. We achieve SoTA performance in RMSE on KITTI benchmark. Extensive experiments on other datasets future demonstrate the superiority of our approach over current state-of-the-art depth completion approaches.},
    arxiv = {https://arxiv.org/pdf/2012.08270.pdf}
    }
  • Xiaoyang Lyu, Liang Liu, Mengmeng Wang, Xin Kong, Lina Liu, Yong Liu, Xinxin Chen, and Yi Yuan. HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), 2021.
    [BibTeX] [Abstract] [arXiv] [PDF]
    Self-supervised learning shows great potential in monoculardepth estimation, using image sequences as the only source ofsupervision. Although people try to use the high-resolutionimage for depth estimation, the accuracy of prediction hasnot been significantly improved. In this work, we find thecore reason comes from the inaccurate depth estimation inlarge gradient regions, making the bilinear interpolation er-ror gradually disappear as the resolution increases. To obtainmore accurate depth estimation in large gradient regions, itis necessary to obtain high-resolution features with spatialand semantic information. Therefore, we present an improvedDepthNet, HR-Depth, with two effective strategies: (1) re-design the skip-connection in DepthNet to get better high-resolution features and (2) propose feature fusion Squeeze-and-Excitation(fSE) module to fuse feature more efficiently.Using Resnet-18 as the encoder, HR-Depth surpasses all pre-vious state-of-the-art(SoTA) methods with the least param-eters at both high and low resolution. Moreover, previousstate-of-the-art methods are based on fairly complex and deepnetworks with a mass of parameters which limits their realapplications. Thus we also construct a lightweight networkwhich uses MobileNetV3 as encoder. Experiments show thatthe lightweight network can perform on par with many largemodels like Monodepth2 at high-resolution with only20%parameters. All codes and models will be available at this https URL.
    @inproceedings{lyu2020hrdepthhr,
    title = {HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation},
    author = {Xiaoyang Lyu and Liang Liu and Mengmeng Wang and Xin Kong and Lina Liu and Yong Liu and Xinxin Chen and Yi Yuan},
    year = 2021,
    booktitle = {Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI)},
    abstract = {Self-supervised learning shows great potential in monoculardepth estimation, using image sequences as the only source ofsupervision. Although people try to use the high-resolutionimage for depth estimation, the accuracy of prediction hasnot been significantly improved. In this work, we find thecore reason comes from the inaccurate depth estimation inlarge gradient regions, making the bilinear interpolation er-ror gradually disappear as the resolution increases. To obtainmore accurate depth estimation in large gradient regions, itis necessary to obtain high-resolution features with spatialand semantic information. Therefore, we present an improvedDepthNet, HR-Depth, with two effective strategies: (1) re-design the skip-connection in DepthNet to get better high-resolution features and (2) propose feature fusion Squeeze-and-Excitation(fSE) module to fuse feature more efficiently.Using Resnet-18 as the encoder, HR-Depth surpasses all pre-vious state-of-the-art(SoTA) methods with the least param-eters at both high and low resolution. Moreover, previousstate-of-the-art methods are based on fairly complex and deepnetworks with a mass of parameters which limits their realapplications. Thus we also construct a lightweight networkwhich uses MobileNetV3 as encoder. Experiments show thatthe lightweight network can perform on par with many largemodels like Monodepth2 at high-resolution with only20%parameters. All codes and models will be available at this https URL.},
    arxiv = {https://arxiv.org/pdf/2012.07356.pdf}
    }
  • Xin Kong, Xuemeng Yang, Guangyao Zhai, Xiangrui Zhao, Xianfang Zeng, Mengmeng Wang, Yong Liu, Wanlong Li, and Feng Wen. Semantic Graph Based Place Recognition for 3D Point Clouds. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), page 8216–8223, 2020.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]
    Due to the difficulty in generating the effective descriptors which are robust to occlusion and viewpoint changes, place recognition for 3D point cloud remains an open issue. Unlike most of the existing methods that focus on extracting local, global, and statistical features of raw point clouds, our method aims at the semantic level that can be superior in terms of robustness to environmental changes. Inspired by the perspective of humans, who recognize scenes through identifying semantic objects and capturing their relations, this paper presents a novel semantic graph based approach for place recognition. First, we propose a novel semantic graph representation for the point cloud scenes by reserving the semantic and topological information of the raw point cloud. Thus, place recognition is modeled as a graph matching problem. Then we design a fast and effective graph similarity network to compute the similarity. Exhaustive evaluations on the KITTI dataset show that our approach is robust to the occlusion as well as viewpoint changes and outperforms the state-of-the-art methods with a large margin. Our code is available at: https://github.com/kxhit/SG_PR.
    @inproceedings{kong2020semanticgb,
    title = {Semantic Graph Based Place Recognition for 3D Point Clouds},
    author = {Xin Kong and Xuemeng Yang and Guangyao Zhai and Xiangrui Zhao and Xianfang Zeng and Mengmeng Wang and Yong Liu and Wanlong Li and Feng Wen},
    year = 2020,
    booktitle = {2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {8216--8223},
    doi = {https://doi.org/10.1109/IROS45743.2020.9341060},
    abstract = {Due to the difficulty in generating the effective descriptors which are robust to occlusion and viewpoint changes, place recognition for 3D point cloud remains an open issue. Unlike most of the existing methods that focus on extracting local, global, and statistical features of raw point clouds, our method aims at the semantic level that can be superior in terms of robustness to environmental changes. Inspired by the perspective of humans, who recognize scenes through identifying semantic objects and capturing their relations, this paper presents a novel semantic graph based approach for place recognition. First, we propose a novel semantic graph representation for the point cloud scenes by reserving the semantic and topological information of the raw point cloud. Thus, place recognition is modeled as a graph matching problem. Then we design a fast and effective graph similarity network to compute the similarity. Exhaustive evaluations on the KITTI dataset show that our approach is robust to the occlusion as well as viewpoint changes and outperforms the state-of-the-art methods with a large margin. Our code is available at: https://github.com/kxhit/SG_PR.},
    arxiv = {https://arxiv.org/pdf/2008.11459.pdf}
    }
  • Xianfang Zeng, Yusu Pan, Mengmeng Wang, Jiangning Zhang, and Yong Liu. Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), 2020.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]
    Recent works have shown how realistic talking face images can be obtained under the supervision of geometry guidance, e.g., facial landmark or boundary. To alleviate the demand for manual annotations, in this paper, we propose a novel self-supervised hybrid model (DAE-GAN) that learns how to reenact face naturally given large amounts of unlabeled videos. Our approach combines two deforming autoencoders with the latest advances in the conditional generation. On the one hand, we adopt the deforming autoencoder to disentangle identity and pose representations. A strong prior in talking face videos is that each frame can be encoded as two parts: one for video-specific identity and the other for various poses. Inspired by that, we utilize a multi-frame deforming autoencoder to learn a pose-invariant embedded face for each video. Meanwhile, a multi-scale deforming autoencoder is proposed to extract pose-related information for each frame. On the other hand, the conditional generator allows for enhancing fine details and overall reality. It leverages the disentangled features to generate photo-realistic and pose-alike face images. We evaluate our model on VoxCeleb1 and RaFD dataset. Experiment results demonstrate the superior quality of reenacted images and the flexibility of transferring facial movements between identities.
    @inproceedings{zeng2020realisticfr,
    title = {Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose},
    author = {Xianfang Zeng and Yusu Pan and Mengmeng Wang and Jiangning Zhang and Yong Liu},
    year = 2020,
    booktitle = {Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI)},
    doi = {https://doi.org/10.1609/AAAI.V34I07.6970},
    abstract = {Recent works have shown how realistic talking face images can be obtained under the supervision of geometry guidance, e.g., facial landmark or boundary. To alleviate the demand for manual annotations, in this paper, we propose a novel self-supervised hybrid model (DAE-GAN) that learns how to reenact face naturally given large amounts of unlabeled videos. Our approach combines two deforming autoencoders with the latest advances in the conditional generation. On the one hand, we adopt the deforming autoencoder to disentangle identity and pose representations. A strong prior in talking face videos is that each frame can be encoded as two parts: one for video-specific identity and the other for various poses. Inspired by that, we utilize a multi-frame deforming autoencoder to learn a pose-invariant embedded face for each video. Meanwhile, a multi-scale deforming autoencoder is proposed to extract pose-related information for each frame. On the other hand, the conditional generator allows for enhancing fine details and overall reality. It leverages the disentangled features to generate photo-realistic and pose-alike face images. We evaluate our model on VoxCeleb1 and RaFD dataset. Experiment results demonstrate the superior quality of reenacted images and the flexibility of transferring facial movements between identities.},
    arxiv = {https://arxiv.org/pdf/2003.12957.pdf}
    }
  • Jiangning Zhang, Chao Xu, Liang Liu, Mengmeng Wang, Xia Wu, Yong Liu, and Yunliang Jiang. Dtvnet: Dynamic time-lapse video generation via single still image. In ECCV, page 300–315, 2020.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]
    This paper presents a novel end-to-end dynamic time-lapse video generation framework, named DTVNet, to generate diversified time-lapse videos from a single landscape image, which are conditioned on normalized motion vectors. The proposed DTVNet consists of two submodules: Optical Flow Encoder (OFE) and Dynamic Video Generator (DVG). The OFE maps a sequence of optical flow maps to a normalized motion vector that encodes the motion information inside the generated video. The DVG contains motion and content streams that learn from the motion vector and the single image respectively, as well as an encoder and a decoder to learn shared content features and construct video frames with corresponding motion respectively. Specifically, the motion stream introduces multiple adaptive instance normalization (AdaIN) layers to integrate multi-level motion information that are processed by linear layers. In the testing stage, videos with the same content but various motion information can be generated by different normalized motion vectors based on only one input image. We further conduct experiments on Sky Time-lapse dataset, and the results demonstrate the superiority of our approach over the state-of-the-art methods for generating high-quality and dynamic videos, as well as the variety for generating videos with various motion information.
    @inproceedings{zhang2020dtvnet,
    title = {Dtvnet: Dynamic time-lapse video generation via single still image},
    author = {Zhang, Jiangning and Xu, Chao and Liu, Liang and Wang, Mengmeng and Wu, Xia and Liu, Yong and Jiang, Yunliang},
    year = 2020,
    booktitle = {{ECCV}},
    pages = {300--315},
    doi = {https://doi.org/10.1007/978-3-030-58558-7_18},
    abstract = {This paper presents a novel end-to-end dynamic time-lapse video generation framework, named DTVNet, to generate diversified time-lapse videos from a single landscape image, which are conditioned on normalized motion vectors. The proposed DTVNet consists of two submodules: Optical Flow Encoder (OFE) and Dynamic Video Generator (DVG). The OFE maps a sequence of optical flow maps to a normalized motion vector that encodes the motion information inside the generated video. The DVG contains motion and content streams that learn from the motion vector and the single image respectively, as well as an encoder and a decoder to learn shared content features and construct video frames with corresponding motion respectively. Specifically, the motion stream introduces multiple adaptive instance normalization (AdaIN) layers to integrate multi-level motion information that are processed by linear layers. In the testing stage, videos with the same content but various motion information can be generated by different normalized motion vectors based on only one input image. We further conduct experiments on Sky Time-lapse dataset, and the results demonstrate the superiority of our approach over the state-of-the-art methods for generating high-quality and dynamic videos, as well as the variety for generating videos with various motion information.},
    arxiv = {https://arxiv.org/abs/2008.04776}
    }
  • Hao Zhang, Mengmeng Wang, Yong Liu, and Yi Yuan. FDN: Feature Decoupling Network for Head Pose Estimation. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), 2020.
    [BibTeX] [Abstract] [DOI] [PDF]
    Head pose estimation from RGB images without depth information is a challenging task due to the loss of spatial information as well as large head pose variations in the wild. The performance of existing landmark-free methods remains unsatisfactory as the quality of estimated pose is inferior. In this paper, we propose a novel three-branch network architecture, termed as Feature Decoupling Network (FDN), a more powerful architecture for landmark-free head pose estimation from a single RGB image. In FDN, we first propose a feature decoupling (FD) module to explicitly learn the discriminative features for each pose angle by adaptively recalibrating its channel-wise responses. Besides, we introduce a cross-category center (CCC) loss to constrain the distribution of the latent variable subspaces and thus we can obtain more compact and distinct subspaces. Extensive experiments on both in-the-wild and controlled environment datasets demonstrate that the proposed method outperforms other state-of-the-art methods based on a single RGB image and behaves on par with approaches based on multimodal input resources.
    @inproceedings{zhang2020fdnfd,
    title = {FDN: Feature Decoupling Network for Head Pose Estimation},
    author = {Hao Zhang and Mengmeng Wang and Yong Liu and Yi Yuan},
    year = 2020,
    booktitle = {Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI)},
    doi = {https://doi.org/10.1609/AAAI.V34I07.6974},
    abstract = {Head pose estimation from RGB images without depth information is a challenging task due to the loss of spatial information as well as large head pose variations in the wild. The performance of existing landmark-free methods remains unsatisfactory as the quality of estimated pose is inferior. In this paper, we propose a novel three-branch network architecture, termed as Feature Decoupling Network (FDN), a more powerful architecture for landmark-free head pose estimation from a single RGB image. In FDN, we first propose a feature decoupling (FD) module to explicitly learn the discriminative features for each pose angle by adaptively recalibrating its channel-wise responses. Besides, we introduce a cross-category center (CCC) loss to constrain the distribution of the latent variable subspaces and thus we can obtain more compact and distinct subspaces. Extensive experiments on both in-the-wild and controlled environment datasets demonstrate that the proposed method outperforms other state-of-the-art methods based on a single RGB image and behaves on par with approaches based on multimodal input resources.}
    }
  • Jiangning Zhang, Xianfang Zeng, Mengmeng Wang, Yusu Pan, Liang Liu, Yong Liu, Yu Ding, and Changjie Fan. FReeNet: Multi-Identity Face Reenactment. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 5325–5334, 2020.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]
    This paper presents a novel multi-identity face reenactment framework, named FReeNet, to transfer facial expressions from an arbitrary source face to a target face with a shared model. The proposed FReeNet consists of two parts: Unified Landmark Converter (ULC) and Geometry-aware Generator (GAG). The ULC adopts an encode-decoder architecture to efficiently convert expression in a latent landmark space, which significantly narrows the gap of the face contour between source and target identities. The GAG leverages the converted landmark to reenact the photorealistic image with a reference image of the target person. Moreover, a new triplet perceptual loss is proposed to force the GAG module to learn appearance and geometry information simultaneously, which also enriches facial details of the reenacted images. Further experiments demonstrate the superiority of our approach for generating photorealistic and expression-alike faces, as well as the flexibility for transferring facial expressions between identities.
    @inproceedings{zhang2020freenetmf,
    title = {FReeNet: Multi-Identity Face Reenactment},
    author = {Jiangning Zhang and Xianfang Zeng and Mengmeng Wang and Yusu Pan and Liang Liu and Yong Liu and Yu Ding and Changjie Fan},
    year = 2020,
    booktitle = {2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {5325--5334},
    doi = {https://doi.org/10.1109/cvpr42600.2020.00537},
    abstract = {This paper presents a novel multi-identity face reenactment framework, named FReeNet, to transfer facial expressions from an arbitrary source face to a target face with a shared model. The proposed FReeNet consists of two parts: Unified Landmark Converter (ULC) and Geometry-aware Generator (GAG). The ULC adopts an encode-decoder architecture to efficiently convert expression in a latent landmark space, which significantly narrows the gap of the face contour between source and target identities. The GAG leverages the converted landmark to reenact the photorealistic image with a reference image of the target person. Moreover, a new triplet perceptual loss is proposed to force the GAG module to learn appearance and geometry information simultaneously, which also enriches facial details of the reenacted images. Further experiments demonstrate the superiority of our approach for generating photorealistic and expression-alike faces, as well as the flexibility for transferring facial expressions between identities.},
    arxiv = {http://arxiv.org/pdf/1905.11805}
    }
  • Mengmeng Wang, Yong Liu, Daobilige Su, Yufan Liao, Lei Shi, Jinhong Xu, and Jaime Valls Miro. Accurate and Real-Time 3-D Tracking for the Following Robots by Fusing Vision and Ultrasonar Information. IEEE/ASME Transactions on Mechatronics, 23:997–1006, 2018.
    [BibTeX] [Abstract] [DOI] [PDF]
    Acquiring the accurate three-dimensional (3-D) position of a target person around a robot provides valuable information that is applicable to a wide range of robotic tasks, especially for promoting the intelligent manufacturing processes of industries. This paper presents a real-time robotic 3-D human tracking system that combines a monocular camera with an ultrasonic sensor by an extended Kalman filter (EKF). The proposed system consists of three submodules: a monocular camera sensor tracking module, an ultrasonic sensor tracking module, and the multisensor fusion algorithm. An improved visual tracking algorithm is presented to provide 2-D partial location estimation. The algorithm is designed to overcome severe occlusions, scale variation, target missing, and achieve robust redetection. The scale accuracy is further enhanced by the estimated 3-D information. An ultrasonic sensor array is employed to provide the range information from the target person to the robot, and time of flight is used for the 2-D partial location estimation. EKF is adopted to sequentially process multiple, heterogeneous measurements arriving in an asynchronous order from the vision sensor, and the ultrasonic sensor separately. In the experiments, the proposed tracking system is tested in both a simulation platform and actual mobile robot for various indoor and outdoor scenes. The experimental results show the persuasive performance of the 3-D tracking system in terms of both the accuracy and robustness.
    @article{wang2018accuratear,
    title = {Accurate and Real-Time 3-D Tracking for the Following Robots by Fusing Vision and Ultrasonar Information},
    author = {Mengmeng Wang and Yong Liu and Daobilige Su and Yufan Liao and Lei Shi and Jinhong Xu and Jaime Valls Miro},
    year = 2018,
    journal = {IEEE/ASME Transactions on Mechatronics},
    volume = 23,
    pages = {997--1006},
    doi = {https://doi.org/10.1109/TMECH.2018.2820172},
    abstract = {Acquiring the accurate three-dimensional (3-D) position of a target person around a robot provides valuable information that is applicable to a wide range of robotic tasks, especially for promoting the intelligent manufacturing processes of industries. This paper presents a real-time robotic 3-D human tracking system that combines a monocular camera with an ultrasonic sensor by an extended Kalman filter (EKF). The proposed system consists of three submodules: a monocular camera sensor tracking module, an ultrasonic sensor tracking module, and the multisensor fusion algorithm. An improved visual tracking algorithm is presented to provide 2-D partial location estimation. The algorithm is designed to overcome severe occlusions, scale variation, target missing, and achieve robust redetection. The scale accuracy is further enhanced by the estimated 3-D information. An ultrasonic sensor array is employed to provide the range information from the target person to the robot, and time of flight is used for the 2-D partial location estimation. EKF is adopted to sequentially process multiple, heterogeneous measurements arriving in an asynchronous order from the vision sensor, and the ultrasonic sensor separately. In the experiments, the proposed tracking system is tested in both a simulation platform and actual mobile robot for various indoor and outdoor scenes. The experimental results show the persuasive performance of the 3-D tracking system in terms of both the accuracy and robustness.}
    }
  • Mengmeng Wang, Yong Liu, and Zeyi Huang. Large Margin Object Tracking with Circulant Feature Maps. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 4800–4808, 2017.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]
    Structured output support vector machine (SVM) based tracking algorithms have shown favorable performance recently. Nonetheless, the time-consuming candidate sampling and complex optimization limit their real-time applications. In this paper, we propose a novel large margin object tracking method which absorbs the strong discriminative ability from structured output SVM and speeds up by the correlation filter algorithm significantly. Secondly, a multimodal target detection technique is proposed to improve the target localization precision and prevent model drift introduced by similar objects or background noise. Thirdly, we exploit the feedback from high-confidence tracking results to avoid the model corruption problem. We implement two versions of the proposed tracker with the representations from both conventional hand-crafted and deep convolution neural networks (CNNs) based features to validate the strong compatibility of the algorithm. The experimental results demonstrate that the proposed tracker performs superiorly against several state-of-the-art algorithms on the challenging benchmark sequences while runs at speed in excess of 80 frames per second.
    @inproceedings{wang2017largemo,
    title = {Large Margin Object Tracking with Circulant Feature Maps},
    author = {Mengmeng Wang and Yong Liu and Zeyi Huang},
    year = 2017,
    booktitle = {2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {4800--4808},
    doi = {https://doi.org/10.1109/CVPR.2017.510},
    arxiv = {http://arxiv.org/pdf/1703.05020},
    abstract = {Structured output support vector machine (SVM) based tracking algorithms have shown favorable performance recently. Nonetheless, the time-consuming candidate sampling and complex optimization limit their real-time applications. In this paper, we propose a novel large margin object tracking method which absorbs the strong discriminative ability from structured output SVM and speeds up by the correlation filter algorithm significantly. Secondly, a multimodal target detection technique is proposed to improve the target localization precision and prevent model drift introduced by similar objects or background noise. Thirdly, we exploit the feedback from high-confidence tracking results to avoid the model corruption problem. We implement two versions of the proposed tracker with the representations from both conventional hand-crafted and deep convolution neural networks (CNNs) based features to validate the strong compatibility of the algorithm. The experimental results demonstrate that the proposed tracker performs superiorly against several state-of-the-art algorithms on the challenging benchmark sequences while runs at speed in excess of 80 frames per second.}
    }
  • Mengmeng Wang, Daobilige Su, Lei Shi, Yong Liu, and Jaime Valls Miro. Real-time 3D human tracking for mobile robots with multisensors. In 2017 IEEE International Conference on Robotics and Automation (ICRA), page 5081–5087, 2017.
    [BibTeX] [Abstract] [DOI] [PDF]
    Acquiring the accurate 3-D position of a target person around a robot provides fundamental and valuable information that is applicable to a wide range of robotic tasks, including home service, navigation and entertainment. This paper presents a real-time robotic 3-D human tracking system which combines a monocular camera with an ultrasonic sensor by the extended Kalman filter (EKF). The proposed system consists of three sub-modules: monocular camera sensor tracking model, ultrasonic sensor tracking model and multi-sensor fusion. An improved visual tracking algorithm is presented to provide partial location estimation (2-D). The algorithm is designed to overcome severe occlusions, scale variation, target missing and achieve robust re-detection. The scale accuracy is further enhanced by the estimated 3-D information. An ultrasonic sensor array is employed to provide the range information from the target person to the robot and Gaussian Process Regression is used for partial location estimation (2-D). EKF is adopted to sequentially process multiple, heterogeneous measurements arriving in an asynchronous order from the vision sensor and the ultrasonic sensor separately. In the experiments, the proposed tracking system is tested in both simulation platform and actual mobile robot for various indoor and outdoor scenes. The experimental results show the superior performance of the 3-D tracking system in terms of both the accuracy and robustness.
    @inproceedings{wang2017realtime3h,
    title = {Real-time 3D human tracking for mobile robots with multisensors},
    author = {Mengmeng Wang and Daobilige Su and Lei Shi and Yong Liu and Jaime Valls Miro},
    year = 2017,
    booktitle = {2017 IEEE International Conference on Robotics and Automation (ICRA)},
    pages = {5081--5087},
    doi = {https://doi.org/10.1109/ICRA.2017.7989593},
    abstract = {Acquiring the accurate 3-D position of a target person around a robot provides fundamental and valuable information that is applicable to a wide range of robotic tasks, including home service, navigation and entertainment. This paper presents a real-time robotic 3-D human tracking system which combines a monocular camera with an ultrasonic sensor by the extended Kalman filter (EKF). The proposed system consists of three sub-modules: monocular camera sensor tracking model, ultrasonic sensor tracking model and multi-sensor fusion. An improved visual tracking algorithm is presented to provide partial location estimation (2-D). The algorithm is designed to overcome severe occlusions, scale variation, target missing and achieve robust re-detection. The scale accuracy is further enhanced by the estimated 3-D information. An ultrasonic sensor array is employed to provide the range information from the target person to the robot and Gaussian Process Regression is used for partial location estimation (2-D). EKF is adopted to sequentially process multiple, heterogeneous measurements arriving in an asynchronous order from the vision sensor and the ultrasonic sensor separately. In the experiments, the proposed tracking system is tested in both simulation platform and actual mobile robot for various indoor and outdoor scenes. The experimental results show the superior performance of the 3-D tracking system in terms of both the accuracy and robustness.}
    }
  • Mengmeng Wang, Yong Liu, and Rong Xiong. Robust object tracking with a hierarchical ensemble framework. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), page 438–445, 2016.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]
    Autonomous robots enjoy a wide popularity nowadays and have been applied in many applications, such as home security, entertainment, delivery, navigation and guidance. It is vital for robots to track objects accurately in real time in these applications, so it is necessary to focus on tracking algorithms to improve the robustness, speed and accuracy. In this paper, we propose a real-time robust object tracking algorithm based on a hierarchical ensemble framework which incorporates information including individual pixel features, local patches and holistic target models. The framework combines multiple ensemble models simultaneously instead of using a single ensemble model individually. A discriminative model which accounts for the matching degree of local patches is adopted via a bottom ensemble layer, and a generative model which exploits holistic templates is used to search for the object based on the middle ensemble layer as well as an adaptive Kalman filter. We test the proposed tracker on challenging benchmark image sequences. The experimental results demonstrate that the proposed tracker performs superiorly against several state-of-the-art algorithms, especially when the appearance changes dramatically and the occlusions occur.
    @inproceedings{wang2016robustot,
    title = {Robust object tracking with a hierarchical ensemble framework},
    author = {Mengmeng Wang and Yong Liu and Rong Xiong},
    year = 2016,
    booktitle = {2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {438--445},
    doi = {https://doi.org/10.1109/IROS.2016.7759091},
    arxiv = {http://arxiv.org/pdf/1509.06925},
    abstract = {Autonomous robots enjoy a wide popularity nowadays and have been applied in many applications, such as home security, entertainment, delivery, navigation and guidance. It is vital for robots to track objects accurately in real time in these applications, so it is necessary to focus on tracking algorithms to improve the robustness, speed and accuracy. In this paper, we propose a real-time robust object tracking algorithm based on a hierarchical ensemble framework which incorporates information including individual pixel features, local patches and holistic target models. The framework combines multiple ensemble models simultaneously instead of using a single ensemble model individually. A discriminative model which accounts for the matching degree of local patches is adopted via a bottom ensemble layer, and a generative model which exploits holistic templates is used to search for the object based on the middle ensemble layer as well as an adaptive Kalman filter. We test the proposed tracker on challenging benchmark image sequences. The experimental results demonstrate that the proposed tracker performs superiorly against several state-of-the-art algorithms, especially when the appearance changes dramatically and the occlusions occur.}
    }

Links