Publications

2024

  • J. Mei, Y. Yang, M. Wang, J. Zhu, J. Ra, Y. Ma, L. Li, and Y. Liu, “Camera-Based 3D Semantic Scene Completion With Sparse Guidance Network," IEEE Transactions on Image Processing, vol. 33, pp. 5468-5481, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80% mIoU and 45.45% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. Code is available at https://github.com/Jieqianyu/SGN.

    @article{mei2024cbs,
    title = {Camera-Based 3D Semantic Scene Completion With Sparse Guidance Network},
    author = {Jianbiao Mei and Yu Yang and Mengmeng Wang and Junyu Zhu and Jongwon Ra and Yukai Ma and Laijian Li and Yong Liu},
    year = 2024,
    journal = {IEEE Transactions on Image Processing},
    volume = 33,
    pages = {5468-5481},
    doi = {10.1109/TIP.2024.3461989},
    abstract = {Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80% mIoU and 45.45% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. Code is available at https://github.com/Jieqianyu/SGN.}
    }

  • W. Liu, B. Zhang, T. Liu, J. Jiang, and Y. Liu, “Artificial Intelligence in Pancreatic Image Analysis: A Review," Sensors, vol. 24, p. 4749, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    Pancreatic cancer is a highly lethal disease with a poor prognosis. Its early diagnosis and accurate treatment mainly rely on medical imaging, so accurate medical image analysis is especially vital for pancreatic cancer patients. However, medical image analysis of pancreatic cancer is facing challenges due to ambiguous symptoms, high misdiagnosis rates, and significant financial costs. Artificial intelligence (AI) offers a promising solution by relieving medical personnel’s workload, improving clinical decision-making, and reducing patient costs. This study focuses on AI applications such as segmentation, classification, object detection, and prognosis prediction across five types of medical imaging: CT, MRI, EUS, PET, and pathological images, as well as integrating these imaging modalities to boost diagnostic accuracy and treatment efficiency. In addition, this study discusses current hot topics and future directions aimed at overcoming the challenges in AI-enabled automated pancreatic cancer diagnosis algorithms.

    @article{jiang2024aii,
    title = {Artificial Intelligence in Pancreatic Image Analysis: A Review},
    author = {Weixuan Liu and Bairui Zhang and Tao Liu and Juntao Jiang and Yong Liu},
    year = 2024,
    journal = {Sensors},
    volume = 24,
    pages = {4749},
    doi = {10.3390/s24144749},
    abstract = {Pancreatic cancer is a highly lethal disease with a poor prognosis. Its early diagnosis and accurate treatment mainly rely on medical imaging, so accurate medical image analysis is especially vital for pancreatic cancer patients. However, medical image analysis of pancreatic cancer is facing challenges due to ambiguous symptoms, high misdiagnosis rates, and significant financial costs. Artificial intelligence (AI) offers a promising solution by relieving medical personnel's workload, improving clinical decision-making, and reducing patient costs. This study focuses on AI applications such as segmentation, classification, object detection, and prognosis prediction across five types of medical imaging: CT, MRI, EUS, PET, and pathological images, as well as integrating these imaging modalities to boost diagnostic accuracy and treatment efficiency. In addition, this study discusses current hot topics and future directions aimed at overcoming the challenges in AI-enabled automated pancreatic cancer diagnosis algorithms.}
    }

  • H. Li, Y. Ma, Y. Huang, Y. Gu, W. Xu, Y. Liu, and X. Zuo, “RIDERS: Radar-Infrared Depth Estimation for Robust Sensing," IEEE Transactions on Intelligent Transportation Systems, 2024.
    [BibTeX] [Abstract] [DOI]

    Dense depth recovery is crucial in autonomous driving, serving as a foundational element for obstacle avoidance, 3D object detection, and local path planning. Adverse weather conditions, including haze, dust, rain, snow, and darkness, introduce significant challenges to accurate dense depth estimation, thereby posing substantial safety risks in autonomous driving. These challenges are particularly pronounced for traditional depth estimation methods that rely on short electromagnetic wave sensors, such as visible spectrum cameras and near-infrared LiDAR, due to their susceptibility to diffraction noise and occlusion in such environments. To fundamentally overcome this issue, we present a novel approach for robust metric depth estimation by fusing a millimeter-wave radar and a monocular infrared thermal camera, which are capable of penetrating atmospheric particles and unaffected by lighting conditions. Our proposed Radar-Infrared fusion method achieves highly accurate and finely detailed dense depth estimation through three stages, including monocular depth prediction with global scale alignment, quasi-dense radar augmentation by learning radar-pixels correspondences, and local scale refinement of dense depth using a scale map learner. Our method achieves exceptional visual quality and accurate metric estimation by addressing the challenges of ambiguity and misalignment that arise from directly fusing multi-modal long-wave features. We evaluate the performance of our approach on the NTU4DRadLM dataset and our self-collected challenging ZJU-Multispectrum dataset. Especially noteworthy is the unprecedented robustness demonstrated by our proposed method in smoky scenarios.

    @article{li2024riders,
    title = {RIDERS: Radar-Infrared Depth Estimation for Robust Sensing},
    author = {Han Li and Yukai Ma and Yuehao Huang and Yaqing Gu and Weihua Xu and Yong Liu and Xingxing Zuo},
    year = 2024,
    journal = {IEEE Transactions on Intelligent Transportation Systems},
    doi = {10.1109/TITS.2024.3432996},
    abstract = {Dense depth recovery is crucial in autonomous driving, serving as a foundational element for obstacle avoidance, 3D object detection, and local path planning. Adverse weather conditions, including haze, dust, rain, snow, and darkness, introduce significant challenges to accurate dense depth estimation, thereby posing substantial safety risks in autonomous driving. These challenges are particularly pronounced for traditional depth estimation methods that rely on short electromagnetic wave sensors, such as visible spectrum cameras and near-infrared LiDAR, due to their susceptibility to diffraction noise and occlusion in such environments. To fundamentally overcome this issue, we present a novel approach for robust metric depth estimation by fusing a millimeter-wave radar and a monocular infrared thermal camera, which are capable of penetrating atmospheric particles and unaffected by lighting conditions. Our proposed Radar-Infrared fusion method achieves highly accurate and finely detailed dense depth estimation through three stages, including monocular depth prediction with global scale alignment, quasi-dense radar augmentation by learning radar-pixels correspondences, and local scale refinement of dense depth using a scale map learner. Our method achieves exceptional visual quality and accurate metric estimation by addressing the challenges of ambiguity and misalignment that arise from directly fusing multi-modal long-wave features. We evaluate the performance of our approach on the NTU4DRadLM dataset and our self-collected challenging ZJU-Multispectrum dataset. Especially noteworthy is the unprecedented robustness demonstrated by our proposed method in smoky scenarios.}
    }

  • L. Peng, R. Cai, J. Xiang, J. Zhu, W. Liu, W. Gao, and Y. Liu, “LiteGrasp: A Light Robotic Grasp Detection via Semi-Supervised Knowledge Distillation," IEEE Robotics and Automation Letters, vol. 9, pp. 7995-8002, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    Grasping detection from single images in robotic applications poses a significant challenge. While contemporary deep learning techniques excel, their success often hinges on large annotated datasets and intricate network architectures. In this letter, we present LiteGrasp, a novel semi-supervised lightweight framework purpose-built for grasp detection, eliminating the necessity for exhaustive supervision and intricate networks. Our approach uses a limited amount of labeled data via a knowledge distillation method, introducing HRGrasp-Net, a model with high efficiency for extracting features and largely based on HRNet. We incorporate pseudo-label filtering within a mutual learning model set within a teacher-student paradigm. This enhances the transference of data from images with labels to those without. Additionally, we introduce the streamlined Lite HRGrasp-Net, acting as the student network which gains further distillation knowledge using a multi-level fusion cascade originating from HRGrasp-Net. Impressively, LiteGrasp thrives with just a fraction (4.3%) of HRGrasp-Net’s original model size, and with limited labeled data relative to total data (25% ratio) across all benchmarks, regularly outperforming solely supervised and semi-supervised models. Taking just 6 ms for execution, LiteGrasp showcases exceptional accuracy (99.99% and 97.21% on Cornell and Jacquard data sets respectively), as well as an impressive 95.3% rate of success in grasping when deployed using a 6DoF UR5e robotic arm. These highlights underscore the effectiveness and efficiency of LiteGrasp for grasp detection, even under resource-limited conditions.

    @article{peng2024lal,
    title = {LiteGrasp: A Light Robotic Grasp Detection via Semi-Supervised Knowledge Distillation},
    author = {Linpeng Peng and Rongyao Cai and Jingyang Xiang and Junyu Zhu and Weiwei Liu and Wang Gao and Yong Liu},
    year = 2024,
    journal = {IEEE Robotics and Automation Letters},
    volume = 9,
    pages = {7995-8002},
    doi = {10.1109/LRA.2024.3436336},
    abstract = {Grasping detection from single images in robotic applications poses a significant challenge. While contemporary deep learning techniques excel, their success often hinges on large annotated datasets and intricate network architectures. In this letter, we present LiteGrasp, a novel semi-supervised lightweight framework purpose-built for grasp detection, eliminating the necessity for exhaustive supervision and intricate networks. Our approach uses a limited amount of labeled data via a knowledge distillation method, introducing HRGrasp-Net, a model with high efficiency for extracting features and largely based on HRNet. We incorporate pseudo-label filtering within a mutual learning model set within a teacher-student paradigm. This enhances the transference of data from images with labels to those without. Additionally, we introduce the streamlined Lite HRGrasp-Net, acting as the student network which gains further distillation knowledge using a multi-level fusion cascade originating from HRGrasp-Net. Impressively, LiteGrasp thrives with just a fraction (4.3%) of HRGrasp-Net's original model size, and with limited labeled data relative to total data (25% ratio) across all benchmarks, regularly outperforming solely supervised and semi-supervised models. Taking just 6 ms for execution, LiteGrasp showcases exceptional accuracy (99.99% and 97.21% on Cornell and Jacquard data sets respectively), as well as an impressive 95.3% rate of success in grasping when deployed using a 6DoF UR5e robotic arm. These highlights underscore the effectiveness and efficiency of LiteGrasp for grasp detection, even under resource-limited conditions.}
    }

  • R. Cai, W. Gao, L. Peng, Z. Lu, K. Zhang, and Y. Liu, “Debiased Contrastive Learning With Supervision Guidance for Industrial Fault Detection," IEEE Transactions on Industrial Informatics, 2024.
    [BibTeX] [Abstract] [DOI]

    The time series self-supervised contrastive learning framework has succeeded significantly in industrial fault detection scenarios. It typically consists of pretraining on abundant unlabeled data and fine-tuning on limited annotated data. However, the two-phase framework faces three challenges: Sampling bias, task-agnostic representation issue, and angular-centricity issue. These challenges hinder further development in industrial applications. This article introduces a debiased contrastive learning with supervision guidance (DCLSG) framework and applies it to industrial fault detection tasks. First, DCLSG employs channel augmentation to integrate temporal and frequency domain information. Pseudolabels based on momentum clustering operation are assigned to extracted representations, thereby mitigating the sampling bias raised by the selection of positive pairs. Second, the generated supervisory signal guides the pretraining phase, tackling the task-agnostic representation issue. Third, the angular-centricity issue is addressed using the proposed Gaussian distance metric measuring the radial distribution of representations. The experiments conducted on three industrial datasets (ISDB, CWRU, and practical datasets) validate the superior performance of the DCLSG compared to other fault detection methods.

    @article{cai2024dcl,
    title = {Debiased Contrastive Learning With Supervision Guidance for Industrial Fault Detection},
    author = {Rongyao Cai and Wang Gao and Linpeng Peng and Zhengming Lu and Kexin Zhang and Yong Liu},
    year = 2024,
    journal = {IEEE Transactions on Industrial Informatics},
    doi = {10.1109/TII.2024.3424561},
    abstract = {The time series self-supervised contrastive learning framework has succeeded significantly in industrial fault detection scenarios. It typically consists of pretraining on abundant unlabeled data and fine-tuning on limited annotated data. However, the two-phase framework faces three challenges: Sampling bias, task-agnostic representation issue, and angular-centricity issue. These challenges hinder further development in industrial applications. This article introduces a debiased contrastive learning with supervision guidance (DCLSG) framework and applies it to industrial fault detection tasks. First, DCLSG employs channel augmentation to integrate temporal and frequency domain information. Pseudolabels based on momentum clustering operation are assigned to extracted representations, thereby mitigating the sampling bias raised by the selection of positive pairs. Second, the generated supervisory signal guides the pretraining phase, tackling the task-agnostic representation issue. Third, the angular-centricity issue is addressed using the proposed Gaussian distance metric measuring the radial distribution of representations. The experiments conducted on three industrial datasets (ISDB, CWRU, and practical datasets) validate the superior performance of the DCLSG compared to other fault detection methods.}
    }

  • Y. Han, J. Zhang, Y. Wang, C. Wang, Y. Liu, L. Qi, X. Li, and M. Yang, “Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
    [BibTeX] [Abstract] [DOI]

    Few-Shot Instance Segmentation (FSIS) requires detecting and segmenting novel classes with limited support examples. Existing methods based on Region Proposal Networks (RPNs) face two issues: 1) Overfitting suppresses novel class objects; 2) Dual-branch models require complex spatial correlation strategies to prevent spatial information loss when generating class prototypes. We introduce a unified framework, Reference Twice (RefT), to exploit the relationship between support and query features for FSIS and related tasks. Our three main contributions are: 1) A novel transformer-based baseline that avoids overfitting, offering a new direction for FSIS; 2) Demonstrating that support object queries encode key factors after base training, allowing query features to be enhanced twice at both feature and query levels using simple cross-attention, thus avoiding complex spatial correlation interaction; 3) Introducing a class-enhanced base knowledge distillation loss to address the issue of DETR-like models struggling with incremental settings due to the input projection layer, enabling easy extension to incremental FSIS. Extensive experimental evaluations on the COCO dataset under three FSIS settings demonstrate that our method performs favorably against existing approaches across different shots, e.g., +8.2/ + 9.4 performance gain over state-of-the-art methods with 10/30-shots. Source code and models will be available at this github site.

    @article{han2024rta,
    title = {Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation},
    author = {Yue Han and Jiangning Zhang and Yabiao Wang and Chengjie Wang and Yong Liu and Lu Qi and Xiangtai Li and Ming-Hsuan Yang},
    year = 2024,
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    doi = {10.1109/TPAMI.2024.3421340},
    abstract = {Few-Shot Instance Segmentation (FSIS) requires detecting and segmenting novel classes with limited support examples. Existing methods based on Region Proposal Networks (RPNs) face two issues: 1) Overfitting suppresses novel class objects; 2) Dual-branch models require complex spatial correlation strategies to prevent spatial information loss when generating class prototypes. We introduce a unified framework, Reference Twice (RefT), to exploit the relationship between support and query features for FSIS and related tasks. Our three main contributions are: 1) A novel transformer-based baseline that avoids overfitting, offering a new direction for FSIS; 2) Demonstrating that support object queries encode key factors after base training, allowing query features to be enhanced twice at both feature and query levels using simple cross-attention, thus avoiding complex spatial correlation interaction; 3) Introducing a class-enhanced base knowledge distillation loss to address the issue of DETR-like models struggling with incremental settings due to the input projection layer, enabling easy extension to incremental FSIS. Extensive experimental evaluations on the COCO dataset under three FSIS settings demonstrate that our method performs favorably against existing approaches across different shots, e.g., +8.2/ + 9.4 performance gain over state-of-the-art methods with 10/30-shots. Source code and models will be available at this github site.}
    }

  • K. Zhang, Q. Wen, C. Zhang, R. Cai, M. Jin, Y. Liu, J. Y. Zhang, Y. Liang, G. Pang, D. Song, and S. Pan, “Self-Supervised Learning for Time Series Analysis: Taxonomy, Progress, and Prospects," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, pp. 6775-6794, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    Self-supervised learning (SSL) has recently achieved impressive performance on various time series tasks. The most prominent advantage of SSL is that it reduces the dependence on labeled data. Based on the pre-training and fine-tuning strategy, even a small amount of labeled data can achieve high performance. Compared with many published self-supervised surveys on computer vision and natural language processing, a comprehensive survey for time series SSL is still missing. To fill this gap, we review current state-of-the-art SSL methods for time series data in this article. To this end, we first comprehensively review existing surveys related to SSL and time series, and then provide a new taxonomy of existing time series SSL methods by summarizing them from three perspectives: generative-based, contrastive-based, and adversarial-based. These methods are further divided into ten subcategories with detailed reviews and discussions about their key intuitions, main frameworks, advantages and disadvantages. To facilitate the experiments and validation of time series SSL methods, we also summarize datasets commonly used in time series forecasting, classification, anomaly detection, and clustering tasks. Finally, we present the future directions of SSL for time series analysis.

    @article{zhang2024ssl,
    title = {Self-Supervised Learning for Time Series Analysis: Taxonomy, Progress, and Prospects},
    author = {Kexin Zhang and Qingsong Wen and Chaoli Zhang and Rongyao Cai and Ming Jin and Yong Liu and James Y. Zhang and Yuxuan Liang and Guansong Pang and Dongjin Song and Shirui Pan},
    year = 2024,
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    volume = 46,
    pages = {6775-6794},
    doi = {10.1109/TPAMI.2024.3387317},
    abstract = {Self-supervised learning (SSL) has recently achieved impressive performance on various time series tasks. The most prominent advantage of SSL is that it reduces the dependence on labeled data. Based on the pre-training and fine-tuning strategy, even a small amount of labeled data can achieve high performance. Compared with many published self-supervised surveys on computer vision and natural language processing, a comprehensive survey for time series SSL is still missing. To fill this gap, we review current state-of-the-art SSL methods for time series data in this article. To this end, we first comprehensively review existing surveys related to SSL and time series, and then provide a new taxonomy of existing time series SSL methods by summarizing them from three perspectives: generative-based, contrastive-based, and adversarial-based. These methods are further divided into ten subcategories with detailed reviews and discussions about their key intuitions, main frameworks, advantages and disadvantages. To facilitate the experiments and validation of time series SSL methods, we also summarize datasets commonly used in time series forecasting, classification, anomaly detection, and clustering tasks. Finally, we present the future directions of SSL for time series analysis.}
    }

  • J. Mei, M. Wang, Y. Yang, Z. Li, and Y. Liu, “Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation," Applied Intelligence, vol. 54, pp. 6138-6153, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼50% faster inference speed with only a slight 0.2% (J&F) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git.

    @article{mei2024lsr,
    title = {Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation},
    author = {Jianbiao Mei and Mengmeng Wang and Yu Yang and Zizhang Li and Yong Liu},
    year = 2024,
    journal = {Applied Intelligence},
    volume = 54,
    pages = {6138-6153},
    doi = {10.1007/s10489-024-05486-y},
    abstract = {Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼50% faster inference speed with only a slight 0.2% (J&F) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git.}
    }

  • J. Zhang, X. Li, Y. Wang, C. Wang, Y. Yang, Y. Liu, and D. Tao, “EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm," International Journal of Computer Vision, vol. 132, pp. 3509-3536, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block, which consists of three residual parts, i.e., Multi-scale region aggregation, global and local interaction, and feed-forward network modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a task-related head docked with transformer backbone to complete final information fusion more flexibly and improve a modulated deformable MSA to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over state-of-the-art methods. E.g., our Mobile (1.8 M), Tiny (6.1 M), Small (24.3 M), and Base (49.0 M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7. Code is available at https://github.com/zhangzjn/EATFormer.

    @article{zhang2024eat,
    title = {EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm},
    author = {Jiangning Zhang and Xiangtai Li and Yabiao Wang and Chengjie Wang and Yibo Yang and Yong Liu and Dacheng Tao},
    year = 2024,
    journal = {International Journal of Computer Vision},
    volume = 132,
    pages = {3509-3536},
    doi = {10.1007/s11263-024-02034-6},
    abstract = {Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block, which consists of three residual parts, i.e., Multi-scale region aggregation, global and local interaction, and feed-forward network modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a task-related head docked with transformer backbone to complete final information fusion more flexibly and improve a modulated deformable MSA to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over state-of-the-art methods. E.g., our Mobile (1.8 M), Tiny (6.1 M), Small (24.3 M), and Base (49.0 M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7. Code is available at https://github.com/zhangzjn/EATFormer.}
    }

  • J. Mei, Y. Yang, M. Wang, Z. Li, J. Ra, and Y. Liu, “LiDAR Video Object Segmentation with Dynamic Kernel Refinement," Pattern Recognition Letters, vol. 178, pp. 21-27, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this paper, we formalize memory- and tracking-based methods to perform the LiDAR-based Video Object Segmentation (VOS) task, which segments points of the specific 3D target (given in the first frame) in a LiDAR sequence. LiDAR-based VOS can directly provide target-aware geometric information for practical application scenarios like behavior analysis and anticipating danger. We first construct a LiDAR-based VOS dataset named KITTI-VOS based on SemanticKITTI, which acts as a testbed and facilitates comprehensive evaluations of algorithm performance. Next, we provide two types of baselines, i.e., memory-based and tracking-based baselines, to explore this task. Specifically, the first memory-based pipeline is built on a space–time memory network equipped with the non-local spatiotemporal attention-based memory bank. We further design a more potent variant to introduce the locality into the spatiotemporal attention module by local self-attention and cross-attention modules. For the second tracking-based baseline, we modify two representative 3D object tracking methods to adapt to LiDAR-based VOS tasks. Finally, we propose a refine module that takes mask priors and generates object-aware kernels, which could boost all the baselines’ performance. We evaluate the proposed methods on the dataset and demonstrate their effectiveness.

    @article{mei2024lvo,
    title = {LiDAR Video Object Segmentation with Dynamic Kernel Refinement},
    author = {Jianbiao Mei and Yu Yang and Mengmeng Wang and Zizhang Li and Jongwon Ra and Yong Liu},
    year = 2024,
    journal = {Pattern Recognition Letters},
    volume = 178,
    pages = {21-27},
    doi = {10.1016/j.patrec.2023.12.013},
    abstract = {In this paper, we formalize memory- and tracking-based methods to perform the LiDAR-based Video Object Segmentation (VOS) task, which segments points of the specific 3D target (given in the first frame) in a LiDAR sequence. LiDAR-based VOS can directly provide target-aware geometric information for practical application scenarios like behavior analysis and anticipating danger. We first construct a LiDAR-based VOS dataset named KITTI-VOS based on SemanticKITTI, which acts as a testbed and facilitates comprehensive evaluations of algorithm performance. Next, we provide two types of baselines, i.e., memory-based and tracking-based baselines, to explore this task. Specifically, the first memory-based pipeline is built on a space–time memory network equipped with the non-local spatiotemporal attention-based memory bank. We further design a more potent variant to introduce the locality into the spatiotemporal attention module by local self-attention and cross-attention modules. For the second tracking-based baseline, we modify two representative 3D object tracking methods to adapt to LiDAR-based VOS tasks. Finally, we propose a refine module that takes mask priors and generates object-aware kernels, which could boost all the baselines’ performance. We evaluate the proposed methods on the dataset and demonstrate their effectiveness.}
    }

  • K. Zhang, R. Cai, C. Zhou, and Y. Liu, “Debiased Contrastive Learning for Time-Series Representation Learning and Fault Detection," IEEE Transactions on Industrial Informatics, vol. 20, pp. 7641-7653, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    Building reliable fault detection systems through deep neural networks is an appealing topic in industrial scenarios. In these contexts, the representations extracted by neural networks on available labeled time-series data can reflect system states. However, this endeavor remains challenging due to the necessity of labeled data. Self-supervised contrastive learning (SSCL) is one of the effective approaches to deal with this challenge, but existing SSCL-based models suffer from sampling bias and representation bias problems. This article introduces a debiased contrastive learning framework for time-series data and applies it to industrial fault detection tasks. This framework first develops the multigranularity augmented view generation method to generate augmented views at different granularities. It then introduces the momentum clustering contrastive learning strategy and the expert knowledge guidance mechanism to mitigate sampling bias and representation bias, respectively. Finally, the experiments on a public bearing fault detection dataset and a widely used valve stiction detection dataset show the effectiveness of the proposed feature learning framework.

    @article{zhang2024dcl,
    title = {Debiased Contrastive Learning for Time-Series Representation Learning and Fault Detection},
    author = {Kexin Zhang and Rongyao Cai and Chunlin Zhou and Yong Liu},
    year = 2024,
    journal = {IEEE Transactions on Industrial Informatics},
    volume = 20,
    pages = {7641-7653},
    doi = {10.1109/TII.2024.3359409},
    abstract = {Building reliable fault detection systems through deep neural networks is an appealing topic in industrial scenarios. In these contexts, the representations extracted by neural networks on available labeled time-series data can reflect system states. However, this endeavor remains challenging due to the necessity of labeled data. Self-supervised contrastive learning (SSCL) is one of the effective approaches to deal with this challenge, but existing SSCL-based models suffer from sampling bias and representation bias problems. This article introduces a debiased contrastive learning framework for time-series data and applies it to industrial fault detection tasks. This framework first develops the multigranularity augmented view generation method to generate augmented views at different granularities. It then introduces the momentum clustering contrastive learning strategy and the expert knowledge guidance mechanism to mitigate sampling bias and representation bias, respectively. Finally, the experiments on a public bearing fault detection dataset and a widely used valve stiction detection dataset show the effectiveness of the proposed feature learning framework.}
    }

  • S. Li, J. Chen, S. Liu, C. Zhu, G. Tian, and Y. Liu, “MCMC: Multi-Constrained Model Compression via One-stage Envelope Reinforcement Learning," IEEE Transactions on Neural Networks and Learning Systems, 2024.
    [BibTeX] [Abstract] [DOI]

    Model compression methods are being developed to bridge the gap between the massive scale of neural networks and the limited hardware resources on edge devices. Since most real-world applications deployed on resource-limited hardware platforms typically have multiple hardware constraints simultaneously, most existing model compression approaches that only consider optimizing one single hardware objective are ineffective. In this article, we propose an automated pruning method called multi-constrained model compression (MCMC) that allows for the optimization of multiple hardware targets, such as latency, floating point operations (FLOPs), and memory usage, while minimizing the impact on accuracy. Specifically, we propose an improved multi-objective reinforcement learning (MORL) algorithm, the one-stage envelope deep deterministic policy gradient (DDPG) algorithm, to determine the pruning strategy for neural networks. Our improved one-stage envelope DDPG algorithm reduces exploration time and offers greater flexibility in adjusting target priorities, enhancing its suitability for pruning tasks. For instance, on the visual geometry group (VGG)-16 network, our method achieved an 80% reduction in FLOPs, a 2.31x reduction in memory usage, and a 1.92x acceleration, with an accuracy improvement of 0.09% compared with the baseline. For larger datasets, such as ImageNet, we reduced FLOPs by 50% for MobileNet-V1, resulting in a 4.7x faster speed and 1.48x memory compression, while maintaining the same accuracy. When applied to edge devices, such as JETSON XAVIER NX, our method resulted in a 71% reduction in FLOPs for MobileNet-V1, leading to a 1.63x faster speed, 1.64x memory compression, and an accuracy improvement.

    @article{li2024mcmc,
    title = {MCMC: Multi-Constrained Model Compression via One-stage Envelope Reinforcement Learning},
    author = {Siqi Li and Jun Chen and Shanqi Liu and Chengrui Zhu and Guanzhong Tian and Yong Liu},
    year = 2024,
    journal = {IEEE Transactions on Neural Networks and Learning Systems},
    doi = {10.1109/TNNLS.2024.3353763},
    abstract = {Model compression methods are being developed to bridge the gap between the massive scale of neural networks and the limited hardware resources on edge devices. Since most real-world applications deployed on resource-limited hardware platforms typically have multiple hardware constraints simultaneously, most existing model compression approaches that only consider optimizing one single hardware objective are ineffective. In this article, we propose an automated pruning method called multi-constrained model compression (MCMC) that allows for the optimization of multiple hardware targets, such as latency, floating point operations (FLOPs), and memory usage, while minimizing the impact on accuracy. Specifically, we propose an improved multi-objective reinforcement learning (MORL) algorithm, the one-stage envelope deep deterministic policy gradient (DDPG) algorithm, to determine the pruning strategy for neural networks. Our improved one-stage envelope DDPG algorithm reduces exploration time and offers greater flexibility in adjusting target priorities, enhancing its suitability for pruning tasks. For instance, on the visual geometry group (VGG)-16 network, our method achieved an 80% reduction in FLOPs, a 2.31x reduction in memory usage, and a 1.92x acceleration, with an accuracy improvement of 0.09% compared with the baseline. For larger datasets, such as ImageNet, we reduced FLOPs by 50% for MobileNet-V1, resulting in a 4.7x faster speed and 1.48x memory compression, while maintaining the same accuracy. When applied to edge devices, such as JETSON XAVIER NX, our method resulted in a 71% reduction in FLOPs for MobileNet-V1, leading to a 1.63x faster speed, 1.64x memory compression, and an accuracy improvement.}
    }

  • Y. Ma, H. Li, X. Zhao, Y. Gu, X. Lang, L. Li, and Y. Liu, “FMCW Radar on LiDAR Map Localization in Structural Urban Environments," Journal of Field Robotics, vol. 41, pp. 699-717, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    Multisensor fusion-based localization technology has achieved high accuracy in autonomous systems. How to improve the robustness is the main challenge at present. The most commonly used LiDAR and camera are weather-sensitive, while the frequency-modulated continuous wave Radar has strong adaptability but suffers from noise and ghost effects. In this paper, we propose a heterogeneous localization method called Radar on LiDAR Map, which aims to enhance localization accuracy without relying on loop closures by mitigating the accumulated error in Radar odometry in real time. To accomplish this, we utilize LiDAR scans and ground truth paths as Teach paths and Radar scans as the trajectories to be estimated, referred to as Repeat paths. By establishing a correlation between the Radar and LiDAR scan data, we can enhance the accuracy of Radar odometry estimation. Our approach involves embedding the data from both Radar and LiDAR sensors into a density map. We calculate the spatial vector similarity with an offset to determine the corresponding place index within the candidate map and estimate the rotation and translation. To refine the alignment, we utilize the Iterative Closest Point algorithm to achieve optimal matching on the LiDAR submap. The estimated bias is subsequently incorporated into the Radar SLAM for optimizing the position map. We conducted extensive experiments on the Mulran Radar Data set, Oxford Radar RobotCar Dataset, and our data set to demonstrate the feasibility and effectiveness of our proposed approach. Our proposed scan projection descriptors achieves homogeneous and heterogeneous place recognition and works much better than existing methods. Its application to the Radar SLAM system also substantially improves the positioning accuracy. All sequences’ root mean square error is 2.53 m for positioning and 1.83 degrees for angle.

    @article{ma2024fmcw,
    title = {FMCW Radar on LiDAR Map Localization in Structural Urban Environments},
    author = {Yukai Ma and Han Li and Xiangrui Zhao and Yaqing Gu and Xiaolei Lang and Laijian Li and Yong Liu},
    year = 2024,
    journal = {Journal of Field Robotics},
    volume = 41,
    pages = {699-717},
    doi = {10.1002/rob.22291},
    abstract = {Multisensor fusion-based localization technology has achieved high accuracy in autonomous systems. How to improve the robustness is the main challenge at present. The most commonly used LiDAR and camera are weather-sensitive, while the frequency-modulated continuous wave Radar has strong adaptability but suffers from noise and ghost effects. In this paper, we propose a heterogeneous localization method called Radar on LiDAR Map, which aims to enhance localization accuracy without relying on loop closures by mitigating the accumulated error in Radar odometry in real time. To accomplish this, we utilize LiDAR scans and ground truth paths as Teach paths and Radar scans as the trajectories to be estimated, referred to as Repeat paths. By establishing a correlation between the Radar and LiDAR scan data, we can enhance the accuracy of Radar odometry estimation. Our approach involves embedding the data from both Radar and LiDAR sensors into a density map. We calculate the spatial vector similarity with an offset to determine the corresponding place index within the candidate map and estimate the rotation and translation. To refine the alignment, we utilize the Iterative Closest Point algorithm to achieve optimal matching on the LiDAR submap. The estimated bias is subsequently incorporated into the Radar SLAM for optimizing the position map. We conducted extensive experiments on the Mulran Radar Data set, Oxford Radar RobotCar Dataset, and our data set to demonstrate the feasibility and effectiveness of our proposed approach. Our proposed scan projection descriptors achieves homogeneous and heterogeneous place recognition and works much better than existing methods. Its application to the Radar SLAM system also substantially improves the positioning accuracy. All sequences' root mean square error is 2.53 m for positioning and 1.83 degrees for angle.}
    }

  • T. Huang, Q. Liu, X. Zhao, J. Chen, and Y. Liu, “Learnable Chamfer Distance for Point Cloud Reconstruction," Pattern Recognition Letters, vol. 178, pp. 43-48, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    As point clouds are 3D signals with permutation invariance, most existing works train their reconstruction networks by measuring shape differences with the average point-to-point distance between point clouds matched with predefined rules. However, the static matching rules may deviate from actual shape differences. Although some works propose dynamically -updated learnable structures to replace matching rules, they need more iterations to converge well. In this work, we propose a simple but effective reconstruction loss, named Learnable Chamfer Distance (LCD) by dynamically paying attention to matching distances with different weight distributions controlled with a group of learnable networks. By training with adversarial strategy, LCD learns to search defects in reconstructed results and overcomes the weaknesses of static matching rules, while the performances at low iterations can also be guaranteed by the basic matching algorithm. Experiments on multiple reconstruction networks confirm that LCD can help achieve better reconstruction performances and extract more representative representations with faster convergence and comparable training efficiency.

    @article{huang2024lcd,
    title = {Learnable Chamfer Distance for Point Cloud Reconstruction},
    author = {Tianxin Huang and Qingyao Liu and Xiangrui Zhao and Jun Chen and Yong Liu},
    year = 2024,
    journal = {Pattern Recognition Letters},
    volume = 178,
    pages = {43-48},
    doi = {10.1016/j.patrec.2023.12.015},
    abstract = {As point clouds are 3D signals with permutation invariance, most existing works train their reconstruction networks by measuring shape differences with the average point-to-point distance between point clouds matched with predefined rules. However, the static matching rules may deviate from actual shape differences. Although some works propose dynamically -updated learnable structures to replace matching rules, they need more iterations to converge well. In this work, we propose a simple but effective reconstruction loss, named Learnable Chamfer Distance (LCD) by dynamically paying attention to matching distances with different weight distributions controlled with a group of learnable networks. By training with adversarial strategy, LCD learns to search defects in reconstructed results and overcomes the weaknesses of static matching rules, while the performances at low iterations can also be guaranteed by the basic matching algorithm. Experiments on multiple reconstruction networks confirm that LCD can help achieve better reconstruction performances and extract more representative representations with faster convergence and comparable training efficiency.}
    }

  • Y. Chen, Y. Wu, H. Yang, J. Cao, Q. Wang, and Y. Liu, “A Distributed Pipeline for Collaborative Pursuit in the Target Guarding Problem," IEEE Robotics and Automation Letters (RA-L), vol. 9, pp. 2064-2071, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    The target guarding problem (TGP) is a classical combat game where pursuers aim to capture evaders to protect a territory from intrusion. This paper proposes a distributed pipeline for multi-pursuer multi-evader TGP with the capability to accommodate varying numbers of evaders and criteria for successful pursuit. The pipeline integrates a cooperative encirclement-oriented distributed model predictive control (CEO-DMPC) method with a collaborative grouping strategy for trajectory planning of pursuers. This integration achieves cooperation and collision avoidance during the capture process across various scenarios. Besides, the objective function of CEO-DMPC employs sequences of predicted states instead of only a terminal state. Evaders are guided by the artificial potential field (APF) policy to reach their goals without being captured. Simulations with different parameters are conducted to validate the whole pipeline and the experiment results are illustrated and analyzed.

    @article{chen2024adp,
    title = {A Distributed Pipeline for Collaborative Pursuit in the Target Guarding Problem},
    author = {Yansong Chen and Yuchen Wu and Helei Yang and Junjie Cao and Qinqin Wang and Yong Liu},
    year = 2024,
    journal = {IEEE Robotics and Automation Letters (RA-L)},
    volume = 9,
    pages = {2064-2071},
    doi = {10.1109/LRA.2024.3349977},
    abstract = {The target guarding problem (TGP) is a classical combat game where pursuers aim to capture evaders to protect a territory from intrusion. This paper proposes a distributed pipeline for multi-pursuer multi-evader TGP with the capability to accommodate varying numbers of evaders and criteria for successful pursuit. The pipeline integrates a cooperative encirclement-oriented distributed model predictive control (CEO-DMPC) method with a collaborative grouping strategy for trajectory planning of pursuers. This integration achieves cooperation and collision avoidance during the capture process across various scenarios. Besides, the objective function of CEO-DMPC employs sequences of predicted states instead of only a terminal state. Evaders are guided by the artificial potential field (APF) policy to reach their goals without being captured. Simulations with different parameters are conducted to validate the whole pipeline and the experiment results are illustrated and analyzed.}
    }

  • X. Zuo, M. Zhang, M. Wang, Y. Chen, G. Huang, Y. Liu, and M. Li, “Visual-Based Kinematics and Pose Estimation for Skid-Steering Robots," IEEE Transactions on Automation Science and Engineering, vol. 21, pp. 91-105, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    To build commercial robots, skid-steering mechanical design is of increased popularity due to its manufacturing simplicity and unique mechanism. However, these also cause significant challenges on software and algorithm design, especially for the pose estimation (i.e., determining the robot’s rotation and position) of skid-steering robots, since they change their orientation with an inevitable skid. To tackle this problem, we propose a probabilistic sliding-window estimator dedicated to skid-steering robots, using measurements from a monocular camera, the wheel encoders, and optionally an inertial measurement unit (IMU). Specifically, we explicitly model the kinematics of skid-steering robots by both track instantaneous centers of rotation (ICRs) and correction factors, which are capable of compensating for the complexity of track-to-terrain interaction, the imperfectness of mechanical design, terrain conditions and smoothness, etc. To prevent performance reduction in robots’ long-term missions, the time- and location- varying kinematic parameters are estimated online along with pose estimation states in a tightly-coupled manner. More importantly, we conduct indepth observability analysis for different sensors and design configurations in this paper, which provides us with theoretical tools in making the correct choice when building real commercial robots. In our experiments, we validate the proposed method by both simulation tests and real-world experiments, which demonstrate that our method outperforms competing methods by wide margins.

    @article{zuo2024vbk,
    title = {Visual-Based Kinematics and Pose Estimation for Skid-Steering Robots},
    author = {Xingxing Zuo and Mingming Zhang and Mengmeng Wang and Yiming Chen and Guoquan Huang and Yong Liu and Mingyang Li},
    year = 2024,
    journal = {IEEE Transactions on Automation Science and Engineering},
    volume = 21,
    pages = {91-105},
    doi = {10.1109/TASE.2022.3214984},
    abstract = {To build commercial robots, skid-steering mechanical design is of increased popularity due to its manufacturing simplicity and unique mechanism. However, these also cause significant challenges on software and algorithm design, especially for the pose estimation (i.e., determining the robot’s rotation and position) of skid-steering robots, since they change their orientation with an inevitable skid. To tackle this problem, we propose a probabilistic sliding-window estimator dedicated to skid-steering robots, using measurements from a monocular camera, the wheel encoders, and optionally an inertial measurement unit (IMU). Specifically, we explicitly model the kinematics of skid-steering robots by both track instantaneous centers of rotation (ICRs) and correction factors, which are capable of compensating for the complexity of track-to-terrain interaction, the imperfectness of mechanical design, terrain conditions and smoothness, etc. To prevent performance reduction in robots’ long-term missions, the time- and location- varying kinematic parameters are estimated online along with pose estimation states in a tightly-coupled manner. More importantly, we conduct indepth observability analysis for different sensors and design configurations in this paper, which provides us with theoretical tools in making the correct choice when building real commercial robots. In our experiments, we validate the proposed method by both simulation tests and real-world experiments, which demonstrate that our method outperforms competing methods by wide margins.}
    }

  • L. Liu, X. Song, M. Wang, Y. Dai, Y. Liu, and L. Zhang, “AGDF-Net: Learning Domain Generalizable Depth Features with Adaptive Guidance Fusion," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, pp. 3137-3155, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    Cross-domain generalizable depth estimation aims to estimate the depth of target domains (i.e., real-world) using models trained on the source domains (i.e., synthetic). Previous methods mainly use additional real-world domain datasets to extract depth specific information for cross-domain generalizable depth estimation. Unfortunately, due to the large domain gap, adequate depth specific information is hard to obtain and interference is difficult to remove, which limits the performance. To relieve these problems, we propose a domain generalizable feature extraction network with adaptive guidance fusion (AGDF-Net) to fully acquire essential features for depth estimation at multi-scale feature levels. Specifically, our AGDF-Net first separates the image into initial depth and weak-related depth components with reconstruction and contrary losses. Subsequently, an adaptive guidance fusion module is designed to sufficiently intensify the initial depth features for domain generalizable intensified depth features acquisition. Finally, taking intensified depth features as input, an arbitrary depth estimation network can be used for real-world depth estimation. Using only synthetic datasets, our AGDF-Net can be applied to various real-world datasets (i.e., KITTI, NYUDv2, NuScenes, DrivingStereo and CityScapes) with state-of-the-art performances. Furthermore, experiments with a small amount of real-world data in a semi-supervised setting also demonstrate the superiority of AGDF-Net over state-of-the-art approaches.

    @article{liu2024agdf,
    title = {AGDF-Net: Learning Domain Generalizable Depth Features with Adaptive Guidance Fusion},
    author = {Lina Liu and Xibin Song and Mengmeng Wang and Yuchao Dai and Yong Liu and Liangjun Zhang},
    year = 2024,
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    volume = 46,
    pages = {3137-3155},
    doi = {10.1109/TPAMI.2023.3342634},
    abstract = {Cross-domain generalizable depth estimation aims to estimate the depth of target domains (i.e., real-world) using models trained on the source domains (i.e., synthetic). Previous methods mainly use additional real-world domain datasets to extract depth specific information for cross-domain generalizable depth estimation. Unfortunately, due to the large domain gap, adequate depth specific information is hard to obtain and interference is difficult to remove, which limits the performance. To relieve these problems, we propose a domain generalizable feature extraction network with adaptive guidance fusion (AGDF-Net) to fully acquire essential features for depth estimation at multi-scale feature levels. Specifically, our AGDF-Net first separates the image into initial depth and weak-related depth components with reconstruction and contrary losses. Subsequently, an adaptive guidance fusion module is designed to sufficiently intensify the initial depth features for domain generalizable intensified depth features acquisition. Finally, taking intensified depth features as input, an arbitrary depth estimation network can be used for real-world depth estimation. Using only synthetic datasets, our AGDF-Net can be applied to various real-world datasets (i.e., KITTI, NYUDv2, NuScenes, DrivingStereo and CityScapes) with state-of-the-art performances. Furthermore, experiments with a small amount of real-world data in a semi-supervised setting also demonstrate the superiority of AGDF-Net over state-of-the-art approaches.}
    }

  • Y. Liu, J. Chen, and Y. Liu, “DCCD: Reducing Neural Network Redundancy via Distillation," IEEE Transactions on Neural Networks and Learning Systems, vol. 35, pp. 10006-10017, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    Deep neural models have achieved remarkable performance on various supervised and unsupervised learning tasks, but it is a challenge to deploy these large-size networks on resource-limited devices. As a representative type of model compression and acceleration methods, knowledge distillation (KD) solves this problem by transferring knowledge from heavy teachers to lightweight students. However, most distillation methods focus on imitating the responses of teacher networks but ignore the information redundancy of student networks. In this article, we propose a novel distillation framework difference-based channel contrastive distillation (DCCD), which introduces channel contrastive knowledge and dynamic difference knowledge into student networks for redundancy reduction. At the feature level, we construct an efficient contrastive objective that broadens student networks’ feature expression space and preserves richer information in the feature extraction stage. At the final output level, more detailed knowledge is extracted from teacher networks by making a difference between multiview augmented responses of the same instance. We enhance student networks to be more sensitive to minor dynamic changes. With the improvement of two aspects of DCCD, the student network gains contrastive and difference knowledge and reduces its overfitting and redundancy. Finally, we achieve surprising results that the student approaches and even outperforms the teacher in test accuracy on CIFAR-100. We reduce the top-1 error to 28.16% on ImageNet classification and 24.15% for cross-model transfer with ResNet-18. Empirical experiments and ablation studies on popular datasets show that our proposed method can achieve state-of-the-art accuracy compared with other distillation methods.

    @article{liu2024dccd,
    title = {DCCD: Reducing Neural Network Redundancy via Distillation},
    author = {Yuang Liu and Jun Chen and Yong Liu},
    year = 2024,
    journal = {IEEE Transactions on Neural Networks and Learning Systems},
    volume = 35,
    pages = {10006-10017},
    doi = {10.1109/TNNLS.2023.3238337},
    abstract = {Deep neural models have achieved remarkable performance on various supervised and unsupervised learning tasks, but it is a challenge to deploy these large-size networks on resource-limited devices. As a representative type of model compression and acceleration methods, knowledge distillation (KD) solves this problem by transferring knowledge from heavy teachers to lightweight students. However, most distillation methods focus on imitating the responses of teacher networks but ignore the information redundancy of student networks. In this article, we propose a novel distillation framework difference-based channel contrastive distillation (DCCD), which introduces channel contrastive knowledge and dynamic difference knowledge into student networks for redundancy reduction. At the feature level, we construct an efficient contrastive objective that broadens student networks' feature expression space and preserves richer information in the feature extraction stage. At the final output level, more detailed knowledge is extracted from teacher networks by making a difference between multiview augmented responses of the same instance. We enhance student networks to be more sensitive to minor dynamic changes. With the improvement of two aspects of DCCD, the student network gains contrastive and difference knowledge and reduces its overfitting and redundancy. Finally, we achieve surprising results that the student approaches and even outperforms the teacher in test accuracy on CIFAR-100. We reduce the top-1 error to 28.16% on ImageNet classification and 24.15% for cross-model transfer with ResNet-18. Empirical experiments and ablation studies on popular datasets show that our proposed method can achieve state-of-the-art accuracy compared with other distillation methods.}
    }

  • S. Liu, W. Liu, W. Chen, G. Tian, J. Chen, Y. Tong, J. Cao, and Y. Liu, “Learning Multi-Agent Cooperation via Considering Actions of Teammates," IEEE Transactions on Neural Networks and Learning Systems, vol. 35, pp. 11553-11564, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    Recently value-based centralized training with decentralized execution (CTDE) multi-agent reinforcement learning (MARL) methods have achieved excellent performance in cooperative tasks. However, the most representative method among these methods, Q-network MIXing (QMIX), restricts the joint action Q values to be a monotonic mixing of each agent ‘ s utilities. Furthermore, current methods cannot generalize to unseen environments or different agent configurations, which is known as ad hoc team play situation. In this work, we propose a novel Q values decomposition that considers both the return of an agent acting on its own and cooperating with other observable agents to address the nonmonotonic problem. Based on the decomposition, we propose a greedy action searching method that can improve exploration and is not affected by changes in observable agents or changes in the order of agents ‘ actions. In this way, our method can adapt to ad hoc team play situation. Furthermore, we utilize an auxiliary loss related to environmental cognition consistency and a modified prioritized experience replay (PER) buffer to assist training. Our extensive experimental results show that our method achieves significant performance improvements in both challenging monotonic and nonmonotonic domains, and can handle the ad hoc team play situation perfectly.

    @article{liu2024lma,
    title = {Learning Multi-Agent Cooperation via Considering Actions of Teammates},
    author = {Shanqi Liu and Weiwei Liu and Wenzhou Chen and Guanzhong Tian and Jun Chen and Yao Tong and Junjie Cao and Yong Liu},
    year = 2024,
    journal = {IEEE Transactions on Neural Networks and Learning Systems},
    volume = 35,
    pages = {11553-11564},
    doi = {10.1109/TNNLS.2023.3262921},
    abstract = {Recently value-based centralized training with decentralized execution (CTDE) multi-agent reinforcement learning (MARL) methods have achieved excellent performance in cooperative tasks. However, the most representative method among these methods, Q-network MIXing (QMIX), restricts the joint action Q values to be a monotonic mixing of each agent ' s utilities. Furthermore, current methods cannot generalize to unseen environments or different agent configurations, which is known as ad hoc team play situation. In this work, we propose a novel Q values decomposition that considers both the return of an agent acting on its own and cooperating with other observable agents to address the nonmonotonic problem. Based on the decomposition, we propose a greedy action searching method that can improve exploration and is not affected by changes in observable agents or changes in the order of agents ' actions. In this way, our method can adapt to ad hoc team play situation. Furthermore, we utilize an auxiliary loss related to environmental cognition consistency and a modified prioritized experience replay (PER) buffer to assist training. Our extensive experimental results show that our method achieves significant performance improvements in both challenging monotonic and nonmonotonic domains, and can handle the ad hoc team play situation perfectly.}
    }

2023

  • G. Xu, X. Kang, H. Yang, Y. Wu, W. Liu, J. Cao, and Y. Liu, “Distributed Multi-Vehicle Task Assignment and Motion Planning in Dense Environments," IEEE Transactions on Automation Science and Engineering, 2023.
    [BibTeX] [Abstract] [DOI]

    This article investigates the multi-vehicle task assignment and motion planning (MVTAMP) problem. In a dense environment, a fleet of non-holonomic vehicles is appointed to visit a series of target positions and then move to a specific ending area for real-world applications such as clearing threat targets, aid rescue, and package delivery. We presented a novel hierarchical method to simultaneously address the multiple vehicles’ task assignment and motion planning problem. Unlike most related work, our method considers the MVTAMP problem applied to non-holonomic vehicles in large-scale scenarios. At the high level, we proposed a novel distributed algorithm to address task assignment, which produces a closer to the optimal task assignment scheme by reducing the intersection paths between vehicles and tasks or between tasks and tasks. At the low level, we proposed a novel distributed motion planning algorithm that addresses the vehicle deadlocks in local planning and then quickly generates a feasible new velocity for the non-holonomic vehicle in dense environments, guaranteeing that each vehicle efficiently visits its assigned target positions. Extensive simulation experiments in large-scale scenarios for non-holonomic vehicles and two real-world experiments demonstrate the effectiveness and advantages of our method in practical applications. The source code of our method can be available at https://github.com/wuuya1/LRGO. Note to Practitioners-The motivation for this article stems from the need to solve the multi-vehicle task assignment and motion planning (MVTAMP) problem for non-holonomic vehicles in dense environments. Many real-world applications exist, such as clearing threat targets, aid rescue, and package delivery. However, when vehicles need to continuously visit a series of assigned targets, motion planning for non-holonomic vehicles becomes more difficult because it is more likely to occur sharp turns between adjacent target path nodes. In this case, a better task allocation scheme can often lead to more efficient target visits and save all vehicles’ total traveling distance. To bridge this, we proposed a hierarchical method for solving the MVTAMP problem in large-scale complex scenarios. The numerous large-scale simulations and two real-world experiments show the effectiveness of the proposed method. Our future work will focus on the integrated task assignment and motion planning problem for non-holonomic vehicles in highly dynamic scenarios.

    @article{xu2023dmv,
    title = {Distributed Multi-Vehicle Task Assignment and Motion Planning in Dense Environments},
    author = {Gang Xu and Xiao Kang and Helei Yang and Yuchen Wu and Weiwei Liu and Junjie Cao and Yong Liu},
    year = 2023,
    journal = {IEEE Transactions on Automation Science and Engineering},
    doi = {10.1109/TASE.2023.3336076},
    abstract = {This article investigates the multi-vehicle task assignment and motion planning (MVTAMP) problem. In a dense environment, a fleet of non-holonomic vehicles is appointed to visit a series of target positions and then move to a specific ending area for real-world applications such as clearing threat targets, aid rescue, and package delivery. We presented a novel hierarchical method to simultaneously address the multiple vehicles' task assignment and motion planning problem. Unlike most related work, our method considers the MVTAMP problem applied to non-holonomic vehicles in large-scale scenarios. At the high level, we proposed a novel distributed algorithm to address task assignment, which produces a closer to the optimal task assignment scheme by reducing the intersection paths between vehicles and tasks or between tasks and tasks. At the low level, we proposed a novel distributed motion planning algorithm that addresses the vehicle deadlocks in local planning and then quickly generates a feasible new velocity for the non-holonomic vehicle in dense environments, guaranteeing that each vehicle efficiently visits its assigned target positions. Extensive simulation experiments in large-scale scenarios for non-holonomic vehicles and two real-world experiments demonstrate the effectiveness and advantages of our method in practical applications. The source code of our method can be available at https://github.com/wuuya1/LRGO. Note to Practitioners-The motivation for this article stems from the need to solve the multi-vehicle task assignment and motion planning (MVTAMP) problem for non-holonomic vehicles in dense environments. Many real-world applications exist, such as clearing threat targets, aid rescue, and package delivery. However, when vehicles need to continuously visit a series of assigned targets, motion planning for non-holonomic vehicles becomes more difficult because it is more likely to occur sharp turns between adjacent target path nodes. In this case, a better task allocation scheme can often lead to more efficient target visits and save all vehicles' total traveling distance. To bridge this, we proposed a hierarchical method for solving the MVTAMP problem in large-scale complex scenarios. The numerous large-scale simulations and two real-world experiments show the effectiveness of the proposed method. Our future work will focus on the integrated task assignment and motion planning problem for non-holonomic vehicles in highly dynamic scenarios.}
    }

  • Shipeng Bai, Jun Chen, Yu Yang, and Yong Liu, “Multi-Dimension Compression of Feed-Forward Network in Vision Transformers," Pattern Recognition Letters, vol. 176, pp. 56-61, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    Vision Transformers (ViTs) have recently made a splash in computer vision domain and achieved state-of-the-art in many vision tasks. Nevertheless, due to their vast model size and high computational costs, rare transformer-based models are adopted in real-world applications. Since the computational costs of attention operation is the square of the input size, some compression methods for the Multi-Head Self-Attention (MHSA) module have been proposed, reducing its FLOPs successfully but almost without parameters reduction. Meanwhile, the number of parameters and computational costs in the Feed-Forward Network (FFN) module exceeds the MHSA larger, while its compression technologies have not been delved deeper. Consequently, we focus our insight on the compression of FFN layer and present a pruning method named Multi-Dimension Compression of Feed-Forward Network in Vision Transformers(MCF), which greatly reduces the model’s parameters and computational costs. Firstly, we identify the critical elements in the output of the FFN module and then employ them to guide the irregular sparsity of this layer, recognizing insignificant elements of FFN layer that have less impact on the output. Successively, to discard the insignificant elements, we transform the irregular sparsity into regular sparsity and prune them, thus reducing the models’ parameters and getting a substantial speed-up during inference. Extensive results on ImageNet-1K validate the effectiveness of our proposed method, which obtains significant parameters and computational costs reduction with almost unimpaired generalization. For example, we compress DeiT-Tiny with 42% reduction in FLOPs and 33% reduction in parameters, almost without losing accuracy on the ImageNet dataset. Further, we verify the effectiveness of our method in the downstream task, using the pruned DeiT-Small as the backbone for the object detection task on the COCO dataset, gaining revenue without compromising its performance.

    @article{bai2023mdc,
    title = {Multi-Dimension Compression of Feed-Forward Network in Vision Transformers},
    author = {Shipeng Bai and Jun Chen and Yu Yang and Yong Liu},
    year = 2023,
    journal = {Pattern Recognition Letters},
    volume = 176,
    pages = {56-61},
    doi = {10.1016/j.patrec.2023.10.014},
    abstract = {Vision Transformers (ViTs) have recently made a splash in computer vision domain and achieved state-of-the-art in many vision tasks. Nevertheless, due to their vast model size and high computational costs, rare transformer-based models are adopted in real-world applications. Since the computational costs of attention operation is the square of the input size, some compression methods for the Multi-Head Self-Attention (MHSA) module have been proposed, reducing its FLOPs successfully but almost without parameters reduction. Meanwhile, the number of parameters and computational costs in the Feed-Forward Network (FFN) module exceeds the MHSA larger, while its compression technologies have not been delved deeper. Consequently, we focus our insight on the compression of FFN layer and present a pruning method named Multi-Dimension Compression of Feed-Forward Network in Vision Transformers(MCF), which greatly reduces the model's parameters and computational costs. Firstly, we identify the critical elements in the output of the FFN module and then employ them to guide the irregular sparsity of this layer, recognizing insignificant elements of FFN layer that have less impact on the output. Successively, to discard the insignificant elements, we transform the irregular sparsity into regular sparsity and prune them, thus reducing the models' parameters and getting a substantial speed-up during inference. Extensive results on ImageNet-1K validate the effectiveness of our proposed method, which obtains significant parameters and computational costs reduction with almost unimpaired generalization. For example, we compress DeiT-Tiny with 42% reduction in FLOPs and 33% reduction in parameters, almost without losing accuracy on the ImageNet dataset. Further, we verify the effectiveness of our method in the downstream task, using the pruned DeiT-Small as the backbone for the object detection task on the COCO dataset, gaining revenue without compromising its performance.}
    }

  • M. Wang, J. Xing, J. Mei, Y. Liu, and Y. Jiang, “ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition," IEEE Transactions on Neural Networks and Learning Systems, 2023.
    [BibTeX] [Abstract] [DOI]

    The canonical approach to video action recognition dictates a neural network model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferability on new datasets with unseen concepts. In this article, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters’ requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub “pre-train, adapt and fine-tune." This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then, it makes the action recognition task to act more like pre-training problems via adaptation engineering. Finally, it is fine-tuned end-to-end on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.git.

    @article{wang2023aclip,
    title = {ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition},
    author = {Mengmeng Wang and Jiazheng Xing and Jianbiao Mei and Yong Liu and Yunliang Jiang},
    year = 2023,
    journal = {IEEE Transactions on Neural Networks and Learning Systems},
    doi = {10.1109/TNNLS.2023.3331841},
    abstract = {The canonical approach to video action recognition dictates a neural network model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferability on new datasets with unseen concepts. In this article, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters' requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, adapt and fine-tune." This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then, it makes the action recognition task to act more like pre-training problems via adaptation engineering. Finally, it is fine-tuned end-to-end on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.git.}
    }

  • X. Lang, C. Chen, K. Tang, Y. Ma, J. Lv, Y. Liu, and X. Zuo, “Coco-LIC: Continuous-Time Tightly-Coupled LiDAR-Inertial-Camera Odometry using Non-Uniform B-spline," IEEE Robotics and Automation Letters, vol. 8, pp. 7074-7081, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this paper, we propose an effcient continuous-time LiDAR-Inertial-Camera Odometry, utilizing non-uniform B-splines to tightly couple measurements from the LiDAR, IMU, and camera. In contrast to uniform B-spline-based continuous-time methods, our non-uniform B-spline approach offers signifcant advantages in terms of achieving real-time effciency and high accuracy. This is accomplished by dynamically and adaptively placing control points, taking into account the varying dynamics of the motion. To enable effcient fusion of heterogeneous LiDAR-Inertial-Camera data within a short sliding-window optimization, we assign depth to visual pixels using corresponding map points from a global LiDAR map, and formulate frame-to-map reprojection factors for the associated pixels in the current image frame. This way circumvents the necessity for depth optimization of visual pixels, which typically entails a lengthy sliding window with numerous control points for continuous-time trajectory estimation. We conduct dedicated experiments on real-world datasets to demonstrate the advantage and effcacy of adopting non-uniform continuous-time trajectory representation. Our LiDAR-Inertial-Camera odometry system is also extensively evaluated on both challenging scenarios with sensor degenerations and large-scale scenarios, and has shown comparable or higher accuracy than the state-of-the-art methods. The codebase of this paper will also be open-sourced at https://github.com/APRIL-ZJU/Coco-LIC .

    @article{lang2023lic,
    title = {Coco-LIC: Continuous-Time Tightly-Coupled LiDAR-Inertial-Camera Odometry using Non-Uniform B-spline},
    author = {Xiaolei Lang and Chao Chen and Kai Tang and Yukai Ma and Jiajun Lv and Yong Liu and Xingxing Zuo},
    year = 2023,
    journal = {IEEE Robotics and Automation Letters},
    volume = 8,
    pages = {7074-7081},
    doi = {10.1109/LRA.2023.3315542},
    abstract = {In this paper, we propose an effcient continuous-time LiDAR-Inertial-Camera Odometry, utilizing non-uniform B-splines to tightly couple measurements from the LiDAR, IMU, and camera. In contrast to uniform B-spline-based continuous-time methods, our non-uniform B-spline approach offers signifcant advantages in terms of achieving real-time effciency and high accuracy. This is accomplished by dynamically and adaptively placing control points, taking into account the varying dynamics of the motion. To enable effcient fusion of heterogeneous LiDAR-Inertial-Camera data within a short sliding-window optimization, we assign depth to visual pixels using corresponding map points from a global LiDAR map, and formulate frame-to-map reprojection factors for the associated pixels in the current image frame. This way circumvents the necessity for depth optimization of visual pixels, which typically entails a lengthy sliding window with numerous control points for continuous-time trajectory estimation. We conduct dedicated experiments on real-world datasets to demonstrate the advantage and effcacy of adopting non-uniform continuous-time trajectory representation. Our LiDAR-Inertial-Camera odometry system is also extensively evaluated on both challenging scenarios with sensor degenerations and large-scale scenarios, and has shown comparable or higher accuracy than the state-of-the-art methods. The codebase of this paper will also be open-sourced at https://github.com/APRIL-ZJU/Coco-LIC .}
    }

  • B. Jiang, J. Chen, and Y. Liu, “Single-Shot Pruning and Quantization for Hardware-Friendly Neural Network Acceleration," Engineering Applications of Artificial Intelligence, vol. 126, p. 106816, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    Applying CNN on embedded systems is challenging due to model size limitations. Pruning and quantization can help, but are time-consuming to apply separately. Our Single-Shot Pruning and Quantization strategy addresses these issues by quantizing and pruning in a single process. We evaluated our method on CIFAR-10 and CIFAR-100 datasets for image classification. Our model is 69.4% smaller with little accuracy loss, and runs 6-8 times faster on NVIDIA Xavier NX hardware.

    @article{jiang2023ssp,
    title = {Single-Shot Pruning and Quantization for Hardware-Friendly Neural Network Acceleration},
    author = {Bofeng Jiang and Jun Chen and Yong Liu},
    year = 2023,
    journal = {Engineering Applications of Artificial Intelligence},
    volume = 126,
    pages = {106816},
    doi = {10.1016/j.engappai.2023.106816},
    abstract = {Applying CNN on embedded systems is challenging due to model size limitations. Pruning and quantization can help, but are time-consuming to apply separately. Our Single-Shot Pruning and Quantization strategy addresses these issues by quantizing and pruning in a single process. We evaluated our method on CIFAR-10 and CIFAR-100 datasets for image classification. Our model is 69.4% smaller with little accuracy loss, and runs 6-8 times faster on NVIDIA Xavier NX hardware.}
    }

  • W. Liu, W. Jing, S. Liu, Y. Ruan, K. Zhang, J. Yang, and Y. Liu, “Expert Demonstrations Guide Reward Decomposition for Multi-Agent Cooperation," Neural Computing and Applications, vol. 35, pp. 19847-19863, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    Humans are able to achieve good teamwork through collaboration, since the contributions of the actions from human team members are properly understood by each individual. Therefore, reasonable credit assignment is crucial for multi-agent cooperation. Although existing work uses value decomposition algorithms to mitigate the credit assignment problem, since they decompose the global value function at multi-agents’ local value function level, the overall evaluation of the value function can easily lead to approximation errors. Moreover, such strategies are vulnerable to sparse reward scenarios. In this paper, we propose to use expert demonstrations to guide the team reward decomposition at each time step, rather than value decomposition. The proposed method computes the reward ratio of each agent according to the similarity between the state-action pair of the agent and the expert demonstrations. In addition, under this setting, each agent can independently train its value function and evaluate its behavior, which makes the algorithm highly robust to team rewards. Moreover, the proposed method constrains the policy to collect data with similar distribution to the expert data during the exploration, which makes policy update more robust. We conduct extensive experiments to validate our proposed method in various MARL environments, the results show that our algorithm outperforms the state-of-the-art algorithms in most scenarios; our method is robust to various reward functions; and the trajectories by our policy is closer to that of the expert policy.

    @article{liu2023edg,
    title = {Expert Demonstrations Guide Reward Decomposition for Multi-Agent Cooperation},
    author = {Weiwei Liu and Wei Jing and Shanqi Liu and Yudi Ruan and Kexin Zhang and Jian Yang and Yong Liu},
    year = 2023,
    journal = {Neural Computing and Applications},
    volume = 35,
    pages = {19847-19863},
    doi = {10.1007/s00521-023-08785-6},
    abstract = {Humans are able to achieve good teamwork through collaboration, since the contributions of the actions from human team members are properly understood by each individual. Therefore, reasonable credit assignment is crucial for multi-agent cooperation. Although existing work uses value decomposition algorithms to mitigate the credit assignment problem, since they decompose the global value function at multi-agents' local value function level, the overall evaluation of the value function can easily lead to approximation errors. Moreover, such strategies are vulnerable to sparse reward scenarios. In this paper, we propose to use expert demonstrations to guide the team reward decomposition at each time step, rather than value decomposition. The proposed method computes the reward ratio of each agent according to the similarity between the state-action pair of the agent and the expert demonstrations. In addition, under this setting, each agent can independently train its value function and evaluate its behavior, which makes the algorithm highly robust to team rewards. Moreover, the proposed method constrains the policy to collect data with similar distribution to the expert data during the exploration, which makes policy update more robust. We conduct extensive experiments to validate our proposed method in various MARL environments, the results show that our algorithm outperforms the state-of-the-art algorithms in most scenarios; our method is robust to various reward functions; and the trajectories by our policy is closer to that of the expert policy.}
    }

  • Y. Liang, J. Zhang, S. Zhao, R. Wu, Y. Liu, and S. Pan, “Omni-Frequency Channel-Selection Representations for Unsupervised Anomaly Detection," IEEE Transactions on Image Processing, vol. 32, pp. 4327-4340, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    Density-based and classification-based methods have ruled unsupervised anomaly detection in recent years, while reconstruction-based methods are rarely mentioned for the poor reconstruction ability and low performance. However, the latter requires no costly extra training samples for the unsupervised training that is more practical, so this paper focuses on improving reconstruction-based method and proposes a novel Omni-frequency Channel-selection Reconstruction (OCR-GAN) network to handle sensory anomaly detection task in a perspective of frequency. Concretely, we propose a Frequency Decoupling (FD) module to decouple the input image into different frequency components and model the reconstruction process as a combination of parallel omni-frequency image restorations, as we observe a significant difference in the frequency distribution of normal and abnormal images. Given the correlation among multiple frequencies, we further propose a Channel Selection (CS) module that performs frequency interaction among different encoders by adaptively selecting different channels. Abundant experiments demonstrate the effectiveness and superiority of our approach over different kinds of methods, e.g., achieving a new state-of-theart 98.3 detection AUC on the MVTec AD dataset without extra training data that markedly surpasses the reconstruction-based baseline by +38.11. and the current SOTA method by +0.31.. The source code is available in the additional materials.

    @article{liang2023omni,
    title = {Omni-Frequency Channel-Selection Representations for Unsupervised Anomaly Detection},
    author = {Yufei Liang and Jiangning Zhang and Shiwei Zhao and Runze Wu and Yong Liu and Shuwen Pan},
    year = 2023,
    journal = {IEEE Transactions on Image Processing},
    volume = 32,
    pages = {4327-4340},
    doi = {10.1109/TIP.2023.3293772},
    abstract = {Density-based and classification-based methods have ruled unsupervised anomaly detection in recent years, while reconstruction-based methods are rarely mentioned for the poor reconstruction ability and low performance. However, the latter requires no costly extra training samples for the unsupervised training that is more practical, so this paper focuses on improving reconstruction-based method and proposes a novel Omni-frequency Channel-selection Reconstruction (OCR-GAN) network to handle sensory anomaly detection task in a perspective of frequency. Concretely, we propose a Frequency Decoupling (FD) module to decouple the input image into different frequency components and model the reconstruction process as a combination of parallel omni-frequency image restorations, as we observe a significant difference in the frequency distribution of normal and abnormal images. Given the correlation among multiple frequencies, we further propose a Channel Selection (CS) module that performs frequency interaction among different encoders by adaptively selecting different channels. Abundant experiments demonstrate the effectiveness and superiority of our approach over different kinds of methods, e.g., achieving a new state-of-theart 98.3 detection AUC on the MVTec AD dataset without extra training data that markedly surpasses the reconstruction-based baseline by +38.11. and the current SOTA method by +0.31.. The source code is available in the additional materials.}
    }

  • J. Chen, S. Bai, T. Huang, M. Wang, G. Tian, and Y. Liu, “Data-Free Quantization via Mixed-Precision Compensation without Fine-Tuning," Pattern Recognition, vol. 143, p. 109780, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    Neural network quantization is a very promising solution in the field of model compression, but its resulting accuracy highly depends on a training/fine-tuning process and requires the original data. This not only brings heavy computation and time costs but also is not conducive to privacy and sensitive information protection. Therefore, a few recent works are starting to focus on data-free quantization. However, data free quantization does not perform well while dealing with ultra-low precision quantization. Although researchers utilize generative methods of synthetic data to address this problem partially, data synthesis needs to take a lot of computation and time. In this paper, we propose a data-free mixed-precision compensation (DF-MPC) method to recover the performance of an ultra-low precision quantized model without any data and fine-tuning process. By assuming the quantized error caused by a low-precision quantized layer can be restored via the reconstruction of a high-precision quantized layer, we mathematically formulate the reconstruction loss between the pre-trained full-precision model and its layer-wise mixed-precision quantized model. Based on our formulation, we theoretically deduce the closed-form solution by minimizing the reconstruction loss of the feature maps. Since DF-MPC does not require any original/synthetic data, it is a more efficient method to approximate the full-precision model. Experimentally, our DF-MPC is able to achieve higher accuracy for an ultra-low precision quantized model compared to the recent methods without any data and fine-tuning process.

    @article{chen2023dfq,
    title = {Data-Free Quantization via Mixed-Precision Compensation without Fine-Tuning},
    author = {Jun Chen and Shipeng Bai and Tianxin Huang and Mengmeng Wang and Guanzhong Tian and Yong Liu},
    year = 2023,
    journal = {Pattern Recognition},
    volume = 143,
    pages = {109780},
    doi = {10.1016/j.patcog.2023.109780},
    abstract = {Neural network quantization is a very promising solution in the field of model compression, but its resulting accuracy highly depends on a training/fine-tuning process and requires the original data. This not only brings heavy computation and time costs but also is not conducive to privacy and sensitive information protection. Therefore, a few recent works are starting to focus on data-free quantization. However, data free quantization does not perform well while dealing with ultra-low precision quantization. Although researchers utilize generative methods of synthetic data to address this problem partially, data synthesis needs to take a lot of computation and time. In this paper, we propose a data-free mixed-precision compensation (DF-MPC) method to recover the performance of an ultra-low precision quantized model without any data and fine-tuning process. By assuming the quantized error caused by a low-precision quantized layer can be restored via the reconstruction of a high-precision quantized layer, we mathematically formulate the reconstruction loss between the pre-trained full-precision model and its layer-wise mixed-precision quantized model. Based on our formulation, we theoretically deduce the closed-form solution by minimizing the reconstruction loss of the feature maps. Since DF-MPC does not require any original/synthetic data, it is a more efficient method to approximate the full-precision model. Experimentally, our DF-MPC is able to achieve higher accuracy for an ultra-low precision quantized model compared to the recent methods without any data and fine-tuning process.}
    }

  • Y. Liang, M. Wang, Y. Jin, S. Pan, and Y. Liu, “Hierarchical Supervisions with Two-Stream Network for Deepfake Detection," Pattern Recognition Letters, vol. 172, pp. 121-127, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    Recently, the quality of face generation and manipulation has reached impressive levels, making it diffi-cult even for humans to distinguish real and fake faces. At the same time, methods to distinguish fake faces from reals came out, such as Deepfake detection. However, the task of Deepfake detection remains challenging, especially the low-quality fake images circulating on the Internet and the diversity of face generation methods. In this work, we propose a new Deepfake detection network that could effectively distinguish both high-quality and low-quality faces generated by various generation methods. First, we design a two-stream framework that incorporates a regular spatial stream and a frequency stream to handle the low-quality problem since we find that the frequency domain artifacts of low-quality images will be preserved. Second, we introduce hierarchical supervisions in a coarse-to-fine manner, which con-sists of a coarse binary classification branch to classify reals and fakes and a five-category classification branch to classify reals and four different types of fakes. Extensive experiments have proved the effec-tiveness of our framework on several widely used datasets.

    @article{liang2023hs,
    title = {Hierarchical Supervisions with Two-Stream Network for Deepfake Detection},
    author = {Yufei Liang and Mengmeng Wang and Yining Jin and Shuwen Pan and Yong Liu},
    year = 2023,
    journal = {Pattern Recognition Letters},
    volume = 172,
    pages = {121-127},
    doi = {10.1016/j.patrec.2023.05.029},
    abstract = {Recently, the quality of face generation and manipulation has reached impressive levels, making it diffi-cult even for humans to distinguish real and fake faces. At the same time, methods to distinguish fake faces from reals came out, such as Deepfake detection. However, the task of Deepfake detection remains challenging, especially the low-quality fake images circulating on the Internet and the diversity of face generation methods. In this work, we propose a new Deepfake detection network that could effectively distinguish both high-quality and low-quality faces generated by various generation methods. First, we design a two-stream framework that incorporates a regular spatial stream and a frequency stream to handle the low-quality problem since we find that the frequency domain artifacts of low-quality images will be preserved. Second, we introduce hierarchical supervisions in a coarse-to-fine manner, which con-sists of a coarse binary classification branch to classify reals and fakes and a five-category classification branch to classify reals and four different types of fakes. Extensive experiments have proved the effec-tiveness of our framework on several widely used datasets.}
    }

  • Jianbiao Mei, Mengmeng Wang, Yu Yang, Yanjun Li, and Yong Liu, “Fast Real-Time Video Object Segmentation with a Tangled Memory Network," ACM Transactions on Intelligent Systems and Technology, vol. 14, p. 51, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this article, we present a fast real-time tangled memory network that segments the objects effectively and efficiently for semi-supervised video object segmentation (VOS). We propose a tangled reference encoder and a memory bank organization mechanism based on a state estimator to fully utilize the mask features and alleviate memory overhead and computational burden brought by the unlimited memory bank used in many memory-based methods. First, the tangled memory network exploits the mask features that uncover abundant object information like edges and contours but are not fully explored in existing methods. Specifically, a tangled two-stream reference encoder is designed to extract and fuse the features from both RGB frames and the predicted masks. Second, to indicate the quality of the predicted mask and feedback the online prediction state for organizing the memory bank, we devise a target state estimator to learn the IoU score between the predicted mask and ground truth. Moreover, to accelerate the forward process and avoid memory overflow, we use a memory bank of fixed size to store historical features by designing a new efficient memory bank organization mechanism based on the mask state score provided by the state estimator. We conduct comprehensive experiments on the public benchmarks DAVIS and YouTube-VOS, demonstrating that our method obtains competitive results while running at high speed (66 FPS on the DAVIS16-val set).

    @article{mei2023fast,
    title = {Fast Real-Time Video Object Segmentation with a Tangled Memory Network},
    author = {Jianbiao Mei and Mengmeng Wang and Yu Yang and Yanjun Li and Yong Liu},
    year = 2023,
    journal = {ACM Transactions on Intelligent Systems and Technology},
    volume = 14,
    pages = {51},
    doi = {10.1145/3585076},
    abstract = {In this article, we present a fast real-time tangled memory network that segments the objects effectively and efficiently for semi-supervised video object segmentation (VOS). We propose a tangled reference encoder and a memory bank organization mechanism based on a state estimator to fully utilize the mask features and alleviate memory overhead and computational burden brought by the unlimited memory bank used in many memory-based methods. First, the tangled memory network exploits the mask features that uncover abundant object information like edges and contours but are not fully explored in existing methods. Specifically, a tangled two-stream reference encoder is designed to extract and fuse the features from both RGB frames and the predicted masks. Second, to indicate the quality of the predicted mask and feedback the online prediction state for organizing the memory bank, we devise a target state estimator to learn the IoU score between the predicted mask and ground truth. Moreover, to accelerate the forward process and avoid memory overflow, we use a memory bank of fixed size to store historical features by designing a new efficient memory bank organization mechanism based on the mask state score provided by the state estimator. We conduct comprehensive experiments on the public benchmarks DAVIS and YouTube-VOS, demonstrating that our method obtains competitive results while running at high speed (66 FPS on the DAVIS16-val set).}
    }

  • L. Li, Y. Ma, K. Tang, X. Zhao, C. Chen, J. Huang, J. Mei, and Y. Liu, “Geo-localization with Transformer-based 2D-3D match Network," IEEE Robotics and Automation Letters (RA-L), vol. 8, pp. 4855-4862, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    This letter presents a novel method for geographical localization by registering satellite maps with LiDAR point clouds. This method includes a Transformer-based 2D-3D matching network called D-GLSNet that directly matches the LiDAR point clouds and satellite images through end-to-end learning. Without the need for feature point detection, D-GLSNet provides accurate pixel-to-point association between the LiDAR point clouds and satellite images. And then, we can easily calculate the horizontal offset (Δx,Δy) and angular deviation Δθyaw between them, thereby achieving accurate registration. To demonstrate our network’s localization potential, we have designed a Geo-localization Node (GLN) that implements geographical localization and is plug-and-play in the SLAM system. Compared to GPS, GLN is less susceptible to external interference, such as building occlusion. In urban scenarios, our proposed D-GLSNet can output high-quality matching, enabling GLN to function stably and deliver more accurate localization results. Extensive experiments on the KITTI dataset show that our D-GLSNet method achieves a mean Relative Translation Error (RTE) of 1.43 m. Furthermore, our method outperforms state-of-the-art LiDAR-based geospatial localization methods when combined with odometry.

    @article{li2023glw,
    title = {Geo-localization with Transformer-based 2D-3D match Network},
    author = {Laijian Li and Yukai Ma and Kai Tang and Xiangrui Zhao and Chao Chen and Jianxin Huang and Jianbiao Mei and Yong Liu},
    year = 2023,
    journal = {IEEE Robotics and Automation Letters (RA-L)},
    volume = 8,
    pages = {4855-4862},
    doi = {10.1109/LRA.2023.3290526},
    abstract = {This letter presents a novel method for geographical localization by registering satellite maps with LiDAR point clouds. This method includes a Transformer-based 2D-3D matching network called D-GLSNet that directly matches the LiDAR point clouds and satellite images through end-to-end learning. Without the need for feature point detection, D-GLSNet provides accurate pixel-to-point association between the LiDAR point clouds and satellite images. And then, we can easily calculate the horizontal offset (Δx,Δy) and angular deviation Δθyaw between them, thereby achieving accurate registration. To demonstrate our network's localization potential, we have designed a Geo-localization Node (GLN) that implements geographical localization and is plug-and-play in the SLAM system. Compared to GPS, GLN is less susceptible to external interference, such as building occlusion. In urban scenarios, our proposed D-GLSNet can output high-quality matching, enabling GLN to function stably and deliver more accurate localization results. Extensive experiments on the KITTI dataset show that our D-GLSNet method achieves a mean Relative Translation Error (RTE) of 1.43 m. Furthermore, our method outperforms state-of-the-art LiDAR-based geospatial localization methods when combined with odometry.}
    }

  • W. Liu, L. Peng, L. Wen, J. Yang, and Y. Liu, “Decomposing Shared Networks for Separate Cooperation with Multi-agent Reinforcement Learning," Information Sciences, vol. 641, p. 119085, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    Sharing network parameters between agents is an essential and typical operation to improve the scalability of multi-agent reinforcement learning algorithms. However, agents with different tasks sharing the same network parameters are not conducive to distinguishing the agents’ skills. In addition, the importance of communication between agents undertaking the same task is much higher than that with external agents. Therefore, we propose Dual Cooperation Networks (DCN). In order to distinguish whether agents undertake the same task, all agents are grouped according to their status through the graph neural network instead of the traditional proximity. The agent communicates within the group to achieve strong cooperation. After that, the global value function is decomposed by groups to facilitate cooperation between groups. Finally, we have verified it in simulation and physical hardware, and the algorithm has achieved excellent performance.

    @article{liu2023dsn,
    title = {Decomposing Shared Networks for Separate Cooperation with Multi-agent Reinforcement Learning},
    author = {Weiwei Liu and Linpeng Peng and Licheng Wen and Jian Yang and Yong Liu},
    year = 2023,
    journal = {Information Sciences},
    volume = 641,
    pages = {119085},
    doi = {10.1016/j.ins.2023.119085},
    abstract = {Sharing network parameters between agents is an essential and typical operation to improve the scalability of multi-agent reinforcement learning algorithms. However, agents with different tasks sharing the same network parameters are not conducive to distinguishing the agents' skills. In addition, the importance of communication between agents undertaking the same task is much higher than that with external agents. Therefore, we propose Dual Cooperation Networks (DCN). In order to distinguish whether agents undertake the same task, all agents are grouped according to their status through the graph neural network instead of the traditional proximity. The agent communicates within the group to achieve strong cooperation. After that, the global value function is decomposed by groups to facilitate cooperation between groups. Finally, we have verified it in simulation and physical hardware, and the algorithm has achieved excellent performance.}
    }

  • J. Lv, X. Lang, J. Xu, M. Wang, Y. Liu, and X. Zuo, “Continuous-Time Fixed-Lag Smoothing for LiDAR-Inertial-Camera SLAM," IEEE/ASME Transactions on Mechatronics, vol. 28, pp. 2259-2270, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    Localization and mapping with heterogeneous multi-sensor fusion have been prevalent in recent years. To adequately fuse multi-modal sensor measurements received at different time instants and different frequencies, we estimate the continuous-time trajectory by fixed-lag smoothing within a factor-graph optimization framework. With the continuous-time formulation, we can query poses at any time instants corresponding to the sensor measurements. To bound the computation complexity of the continuous-time fixed-lag smoother, we maintain temporal and keyframe sliding windows with constant size, and probabilistically marginalize out control points of the trajectory and other states, which allows preserving prior information for future sliding-window optimization. Based on continuous-time fixed-lag smoothing, we design tightly-coupled multi-modal SLAM algorithms with a variety of sensor combinations, like the LiDAR-inertial and LiDAR-inertial-camera SLAM systems, in which online timeoffset calibration is also naturally supported. More importantly, benefiting from the marginalization and our derived analytical Jacobians for optimization, the proposed continuous-time SLAM systems can achieve real-time performance regardless of the high complexity of continuous-time formulation. The proposed multi-modal SLAM systems have been widely evaluated on three public datasets and self-collect datasets. The results demonstrate that the proposed continuous-time SLAM systems can achieve high-accuracy pose estimations and outperform existing state-of-the-art methods. To benefit the research community, we will open source our code at {https://github.com/APRIL-ZJU/clic}.

    @article{lv2023ctfl,
    title = {Continuous-Time Fixed-Lag Smoothing for LiDAR-Inertial-Camera SLAM},
    author = {Jiajun Lv and Xiaolei Lang and Jinhong Xu and Mengmeng Wang and Yong Liu and Xingxing Zuo},
    year = 2023,
    journal = {IEEE/ASME Transactions on Mechatronics},
    volume = 28,
    pages = {2259-2270},
    doi = {10.1109/TMECH.2023.3241398},
    abstract = {Localization and mapping with heterogeneous multi-sensor fusion have been prevalent in recent years. To adequately fuse multi-modal sensor measurements received at different time instants and different frequencies, we estimate the continuous-time trajectory by fixed-lag smoothing within a factor-graph optimization framework. With the continuous-time formulation, we can query poses at any time instants corresponding to the sensor measurements. To bound the computation complexity of the continuous-time fixed-lag smoother, we maintain temporal and keyframe sliding windows with constant size, and probabilistically marginalize out control points of the trajectory and other states, which allows preserving prior information for future sliding-window optimization. Based on continuous-time fixed-lag smoothing, we design tightly-coupled multi-modal SLAM algorithms with a variety of sensor combinations, like the LiDAR-inertial and LiDAR-inertial-camera SLAM systems, in which online timeoffset calibration is also naturally supported. More importantly, benefiting from the marginalization and our derived analytical Jacobians for optimization, the proposed continuous-time SLAM systems can achieve real-time performance regardless of the high complexity of continuous-time formulation. The proposed multi-modal SLAM systems have been widely evaluated on three public datasets and self-collect datasets. The results demonstrate that the proposed continuous-time SLAM systems can achieve high-accuracy pose estimations and outperform existing state-of-the-art methods. To benefit the research community, we will open source our code at {https://github.com/APRIL-ZJU/clic}.}
    }

  • T. Huang, Z. Chen, W. Gao, Z. Xue, and Y. Liu, “A USV-UAV Cooperative Trajectory Planning Algorithm with Hull Dynamic Constraints," Sensors, vol. 23, p. 1845, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    Efficient trajectory generation in complex dynamic environments remains an open problem in the operation of an unmanned surface vehicle (USV). The perception of a USV is usually interfered by the swing of the hull and the ambient weather, making it challenging to plan optimal USV trajectories. In this paper, a cooperative trajectory planning algorithm for a coupled USV-UAV system is proposed to ensure that a USV can execute a safe and smooth path as it autonomously advances through multi-obstacle maps. Specifically, the unmanned aerial vehicle (UAV) plays the role of a flight sensor, providing real-time global map and obstacle information with a lightweight semantic segmentation network and 3D projection transformation. An initial obstacle avoidance trajectory is generated by a graph-based search method. Concerning the unique under-actuated kinematic characteristics of the USV, a numerical optimization method based on hull dynamic constraints is introduced to make the trajectory easier to be tracked for motion control. Finally, a motion control method based on NMPC with the lowest energy consumption constraint during execution is proposed. Experimental results verify the effectiveness of the whole system, and the generated trajectory is locally optimal for USV with considerable tracking accuracy.

    @article{huang2023usv,
    title = {A USV-UAV Cooperative Trajectory Planning Algorithm with Hull Dynamic Constraints},
    author = {Tao Huang and Zhe Chen and Wang Gao and Zhenfeng Xue and Yong Liu},
    year = 2023,
    journal = {Sensors},
    volume = 23,
    pages = {1845},
    doi = {10.3390/s23041845},
    abstract = {Efficient trajectory generation in complex dynamic environments remains an open problem in the operation of an unmanned surface vehicle (USV). The perception of a USV is usually interfered by the swing of the hull and the ambient weather, making it challenging to plan optimal USV trajectories. In this paper, a cooperative trajectory planning algorithm for a coupled USV-UAV system is proposed to ensure that a USV can execute a safe and smooth path as it autonomously advances through multi-obstacle maps. Specifically, the unmanned aerial vehicle (UAV) plays the role of a flight sensor, providing real-time global map and obstacle information with a lightweight semantic segmentation network and 3D projection transformation. An initial obstacle avoidance trajectory is generated by a graph-based search method. Concerning the unique under-actuated kinematic characteristics of the USV, a numerical optimization method based on hull dynamic constraints is introduced to make the trajectory easier to be tracked for motion control. Finally, a motion control method based on NMPC with the lowest energy consumption constraint during execution is proposed. Experimental results verify the effectiveness of the whole system, and the generated trajectory is locally optimal for USV with considerable tracking accuracy.}
    }

  • C. Chen, Y. Ma, J. Lv, X. Zhao, L. Li, Y. Liu, and W. Gao, “OL-SLAM: A Robust and Versatile System of Object Localization and SLAM," Sensors, vol. 23, p. 801, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    This paper proposes a real-time, versatile Simultaneous Localization and Mapping (SLAM) and object localization system, which fuses measurements from LiDAR, camera, Inertial Measurement Unit (IMU), and Global Positioning System (GPS). Our system can locate itself in an unknown environment and build a scene map based on which we can also track and obtain the global location of objects of interest. Precisely, our SLAM subsystem consists of the following four parts: LiDAR-inertial odometry, Visual-inertial odometry, GPS-inertial odometry, and global pose graph optimization. The target-tracking and positioning subsystem is developed based on YOLOv4. Benefiting from the use of GPS sensor in the SLAM system, we can obtain the global positioning information of the target; therefore, it can be highly useful in military operations, rescue and disaster relief, and other scenarios.

    @article{chen2023ols,
    title = {OL-SLAM: A Robust and Versatile System of Object Localization and SLAM},
    author = {Chao Chen and Yukai Ma and Jiajun Lv and Xiangrui Zhao and Laijian Li and Yong Liu and Wang Gao},
    year = 2023,
    journal = {Sensors},
    volume = 23,
    pages = {801},
    doi = {10.3390/s23020801},
    abstract = {This paper proposes a real-time, versatile Simultaneous Localization and Mapping (SLAM) and object localization system, which fuses measurements from LiDAR, camera, Inertial Measurement Unit (IMU), and Global Positioning System (GPS). Our system can locate itself in an unknown environment and build a scene map based on which we can also track and obtain the global location of objects of interest. Precisely, our SLAM subsystem consists of the following four parts: LiDAR-inertial odometry, Visual-inertial odometry, GPS-inertial odometry, and global pose graph optimization. The target-tracking and positioning subsystem is developed based on YOLOv4. Benefiting from the use of GPS sensor in the SLAM system, we can obtain the global positioning information of the target; therefore, it can be highly useful in military operations, rescue and disaster relief, and other scenarios.}
    }

  • L. Liu, X. Song, J. Sun, X. Lyu, L. Li, Y. Liu, and L. Zhang, “MFF-Net: Towards Efficient Monocular Depth Completion with Multi-Modal Feature Fusion," IEEE Robotics and Automation Letters (RA-L), vol. 8, pp. 920-927, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    Remarkable progress has been achieved by current depth completion approaches, which produce dense depth maps from sparse depth maps and corresponding color images. However, the performances of these approaches are limited due to the insufficient feature extractions and fusions. In this work, we propose an efficient multi-modal feature fusion based depth completion framework (MFF-Net), which can efficiently extract and fuse features with different modals in both encoding and decoding processes, thus more depth details with better performance can be obtained. In specific, the encoding process contains three branches where different modals of features from both color and sparse depth input can be extracted, and a multi-feature channel shuffle is utilized to enhance these features thus features with better representation abilities can be obtained. Meanwhile, the decoding process contains two branches to sufficiently fuse the extracted multi-modal features, and a multi-level weighted combination is employed to further enhance and fuse features with different modals, thus leading to more accurate and better refined depth maps. Extensive experiments on different benchmarks demonstrate that we achieve state-of-the-art among online methods. Meanwhile, we further evaluate the predicted dense depth by RGB-D SLAM, which is a commonly used downstream robotic perception task, and higher accuracy on vehicle’s trajectory can be obtained in KITTI odometry dataset, which demonstrates the high quality of our depth prediction and the potential of improving the related downstream tasks with depth completion results.

    @article{liu2023mff,
    title = {MFF-Net: Towards Efficient Monocular Depth Completion with Multi-Modal Feature Fusion},
    author = {Lina Liu and Xibin Song and Jiadai Sun and Xiaoyang Lyu and Lin Li and Yong Liu and Liangjun Zhang},
    year = 2023,
    journal = {IEEE Robotics and Automation Letters (RA-L)},
    volume = 8,
    pages = {920-927},
    doi = {10.1109/LRA.2023.3234776},
    abstract = {Remarkable progress has been achieved by current depth completion approaches, which produce dense depth maps from sparse depth maps and corresponding color images. However, the performances of these approaches are limited due to the insufficient feature extractions and fusions. In this work, we propose an efficient multi-modal feature fusion based depth completion framework (MFF-Net), which can efficiently extract and fuse features with different modals in both encoding and decoding processes, thus more depth details with better performance can be obtained. In specific, the encoding process contains three branches where different modals of features from both color and sparse depth input can be extracted, and a multi-feature channel shuffle is utilized to enhance these features thus features with better representation abilities can be obtained. Meanwhile, the decoding process contains two branches to sufficiently fuse the extracted multi-modal features, and a multi-level weighted combination is employed to further enhance and fuse features with different modals, thus leading to more accurate and better refined depth maps. Extensive experiments on different benchmarks demonstrate that we achieve state-of-the-art among online methods. Meanwhile, we further evaluate the predicted dense depth by RGB-D SLAM, which is a commonly used downstream robotic perception task, and higher accuracy on vehicle's trajectory can be obtained in KITTI odometry dataset, which demonstrates the high quality of our depth prediction and the potential of improving the related downstream tasks with depth completion results.}
    }

  • K. Zhang, Y. Liu, Y. Gu, J. Wang, and X. Ruan, “Valve Stiction Detection Using Multitimescale Feature Consistent Constraint for Time-Series Data," IEEE-ASME Transactions on Mechatronics, vol. 28, pp. 1488-1499, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    Using neural networks to build a reliable fault detection model is an attractive topic in industrial processes but remains challenging due to the lack of labeled data. We propose a feature learning approach for industrial time-series data based on self-supervised contrastive learning to tackle this challenge. The proposed approach consists of two components: data transformation and representation learning. The data transformation converts the raw time-series to temporal distance matrices capable of storing temporal and spatial information. The representation learning component uses a convolution-based encoder to encode the temporal distance matrices to embedding representations. The encoder is trained using a new constraint called multitimescale feature consistent constraint. Finally, a fault detection framework for the valve stiction detection task is developed based on the feature learning method. The proposed framework is evaluated not only on an industrial benchmark dataset but also on a hardware experimental system and real industrial environments.

    @article{zhang2023vsd,
    title = {Valve Stiction Detection Using Multitimescale Feature Consistent Constraint for Time-Series Data},
    author = {Kexin Zhang and Yong Liu and Yong Gu and Jiadong Wang and Xiaojun Ruan},
    year = 2023,
    journal = {IEEE-ASME Transactions on Mechatronics},
    volume = 28,
    pages = {1488-1499},
    doi = {10.1109/TMECH.2022.3227960},
    abstract = {Using neural networks to build a reliable fault detection model is an attractive topic in industrial processes but remains challenging due to the lack of labeled data. We propose a feature learning approach for industrial time-series data based on self-supervised contrastive learning to tackle this challenge. The proposed approach consists of two components: data transformation and representation learning. The data transformation converts the raw time-series to temporal distance matrices capable of storing temporal and spatial information. The representation learning component uses a convolution-based encoder to encode the temporal distance matrices to embedding representations. The encoder is trained using a new constraint called multitimescale feature consistent constraint. Finally, a fault detection framework for the valve stiction detection task is developed based on the feature learning method. The proposed framework is evaluated not only on an industrial benchmark dataset but also on a hardware experimental system and real industrial environments.}
    }

  • Y. Yang, M. Wang, J. Mei, and Y. Liu, “Exploiting Semantic-level Affinities with a Mask-Guided Network for Temporal Action Proposal in Videos," Applied Intelligence, vol. 53, pp. 15516-15536, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    Temporal action proposal (TAP) aims to detect the action instances’ starting and ending times in untrimmed videos, which is fundamental and critical for large-scale video analysis and human action understanding. The main challenge of the temporal action proposal lies in modeling representative temporal relations in long untrimmed videos. Existing state-of-the-art methods achieve temporal modeling by building local-level, proposal-level, or global-level temporal dependencies. Local methods lack a wider receptive field, while proposal and global methods lack the focalization of learning action frames and contain background distractions. In this paper, we propose that learning semantic-level affinities can capture more practical information. Specifically, by modeling semantic associations between frames and action units, action segments (foregrounds) can aggregate supportive cues from other co-occurring actions, and nonaction clips (backgrounds) can learn the discriminations between them and action frames. To this end, we propose a novel framework named the Mask-Guided Network (MGNet) to build semantic-level temporal associations for the TAP task. Specifically, we first propose a Foreground Mask Generation (FMG) module to adaptively generate the foreground mask, representing the locations of the action units throughout the video. Second, we design a Mask-Guided Transformer (MGT) by exploiting the foreground mask to guide the self-attention mechanism to focus on and calculate semantic affinities with the foreground frames. Finally, these two modules are jointly explored in a unified framework. MGNet models the intra-semantic similarities for foregrounds, extracting supportive action cues for boundary refinement; it also builds the inter-semantic distances for backgrounds, providing the semantic gaps to suppress false positives and distractions. Extensive experiments are conducted on two challenging datasets, ActivityNet-1.3 and THUMOS14, and the results demonstrate that our method achieves superior performance.

    @article{yang2023esl,
    title = {Exploiting Semantic-level Affinities with a Mask-Guided Network for Temporal Action Proposal in Videos},
    author = {Yu Yang and Mengmeng Wang and Jianbiao Mei and Yong Liu},
    year = 2023,
    journal = {Applied Intelligence},
    volume = 53,
    pages = {15516-15536},
    doi = {10.1007/s10489-022-04261-1},
    abstract = {Temporal action proposal (TAP) aims to detect the action instances' starting and ending times in untrimmed videos, which is fundamental and critical for large-scale video analysis and human action understanding. The main challenge of the temporal action proposal lies in modeling representative temporal relations in long untrimmed videos. Existing state-of-the-art methods achieve temporal modeling by building local-level, proposal-level, or global-level temporal dependencies. Local methods lack a wider receptive field, while proposal and global methods lack the focalization of learning action frames and contain background distractions. In this paper, we propose that learning semantic-level affinities can capture more practical information. Specifically, by modeling semantic associations between frames and action units, action segments (foregrounds) can aggregate supportive cues from other co-occurring actions, and nonaction clips (backgrounds) can learn the discriminations between them and action frames. To this end, we propose a novel framework named the Mask-Guided Network (MGNet) to build semantic-level temporal associations for the TAP task. Specifically, we first propose a Foreground Mask Generation (FMG) module to adaptively generate the foreground mask, representing the locations of the action units throughout the video. Second, we design a Mask-Guided Transformer (MGT) by exploiting the foreground mask to guide the self-attention mechanism to focus on and calculate semantic affinities with the foreground frames. Finally, these two modules are jointly explored in a unified framework. MGNet models the intra-semantic similarities for foregrounds, extracting supportive action cues for boundary refinement; it also builds the inter-semantic distances for backgrounds, providing the semantic gaps to suppress false positives and distractions. Extensive experiments are conducted on two challenging datasets, ActivityNet-1.3 and THUMOS14, and the results demonstrate that our method achieves superior performance.}
    }

  • C. Xu, X. Wu, M. Wang, F. Qiu, Y. Liu, and J. Ren, “Improving Dynamic Gesture Recognition in Untrimmed Videos by An Online Lightweight Framework and A New Gesture Dataset ZJUGesture," Neurocomputing, vol. 523, pp. 58-68, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    Human–computer interaction technology brings great convenience to people, and dynamic gesture recognition makes it possible for a man to interact naturally with a machine. However, recognizing gestures quickly and precisely in untrimmed videos remains a challenge in real-world systems since: (1) It is challenging to locate the temporal boundaries of performing gestures; (2) There are significant differences in performing gestures among different people, resulting in a variety of gestures; (3) There must be a trade-off between the accuracy and the computational consumption. In this work, we propose an online lightweight two-stage framework, including a detection module and a gesture recognition module, to precisely detect and classify dynamic gestures in untrimmed videos. Specifically, we first design a low-power detection module to locate gestures in time series, then a temporal relational reasoning module is employed for gesture recognition. Moreover, we present a new dynamic gesture dataset named ZJUGesture, which contains nine classes of common gestures in various scenarios. Extensive experiments on the proposed ZJUGesture and 20-bn-Jester dataset demonstrate the attractive performance of our method with high accuracy and a low computational cost.

    @article{xv2022idg,
    title = {Improving Dynamic Gesture Recognition in Untrimmed Videos by An Online Lightweight Framework and A New Gesture Dataset ZJUGesture},
    author = {Chao Xu and Xia Wu and Mengmeng Wang and Feng Qiu and Yong Liu and Jun Ren},
    year = 2023,
    journal = {Neurocomputing},
    volume = 523,
    pages = {58-68},
    doi = {10.1016/j.neucom.2022.12.022},
    abstract = {Human–computer interaction technology brings great convenience to people, and dynamic gesture recognition makes it possible for a man to interact naturally with a machine. However, recognizing gestures quickly and precisely in untrimmed videos remains a challenge in real-world systems since: (1) It is challenging to locate the temporal boundaries of performing gestures; (2) There are significant differences in performing gestures among different people, resulting in a variety of gestures; (3) There must be a trade-off between the accuracy and the computational consumption. In this work, we propose an online lightweight two-stage framework, including a detection module and a gesture recognition module, to precisely detect and classify dynamic gestures in untrimmed videos. Specifically, we first design a low-power detection module to locate gestures in time series, then a temporal relational reasoning module is employed for gesture recognition. Moreover, we present a new dynamic gesture dataset named ZJUGesture, which contains nine classes of common gestures in various scenarios. Extensive experiments on the proposed ZJUGesture and 20-bn-Jester dataset demonstrate the attractive performance of our method with high accuracy and a low computational cost.}
    }

  • M. Wang, J. Xing, J. Su, J. Chen, and Y. Liu, “Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, pp. 3347-3362, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    Recent methods for action recognition always apply 3D Convolutional Neural Networks (CNNs) to extract spatiotemporal features and introduce optical flows to present motion features. Although achieving state-of-the-art performance, they are expensive in both time and space. In this paper, we propose to represent both the two kinds of features in a unified 2D CNN without any 3D convolution or optical flows calculation. In particular, we first design a channel-wise spatiotemporal module to present the spatiotemporal features and a channel-wise motion module to encode feature-level motion features efficiently. Secondly, we combine these two modules and an identity mapping path into one united block that can easily replaces the original residual block in the ResNet architecture, forming a simple yet effective network termed STM network by introducing very limited extra computation cost and parameters. Thirdly, we propose a novel Twins Training framework for action recognition by incorporating a correlation loss to optimize the inter-class and intra-class correlation and a siamese structure to fully stretch the training data. We extensively validate the proposed STM on both temporal-related datasets (i.e., Something-Something v1 & v2) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51). It achieves favorable results against state-of-the-art methods in all the datasets.

    @article{wang2022lsm,
    title = {Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition},
    author = {Mengmeng Wang and Jiazheng Xing and Jing Su and Jun Chen and Yong Liu},
    year = 2023,
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    volume = 45,
    pages = {3347-3362},
    doi = {10.1109/TPAMI.2022.3173658},
    abstract = {Recent methods for action recognition always apply 3D Convolutional Neural Networks (CNNs) to extract spatiotemporal features and introduce optical flows to present motion features. Although achieving state-of-the-art performance, they are expensive in both time and space. In this paper, we propose to represent both the two kinds of features in a unified 2D CNN without any 3D convolution or optical flows calculation. In particular, we first design a channel-wise spatiotemporal module to present the spatiotemporal features and a channel-wise motion module to encode feature-level motion features efficiently. Secondly, we combine these two modules and an identity mapping path into one united block that can easily replaces the original residual block in the ResNet architecture, forming a simple yet effective network termed STM network by introducing very limited extra computation cost and parameters. Thirdly, we propose a novel Twins Training framework for action recognition by incorporating a correlation loss to optimize the inter-class and intra-class correlation and a siamese structure to fully stretch the training data. We extensively validate the proposed STM on both temporal-related datasets (i.e., Something-Something v1 \& v2) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51). It achieves favorable results against state-of-the-art methods in all the datasets.}
    }

  • T. Huang, H. Zou, J. Cui, J. Zhang, X. Yang, L. Li, and Y. Liu, “Adaptive Recurrent Forward Network for Dense Point Cloud Completion," IEEE Transactions on Multimedia, vol. 25, pp. 5903-5915, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    Point cloud completion is an interesting and challenging task in 3D vision, which aims to recover complete shapes from sparse and incomplete point clouds. Existing completion networks often require a vast number of parameters and substantial computational costs to achieve a high performance level, which may limit their practical application. In this work, we propose a novel Adaptive efficient Recurrent Forward Network (ARFNet), which is composed of three parts: Recurrent Feature Extraction (RFE), Forward Dense Completion (FDC) and Raw Shape Protection (RSP). In an RFE, multiple short global features are extracted from incomplete point clouds, while a dense quantity of completed results are generated in a coarse-to-fine pipeline in the FDC. Finally, we propose the Adamerge module to preserve the details from the original models by merging the generated results with the original incomplete point clouds in the RSP. In addition, we introduce the Sampling Chamfer Distance to better capture the shapes of the models and the balanced expansion constraint to restrict the expansion distances from coarse to fine. According to the experiments on ShapeNet and KITTI, our network can achieve state-of-the-art completion performances on dense point clouds with fewer parameters, smaller model sizes, lower memory costs and a faster convergence.

    @article{huang2022arf,
    title = {Adaptive Recurrent Forward Network for Dense Point Cloud Completion},
    author = {Tianxin Huang and Hao Zou and Jinhao Cui and Jiangning Zhang and Xuemeng Yang and Lin Li and Yong Liu},
    year = 2023,
    journal = {IEEE Transactions on Multimedia},
    volume = {25},
    pages = {5903-5915},
    doi = {10.1109/TMM.2022.3200851},
    abstract = {Point cloud completion is an interesting and challenging task in 3D vision, which aims to recover complete shapes from sparse and incomplete point clouds. Existing completion networks often require a vast number of parameters and substantial computational costs to achieve a high performance level, which may limit their practical application. In this work, we propose a novel Adaptive efficient Recurrent Forward Network (ARFNet), which is composed of three parts: Recurrent Feature Extraction (RFE), Forward Dense Completion (FDC) and Raw Shape Protection (RSP). In an RFE, multiple short global features are extracted from incomplete point clouds, while a dense quantity of completed results are generated in a coarse-to-fine pipeline in the FDC. Finally, we propose the Adamerge module to preserve the details from the original models by merging the generated results with the original incomplete point clouds in the RSP. In addition, we introduce the Sampling Chamfer Distance to better capture the shapes of the models and the balanced expansion constraint to restrict the expansion distances from coarse to fine. According to the experiments on ShapeNet and KITTI, our network can achieve state-of-the-art completion performances on dense point clouds with fewer parameters, smaller model sizes, lower memory costs and a faster convergence.}
    }

  • H. Lin, M. Wang, Y. Liu, and J. Kou, “Correlation-based and content-enhanced network for video style transfer," Pattern Analysis and Applications, vol. 26, pp. 343-355, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    Artistic style transfer aims to migrate the style pattern from a referenced style image to a given content image, which has achieved significant advances in recent years. However, producing temporally coherent and visually pleasing stylized frames is still challenging. Although existing works have made some effort, they rely on the inefficient optical flow or other cumbersome operations to model spatiotemporal information. In this paper, we propose an arbitrary video style transfer network that can generate consistent results with reasonable style patterns and clear content structure. We adopt multi-channel correlation module to render the input images stably according to cross-domain feature correlation. Meanwhile, Earth Movers’ Distance is used to capture the main characteristics of style images. To maintain the semantic structure during the stylization, we also employ the AdaIN-based skip connections and self-similarity loss, which can further improve the temporal consistency. Qualitative and quantitative experiments have demonstrated the effectiveness of our framework.

    @article{lin2023cbc,
    title = {Correlation-based and content-enhanced network for video style transfer},
    author = {Honglin Lin and Mengmeng Wang and Yong Liu and Jiaxin Kou},
    year = 2023,
    journal = {Pattern Analysis and Applications},
    volume = {26},
    pages = {343-355},
    doi = {10.1007/s10044-022-01106-y},
    abstract = {Artistic style transfer aims to migrate the style pattern from a referenced style image to a given content image, which has achieved significant advances in recent years. However, producing temporally coherent and visually pleasing stylized frames is still challenging. Although existing works have made some effort, they rely on the inefficient optical flow or other cumbersome operations to model spatiotemporal information. In this paper, we propose an arbitrary video style transfer network that can generate consistent results with reasonable style patterns and clear content structure. We adopt multi-channel correlation module to render the input images stably according to cross-domain feature correlation. Meanwhile, Earth Movers' Distance is used to capture the main characteristics of style images. To maintain the semantic structure during the stylization, we also employ the AdaIN-based skip connections and self-similarity loss, which can further improve the temporal consistency. Qualitative and quantitative experiments have demonstrated the effectiveness of our framework.}
    }

2022

  • T. Huang, J. Zhang, J. C. and Zhonggan Ding, Y. Tai, Z. Zhang, C. Wang, and Y. Liu, “3QNet: 3D Point Cloud Geometry Quantization Compression Network," ACM Transactions on Graphics, 2022.
    [BibTeX] [Abstract] [DOI]

    Since the development of 3D applications, the point cloud, as a spatial description easily acquired by sensors, has been widely used in multiple areas such as SLAM and 3D reconstruction. Point Cloud Compression (PCC) has also attracted more attention as a primary step before point cloud transferring and saving, where the geometry compression is an important component of PCC to compress the points geometrical structures. However, existing non-learning-based geometry compression methods are often limited by manually pre-defined compression rules. Though learning-based compression methods can significantly improve the algorithm performances by learning compression rules from data, they still have some defects. Voxel-based compression networks introduce precision errors due to the voxelized operations, while point-based methods may have relatively weak robustness and are mainly designed for sparse point clouds. In this work, we propose a novel learning-based point cloud compression framework named 3D Point Cloud Geometry Quantiation Compression Network (3QNet), which overcomes the robustness limitation of existing point-based methods and can handle dense points. By learning a codebook including common structural features from simple and sparse shapes, 3QNet can efficiently deal with multiple kinds of point clouds. According to experiments on object models, indoor scenes, and outdoor scans, 3QNet can achieve better compression performances than many representative methods.

    @article{huang2022Net,
    title = {3QNet: 3D Point Cloud Geometry Quantization Compression Network},
    author = {Tianxin Huang and Jiangning Zhang and Jun Chen and Zhonggan Ding and Ying Tai and Zhenyu Zhang and Chengjie Wang and Yong Liu},
    year = 2022,
    journal = {ACM Transactions on Graphics},
    doi = {10.1145/3550454.3555481},
    abstract = {Since the development of 3D applications, the point cloud, as a spatial description easily acquired by sensors, has been widely used in multiple areas such as SLAM and 3D reconstruction. Point Cloud Compression (PCC) has also attracted more attention as a primary step before point cloud transferring and saving, where the geometry compression is an important component of PCC to compress the points geometrical structures. However, existing non-learning-based geometry compression methods are often limited by manually pre-defined compression rules. Though learning-based compression methods can significantly improve the algorithm performances by learning compression rules from data, they still have some defects. Voxel-based compression networks introduce precision errors due to the voxelized operations, while point-based methods may have relatively weak robustness and are mainly designed for sparse point clouds. In this work, we propose a novel learning-based point cloud compression framework named 3D Point Cloud Geometry Quantiation Compression Network (3QNet), which overcomes the robustness limitation of existing point-based methods and can handle dense points. By learning a codebook including common structural features from simple and sparse shapes, 3QNet can efficiently deal with multiple kinds of point clouds. According to experiments on object models, indoor scenes, and outdoor scans, 3QNet can achieve better compression performances than many representative methods.}
    }

  • T. Huang, J. Chen, J. Zhang, Y. Liu, and J. Liang, “Fast Point Cloud Sampling Network," Pattern Recognition Letters, 2022.
    [BibTeX] [Abstract] [DOI]

    The increasing number of points in 3D point clouds has brought great challenges for subsequent algorithm efficiencies. Down-sampling algorithms are adopted to simplify the data and accelerate the computation. Except the well-known random sampling and farthest distance sampling, some recent works have tried to learn a sampling pattern according to the downstream task, which helps generate sampled points by fully-connected networks with fixed output point numbers. In this condition, a progress-net structure covering all resolutions sampling networks or multiple separate sampling networks for different resolutions are required, which is inconvenient. In this work, we propose a novel learning-based point cloud sampling framework, named Fast point cloud sampling network (FPN), which drives initial randomly sampled points to better positions instead of generating coordinates. FPN can be used to sample points clouds to any resolution once trained by changing the number of initial randomly sampled points. Results on point cloud reconstruction and recognition confirm that FPN can reach state-of-the-art performances with much higher sampling efficiency than most existing sampling methods.

    @article{huang2022fast,
    title = {Fast Point Cloud Sampling Network},
    author = {Tianxin Huang and Jun Chen and Jiangning Zhang and Yong Liu and Jie Liang},
    year = 2022,
    journal = {Pattern Recognition Letters},
    doi = {10.1016/j.patrec.2022.11.006},
    abstract = {The increasing number of points in 3D point clouds has brought great challenges for subsequent algorithm efficiencies. Down-sampling algorithms are adopted to simplify the data and accelerate the computation. Except the well-known random sampling and farthest distance sampling, some recent works have tried to learn a sampling pattern according to the downstream task, which helps generate sampled points by fully-connected networks with fixed output point numbers. In this condition, a progress-net structure covering all resolutions sampling networks or multiple separate sampling networks for different resolutions are required, which is inconvenient. In this work, we propose a novel learning-based point cloud sampling framework, named Fast point cloud sampling network (FPN), which drives initial randomly sampled points to better positions instead of generating coordinates. FPN can be used to sample points clouds to any resolution once trained by changing the number of initial randomly sampled points. Results on point cloud reconstruction and recognition confirm that FPN can reach state-of-the-art performances with much higher sampling efficiency than most existing sampling methods.}
    }

  • J. Lv, X. Zuo, K. Hu, J. Xu, G. Huang, and Y. Liu, “Observability-Aware Intrinsic and Extrinsic Calibration of LiDAR-IMU System," IEEE Transactions on Robotics, vol. 38, iss. 6, pp. 3734-3753, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    Accurate and reliable sensor calibration is essential to fuse LiDAR and inertial measurements, which are usually available in robotic applications. In this article, we propose a novel LiDAR-IMU calibration method within the continuous-time batch-optimization framework, where the intrinsics of both sensors and the spatial-temporal extrinsics between sensors are calibrated without using calibration infrastructure, such as fiducial tags. Compared to discrete-time approaches, the continuous-time formulation has natural advantages for fusing high-rate measurements from LiDAR and IMU sensors. To improve efficiency and address degenerate motions, the following two observability-aware modules are leveraged: first, The information-theoretic data selection policy selects only the most informative segments for calibration during data collection, which significantly improves the calibration efficiency by processing only the selected informative segments. Second, the observability-aware state update mechanism in nonlinear least-squares optimization updates only the identifiable directions in the state space with truncated singular value decomposition, which enables accurate calibration results even under degenerate cases where informative data segments are not available. The proposed LiDAR-IMU calibration approach has been validated extensively in both simulated and real-world experiments with different robot platforms, demonstrating its high accuracy and repeatability in commonly-seen human-made environments.

    @article{lv2022oai,
    title = {Observability-Aware Intrinsic and Extrinsic Calibration of LiDAR-IMU System},
    author = {Jiajun Lv and Xingxing Zuo and Kewei Hu and Jinhong Xu and Guoquan Huang and Yong Liu},
    year = 2022,
    journal = {IEEE Transactions on Robotics},
    volume = {38},
    number = {6},
    pages = {3734-3753},
    doi = {10.1109/TRO.2022.3174476},
    abstract = {Accurate and reliable sensor calibration is essential to fuse LiDAR and inertial measurements, which are usually available in robotic applications. In this article, we propose a novel LiDAR-IMU calibration method within the continuous-time batch-optimization framework, where the intrinsics of both sensors and the spatial-temporal extrinsics between sensors are calibrated without using calibration infrastructure, such as fiducial tags. Compared to discrete-time approaches, the continuous-time formulation has natural advantages for fusing high-rate measurements from LiDAR and IMU sensors. To improve efficiency and address degenerate motions, the following two observability-aware modules are leveraged: first, The information-theoretic data selection policy selects only the most informative segments for calibration during data collection, which significantly improves the calibration efficiency by processing only the selected informative segments. Second, the observability-aware state update mechanism in nonlinear least-squares optimization updates only the identifiable directions in the state space with truncated singular value decomposition, which enables accurate calibration results even under degenerate cases where informative data segments are not available. The proposed LiDAR-IMU calibration approach has been validated extensively in both simulated and real-world experiments with different robot platforms, demonstrating its high accuracy and repeatability in commonly-seen human-made environments.}
    }

  • C. Xu, J. Zhang, M. Wang, G. Tian, and Y. Liu, “Multi-level Spatial-temporal Feature Aggregation for Video Object Detection," IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, iss. 11, pp. 7809-7820, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    Video object detection (VOD) focuses on detecting objects for each frame in a video, which is a challenging task due to appearance deterioration in certain video frames. Recent works usually distill crucial information from multiple support frames to improve the reference features, but they only perform at frame level or proposal level that cannot integrate spatial-temporal features sufficiently. To deal with this challenge, we treat VOD as a spatial-temporal hierarchical features interacting process and introduce a Multi-level Spatial-Temporal (MST) feature aggregation framework to fully exploit frame-level, proposal-level, and instance-level information in a unified framework. Specifically, MST first measures context similarity in pixel space to enhance all frame-level features rather than only update reference features. The proposal-level feature aggregation then models object relation to augment reference object proposals. Furthermore, to filter out irrelevant information from other classes and backgrounds, we introduce an instance ID constraint to boost instance-level features by leveraging support object proposal features that belong to the same object. Besides, we propose a Deformable Feature Alignment (DAlign) module before MST to achieve a more accurate pixel-level spatial alignment for better feature aggregation. Extensive experiments are conducted on ImageNet VID and UAVDT datasets that demonstrate the superiority of our method over state-of-the-art (SOTA) methods. Our method achieves 83.3% and 62.1% with ResNet-101 on two datasets, outperforming SOTA MEGA by 0.4% and 2.7%.

    @article{xu2022mls,
    title = {Multi-level Spatial-temporal Feature Aggregation for Video Object Detection},
    author = {Chao Xu and Jiangning Zhang and Mengmeng Wang and Guanzhong Tian and Yong Liu},
    year = 2022,
    journal = {IEEE Transactions on Circuits and Systems for Video Technology},
    volume = {32},
    number = {11},
    pages = {7809-7820},
    doi = {10.1109/TCSVT.2022.3183646},
    abstract = {Video object detection (VOD) focuses on detecting objects for each frame in a video, which is a challenging task due to appearance deterioration in certain video frames. Recent works usually distill crucial information from multiple support frames to improve the reference features, but they only perform at frame level or proposal level that cannot integrate spatial-temporal features sufficiently. To deal with this challenge, we treat VOD as a spatial-temporal hierarchical features interacting process and introduce a Multi-level Spatial-Temporal (MST) feature aggregation framework to fully exploit frame-level, proposal-level, and instance-level information in a unified framework. Specifically, MST first measures context similarity in pixel space to enhance all frame-level features rather than only update reference features. The proposal-level feature aggregation then models object relation to augment reference object proposals. Furthermore, to filter out irrelevant information from other classes and backgrounds, we introduce an instance ID constraint to boost instance-level features by leveraging support object proposal features that belong to the same object. Besides, we propose a Deformable Feature Alignment (DAlign) module before MST to achieve a more accurate pixel-level spatial alignment for better feature aggregation. Extensive experiments are conducted on ImageNet VID and UAVDT datasets that demonstrate the superiority of our method over state-of-the-art (SOTA) methods. Our method achieves 83.3% and 62.1% with ResNet-101 on two datasets, outperforming SOTA MEGA by 0.4% and 2.7%.}
    }

  • M. Wang, J. Mei, L. Liu, and Y. Liu, “Delving Deeper Into Mask Utilization in Video Object Segmentation," IEEE Transactions on Image Processing, vol. 31, pp. 6255-6266, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    This paper focuses on the mask utilization of video object segmentation (VOS). The mask here mains the reference masks in the memory bank, i.e., several chosen high-quality predicted masks, which are usually used with the reference frames together. The reference masks depict the edge and contour features of the target object and indicate the boundary of the target against the background, while the reference frames contain the raw RGB information of the whole image. It is obvious that the reference masks could play a significant role in the VOS, but this is not well explored yet. To tackle this, we propose to investigate the mask advantages of both the encoder and the matcher. For the encoder, we provide a unified codebase to integrate and compare eight different mask-fused encoders. Half of them are inherited or summarized from existing methods, and the other half are devised by ourselves. We find the best configuration from our design and give valuable observations from the comparison. Then, we propose a new mask-enhanced matcher to reduce the background distraction and enhance the locality of the matching process. Combining the mask-fused encoder, mask-enhanced matcher and a standard decoder, we formulate a new architecture named MaskVOS, which sufficiently exploits the mask benefits for VOS. Qualitative and quantitative results demonstrate the effectiveness of our method. We hope our exploration could raise the attention of mask utilization in VOS.

    @article{wang2022ddi,
    title = {Delving Deeper Into Mask Utilization in Video Object Segmentation},
    author = {Mengmeng Wang and Jianbiao Mei and Lina Liu and Yong Liu},
    year = 2022,
    journal = {IEEE Transactions on Image Processing},
    volume = {31},
    pages = {6255-6266},
    doi = {10.1109/TIP.2022.3208409},
    abstract = {This paper focuses on the mask utilization of video object segmentation (VOS). The mask here mains the reference masks in the memory bank, i.e., several chosen high-quality predicted masks, which are usually used with the reference frames together. The reference masks depict the edge and contour features of the target object and indicate the boundary of the target against the background, while the reference frames contain the raw RGB information of the whole image. It is obvious that the reference masks could play a significant role in the VOS, but this is not well explored yet. To tackle this, we propose to investigate the mask advantages of both the encoder and the matcher. For the encoder, we provide a unified codebase to integrate and compare eight different mask-fused encoders. Half of them are inherited or summarized from existing methods, and the other half are devised by ourselves. We find the best configuration from our design and give valuable observations from the comparison. Then, we propose a new mask-enhanced matcher to reduce the background distraction and enhance the locality of the matching process. Combining the mask-fused encoder, mask-enhanced matcher and a standard decoder, we formulate a new architecture named MaskVOS, which sufficiently exploits the mask benefits for VOS. Qualitative and quantitative results demonstrate the effectiveness of our method. We hope our exploration could raise the attention of mask utilization in VOS.}
    }

  • X. Lang, J. Lv, J. Huang, Y. Ma, Y. Liu, and X. Zuo, “Ctrl-VIO: Continuous-Time Visual-Inertial Odometry for Rolling Shutter Cameras," IEEE Robotics and Automation Letters (RA-L), vol. 7, iss. 4, pp. 11537-11544, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this letter, we propose a probabilistic continuoustime visual-inertial odometry (VIO) for rolling shutter cameras. The continuous-time trajectory formulation naturally facilitates the fusion of asynchronized high-frequency IMU data and motion distorted rolling shutter images. To prevent intractable computation load, the proposed VIO is sliding-window and keyframe-based. We propose to probabilistically marginalize the control points to keep the constant number of keyframes in the sliding window. Furthermore, the line exposure time difference (line delay) of the rolling shutter camera can be online calibrated in our continuous-time VIO. To extensively examine the performance of our continuoustime VIO, experiments are conducted on publicly-available WHURSVI, TUM-RSVI, and SenseTime-RSVI rolling shutter datasets. The results demonstrate the proposed continuous-time VIO significantly outperforms the existing state-of-the-art VIO methods.

    @article{lang2022ctv,
    title = {Ctrl-VIO: Continuous-Time Visual-Inertial Odometry for Rolling Shutter Cameras},
    author = {Xiaolei Lang and Jiajun Lv and Jianxin Huang and Yukai Ma and Yong Liu and Xingxing Zuo},
    year = 2022,
    journal = {IEEE Robotics and Automation Letters (RA-L)},
    volume = {7},
    number = {4},
    pages = {11537-11544},
    doi = {10.1109/LRA.2022.3202349},
    abstract = {In this letter, we propose a probabilistic continuoustime visual-inertial odometry (VIO) for rolling shutter cameras. The continuous-time trajectory formulation naturally facilitates
    the fusion of asynchronized high-frequency IMU data and motion distorted rolling shutter images. To prevent intractable computation load, the proposed VIO is sliding-window and keyframe-based. We propose to probabilistically marginalize the control points to keep the constant number of keyframes in the sliding window. Furthermore, the line exposure time difference (line delay) of the rolling
    shutter camera can be online calibrated in our continuous-time VIO. To extensively examine the performance of our continuoustime VIO, experiments are conducted on publicly-available WHURSVI, TUM-RSVI, and SenseTime-RSVI rolling shutter datasets. The results demonstrate the proposed continuous-time VIO significantly outperforms the existing state-of-the-art VIO methods.}
    }

  • G. Zhai, Y. Zheng, Z. Xu, X. Kong, Y. Liu, B. Busam, Y. Ren, N. Navab, and Z. Zhang, “DA^2 Dataset: Toward Dexterity-Aware Dual-Arm Grasping," IEEE Robotics and Automation Letters (RA-L), vol. 7, iss. 4, pp. 8941-8948, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this paper, we introduce DA^2, the first large-scale dual-arm dexterity-aware dataset for the generation of optimal bimanual grasping pairs for arbitrary large objects. The dataset contains about 9M pairs of parallel-jaw grasps, generated from more than 6000 objects and each labeled with various grasp dexterity measures. In addition, we propose an end-to-end dual-arm grasp evaluation model trained on the rendered scenes from this dataset. We utilize the evaluation model as our baseline to show the value of this novel and nontrivial dataset by both online analysis and real robot experiments. All data and related code will be open-sourced at https://sites.google.com/view/da2dataset.

    @article{zhai2022ddt,
    title = {DA^2 Dataset: Toward Dexterity-Aware Dual-Arm Grasping},
    author = {Guangyao Zhai and Yu Zheng and Ziwei Xu and Xin Kong and Yong Liu and Benjamin Busam and Yi Ren and Nassir Navab and Zhengyou Zhang},
    year = 2022,
    journal = {IEEE Robotics and Automation Letters (RA-L)},
    volume = {7},
    number = {4},
    pages = {8941-8948},
    doi = {10.1109/LRA.2022.3189959},
    abstract = {In this paper, we introduce DA^2, the first large-scale dual-arm dexterity-aware dataset for the generation of optimal bimanual grasping pairs for arbitrary large objects. The dataset contains about 9M pairs of parallel-jaw grasps, generated from more than 6000 objects and each labeled with various grasp dexterity measures. In addition, we propose an end-to-end dual-arm grasp evaluation model trained on the rendered scenes from this dataset. We utilize the evaluation model as our baseline to show the value of this novel and nontrivial dataset by both online analysis and real robot experiments. All data and related code will be open-sourced at https://sites.google.com/view/da2dataset.}
    }

  • Y. Lin, M. Wang, W. Chen, W. Gao, L. Li, and Y. Liu, “Multiple Object Tracking of Drone Videos by a Temporal-Association Network with Separated-Tasks Structure," Remote Sensing, vol. 14, iss. 16, p. 3862, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    The task of multi-object tracking via deep learning methods for UAV videos has become an important research direction. However, with some current multiple object tracking methods, the relationship between object detection and tracking is not well handled, and decisions on how to make good use of temporal information can affect tracking performance as well. To improve the performance of multi-object tracking, this paper proposes an improved multiple object tracking model based on FairMOT. The proposed model contains a structure to separate the detection and ReID heads to decrease the influence between every function head. Additionally, we develop a temporal embedding structure to strengthen the representational ability of the model. By combing the temporal-association structure and separating different function heads, the model’s performance in object detection and tracking tasks is improved, which has been verified on the VisDrone2019 dataset. Compared with the original method, the proposed model improves MOTA by 4.9% and MOTP by 1.2% and has better tracking performance than the models such as SORT and HDHNet on the UAV video dataset.

    @article{lin2022mot,
    title = {Multiple Object Tracking of Drone Videos by a Temporal-Association Network with Separated-Tasks Structure},
    author = {Yeneng Lin and Mengmeng Wang and Wenzhou Chen and Wang Gao and Lei Li and Yong Liu},
    year = 2022,
    journal = {Remote Sensing},
    volume = {14},
    number = {16},
    pages = {3862},
    doi = {10.3390/rs14163862},
    abstract = {The task of multi-object tracking via deep learning methods for UAV videos has become an important research direction. However, with some current multiple object tracking methods,
    the relationship between object detection and tracking is not well handled, and decisions on how to make good use of temporal information can affect tracking performance as well. To improve the performance of multi-object tracking, this paper proposes an improved multiple object tracking model based on FairMOT. The proposed model contains a structure to separate the detection and ReID heads to decrease the influence between every function head. Additionally, we develop a temporal embedding structure to strengthen the representational ability of the model. By combing the temporal-association structure and separating different function heads, the model’s performance in object detection and tracking tasks is improved, which has been verified on the VisDrone2019 dataset. Compared with the original method, the proposed model improves MOTA by 4.9% and MOTP by 1.2% and has better tracking performance than the models such as SORT and HDHNet on the UAV video dataset.}
    }

  • T. Huang, Y. Liu, and Z. Pan, “Deep Residual Surrogate Model," Information Sciences, vol. 605, pp. 86-98, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    Surrogate models are widely used to model the high computational cost problems such as industrial simulation or engineering optimization when the size of sampled data for modeling is greatly limited. They can significantly improve the efficiency of complex calculations by modeling original expensive problems with simpler computation-saving functions. However, a single surrogate model cannot always perform well for various problems. On this occasion, hybrid surrogate models are created to improve the final performances on different problems by combining advantages of multiple single models. Nevertheless, existing hybrid methods work by estimating weights for all alternative single models, which limits the efficiency when more single models are adopted. In this paper, we propose a novel hybrid surrogate model quite different from former methods, named the Deep Residual Surrogate model (DRS). DRS does not merge all alternative single surrogate models directly by weights, but by assembling selected ones in a multiple layers structure. We propose first derivate validation (FDV) to recurrently select the single surrogate model adopted in each layer from all candidates. Experimental results on multiple benchmark problems demonstrate that DRS has better performances than existing single and hybrid surrogate models in both prediction accuracy and stability with higher efficiency. (C) 2022 Elsevier Inc. All rights reserved.

    @article{huang2022drs,
    title = {Deep Residual Surrogate Model},
    author = {Tianxin Huang and Yong Liu and Zaisheng Pan},
    year = 2022,
    journal = {Information Sciences},
    volume = {605},
    pages = {86-98},
    doi = {10.1016/j.ins.2022.04.041},
    abstract = {Surrogate models are widely used to model the high computational cost problems such as industrial simulation or engineering optimization when the size of sampled data for modeling is greatly limited. They can significantly improve the efficiency of complex calculations by modeling original expensive problems with simpler computation-saving functions. However, a single surrogate model cannot always perform well for various problems. On this occasion, hybrid surrogate models are created to improve the final performances on different problems by combining advantages of multiple single models. Nevertheless, existing hybrid methods work by estimating weights for all alternative single models, which limits the efficiency when more single models are adopted. In this paper, we propose a novel hybrid surrogate model quite different from former methods, named the Deep Residual Surrogate model (DRS). DRS does not merge all alternative single surrogate models directly by weights, but by assembling selected ones in a multiple layers structure. We propose first derivate validation (FDV) to recurrently select the single surrogate model adopted in each layer from all candidates. Experimental results on multiple benchmark problems demonstrate that DRS has better performances than existing single and hybrid surrogate models in both prediction accuracy and stability with higher efficiency. (C) 2022 Elsevier Inc. All rights reserved.}
    }

  • X. Zhao, Y. Liu, Z. Wang, K. Wu, G. Dissanayake, and Y. Liu, “TG: Accurate and Efficient RGB-D Feature with Texture and Geometric Information," IEEE-ASME Transactions on Mechatronics, vol. 27, iss. 4, pp. 1973-1981, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    Feature extraction and matching are the basis of many computer vision problems, such as image retrieval, object recognition, and visual odometry. In this article, we present a novel RGB-D feature with texture and geometric information (TG). It consists of a keypoint detector and a feature descriptor, which is accurate, efficient, and robust to scene variance. In the keypoint detection, we build a simplified Gaussian image pyramid to extract the texture feature. Meanwhile, the gradient of the point cloud is superimposed as the geometric feature. In the feature description, the texture information and spatial information are encoded in relative order to build a discriminative descriptor. We also construct a novel RGB-D benchmark dataset for RGB-D detector and descriptor evaluation under single variation. Comprehensive experiments are carried out to prove the superior performance of the proposed feature compared with state-of-the-art algorithms. The experimental results also demonstrate that our TG can achieve better performance especially on accuracy and the computational efficiency, making it more suitable for the real-time applications, e.g., visual odometry.

    @article{zhao2022tga,
    title = {TG: Accurate and Efficient RGB-D Feature with Texture and Geometric Information},
    author = {Xiangrui Zhao and Yu Liu and Zhengbo Wang and Kanzhi Wu and Gamini Dissanayake and Yong Liu},
    year = 2022,
    journal = {IEEE-ASME Transactions on Mechatronics},
    volume = {27},
    number = {4},
    pages = {1973-1981},
    doi = {10.1109/TMECH.2022.3175812},
    abstract = {Feature extraction and matching are the basis of many computer vision problems, such as image retrieval, object recognition, and visual odometry. In this article, we present a novel RGB-D feature with texture and geometric information (TG). It consists of a keypoint detector and a feature descriptor, which is accurate, efficient, and robust to scene variance. In the keypoint detection, we build a simplified Gaussian image pyramid to extract the texture feature. Meanwhile, the gradient of the point cloud is superimposed as the geometric feature. In the feature description, the texture information and spatial information are encoded in relative order to build a discriminative descriptor. We also construct a novel RGB-D benchmark dataset for RGB-D detector and descriptor evaluation under single variation. Comprehensive experiments are carried out to prove the superior performance of the proposed feature compared with state-of-the-art algorithms. The experimental results also demonstrate that our TG can achieve better performance especially on accuracy and the computational efficiency, making it more suitable for the real-time applications, e.g., visual odometry.}
    }

  • G. Xu, Y. Chen, J. Cao, D. Zhu, W. Liu, and Y. Liu, “Multivehicle Motion Planning with Posture Constraints in Real World," IEEE-ASME Transactions on Mechatronics, vol. 27, iss. 4, pp. 2125-2133, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    This article addresses the posture constraints problem in multivehicle motion planning for specific applications such as ground exploration tasks. Unlike most of the related work in motion planning, this article investigates more practical applications in the real world for nonholonomic unmanned ground vehicles (UGVs). In this case, a strategy of diversion is designed to optimize the smoothness of motion. Considering the problem of the posture constraints, a postured collision avoidance algorithm is proposed for the motion planning of the multiple nonholonomic UGVs. Two simulation experiments were conducted to verify the effectiveness and analyze the quantitative performance of the proposed method. Then, the practicability of the proposed algorithm was verified with an experiment in a natural environment.

    @article{xu2022mmp,
    title = {Multivehicle Motion Planning with Posture Constraints in Real World},
    author = {Gang Xu and Yansong Chen and Junjie Cao and Deye Zhu and Weiwei Liu and Yong Liu},
    year = 2022,
    journal = {IEEE-ASME Transactions on Mechatronics},
    volume = {27},
    number = {4},
    pages = {2125-2133},
    doi = {10.1109/TMECH.2022.3173130},
    abstract = {This article addresses the posture constraints problem in multivehicle motion planning for specific applications such as ground exploration tasks. Unlike most of the related work in motion planning, this article investigates more practical applications in the real world for nonholonomic unmanned ground vehicles (UGVs). In this case, a strategy of diversion is designed to optimize the smoothness of motion. Considering the problem of the posture constraints, a postured collision avoidance algorithm is proposed for the motion planning of the multiple nonholonomic UGVs. Two simulation experiments were conducted to verify the effectiveness and analyze the quantitative performance of the proposed method. Then, the practicability of the proposed algorithm was verified with an experiment in a natural environment.}
    }

  • L. Li, X. Kong, X. Zhao, T. Huang, and Y. Liu, “Semantic Scan Context: A Novel Semantic-based Loop-closure Method for LiDAR SLAM," Autonomous Robots, vol. 46, iss. 4, pp. 535-551, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    As one of the key technologies of SLAM, loop-closure detection can help eliminate the cumulative errors of the odometry. Many of the current LiDAR-based SLAM systems do not integrate a loop-closure detection module, so they will inevitably suffer from cumulative errors. This paper proposes a semantic-based place recognition method called Semantic Scan Context (SSC), which consists of the two-step global ICP and the semantic-based descriptor. Thanks to the use of high-level semantic features, our descriptor can effectively encode scene information. The proposed two-step global ICP can help eliminate the influence of rotation and translation on descriptor matching and provide a good initial value for geometric verification. Further, we built a complete loop-closure detection module based on SSC and combined it with the famous LOAM to form a full LiDAR SLAM system. Exhaustive experiments on the KITTI and KITTI-360 datasets show that our approach is competitive to the state-of-the-art methods, robust to the environment, and has good generalization ability. Our code is available at:https://github.com/lilin-hitcrt/SSC.

    @article{li2022ssc,
    title = {Semantic Scan Context: A Novel Semantic-based Loop-closure Method for LiDAR SLAM},
    author = {Lin Li and Xin Kong and Xiangrui Zhao and Tianxin Huang and Yong Liu},
    year = 2022,
    journal = {Autonomous Robots},
    volume = {46},
    number = {4},
    pages = {535-551},
    doi = {10.1007/s10514-022-10037-w},
    abstract = {As one of the key technologies of SLAM, loop-closure detection can help eliminate the cumulative errors of the odometry. Many of the current LiDAR-based SLAM systems do not integrate a loop-closure detection module, so they will inevitably suffer from cumulative errors. This paper proposes a semantic-based place recognition method called Semantic Scan Context (SSC), which consists of the two-step global ICP and the semantic-based descriptor. Thanks to the use of high-level semantic features, our descriptor can effectively encode scene information. The proposed two-step global ICP can help eliminate the influence of rotation and translation on descriptor matching and provide a good initial value for geometric verification. Further, we built a complete loop-closure detection module based on SSC and combined it with the famous LOAM to form a full LiDAR SLAM system. Exhaustive experiments on the KITTI and KITTI-360 datasets show that our approach is competitive to the state-of-the-art methods, robust to the environment, and has good generalization ability. Our code is available at:https://github.com/lilin-hitcrt/SSC.}
    }

  • L. Li, X. Kong, X. Zhao, T. Huang, W. li, F. Wen, H. Zhang, and Y. Liu, “RINet: Efficient 3D Lidar-Based Place Recognition Using Rotation Invariant Neural Network," IEEE Robotics and Automation Letters (RA-L), vol. 7, iss. 2, pp. 4321-4328, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    LiDAR-based place recognition (LPR) is one of the basic capabilities of robots, which can retrieve scenes from maps and identify previously visited locations based on 3D point clouds. As robots often pass the same place from different views, LPR methods are supposed to be robust to rotation, which is lacking in most current learning-based approaches. In this letter, we propose a rotation invariant neural network structure that can detect reverse loop closures even training data is all in the same direction. Specifically, we design a novel rotation equivariant global descriptor, which combines semantic and geometric features to improve description ability. Then a rotation invariant siamese neural network is implemented to predict the similarity of descriptor pairs. Our network is lightweight and can operate more than 8000 FPS on an i7-9700 CPU. Exhaustive evaluations and robustness tests on the KITTI, KITTI-360, and NCLT datasets show that our approach can work stably in various scenarios and achieve state-of-the-art performance.

    @article{li2022rinet,
    title = {RINet: Efficient 3D Lidar-Based Place Recognition Using Rotation Invariant Neural Network},
    author = {Lin Li and Xin Kong and Xiangrui Zhao and Tianxin Huang and Wanlong li and Feng Wen and Hongbo Zhang and Yong Liu},
    year = 2022,
    journal = {IEEE Robotics and Automation Letters (RA-L)},
    volume = {7},
    number = {2},
    pages = {4321-4328},
    doi = {10.1109/LRA.2022.3150499},
    abstract = {LiDAR-based place recognition (LPR) is one of the basic capabilities of robots, which can retrieve scenes from maps and identify previously visited locations based on 3D point clouds. As robots often pass the same place from different views, LPR methods are supposed to be robust to rotation, which is lacking in most current learning-based approaches. In this letter, we propose a rotation invariant neural network structure that can detect reverse loop closures even training data is all in the same direction. Specifically, we design a novel rotation equivariant global descriptor, which combines semantic and geometric features to improve description ability. Then a rotation invariant siamese neural network is implemented to predict the similarity of descriptor pairs. Our network is lightweight and can operate more than 8000 FPS on an i7-9700 CPU. Exhaustive evaluations and robustness tests on the KITTI, KITTI-360, and NCLT datasets show that our approach can work stably in various scenarios and achieve state-of-the-art performance.}
    }

  • K. Zhang, Y. Liu, Y. Gu, X. Ruan, and J. Wang, “Multiple Timescale Feature Learning Strategy for Valve Stiction Detection Based on Convolutional Neural Network," IEEE/ASME Transactions on Mechatronics, vol. 27, iss. 3, pp. 1478-1488, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    This article proposes a valve stiction detection strategy based on a convolutional neural network. Considering the commonly existing characteristics of industrial time-series signals, the strategy is developed to learn features on multiple timescales automatically. Unlike the traditional approaches using hand-crafted features, the proposed strategy can automatically learn representative features on the time-series data collected from industrial control loops. The strategy is composed of two complementary data conversion methods, a mixed feature learning stage and a fusion decision stage, and it has the following merits: 1) the interaction of different pairs of time series can be effectively captured; and 2) the whole process of feature learning is automatic, and no manual feature extraction is needed. The effectiveness of the proposed strategy is evaluated through the comprehensive data, including the International Stiction Data Base, and the real data collected from the real hardware experimental system and the industrial environment. Compared with four traditional methods and three deep-learning-based methods, the experimental results demonstrate that the proposed strategy outperforms the other methods. Besides performance evaluation, we give the implementation procedure of practical application of the proposed strategy and provide the detailed analysis from the perspective of the data conversion methods and the number of timescales.

    @article{zhang2022mtf,
    title = {Multiple Timescale Feature Learning Strategy for Valve Stiction Detection Based on Convolutional Neural Network},
    author = {Kexin Zhang and Yong Liu and Yong Gu and Xiaojun Ruan and Jiadong Wang},
    year = 2022,
    journal = {IEEE/ASME Transactions on Mechatronics},
    volume = {27},
    number = {3},
    pages = {1478-1488},
    doi = {10.1109/TMECH.2021.3087503},
    abstract = {This article proposes a valve stiction detection strategy based on a convolutional neural network. Considering the commonly existing characteristics of industrial time-series signals, the strategy is developed to learn features on multiple timescales automatically. Unlike the traditional approaches using hand-crafted features, the proposed strategy can automatically learn representative features on the time-series data collected from industrial control loops. The strategy is composed of two complementary data conversion methods, a mixed feature learning stage and a fusion decision stage, and it has the following merits: 1) the interaction of different pairs of time series can be effectively captured; and 2) the whole process of feature learning is automatic, and no manual feature extraction is needed. The effectiveness of the proposed strategy is evaluated through the comprehensive data, including the International Stiction Data Base, and the real data collected from the real hardware experimental system and the industrial environment. Compared with four traditional methods and three deep-learning-based methods, the experimental results demonstrate that the proposed strategy outperforms the other methods. Besides performance evaluation, we give the implementation procedure of practical application of the proposed strategy and provide the detailed analysis from the perspective of the data conversion methods and the number of timescales.}
    }

  • L. Wen, Y. Liu, and H. Li, “CL-MAPF: Multi-Agent Path Finding for Car-Like Robots with Kinematic and Spatiotemporal Constraints," Robotics and Autonomous Systems, vol. 150, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    Multi-Agent Path Finding has been widely studied in the past few years due to its broad application in the field of robotics and AI. However, previous solvers rely on several simplifying assumptions. This limits their applicability in numerous real-world domains that adopt nonholonomic car-like agents rather than holonomic ones. In this paper, we give a mathematical formalization of the Multi-Agent Path Finding for Car-Like robots (CL-MAPF) problem. We propose a novel hierarchical search-based solver called Car-Like Conflict-Based Search to address this problem. It applies a body conflict tree to address collisions considering the shapes of the agents. We introduce a new algorithm called Spatiotemporal Hybrid-State A* as the single-agent planner to generate agents’ paths satisfying both kinematic and spatiotemporal constraints. We also present a sequential planning version of our method, sacrificing a small amount of solution quality to achieve a significant reduction in runtime. We compare our method with two baseline algorithms on a dedicated benchmark and validate it in real-world scenarios. The experiment results show that the planning success rate of both baseline algorithms is below 50% for all six scenarios, while our algorithm maintains that of over 98%. It also gives clear evidence that our algorithm scales well to 100 agents in 300 m x 300 m scenario and is able to produce solutions that can be directly applied to Ackermann-steering robots in the real world. The benchmark and source code are released in https://github.com/APRIL-ZJU/CL-CBS. The video of the experiments can be found on YouTube.(C) 2021 Elsevier B.V. All rights reserved.

    @article{wen2022clm,
    title = {CL-MAPF: Multi-Agent Path Finding for Car-Like Robots with Kinematic and Spatiotemporal Constraints},
    author = {Licheng Wen and Yong Liu and Hongliang Li},
    year = 2022,
    journal = {Robotics and Autonomous Systems},
    volume = 150,
    doi = {10.1016/j.robot.2021.103997},
    abstract = {Multi-Agent Path Finding has been widely studied in the past few years due to its broad application in the field of robotics and AI. However, previous solvers rely on several simplifying assumptions. This limits their applicability in numerous real-world domains that adopt nonholonomic car-like agents rather than holonomic ones. In this paper, we give a mathematical formalization of the Multi-Agent Path Finding for Car-Like robots (CL-MAPF) problem. We propose a novel hierarchical search-based solver called Car-Like Conflict-Based Search to address this problem. It applies a body conflict tree to address collisions considering the shapes of the agents. We introduce a new algorithm called Spatiotemporal Hybrid-State A* as the single-agent planner to generate agents' paths satisfying both kinematic and spatiotemporal constraints. We also present a sequential planning version of our method, sacrificing a small amount of solution quality to achieve a significant reduction in runtime. We compare our method with two baseline algorithms on a dedicated benchmark and validate it in real-world scenarios. The experiment results show that the planning success rate of both baseline algorithms is below 50% for all six scenarios, while our algorithm maintains that of over 98%. It also gives clear evidence that our algorithm scales well to 100 agents in 300 m x 300 m scenario and is able to produce solutions that can be directly applied to Ackermann-steering robots in the real world. The benchmark and source code are released in https://github.com/APRIL-ZJU/CL-CBS. The video of the experiments can be found on YouTube.(C) 2021 Elsevier B.V. All rights reserved.}
    }

  • J. Zhang, X. Zeng, C. Xu, and Y. Liu, “Real-Time Audio-Guided Multi-Face Reenactment," IEEE Signal Processing Letters, vol. 29, p. 1–5, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    Audio-guided face reenactment aims to generate authentic target faces that have matched facial expression of the input audio, and many learning-based methods have successfully achieved this. However, mostmethods can only reenact a particular person once trained or suffer from the low-quality generation of the target images. Also, nearly none of the current reenactment works consider the model size and running speed that are important for practical use. To solve the above challenges, we propose an efficient Audio-guided Multi-face reenactment model named AMNet, which can reenact target faces among multiple persons with corresponding source faces and drive signals as inputs. Concretely, we design a Geometric Controller (GC) module to inject the drive signals so that the model can be optimized in an end-to-end manner and generate more authentic images. Also, we adopt a lightweight network for our face reenactor so that the model can run in realtime on both CPU and GPU devices. Abundant experiments prove our approach’s superiority over existing methods, e.g., averagely decreasing FID by 0.12. and increasing SSIM by 0.031. than APB2Face, while owning fewer parameters (x4 down arrow) and faster CPU speed (x4 up arrow).

    @article{zhang2022rta,
    title = {Real-Time Audio-Guided Multi-Face Reenactment},
    author = {Jiangning Zhang and Xianfang Zeng and Chao Xu and Yong Liu},
    year = 2022,
    journal = {IEEE Signal Processing Letters},
    volume = {29},
    pages = {1--5},
    doi = {10.1109/LSP.2021.3116506},
    abstract = {Audio-guided face reenactment aims to generate authentic target faces that have matched facial expression of the input audio, and many learning-based methods have successfully achieved this. However, mostmethods can only reenact a particular person once trained or suffer from the low-quality generation of the target images. Also, nearly none of the current reenactment works consider the model size and running speed that are important for practical use. To solve the above challenges, we propose an efficient Audio-guided Multi-face reenactment model named AMNet, which can reenact target faces among multiple persons with corresponding source faces and drive signals as inputs. Concretely, we design a Geometric Controller (GC) module to inject the drive signals so that the model can be optimized in an end-to-end manner and generate more authentic images. Also, we adopt a lightweight network for our face reenactor so that the model can run in realtime on both CPU and GPU devices. Abundant experiments prove our approach's superiority over existing methods, e.g., averagely decreasing FID by 0.12. and increasing SSIM by 0.031. than APB2Face, while owning fewer parameters (x4 down arrow) and faster CPU speed (x4 up arrow).}
    }

  • J. Cao, Y. Wang, Y. Liu, and X. Ni, “Multi-Robot Learning Dynamic Obstacle Avoidance in Formation with Information-Directed Exploration," IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 6, iss. 6, pp. 1357-1367, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    This paper presents an algorithm that generates distributed collision-free velocities for multi-robot while maintain formation as much as possible. The adaptive formation problem is cast as a sequential decision-making problem, which is solved using reinforcement learning that trains several distributed policies to avoid dynamic obstacles on the top of consensus velocities. We construct the policy with Bayesian Linear Regression based on a neural network (called BNL) to compute the state-action value uncertainty efficiently for sequential decision making. The information-directed sampling is applied in our BNL policy to achieve efficient exploration. By further combining the distributional reinforcement learning, we can estimate the intrinsic uncertainty of the state-action value globally and more accurately. For continuous control tasks, efficient exploration can be achieved by optimizing a policy with the sampled action value function from a BNL model. Through our experiments in some contextual Bandit and sequential decision-making tasks, we show that exploration with the BNL model has improved efficiency in both computation and training samples. By augmenting the consensus velocities with our BNL policy, experiments on Multi-Robot navigation demonstrate that adaptive formation is achieved.

    @article{cao2021mrl,
    title = {Multi-Robot Learning Dynamic Obstacle Avoidance in Formation with Information-Directed Exploration},
    author = {Junjie Cao and Yujie Wang and Yong Liu and Xuesong Ni},
    year = 2022,
    journal = {IEEE Transactions on Emerging Topics in Computational Intelligence},
    volume = {6},
    number = {6},
    pages = {1357-1367},
    doi = {10.1109/TETCI.2021.3127925},
    abstract = {This paper presents an algorithm that generates distributed collision-free velocities for multi-robot while maintain formation as much as possible. The adaptive formation problem is cast as a sequential decision-making problem, which is solved using reinforcement learning that trains several distributed policies to avoid dynamic obstacles on the top of consensus velocities. We construct the policy with Bayesian Linear Regression based on a neural network (called BNL) to compute the state-action value uncertainty efficiently for sequential decision making. The information-directed sampling is applied in our BNL policy to achieve efficient exploration. By further combining the distributional reinforcement learning, we can estimate the intrinsic uncertainty of the state-action value globally and more accurately. For continuous control tasks, efficient exploration can be achieved by optimizing a policy with the sampled action value function from a BNL model. Through our experiments in some contextual Bandit and sequential decision-making tasks, we show that exploration with the BNL model has improved efficiency in both computation and training samples. By augmenting the consensus velocities with our BNL policy, experiments on Multi-Robot navigation demonstrate that adaptive formation is achieved.}
    }

  • C. Deng, M. Wang, L. Liu, Y. Liu, and Y. Jiang, “Extended feature pyramid network for small object detection," IEEE Transactions on Multimedia, vol. 24, pp. 1968-1979, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    Small object detection remains an unsolved challenge because it is hard to extract information of small objects with only a few pixels. While scale-level corresponding detection in feature pyramid network alleviates this problem, we find feature coupling of various scales still impairs the performance of small objects. In this paper, we propose an extended feature pyramid network (EFPN) with an extra high-resolution pyramid level specialized for small object detection. Specifically, we design a novel module, named feature texture transfer (FTT), which is used to super-resolve features and extract credible regional details simultaneously. Moreover, we introduce a cross resolution distillation mechanism to transfer the ability of perceiving details across the scales of the network, where a foreground-background-balanced loss function is designed to alleviate area imbalance of foreground and background. In our experiments, the proposed EFPN is efficient on both computation and memory, and yields state-of-the-art results on small traffic-sign dataset Tsinghua-Tencent 100K and small category of general object detection dataset MS COCO.

    @article{deng2022efp,
    title = {Extended feature pyramid network for small object detection},
    author = {Chunfang Deng and Mengmeng Wang and Liang Liu and Yong Liu and Yunliang Jiang},
    year = 2022,
    journal = {IEEE Transactions on Multimedia},
    volume = {24},
    pages = {1968-1979},
    doi = {10.1109/TMM.2021.3074273},
    abstract = {Small object detection remains an unsolved challenge because it is hard to extract information of small objects with only a few pixels. While scale-level corresponding detection in feature pyramid network alleviates this problem, we find feature coupling of various scales still impairs the performance of small objects. In this paper, we propose an extended feature pyramid network (EFPN) with an extra high-resolution pyramid level specialized for small object detection. Specifically, we design a novel module, named feature texture transfer (FTT), which is used to super-resolve features and extract credible regional details simultaneously. Moreover, we introduce a cross resolution distillation mechanism to transfer the ability of perceiving details across the scales of the network, where a foreground-background-balanced loss function is designed to alleviate area imbalance of foreground and background. In our experiments, the proposed EFPN is efficient on both computation and memory, and yields state-of-the-art results on small traffic-sign dataset Tsinghua-Tencent 100K and small category of general object detection dataset MS COCO.}
    }

  • J. Fan, S. Gao, Y. Liu, X. Ma, J. Yang, and C. Fan, “Semisupervised Game Player Categorization From Very Big Behavior Log Data," IEEE Transactions on Systems Man Cybernetics-Systems, vol. 52, iss. 6, pp. 3419-3430, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    Extracting the specific category of the players, such as the malignant Bot, from the huge log data of the massive multiplayer online role playing games, denoted as MMORPGs, is an important basic task in game security and personal recommendation. In this article, we propose a parallel semisupervised framework to categorize specific game players with a few labelknown target samples, which are denoted as bait players. Our approach first presents a feature representation model based on the players’ level granularity, which can acquire aligned feature representations in the lower dimensional space from the players’ original action sequences. Then, we propose a semisupervised clustering method, extended from the bisecting k-means model, to extract the specified players with the help of those bait players. Due to massive amounts of game log data, the computation complexity is an extreme challenge to implement our feature representation and semisupervised extraction approaches. We also propose a hierarchical parallelism framework, which allows the data to be computed horizontally and vertically simultaneously and enables varied parallel combinations for the steps of our semisupervised categorization approach. The comparable experiments on real-world MMORPGs’ log data, containing more than 465 Gbytes and million players, are carried out to demonstrate the effectiveness and efficiency of our proposed approach compared with the state-of-the-art methods.

    @article{fan2022sgp,
    title = {Semisupervised Game Player Categorization From Very Big Behavior Log Data},
    author = {Jing Fan and Shaowen Gao and Yong Liu and Xinqiang Ma and Jiandang Yang and Changjie Fan},
    year = 2022,
    journal = {IEEE Transactions on Systems Man Cybernetics-Systems},
    volume = {52},
    number = {6},
    pages = {3419-3430},
    doi = {10.1109/TSMC.2021.3066545},
    abstract = {Extracting the specific category of the players, such as the malignant Bot, from the huge log data of the massive multiplayer online role playing games, denoted as MMORPGs, is an important basic task in game security and personal recommendation. In this article, we propose a parallel semisupervised framework to categorize specific game players with a few labelknown target samples, which are denoted as bait players. Our approach first presents a feature representation model based on the players’ level granularity, which can acquire aligned feature representations in the lower dimensional space from the players’ original action sequences. Then, we propose a semisupervised clustering method, extended from the bisecting k-means model, to extract the specified players with the help of those bait players. Due to massive amounts of game log data, the computation complexity is an extreme challenge to implement our feature representation and semisupervised extraction approaches. We also propose a hierarchical parallelism framework, which allows the data to be computed horizontally and vertically simultaneously and enables varied parallel combinations for the steps of our semisupervised categorization approach. The comparable experiments on real-world MMORPGs’ log data, containing more than 465 Gbytes and million players, are carried out to demonstrate the effectiveness and efficiency of our proposed approach compared with the state-of-the-art methods.}
    }

2021

  • Q. Shen, J. Lou, Y. Liu, and Y. Jiang, “Hesitant fuzzy multi-attribute decision making based on binary connection number of set pair analysis," Soft Computing, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]

    To objectively evaluate the influence of hesitant fuzziness on the ranking of alternatives in multi-attribute decision making with hesitant fuzzy or probabilistic hesitant fuzzy information, the binary connection number of set pair analysis is applied to hesitant fuzzy multi-attribute decision making. The hesitant or probabilistic hesitant fuzzy set is transformed to the binary connection number. A hesitant fuzzy multi-attribute decision making model based on binary connection number is then established. Binary connection number theory is utilized to obtain the hesitant fuzzy center and decision-making suggestions about the alternative ranking under different hesitant fuzzy conditions. Experimental examples show that the hesitant fuzzy multi-attribute decision making model based on binary connection number has a certain versatility. It can determine the optimal scheme under the influence of hesitant fuzziness on the alternative ranking and contains the results of the same hesitant fuzzy decision-making problem using other methods, which helps in targeted decision making according to different hesitant fuzzy conditions.

    @article{shen2021hesitantfm,
    title = {Hesitant fuzzy multi-attribute decision making based on binary connection number of set pair analysis},
    author = {Qing Shen and Jungang Lou and Yong Liu and Yunliang Jiang},
    year = 2021,
    journal = {Soft Computing},
    doi = {10.1007/s00500-021-06215-0},
    abstract = {To objectively evaluate the influence of hesitant fuzziness on the ranking of alternatives in multi-attribute decision making with hesitant fuzzy or probabilistic hesitant fuzzy information, the binary connection number of set pair analysis is applied to hesitant fuzzy multi-attribute decision making. The hesitant or probabilistic hesitant fuzzy set is transformed to the binary connection number. A hesitant fuzzy multi-attribute decision making model based on binary connection number is then established. Binary connection number theory is utilized to obtain the hesitant fuzzy center and decision-making suggestions about the alternative ranking under different hesitant fuzzy conditions. Experimental examples show that the hesitant fuzzy multi-attribute decision making model based on binary connection number has a certain versatility. It can determine the optimal scheme under the influence of hesitant fuzziness on the alternative ranking and contains the results of the same hesitant fuzzy decision-making problem using other methods, which helps in targeted decision making according to different hesitant fuzzy conditions.}
    }

  • G. Tian, Y. Sun, Y. Liu, X. Zeng, M. Wang, Y. Liu, J. Zhang, and J. Chen, “Adding before Pruning: Sparse Filter Fusion for Deep Convolutional Neural Networks via Auxiliary Attention," IEEE Transactions on Neural Networks and Learning Systems, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]

    Filter pruning is a significant feature selection technique to shrink the existing feature fusion schemes (especially on convolution calculation and model size), which helps to develop more efficient feature fusion models while maintaining state-of-the-art performance. In addition, it reduces the storage and computation requirements of deep neural networks (DNNs) and accelerates the inference process dramatically. Existing methods mainly rely on manual constraints such as normalization to select the filters. A typical pipeline comprises two stages: first pruning the original neural network and then fine-tuning the pruned model. However, choosing a manual criterion can be somehow tricky and stochastic. Moreover, directly regularizing and modifying filters in the pipeline suffer from being sensitive to the choice of hyperparameters, thus making the pruning procedure less robust. To address these challenges, we propose to handle the filter pruning issue through one stage: using an attention-based architecture that adaptively fuses the filter selection with filter learning in a unified network. Specifically, we present a pruning method named adding before pruning (ABP) to make the model focus on the filters of higher significance by training instead of man-made criteria such as norm, rank, etc. First, we add an auxiliary attention layer into the original model and set the significance scores in this layer to be binary. Furthermore, to propagate the gradients in the auxiliary attention layer, we design a specific gradient estimator and prove its effectiveness for convergence in the graph flow through mathematical derivation. In the end, to relieve the dependence on the complicated prior knowledge for designing the thresholding criterion, we simultaneously prune and train the filters to automatically eliminate network redundancy with recoverability. Extensive experimental results on the two typical image classification benchmarks, CIFAR-10 and ILSVRC-2012, illustrate that the proposed approach performs favorably against previous state-of-the-art filter pruning algorithms.

    @article{tian2021abp,
    title = {Adding before Pruning: Sparse Filter Fusion for Deep Convolutional Neural Networks via Auxiliary Attention},
    author = {Guanzhong Tian and Yiran Sun and Yuang Liu and Xianfang Zeng and Mengmeng Wang and Yong Liu and Jiangning Zhang and Jun Chen},
    year = 2021,
    journal = {IEEE Transactions on Neural Networks and Learning Systems},
    doi = {10.1109/TNNLS.2021.3106917},
    abstract = {Filter pruning is a significant feature selection technique to shrink the existing feature fusion schemes (especially on convolution calculation and model size), which helps to develop more efficient feature fusion models while maintaining state-of-the-art performance. In addition, it reduces the storage and computation requirements of deep neural networks (DNNs) and accelerates the inference process dramatically. Existing methods mainly rely on manual constraints such as normalization to select the filters. A typical pipeline comprises two stages: first pruning the original neural network and then fine-tuning the pruned model. However, choosing a manual criterion can be somehow tricky and stochastic. Moreover, directly regularizing and modifying filters in the pipeline suffer from being sensitive to the choice of hyperparameters, thus making the pruning procedure less robust. To address these challenges, we propose to handle the filter pruning issue through one stage: using an attention-based architecture that adaptively fuses the filter selection with filter learning in a unified network. Specifically, we present a pruning method named adding before pruning (ABP) to make the model focus on the filters of higher significance by training instead of man-made criteria such as norm, rank, etc. First, we add an auxiliary attention layer into the original model and set the significance scores in this layer to be binary. Furthermore, to propagate the gradients in the auxiliary attention layer, we design a specific gradient estimator and prove its effectiveness for convergence in the graph flow through mathematical derivation. In the end, to relieve the dependence on the complicated prior knowledge for designing the thresholding criterion, we simultaneously prune and train the filters to automatically eliminate network redundancy with recoverability. Extensive experimental results on the two typical image classification benchmarks, CIFAR-10 and ILSVRC-2012, illustrate that the proposed approach performs favorably against previous state-of-the-art filter pruning algorithms.}
    }

  • S. Liu, J. Cao, Y. Wang, W. Chen, and Y. Liu, “Self-play reinforcement learning with comprehensive critic in computer games," Neurocomputing, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]

    Self-play reinforcement learning, where agents learn by playing with themselves, has been successfully applied in many game scenarios. However, the training procedure for self-play reinforcement learning is unstable and more sample-inefficient than (general) reinforcement learning, especially in imperfect information games. To improve the self-play training process, we incorporate a comprehensive critic into the policy gradient method to form a self-play actor-critic (SPAC) method for training agents to play com-puter games. We evaluate our method in four different environments in both competitive and coopera-tive tasks. The results show that the agent trained with our SPAC method outperforms those trained with deep deterministic policy gradient (DDPG) and proximal policy optimization (PPO) algorithms in many different evaluation approaches, which vindicate the effect of our comprehensive critic in the self-play training procedure. CO 2021 Elsevier B.V. All rights reserved.

    @article{liu2021spr,
    title = {Self-play reinforcement learning with comprehensive critic in computer games},
    author = {Shanqi Liu and Junjie Cao and Yujie Wang and Wenzhou Chen and Yong Liu},
    year = 2021,
    journal = {Neurocomputing},
    doi = {10.1016/j.neucom.2021.04.006},
    abstract = {Self-play reinforcement learning, where agents learn by playing with themselves, has been successfully applied in many game scenarios. However, the training procedure for self-play reinforcement learning is unstable and more sample-inefficient than (general) reinforcement learning, especially in imperfect information games. To improve the self-play training process, we incorporate a comprehensive critic into the policy gradient method to form a self-play actor-critic (SPAC) method for training agents to play com-puter games. We evaluate our method in four different environments in both competitive and coopera-tive tasks. The results show that the agent trained with our SPAC method outperforms those trained with deep deterministic policy gradient (DDPG) and proximal policy optimization (PPO) algorithms in many different evaluation approaches, which vindicate the effect of our comprehensive critic in the self-play training procedure. CO 2021 Elsevier B.V. All rights reserved.}
    }

  • C. Xu, X. Wu, Y. Li, Y. Jin, M. Wang, and Y. Liu, “Cross-modality online distillation for multi-view action recognition," Neurocomputing, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]

    Recently, some multi-modality features are introduced to the multi-view action recognition methods in order to obtain a more robust performance. However, it is intuitive that not all modalities are available in real applications, such as daily scenes that missing depth modal data and capture RGB sequences only. This raises the challenge of how to learn critical features from multi-modality data, while relying on RGB sequences and still get robust performance at test time. To address this challenge, our paper presents a novel two-stage teacher-student framework, the teacher network takes advantage of multi-view geometry-andtexture features during training, while a student network only given RGB sequences at test time. Specifically, in the first stage, Cross-modality Aggregated Transfer (CAT) network is proposed to transfer multi-view cross-modality aggregated features from the teacher network to the student network. Moreover, We design a Viewpoint-Aware Attention (VAA) module which captures discriminative information across different views to e_ectively combine multi-view features. In the second stage, Multi-view Features Strengthen (MFS) network that contains VAA module as well further strengthen the global view-invariance features of the student network. Besides, both of CAT and MFS learn in an online distillation manner so that the teacher and the student network can be trained jointly. Extensive experiments at IXMAS and Northwestern-UCLA demonstrate the effectiveness of the proposed method.

    @article{xu2021cmo,
    title = {Cross-modality online distillation for multi-view action recognition},
    author = {Chao Xu and Xia Wu and Yachun Li and Yining Jin and Mengmeng Wang and Yong Liu},
    year = 2021,
    journal = {Neurocomputing},
    doi = {10.1016/j.neucom.2021.05.077},
    abstract = {Recently, some multi-modality features are introduced to the multi-view action recognition methods in order to obtain a more robust performance. However, it is intuitive that not all modalities are available in real applications, such as daily scenes that missing depth modal data and capture RGB sequences only. This raises the challenge of how to learn critical features from multi-modality data, while relying on RGB sequences and still get robust performance at test time. To address this challenge, our paper presents a novel two-stage teacher-student framework, the teacher network takes advantage of multi-view geometry-andtexture features during training, while a student network only given RGB sequences at test time. Specifically, in the first stage, Cross-modality Aggregated Transfer (CAT) network is proposed to transfer multi-view cross-modality aggregated features from the teacher network to the student network. Moreover, We design a Viewpoint-Aware Attention (VAA) module which captures discriminative information across different views to e_ectively combine multi-view features. In the second stage, Multi-view Features Strengthen (MFS) network that contains VAA module as well further strengthen the global view-invariance features of the student network. Besides, both of CAT and MFS learn in an online distillation manner so that the teacher and the student network can be trained jointly. Extensive experiments at IXMAS and Northwestern-UCLA demonstrate the effectiveness of the proposed method.}
    }

  • Z. Zhang, J. Yan, X. Kong, G. Zhai, and Y. Liu, “Efficient Motion Planning based on Kinodynamic Model for Quadruped Robots Following Persons in Confined Spaces," IEEE/ASME Transactions on Mechatronics, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]

    Quadruped robots have superior terrain adaptability and flexible movement capabilities than traditional robots. In this paper, we innovatively apply it in person-following tasks, and propose an efficient motion planning scheme for quadruped robots to generate a flexible and effective trajectory in confined spaces. The method builds a real-time local costmap via onboard sensors, which involves both static and dynamic obstacles. And we exploit a simplified kinodynamic model and formulate the friction pyramids formed by Ground Reaction Forces (GRFs) inequality constraints to ensure the executable of the optimized trajectory. In addition, we obtain the optimal following trajectory in the costmap completely based on the robots rectangular footprint description, which ensures that it can walk through the narrow spaces avoiding collision. Finally, a receding horizon control strategy is employed to improve the robustness of motion in complex environments. The proposed motion planning framework is integrated on the quadruped robot JueYing and tested in simulation as well as real scenarios. It shows that the execution success rates in various scenes are all over 90\%.

    @article{zhang2021emp,
    title = {Efficient Motion Planning based on Kinodynamic Model for Quadruped Robots Following Persons in Confined Spaces},
    author = {Zhen Zhang and Jiaqing Yan and Xin Kong and Guangyao Zhai and Yong Liu},
    year = 2021,
    journal = {IEEE/ASME Transactions on Mechatronics},
    doi = {10.1109/TMECH.2021.3083594},
    abstract = {Quadruped robots have superior terrain adaptability and flexible movement capabilities than traditional robots. In this paper, we innovatively apply it in person-following tasks, and propose an efficient motion planning scheme for quadruped robots to generate a flexible and effective trajectory in confined spaces. The method builds a real-time local costmap via onboard sensors, which involves both static and dynamic obstacles. And we exploit a simplified kinodynamic model and formulate the friction pyramids formed by Ground Reaction Forces (GRFs) inequality constraints to ensure the executable of the optimized trajectory. In addition, we obtain the optimal following trajectory in the costmap completely based on the robots rectangular footprint description, which ensures that it can walk through the narrow spaces avoiding collision. Finally, a receding horizon control strategy is employed to improve the robustness of motion in complex environments. The proposed motion planning framework is integrated on the quadruped robot JueYing and tested in simulation as well as real scenarios. It shows that the execution success rates in various scenes are all over 90\%.}
    }

  • W. Liu, S. Liu, J. Cao, Q. Wang, X. Lang, and Y. Liu, “Learning Communication for Cooperation in Dynamic Agent-Number Environment," IEEE/ASME Transactions on Mechatronics, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]

    The number of agents in many multi-agent systems in the real world changes all the time, such as storage robots and drone cluster systems. Still, most current multi-agent reinforcement learning algorithms are limited to fixed network dimensions, and prior knowledge is used to preset the number of agents in the training phase, which leads to a poor generalization of the algorithm. In addition, these algorithms use centralized training to solve the instability problem of multi-agent systems. However, the centralized learning of large-scale multi-agent reinforcement learning algorithms will lead to an explosion of network dimensions, which in turn leads to very limited scalability of centralized learning algorithms. To solve these two difficulties, we propose Group Centralized Training and Decentralized Execution-Unlimited Dynamic Agent-number Network (GCTDE-UDAN). Firstly, since we use the attention mechanism to select several leaders and establish a dynamic number of teams, and UDAN performs a non-linear combination of all agents’ Q values when performing value decomposition, it is not affected by changes in the number of agents. Moreover, our algorithm can unite any agent to form a group and conduct centralized training within the group, avoiding network dimension explosion caused by global centralized training of large-scale agents. Finally, we verified on the simulation and experimental platform that the algorithm can learn and perform cooperative behaviors in many dynamic multi-agent environments.

    @article{liu2021lcf,
    title = {Learning Communication for Cooperation in Dynamic Agent-Number Environment},
    author = {Weiwei Liu and Shanqi Liu and Junjie Cao and Qi Wang and Xiaolei Lang and Yong Liu},
    year = 2021,
    journal = {IEEE/ASME Transactions on Mechatronics},
    doi = {10.1109/TMECH.2021.3076080},
    abstract = {The number of agents in many multi-agent systems in the real world changes all the time, such as storage robots and drone cluster systems. Still, most current multi-agent reinforcement learning algorithms are limited to fixed network dimensions, and prior knowledge is used to preset the number of agents in the training phase, which leads to a poor generalization of the algorithm. In addition, these algorithms use centralized training to solve the instability problem of multi-agent systems. However, the centralized learning of large-scale multi-agent reinforcement learning algorithms will lead to an explosion of network dimensions, which in turn leads to very limited scalability of centralized learning algorithms. To solve these two difficulties, we propose Group Centralized Training and Decentralized Execution-Unlimited Dynamic Agent-number Network (GCTDE-UDAN). Firstly, since we use the attention mechanism to select several leaders and establish a dynamic number of teams, and UDAN performs a non-linear combination of all agents' Q values when performing value decomposition, it is not affected by changes in the number of agents. Moreover, our algorithm can unite any agent to form a group and conduct centralized training within the group, avoiding network dimension explosion caused by global centralized training of large-scale agents. Finally, we verified on the simulation and experimental platform that the algorithm can learn and perform cooperative behaviors in many dynamic multi-agent environments.}
    }

  • Y. Jiang, K. Zhao, J. Cao, J. Fan, and Y. Liu, “Asynchronous parallel hyperparameter search with population evolution," Control and Decision, vol. 36, p. 1825–1833, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]

    In recent years, with the continuous increase of deep learning models, especially deep reinforcement learning models, the training cost, that is, the search space of hyperparameters, has also continuously increased. However, most traditional hyperparameter search algorithms are based on sequential execution of training, which often takes weeks or even months to find a better hyperparameter configuration. In order to solve the problem of the long search time hyperparameters and the difficulty in finding a better hyperparameter of deep reinforcement learning configuration, this paper proposes a new hyper-parameter search algorithm, named asynchronous parallel hyperparameter search with population evolution. This algorithm combines the idea of evolutionary algorithms and uses a fixed resource budget to search the population model and its hyperparameters asynchronously and in parallel, thereby improving the performance of the algorithm. It is realized that a parameter search algorithm can run on the Ray parallel distributed framework. Experiments show that the parametric asynchronous parallel search based on population evolution on the parallel framework is better than the traditional hyperparameter search algorithm, and its performance is stable.

    @article{fan2021aph,
    title = {Asynchronous parallel hyperparameter search with population evolution},
    author = {Yunliang Jiang and Kang Zhao and Junjie Cao and Jing Fan and Yong Liu},
    year = 2021,
    journal = {Control and Decision},
    volume = 36,
    pages = {1825--1833},
    doi = {10.13195/j.kzyjc.2019.1743},
    issue = 8,
    abstract = {In recent years, with the continuous increase of deep learning models, especially deep reinforcement learning models, the training cost, that is, the search space of hyperparameters, has also continuously increased. However, most traditional hyperparameter search algorithms are based on sequential execution of training, which often takes weeks or even months to find a better hyperparameter configuration. In order to solve the problem of the long search time hyperparameters and the difficulty in finding a better hyperparameter of deep reinforcement learning configuration, this paper proposes a new hyper-parameter search algorithm, named asynchronous parallel hyperparameter search with population evolution. This algorithm combines the idea of evolutionary algorithms and uses a fixed resource budget to search the population model and its hyperparameters asynchronously and in parallel, thereby improving the performance of the algorithm. It is realized that a parameter search algorithm can run on the Ray parallel distributed framework. Experiments show that the parametric asynchronous parallel search based on population evolution on the parallel framework is better than the traditional hyperparameter search algorithm, and its performance is stable.}
    }

  • D. Yang, Z. Pan, Y. Cao, Y. Wang, X. Lai, J. Yang, and Y. Liu, “Wind measurement by computer vision on unmanned sailboat," International Journal of Intelligent Robotics and Applications, vol. 5, p. 252–263, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]

    The measurement accuracy of wind direction and wind speed is very important to the unmanned sailboat control, but the mature mechanical wind sensor and ultrasonic wind sensor both have great defects to be applied to the unmanned sailboat. Inspired by previous works on neural networks, we propose a low-cost, real-time, and robust wind measurement system based on computer vision (CV). This CV-wind-sensor includes an airflow rope and a camera, which can be simply deployed on the sailboat. We implement a prototype system on the FPGA platform and run a series of experiments that demonstrate the promising performance of our system. For example, the absolute measurement loss of the CV sensor in this paper is basically kept below 0.4 m/s, which shows a great advantage of measurement accuracy compared with the mechanical sensor.

    @article{yang2021wmb,
    title = {Wind measurement by computer vision on unmanned sailboat},
    author = {Dasheng Yang and Zaisheng Pan and Yan Cao and Yifan Wang and Xiao Lai and Jian Yang and Yong Liu},
    year = 2021,
    journal = {International Journal of Intelligent Robotics and Applications},
    volume = 5,
    pages = {252--263},
    doi = {10.1007/s41315-021-00171-6},
    issue = 2,
    abstract = {The measurement accuracy of wind direction and wind speed is very important to the unmanned sailboat control, but the mature mechanical wind sensor and ultrasonic wind sensor both have great defects to be applied to the unmanned sailboat. Inspired by previous works on neural networks, we propose a low-cost, real-time, and robust wind measurement system based on computer vision (CV). This CV-wind-sensor includes an airflow rope and a camera, which can be simply deployed on the sailboat. We implement a prototype system on the FPGA platform and run a series of experiments that demonstrate the promising performance of our system. For example, the absolute measurement loss of the CV sensor in this paper is basically kept below 0.4 m/s, which shows a great advantage of measurement accuracy compared with the mechanical sensor.}
    }

  • S. Lin, J. Huang, W. Chen, W. Zhou, J. Xu, Y. Liu, and J. Yao, “Intelligent warehouse monitoring based on distributed system and edge computing," International Journal of Intelligent Robotics and Applications, vol. 5, p. 130–142, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]

    This paper mainly focuses on the volume calculation of materials in the warehouse where sand and gravel materials are stored and monitored whether materials are lacking in real-time. Specifically, we proposed the sandpile model and the point cloud projection obtained from the LiDAR sensors to calculate the material volume. We use distributed edge computing modules to build a centralized system and transmit data remotely through a high-power wireless network, which solves sensor placement and data transmission in a complex warehouse environment. Our centralized system can also reduce worker participation in a harsh factorial environment. Furthermore, the point cloud data of the warehouse is colored to visualize the actual factorial environment. Our centralized system has been deployed in the real factorial environment and got a good performance.

    @article{huang2021iwm,
    title = {Intelligent warehouse monitoring based on distributed system and edge computing},
    author = {Sen Lin and Jianxin Huang and Wenzhou Chen and Wenlong Zhou and Jinhong Xu and Yong Liu and Jinqiang Yao},
    year = 2021,
    journal = {International Journal of Intelligent Robotics and Applications},
    volume = 5,
    pages = {130--142},
    doi = {10.1007/s41315-021-00173-4},
    issue = 2,
    abstract = {This paper mainly focuses on the volume calculation of materials in the warehouse where sand and gravel materials are stored and monitored whether materials are lacking in real-time. Specifically, we proposed the sandpile model and the point cloud projection obtained from the LiDAR sensors to calculate the material volume. We use distributed edge computing modules to build a centralized system and transmit data remotely through a high-power wireless network, which solves sensor placement and data transmission in a complex warehouse environment. Our centralized system can also reduce worker participation in a harsh factorial environment. Furthermore, the point cloud data of the warehouse is colored to visualize the actual factorial environment. Our centralized system has been deployed in the real factorial environment and got a good performance.}
    }

  • J. Zhang, C. Xu, X. Zhao, L. Liu, Y. Liu, J. Yao, and Z. Pan, “Learning hierarchical and efficient Person re-identification for robotic navigation," International Journal of Intelligent Robotics and Applications, vol. 5, p. 104–118, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]

    Recent works in the person re-identification task mainly focus on the model accuracy while ignoring factors related to efficiency, e.g., model size and latency, which are critical for practical application. In this paper, we propose a novel Hierarchical andEfficientNetwork (HENet) that learns hierarchical global, partial, and recovery features ensemble under the supervision of multiple loss combinations. To further improve the robustness against the irregular occlusion, we propose a new dataset augmentation approach, dubbed random polygon erasing, to random erase the input image’s irregular area imitating the body part missing. We also propose an EfficiencyScore (ES) metric to evaluate the model efficiency. Extensive experiments on Market1501, DukeMTMC-ReID, and CUHK03 datasets show the efficiency and superiority of our approach compared with epoch-making methods. We further deploy HENet on a robotic car, and the experimental result demonstrates the effectiveness of our method for robotic navigation.

    @article{zhang2021lha,
    title = {Learning hierarchical and efficient Person re-identification for robotic navigation},
    author = {Jiangning Zhang and Chao Xu and Xiangrui Zhao and Liang Liu and Yong Liu and Jinqiang Yao and Zaisheng Pan},
    year = 2021,
    journal = {International Journal of Intelligent Robotics and Applications},
    volume = 5,
    pages = {104--118},
    doi = {10.1007/s41315-021-00167-2},
    issue = 2,
    abstract = {Recent works in the person re-identification task mainly focus on the model accuracy while ignoring factors related to efficiency, e.g., model size and latency, which are critical for practical application. In this paper, we propose a novel Hierarchical andEfficientNetwork (HENet) that learns hierarchical global, partial, and recovery features ensemble under the supervision of multiple loss combinations. To further improve the robustness against the irregular occlusion, we propose a new dataset augmentation approach, dubbed random polygon erasing, to random erase the input image's irregular area imitating the body part missing. We also propose an EfficiencyScore (ES) metric to evaluate the model efficiency. Extensive experiments on Market1501, DukeMTMC-ReID, and CUHK03 datasets show the efficiency and superiority of our approach compared with epoch-making methods. We further deploy HENet on a robotic car, and the experimental result demonstrates the effectiveness of our method for robotic navigation.}
    }

  • L. Liu, Y. Liao, Y. Wang, A. Geiger, and Y. Liu, “Learning Steering Kernels for Guided Depth Completion," IEEE Transactions on Image Processing, vol. 30, p. 2850–2861, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]

    This paper addresses the guided depth completion task in which the goal is to predict a dense depth map given a guidance RGB image and sparse depth measurements. Recent advances on this problem nurture hopes that one day we can acquire accurate and dense depth at a very low cost. A major challenge of guided depth completion is to effectively make use of extremely sparse measurements, e.g., measurements covering less than 1% of the image pixels. In this paper, we propose a fully differentiable model that avoids convolving on sparse tensors by jointly learning depth interpolation and refinement. More specifically, we propose a differentiable kernel regression layer that interpolates the sparse depth measurements via learned kernels. We further refine the interpolated depth map using a residual depth refinement layer which leads to improved performance compared to learning absolute depth prediction using a vanilla network. We provide experimental evidence that our differentiable kernel regression layer not only enables end-to-end training from very sparse measurements using standard convolutional network architectures, but also leads to better depth interpolation results compared to existing heuristically motivated methods. We demonstrate that our method outperforms many state-of-the-art guided depth completion techniques on both NYUv2 and KITTI. We further show the generalization ability of our method with respect to the density and spatial statistics of the sparse depth measurements.

    @article{liu2021lsk,
    title = {Learning Steering Kernels for Guided Depth Completion},
    author = {Lina Liu and Yiyi Liao and Yue Wang and Andreas Geiger and Yong Liu},
    year = 2021,
    journal = {IEEE Transactions on Image Processing},
    volume = 30,
    pages = {2850--2861},
    doi = {10.1109/TIP.2021.3055629},
    abstract = {This paper addresses the guided depth completion task in which the goal is to predict a dense depth map given a guidance RGB image and sparse depth measurements. Recent advances on this problem nurture hopes that one day we can acquire accurate and dense depth at a very low cost. A major challenge of guided depth completion is to effectively make use of extremely sparse measurements, e.g., measurements covering less than 1% of the image pixels. In this paper, we propose a fully differentiable model that avoids convolving on sparse tensors by jointly learning depth interpolation and refinement. More specifically, we propose a differentiable kernel regression layer that interpolates the sparse depth measurements via learned kernels. We further refine the interpolated depth map using a residual depth refinement layer which leads to improved performance compared to learning absolute depth prediction using a vanilla network. We provide experimental evidence that our differentiable kernel regression layer not only enables end-to-end training from very sparse measurements using standard convolutional network architectures, but also leads to better depth interpolation results compared to existing heuristically motivated methods. We demonstrate that our method outperforms many state-of-the-art guided depth completion techniques on both NYUv2 and KITTI. We further show the generalization ability of our method with respect to the density and spatial statistics of the sparse depth measurements.}
    }

  • X. Zeng, W. Wu, G. Tian, F. Li, and Y. Liu, “Deep Superpixel Convolutional Network for Image Recognition," IEEE Signal Processing Letters, vol. 28, pp. 922-926, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]

    Due to the high representational efficiency, superpixel largely reduces the number of image primitives for subsequent processing. However, superpixel is scarcely utilized in recent methods since its irregular shape is intractable for standard convolutional layer. In this paper, we propose an end-to-end trainable superpixel convolutional network, named SPNet, to learn high-level representation on image superpixel primitives. We start by treating irregular superpixel lattices as a 2D point cloud, where the low-level features inside one superpixel are aggregated to one feature vector. We replace the standard convolutional layer with the PointConv layer to handle the irregular and unordered point cloud. Besides, we propose grid based downsampling strategies to output uniform 2D sampling result. The resulting network largely utilizes the efficiency of superpixel and provides a novel view for image recognition task. Experiments on image recognition task show promising results compared with prominent image classification methods. The visualization of class activation mapping shows great accuracy at object localization and boundary segmentation.

    @article{zeng2021deepsc,
    title = {Deep Superpixel Convolutional Network for Image Recognition},
    author = {Xianfang Zeng and Wenxuan Wu and Guangzhong Tian and Fuxin Li and Yong Liu},
    year = 2021,
    journal = {IEEE Signal Processing Letters},
    volume = 28,
    pages = {922-926},
    doi = {10.1109/LSP.2021.3075605},
    abstract = {Due to the high representational efficiency, superpixel largely reduces the number of image primitives for subsequent processing. However, superpixel is scarcely utilized in recent methods since its irregular shape is intractable for standard convolutional layer. In this paper, we propose an end-to-end trainable superpixel convolutional network, named SPNet, to learn high-level representation on image superpixel primitives. We start by treating irregular superpixel lattices as a 2D point cloud, where the low-level features inside one superpixel are aggregated to one feature vector. We replace the standard convolutional layer with the PointConv layer to handle the irregular and unordered point cloud. Besides, we propose grid based downsampling strategies to output uniform 2D sampling result. The resulting network largely utilizes the efficiency of superpixel and provides a novel view for image recognition task. Experiments on image recognition task show promising results compared with prominent image classification methods. The visualization of class activation mapping shows great accuracy at object localization and boundary segmentation.}
    }

  • G. Tian, J. Chen, X. Zeng, and Y. Liu, “Pruning by Training: A Novel Deep Neural Network Compression Framework for Image Processing," IEEE Signal Processing Letters, vol. 28, p. 344–348, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]

    Filter pruning for a pre-trained convolutional neural network is most normally performed through human-made constraints or criteria such as norms, ranks, etc. Typically, the pruning pipeline comprises two-stage: first learn a sparse structure from the original model, then optimize the weights in the new prune model. One disadvantage of using human-made criteria to prune filters is that the design and selection of threshold criteria depend on complicated prior knowledge. Besides, the pruning process is less robust due to the impact of directly regularizing on filters. To address the problems mentioned, we propose an effective one-stage pruning framework: introducing a trainable collaborative layer to jointly prune and learn neural networks in one go. In our framework, we first add a binary collaborative layer for each original filter. Then, a new type of gradient estimator – asymptotic gradient estimator is first introduced to pass the gradient in the binary collaborative layer. Finally, we simultaneously learn the sparse structure and optimize the weights from the original model in the training process. Our evaluation results on typical benchmarks, CIFAR and ImageNet, demonstrate very promising results against other state-of-the-art filter pruning methods.

    @article{tian2021pbt,
    title = {Pruning by Training: A Novel Deep Neural Network Compression Framework for Image Processing},
    author = {Guanzhong Tian and Jun Chen and Xianfang Zeng and Yong Liu},
    year = 2021,
    journal = {IEEE Signal Processing Letters},
    volume = 28,
    pages = {344--348},
    doi = {10.1109/LSP.2021.3054315},
    abstract = {Filter pruning for a pre-trained convolutional neural network is most normally performed through human-made constraints or criteria such as norms, ranks, etc. Typically, the pruning pipeline comprises two-stage: first learn a sparse structure from the original model, then optimize the weights in the new prune model. One disadvantage of using human-made criteria to prune filters is that the design and selection of threshold criteria depend on complicated prior knowledge. Besides, the pruning process is less robust due to the impact of directly regularizing on filters. To address the problems mentioned, we propose an effective one-stage pruning framework: introducing a trainable collaborative layer to jointly prune and learn neural networks in one go. In our framework, we first add a binary collaborative layer for each original filter. Then, a new type of gradient estimator - asymptotic gradient estimator is first introduced to pass the gradient in the binary collaborative layer. Finally, we simultaneously learn the sparse structure and optimize the weights from the original model in the training process. Our evaluation results on typical benchmarks, CIFAR and ImageNet, demonstrate very promising results against other state-of-the-art filter pruning methods.}
    }

  • W. Chen, J. Xu, X. Zhao, Y. Liu, and J. Yang, “Separated Sonar Localization System for Indoor Robot Navigation," IEEE Transactions on Industrial Electronics, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]

    This work addresses the task of mobile robot local-ization for indoor navigation. In this paper, we propose a novel indoor localization system based on separated sonar sensors which can be deployed in large-scale indoor environments conveniently. In our approach, the separated sonar receivers deploy on the top ceiling, and the mobile robot equipped with the separated sonar transmitters navigates in the indoor environment. The distance measurements between the receivers and the transmitters can be obtained in real-time from the control board of receivers with the infrared synchronization. The positions of the mobile robot can be computed without accumulative error. And the proposed localization method can achieve high precision in the indoor localization tasks at a very low cost. We also present a calibration method based on the simultaneous localization and mapping(SLAM) to initialize the positions of our system. To evaluate the feasibility and the dynamic accuracy of the proposed system, we construct our localization system in the Virtual Robot Experimentation Platform(V-REP) simulation platform and deploy this system in a real-world environment. Both the simulation and real-world experiments have demonstrated that our system can achieve centimeter-level accuracy. The localization accuracy of the proposed system is sufficient for robot indoor navigation.

    @article{chen2021separatedsl,
    title = {Separated Sonar Localization System for Indoor Robot Navigation},
    author = {Wenzhou Chen and Jinhong Xu and Xiangrui Zhao and Yong Liu and Jian Yang},
    year = 2021,
    journal = {IEEE Transactions on Industrial Electronics},
    doi = {10.1109/TIE.2020.2994856},
    abstract = {This work addresses the task of mobile robot local-ization for indoor navigation. In this paper, we propose a novel indoor localization system based on separated sonar sensors which can be deployed in large-scale indoor environments conveniently. In our approach, the separated sonar receivers deploy on the top ceiling, and the mobile robot equipped with the separated sonar transmitters navigates in the indoor environment. The distance measurements between the receivers and the transmitters can be obtained in real-time from the control board of receivers with the infrared synchronization. The positions of the mobile robot can be computed without accumulative error. And the proposed localization method can achieve high precision in the indoor localization tasks at a very low cost. We also present a calibration method based on the simultaneous localization and mapping(SLAM) to initialize the positions of our system. To evaluate the feasibility and the dynamic accuracy of the proposed system, we construct our localization system in the Virtual Robot Experimentation Platform(V-REP) simulation platform and deploy this system in a real-world environment. Both the simulation and real-world experiments have demonstrated that our system can achieve centimeter-level accuracy. The localization accuracy of the proposed system is sufficient for robot indoor navigation.}
    }

  • X. Zeng, Y. Pan, H. Zhang, M. Wang, G. Tian, and Y. Liu, “Unpaired Salient Object Translation via Spatial Attention Prior," Neurocomputing, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]

    With only set-level constraints, unpaired image translation is challenging in discovering the correct semantic-level correspondences between two domains. This limitation often results in false positives such as significantly changing color and appearance of the background during image translation. To address this limitation, we propose the Spatial Attention-Aware Generative Adversarial Network (SAAGAN), a novel approach to jointly learn salient object discovery and translation. Specifically, our generator consists of (1) spatial attention prediction branch and (2) image translation branch. For attention branch, we extract spatial attention prior from a pre-trained classification network to provide weak supervision for object discovery. The proposed attention loss can largely stabilize the training process of attention-guided generator. For translation branch, we revise classical adversarial loss for salient object translation. Such a discriminator only distinguish the distribution of the object between two domains. What is more, we propose a fake sample augmentation strategy to provide extra spatial information for discriminator. Our approach allows simultaneously locating the attention areas in each image and translating the related areas between two domains. Extensive experiments and evaluations show that our model can achieve more realistic mappings compared to state-of-the-art unpaired image translation methods.

    @article{zeng2021unpairedso,
    title = {Unpaired Salient Object Translation via Spatial Attention Prior},
    author = {Xiangfang Zeng and Yusu Pan and Hao Zhang and Mengmeng Wang and Guanzhong Tian and Yong Liu},
    year = 2021,
    journal = {Neurocomputing},
    doi = {10.1016/j.neucom.2020.05.105},
    abstract = {With only set-level constraints, unpaired image translation is challenging in discovering the correct semantic-level correspondences between two domains. This limitation often results in false positives such as significantly changing color and appearance of the background during image translation. To address this limitation, we propose the Spatial Attention-Aware Generative Adversarial Network (SAAGAN), a novel approach to jointly learn salient object discovery and translation. Specifically, our generator consists of (1) spatial attention prediction branch and (2) image translation branch. For attention branch, we extract spatial attention prior from a pre-trained classification network to provide weak supervision for object discovery. The proposed attention loss can largely stabilize the training process of attention-guided generator. For translation branch, we revise classical adversarial loss for salient object translation. Such a discriminator only distinguish the distribution of the object between two domains. What is more, we propose a fake sample augmentation strategy to provide extra spatial information for discriminator. Our approach allows simultaneously locating the attention areas in each image and translating the related areas between two domains. Extensive experiments and evaluations show that our model can achieve more realistic mappings compared to state-of-the-art unpaired image translation methods.}
    }

  • M. Zhang, X. Zuo, Y. Chen, Y. Liu, and M. Li, “Pose Estimation for Ground Robots: On Manifold Representation, Integration, Re-Parameterization, and Optimization," IEEE Transactions on Robotics, 2021.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    In this paper, we focus on motion estimation dedicated for non-holonomic ground robots, by probabilistically fusing measurements from the wheel odometer and exteroceptive sensors. For ground robots, the wheel odometer is widely used in pose estimation tasks, especially in applications under planar-scene based environments. However, since the wheel odometer only provides 2D motion estimates, it is extremely challenging to use that for performing accurate full 6D pose (3D position and 3D orientation) estimation. Traditional methods on 6D pose estimation either approximate sensor or motion models, at the cost of accuracy reduction, or rely on other sensors, e.g., inertial measurement unit (IMU), to provide complementary measurements. By contrast, in this paper, we propose a novel method to utilize the wheel odometer for 6D pose estimation, by modeling and utilizing motion manifold for ground robots. Our approach is probabilistically formulated and only requires the wheel odometer and an exteroceptive sensor (e.g., a camera). Specifically, our method i) formulates the motion manifold of ground robots by parametric representation, ii) performs manifold based 6D integration with the wheel odometer measurements only, and iii) re-parameterizes manifold equations periodically for error reduction. To demonstrate the effectiveness and applicability of the proposed algorithmic modules, we integrate that into a sliding-window pose estimator by using measurements from the wheel odometer and a monocular camera. By conducting extensive simulated and real-world experiments, we show that the proposed algorithm outperforms competing state-of-the-art algorithms by a significant margin in pose estimation accuracy, especially when deployed in complex large-scale real-world environments.

    @article{zhang2021poseef,
    title = {Pose Estimation for Ground Robots: On Manifold Representation, Integration, Re-Parameterization, and Optimization},
    author = {Mingming Zhang and Xingxing Zuo and Yiming Chen and Yong Liu and Mingyang Li},
    year = 2021,
    journal = {IEEE Transactions on Robotics},
    doi = {10.1109/TRO.2020.3043970},
    abstract = {In this paper, we focus on motion estimation dedicated for non-holonomic ground robots, by probabilistically fusing measurements from the wheel odometer and exteroceptive sensors. For ground robots, the wheel odometer is widely used in pose estimation tasks, especially in applications under planar-scene based environments. However, since the wheel odometer only provides 2D motion estimates, it is extremely challenging to use that for performing accurate full 6D pose (3D position and 3D orientation) estimation. Traditional methods on 6D pose estimation either approximate sensor or motion models, at the cost of accuracy reduction, or rely on other sensors, e.g., inertial measurement unit (IMU), to provide complementary measurements. By contrast, in this paper, we propose a novel method to utilize the wheel odometer for 6D pose estimation, by modeling and utilizing motion manifold for ground robots. Our approach is probabilistically formulated and only requires the wheel odometer and an exteroceptive sensor (e.g., a camera). Specifically, our method i) formulates the motion manifold of ground robots by parametric representation, ii) performs manifold based 6D integration with the wheel odometer measurements only, and iii) re-parameterizes manifold equations periodically for error reduction. To demonstrate the effectiveness and applicability of the proposed algorithmic modules, we integrate that into a sliding-window pose estimator by using measurements from the wheel odometer and a monocular camera. By conducting extensive simulated and real-world experiments, we show that the proposed algorithm outperforms competing state-of-the-art algorithms by a significant margin in pose estimation accuracy, especially when deployed in complex large-scale real-world environments.},
    arxiv = {https://arxiv.org/pdf/1909.03423.pdf}
    }

  • J. Chen, L. Liu, Y. Liu, and X. Zeng, “A Learning Framework for n-Bit Quantized Neural Networks Toward FPGAs," IEEE Transactions on Neural Networks and Learning Systems, vol. 32, p. 1067–1081, 2021.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    The quantized neural network (QNN) is an efficient approach for network compression and can be widely used in the implementation of field-programmable gate arrays (FPGAs). This article proposes a novel learning framework for $n$ -bit QNNs, whose weights are constrained to the power of two. To solve the gradient vanishing problem, we propose a reconstructed gradient function for QNNs in the back-propagation algorithm that can directly get the real gradient rather than estimating an approximate gradient of the expected loss. We also propose a novel QNN structure named $n$ -BQ-NN, which uses shift operation to replace the multiply operation and is more suitable for the inference on FPGAs. Furthermore, we also design a shift vector processing element (SVPE) array to replace all 16-bit multiplications with SHIFT operations in convolution operation on FPGAs. We also carry out comparable experiments to evaluate our framework. The experimental results show that the quantized models of ResNet, DenseNet, and AlexNet through our learning framework can achieve almost the same accuracies with the original full-precision models. Moreover, when using our learning framework to train our $n$ -BQ-NN from scratch, it can achieve state-of-the-art results compared with typical low-precision QNNs. Experiments on Xilinx ZCU102 platform show that our $n$ -BQ-NN with our SVPE can execute 2.9 times faster than that with the vector processing element (VPE) in inference. As the SHIFT operation in our SVPE array will not consume digital signal processing (DSP) resources on FPGAs, the experiments have shown that the use of SVPE array also reduces average energy consumption to 68.7% of the VPE array with 16 bit.

    @article{chen2021alf,
    title = {A Learning Framework for n-Bit Quantized Neural Networks Toward FPGAs},
    author = {Jun Chen and Liang Liu and Yong Liu and Xianfang Zeng},
    year = 2021,
    journal = {IEEE Transactions on Neural Networks and Learning Systems},
    volume = 32,
    pages = {1067--1081},
    doi = {https://doi.org/10.1109/TNNLS.2020.2980041},
    abstract = {The quantized neural network (QNN) is an efficient approach for network compression and can be widely used in the implementation of field-programmable gate arrays (FPGAs). This article proposes a novel learning framework for  $n$ -bit QNNs, whose weights are constrained to the power of two. To solve the gradient vanishing problem, we propose a reconstructed gradient function for QNNs in the back-propagation algorithm that can directly get the real gradient rather than estimating an approximate gradient of the expected loss. We also propose a novel QNN structure named  $n$ -BQ-NN, which uses shift operation to replace the multiply operation and is more suitable for the inference on FPGAs. Furthermore, we also design a shift vector processing element (SVPE) array to replace all 16-bit multiplications with SHIFT operations in convolution operation on FPGAs. We also carry out comparable experiments to evaluate our framework. The experimental results show that the quantized models of ResNet, DenseNet, and AlexNet through our learning framework can achieve almost the same accuracies with the original full-precision models. Moreover, when using our learning framework to train our  $n$ -BQ-NN from scratch, it can achieve state-of-the-art results compared with typical low-precision QNNs. Experiments on Xilinx ZCU102 platform show that our  $n$ -BQ-NN with our SVPE can execute 2.9 times faster than that with the vector processing element (VPE) in inference. As the SHIFT operation in our SVPE array will not consume digital signal processing (DSP) resources on FPGAs, the experiments have shown that the use of SVPE array also reduces average energy consumption to 68.7% of the VPE array with 16 bit.},
    arxiv = {http://arxiv.org/pdf/2004.02396}
    }

  • Z. Li, Y. Sun, G. Tian, L. Xie, Y. Liu, H. Su, and Y. He, “A compression pipeline for one-stage object detection model," Journal of Real-Time Image Processing, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]

    Deep neural networks (DNNs) have strong fitting ability on a variety of computer vision tasks, but they also require intensive computing power and large storage space, which are not always available in portable smart devices. Although a lot of studies have contributed to the compression of image classification networks, there are few model compression algorithms for object detection models. In this paper, we propose a general compression pipeline for one-stage object detection networks to meet the real-time requirements. Firstly, we propose a softer pruning strategy on the backbone to reduce the number of filters. Compared with original direct pruning, our method can maintain the integrity of network structure and reduce the drop of accuracy. Secondly, we transfer the knowledge of the original model to the small model by knowledge distillation to reduce the accuracy drop caused by pruning. Finally, as edge devices are more suitable for integer operations, we further transform the 32-bit floating point model into the 8-bit integer model through quantization. With this pipeline, the model size and inference time are compressed to 10% or less of the original, while the mAP is only reduced by 2.5% or less. We verified that performance of the compression pipeline on the Pascal VOC dataset.

    @article{li2021acp,
    title = {A compression pipeline for one-stage object detection model},
    author = {Zhishan Li and Yiran Sun and Guanzhong Tian and Lei Xie and Yong Liu and Hongye Su and Yifan He},
    year = 2021,
    journal = {Journal of Real-Time Image Processing},
    doi = {10.1007/s11554-021-01082-2},
    abstract = {Deep neural networks (DNNs) have strong fitting ability on a variety of computer vision tasks, but they also require intensive computing power and large storage space, which are not always available in portable smart devices. Although a lot of studies have contributed to the compression of image classification networks, there are few model compression algorithms for object detection models. In this paper, we propose a general compression pipeline for one-stage object detection networks to meet the real-time requirements. Firstly, we propose a softer pruning strategy on the backbone to reduce the number of filters. Compared with original direct pruning, our method can maintain the integrity of network structure and reduce the drop of accuracy. Secondly, we transfer the knowledge of the original model to the small model by knowledge distillation to reduce the accuracy drop caused by pruning. Finally, as edge devices are more suitable for integer operations, we further transform the 32-bit floating point model into the 8-bit integer model through quantization. With this pipeline, the model size and inference time are compressed to 10% or less of the original, while the mAP is only reduced by 2.5% or less. We verified that performance of the compression pipeline on the Pascal VOC dataset.}
    }

  • W. Liu, L. Peng, J. Cao, X. Fu, Y. Liu, and Z. Pan, “Ensemble Bootstrapped Deep Deterministic Policy Gradient for Vision-Based Robotic Grasping," IEEE Access, vol. 9, p. 19916–19925, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]

    With sufficient practice, humans can grab objects they have never seen before through brain decision-making. However, the manipulators, which has a wide range of applications in industrial production, can still only grab specific objects. Because most of the grasp algorithms rely on prior knowledge such as hand-eye calibration results, object model features, and can only target specific types of objects. When the task scenario and the operation target change, it cannot perform effective redeployment. In order to solve the above problems, academia often uses reinforcement learning to train grasping algorithms. However, the method of reinforcement learning in the field of manipulators grasping mainly encounters these main problems: insufficient sample utilization, poor algorithm stability, and limited exploration. This article uses LfD, BC, and DDPG to improve sample utilization. Use multiple critics to integrate and evaluate input actions to solve the problem of algorithm instability. Finally, inspired by Thompson’s sampling idea, the input action is evaluated from different angles, which increases the algorithm’s exploration of the environment and reduces the number of interactions with the environment. EDDPG and EBDDPG algorithm is designed in the article. In order to further improve the generalization ability of the algorithm, this article does not use extra information that is difficult to obtain directly on the physical platform, such as the real coordinates of the target object and the continuous motion space at the end of the manipulator in the Cartesian coordinate system is used as the output of the decision. The simulation results show that, under the same number of interactions, the manipulators’ success rate in grabbing 1000 random objects has increased more than double and reached state-of-the-art(SOTA) performance.

    @article{liu2021ensemblebd,
    title = {Ensemble Bootstrapped Deep Deterministic Policy Gradient for Vision-Based Robotic Grasping},
    author = {Weiwei Liu and Linpeng Peng and Junjie Cao and Xiaokuan Fu and Yong Liu and Zaisheng Pan},
    year = 2021,
    journal = {IEEE Access},
    volume = 9,
    pages = {19916--19925},
    doi = {10.1109/ACCESS.2021.3049860},
    abstract = {With sufficient practice, humans can grab objects they have never seen before through brain decision-making. However, the manipulators, which has a wide range of applications in industrial production, can still only grab specific objects. Because most of the grasp algorithms rely on prior knowledge such as hand-eye calibration results, object model features, and can only target specific types of objects. When the task scenario and the operation target change, it cannot perform effective redeployment. In order to solve the above problems, academia often uses reinforcement learning to train grasping algorithms. However, the method of reinforcement learning in the field of manipulators grasping mainly encounters these main problems: insufficient sample utilization, poor algorithm stability, and limited exploration. This article uses LfD, BC, and DDPG to improve sample utilization. Use multiple critics to integrate and evaluate input actions to solve the problem of algorithm instability. Finally, inspired by Thompson's sampling idea, the input action is evaluated from different angles, which increases the algorithm's exploration of the environment and reduces the number of interactions with the environment. EDDPG and EBDDPG algorithm is designed in the article. In order to further improve the generalization ability of the algorithm, this article does not use extra information that is difficult to obtain directly on the physical platform, such as the real coordinates of the target object and the continuous motion space at the end of the manipulator in the Cartesian coordinate system is used as the output of the decision. The simulation results show that, under the same number of interactions, the manipulators' success rate in grabbing 1000 random objects has increased more than double and reached state-of-the-art(SOTA) performance.}
    }

2020

  • X. Zuo, W. Ye, Y. Yang, R. Zheng, T. Vidal-Calleja, G. Huang, and Y. Liu, “Multimodal localization: Stereo over LiDAR map," Journal of Field Robotics, vol. 37, p. 1003–1026, 2020.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this paper, we present a real‐time high‐precision visual localization system for an autonomous vehicle which employs only low‐cost stereo cameras to localize the vehicle with a priori map built using a more expensive 3D LiDAR sensor. To this end, we construct two different visual maps: a sparse feature visual map for visual odometry (VO) based motion tracking, and a semidense visual map for registration with the prior LiDAR map. To register two point clouds sourced from different modalities (i.e., cameras and LiDAR), we leverage probabilistic weighted normal distributions transformation (ProW‐NDT), by particularly taking into account the uncertainty of source point clouds. The registration results are then fused via pose graph optimization to correct the VO drift. Moreover, surfels extracted from the prior LiDAR map are used to refine the sparse 3D visual features that will further improve VO‐based motion estimation. The proposed system has been tested extensively in both simulated and real‐world experiments, showing that robust, high‐precision, real‐time localization can be achieved.

    @article{zuo2020multimodalls,
    title = {Multimodal localization: Stereo over LiDAR map},
    author = {Xingxing Zuo and Wenlong Ye and Yulin Yang and Renjie Zheng and Teresa Vidal-Calleja and Guoquan Huang and Yong Liu},
    year = 2020,
    journal = {Journal of Field Robotics},
    volume = 37,
    pages = {1003--1026},
    doi = {https://doi.org/10.1002/rob.21936},
    abstract = {In this paper, we present a real‐time high‐precision visual localization system for an autonomous vehicle which employs only low‐cost stereo cameras to localize the vehicle with a priori map built using a more expensive 3D LiDAR sensor. To this end, we construct two different visual maps: a sparse feature visual map for visual odometry (VO) based motion tracking, and a semidense visual map for registration with the prior LiDAR map. To register two point clouds sourced from different modalities (i.e., cameras and LiDAR), we leverage probabilistic weighted normal distributions transformation (ProW‐NDT), by particularly taking into account the uncertainty of source point clouds. The registration results are then fused via pose graph optimization to correct the VO drift. Moreover, surfels extracted from the prior LiDAR map are used to refine the sparse 3D visual features that will further improve VO‐based motion estimation. The proposed system has been tested extensively in both simulated and real‐world experiments, showing that robust, high‐precision, real‐time localization can be achieved.}
    }

  • J. Cao, W. Liu, Y. Liu, and J. Yang, “Generalize Robot Learning From Demonstration to Variant Scenarios With Evolutionary Policy Gradient," Frontiers in Neurorobotics, vol. 14, 2020.
    [BibTeX] [Abstract] [DOI] [PDF]

    There has been substantial growth in research on the robot automation, which aims to make robots capable of directly interacting with the world or human. Robot learning for automation from human demonstration is central to such situation. However, the dependence of demonstration restricts robot to a fixed scenario, without the ability to explore in variant situations to accomplish the same task as in demonstration. Deep reinforcement learning methods may be a good method to make robot learning beyond human demonstration and fulfilling the task in unknown situations. The exploration is the core of such generalization to different environments. While the exploration in reinforcement learning may be ineffective and suffer from the problem of low sample efficiency. In this paper, we present Evolutionary Policy Gradient (EPG) to make robot learn from demonstration and perform goal oriented exploration efficiently. Through goal oriented exploration, our method can generalize robot learned skill to environments with different parameters. Our Evolutionary Policy Gradient combines parameter perturbation with policy gradient method in the framework of Evolutionary Algorithms (EAs) and can fuse the benefits of both, achieving effective and efficient exploration. With demonstration guiding the evolutionary process, robot can accelerate the goal oriented exploration to generalize its capability to variant scenarios. The experiments, carried out in robot control tasks in OpenAI Gym with dense and sparse rewards, show that our EPG is able to provide competitive performance over the original policy gradient methods and EAs. In the manipulator task, our robot can learn to open the door with vision in environments which are different from where the demonstrations are provided.

    @article{cao2020generalizerl,
    title = {Generalize Robot Learning From Demonstration to Variant Scenarios With Evolutionary Policy Gradient},
    author = {Junjie Cao and Weiwei Liu and Yong Liu and Jian Yang},
    year = 2020,
    journal = {Frontiers in Neurorobotics},
    volume = 14,
    doi = { https://doi.org/10.3389/fnbot.2020.00021},
    abstract = {There has been substantial growth in research on the robot automation, which aims to make robots capable of directly interacting with the world or human. Robot learning for automation from human demonstration is central to such situation. However, the dependence of demonstration restricts robot to a fixed scenario, without the ability to explore in variant situations to accomplish the same task as in demonstration. Deep reinforcement learning methods may be a good method to make robot learning beyond human demonstration and fulfilling the task in unknown situations. The exploration is the core of such generalization to different environments. While the exploration in reinforcement learning may be ineffective and suffer from the problem of low sample efficiency. In this paper, we present Evolutionary Policy Gradient (EPG) to make robot learn from demonstration and perform goal oriented exploration efficiently. Through goal oriented exploration, our method can generalize robot learned skill to environments with different parameters. Our Evolutionary Policy Gradient combines parameter perturbation with policy gradient method in the framework of Evolutionary Algorithms (EAs) and can fuse the benefits of both, achieving effective and efficient exploration. With demonstration guiding the evolutionary process, robot can accelerate the goal oriented exploration to generalize its capability to variant scenarios. The experiments, carried out in robot control tasks in OpenAI Gym with dense and sparse rewards, show that our EPG is able to provide competitive performance over the original policy gradient methods and EAs. In the manipulator task, our robot can learn to open the door with vision in environments which are different from where the demonstrations are provided.}
    }

  • J. Chen, Y. Liu, H. Zhang, S. Hou, and J. Yang, “Propagating Asymptotic-Estimated Gradients for Low Bitwidth Quantized Neural Networks," IEEE Journal of Selected Topics in Signal Processing, vol. 14, p. 848–859, 2020.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    The quantized neural networks (QNNs) can be useful for neural network acceleration and compression, but during the training process they pose a challenge: how to propagate the gradient of loss function through the graph flow with a derivative of 0 almost everywhere. In response to this non-differentiable situation, we propose a novel Asymptotic-Quantized Estimator (AQE) to estimate the gradient. In particular, during back-propagation, the graph that relates inputs to output remains smoothness and differentiability. At the end of training, the weights and activations have been quantized to low-precision because of the asymptotic behaviour of AQE. Meanwhile, we propose a M-bit Inputs and N-bit Weights Network (MINW-Net) trained by AQE, a quantized neural network with 1–3 bits weights and activations. In the inference phase, we can use XNOR or SHIFT operations instead of convolution operations to accelerate the MINW-Net. Our experiments on CIFAR datasets demonstrate that our AQE is well defined, and the QNNs with AQE perform better than that with Straight-Through Estimator (STE). For example, in the case of the same ConvNet that has 1-bit weights and activations, our MINW-Net with AQE can achieve a prediction accuracy 1.5% higher than the Binarized Neural Network (BNN) with STE. The MINW-Net, which is trained from scratch by AQE, can achieve comparable classification accuracy as 32-bit counterparts on CIFAR test sets. Extensive experimental results on ImageNet dataset show great superiority of the proposed AQE and our MINW-Net achieves comparable results with other state-of-the-art QNNs.

    @article{chen2020propagatingag,
    title = {Propagating Asymptotic-Estimated Gradients for Low Bitwidth Quantized Neural Networks},
    author = {Jun Chen and Yong Liu and Hao Zhang and Shengnan Hou and Jian Yang},
    year = 2020,
    journal = {IEEE Journal of Selected Topics in Signal Processing},
    volume = 14,
    pages = {848--859},
    doi = {https://doi.org/10.1109/JSTSP.2020.2966327},
    abstract = {The quantized neural networks (QNNs) can be useful for neural network acceleration and compression, but during the training process they pose a challenge: how to propagate the gradient of loss function through the graph flow with a derivative of 0 almost everywhere. In response to this non-differentiable situation, we propose a novel Asymptotic-Quantized Estimator (AQE) to estimate the gradient. In particular, during back-propagation, the graph that relates inputs to output remains smoothness and differentiability. At the end of training, the weights and activations have been quantized to low-precision because of the asymptotic behaviour of AQE. Meanwhile, we propose a M-bit Inputs and N-bit Weights Network (MINW-Net) trained by AQE, a quantized neural network with 1–3 bits weights and activations. In the inference phase, we can use XNOR or SHIFT operations instead of convolution operations to accelerate the MINW-Net. Our experiments on CIFAR datasets demonstrate that our AQE is well defined, and the QNNs with AQE perform better than that with Straight-Through Estimator (STE). For example, in the case of the same ConvNet that has 1-bit weights and activations, our MINW-Net with AQE can achieve a prediction accuracy 1.5% higher than the Binarized Neural Network (BNN) with STE. The MINW-Net, which is trained from scratch by AQE, can achieve comparable classification accuracy as 32-bit counterparts on CIFAR test sets. Extensive experimental results on ImageNet dataset show great superiority of the proposed AQE and our MINW-Net achieves comparable results with other state-of-the-art QNNs.},
    arxiv = {http://arxiv.org/pdf/2003.04296}
    }

  • J. Ri, G. Tian, Y. Liu, W. Xu, and J. Lou, “Extreme learning machine with hybrid cost function of G-mean and probability for imbalance learning," International Journal of Machine Learning and Cybernetics, vol. 11, p. 2007–2020, 2020.
    [BibTeX] [Abstract] [DOI] [PDF]

    Extreme learning machine(ELM) is a simple and fast machine learning algorithm. However, similar to other conventional learning algorithms, the classical ELM can not well process the problem of imbalanced data distribution. In this paper, in order to improve the learning performance of classical ELM for imbalanced data learning, we present a novel variant of the ELM algorithm based on a hybrid cost function which employs the probability that given training sample belong in each class to calculate the G-mean. We perform comparable experiments for our approach and the state-of-the-arts methods on standard classification datasets which consist of 58 binary datasets and 9 multiclass datasets under different degrees of imbalance ratio. Experimental results show that our proposed algorithm can improve the classification performance significantly compared with other state-of-the-art methods.

    @article{ri2020extremelm,
    title = {Extreme learning machine with hybrid cost function of G-mean and probability for imbalance learning},
    author = {JongHyok Ri and Guanzhong Tian and Yong Liu and Weihua Xu and Jungang Lou},
    year = 2020,
    journal = {International Journal of Machine Learning and Cybernetics},
    volume = 11,
    pages = {2007--2020},
    doi = {https://doi.org/10.1007/s13042-020-01090-x},
    abstract = {Extreme learning machine(ELM) is a simple and fast machine learning algorithm. However, similar to other conventional learning algorithms, the classical ELM can not well process the problem of imbalanced data distribution. In this paper, in order to improve the learning performance of classical ELM for imbalanced data learning, we present a novel variant of the ELM algorithm based on a hybrid cost function which employs the probability that given training sample belong in each class to calculate the G-mean. We perform comparable experiments for our approach and the state-of-the-arts methods on standard classification datasets which consist of 58 binary datasets and 9 multiclass datasets under different degrees of imbalance ratio. Experimental results show that our proposed algorithm can improve the classification performance significantly compared with other state-of-the-art methods.}
    }

  • W. Ye, J. Sun, M. Xu, X. Yang, H. Li, and Y. Liu, “Detecting Aging Substation Transformers by Audio Signal with Deep Neural Network," Lecture Notes in Computer Science, p. 70–82, 2020.
    [BibTeX] [Abstract] [DOI] [PDF]

    In order to monitor the aging of transformers and ensure the operational safety in substations, a practical detection system for indoor substation transformers based on the analysis of audio signal is designed, which use computer technology instead of manpower to efficiently monitor the transformers working states in real-time. Our work consists of a small and low cost AI-STBOX and an intelligent AI Cloud Platform. AI-STBOX is installed directionally in each transformer room for continuously collecting, compressing and uploading the transformers audio data. The AI Cloud Platform receives audio data from AI-STBOX, analyses and organizes the data to low-dimensional speech features with STFT and Mel cepstrum analysis. Input the features into a powerful deep neural network, the system can quickly distinguish the working states of each substation transformer before is has serious faults. It can locate aging transformers, command the maintenance platform to quickly release the repair task, thus avoid unforeseeable outages and minimize planned downtimes. The approach has achieved excellent results in the substation aging transformers detection scene.

    @article{ye2020detectingas,
    title = {Detecting Aging Substation Transformers by Audio Signal with Deep Neural Network},
    author = {Wei Ye and Jiasai Sun and Min Xu and Xuemeng Yang and Hongliang Li and Yong Liu},
    year = 2020,
    journal = {Lecture Notes in Computer Science},
    pages = {70--82},
    doi = {https://doi.org/10.1007/978-3-662-61510-2_7},
    abstract = {In order to monitor the aging of transformers and ensure the operational safety in substations, a practical detection system for indoor substation transformers based on the analysis of audio signal is designed, which use computer technology instead of manpower to efficiently monitor the transformers working states in real-time. Our work consists of a small and low cost AI-STBOX and an intelligent AI Cloud Platform. AI-STBOX is installed directionally in each transformer room for continuously collecting, compressing and uploading the transformers audio data. The AI Cloud Platform receives audio data from AI-STBOX, analyses and organizes the data to low-dimensional speech features with STFT and Mel cepstrum analysis. Input the features into a powerful deep neural network, the system can quickly distinguish the working states of each substation transformer before is has serious faults. It can locate aging transformers, command the maintenance platform to quickly release the repair task, thus avoid unforeseeable outages and minimize planned downtimes. The approach has achieved excellent results in the substation aging transformers detection scene.}
    }

  • G. Zhai, L. Liu, L. Zhang, and Y. Liu, “PoseConvGRU: A Monocular Approach for Visual Ego-motion Estimation by Learning," Pattern Recognit., vol. 102, p. 107187, 2020.
    [BibTeX] [Abstract] [DOI] [PDF]

    While many visual ego-motion algorithm variants have been proposed in the past decade, learning based ego-motion estimation methods have seen an increasing attention because of its desirable properties of robustness to image noise and camera calibration independence. In this work, we propose a data-driven approach of fully trainable visual ego-motion estimation for a monocular camera. We use an end-to-end learning approach in allowing the model to map directly from input image pairs to an estimate of ego-motion (parameterized as 6-DoF transformation matrices). We introduce a novel two-module Long-term Recurrent Convolutional Neural Networks called PoseConvGRU, with an explicit sequence pose estimation loss to achieve this. The feature-encoding module encodes the short-term motion feature in an image pair, while the memory-propagating module captures the long-term motion feature in the consecutive image pairs. The visual memory is implemented with convolutional gated recurrent units, which allows propagating information over time. At each time step, two consecutive RGB images are stacked together to form a 6 channels tensor for module-1 to learn how to extract motion information and estimate poses. The sequence of output maps is then passed through a stacked ConvGRU module to generate the relative transformation pose of each image pair. We also augment the training data by randomly skipping frames to simulate the velocity variation which results in a better performance in turning and high-velocity situations. We evaluate the performance of our proposed approach on the KITTI Visual Odometry benchmark. The experiments show a competitive performance of the proposed method to the geometric method and encourage further exploration of learning based methods for the purpose of estimating camera ego-motion even though geometrical methods demonstrate promising results.

    @article{zhai2020poseconvgruam,
    title = {PoseConvGRU: A Monocular Approach for Visual Ego-motion Estimation by Learning},
    author = {Guangyao Zhai and Liang Liu and Linjian Zhang and Yong Liu},
    year = 2020,
    journal = {Pattern Recognit.},
    volume = 102,
    pages = 107187,
    doi = {https://doi.org/10.1016/j.patcog.2019.107187},
    abstract = {While many visual ego-motion algorithm variants have been proposed in the past decade, learning based ego-motion estimation methods have seen an increasing attention because of its desirable properties of robustness to image noise and camera calibration independence. In this work, we propose a data-driven approach of fully trainable visual ego-motion estimation for a monocular camera. We use an end-to-end learning approach in allowing the model to map directly from input image pairs to an estimate of ego-motion (parameterized as 6-DoF transformation matrices). We introduce a novel two-module Long-term Recurrent Convolutional Neural Networks called PoseConvGRU, with an explicit sequence pose estimation loss to achieve this. The feature-encoding module encodes the short-term motion feature in an image pair, while the memory-propagating module captures the long-term motion feature in the consecutive image pairs. The visual memory is implemented with convolutional gated recurrent units, which allows propagating information over time. At each time step, two consecutive RGB images are stacked together to form a 6 channels tensor for module-1 to learn how to extract motion information and estimate poses. The sequence of output maps is then passed through a stacked ConvGRU module to generate the relative transformation pose of each image pair. We also augment the training data by randomly skipping frames to simulate the velocity variation which results in a better performance in turning and high-velocity situations. We evaluate the performance of our proposed approach on the KITTI Visual Odometry benchmark. The experiments show a competitive performance of the proposed method to the geometric method and encourage further exploration of learning based methods for the purpose of estimating camera ego-motion even though geometrical methods demonstrate promising results.}
    }

  • X. Zhao, L. Liu, R. Zheng, W. Ye, and Y. Liu, “A Robust Stereo Feature-aided Semi-direct SLAM System," Robotics and Autonomous Systems, vol. 132, p. 103597, 2020.
    [BibTeX] [Abstract] [DOI] [PDF]

    In autonomous driving, many intelligent perception technologies have been put in use. However, visual SLAM still has problems with robustness, which limits its application, although it has been developed for a long time. We propose a feature-aided semi-direct approach to combine the direct and indirect methods in visual SLAM to allow robust localization under various situations, including large-baseline motion, textureless environment, and great illumination changes. In our approach, we first calculate inter-frame pose estimation by feature matching. Then we use the direct alignment and a multi-scale pyramid, which employs the previous coarse estimation as a priori, to obtain a more precise result. To get more accurate photometric parameters, we combine the online photometric calibration method with visual odometry. Furthermore, we replace the Shi–Tomasi corner with the ORB feature, which is more robust to illumination. For extreme brightness change, we employ the dark channel prior to weaken the halation and maintain the consistency of the image. To evaluate our approach, we build a full stereo visual SLAM system. Experiments on the publicly available dataset and our mobile robot dataset indicate that our approach improves the accuracy and robustness of the SLAM system.

    @article{zhao2020ars,
    title = {A Robust Stereo Feature-aided Semi-direct SLAM System},
    author = {Xiangrui Zhao and Lina Liu and Renjie Zheng and Wenlong Ye and Yong Liu},
    year = 2020,
    journal = {Robotics and Autonomous Systems},
    volume = 132,
    pages = 103597,
    doi = {https://doi.org/10.1016/j.robot.2020.103597},
    abstract = {In autonomous driving, many intelligent perception technologies have been put in use. However, visual SLAM still has problems with robustness, which limits its application, although it has been developed for a long time. We propose a feature-aided semi-direct approach to combine the direct and indirect methods in visual SLAM to allow robust localization under various situations, including large-baseline motion, textureless environment, and great illumination changes. In our approach, we first calculate inter-frame pose estimation by feature matching. Then we use the direct alignment and a multi-scale pyramid, which employs the previous coarse estimation as a priori, to obtain a more precise result. To get more accurate photometric parameters, we combine the online photometric calibration method with visual odometry. Furthermore, we replace the Shi–Tomasi corner with the ORB feature, which is more robust to illumination. For extreme brightness change, we employ the dark channel prior to weaken the halation and maintain the consistency of the image. To evaluate our approach, we build a full stereo visual SLAM system. Experiments on the publicly available dataset and our mobile robot dataset indicate that our approach improves the accuracy and robustness of the SLAM system.}
    }

2019

  • W. Chen, S. Zhou, Z. Pan, H. Zheng, and Y. Liu, “Mapless Collaborative Navigation for a Multi-Robot System Based on the Deep Reinforcement Learning," Applied Sciences, vol. 9, p. 4198, 2019.
    [BibTeX] [Abstract] [DOI] [PDF]

    Compared with the single robot system, a multi-robot system has higher efficiency and fault tolerance. The multi-robot system has great potential in some application scenarios, such as the robot search, rescue and escort tasks, and so on. Deep reinforcement learning provides a potential framework for multi-robot formation and collaborative navigation. This paper mainly studies the collaborative formation and navigation of multi-robots by using the deep reinforcement learning algorithm. The proposed method improves the classical Deep Deterministic Policy Gradient (DDPG) to address the single robot mapless navigation task. We also extend the single-robot Deep Deterministic Policy Gradient algorithm to the multi-robot system, and obtain the Parallel Deep Deterministic Policy Gradient (PDDPG). By utilizing the 2D lidar sensor, the group of robots can accomplish the formation construction task and the collaborative formation navigation task. The experiment results in a Gazebo simulation platform illustrates that our method is capable of guiding mobile robots to construct the formation and keep the formation during group navigation, directly through raw lidar data inputs.

    @article{chen2019maplesscn,
    title = {Mapless Collaborative Navigation for a Multi-Robot System Based on the Deep Reinforcement Learning},
    author = {Wenzhou Chen and Shizheng Zhou and Zaisheng Pan and Huixian Zheng and Yong Liu},
    year = 2019,
    journal = {Applied Sciences},
    volume = 9,
    pages = 4198,
    doi = {10.3390/app9204198},
    abstract = {Compared with the single robot system, a multi-robot system has higher efficiency and fault tolerance. The multi-robot system has great potential in some application scenarios, such as the robot search, rescue and escort tasks, and so on. Deep reinforcement learning provides a potential framework for multi-robot formation and collaborative navigation. This paper mainly studies the collaborative formation and navigation of multi-robots by using the deep reinforcement learning algorithm. The proposed method improves the classical Deep Deterministic Policy Gradient (DDPG) to address the single robot mapless navigation task. We also extend the single-robot Deep Deterministic Policy Gradient algorithm to the multi-robot system, and obtain the Parallel Deep Deterministic Policy Gradient (PDDPG). By utilizing the 2D lidar sensor, the group of robots can accomplish the formation construction task and the collaborative formation navigation task. The experiment results in a Gazebo simulation platform illustrates that our method is capable of guiding mobile robots to construct the formation and keep the formation during group navigation, directly through raw lidar data inputs.}
    }

  • X. Fu, Y. Liu, and Z. Wang, “Active Learning-Based Grasp for Accurate Industrial Manipulation," IEEE Transactions on Automation Science and Engineering, vol. 16, p. 1610–1618, 2019.
    [BibTeX] [Abstract] [DOI] [PDF]

    We propose an active learning-based grasp method for accurate industrial manipulation that combines the high accuracy of geometrically driven grasp methods and the generalization ability of data-driven grasp methods. Our grasp sequence consists of pregrasp stage and grasp stage which integrates the active perception and manipulation. In pregrasp stage, the manipulator actively moves and perceives the object. At each step, given the perception image, a motion is chosen so that the manipulator can adjust to a proper pose to grasp the object. We train a convolutional neural network to estimate the motion and combine the network with a closed-loop control so that the end effector can move to the pregrasp state. In grasp stage, the manipulator executes a fixed motion to complete the grasp task. The fixed motion can be acquired from the demonstration with nonexpert conveniently. Our proposed method does not require the prior knowledge of camera intrinsic parameters, hand-eye transformation, or manually designed feature of objects. Instead, the training data sets containing prior knowledge are collected through interactive perception. The method can be easily transferred to new tasks with a few human interventions and is able to complete high accuracy grasp task with a certain robustness to partial observation condition. In our circuit board grasping tests, we could achieve a grasp accuracy of 0.8 mm and 0.6°. Note to Practitioners—The research in this paper is motivated by the following practical problem. Manipulators on industrial lines can complete high accuracy tasks with hand-crafting features of objects. The perception is only used for object detection and localization. It is not flexible since the prior knowledge differs from tasks, which takes a long time to deploy in a new task. Besides, only well-trained experts are qualified to complete the deployment process. Our grasp method uses a convolutional network to estimate the motion for manipulator directly from images. The camera is mounted on the manipulator and can perceive the object actively. The training data set of the network is specific for different objects that can be automatically collected with a few human interventions. Our method simplifies the deployment process and can be applied in 3C industry (computers, communications, and consumer electronics) where the products upgrade frequently.

    @article{fu2019activelg,
    title = {Active Learning-Based Grasp for Accurate Industrial Manipulation},
    author = {Xiaokuan Fu and Yong Liu and Zhilei Wang},
    year = 2019,
    journal = {IEEE Transactions on Automation Science and Engineering},
    volume = 16,
    pages = {1610--1618},
    doi = {https://doi.org/10.1109/TASE.2019.2897791},
    abstract = {We propose an active learning-based grasp method for accurate industrial manipulation that combines the high accuracy of geometrically driven grasp methods and the generalization ability of data-driven grasp methods. Our grasp sequence consists of pregrasp stage and grasp stage which integrates the active perception and manipulation. In pregrasp stage, the manipulator actively moves and perceives the object. At each step, given the perception image, a motion is chosen so that the manipulator can adjust to a proper pose to grasp the object. We train a convolutional neural network to estimate the motion and combine the network with a closed-loop control so that the end effector can move to the pregrasp state. In grasp stage, the manipulator executes a fixed motion to complete the grasp task. The fixed motion can be acquired from the demonstration with nonexpert conveniently. Our proposed method does not require the prior knowledge of camera intrinsic parameters, hand-eye transformation, or manually designed feature of objects. Instead, the training data sets containing prior knowledge are collected through interactive perception. The method can be easily transferred to new tasks with a few human interventions and is able to complete high accuracy grasp task with a certain robustness to partial observation condition. In our circuit board grasping tests, we could achieve a grasp accuracy of 0.8 mm and 0.6°. Note to Practitioners—The research in this paper is motivated by the following practical problem. Manipulators on industrial lines can complete high accuracy tasks with hand-crafting features of objects. The perception is only used for object detection and localization. It is not flexible since the prior knowledge differs from tasks, which takes a long time to deploy in a new task. Besides, only well-trained experts are qualified to complete the deployment process. Our grasp method uses a convolutional network to estimate the motion for manipulator directly from images. The camera is mounted on the manipulator and can perceive the object actively. The training data set of the network is specific for different objects that can be automatically collected with a few human interventions. Our method simplifies the deployment process and can be applied in 3C industry (computers, communications, and consumer electronics) where the products upgrade frequently.}
    }

  • L. Liu, Y. Liu, and J. Zhang, “Learning-Based Hand Motion Capture and Understanding in Assembly Process," IEEE Transactions on Industrial Electronics, vol. 66, p. 9703–9712, 2019.
    [BibTeX] [Abstract] [DOI] [PDF]

    Manual assembly is still an essential part in modern manufacturing. Understanding the actual state of the assembly process can not only improve quality control of products, but also collect comprehensive data for production planning and proficiency assessments. Addressing the rising complexity led by the uncertainty in manual assembly, this paper presents an efficient approach to automatically capture and analyze hand operations in the assembly process. In this paper, a detection-based tracking method is introduced to capture trajectories of hand movement from the camera installed in each workstation. Then, the actions in hand trajectories are identified with a novel temporal action localization model. The experimental results have proved that our method reached the application level with high accuracy and a low computational cost. The proposed system is lightweight enough to be quickly set up on an embedded computing device for real-time online inference and on a cloud server for offline analysis as well.

    @article{liu2019learningbasedhm,
    title = {Learning-Based Hand Motion Capture and Understanding in Assembly Process},
    author = {Liang Liu and Yong Liu and Jiangning Zhang},
    year = 2019,
    journal = {IEEE Transactions on Industrial Electronics},
    volume = 66,
    pages = {9703--9712},
    doi = {https://doi.org/10.1109/TIE.2018.2884206},
    abstract = {Manual assembly is still an essential part in modern manufacturing. Understanding the actual state of the assembly process can not only improve quality control of products, but also collect comprehensive data for production planning and proficiency assessments. Addressing the rising complexity led by the uncertainty in manual assembly, this paper presents an efficient approach to automatically capture and analyze hand operations in the assembly process. In this paper, a detection-based tracking method is introduced to capture trajectories of hand movement from the camera installed in each workstation. Then, the actions in hand trajectories are identified with a novel temporal action localization model. The experimental results have proved that our method reached the application level with high accuracy and a low computational cost. The proposed system is lightweight enough to be quickly set up on an embedded computing device for real-time online inference and on a cloud server for offline analysis as well.}
    }

  • G. Tian, L. Liu, J. Ri, Y. Liu, and Y. Sun, “ObjectFusion: An object detection and segmentation framework with RGB-D SLAM and convolutional neural networks," Neurocomputing, vol. 345, p. 3–14, 2019.
    [BibTeX] [Abstract] [DOI] [PDF]

    Given the driving advances on CNNs (Convolutional Neural Networks) [1], deep neural networks being deployed for accurate detection and semantic reconstruction in SLAM (Simultaneous Localization and Mapping) has become a trend. However, as far as we know, almost all existing methods focus on design a specific CNN architecture for single task. In this paper, we propose a novel framework which employs a general object detection CNN to fuse with a SLAM system towards obtaining better performances on both detection and semantic segmentation in 3D space. Our approach first use CNN-based detection network to obtain the 2D object proposals which can be used to establish the local target map. We then use the results estimated from SLAM to update the dynamic global target map based on the local target map obtained by CNNs. Finally, we are able to obtain the detection result for the current frame by projecting the global target map into 2D space. On the other hand, we send the estimation results back to SLAM and update the semantic surfel model in SLAM system. Therefore, we can acquire the segmentation result by projecting the updated 3D surfel model into 2D. Our fusion scheme privileges in object detection and segmentation by integrating with SLAM system to preserve the spatial continuity and temporal consistency. Evaluation performances on four datasets demonstrate the effectiveness and robustness of our method.

    @article{tian2019objectfusionao,
    title = {ObjectFusion: An object detection and segmentation framework with RGB-D SLAM and convolutional neural networks},
    author = {Guanzhong Tian and Liang Liu and JongHyok Ri and Yong Liu and Yiran Sun},
    year = 2019,
    journal = {Neurocomputing},
    volume = 345,
    pages = {3--14},
    doi = {https://doi.org/10.1016/J.NEUCOM.2019.01.088},
    abstract = {Given the driving advances on CNNs (Convolutional Neural Networks) [1], deep neural networks being deployed for accurate detection and semantic reconstruction in SLAM (Simultaneous Localization and Mapping) has become a trend. However, as far as we know, almost all existing methods focus on design a specific CNN architecture for single task. In this paper, we propose a novel framework which employs a general object detection CNN to fuse with a SLAM system towards obtaining better performances on both detection and semantic segmentation in 3D space. Our approach first use CNN-based detection network to obtain the 2D object proposals which can be used to establish the local target map. We then use the results estimated from SLAM to update the dynamic global target map based on the local target map obtained by CNNs. Finally, we are able to obtain the detection result for the current frame by projecting the global target map into 2D space. On the other hand, we send the estimation results back to SLAM and update the semantic surfel model in SLAM system. Therefore, we can acquire the segmentation result by projecting the updated 3D surfel model into 2D. Our fusion scheme privileges in object detection and segmentation by integrating with SLAM system to preserve the spatial continuity and temporal consistency. Evaluation performances on four datasets demonstrate the effectiveness and robustness of our method.}
    }

  • X. Zuo, P. Geneva, Y. Yang, W. Ye, Y. Liu, and G. Huang, “Visual-Inertial Localization With Prior LiDAR Map Constraints," IEEE Robotics and Automation Letters, vol. 4, p. 3394–3401, 2019.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this letter, we develop a low-cost stereo visual-inertial localization system, which leverages efficient multi-state constraint Kalman filter (MSCKF)-based visual-inertial odometry (VIO) while utilizing an a priori LiDAR map to provide bounded-error three-dimensional navigation. Besides the standard sparse visual feature measurements used in VIO, the global registrations of visual semi-dense clouds to the prior LiDAR map are also exploited in a tightly-coupled MSCKF update, thus correcting accumulated drift. This cross-modality constraint between visual and LiDAR pointclouds is particularly addressed. The proposed approach is validated on both Monte Carlo simulations and real-world experiments, showing that LiDAR map constraints between clouds created through different sensing modalities greatly improve the standard VIO and provide bounded-error performance.

    @article{zuo2019visualinertiallw,
    title = {Visual-Inertial Localization With Prior LiDAR Map Constraints},
    author = {Xingxing Zuo and Patrick Geneva and Yulin Yang and Wenlong Ye and Yong Liu and Guoquan Huang},
    year = 2019,
    journal = {IEEE Robotics and Automation Letters},
    volume = 4,
    pages = {3394--3401},
    doi = {https://doi.org/10.1109/LRA.2019.2927123},
    abstract = {In this letter, we develop a low-cost stereo visual-inertial localization system, which leverages efficient multi-state constraint Kalman filter (MSCKF)-based visual-inertial odometry (VIO) while utilizing an a priori LiDAR map to provide bounded-error three-dimensional navigation. Besides the standard sparse visual feature measurements used in VIO, the global registrations of visual semi-dense clouds to the prior LiDAR map are also exploited in a tightly-coupled MSCKF update, thus correcting accumulated drift. This cross-modality constraint between visual and LiDAR pointclouds is particularly addressed. The proposed approach is validated on both Monte Carlo simulations and real-world experiments, showing that LiDAR map constraints between clouds created through different sensing modalities greatly improve the standard VIO and provide bounded-error performance.}
    }

2018

  • Y. Liu and L. Liu, “Accurate real-time ball trajectory estimation with onboard stereo camera system for humanoid ping-pong robot," Robotics and Autonomous Systems, vol. 101, p. 34–44, 2018.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this paper, an accurate real-time ball trajectory estimation approach working on the onboard stereo camera system for the humanoid ping-pong robot has been presented. As the asynchronous observations from different cameras will great reduce the accuracy of the trajectory estimation, the proposed approach will main focus on increasing the estimation accuracy under those asynchronous observations via concerning the flying ball’s motion consistency. The approximate polynomial trajectory model for the flying ball is built to optimize the best parameters from the asynchronous observations in each discrete temporal interval. The experiments show the proposed approach can performance much better than the method that ignores the asynchrony and can achieve the similar performance as the hardware-triggered synchronizing based method, which cannot be deployed in the real onboard vision system due to the limited bandwidth and real-time output requirement.

    @article{liu2018accuraterb,
    title = {Accurate real-time ball trajectory estimation with onboard stereo camera system for humanoid ping-pong robot},
    author = {Yong Liu and Liang Liu},
    year = 2018,
    journal = {Robotics and Autonomous Systems},
    volume = 101,
    pages = {34--44},
    doi = {https://doi.org/10.1016/j.robot.2017.12.004},
    abstract = {In this paper, an accurate real-time ball trajectory estimation approach working on the onboard stereo camera system for the humanoid ping-pong robot has been presented. As the asynchronous observations from different cameras will great reduce the accuracy of the trajectory estimation, the proposed approach will main focus on increasing the estimation accuracy under those asynchronous observations via concerning the flying ball’s motion consistency. The approximate polynomial trajectory model for the flying ball is built to optimize the best parameters from the asynchronous observations in each discrete temporal interval. The experiments show the proposed approach can performance much better than the method that ignores the asynchrony and can achieve the similar performance as the hardware-triggered synchronizing based method, which cannot be deployed in the real onboard vision system due to the limited bandwidth and real-time output requirement.}
    }

  • J. Ri, L. Liu, Y. Liu, H. Wu, W. Huang, and H. Kim, “Optimal Weighted Extreme Learning Machine for Imbalanced Learning with Differential Evolution [Research Frontier]," IEEE Computational Intelligence Magazine, vol. 13, p. 32–47, 2018.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this paper, we present a formal model for the optimal weighted extreme learning machine (ELM) on imbalanced learning. Our model regards the optimal weighted ELM as an optimization problem to find the best weight matrix. We propose an approximate search algorithm, named weighted ELM with differential evolution (DE), that is a competitive stochastic search technique, to solve the optimization problem of the proposed formal imbalanced learning model. We perform experiments on standard imbalanced classification datasets which consist of 39 binary datasets and 3 multiclass datasets. The results show a significant performance improvement over standard ELM with an average Gmean improvement of 10.15% on binary datasets and 1.48% on multiclass datasets, which are also better than other state-of-the-art methods. We also demonstrate that our proposed algorithm can achieve high accuracy in representation learning by performing experiments on MNIST, CIFAR-10, and YouTube-8M, with feature representation from convolutional neural networks.

    @article{ri2018optimalwe,
    title = {Optimal Weighted Extreme Learning Machine for Imbalanced Learning with Differential Evolution [Research Frontier]},
    author = {JongHyok Ri and Liang Liu and Yong Liu and Huifeng Wu and Wenliang Huang and Hun Kim},
    year = 2018,
    journal = {IEEE Computational Intelligence Magazine},
    volume = 13,
    pages = {32--47},
    doi = {https://doi.org/10.1109/MCI.2018.2840707},
    abstract = {In this paper, we present a formal model for the optimal weighted extreme learning machine (ELM) on imbalanced learning. Our model regards the optimal weighted ELM as an optimization problem to find the best weight matrix. We propose an approximate search algorithm, named weighted ELM with differential evolution (DE), that is a competitive stochastic search technique, to solve the optimization problem of the proposed formal imbalanced learning model. We perform experiments on standard imbalanced classification datasets which consist of 39 binary datasets and 3 multiclass datasets. The results show a significant performance improvement over standard ELM with an average Gmean improvement of 10.15% on binary datasets and 1.48% on multiclass datasets, which are also better than other state-of-the-art methods. We also demonstrate that our proposed algorithm can achieve high accuracy in representation learning by performing experiments on MNIST, CIFAR-10, and YouTube-8M, with feature representation from convolutional neural networks.}
    }

  • M. Wang, Y. Liu, D. Su, Y. Liao, L. Shi, J. Xu, and J. V. Miro, “Accurate and Real-Time 3-D Tracking for the Following Robots by Fusing Vision and Ultrasonar Information," IEEE/ASME Transactions on Mechatronics, vol. 23, p. 997–1006, 2018.
    [BibTeX] [Abstract] [DOI] [PDF]

    Acquiring the accurate three-dimensional (3-D) position of a target person around a robot provides valuable information that is applicable to a wide range of robotic tasks, especially for promoting the intelligent manufacturing processes of industries. This paper presents a real-time robotic 3-D human tracking system that combines a monocular camera with an ultrasonic sensor by an extended Kalman filter (EKF). The proposed system consists of three submodules: a monocular camera sensor tracking module, an ultrasonic sensor tracking module, and the multisensor fusion algorithm. An improved visual tracking algorithm is presented to provide 2-D partial location estimation. The algorithm is designed to overcome severe occlusions, scale variation, target missing, and achieve robust redetection. The scale accuracy is further enhanced by the estimated 3-D information. An ultrasonic sensor array is employed to provide the range information from the target person to the robot, and time of flight is used for the 2-D partial location estimation. EKF is adopted to sequentially process multiple, heterogeneous measurements arriving in an asynchronous order from the vision sensor, and the ultrasonic sensor separately. In the experiments, the proposed tracking system is tested in both a simulation platform and actual mobile robot for various indoor and outdoor scenes. The experimental results show the persuasive performance of the 3-D tracking system in terms of both the accuracy and robustness.

    @article{wang2018accuratear,
    title = {Accurate and Real-Time 3-D Tracking for the Following Robots by Fusing Vision and Ultrasonar Information},
    author = {Mengmeng Wang and Yong Liu and Daobilige Su and Yufan Liao and Lei Shi and Jinhong Xu and Jaime Valls Miro},
    year = 2018,
    journal = {IEEE/ASME Transactions on Mechatronics},
    volume = 23,
    pages = {997--1006},
    doi = {https://doi.org/10.1109/TMECH.2018.2820172},
    abstract = {Acquiring the accurate three-dimensional (3-D) position of a target person around a robot provides valuable information that is applicable to a wide range of robotic tasks, especially for promoting the intelligent manufacturing processes of industries. This paper presents a real-time robotic 3-D human tracking system that combines a monocular camera with an ultrasonic sensor by an extended Kalman filter (EKF). The proposed system consists of three submodules: a monocular camera sensor tracking module, an ultrasonic sensor tracking module, and the multisensor fusion algorithm. An improved visual tracking algorithm is presented to provide 2-D partial location estimation. The algorithm is designed to overcome severe occlusions, scale variation, target missing, and achieve robust redetection. The scale accuracy is further enhanced by the estimated 3-D information. An ultrasonic sensor array is employed to provide the range information from the target person to the robot, and time of flight is used for the 2-D partial location estimation. EKF is adopted to sequentially process multiple, heterogeneous measurements arriving in an asynchronous order from the vision sensor, and the ultrasonic sensor separately. In the experiments, the proposed tracking system is tested in both a simulation platform and actual mobile robot for various indoor and outdoor scenes. The experimental results show the persuasive performance of the 3-D tracking system in terms of both the accuracy and robustness.}
    }

2017

  • J. Fan, Y. Jiang, and Y. Liu, “Quick attribute reduction with generalized indiscernibility models," Information Sciences, vol. 397, p. 15–36, 2017.
    [BibTeX] [Abstract] [DOI] [PDF]

    We present a generalized indiscernibility reduction model(GIRM) and a concept of the granular structure in GIRM.We prove that GIRM is compatible with three typical reduction models.We present a generalized attribute reduction algorithm and a generalized positive region computing algorithm based on GIRM.We present acceleration policies on two generalized algorithms and fast positive region computing approaches for three typical reduction models. The efficiency of attribute reduction is one of the important challenges being faced in the field of Big Data processing. Although many quick attribute reduction algorithms have been proposed, they are tightly coupled with their corresponding indiscernibility relations, and it is difficult to extend specific acceleration policies to other reduction models. In this paper, we propose a generalized indiscernibility reduction model(GIRM) and a concept of the granular structure in GIRM, which is a quantitative measurement induced from multiple indiscernibility relations and which can be used to represent the computation cost of varied models. Then, we prove that our GIRM is compatible with three typical reduction models. Based on the proposed GIRM, we present a generalized attribute reduction algorithm and a generalized positive region computing algorithm. We perform a quantitative analysis of the computation complexities of two algorithms using the granular structure. For the generalized attribute reduction, we present systematic acceleration policies that can reduce the computational domain and optimize the computation of the positive region. Based on the granular structure, we propose acceleration policies for the computation of the generalized positive region, and we also propose fast positive region computation approaches for three typical reduction models. Experimental results for various datasets prove the efficiency of our acceleration policies in those three typical reduction models.

    @article{jing2017quickar,
    title = {Quick attribute reduction with generalized indiscernibility models},
    author = {Jing Fan and YunLiang Jiang and Yong Liu},
    year = 2017,
    journal = {Information Sciences},
    volume = 397,
    pages = {15--36},
    doi = {https://doi.org/10.1016/J.INS.2017.02.032},
    abstract = {We present a generalized indiscernibility reduction model(GIRM) and a concept of the granular structure in GIRM.We prove that GIRM is compatible with three typical reduction models.We present a generalized attribute reduction algorithm and a generalized positive region computing algorithm based on GIRM.We present acceleration policies on two generalized algorithms and fast positive region computing approaches for three typical reduction models. The efficiency of attribute reduction is one of the important challenges being faced in the field of Big Data processing. Although many quick attribute reduction algorithms have been proposed, they are tightly coupled with their corresponding indiscernibility relations, and it is difficult to extend specific acceleration policies to other reduction models. In this paper, we propose a generalized indiscernibility reduction model(GIRM) and a concept of the granular structure in GIRM, which is a quantitative measurement induced from multiple indiscernibility relations and which can be used to represent the computation cost of varied models. Then, we prove that our GIRM is compatible with three typical reduction models. Based on the proposed GIRM, we present a generalized attribute reduction algorithm and a generalized positive region computing algorithm. We perform a quantitative analysis of the computation complexities of two algorithms using the granular structure. For the generalized attribute reduction, we present systematic acceleration policies that can reduce the computational domain and optimize the computation of the positive region. Based on the granular structure, we propose acceleration policies for the computation of the generalized positive region, and we also propose fast positive region computation approaches for three typical reduction models. Experimental results for various datasets prove the efficiency of our acceleration policies in those three typical reduction models.}
    }

  • Y. Liao, Y. Wang, and Y. Liu, “Graph Regularized Auto-Encoders for Image Representation," IEEE Transactions on Image Processing, vol. 26, p. 2839–2852, 2017.
    [BibTeX] [Abstract] [DOI] [PDF]

    Image representation has been intensively explored in the domain of computer vision for its significant influence on the relative tasks such as image clustering and classification. It is valuable to learn a low-dimensional representation of an image which preserves its inherent information from the original image space. At the perspective of manifold learning, this is implemented with the local invariant idea to capture the intrinsic low-dimensional manifold embedded in the high-dimensional input space. Inspired by the recent successes of deep architectures, we propose a local invariant deep nonlinear mapping algorithm, called graph regularized auto-encoder (GAE). With the graph regularization, the proposed method preserves the local connectivity from the original image space to the representation space, while the stacked auto-encoders provide explicit encoding model for fast inference and powerful expressive capacity for complex modeling. Theoretical analysis shows that the graph regularizer penalizes the weighted Frobenius norm of the Jacobian matrix of the encoder mapping, where the weight matrix captures the local property in the input space. Furthermore, the underlying effects on the hidden representation space are revealed, providing insightful explanation to the advantage of the proposed method. Finally, the experimental results on both clustering and classification tasks demonstrate the effectiveness of our GAE as well as the correctness of the proposed theoretical analysis, and it also suggests that GAE is a superior solution to the current deep representation learning techniques comparing with variant auto-encoders and existing local invariant methods.

    @article{liao2017graphra,
    title = {Graph Regularized Auto-Encoders for Image Representation},
    author = {Yiyi Liao and Yue Wang and Yong Liu},
    year = 2017,
    journal = {IEEE Transactions on Image Processing},
    volume = 26,
    pages = {2839--2852},
    doi = {https://doi.org/10.1109/TIP.2016.2605010},
    abstract = {Image representation has been intensively explored in the domain of computer vision for its significant influence on the relative tasks such as image clustering and classification. It is valuable to learn a low-dimensional representation of an image which preserves its inherent information from the original image space. At the perspective of manifold learning, this is implemented with the local invariant idea to capture the intrinsic low-dimensional manifold embedded in the high-dimensional input space. Inspired by the recent successes of deep architectures, we propose a local invariant deep nonlinear mapping algorithm, called graph regularized auto-encoder (GAE). With the graph regularization, the proposed method preserves the local connectivity from the original image space to the representation space, while the stacked auto-encoders provide explicit encoding model for fast inference and powerful expressive capacity for complex modeling. Theoretical analysis shows that the graph regularizer penalizes the weighted Frobenius norm of the Jacobian matrix of the encoder mapping, where the weight matrix captures the local property in the input space. Furthermore, the underlying effects on the hidden representation space are revealed, providing insightful explanation to the advantage of the proposed method. Finally, the experimental results on both clustering and classification tasks demonstrate the effectiveness of our GAE as well as the correctness of the proposed theoretical analysis, and it also suggests that GAE is a superior solution to the current deep representation learning techniques comparing with variant auto-encoders and existing local invariant methods.}
    }

  • Y. Wang, Y. Liu, Y. Liao, and R. Xiong, “Scalable Learning Framework for Traversable Region Detection Fusing With Appearance and Geometrical Information," IEEE Transactions on Intelligent Transportation Systems, vol. 18, p. 3267–3281, 2017.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this paper, we present an online learning framework for traversable region detection fusing both appearance and geometry information. Our framework proposes an appearance classifier supervised by the sparse geometric clues to capture the variation in online data, yielding dense detection result in real time. It provides superior detection performance using appearance information with weak geometric prior and can be further improved with more geometry from external sensors. The learning process is divided into three steps: First, we construct features from the super-pixel level, which reduces the computational cost compared with the pixel level processing. Then we classify the multi-scale super-pixels to vote the label of each pixel. Second, we use weighted extreme learning machine as our classifier to deal with the imbalanced data distribution since the weak geometric prior only initializes the labels in a small region. Finally, we employ the online learning process so that our framework can be adaptive to the changing scenes. Experimental results on three different styles of image sequences, i.e., shadow road, rain sequence, and variational sequence, demonstrate the adaptability, stability, and parameter insensitivity of our weak geometry motivated method. We further demonstrate the performance of learning framework on additional five challenging data sets captured by Kinect V2 and stereo camera, validating the method’s effectiveness and efficiency.

    @article{wang2017scalablelf,
    title = {Scalable Learning Framework for Traversable Region Detection Fusing With Appearance and Geometrical Information},
    author = {Yue Wang and Yong Liu and Yiyi Liao and Rong Xiong},
    year = 2017,
    journal = {IEEE Transactions on Intelligent Transportation Systems},
    volume = 18,
    pages = {3267--3281},
    doi = {https://doi.org/10.1109/TITS.2017.2682218},
    abstract = {In this paper, we present an online learning framework for traversable region detection fusing both appearance and geometry information. Our framework proposes an appearance classifier supervised by the sparse geometric clues to capture the variation in online data, yielding dense detection result in real time. It provides superior detection performance using appearance information with weak geometric prior and can be further improved with more geometry from external sensors. The learning process is divided into three steps: First, we construct features from the super-pixel level, which reduces the computational cost compared with the pixel level processing. Then we classify the multi-scale super-pixels to vote the label of each pixel. Second, we use weighted extreme learning machine as our classifier to deal with the imbalanced data distribution since the weak geometric prior only initializes the labels in a small region. Finally, we employ the online learning process so that our framework can be adaptive to the changing scenes. Experimental results on three different styles of image sequences, i.e., shadow road, rain sequence, and variational sequence, demonstrate the adaptability, stability, and parameter insensitivity of our weak geometry motivated method. We further demonstrate the performance of learning framework on additional five challenging data sets captured by Kinect V2 and stereo camera, validating the method’s effectiveness and efficiency.}
    }

  • Y. Liao, S. Kodagoda, Y. Wang, L. Shi, and Y. Liu, “Place Classification With a Graph Regularized Deep Neural Network," IEEE Transactions on Cognitive and Developmental Systems, vol. 9, p. 304–315, 2017.
    [BibTeX] [Abstract] [DOI] [PDF]

    Place classification is a fundamental ability that a robot should possess to carry out effective human-robot interactions. In recent years, there is a high exploitation of artificial intelligence algorithms in robotics applications. Inspired by the recent successes of deep learning methods, we propose an end-to-end learning approach for the place classification problem. With deep architectures, this methodology automatically discovers features and contributes in general to higher classification accuracies. The pipeline of our approach is composed of three parts. First, we construct multiple layers of laser range data to represent the environment information in different levels of granularity. Second, each layer of data are fed into a deep neural network for classification, where a graph regularization is imposed to the deep architecture for keeping local consistency between adjacent samples. Finally, the predicted labels obtained from all layers are fused based on confidence trees to maximize the overall confidence. Experimental results validate the effectiveness of our end-to-end place classification framework in which both the multilayer structure and the graph regularization promote the classification performance. Furthermore, results show that the features automatically learned from the raw input range data can achieve competitive results to the features constructed based on statistical and geometrical information.

    @article{liao2017placecw,
    title = {Place Classification With a Graph Regularized Deep Neural Network},
    author = {Yiyi Liao and Sarath Kodagoda and Yue Wang and Lei Shi and Yong Liu},
    year = 2017,
    journal = {IEEE Transactions on Cognitive and Developmental Systems},
    volume = 9,
    pages = {304--315},
    doi = {https://doi.org/10.1109/TCDS.2016.2586183},
    abstract = {Place classification is a fundamental ability that a robot should possess to carry out effective human-robot interactions. In recent years, there is a high exploitation of artificial intelligence algorithms in robotics applications. Inspired by the recent successes of deep learning methods, we propose an end-to-end learning approach for the place classification problem. With deep architectures, this methodology automatically discovers features and contributes in general to higher classification accuracies. The pipeline of our approach is composed of three parts. First, we construct multiple layers of laser range data to represent the environment information in different levels of granularity. Second, each layer of data are fed into a deep neural network for classification, where a graph regularization is imposed to the deep architecture for keeping local consistency between adjacent samples. Finally, the predicted labels obtained from all layers are fused based on confidence trees to maximize the overall confidence. Experimental results validate the effectiveness of our end-to-end place classification framework in which both the multilayer structure and the graph regularization promote the classification performance. Furthermore, results show that the features automatically learned from the raw input range data can achieve competitive results to the features constructed based on statistical and geometrical information.}
    }

2016

  • Y. Liu, Y. Liao, L. Tang, F. Tang, and W. Liu, “General subspace constrained non-negative matrix factorization for data representation," Neurocomputing, vol. 173, p. 224–232, 2016.
    [BibTeX] [Abstract] [DOI] [PDF]

    Nonnegative matrix factorization (NMF) has been proved to be a powerful data representation method, and has shown success in applications such as data representation and document clustering. However, the non-negative constraint alone is not able to capture the underlying properties of the data. In this paper, we present a framework to enforce general subspace constraints into NMF by augmenting the original objective function with two additional terms. One on constraints of the basis, the other on preserving the structural properties of the original data. This framework is general as it can be used to regularize NMF with a wide variety of subspace constraints that can be formulated into a certain form such as PCA, Fisher LDA and LPP. In addition, we present an iterative optimization algorithm to solve the general subspace constrained non-negative matrix factorization (GSC NMF). We show that the resulting subspace has enriched representation power as shown in our experiments.

    @article{liu2016generalsc,
    title = {General subspace constrained non-negative matrix factorization for data representation},
    author = {Yong Liu and Yiyi Liao and Liang Tang and Feng Tang and Weicong Liu},
    year = 2016,
    journal = {Neurocomputing},
    volume = 173,
    pages = {224--232},
    doi = {https://doi.org/10.1016/j.neucom.2014.11.099},
    abstract = {Nonnegative matrix factorization (NMF) has been proved to be a powerful data representation method, and has shown success in applications such as data representation and document clustering. However, the non-negative constraint alone is not able to capture the underlying properties of the data. In this paper, we present a framework to enforce general subspace constraints into NMF by augmenting the original objective function with two additional terms. One on constraints of the basis, the other on preserving the structural properties of the original data. This framework is general as it can be used to regularize NMF with a wide variety of subspace constraints that can be formulated into a certain form such as PCA, Fisher LDA and LPP. In addition, we present an iterative optimization algorithm to solve the general subspace constrained non-negative matrix factorization (GSC NMF). We show that the resulting subspace has enriched representation power as shown in our experiments.}
    }

  • Y. Liu, R. Xiong, Y. Wang, H. Huang, X. Xie, X. Liu, and G. Zhang, “Stereo Visual-Inertial Odometry With Multiple Kalman Filters Ensemble," IEEE Transactions on Industrial Electronics, vol. 63, p. 6205–6216, 2016.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this paper, we present a stereo visual-inertial odometry algorithm assembled with three separated Kalman filters, i.e., attitude filter, orientation filter, and position filter. Our algorithm carries out the orientation and position estimation with three filters working on different fusion intervals, which can provide more robustness even when the visual odometry estimation fails. In our orientation estimation, we propose an improved indirect Kalman filter, which uses the orientation error space represented by unit quaternion as the state of the filter. The performance of the algorithm is demonstrated through extensive experimental results, including the benchmark KITTI datasets and some challenging datasets captured in a rough terrain campus.

    @article{liu2016stereovo,
    title = {Stereo Visual-Inertial Odometry With Multiple Kalman Filters Ensemble},
    author = {Yong Liu and Rong Xiong and Yue Wang and Hong Huang and Xiaojia Xie and Xiaofeng Liu and Gaoming Zhang},
    year = 2016,
    journal = {IEEE Transactions on Industrial Electronics},
    volume = 63,
    pages = {6205--6216},
    doi = {https://doi.org/10.1109/TIE.2016.2573765},
    abstract = {In this paper, we present a stereo visual-inertial odometry algorithm assembled with three separated Kalman filters, i.e., attitude filter, orientation filter, and position filter. Our algorithm carries out the orientation and position estimation with three filters working on different fusion intervals, which can provide more robustness even when the visual odometry estimation fails. In our orientation estimation, we propose an improved indirect Kalman filter, which uses the orientation error space represented by unit quaternion as the state of the filter. The performance of the algorithm is demonstrated through extensive experimental results, including the benchmark KITTI datasets and some challenging datasets captured in a rough terrain campus.}
    }

  • H. Zhao, Y. Liu, X. Xie, Y. Liao, and X. Liu, “Filtering Based Adaptive Visual Odometry Sensor Framework Robust to Blurred Images," Sensors (Basel, Switzerland), vol. 16, p. 1040, 2016.
    [BibTeX] [Abstract] [DOI] [PDF]

    Visual odometry (VO) estimation from blurred image is a challenging problem in practical robot applications, and the blurred images will severely reduce the estimation accuracy of the VO. In this paper, we address the problem of visual odometry estimation from blurred images, and present an adaptive visual odometry estimation framework robust to blurred images. Our approach employs an objective measure of images, named small image gradient distribution (SIGD), to evaluate the blurring degree of the image, then an adaptive blurred image classification algorithm is proposed to recognize the blurred images, finally we propose an anti-blurred key-frame selection algorithm to enable the VO robust to blurred images. We also carried out varied comparable experiments to evaluate the performance of the VO algorithms with our anti-blur framework under varied blurred images, and the experimental results show that our approach can achieve superior performance comparing to the state-of-the-art methods under the condition with blurred images while not increasing too much computation cost to the original VO algorithms.

    @article{zhao2016filteringba,
    title = {Filtering Based Adaptive Visual Odometry Sensor Framework Robust to Blurred Images},
    author = {Haiyin Zhao and Yong Liu and Xiaojia Xie and Yiyi Liao and Xixi Liu},
    year = 2016,
    journal = {Sensors (Basel, Switzerland)},
    volume = 16,
    pages = 1040,
    doi = {https://doi.org/10.3390/s16071040},
    abstract = {Visual odometry (VO) estimation from blurred image is a challenging problem in practical robot applications, and the blurred images will severely reduce the estimation accuracy of the VO. In this paper, we address the problem of visual odometry estimation from blurred images, and present an adaptive visual odometry estimation framework robust to blurred images. Our approach employs an objective measure of images, named small image gradient distribution (SIGD), to evaluate the blurring degree of the image, then an adaptive blurred image classification algorithm is proposed to recognize the blurred images, finally we propose an anti-blurred key-frame selection algorithm to enable the VO robust to blurred images. We also carried out varied comparable experiments to evaluate the performance of the VO algorithms with our anti-blur framework under varied blurred images, and the experimental results show that our approach can achieve superior performance comparing to the state-of-the-art methods under the condition with blurred images while not increasing too much computation cost to the original VO algorithms.}
    }

2015

  • Y. Liu, F. Tang, and Z. Zeng, “Feature Selection Based on Dependency Margin," IEEE Transactions on Cybernetics, vol. 45, p. 1209–1221, 2015.
    [BibTeX] [Abstract] [DOI] [PDF]

    Feature selection tries to find a subset of feature from a larger feature pool and the selected subset can provide the same or even better performance compared with using the whole set. Feature selection is usually a critical preprocessing step for many machine-learning applications such as clustering and classification. In this paper, we focus on feature selection for supervised classification which targets at finding features that can best predict class labels. Traditional greedy search algorithms incrementally find features based on the relevance of candidate features and the class label. However, this may lead to suboptimal results when there are redundant features that may interfere with the selection. To solve this problem, we propose a subset selection algorithm that considers both the selected and remaining features’ relevances with the label. The intuition is that features, which do not have better alternatives from the feature set, should be selected first. We formulate the selection problem as maximizing the dependency margin which is measured by the difference between the selected feature set performance and the remaining feature set performance. Extensive experiments on various data sets show the superiority of the proposed approach against traditional algorithms.

    @article{liu2015featuresb,
    title = {Feature Selection Based on Dependency Margin},
    author = {Yong Liu and Feng Tang and Zhiyong Zeng},
    year = 2015,
    journal = {IEEE Transactions on Cybernetics},
    volume = 45,
    pages = {1209--1221},
    doi = {https://doi.org/10.1109/TCYB.2014.2347372},
    abstract = {Feature selection tries to find a subset of feature from a larger feature pool and the selected subset can provide the same or even better performance compared with using the whole set. Feature selection is usually a critical preprocessing step for many machine-learning applications such as clustering and classification. In this paper, we focus on feature selection for supervised classification which targets at finding features that can best predict class labels. Traditional greedy search algorithms incrementally find features based on the relevance of candidate features and the class label. However, this may lead to suboptimal results when there are redundant features that may interfere with the selection. To solve this problem, we propose a subset selection algorithm that considers both the selected and remaining features' relevances with the label. The intuition is that features, which do not have better alternatives from the feature set, should be selected first. We formulate the selection problem as maximizing the dependency margin which is measured by the difference between the selected feature set performance and the remaining feature set performance. Extensive experiments on various data sets show the superiority of the proposed approach against traditional algorithms.}
    }

  • Y. Jiang, Y. Shen, Y. Liu, and W. Liu, “Multiclass AdaBoost ELM and Its Application in LBP Based Face Recognition," Mathematical Problems in Engineering, vol. 2015, p. 918105, 2015.
    [BibTeX] [Abstract] [DOI] [PDF]

    Extreme learning machine (ELM) is a competitive machine learning technique, which is simple in theory and fast in implementation; it can identify faults quickly and precisely as compared with traditional identification techniques such as support vector machines (SVM). As verified by the simulation results, ELM tends to have better scalability and can achieve much better generalization performance and much faster learning speed compared with traditional SVM. In this paper, we introduce a multiclass AdaBoost based ELM ensemble method. In our approach, the ELM algorithm is selected as the basic ensemble predictor due to its rapid speed and good performance. Compared with the existing boosting ELM algorithm, our algorithm can be directly used in multiclass classification problem. We also carried out comparable experiments with face recognition datasets. The experimental results show that the proposed algorithm can not only make the predicting result more stable, but also achieve better generalization performance.

    @article{jiang2015multiclassae,
    title = {Multiclass AdaBoost ELM and Its Application in LBP Based Face Recognition},
    author = {Yunliang Jiang and Yefeng Shen and Yong Liu and Weicong Liu},
    year = 2015,
    journal = {Mathematical Problems in Engineering},
    volume = 2015,
    pages = 918105,
    doi = {https://doi.org/10.1155/2015%2F918105},
    abstract = {Extreme learning machine (ELM) is a competitive machine learning technique, which is simple in theory and fast in implementation; it can identify faults quickly and precisely as compared with traditional identification techniques such as support vector machines (SVM). As verified by the simulation results, ELM tends to have better scalability and can achieve much better generalization performance and much faster learning speed compared with traditional SVM. In this paper, we introduce a multiclass AdaBoost based ELM ensemble method. In our approach, the ELM algorithm is selected as the basic ensemble predictor due to its rapid speed and good performance. Compared with the existing boosting ELM algorithm, our algorithm can be directly used in multiclass classification problem. We also carried out comparable experiments with face recognition datasets. The experimental results show that the proposed algorithm can not only make the predicting result more stable, but also achieve better generalization performance.}
    }

2014

  • Y. Jiang, Y. Liu, W. Huang, and L. Huang, “Performance analysis of a mobile agent prototype system based on VIRGO P2P protocols," Concurrency and Computation: Practice and Experience, vol. 26, p. 447–467, 2014.
    [BibTeX] [Abstract] [DOI] [PDF]

    The mobile agent technique has been broadly used in next generation distributed systems. The system performance measurement and simulation are required before the system can be deployed on a large scale. In this paper, we address performance analysis on a finite state mobile agent prototype on the basis of Virtual Hierarchical Tree Grid Organizations (VIRGO). The finite states refer to the migration, execution, and searching of the mobile agent. We introduce a novel evaluation model for the finite state mobile agent. The experimental results based on this evaluation model show that the finite mobile agents can perform well under multiple agent conditions and are superior to the traditional client/server approach.

    @article{jiang2014performanceao,
    title = {Performance analysis of a mobile agent prototype system based on VIRGO P2P protocols},
    author = {Yunliang Jiang and Yong Liu and Wenliang Huang and Lican Huang},
    year = 2014,
    journal = {Concurrency and Computation: Practice and Experience},
    volume = 26,
    pages = {447--467},
    doi = {https://doi.org/10.1002/cpe.3006},
    abstract = {The mobile agent technique has been broadly used in next generation distributed systems. The system performance measurement and simulation are required before the system can be deployed on a large scale. In this paper, we address performance analysis on a finite state mobile agent prototype on the basis of Virtual Hierarchical Tree Grid Organizations (VIRGO). The finite states refer to the migration, execution, and searching of the mobile agent. We introduce a novel evaluation model for the finite state mobile agent. The experimental results based on this evaluation model show that the finite mobile agents can perform well under multiple agent conditions and are superior to the traditional client/server approach.}
    }

  • Y. Liu, X. Zheng, F. Tang, and X. Chen, “Ontology design with a granular approach," Expert Systems with Applications, vol. 41, p. 4867–4877, 2014.
    [BibTeX] [Abstract] [DOI] [PDF]

    Ontology design for complex applications is quite a challenge. The quality of an ontology is highly dependent upon the capabilities of designers, and the collaborative design process is hampered by the difficulty of balancing the viewpoints of different designers. In this paper, we present a granular view of ontology: ontologies are granular, ontologies are granular approximations of conceptualizations and conceptual-relation granules of an ontology are ordered tuples. We then propose a corresponding granular ontology design approach. In our granular ontology design approach, the unified granular cognition level and hierarchies of sub-concepts are initialized before ontological terms are designed in detail, which reduces the subjective effects of the capabilities of designers. Our approach also introduces the idea of optimization to choose an optimal subset, which can best approximate the real concept domain, from the knowledge rule set presented by different domain experts. The optimal subset is chosen on the basis of the principle of granular ontology knowledge structure.

    @article{liu2014ontologydw,
    title = {Ontology design with a granular approach},
    author = {Yong Liu and Xiaoling Zheng and Feng Tang and Xiaofei Chen},
    year = 2014,
    journal = {Expert Systems with Applications},
    volume = 41,
    pages = {4867--4877},
    doi = {https://doi.org/10.1016/j.eswa.2014.02.019},
    abstract = {Ontology design for complex applications is quite a challenge. The quality of an ontology is highly dependent upon the capabilities of designers, and the collaborative design process is hampered by the difficulty of balancing the viewpoints of different designers. In this paper, we present a granular view of ontology: ontologies are granular, ontologies are granular approximations of conceptualizations and conceptual-relation granules of an ontology are ordered tuples. We then propose a corresponding granular ontology design approach. In our granular ontology design approach, the unified granular cognition level and hierarchies of sub-concepts are initialized before ontological terms are designed in detail, which reduces the subjective effects of the capabilities of designers. Our approach also introduces the idea of optimization to choose an optimal subset, which can best approximate the real concept domain, from the knowledge rule set presented by different domain experts. The optimal subset is chosen on the basis of the principle of granular ontology knowledge structure.}
    }

  • Y. Liu, W. Huang, Y. Jiang, and Z. Zeng, “Quick attribute reduct algorithm for neighborhood rough set model," Information Sciences, vol. 271, p. 65–81, 2014.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this paper, we propose an efficient quick attribute reduct algorithm based on neighborhood rough set model. In this algorithm we divide the objects (records) of the whole data set into a series of buckets based on their Euclidean distances, and then iterate each record by the sequence of buckets to calculate the positive region of neighborhood rough set model. We also prove that each record’s θ -neighborhood elements can only be contained in its own bucket and its adjacent buckets, thus it can reduce the iterations greatly. Based on the division of buckets, we then present a new fast algorithm to calculate the positive region of neighborhood rough set model, which can achieve a complexity of O ( m | U | ) , m is the number of attributes, | U | is the number of records containing in the data set. Furthermore, with the new fast positive region computation algorithm, we present a quick reduct algorithm for neighborhood rough set model, and our algorithm can achieve a complexity of O ( m 2 | U | ) . At last, the efficiency of this quick reduct algorithm is proved by comparable experiments, and especially this algorithm is more suitable for the reduction of big data.

    @article{liu2014quickar,
    title = {Quick attribute reduct algorithm for neighborhood rough set model},
    author = {Yong Liu and Wenliang Huang and YunLiang Jiang and Zhiyong Zeng},
    year = 2014,
    journal = {Information Sciences},
    volume = 271,
    pages = {65--81},
    doi = {https://doi.org/10.1016/J.INS.2014.02.093},
    abstract = {In this paper, we propose an efficient quick attribute reduct algorithm based on neighborhood rough set model. In this algorithm we divide the objects (records) of the whole data set into a series of buckets based on their Euclidean distances, and then iterate each record by the sequence of buckets to calculate the positive region of neighborhood rough set model. We also prove that each record’s θ -neighborhood elements can only be contained in its own bucket and its adjacent buckets, thus it can reduce the iterations greatly. Based on the division of buckets, we then present a new fast algorithm to calculate the positive region of neighborhood rough set model, which can achieve a complexity of O ( m | U | ) , m is the number of attributes, | U | is the number of records containing in the data set. Furthermore, with the new fast positive region computation algorithm, we present a quick reduct algorithm for neighborhood rough set model, and our algorithm can achieve a complexity of O ( m 2 | U | ) . At last, the efficiency of this quick reduct algorithm is proved by comparable experiments, and especially this algorithm is more suitable for the reduction of big data.}
    }

  • Y. Liu, R. Xiong, and Y. Li, “Robust and Accurate Multiple-Camera Pose Estimation toward Robotic Applications," International Journal of Advanced Robotic Systems, vol. 11, p. 153, 2014.
    [BibTeX] [Abstract] [DOI] [PDF]

    Pose estimation methods in robotics applications frequently suffer from inaccuracy due to a lack of correspondence and real-time constraints, and instability from a wide range of viewpoints, etc. In this paper, we present a novel approach for estimating the poses of all the cameras in a multi-camera system in which each camera is placed rigidly using only a few coplanar points simultaneously. Instead of solving the orientation and translation for the multi-camera system from the overlapping point correspondences among all the cameras directly, we employ homography, which can map image points with 3D coplanar-referenced points. In our method, we first establish the corresponding relations between each camera by their Euclidean geometries and optimize the homographies of the cameras; then, we solve the orientation and translation for the optimal homographies. The results from simulations and real case experiments show that our approach is accurate and robust for implementation in robotics applications. Finally, a practical implementation in a ping-pong robot is described in order to confirm the validity of our approach.

    @article{liu2014robustaa,
    title = {Robust and Accurate Multiple-Camera Pose Estimation toward Robotic Applications},
    author = {Yong Liu and Rong Xiong and Yi Li},
    year = 2014,
    journal = {International Journal of Advanced Robotic Systems},
    volume = 11,
    pages = 153,
    doi = {https://doi.org/10.5772/58868},
    abstract = {Pose estimation methods in robotics applications frequently suffer from inaccuracy due to a lack of correspondence and real-time constraints, and instability from a wide range of viewpoints, etc. In this paper, we present a novel approach for estimating the poses of all the cameras in a multi-camera system in which each camera is placed rigidly using only a few coplanar points simultaneously. Instead of solving the orientation and translation for the multi-camera system from the overlapping point correspondences among all the cameras directly, we employ homography, which can map image points with 3D coplanar-referenced points. In our method, we first establish the corresponding relations between each camera by their Euclidean geometries and optimize the homographies of the cameras; then, we solve the orientation and translation for the optimal homographies. The results from simulations and real case experiments show that our approach is accurate and robust for implementation in robotics applications. Finally, a practical implementation in a ping-pong robot is described in order to confirm the validity of our approach.}
    }

2013

  • Y. Jiang, X. Zhang, L. Tang, W. Liu, J. Fan, and Y. Liu, “Multi-Robot Remote Interaction with FS-MAS," International Journal of Advanced Robotic Systems, vol. 10, p. 141, 2013.
    [BibTeX] [Abstract] [DOI] [PDF]

    The need to reduce bandwidth, improve productivity, autonomy and the scalability in multi-robot teleoperation has been recognized for a long time. In this article we propose a novel finite state machine mobile agent based on the network interaction service model, namely FS-MAS. This model consists of three finite state machines, namely the Finite State Mobile Agent (FS-Agent), which is the basic service module. The Service Content Finite State Machine (Content-FS), using the XML language to define workflow, to describe service content and service computation process. The Mobile Agent computation model Finite State Machine (MACM-FS), used to describe the service implementation. Finally, we apply this service model to the multi-robot system, the initial realization completing complex tasks in the form of multi-robot scheduling. This demonstrates that the robot has greatly improved intelligence, and provides a wide solution space for critical issues such as task division, rational and efficient use of resource and multi-robot collaboration.

    @article{jiang2013multirobotri,
    title = {Multi-Robot Remote Interaction with FS-MAS},
    author = {Yunliang Jiang and Xiongtao Zhang and Liang Tang and Weicong Liu and Jing Fan and Yong Liu},
    year = 2013,
    journal = {International Journal of Advanced Robotic Systems},
    volume = 10,
    pages = 141,
    doi = {https://doi.org/10.5772/54468},
    abstract = {The need to reduce bandwidth, improve productivity, autonomy and the scalability in multi-robot teleoperation has been recognized for a long time. In this article we propose a novel finite state machine mobile agent based on the network interaction service model, namely FS-MAS. This model consists of three finite state machines, namely the Finite State Mobile Agent (FS-Agent), which is the basic service module. The Service Content Finite State Machine (Content-FS), using the XML language to define workflow, to describe service content and service computation process. The Mobile Agent computation model Finite State Machine (MACM-FS), used to describe the service implementation. Finally, we apply this service model to the multi-robot system, the initial realization completing complex tasks in the form of multi-robot scheduling. This demonstrates that the robot has greatly improved intelligence, and provides a wide solution space for critical issues such as task division, rational and efficient use of resource and multi-robot collaboration.}
    }

  • Y. Jiang, Y. Xu, and Y. Liu, “Performance evaluation of feature detection and matching in stereo visual odometry," Neurocomputing, vol. 120, p. 380–390, 2013.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this paper, we try to evaluate which detector and descriptor may be the most appropriate solution in stereo visual odometry and whether there is any bias on calculation methods in visual odometry applications. We summarize the state of art feature detectors and descriptors in visual odometry field and divide them based on their implemented details. We present three new evaluation criterions (Detection Chain Repeatability, Average Detection Chain Re-projection Error and Matching Chain Precision) of feature detectors and descriptors. We also design experiments to evaluate the performance of different detectors and descriptors from the robustness, precision and cost of computation.

    @article{jiang2013performanceeo,
    title = {Performance evaluation of feature detection and matching in stereo visual odometry},
    author = {Yunliang Jiang and Yunxi Xu and Yong Liu},
    year = 2013,
    journal = {Neurocomputing},
    volume = 120,
    pages = {380--390},
    doi = {https://doi.org/10.1016/j.neucom.2012.06.055},
    abstract = {In this paper, we try to evaluate which detector and descriptor may be the most appropriate solution in stereo visual odometry and whether there is any bias on calculation methods in visual odometry applications. We summarize the state of art feature detectors and descriptors in visual odometry field and divide them based on their implemented details. We present three new evaluation criterions (Detection Chain Repeatability, Average Detection Chain Re-projection Error and Matching Chain Precision) of feature detectors and descriptors. We also design experiments to evaluate the performance of different detectors and descriptors from the robustness, precision and cost of computation.}
    }

2012

  • Y. Liu, M. Zhang, F. Tang, Y. Jiang, Z. Pan, G. Liu, and H. Shen, “Constructing the virtual Jing-Hang Grand Canal with onto-draw," Expert Systems with Applications, vol. 39, p. 12071–12084, 2012.
    [BibTeX] [Abstract] [DOI] [PDF]

    Constructing virtual 3D historical scenes from literature and records is a very challenging problem due to the difficulty in incorporating different types of domain knowledge into the modeling system. The domain knowledge comes from different experts, including: architects, historians, rendering artists, user interface designers and computer engineers. In this paper we investigate the problem of automatically generating drawings of ancient scenes by ontologies extracted from these domains. We introduce a framework called onto-draw to generate semantic models of desired scenes by constructing hierarchical ontology concept domains. Inconsistencies among them are resolved via an iterative refinement algorithm. We implement the onto-draw based ontology design approach and inconsistency removal technique in the virtual Jing-Hang Grand Canal construction project (Chen et al., 2010) and achieve encouraging results.

    @article{liu2012constructingtv,
    title = {Constructing the virtual Jing-Hang Grand Canal with onto-draw},
    author = {Yong Liu and Minming Zhang and Feng Tang and Yunliang Jiang and Zhigeng Pan and Gengdai Liu and Huaqing Shen},
    year = 2012,
    journal = {Expert Systems with Applications},
    volume = 39,
    pages = {12071--12084},
    doi = {https://doi.org/10.1016/j.eswa.2012.04.026},
    abstract = {Constructing virtual 3D historical scenes from literature and records is a very challenging problem due to the difficulty in incorporating different types of domain knowledge into the modeling system. The domain knowledge comes from different experts, including: architects, historians, rendering artists, user interface designers and computer engineers. In this paper we investigate the problem of automatically generating drawings of ancient scenes by ontologies extracted from these domains. We introduce a framework called onto-draw to generate semantic models of desired scenes by constructing hierarchical ontology concept domains. Inconsistencies among them are resolved via an iterative refinement algorithm. We implement the onto-draw based ontology design approach and inconsistency removal technique in the virtual Jing-Hang Grand Canal construction project (Chen et al., 2010) and achieve encouraging results.}
    }

  • Y. Liu, M. Zhang, Y. Jiang, and H. Zhao, “Improving procedural modeling with semantics in digital architectural heritage," Computers and Graphics, vol. 36, p. 178–184, 2012.
    [BibTeX] [Abstract] [DOI] [PDF]

    We first introduce three challenges in the procedural modeling of digital architectural heritages and then present a general framework, which integrates several machine intelligence and semantic techniques, e.g., the ontology design approach, pattern mining, auto-annotating and rule reduction, to improve the procedural methods in architectural modeling. Several evaluations and experiments are also presented. The experimental results illustrate the improvements following our approach. & 2012 Elsevier Ltd. All rights reserved.

    @article{liu2012improvingpm,
    title = {Improving procedural modeling with semantics in digital architectural heritage},
    author = {Yong Liu and Mingmin Zhang and Yunliang Jiang and Haiying Zhao},
    year = 2012,
    journal = {Computers and Graphics},
    volume = 36,
    pages = {178--184},
    doi = {https://doi.org/10.1016/j.cag.2012.01.003},
    abstract = {We first introduce three challenges in the procedural modeling of digital architectural heritages and then present a general framework, which integrates several machine intelligence and semantic techniques, e.g., the ontology design approach, pattern mining, auto-annotating and rule reduction, to improve the procedural methods in architectural modeling. Several evaluations and experiments are also presented. The experimental results illustrate the improvements following our approach. & 2012 Elsevier Ltd. All rights reserved.}
    }

2024

  • J. Xiang, S. Li, J. Chen, Z. Chen, T. Huang, L. Peng, and Y. Liu, “MaxQ: Multi-Axis Query for N:m Sparsity Network," in 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 15845-15854.
    [BibTeX] [Abstract] [DOI] [PDF]

    N:m sparsity has received increasing attention due to its remarkable performance and latency trade-off compared with structured and unstructured sparsity. How-ever, existing N:m sparsity methods do not differentiate the relative importance of weights among blocks and leave important weights underappreciated. Besides, they di-rectly apply N:m sparsity to the whole network, which will cause severe information loss. Thus, they are still sub-optimal. In this paper, we propose an efficient and effective Multi-Axis Query methodology, dubbed as MaxQ, to rectify these problems. During the training, MaxQ employs a dynamic approach to generate soft N:m masks, considering the weight importance across multiple axes. This method enhances the weights with more importance and ensures more effective updates. Meanwhile, a spar-sity strategy that gradually increases the percentage of N:m weight blocks is applied, which allows the network to heal from the pruning-induced damage progressively. During the runtime, the N:m soft masks can be precom-puted as constants and folded into weights without causing any distortion to the sparse pattern and incurring ad-ditional computational overhead. Comprehensive experi-ments demonstrate that MaxQ achieves consistent improve-ments across diverse CNN architectures in various com-puter vision tasks, including image classification, object detection and instance segmentation. For ResNet50 with 1:16 sparse pattern, MaxQ can achieve 74.6% top-1 ac-curacy on ImageNet and improve by over 2.8% over the state-of-the-art. Codes and checkpoints are available at https://github.com/JingyangXiang/MaxQ.

    @inproceedings{xiang2024maxq,
    title = {MaxQ: Multi-Axis Query for N:m Sparsity Network},
    author = {Jingyang Xiang and Siqi Li and Junhao Chen and Zhuangzhi Chen and Tianxin Huang and Linpeng Peng and Yong Liu},
    year = 2024,
    booktitle = {2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {15845-15854},
    doi = {10.1109/CVPR52733.2024.01500},
    abstract = {N:m sparsity has received increasing attention due to its remarkable performance and latency trade-off compared with structured and unstructured sparsity. How-ever, existing N:m sparsity methods do not differentiate the relative importance of weights among blocks and leave important weights underappreciated. Besides, they di-rectly apply N:m sparsity to the whole network, which will cause severe information loss. Thus, they are still sub-optimal. In this paper, we propose an efficient and effective Multi-Axis Query methodology, dubbed as MaxQ, to rectify these problems. During the training, MaxQ employs a dynamic approach to generate soft N:m masks, considering the weight importance across multiple axes. This method enhances the weights with more importance and ensures more effective updates. Meanwhile, a spar-sity strategy that gradually increases the percentage of N:m weight blocks is applied, which allows the network to heal from the pruning-induced damage progressively. During the runtime, the N:m soft masks can be precom-puted as constants and folded into weights without causing any distortion to the sparse pattern and incurring ad-ditional computational overhead. Comprehensive experi-ments demonstrate that MaxQ achieves consistent improve-ments across diverse CNN architectures in various com-puter vision tasks, including image classification, object detection and instance segmentation. For ResNet50 with 1:16 sparse pattern, MaxQ can achieve 74.6% top-1 ac-curacy on ImageNet and improve by over 2.8% over the state-of-the-art. Codes and checkpoints are available at https://github.com/JingyangXiang/MaxQ.}
    }

  • X. Hou, J. Xing, Y. Qian, Y. Guo, S. Xin, J. Chen, K. Tang, M. Wang, Z. Jiang, L. Liu, and Y. Liu, “SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking," in 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26541-26551.
    [BibTeX] [Abstract] [DOI] [PDF]

    Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers, which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore, recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However, the modality gap limits pre-trained knowledge recall, and the dominance of the RGB modality persists, preventing the full utilization of information from other modalities. To address these issues, we propose a novel symmetric multimodal tracking framework called SDSTrack. We introduce lightweight adaptation for efficient fine-tuning, which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced, symmetric manner. Furthermore, we design a complementary masked patch distillation strategy to enhance the robustness of trackers in complex environments, such as extreme weather, poor imaging, and sensor failure. Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art methods in various multimodal tracking scenarios, including RGB+Depth, RGB+Thermal, and RGB+Event tracking, and exhibits impressive results in extreme conditions. Our source code is available at: https://github.com/hoqolo/SDSTrack.

    @inproceedings{hou2024sds,
    title = {SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking},
    author = {Xiaojun Hou and Jiazheng Xing and Yijie Qian and Yaowei Guo and Shuo Xin and Junhao Chen and Kai Tang and Mengmeng Wang and Zhengkai Jiang and Liang Liu and Yong Liu},
    year = 2024,
    booktitle = {2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {26541-26551},
    doi = {10.1109/CVPR52733.2024.02507},
    abstract = {Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers, which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore, recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However, the modality gap limits pre-trained knowledge recall, and the dominance of the RGB modality persists, preventing the full utilization of information from other modalities. To address these issues, we propose a novel symmetric multimodal tracking framework called SDSTrack. We introduce lightweight adaptation for efficient fine-tuning, which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced, symmetric manner. Furthermore, we design a complementary masked patch distillation strategy to enhance the robustness of trackers in complex environments, such as extreme weather, poor imaging, and sensor failure. Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art methods in various multimodal tracking scenarios, including RGB+Depth, RGB+Thermal, and RGB+Event tracking, and exhibits impressive results in extreme conditions. Our source code is available at: https://github.com/hoqolo/SDSTrack.}
    }

  • H. Li, Y. Ma, Y. Gu, K. Hu, Y. Liu, and X. Zuo, “RadarCam-Depth: Radar-Camera Fusion for Depth Estimation with Learned Metric Scale," in 2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 10665-10672.
    [BibTeX] [Abstract] [DOI] [PDF]

    We present a novel approach for metric dense depth estimation based on the fusion of a single-view image and a sparse, noisy Radar point cloud. The direct fusion of heterogeneous Radar and image data, or their encodings, tends to yield dense depth maps with significant artifacts, blurred boundaries, and suboptimal accuracy. To circumvent this issue, we learn to augment versatile and robust monocular depth prediction with the dense metric scale induced from sparse and noisy Radar data. We propose a Radar-Camera framework for highly accurate and fine-detailed dense depth estimation with four stages, including monocular depth prediction, global scale alignment of monocular depth with sparse Radar points, quasi-dense scale estimation through learning the association between Radar points and image patches, and local scale refinement of dense depth using a scale map learner. Our proposed method significantly outperforms the state-of-the-art Radar-Camera depth estimation methods by reducing the mean absolute error (MAE) of depth estimation by 25.6% and 40.2% on the challenging nuScenes dataset and our self-collected ZJU-4DRadarCam dataset, respectively. Our code and dataset will be released at https://github.com/MMOCKING/RadarCam-Depth.

    @inproceedings{li2024rcd,
    title = {RadarCam-Depth: Radar-Camera Fusion for Depth Estimation with Learned Metric Scale},
    author = {Han Li and Yukai Ma and Yaqing Gu and Kewei Hu and Yong Liu and Xingxing Zuo},
    year = 2024,
    booktitle = {2024 IEEE International Conference on Robotics and Automation (ICRA)},
    pages = {10665-10672},
    doi = {10.1109/ICRA57147.2024.10610929},
    abstract = {We present a novel approach for metric dense depth estimation based on the fusion of a single-view image and a sparse, noisy Radar point cloud. The direct fusion of heterogeneous Radar and image data, or their encodings, tends to yield dense depth maps with significant artifacts, blurred boundaries, and suboptimal accuracy. To circumvent this issue, we learn to augment versatile and robust monocular depth prediction with the dense metric scale induced from sparse and noisy Radar data. We propose a Radar-Camera framework for highly accurate and fine-detailed dense depth estimation with four stages, including monocular depth prediction, global scale alignment of monocular depth with sparse Radar points, quasi-dense scale estimation through learning the association between Radar points and image patches, and local scale refinement of dense depth using a scale map learner. Our proposed method significantly outperforms the state-of-the-art Radar-Camera depth estimation methods by reducing the mean absolute error (MAE) of depth estimation by 25.6% and 40.2% on the challenging nuScenes dataset and our self-collected ZJU-4DRadarCam dataset, respectively. Our code and dataset will be released at https://github.com/MMOCKING/RadarCam-Depth.}
    }

  • S. Xin, Z. Zhang, M. Wang, X. Hou, Y. Guo, X. Kang, L. Liu, and Y. Liu, “Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer," in 2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 337-344.
    [BibTeX] [Abstract] [DOI] [PDF]

    Tracking a specific person in 3D scene is gaining momentum due to its numerous applications in robotics. Currently, most 3D trackers focus on driving scenarios with neglected jitter and uncomplicated surroundings, which results in their severe degeneration in complex environments, especially on jolting robot platforms (only 20-60% success rate). To improve the accuracy, a Point-Video-based Transformer Tracking model (PVTrack) is presented for robots. It is the first multi-modal 3D human tracking work that incorporates point clouds together with RGB videos to achieve information complementarity. Moreover, PVTrack proposes the Siamese Point-Video Transformer for feature aggregation to overcome dynamic environments, which captures more target-aware information through the hierarchical attention mechanism adaptively. Considering the violent shaking on robots and rugged terrains, a lateral Human-ware Proposal Network is designed together with an Anti-shake Proposal Compensation module. It alleviates the disturbance caused by complex scenes as well as the particularity of the robot platform. Experiments show that our method achieves state-of-the-art performance on both KITTI/Waymo datasets and a quadruped robot for various indoor and outdoor scenes.

    @inproceedings{xin2024mmh,
    title = {Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer},
    author = {Shuo Xin and Zhen Zhang and Mengmeng Wang and Xiaojun Hou and Yaowei Guo and Xiao Kang and Liang Liu and Yong Liu},
    year = 2024,
    booktitle = {2024 IEEE International Conference on Robotics and Automation (ICRA)},
    pages = {337-344},
    doi = {10.1109/ICRA57147.2024.10610979},
    abstract = {Tracking a specific person in 3D scene is gaining momentum due to its numerous applications in robotics. Currently, most 3D trackers focus on driving scenarios with neglected jitter and uncomplicated surroundings, which results in their severe degeneration in complex environments, especially on jolting robot platforms (only 20-60% success rate). To improve the accuracy, a Point-Video-based Transformer Tracking model (PVTrack) is presented for robots. It is the first multi-modal 3D human tracking work that incorporates point clouds together with RGB videos to achieve information complementarity. Moreover, PVTrack proposes the Siamese Point-Video Transformer for feature aggregation to overcome dynamic environments, which captures more target-aware information through the hierarchical attention mechanism adaptively. Considering the violent shaking on robots and rugged terrains, a lateral Human-ware Proposal Network is designed together with an Anti-shake Proposal Compensation module. It alleviates the disturbance caused by complex scenes as well as the particularity of the robot platform. Experiments show that our method achieves state-of-the-art performance on both KITTI/Waymo datasets and a quadruped robot for various indoor and outdoor scenes.}
    }

  • C. Fu, L. Li, J. Mei, Y. Ma, L. Peng, X. Zhao, and Y. Liu, “A Coarse-to-Fine Place Recognition Approach using Attention-guided Descriptors and Overlap Estimation," in 2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 8493-8499.
    [BibTeX] [Abstract] [DOI] [PDF]

    Place recognition is a challenging but crucial task in robotics. Current description-based methods may be limited by representation capabilities, while pairwise similarity-based methods require exhaustive searches, which is time-consuming. In this paper, we present a novel coarse-to-fine approach to address these problems, which combines BEV (Bird’s Eye View) feature extraction, coarse-grained matching and fine-grained verification. In the coarse stage, our approach utilizes an attention-guided network to generate attention-guided descriptors. We then employ a fast affinity-based candidate selection process to identify the Top-K most similar candidates. In the fine stage, we estimate pairwise overlap among the narrowed-down place candidates to determine the final match. Experimental results on the KITTI and KITTI-360 datasets demonstrate that our approach outperforms state-of-the-art methods. The code will be released publicly soon.

    @inproceedings{fu2024ctf,
    title = {A Coarse-to-Fine Place Recognition Approach using Attention-guided Descriptors and Overlap Estimation},
    author = {Chencan Fu and Lin Li and Jianbiao Mei and Yukai Ma and Linpeng Peng and Xiangrui Zhao and Yong Liu},
    year = 2024,
    booktitle = {2024 IEEE International Conference on Robotics and Automation (ICRA)},
    pages = {8493-8499},
    doi = {10.1109/ICRA57147.2024.10611569},
    abstract = {Place recognition is a challenging but crucial task in robotics. Current description-based methods may be limited by representation capabilities, while pairwise similarity-based methods require exhaustive searches, which is time-consuming. In this paper, we present a novel coarse-to-fine approach to address these problems, which combines BEV (Bird's Eye View) feature extraction, coarse-grained matching and fine-grained verification. In the coarse stage, our approach utilizes an attention-guided network to generate attention-guided descriptors. We then employ a fast affinity-based candidate selection process to identify the Top-K most similar candidates. In the fine stage, we estimate pairwise overlap among the narrowed-down place candidates to determine the final match. Experimental results on the KITTI and KITTI-360 datasets demonstrate that our approach outperforms state-of-the-art methods. The code will be released publicly soon.}
    }

  • J. Zhu, L. Liu, Y. Tang, F. Wen, wanlong li, and Y. Liu, “Semi-Supervised Learning for Visual Bird’s Eye View Semantic Segmentation," in 2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 9079-9085.
    [BibTeX] [Abstract] [DOI] [PDF]

    Visual bird’s eye view (BEV) semantic segmentation helps autonomous vehicles understand the surrounding environment only from front-view (FV) images, including static elements (e.g., roads) and dynamic elements (e.g., vehicles, pedestrians). However, the high cost of annotation procedures of full-supervised methods limits the capability of the visual BEV semantic segmentation, which usually needs HD maps, 3D object bounding boxes, and camera extrinsic matrixes. In this paper, we present a novel semi-supervised framework for visual BEV semantic segmentation to boost performance by exploiting unlabeled images during the training. A consistency loss that makes full use of unlabeled data is then proposed to constrain the model on not only semantic prediction but also the BEV feature. Furthermore, we propose a novel and effective data augmentation method named conjoint rotation which reasonably augments the dataset while maintaining the geometric relationship between the FV images and the BEV semantic segmentation. Extensive experiments on the nuScenes dataset show that our semi-supervised framework can effectively improve prediction accuracy. To the best of our knowledge, this is the first work that explores improving visual BEV semantic segmentation performance using unlabeled data. The code is available at https://github.com/Junyu-Z/Semi-BEVseg.

    @inproceedings{zhu2024ssl,
    title = {Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation},
    author = {Junyu Zhu and Lina Liu and Yu Tang and Feng Wen and wanlong li and Yong Liu},
    year = 2024,
    booktitle = {2024 IEEE International Conference on Robotics and Automation (ICRA)},
    pages = {9079-9085},
    doi = {10.1109/ICRA57147.2024.10611420},
    abstract = {Visual bird's eye view (BEV) semantic segmentation helps autonomous vehicles understand the surrounding environment only from front-view (FV) images, including static elements (e.g., roads) and dynamic elements (e.g., vehicles, pedestrians). However, the high cost of annotation procedures of full-supervised methods limits the capability of the visual BEV semantic segmentation, which usually needs HD maps, 3D object bounding boxes, and camera extrinsic matrixes. In this paper, we present a novel semi-supervised framework for visual BEV semantic segmentation to boost performance by exploiting unlabeled images during the training. A consistency loss that makes full use of unlabeled data is then proposed to constrain the model on not only semantic prediction but also the BEV feature. Furthermore, we propose a novel and effective data augmentation method named conjoint rotation which reasonably augments the dataset while maintaining the geometric relationship between the FV images and the BEV semantic segmentation. Extensive experiments on the nuScenes dataset show that our semi-supervised framework can effectively improve prediction accuracy. To the best of our knowledge, this is the first work that explores improving visual BEV semantic segmentation performance using unlabeled data. The code is available at https://github.com/Junyu-Z/Semi-BEVseg.}
    }

  • S. Liu, D. Xing, P. Gu, X. Wang, B. An, and Y. Liu, “Solving Homogeneous and Heterogeneous Cooperative Tasks with Greedy Sequential Execution," in 12nd International Conference on Learning Representations (ICLR), 2024.
    [BibTeX] [Abstract]

    Cooperative multi-agent reinforcement learning (MARL) is extensively used for solving complex cooperative tasks, and value decomposition methods are a prevalent approach for this domain. However, these methods have not been successful in addressing both homogeneous and heterogeneous tasks simultaneously which is a crucial aspect for the practical application of cooperative agents. On one hand, value decomposition methods demonstrate superior performance in homogeneous tasks. Nevertheless, they tend to produce agents with similar policies, which is unsuitable for heterogeneous tasks. On the other hand, solutions based on personalized observation or assigned roles are well-suited for heterogeneous tasks. However, they often lead to a trade-off situation where the agent’s performance in homogeneous scenarios is negatively affected due to the aggregation of distinct policies. An alternative approach is to adopt sequential execution policies, which offer a flexible form for learning both types of tasks. However, learning sequential execution policies poses challenges in terms of credit assignment, and the limited information about subsequently executed agents can lead to sub-optimal solutions, which is known as the relative over-generalization problem. To tackle these issues, this paper proposes Greedy Sequential Execution (GSE) as a solution to learn the optimal policy that covers both scenarios. In the proposed GSE framework, we introduce an individual utility function into the framework of value decomposition to consider the complex interactions between agents. This function is capable of representing both the homogeneous and heterogeneous optimal policies. Furthermore, we utilize greedy marginal contribution calculated by the utility function as the credit value of the sequential execution policy to address the credit assignment and relative over-generalization problem. We evaluated GSE in both homogeneous and heterogeneous scenarios. The results demonstrate that GSE achieves significant improvement in performance across multiple domains, especially in scenarios involving both homogeneous and heterogeneous tasks.

    @inproceedings{liu2024shh,
    title = {Solving Homogeneous and Heterogeneous Cooperative Tasks with Greedy Sequential Execution},
    author = {Shanqi Liu and Dong Xing and Pengjie Gu and Xinrun Wang and Bo An and Yong Liu},
    year = 2024,
    booktitle = {12nd International Conference on Learning Representations (ICLR)},
    abstract = {Cooperative multi-agent reinforcement learning (MARL) is extensively used for solving complex cooperative tasks, and value decomposition methods are a prevalent approach for this domain. However, these methods have not been successful in addressing both homogeneous and heterogeneous tasks simultaneously which is a crucial aspect for the practical application of cooperative agents. On one hand, value decomposition methods demonstrate superior performance in homogeneous tasks. Nevertheless, they tend to produce agents with similar policies, which is unsuitable for heterogeneous tasks. On the other hand, solutions based on personalized observation or assigned roles are well-suited for heterogeneous tasks. However, they often lead to a trade-off situation where the agent's performance in homogeneous scenarios is negatively affected due to the aggregation of distinct policies. An alternative approach is to adopt sequential execution policies, which offer a flexible form for learning both types of tasks. However, learning sequential execution policies poses challenges in terms of credit assignment, and the limited information about subsequently executed agents can lead to sub-optimal solutions, which is known as the relative over-generalization problem. To tackle these issues, this paper proposes Greedy Sequential Execution (GSE) as a solution to learn the optimal policy that covers both scenarios. In the proposed GSE framework, we introduce an individual utility function into the framework of value decomposition to consider the complex interactions between agents. This function is capable of representing both the homogeneous and heterogeneous optimal policies. Furthermore, we utilize greedy marginal contribution calculated by the utility function as the credit value of the sequential execution policy to address the credit assignment and relative over-generalization problem. We evaluated GSE in both homogeneous and heterogeneous scenarios. The results demonstrate that GSE achieves significant improvement in performance across multiple domains, especially in scenarios involving both homogeneous and heterogeneous tasks.}
    }

  • J. Chen, H. Ye, M. Wang, T. Huang, G. Dai, I. W. Tsang, and Y. Liu, “Decentralized Riemannian Conjugate Gradient Method on the Stiefel Manifold," in 12nd International Conference on Learning Representations (ICLR), 2024.
    [BibTeX] [Abstract]

    The conjugate gradient method is a crucial first-order optimization method that generally converges faster than the steepest descent method, and its computational cost is much lower than that of second-order methods. However, while various types of conjugate gradient methods have been studied in Euclidean spaces and on Riemannian manifolds, there is little study for those in distributed scenarios. This paper proposes a decentralized Riemannian conjugate gradient descent (DRCGD) method that aims at minimizing a global function over the Stiefel manifold. The optimization problem is distributed among a network of agents, where each agent is associated with a local function, and the communication between agents occurs over an undirected connected graph. Since the Stiefel manifold is a non-convex set, a global function is represented as a finite sum of possibly non-convex (but smooth) local functions. The proposed method is free from expensive Riemannian geometric operations such as retractions, exponential maps, and vector transports, thereby reducing the computational complexity required by each agent. To the best of our knowledge, DRCGD is the first decentralized Riemannian conjugate gradient algorithm to achieve global convergence over the Stiefel manifold.

    @inproceedings{chen2024drc,
    title = {Decentralized Riemannian Conjugate Gradient Method on the Stiefel Manifold},
    author = {Jun Chen and Haishan Ye and Mengmeng Wang and Tianxin Huang and Guang Dai and Ivor W Tsang and Yong Liu},
    year = 2024,
    booktitle = {12nd International Conference on Learning Representations (ICLR)},
    abstract = {The conjugate gradient method is a crucial first-order optimization method that generally converges faster than the steepest descent method, and its computational cost is much lower than that of second-order methods. However, while various types of conjugate gradient methods have been studied in Euclidean spaces and on Riemannian manifolds, there is little study for those in distributed scenarios. This paper proposes a decentralized Riemannian conjugate gradient descent (DRCGD) method that aims at minimizing a global function over the Stiefel manifold. The optimization problem is distributed among a network of agents, where each agent is associated with a local function, and the communication between agents occurs over an undirected connected graph. Since the Stiefel manifold is a non-convex set, a global function is represented as a finite sum of possibly non-convex (but smooth) local functions. The proposed method is free from expensive Riemannian geometric operations such as retractions, exponential maps, and vector transports, thereby reducing the computational complexity required by each agent. To the best of our knowledge, DRCGD is the first decentralized Riemannian conjugate gradient algorithm to achieve global convergence over the Stiefel manifold.}
    }

  • R. Cai, X. Xv, Z. Lu, K. Zhang, and Y. Liu, “Fusion Assessment of Safety and Security for Intelligent Industrial Unmanned Systems," in 7th International Symposium on Autonomous Systems (ISAS), 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    Fault tree analysis is the most commonly used methodology in industrial safety analysis to predict the probability or frequency of system failure. Although fault tree analysis has been proposed for more than six decades, the assumptions used in most commercial fault tree analysis codes have not changed significantly, which limits the ability of the method to represent design, operation, and maintenance characteristics in the context of the increasing complexity and specialization of modern industrial systems. The basic setup of traditional fault trees is unable to include dependencies between events, time-varying failures, and repair rate realities to explain complex maintenance strategies. To address the above shortcomings, we propose a fusion tree model combining fault tree and attack tree, and simplify the causal structure of the fusion tree by modularization, and utilize the dynamic Markov model to represent the complex coupling relationship between components or nodes. Finally, we demonstrate the calculation process of fusion tree in pressure vessel systems with temporal control.

    @inproceedings{cai2024fas,
    title = {Fusion Assessment of Safety and Security for Intelligent Industrial Unmanned Systems},
    author = {Rongyao Cai and Xiao Xv and Zhengming Lu and Kexin Zhang and Yong Liu},
    year = 2024,
    booktitle = {7th International Symposium on Autonomous Systems (ISAS)},
    doi = {10.1109/ISAS61044.2024.10552597},
    abstract = {Fault tree analysis is the most commonly used methodology in industrial safety analysis to predict the probability or frequency of system failure. Although fault tree analysis has been proposed for more than six decades, the assumptions used in most commercial fault tree analysis codes have not changed significantly, which limits the ability of the method to represent design, operation, and maintenance characteristics in the context of the increasing complexity and specialization of modern industrial systems. The basic setup of traditional fault trees is unable to include dependencies between events, time-varying failures, and repair rate realities to explain complex maintenance strategies. To address the above shortcomings, we propose a fusion tree model combining fault tree and attack tree, and simplify the causal structure of the fusion tree by modularization, and utilize the dynamic Markov model to represent the complex coupling relationship between components or nodes. Finally, we demonstrate the calculation process of fusion tree in pressure vessel systems with temporal control.}
    }

  • S. Xin, L. Liu, X. Kang, Z. Zhang, M. Wang, and Y. Liu, “Beyond Traditional Driving Scenes: A Robotic-Centric Paradigm for 2D+3D Human Tracking Using Siamese Transformer Network," in 7th International Symposium on Autonomous Systems (ISAS), 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    3D human tracking plays a crucial role in the automation intelligence system. Current approaches focus on achieving higher performance on traditional driving datasets like KITTI, which overlook the jitteriness of the platform and the complexity of the environments. Once the scenarios are migrated to jolting robot platforms, they all degenerate severely with only a 20-60% success rate, which greatly restricts the high-level application of autonomous systems. In this work, beyond traditional flat scenes, we introduce Multi-modal Human Tracking Paradigm (MHTrack), a unified multimodal transformer-based model that can effectively track the target person frame-by-frame in point and video sequences. Specifically, we design a speed-inertia module-assisted stabilization mechanism along with an alternate training strategy to better migrate the tracking algorithm to the robot platform. To capture more target-aware information, we combine the geometric and appearance features of point clouds and video frames together based on a hierarchical Siamese Transformer Network. Additionally, considering the prior characteristics of the human category, we design a lateral cross-attention pyramid head for deeper feature aggregation and final 3D BBox generation. Extensive experiments confirm that MHTrack significantly outperforms the previous state-of-the-arts on both open-source datasets and large-scale robotic datasets. Further analysis verifies each component’s effectiveness and shows the robotic-centric paradigm’s promising potential when deployed into dynamic robotic systems.

    @inproceedings{xin2024btd,
    title = {Beyond Traditional Driving Scenes: A Robotic-Centric Paradigm for 2D+3D Human Tracking Using Siamese Transformer Network},
    author = {Shuo Xin and Liang Liu and Xiao Kang and Zhen Zhang and Mengmeng Wang and Yong Liu},
    year = 2024,
    booktitle = {7th International Symposium on Autonomous Systems (ISAS)},
    doi = {10.1109/ISAS61044.2024.10552604},
    abstract = {3D human tracking plays a crucial role in the automation intelligence system. Current approaches focus on achieving higher performance on traditional driving datasets like KITTI, which overlook the jitteriness of the platform and the complexity of the environments. Once the scenarios are migrated to jolting robot platforms, they all degenerate severely with only a 20-60% success rate, which greatly restricts the high-level application of autonomous systems. In this work, beyond traditional flat scenes, we introduce Multi-modal Human Tracking Paradigm (MHTrack), a unified multimodal transformer-based model that can effectively track the target person frame-by-frame in point and video sequences. Specifically, we design a speed-inertia module-assisted stabilization mechanism along with an alternate training strategy to better migrate the tracking algorithm to the robot platform. To capture more target-aware information, we combine the geometric and appearance features of point clouds and video frames together based on a hierarchical Siamese Transformer Network. Additionally, considering the prior characteristics of the human category, we design a lateral cross-attention pyramid head for deeper feature aggregation and final 3D BBox generation. Extensive experiments confirm that MHTrack significantly outperforms the previous state-of-the-arts on both open-source datasets and large-scale robotic datasets. Further analysis verifies each component's effectiveness and shows the robotic-centric paradigm's promising potential when deployed into dynamic robotic systems.}
    }

  • H. Tai, Y. Qian, X. Kang, L. Liu, and Y. Liu, “Fusing LiDAR and Radar with Pillars Attention for 3D Object Detection," in 7th International Symposium on Autonomous Systems (ISAS), 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    In recent years, LiDAR has emerged as one of the primary sensors for mobile robots, enabling accurate detection of 3D objects. On the other hand, 4D millimeter-wave Radar presents several advantages which can be a complementary for LiDAR, including an extended detection range, enhanced sensitivity to moving objects, and the ability to operate seamlessly in various weather conditions, making it a highly promising technology. To leverage the strengths of both sensors, this paper proposes a novel fusion method that combines LiDAR and 4D millimeter-wave Radar for 3D object detection. The proposed approach begins with an efficient multi-modal feature extraction technique utilizing a pillar representation. This method captures comprehensive information from both LiDAR and millimeter-wave Radar data, facilitating a holistic understanding of the environment. Furthermore, a Pillar Attention Fusion (PAF) module is employed to merge the extracted features, enabling seamless integration and fusion of information from both sensors. This fusion process results in lightweight detection headers capable of accurately predicting object boxes. To evaluate the effectiveness of our proposed approach, extensive experiments were conducted on the VoD dataset. The experimental results demonstrate the superiority of our fusion method, showcasing improved performance in terms of detection accuracy and robustness across different environmental conditions. The fusion of LiDAR and 4D millimeter-wave Radar holds significant potential for enhancing the capabilities of mobile robots in real-world scenarios. The proposed method, with its efficient multi-modal feature extraction and attention-based fusion, provides a reliable and effective solution for 3D object detection.

    @inproceedings{tai2024lidar,
    title = {Fusing LiDAR and Radar with Pillars Attention for 3D Object Detection},
    author = {Hanchen Tai and Yijie Qian and Xiao Kang and Liang Liu and Yong Liu},
    year = 2024,
    booktitle = {7th International Symposium on Autonomous Systems (ISAS)},
    doi = {10.1109/ISAS61044.2024.10552581},
    abstract = {In recent years, LiDAR has emerged as one of the primary sensors for mobile robots, enabling accurate detection of 3D objects. On the other hand, 4D millimeter-wave Radar presents several advantages which can be a complementary for LiDAR, including an extended detection range, enhanced sensitivity to moving objects, and the ability to operate seamlessly in various weather conditions, making it a highly promising technology. To leverage the strengths of both sensors, this paper proposes a novel fusion method that combines LiDAR and 4D millimeter-wave Radar for 3D object detection. The proposed approach begins with an efficient multi-modal feature extraction technique utilizing a pillar representation. This method captures comprehensive information from both LiDAR and millimeter-wave Radar data, facilitating a holistic understanding of the environment. Furthermore, a Pillar Attention Fusion (PAF) module is employed to merge the extracted features, enabling seamless integration and fusion of information from both sensors. This fusion process results in lightweight detection headers capable of accurately predicting object boxes. To evaluate the effectiveness of our proposed approach, extensive experiments were conducted on the VoD dataset. The experimental results demonstrate the superiority of our fusion method, showcasing improved performance in terms of detection accuracy and robustness across different environmental conditions. The fusion of LiDAR and 4D millimeter-wave Radar holds significant potential for enhancing the capabilities of mobile robots in real-world scenarios. The proposed method, with its efficient multi-modal feature extraction and attention-based fusion, provides a reliable and effective solution for 3D object detection.}
    }

  • K. Zhang, Q. Wen, C. Zhang, L. Sun, and Y. Liu, “Skip-Step Contrastive Predictive Coding for Time Series Anomaly Detection," in 2024 IEEE lnternational Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 7065-7069.
    [BibTeX] [Abstract] [DOI] [PDF]

    Self-supervised learning (SSL) shows impressive performance in many tasks lacking sufficient labels. In this paper, we study SSL in time series anomaly detection (TSAD) by incorporating the characteristics of time series data. Specifically, we build an anomaly detection algorithm consisting of global pattern learning and local association learning. The global pattern learning module builds encoder and decoder to reconstruct the raw time series data to detect global anomalies. To complement the limitation of the global pattern learning that ignores local associations between anomaly points and their adjacent windows, we design a local association learning module, which leverages contrastive predictive coding (CPC) to transform the identification of anomaly points into positive pairs identification. Motivated by the observation that adjusting the distance between the history window and the time point to be detected directly impacts the detection performance in the CPC framework, we further propose a skip-step CPC scheme in the local association learning module which adjusts the distance for better construction of the positive pairs and detection results. The experimental results show that the proposed algorithm achieves superior performance on SMD and PSM datasets in comparison with 12 state-of-the-art algorithms.

    @inproceedings{zhang2024ssc,
    title = {Skip-Step Contrastive Predictive Coding for Time Series Anomaly Detection},
    author = {Kexin Zhang and Qingsong Wen and Chaoli Zhang and Liang Sun and Yong Liu},
    year = 2024,
    booktitle = {2024 IEEE lnternational Conference on Acoustics, Speech and Signal Processing (ICASSP)},
    pages = {7065-7069},
    doi = {10.1109/ICASSP48485.2024.10447104},
    abstract = {Self-supervised learning (SSL) shows impressive performance in many tasks lacking sufficient labels. In this paper, we study SSL in time series anomaly detection (TSAD) by incorporating the characteristics of time series data. Specifically, we build an anomaly detection algorithm consisting of global pattern learning and local association learning. The global pattern learning module builds encoder and decoder to reconstruct the raw time series data to detect global anomalies. To complement the limitation of the global pattern learning that ignores local associations between anomaly points and their adjacent windows, we design a local association learning module, which leverages contrastive predictive coding (CPC) to transform the identification of anomaly points into positive pairs identification. Motivated by the observation that adjusting the distance between the history window and the time point to be detected directly impacts the detection performance in the CPC framework, we further propose a skip-step CPC scheme in the local association learning module which adjusts the distance for better construction of the positive pairs and detection results. The experimental results show that the proposed algorithm achieves superior performance on SMD and PSM datasets in comparison with 12 state-of-the-art algorithms.}
    }

  • R. Cai, L. Peng, Z. Lu, K. Zhang, and Y. Liu, “DCS: Debiased Contrastive Learning with Weak Supervision for Time Series Classification," in 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024, pp. 5625-5629.
    [BibTeX] [Abstract] [DOI] [PDF]

    Self-supervised contrastive learning (SSCL) has performed excellently on time series classification tasks. Most SSCL-based classification algorithms generate positive and negative samples in the time or frequency domains, focusing on mining similarities between them. However, two issues are not well addressed in the SSCL framework: the sampling bias and the task-agnostic representation problems. Sampling bias indicates fake negative sample selection in SSCL, and task-agnostic representation results in the unknown correlation between the extracted feature and downstream tasks. To address the issues, we propose Debiased Contrastive learning with weak Supervision framework, abbreviated as DCS. It employs the clustering operation to remove fake negative samples and introduces weak supervisory signals into the SSCL framework to guide feature extraction. Additionally, we propose a channel augmentation method that allows the DCS to extract features from local and global perspectives simultaneously. The comprehensive experiments on the widely used datasets show that DCS achieves performance superior to state-of-the-art methods on the widely used popular benchmark datasets.

    @inproceedings{cai2024dcs,
    title = {DCS: Debiased Contrastive Learning with Weak Supervision for Time Series Classification},
    author = {Rongyao Cai and Linpeng Peng and Zhengming Lu and Kexin Zhang and Yong Liu},
    year = 2024,
    booktitle = {2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
    pages = {5625-5629},
    doi = {10.1109/ICASSP48485.2024.10446381},
    abstract = {Self-supervised contrastive learning (SSCL) has performed excellently on time series classification tasks. Most SSCL-based classification algorithms generate positive and negative samples in the time or frequency domains, focusing on mining similarities between them. However, two issues are not well addressed in the SSCL framework: the sampling bias and the task-agnostic representation problems. Sampling bias indicates fake negative sample selection in SSCL, and task-agnostic representation results in the unknown correlation between the extracted feature and downstream tasks. To address the issues, we propose Debiased Contrastive learning with weak Supervision framework, abbreviated as DCS. It employs the clustering operation to remove fake negative samples and introduces weak supervisory signals into the SSCL framework to guide feature extraction. Additionally, we propose a channel augmentation method that allows the DCS to extract features from local and global perspectives simultaneously. The comprehensive experiments on the widely used datasets show that DCS achieves performance superior to state-of-the-art methods on the widely used popular benchmark datasets.}
    }

  • J. Ra, M. Wang, J. Mei, Shanqi Liu, Y. Yang, and Y. Liu, “Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking via Memory Networks," in 11th International Conference on 3D Vision (3DV), 2024, pp. 842-851.
    [BibTeX] [Abstract] [DOI]

    The point cloud-based 3D single object tracking plays an indispensable role in autonomous driving. However, the application of 3D object tracking in the real world is still challenging due to the inherent sparsity and self-occlusion of point cloud data. Therefore, it is necessary to exploit as much useful information from limited data as we can. Since 3D object tracking is a video-level task, the appearance of objects changes gradually over time, and there is rich spatiotemporal contextual information among historical frames. However, existing methods do not fully utilize this information. To address this, we propose a new method called SCTrack, which utilizes a memory-based paradigm to exploit spatiotemporal contextual information. SCTrack incorporates both long-term and short-term memory banks to store the spatiotemporal features of targets from historical frames. By doing so, the tracker can benefit from the entire video sequence and make more informed predictions. Additionally, SCTrack extracts the mask prior to augmenting the target representation, improving the target-background discriminability. Extensive experiments on KITTI, nuScenes, and Waymo Open datasets verify the effectiveness of our proposed method.

    @inproceedings{Ra2024esc,
    title = {Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking via Memory Networks},
    author = {Jongwon Ra and Mengmeng Wang and Jianbiao Mei and Shanqi Liu and Yu Yang and Yong Liu},
    year = 2024,
    booktitle = {11th International Conference on 3D Vision (3DV)},
    pages = {842-851},
    doi = {10.1109/3DV62453.2024.00050},
    abstract = {The point cloud-based 3D single object tracking plays an indispensable role in autonomous driving. However, the application of 3D object tracking in the real world is still challenging due to the inherent sparsity and self-occlusion of point cloud data. Therefore, it is necessary to exploit as much useful information from limited data as we can. Since 3D object tracking is a video-level task, the appearance of objects changes gradually over time, and there is rich spatiotemporal contextual information among historical frames. However, existing methods do not fully utilize this information. To address this, we propose a new method called SCTrack, which utilizes a memory-based paradigm to exploit spatiotemporal contextual information. SCTrack incorporates both long-term and short-term memory banks to store the spatiotemporal features of targets from historical frames. By doing so, the tracker can benefit from the entire video sequence and make more informed predictions. Additionally, SCTrack extracts the mask prior to augmenting the target representation, improving the target-background discriminability. Extensive experiments on KITTI, nuScenes, and Waymo Open datasets verify the effectiveness of our proposed method.}
    }

  • M. Wang, J. Xing, B. Jiang, J. Chen, J. Mei, X. Zuo, G. Dai, J. Wang, and Y. Liu, “A Multimodal, Multi-task Adapting Framework for Video Action Recognition," in 38th AAAI Conference on Artificial Intelligence (AAAI), 2024, pp. 5517-5525.
    [BibTeX] [Abstract] [DOI] [PDF]

    Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend to prioritize strong supervised performance at the expense of compromising the models’ generalization capabilities during transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named M2-CLIP to address these challenges, preserving both high supervised performance and robust transferability. Firstly, to enhance the individual modality architectures, we introduce multimodal adapters to both the visual and text branches. Specifically, we design a novel visual TED-Adapter, that performs global Temporal Enhancement and local temporal Difference modeling to improve the temporal representation capabilities of the visual encoder. Moreover, we adopt text encoder adapters to strengthen the learning of semantic label information. Secondly, we design a multi-task decoder with a rich set of supervisory signals to adeptly satisfy the need for strong supervised performance and generalization within a multimodal framework. Experimental results validate the efficacy of our approach, demonstrating exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.

    @inproceedings{wang2024amm,
    title = {A Multimodal, Multi-task Adapting Framework for Video Action Recognition},
    author = {Mengmeng Wang and Jiazheng Xing and Boyuan Jiang and Jun Chen and Jianbiao Mei and Xingxing Zuo and Guang Dai and Jingdong Wang and Yong Liu},
    year = 2024,
    booktitle = {38th AAAI Conference on Artificial Intelligence (AAAI)},
    pages = {5517-5525},
    doi = {10.1609/aaai.v38i6.28361},
    abstract = {Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend to prioritize strong supervised performance at the expense of compromising the models' generalization capabilities during transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named M2-CLIP to address these challenges, preserving both high supervised performance and robust transferability. Firstly, to enhance the individual modality architectures, we introduce multimodal adapters to both the visual and text branches. Specifically, we design a novel visual TED-Adapter, that performs global Temporal Enhancement and local temporal Difference modeling to improve the temporal representation capabilities of the visual encoder. Moreover, we adopt text encoder adapters to strengthen the learning of semantic label information. Secondly, we design a multi-task decoder with a rich set of supervisory signals to adeptly satisfy the need for strong supervised performance and generalization within a multimodal framework. Experimental results validate the efficacy of our approach, demonstrating exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.}
    }

2023

  • J. Xiang, S. Li, J. Chen, S. Bai, Y. Ma, G. Dai, and Y. Liu, “SUBP: Soft Uniform Block Pruning for 1xN Sparse CNNs Multithreading Acceleration," in 37th Conference on Neural Information Processing Systems (NeurIPS), 2023, pp. 52033-52050.
    [BibTeX] [Abstract] [PDF]

    The study of sparsity in Convolutional Neural Networks (CNNs) has become widespread to compress and accelerate models in environments with limited resources. By constraining N consecutive weights along the output channel to be group-wise non-zero, the recent network with 1×N sparsity has received tremendous popularity for its three outstanding advantages: 1) A large amount of storage space saving by a Block Sparse Row matrix. 2) Excellent performance at a high sparsity. 3) Significant speedups on CPUs with Advanced Vector Extensions. Recent work requires selecting and fine-tuning 1×N sparse weights based on dense pre-trained weights, leading to the problems such as expensive training cost and memory access, sub-optimal model quality, as well as unbalanced workload across threads (different sparsity across output channels). To overcome them, this paper proposes a novel Soft Uniform Block Pruning (SUBP) approach to train a uniform 1×N sparse structured network from scratch. Specifically, our approach tends to repeatedly allow pruned blocks to regrow to the network based on block angular redundancy and importance sampling in a uniform manner throughout the training process. It not only makes the model less dependent on pre-training, reduces the model redundancy and the risk of pruning the important blocks permanently but also achieves balanced workload. Empirically, on ImageNet, comprehensive experiments across various CNN architectures show that our SUBP consistently outperforms existing 1×N and structured sparsity methods based on pre-trained models or training from scratch. Source codes and models are available at https://github.com/JingyangXiang/SUBP.

    @inproceedings{xiang2023subp,
    title = {SUBP: Soft Uniform Block Pruning for 1xN Sparse CNNs Multithreading Acceleration},
    author = {Jingyang Xiang and Siqi Li and Jun Chen and Shipeng Bai and Yukai Ma and Guang Dai and Yong Liu},
    year = 2023,
    booktitle = {37th Conference on Neural Information Processing Systems (NeurIPS)},
    pages = {52033-52050},
    abstract = {The study of sparsity in Convolutional Neural Networks (CNNs) has become widespread to compress and accelerate models in environments with limited resources. By constraining N consecutive weights along the output channel to be group-wise non-zero, the recent network with 1×N sparsity has received tremendous popularity for its three outstanding advantages: 1) A large amount of storage space saving by a Block Sparse Row matrix. 2) Excellent performance at a high sparsity. 3) Significant speedups on CPUs with Advanced Vector Extensions. Recent work requires selecting and fine-tuning 1×N sparse weights based on dense pre-trained weights, leading to the problems such as expensive training cost and memory access, sub-optimal model quality, as well as unbalanced workload across threads (different sparsity across output channels). To overcome them, this paper proposes a novel Soft Uniform Block Pruning (SUBP) approach to train a uniform 1×N sparse structured network from scratch. Specifically, our approach tends to repeatedly allow pruned blocks to regrow to the network based on block angular redundancy and importance sampling in a uniform manner throughout the training process. It not only makes the model less dependent on pre-training, reduces the model redundancy and the risk of pruning the important blocks permanently but also achieves balanced workload. Empirically, on ImageNet, comprehensive experiments across various CNN architectures show that our SUBP consistently outperforms existing 1×N and structured sparsity methods based on pre-trained models or training from scratch. Source codes and models are available at https://github.com/JingyangXiang/SUBP.}
    }

  • J. Mei, Y. Yang, M. Wang, Z. Li, X. Hou, J. Ra, L. Li, and Y. Liu, “CenterLPS: Segment Instances by Centers for LiDAR Panoptic Segmentation," in 31st ACM International Conference on Multimedia (MM), 2023, pp. 1884-1894.
    [BibTeX] [Abstract] [DOI] [PDF]

    This paper focuses on LiDAR Panoptic Segmentation (LPS), which has attracted more attention recently due to its broad application prospect for autonomous driving and robotics. The mainstream LPS approaches either adopt a top-down strategy relying on 3D object detectors to discover instances or utilize time-consuming heuristic clustering algorithms to group instances in a bottom-up manner. Inspired by the center representation and kernel-based segmentation, we propose a new detection-free and clustering-free framework called CenterLPS, with the center-based instance encoding and decoding paradigm. Specifically, we propose a sparse center proposal network to generate the sparse 3D instance centers, as well as center feature embedding, which can well encode characteristics of instances. Then a center-aware transformer is applied to collect the context between different center feature embedding and around centers. Moreover, we generate the kernel weights based on the enhanced center feature embedding and initialize dynamic convolutions to decode the final instance masks. Finally, a mask fusion module is devised to unify the semantic and instance predictions and improve the panoptic quality. Extensive experiments on SemanticKITTI and nuScenes demonstrate the effectiveness of our proposed center-based framework CenterLPS.

    @inproceedings{mei2023lps,
    title = {CenterLPS: Segment Instances by Centers for LiDAR Panoptic Segmentation},
    author = {Jianbiao Mei and Yu Yang and Mengmeng Wang and Zizhang Li and Xiaojun Hou and Jongwon Ra and Laijian Li and Yong Liu},
    year = 2023,
    booktitle = {31st ACM International Conference on Multimedia (MM)},
    pages = {1884-1894},
    doi = {10.1145/3581783.3612080},
    abstract = {This paper focuses on LiDAR Panoptic Segmentation (LPS), which has attracted more attention recently due to its broad application prospect for autonomous driving and robotics. The mainstream LPS approaches either adopt a top-down strategy relying on 3D object detectors to discover instances or utilize time-consuming heuristic clustering algorithms to group instances in a bottom-up manner. Inspired by the center representation and kernel-based segmentation, we propose a new detection-free and clustering-free framework called CenterLPS, with the center-based instance encoding and decoding paradigm. Specifically, we propose a sparse center proposal network to generate the sparse 3D instance centers, as well as center feature embedding, which can well encode characteristics of instances. Then a center-aware transformer is applied to collect the context between different center feature embedding and around centers. Moreover, we generate the kernel weights based on the enhanced center feature embedding and initialize dynamic convolutions to decode the final instance masks. Finally, a mask fusion module is devised to unify the semantic and instance predictions and improve the panoptic quality. Extensive experiments on SemanticKITTI and nuScenes demonstrate the effectiveness of our proposed center-based framework CenterLPS.}
    }

  • Z. Li, X. Lyu, Y. Ding, M. Wang, Y. Liao, and Y. Liu, “RICO: Regularizing the Unobservable for Indoor Compositional Reconstruction," in 19th IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 17715-17725.
    [BibTeX] [Abstract] [DOI] [PDF]

    Recently, neural implicit surfaces have become popular for multi-view reconstruction. To facilitate practical applications like scene editing and manipulation, some works extend the framework with semantic masks input for the object-compositional reconstruction rather than the holistic perspective. Though achieving plausible disentanglement, the performance drops significantly when processing the indoor scenes where objects are usually partially observed. We propose RICO to address this by regularizing the unobservable regions for indoor compositional reconstruction. Our key idea is to first regularize the smoothness of the occluded background, which then in turn guides the foreground object reconstruction in unobservable regions based on the object-background relationship. Particularly, we regularize the geometry smoothness of occluded background patches. With the improved background surface, the signed distance function and the reversedly rendered depth of objects can be optimized to bound them within the background range. Extensive experiments show our method outperforms other methods on synthetic and real-world indoor scenes and prove the effectiveness of proposed regularizations. The code is available at https://github.com/kyleleey/RICO

    @inproceedings{li2023rico,
    title = {RICO: Regularizing the Unobservable for Indoor Compositional Reconstruction},
    author = {Zizhang Li and Xiaoyang Lyu and Yuanyuan Ding and Mengmeng Wang and Yiyi Liao and Yong Liu},
    year = 2023,
    booktitle = {19th IEEE/CVF International Conference on Computer Vision (ICCV)},
    pages = {17715-17725},
    doi = {10.1109/ICCV51070.2023.01628},
    abstract = {Recently, neural implicit surfaces have become popular for multi-view reconstruction. To facilitate practical applications like scene editing and manipulation, some works extend the framework with semantic masks input for the object-compositional reconstruction rather than the holistic perspective. Though achieving plausible disentanglement, the performance drops significantly when processing the indoor scenes where objects are usually partially observed. We propose RICO to address this by regularizing the unobservable regions for indoor compositional reconstruction. Our key idea is to first regularize the smoothness of the occluded background, which then in turn guides the foreground object reconstruction in unobservable regions based on the object-background relationship. Particularly, we regularize the geometry smoothness of occluded background patches. With the improved background surface, the signed distance function and the reversedly rendered depth of objects can be optimized to bound them within the background range. Extensive experiments show our method outperforms other methods on synthetic and real-world indoor scenes and prove the effectiveness of proposed regularizations. The code is available at https://github.com/kyleleey/RICO}
    }

  • X. Shen, J. Zhang, J. Chen, S. Bai, Y. Han, Y. Wang, C. Wang, and Y. Liu, “Learning Global-Aware Kernel for Image Harmonization," in 19th IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 7501-7510.
    [BibTeX] [Abstract] [DOI] [PDF]

    Image harmonization aims to solve the visual inconsistency problem in composited images by adaptively adjusting the foreground pixels with the background as references. Existing methods employ local color transformation or region matching between foreground and background, which neglects powerful proximity prior and independently distinguishes fore-/back-ground as a whole part for harmonization. As a result, they still show a limited performance across varied foreground objects and scenes. To address this issue, we propose a novel Global-aware Kernel Net-work (GKNet) to harmonize local regions with comprehensive consideration of long-distance background references. Specifically, GKNet includes two parts, i.e., harmony kernel prediction and harmony kernel modulation branches. The former includes a Long-distance Reference Extractor (LRE) to obtain long-distance context and Kernel Prediction Blocks (KPB) to predict multi-level harmony kernels by fusing global information with local features. To achieve this goal, a novel Selective Correlation Fusion (SCF) module is proposed to better select relevant long-distance background references for local harmonization. The latter employs the predicted kernels to harmonize foreground regions with local and global awareness. Abundant experiments demonstrate the superiority of our method for image harmonization over state-of-the-art methods, e.g., achieving 39.53dB PSNR that surpasses the best counterpart by +0.78dB ↑; decreasing fMSE/MSE by 11.5%↓/6.7%↓ compared with the SoTA method. Code will be available at here.

    @inproceedings{shen2023lga,
    title = {Learning Global-Aware Kernel for Image Harmonization},
    author = {Xintian Shen and Jiangning Zhang and Jun Chen and Shipeng Bai and Yue Han and Yabiao Wang and Chengjie Wang and Yong Liu},
    year = 2023,
    booktitle = {19th IEEE/CVF International Conference on Computer Vision (ICCV)},
    pages = {7501-7510},
    doi = {10.1109/ICCV51070.2023.00693},
    abstract = {Image harmonization aims to solve the visual inconsistency problem in composited images by adaptively adjusting the foreground pixels with the background as references. Existing methods employ local color transformation or region matching between foreground and background, which neglects powerful proximity prior and independently distinguishes fore-/back-ground as a whole part for harmonization. As a result, they still show a limited performance across varied foreground objects and scenes. To address this issue, we propose a novel Global-aware Kernel Net-work (GKNet) to harmonize local regions with comprehensive consideration of long-distance background references. Specifically, GKNet includes two parts, i.e., harmony kernel prediction and harmony kernel modulation branches. The former includes a Long-distance Reference Extractor (LRE) to obtain long-distance context and Kernel Prediction Blocks (KPB) to predict multi-level harmony kernels by fusing global information with local features. To achieve this goal, a novel Selective Correlation Fusion (SCF) module is proposed to better select relevant long-distance background references for local harmonization. The latter employs the predicted kernels to harmonize foreground regions with local and global awareness. Abundant experiments demonstrate the superiority of our method for image harmonization over state-of-the-art methods, e.g., achieving 39.53dB PSNR that surpasses the best counterpart by +0.78dB ↑; decreasing fMSE/MSE by 11.5%↓/6.7%↓ compared with the SoTA method. Code will be available at here.}
    }

  • J. Xing, M. Wang, Y. Ruan, B. Chen, Y. Guo, B. Mu, G. Dai, J. Wang, and Y. Liu, “Boosting Few-Shot Action Recognition with Graph-Guided Hybrid Matching," in 19th IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 1740-1750.
    [BibTeX] [Abstract] [DOI] [PDF]

    Class prototype construction and matching are core aspects of few-shot action recognition. Previous methods mainly focus on designing spatiotemporal relation modeling modules or complex temporal alignment algorithms. Despite the promising results, they ignored the value of class prototype construction and matching, leading to unsatisfactory performance in recognizing similar categories in every task. In this paper, we propose GgHM, a new framework with Graph-guided Hybrid Matching. Concretely, we learn task-oriented features by the guidance of a graph neural network during class prototype construction, optimizing the intra- and inter-class feature correlation explicitly. Next, we design a hybrid matching strategy, combining frame-level and tuple-level matching to classify videos with multivariate styles. We additionally propose a learnable dense temporal modeling module to enhance the video feature temporal representation to build a more solid foundation for the matching process. GgHM shows consistent improvements over other challenging baselines on several few-shot datasets, demonstrating the effectiveness of our method. The code will be publicly available at https://github.com/jiazheng-xing/GgHM.

    @inproceedings{xing2023bfs,
    title = {Boosting Few-Shot Action Recognition with Graph-Guided Hybrid Matching},
    author = {Jiazheng Xing and Mengmeng Wang and Yudi Ruan and Bofan Chen and Yaowei Guo and Boyu Mu and Guang Dai and Jingdong Wang and Yong Liu},
    year = 2023,
    booktitle = {19th IEEE/CVF International Conference on Computer Vision (ICCV)},
    pages = {1740-1750},
    doi = {10.1109/ICCV51070.2023.00167},
    abstract = {Class prototype construction and matching are core aspects of few-shot action recognition. Previous methods mainly focus on designing spatiotemporal relation modeling modules or complex temporal alignment algorithms. Despite the promising results, they ignored the value of class prototype construction and matching, leading to unsatisfactory performance in recognizing similar categories in every task. In this paper, we propose GgHM, a new framework with Graph-guided Hybrid Matching. Concretely, we learn task-oriented features by the guidance of a graph neural network during class prototype construction, optimizing the intra- and inter-class feature correlation explicitly. Next, we design a hybrid matching strategy, combining frame-level and tuple-level matching to classify videos with multivariate styles. We additionally propose a learnable dense temporal modeling module to enhance the video feature temporal representation to build a more solid foundation for the matching process. GgHM shows consistent improvements over other challenging baselines on several few-shot datasets, demonstrating the effectiveness of our method. The code will be publicly available at https://github.com/jiazheng-xing/GgHM.}
    }

  • S. Bai, J. Chen, X. Shen, Y. Qian, and Y. liu, “Unified Data-Free Compression: Pruning and Quantization without Fine-Tuning," in 19th IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 5853-5862.
    [BibTeX] [Abstract] [DOI] [PDF]

    Structured pruning and quantization are promising approaches for reducing the inference time and memory footprint of neural networks. However, most existing methods require the original training dataset to fine-tune the model. This not only brings heavy resource consumption but also is not possible for applications with sensitive or proprietary data due to privacy and security concerns. Therefore, a few data-free methods are proposed to address this problem, but they perform data-free pruning and quantization separately, which does not explore the complementarity of pruning and quantization. In this paper, we propose a novel framework named Unified Data-Free Compression(UDFC), which performs pruning and quantization simultaneously without any data and fine-tuning process. Specifically, UDFC starts with the assumption that the partial information of a damaged(e.g., pruned or quantized) channel can be preserved by a linear combination of other channels, and then derives the reconstruction form from the assumption to restore the information loss due to compression. Finally, we formulate the reconstruction error between the original network and its compressed network, and theoretically deduce the closed-form solution. We evaluate the UDFC on the large-scale image classification task and obtain significant improvements over various network architectures and compression methods. For example, we achieve a 20.54% accuracy improvement on ImageNet dataset compared to SOTA method with 30% pruning ratio and 6-bit quantization on ResNet-34. Code will be available at here.

    @inproceedings{bai2023udf,
    title = {Unified Data-Free Compression: Pruning and Quantization without Fine-Tuning},
    author = {Shipeng Bai and Jun Chen and Xintian Shen and Yixuan Qian and Yong liu},
    year = 2023,
    booktitle = {19th IEEE/CVF International Conference on Computer Vision (ICCV)},
    pages = {5853-5862},
    doi = {10.1109/ICCV51070.2023.00540},
    abstract = {Structured pruning and quantization are promising approaches for reducing the inference time and memory footprint of neural networks. However, most existing methods require the original training dataset to fine-tune the model. This not only brings heavy resource consumption but also is not possible for applications with sensitive or proprietary data due to privacy and security concerns. Therefore, a few data-free methods are proposed to address this problem, but they perform data-free pruning and quantization separately, which does not explore the complementarity of pruning and quantization. In this paper, we propose a novel framework named Unified Data-Free Compression(UDFC), which performs pruning and quantization simultaneously without any data and fine-tuning process. Specifically, UDFC starts with the assumption that the partial information of a damaged(e.g., pruned or quantized) channel can be preserved by a linear combination of other channels, and then derives the reconstruction form from the assumption to restore the information loss due to compression. Finally, we formulate the reconstruction error between the original network and its compressed network, and theoretically deduce the closed-form solution. We evaluate the UDFC on the large-scale image classification task and obtain significant improvements over various network architectures and compression methods. For example, we achieve a 20.54% accuracy improvement on ImageNet dataset compared to SOTA method with 30% pruning ratio and 6-bit quantization on ResNet-34. Code will be available at here.}
    }

  • T. Ma, M. Wang, J. Xiao, H. Wu, and Y. Liu, “Synchronize Feature Extracting and Matching: A Single Branch Framework for 3D Object Tracking," in 19th IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 9919-9929.
    [BibTeX] [Abstract] [DOI] [PDF]

    Siamese network has been a de facto benchmark framework for 3D LiDAR object tracking with a shared-parametric encoder extracting features from template and search region, respectively. This paradigm relies heavily on an additional matching network to model the cross-correlation/similarity of the template and search region. In this paper, we forsake the conventional Siamese paradigm and propose a novel single-branch framework, SyncTrack, synchronizing the feature extracting and matching to avoid forwarding encoder twice for template and search region as well as introducing extra parameters of matching network. The synchronization mechanism is based on the dynamic affinity of the Transformer, and an in-depth analysis of the relevance is provided theoretically. Moreover, based on the synchronization, we introduce a novel Attentive PointsSampling strategy into the Transformer layers (APST), replacing the random/Farthest Points Sampling (FPS) method with sampling under the supervision of attentive relations between the template and search region. It implies connecting point-wise sampling with the feature learning, beneficial to aggregating more distinctive and geometric features for tracking with sparse points. Extensive experiments on two benchmark datasets (KITTI and NuScenes) show that SyncTrack achieves state-of-the-art performance in realtime tracking.

    @inproceedings{ma2023sfe,
    title = {Synchronize Feature Extracting and Matching: A Single Branch Framework for 3D Object Tracking},
    author = {Teli Ma and Mengmeng Wang and Jimin Xiao and Huifeng Wu and Yong Liu},
    year = 2023,
    booktitle = {19th IEEE/CVF International Conference on Computer Vision (ICCV)},
    pages = {9919-9929},
    doi = {10.1109/ICCV51070.2023.00913},
    abstract = {Siamese network has been a de facto benchmark framework for 3D LiDAR object tracking with a shared-parametric encoder extracting features from template and search region, respectively. This paradigm relies heavily on an additional matching network to model the cross-correlation/similarity of the template and search region. In this paper, we forsake the conventional Siamese paradigm and propose a novel single-branch framework, SyncTrack, synchronizing the feature extracting and matching to avoid forwarding encoder twice for template and search region as well as introducing extra parameters of matching network. The synchronization mechanism is based on the dynamic affinity of the Transformer, and an in-depth analysis of the relevance is provided theoretically. Moreover, based on the synchronization, we introduce a novel Attentive PointsSampling strategy into the Transformer layers (APST), replacing the random/Farthest Points Sampling (FPS) method with sampling under the supervision of attentive relations between the template and search region. It implies connecting point-wise sampling with the feature learning, beneficial to aggregating more distinctive and geometric features for tracking with sparse points. Extensive experiments on two benchmark datasets (KITTI and NuScenes) show that SyncTrack achieves state-of-the-art performance in realtime tracking.}
    }

  • H. Yang, P. Ge, J. Cao, Y. Yang, and Y. Liu, “Large Scale Pursuit-Evasion Under Collision Avoidance Using Deep Reinforcement Learning," in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023, pp. 2232-2239.
    [BibTeX] [Abstract] [DOI] [PDF]

    This paper examines a pursuit-evasion game (PEG) involving multiple pursuers and evaders. The decentralized pursuers aim to collaborate to capture the faster evaders while avoiding collisions. The policies of all agents are learning-based and are subjected to kinematic constraints that are specific to unicycles. To address the challenge of high dimensionality encountered in large-scale scenarios, we propose a state processing method named Mix-Attention, which is based on Self-Attention. This method effectively mitigates the curse of dimensionality. The simulation results provided in this study demonstrate that the combination of Mix-Attention and Independent Proximal Policy Optimization (IPPO) surpasses alternative approaches when solving the multi-pursuer multi-evader PEG, particularly as the number of entities increases. Moreover, the trained policies showcase their ability to adapt to scenarios involving varying numbers of agents and obstacles without requiring retraining. This adaptability showcases their transferability and robustness. Finally, our proposed approach has been validated through physical experiments conducted with six robots.

    @inproceedings{yang2023lsp,
    title = {Large Scale Pursuit-Evasion Under Collision Avoidance Using Deep Reinforcement Learning},
    author = {Helei Yang and Peng Ge and Junjie Cao and Yifan Yang and Yong Liu},
    year = 2023,
    booktitle = {2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {2232-2239},
    doi = {10.1109/IROS55552.2023.10341975},
    abstract = {This paper examines a pursuit-evasion game (PEG) involving multiple pursuers and evaders. The decentralized pursuers aim to collaborate to capture the faster evaders while avoiding collisions. The policies of all agents are learning-based and are subjected to kinematic constraints that are specific to unicycles. To address the challenge of high dimensionality encountered in large-scale scenarios, we propose a state processing method named Mix-Attention, which is based on Self-Attention. This method effectively mitigates the curse of dimensionality. The simulation results provided in this study demonstrate that the combination of Mix-Attention and Independent Proximal Policy Optimization (IPPO) surpasses alternative approaches when solving the multi-pursuer multi-evader PEG, particularly as the number of entities increases. Moreover, the trained policies showcase their ability to adapt to scenarios involving varying numbers of agents and obstacles without requiring retraining. This adaptability showcases their transferability and robustness. Finally, our proposed approach has been validated through physical experiments conducted with six robots.}
    }

  • C. Chen, H. Wu, Y. Ma, J. Lv, L. Li, and Y. Liu, “LiDAR-Inertial SLAM with Efficiently Extracted Planes," in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023, pp. 1497-1504.
    [BibTeX] [Abstract] [DOI] [PDF]

    This paper proposes a LiDAR-Inertial SLAM with efficiently extracted planes, which couples explicit planes in the odometry to improve accuracy and in the mapping for consistency. The proposed method consists of three parts: an efficient Point →Line→Plane extraction algorithm, a LiDAR-Inertial-Plane tightly coupled odometry, and a global plane-aided mapping. Specifically, we leverage the ring field of the LiDAR point cloud to accelerate the region-growing-based plane extraction algorithm. Then we tightly coupled IMU pre-integration factors, LiDAR odometry factors, and explicit plane factors in the sliding window to obtain a more accurate initial pose for mapping. Finally, we maintain explicit planes in the global map, and enhance system consistency by optimizing the factor graph of optimized odometry factors and plane observation factors. Experimental results show that our plane extraction method is efficient, and the proposed plane-aided LiDAR-Inertial SLAM significantly improves the accuracy and consistency compared to the other state-of-the-art algorithms with only a small increase in time consumption.

    @inproceedings{chen2023lidar,
    title = {LiDAR-Inertial SLAM with Efficiently Extracted Planes},
    author = {Chao Chen and Hangyu Wu and Yukai Ma and Jiajun Lv and Laijian Li and Yong Liu},
    year = 2023,
    booktitle = {2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {1497-1504},
    doi = {10.1109/IROS55552.2023.10342325},
    abstract = {This paper proposes a LiDAR-Inertial SLAM with efficiently extracted planes, which couples explicit planes in the odometry to improve accuracy and in the mapping for consistency. The proposed method consists of three parts: an efficient Point →Line→Plane extraction algorithm, a LiDAR-Inertial-Plane tightly coupled odometry, and a global plane-aided mapping. Specifically, we leverage the ring field of the LiDAR point cloud to accelerate the region-growing-based plane extraction algorithm. Then we tightly coupled IMU pre-integration factors, LiDAR odometry factors, and explicit plane factors in the sliding window to obtain a more accurate initial pose for mapping. Finally, we maintain explicit planes in the global map, and enhance system consistency by optimizing the factor graph of optimized odometry factors and plane observation factors. Experimental results show that our plane extraction method is efficient, and the proposed plane-aided LiDAR-Inertial SLAM significantly improves the accuracy and consistency compared to the other state-of-the-art algorithms with only a small increase in time consumption.}
    }

  • J. Zhu, L. Liu, B. Jiang, F. Wen, H. Zhang, W. li, and Y. Liu, “Self-Supervised Event-Based Monocular Depth Estimation Using Cross-Modal Consistency," in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023, pp. 7704-7710.
    [BibTeX] [Abstract] [DOI] [PDF]

    An event camera is a novel vision sensor that can capture per-pixel brightness changes and output a stream of asynchronous “events”. It has advantages over conventional cameras in those scenes with high-speed motions and challenging lighting conditions because of the high temporal resolution, high dynamic range, low bandwidth, low power consumption, and no motion blur. Therefore, several supervised monocular depth estimation from events is proposed to address scenes difficult for conventional cameras. However, depth annotation is costly and time-consuming. In this paper, to lower the annotation cost, we propose a self-supervised event-based monocular depth estimation framework named EMoDepth. EMoDepth constrains the training process using the cross-modal consistency from intensity frames that are aligned with events in the pixel coordinate. Moreover, in inference, only events are used for monocular depth prediction. Additionally, we design a multi-scale skip-connection architecture to effectively fuse features for depth estimation while maintaining high inference speed. Experiments on MVSEC and DSEC datasets demonstrate that our contributions are effective and that the accuracy can outperform existing supervised event-based and unsupervised frame-based methods.

    @inproceedings{zhu2023sse,
    title = {Self-Supervised Event-Based Monocular Depth Estimation Using Cross-Modal Consistency},
    author = {Junyu Zhu and Lina Liu and Bofeng Jiang and Feng Wen and Hongbo Zhang and Wanlong li and Yong Liu},
    year = 2023,
    booktitle = {2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {7704-7710},
    doi = {10.1109/IROS55552.2023.10342434},
    abstract = {An event camera is a novel vision sensor that can capture per-pixel brightness changes and output a stream of asynchronous “events”. It has advantages over conventional cameras in those scenes with high-speed motions and challenging lighting conditions because of the high temporal resolution, high dynamic range, low bandwidth, low power consumption, and no motion blur. Therefore, several supervised monocular depth estimation from events is proposed to address scenes difficult for conventional cameras. However, depth annotation is costly and time-consuming. In this paper, to lower the annotation cost, we propose a self-supervised event-based monocular depth estimation framework named EMoDepth. EMoDepth constrains the training process using the cross-modal consistency from intensity frames that are aligned with events in the pixel coordinate. Moreover, in inference, only events are used for monocular depth prediction. Additionally, we design a multi-scale skip-connection architecture to effectively fuse features for depth estimation while maintaining high inference speed. Experiments on MVSEC and DSEC datasets demonstrate that our contributions are effective and that the accuracy can outperform existing supervised event-based and unsupervised frame-based methods.}
    }

  • J. Mei, Y. Yang, M. Wang, T. Huang, X. Yang, and Y. Liu, “SSC-RS: Elevate LiDAR Semantic Scene Completion with Representation Separation and BEV Fusion," in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023, pp. 7718-7725.
    [BibTeX] [Abstract] [DOI] [PDF]

    Semantic scene completion (SSC) jointly predicts the semantics and geometry of the entire 3D scene, which plays an essential role in 3D scene understanding for autonomous driving systems. SSC has achieved rapid progress with the help of semantic context in segmentation. However, how to effectively exploit the relationships between the semantic context in semantic segmentation and geometric structure in scene completion remains under exploration. In this paper, we propose to solve outdoor SSC from the perspective of representation separation and BEV fusion. Specifically, we present the network, named SSC-RS, which uses separate branches with deep supervision to explicitly disentangle the learning procedure of the semantic and geometric representations. And a BEV fusion network equipped with the proposed Adaptive Representation Fusion (ARF) module is presented to aggregate the multi-scale features effectively and efficiently. Due to the low computational burden and powerful representation ability, our model has good generality while running in real-time. Extensive experiments on SemanticKITTI demonstrate our SSC-RS achieves state-of-the-art performance. Code is available at https://github.com/Jieqianyu/SSC-RS.git.

    @inproceedings{mei2023ssc,
    title = {SSC-RS: Elevate LiDAR Semantic Scene Completion with Representation Separation and BEV Fusion},
    author = {Jianbiao Mei and Yu Yang and Mengmeng Wang and Tianxin Huang and Xuemeng Yang and Yong Liu},
    year = 2023,
    booktitle = {2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {7718-7725},
    doi = {10.1109/IROS55552.2023.10341742},
    abstract = {Semantic scene completion (SSC) jointly predicts the semantics and geometry of the entire 3D scene, which plays an essential role in 3D scene understanding for autonomous driving systems. SSC has achieved rapid progress with the help of semantic context in segmentation. However, how to effectively exploit the relationships between the semantic context in semantic segmentation and geometric structure in scene completion remains under exploration. In this paper, we propose to solve outdoor SSC from the perspective of representation separation and BEV fusion. Specifically, we present the network, named SSC-RS, which uses separate branches with deep supervision to explicitly disentangle the learning procedure of the semantic and geometric representations. And a BEV fusion network equipped with the proposed Adaptive Representation Fusion (ARF) module is presented to aggregate the multi-scale features effectively and efficiently. Due to the low computational burden and powerful representation ability, our model has good generality while running in real-time. Extensive experiments on SemanticKITTI demonstrate our SSC-RS achieves state-of-the-art performance. Code is available at https://github.com/Jieqianyu/SSC-RS.git.}
    }

  • J. Mei, Y. Yang, M. Wang, X. Hou, L. Li, and Y. Liu, “PANet: LiDAR Panoptic Segmentation with Sparse Instance Proposal and Aggregation," in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023, pp. 7726-7733.
    [BibTeX] [Abstract] [DOI] [PDF]

    Reliable LiDAR panoptic segmentation (LPS), including both semantic and instance segmentation, is vital for many robotic applications, such as autonomous driving. This work proposes a new LPS framework named PANet to eliminate the dependency on the offset branch and improve the performance on large objects, which are always over-segmented by clustering algorithms. Firstly, we propose a non-learning Sparse Instance Proposal (SIP) module with the “sampling-shifting-grouping” scheme to directly group thing points into instances from the raw point cloud efficiently. More specifically, balanced point sampling is introduced to generate sparse seed points with more uniform point distribution over the distance range. And a shift module, termed bubble shifting, is proposed to shrink the seed points to the clustered centers. Then we utilize the connected component label algorithm to generate instance proposals. Furthermore, an instance aggregation module is devised to integrate potentially fragmented instances, improving the performance of the SIP module on large objects. Extensive experiments show that PANet achieves state-of-the-art performance among published works on the SemanticKITII validation and nuScenes validation for the panoptic segmentation task. Code is available at https://github.com/Jieqianyu/PANet.git.

    @inproceedings{mei2023pan,
    title = {PANet: LiDAR Panoptic Segmentation with Sparse Instance Proposal and Aggregation},
    author = {Jianbiao Mei and Yu Yang and Mengmeng Wang and Xiaojun Hou and Laijian Li and Yong Liu},
    year = 2023,
    booktitle = {2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {7726-7733},
    doi = {10.1109/IROS55552.2023.10342468},
    abstract = {Reliable LiDAR panoptic segmentation (LPS), including both semantic and instance segmentation, is vital for many robotic applications, such as autonomous driving. This work proposes a new LPS framework named PANet to eliminate the dependency on the offset branch and improve the performance on large objects, which are always over-segmented by clustering algorithms. Firstly, we propose a non-learning Sparse Instance Proposal (SIP) module with the “sampling-shifting-grouping” scheme to directly group thing points into instances from the raw point cloud efficiently. More specifically, balanced point sampling is introduced to generate sparse seed points with more uniform point distribution over the distance range. And a shift module, termed bubble shifting, is proposed to shrink the seed points to the clustered centers. Then we utilize the connected component label algorithm to generate instance proposals. Furthermore, an instance aggregation module is devised to integrate potentially fragmented instances, improving the performance of the SIP module on large objects. Extensive experiments show that PANet achieves state-of-the-art performance among published works on the SemanticKITII validation and nuScenes validation for the panoptic segmentation task. Code is available at https://github.com/Jieqianyu/PANet.git.}
    }

  • R. Cai, K. Zhang, and Y. Liu, “Industrial Fault Detection Based on Time-Frequency Distillation Autoencoder," in The 42nd Chinese Control Conference (CCC), 2023, pp. 5120-5125.
    [BibTeX] [Abstract] [DOI] [PDF]

    Data-driven feature extraction is a crucial research area in control loop performance assessment (CLPA). Deep learning is a widely used technique for building feature learning models based on neural networks (NNs). However, most NN-based CLPA methods require a large amount of labeled data and do not fully leverage the potential of frequency features. We propose a novel model called time-frequency distillation autoencoder (TFDAE) to address these limitations. The TFDAE consists of a frequency distillation encoder and a representation extraction decoder. The encoder leverages self-supervised contrastive learning to learn time features that guide the distillation of key frequency information. Additionally, a multi-kernel pooling block is incorporated in the encoder, enabling multi-scale information refinement for time feature extraction. The decoder uses the distilled information to extract informative representations and reconstruct the original input series. Taking valve stiction detection in CLPA as the evaluation task, we developed a stiction detection method based on TFDAE. Finally, We evaluate our model on the benchmark dataset: International Stiction Data Base (ISDB), and the experimental results show that TFDAE outperforms traditional knowledge-based and recent NN-based methods.

    @inproceedings{cai2023ifd,
    title = {Industrial Fault Detection Based on Time-Frequency Distillation Autoencoder},
    author = {Rongyao Cai and Kexin Zhang and Yong Liu},
    year = 2023,
    booktitle = {The 42nd Chinese Control Conference (CCC)},
    pages = {5120-5125},
    doi = {10.23919/CCC58697.2023.10239980},
    abstract = {Data-driven feature extraction is a crucial research area in control loop performance assessment (CLPA). Deep learning is a widely used technique for building feature learning models based on neural networks (NNs). However, most NN-based CLPA methods require a large amount of labeled data and do not fully leverage the potential of frequency features. We propose a novel model called time-frequency distillation autoencoder (TFDAE) to address these limitations. The TFDAE consists of a frequency distillation encoder and a representation extraction decoder. The encoder leverages self-supervised contrastive learning to learn time features that guide the distillation of key frequency information. Additionally, a multi-kernel pooling block is incorporated in the encoder, enabling multi-scale information refinement for time feature extraction. The decoder uses the distilled information to extract informative representations and reconstruct the original input series. Taking valve stiction detection in CLPA as the evaluation task, we developed a stiction detection method based on TFDAE. Finally, We evaluate our model on the benchmark dataset: International Stiction Data Base (ISDB), and the experimental results show that TFDAE outperforms traditional knowledge-based and recent NN-based methods.}
    }

  • K. Zhang, R. Cai, and Y. Liu, “Industrial Fault Detection using Contrastive Representation Learning on Time-series Data," in The 22nd World Congress of the International Federation of Automatic Control (IFAC), 2023, pp. 3197-3202.
    [BibTeX] [Abstract] [DOI] [PDF]

    Deep learning (DL) has been known as one of the effective techniques for building data-driven fault detection methods. The successful DL-based methods require the condition that massive labeled data are available, but this is sometimes an inevitable obstacle in real industrial environments. As one of the solutions, autoencoders (AEs) are widely adopted since AEs can extract features from unlabeled data. However, some challenges in AE- based fault detection methods remain, such as the design of encoder architecture, the computational cost, and the usage of the limited labeled data. This paper proposes a new industrial fault detection method through learning instance-level representation of time-series based on the self-supervised contrastive learning framework (SSCL). The proposed method uses dilated-causal-convolution-based encoder-only architecture to extract the information from industrial time- series data. A new data augmentation method for time-series data is proposed based on the temporal distance distribution, which is used to construct positive pairs in SSCL. Moreover, the encoder is alternately trained by the new weighted contrastive loss and the traditional classification loss. Finally, the experiments are conducted on the industrial data set and a semi-physical system, showing the effectiveness of the proposed method.

    @inproceedings{zhang2023ifd,
    title = {Industrial Fault Detection using Contrastive Representation Learning on Time-series Data},
    author = {Kexin Zhang and Rongyao Cai and Yong Liu},
    year = 2023,
    booktitle = {The 22nd World Congress of the International Federation of Automatic Control (IFAC)},
    pages = {3197-3202},
    doi = {10.1016/j.ifacol.2023.10.1456},
    abstract = {Deep learning (DL) has been known as one of the effective techniques for building data-driven fault detection methods. The successful DL-based methods require the condition that massive labeled data are available, but this is sometimes an inevitable obstacle in real industrial environments. As one of the solutions, autoencoders (AEs) are widely adopted since AEs can extract features from unlabeled data. However, some challenges in AE- based fault detection methods remain, such as the design of encoder architecture, the computational cost, and the usage of the limited labeled data. This paper proposes a new industrial fault detection method through learning instance-level representation of time-series based on the self-supervised contrastive learning framework (SSCL). The proposed method uses dilated-causal-convolution-based encoder-only architecture to extract the information from industrial time- series data. A new data augmentation method for time-series data is proposed based on the temporal distance distribution, which is used to construct positive pairs in SSCL. Moreover, the encoder is alternately trained by the new weighted contrastive loss and the traditional classification loss. Finally, the experiments are conducted on the industrial data set and a semi-physical system, showing the effectiveness of the proposed method.}
    }

  • T. Huang, Z. Xue, Z. Chen, and Y. Liu, “Efficient Trajectory Planning and Control for USV with Vessel Dynamics and Differential Flatness," in 2023 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), 2023, pp. 1273-1280.
    [BibTeX] [Abstract] [DOI] [PDF]

    Unmanned surface vessels (USVs) are widely used in ocean exploration and environmental protection. To ensure that USV can successfully perform its mission, trajectory planning and motion tracking are the two most critical technologies. This paper proposes a novel trajectory generation and tracking method for USV based on optimization theory. Specifically, the USV dynamic model is combined with differential flatness, so that the trajectory can be generated by dynamic RRT* in a linear invariant system expression form under the objective of optimal boundary value. We adjust the trajectory through local optimization to reduce the number of samples and improve efficiency. The dynamic constraints are considered in the optimization process so that the generated trajectory conforms to the kinematic characteristics of the under-actuated hull, making tracking easier. Finally, motion tracking is added with model predictive control under a sequential quadratic programming problem. Simulated results show that the planned trajectory is more consistent with the kinematic characteristics of USV, and the tracking accuracy remains at a higher level.

    @inproceedings{huang2023etp,
    title = {Efficient Trajectory Planning and Control for USV with Vessel Dynamics and Differential Flatness},
    author = {Tao Huang and Zhenfeng Xue and Zhe Chen and Yong Liu},
    year = 2023,
    booktitle = {2023 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM)},
    pages = {1273-1280},
    doi = {10.1109/AIM46323.2023.10196111},
    abstract = {Unmanned surface vessels (USVs) are widely used in ocean exploration and environmental protection. To ensure that USV can successfully perform its mission, trajectory planning and motion tracking are the two most critical technologies. This paper proposes a novel trajectory generation and tracking method for USV based on optimization theory. Specifically, the USV dynamic model is combined with differential flatness, so that the trajectory can be generated by dynamic RRT* in a linear invariant system expression form under the objective of optimal boundary value. We adjust the trajectory through local optimization to reduce the number of samples and improve efficiency. The dynamic constraints are considered in the optimization process so that the generated trajectory conforms to the kinematic characteristics of the under-actuated hull, making tracking easier. Finally, motion tracking is added with model predictive control under a sequential quadratic programming problem. Simulated results show that the planned trajectory is more consistent with the kinematic characteristics of USV, and the tracking accuracy remains at a higher level.}
    }

  • C. Xu, J. Zhu, J. Zhang, Y. Han, W. Chu, Y. Tai, C. Wang, Z. Xie, and Y. Liu, “High-Fidelity Generalized Emotional Talking Face Generation with Multi-Modal Emotion Space Learning," in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6609-6619.
    [BibTeX] [Abstract] [DOI] [PDF]

    Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose a more flexible and generalized framework. Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space, which inherits rich semantic prior from CLIP. Consequently, effective multi-modal emotion space learning helps our method support arbitrary emotion modality during testing and could generalize to unseen emotion styles. Besides, an Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation. A followed style-based High-fidelity Emotional Face generator is designed to generate arbitrary high-resolution realistic identities. Our texture generator hierarchically learns flow fields and animated faces in a residual manner. Extensive experiments demonstrate the flexibility and generalization of our method in emotion control and the effectiveness of high-quality face synthesis.

    @inproceedings{xu2023hfg,
    title = {High-Fidelity Generalized Emotional Talking Face Generation with Multi-Modal Emotion Space Learning},
    author = {Chao Xu and Junwei Zhu and Jiangning Zhang and Yue Han and Wenqing Chu and Ying Tai and Chengjie Wang and Zhifeng Xie and Yong Liu},
    year = 2023,
    booktitle = {2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {6609-6619},
    doi = {10.1109/CVPR52729.2023.00639},
    abstract = {Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose a more flexible and generalized framework. Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space, which inherits rich semantic prior from CLIP. Consequently, effective multi-modal emotion space learning helps our method support arbitrary emotion modality during testing and could generalize to unseen emotion styles. Besides, an Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation. A followed style-based High-fidelity Emotional Face generator is designed to generate arbitrary high-resolution realistic identities. Our texture generator hierarchically learns flow fields and animated faces in a residual manner. Extensive experiments demonstrate the flexibility and generalization of our method in emotion control and the effectiveness of high-quality face synthesis.}
    }

  • X. Chen, J. Zhang, C. Xu, Y. Wang, C. Wang, and Y. Liu, “Better “CMOS" Produces Clearer Images: Learning Space-Variant Blur Estimation for Blind Image Super-Resolution," in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 1651-1661.
    [BibTeX] [Abstract] [DOI] [PDF]

    Most of the existing blind image Super-Resolution (SR) methods assume that the blur kernels are space-invariant. However, the blur involved in real applications are usually space-variant due to object motion, out-of-focus, etc., resulting in severe performance drop of the advanced SR methods. To address this problem, we firstly introduce two new datasets with out-of-focus blur, i.e., NYUv2-BSR and Cityscapes-BSR, to support further researches of blind SR with space-variant blur. Based on the datasets, we design a novel Cross-MOdal fuSion network (CMOS) that estimate both blur and semantics simultaneously, which leads to improved SR results. It involves a feature Grouping Interactive Attention (GIA) module to make the two modalities in-teract more effectively and avoid inconsistency. GIA can also be used for the interaction of other features because of the universality of its structure. Qualitative and quantitative experiments compared with state-of-the-art methods on above datasets and real-world images demonstrate the superiority of our method, e.g., obtaining PSNR/SSIM by +1.91↑/+0.0048↑ on NYUv2-BSR than MANet 1 1 https://github.com/ByChelsea/CMOS.git.

    @inproceedings{chen2023cmos,
    title = {Better "CMOS" Produces Clearer Images: Learning Space-Variant Blur Estimation for Blind Image Super-Resolution},
    author = {Xuhai Chen and Jiangning Zhang and Chao Xu and Yabiao Wang and Chengjie Wang and Yong Liu},
    year = 2023,
    booktitle = {2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {1651-1661},
    doi = {10.1109/CVPR52729.2023.00165},
    abstract = {Most of the existing blind image Super-Resolution (SR) methods assume that the blur kernels are space-invariant. However, the blur involved in real applications are usually space-variant due to object motion, out-of-focus, etc., resulting in severe performance drop of the advanced SR methods. To address this problem, we firstly introduce two new datasets with out-of-focus blur, i.e., NYUv2-BSR and Cityscapes-BSR, to support further researches of blind SR with space-variant blur. Based on the datasets, we design a novel Cross-MOdal fuSion network (CMOS) that estimate both blur and semantics simultaneously, which leads to improved SR results. It involves a feature Grouping Interactive Attention (GIA) module to make the two modalities in-teract more effectively and avoid inconsistency. GIA can also be used for the interaction of other features because of the universality of its structure. Qualitative and quantitative experiments compared with state-of-the-art methods on above datasets and real-world images demonstrate the superiority of our method, e.g., obtaining PSNR/SSIM by +1.91↑/+0.0048↑ on NYUv2-BSR than MANet 1 1 https://github.com/ByChelsea/CMOS.git.}
    }

  • T. Huang, Z. Ding, J. Zhang, Y. Tai, Z. Zhang, M. Chen, C. Wang, and Y. Liu, “Learning to Measure the Point Cloud Reconstruction Loss in a Representation Space," in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 12208-12217.
    [BibTeX] [Abstract] [DOI] [PDF]

    For point cloud reconstruction-related tasks, the reconstruction losses to evaluate the shape differences between reconstructed results and the ground truths are typically used to train the task networks. Most existing works measure the training loss with point-to-point distance, which may introduce extra defects as predefined matching rules may deviate from the real shape differences. Although some learning-based works have been proposed to overcome the weaknesses of manually-defined rules, they still measure the shape differences in 3D Euclidean space, which may limit their ability to capture defects in reconstructed shapes. In this work, we propose a learning-based Contrastive Adver-sarial Loss (CALoss) to measure the point cloud reconstruction loss dynamically in a non-linear representation space by combining the contrastive constraint with the adversarial strategy. Specifically, we use the contrastive constraint to help CALoss learn a representation space with shape similarity, while we introduce the adversarial strategy to help CALoss mine differences between reconstructed results and ground truths. According to experiments on reconstruction-related tasks, CALoss can help task networks improve re-construction performances and learn more representative representations.

    @inproceedings{huang2023ltm,
    title = {Learning to Measure the Point Cloud Reconstruction Loss in a Representation Space},
    author = {Tianxin Huang and Zhonggan Ding and Jiangning Zhang and Ying Tai and Zhenyu Zhang and Mingang Chen and Chengjie Wang and Yong Liu},
    year = 2023,
    booktitle = {2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {12208-12217},
    doi = {10.1109/CVPR52729.2023.01175},
    abstract = {For point cloud reconstruction-related tasks, the reconstruction losses to evaluate the shape differences between reconstructed results and the ground truths are typically used to train the task networks. Most existing works measure the training loss with point-to-point distance, which may introduce extra defects as predefined matching rules may deviate from the real shape differences. Although some learning-based works have been proposed to overcome the weaknesses of manually-defined rules, they still measure the shape differences in 3D Euclidean space, which may limit their ability to capture defects in reconstructed shapes. In this work, we propose a learning-based Contrastive Adver-sarial Loss (CALoss) to measure the point cloud reconstruction loss dynamically in a non-linear representation space by combining the contrastive constraint with the adversarial strategy. Specifically, we use the contrastive constraint to help CALoss learn a representation space with shape similarity, while we introduce the adversarial strategy to help CALoss mine differences between reconstructed results and ground truths. According to experiments on reconstruction-related tasks, CALoss can help task networks improve re-construction performances and learn more representative representations.}
    }

  • M. W. M. Z. L. and Teli and Xingxing and Jiajun and Yong Liu, “Correlation Pyramid Network for 3D Single Object Tracking," in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023, pp. 3216-3225.
    [BibTeX] [Abstract] [DOI] [PDF]

    3D LiDAR-based single object tracking (SOT) has gained increasing attention as it plays a crucial role in 3D applications such as autonomous driving. The central problem is how to learn a target-aware representation from the sparse and incomplete point clouds. In this paper, we propose a novel Correlation Pyramid Network (CorpNet) with a unified encoder and a motion-factorized decoder. Specifically, the encoder introduces multi-level self attentions and cross attentions in its main branch to enrich the template and search region features and realize their fusion and interaction, respectively. Additionally, considering the sparsity characteristics of the point clouds, we design a lateral correlation pyramid structure for the encoder to keep as many points as possible by integrating hierarchical correlated features. The output features of the search region from the encoder can be directly fed into the decoder for predicting target locations without any extra matcher. Moreover, in the decoder of CorpNet, we design a motion-factorized head to explicitly learn the different movement patterns of the up axis and the x-y plane together. Extensive experiments on two commonly-used datasets show our CorpNet achieves state-of-the-art results while running in real-time.

    @inproceedings{wang2023cpn,
    title = {Correlation Pyramid Network for 3D Single Object Tracking},
    author = {Mengmeng Wang and Teli Ma and Xingxing Zuo and Jiajun Lv and Yong Liu},
    year = 2023,
    booktitle = {2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
    pages = {3216-3225},
    doi = {10.1109/CVPRW59228.2023.00324},
    abstract = {3D LiDAR-based single object tracking (SOT) has gained increasing attention as it plays a crucial role in 3D applications such as autonomous driving. The central problem is how to learn a target-aware representation from the sparse and incomplete point clouds. In this paper, we propose a novel Correlation Pyramid Network (CorpNet) with a unified encoder and a motion-factorized decoder. Specifically, the encoder introduces multi-level self attentions and cross attentions in its main branch to enrich the template and search region features and realize their fusion and interaction, respectively. Additionally, considering the sparsity characteristics of the point clouds, we design a lateral correlation pyramid structure for the encoder to keep as many points as possible by integrating hierarchical correlated features. The output features of the search region from the encoder can be directly fed into the decoder for predicting target locations without any extra matcher. Moreover, in the decoder of CorpNet, we design a motion-factorized head to explicitly learn the different movement patterns of the up axis and the x-y plane together. Extensive experiments on two commonly-used datasets show our CorpNet achieves state-of-the-art results while running in real-time.}
    }

  • X. and Juntao Jiang and Jiandang Yang and Yong Liu, “Deep Reinforcement Learning Based Lane-level Variable Speed Limit Control," in 9th International Conference on Control Science and Systems Engineering (ICCSSE), 2023, pp. 98-104.
    [BibTeX] [Abstract] [DOI] [PDF]

    Variable speed limit (VSL) is an effective traffic control method to alleviate congestion and increase safety. This paper incorporates deep reinforcement learning (DRL) into the VSL control strategy and proposes a twin delayed deep determin-istic policy gradient (TD3)-based solution. We set different speed limits between every lane to control the speed of vehicles entering the highway merging area, thereby increasing the traffic flow and improving passing efficiency. The proposed model learns a large number of discrete actions within continuous actions through the actor-critic framework, using the reward signal based on the difference between inflow and outflow to train the agent. We selected real-world road segments and collected corresponding data to test the proposed method. The simulation results show that the VSL control based on TD3 can effectively reduce average travel time and increase the number of passing vehicles.

    @inproceedings{chen2023drl,
    title = {Deep Reinforcement Learning Based Lane-level Variable Speed Limit Control},
    author = {Xiyu Chen and Juntao Jiang and Jiandang Yang and Yong Liu},
    year = 2023,
    booktitle = {9th International Conference on Control Science and Systems Engineering (ICCSSE)},
    pages = {98-104},
    doi = {10.1109/ICCSSE59359.2023.10244962},
    abstract = {Variable speed limit (VSL) is an effective traffic control method to alleviate congestion and increase safety. This paper incorporates deep reinforcement learning (DRL) into the VSL control strategy and proposes a twin delayed deep determin-istic policy gradient (TD3)-based solution. We set different speed limits between every lane to control the speed of vehicles entering the highway merging area, thereby increasing the traffic flow and improving passing efficiency. The proposed model learns a large number of discrete actions within continuous actions through the actor-critic framework, using the reward signal based on the difference between inflow and outflow to train the agent. We selected real-world road segments and collected corresponding data to test the proposed method. The simulation results show that the VSL control based on TD3 can effectively reduce average travel time and increase the number of passing vehicles.}
    }

  • S. and Yujing Hu and Runze Wu and Dong Xing and Yu Xiong and Changjie Fan and Kun Kuang and Yong Liu, “Adaptive Value Decomposition with Greedy Marginal Contribution Computation for Cooperative Multi-Agent Reinforcement Learning," in International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2023, pp. 31-39.
    [BibTeX] [Abstract] [PDF]

    Real-world cooperation often requires intensive coordination among agents simultaneously. This task has been extensively studied within the framework of cooperative multi-agent reinforcement learning (MARL), and value decomposition methods are among those cutting-edge solutions. However, traditional methods that learn the value function as a monotonic mixing of per-agent utilities cannot solve the tasks with non-monotonic returns. This hinders their application in generic scenarios. Recent methods tackle this problem from the perspective of implicit credit assignment by learning value functions with complete expressiveness or using additional structures to improve cooperation. However, they are either difficult to learn due to large joint action spaces or insufficient to capture the complicated interactions among agents which are essential to solving tasks with non-monotonic returns. Moreover, applications in real-world scenarios usually require policies to be interpretable, but interpretability is limited in the implicit credit assignment methods. To address these problems, we propose a novel explicit credit assignment method to address the non-monotonic problem. Our method, Adaptive Value decomposition with Greedy Marginal contribution (AVGM), is based on an adaptive value decomposition that learns the cooperative value of a group of dynamically changing agents. We first illustrate that the proposed value decomposition can consider the complicated interactions among agents and is feasible to learn in large-scale scenarios. Then, our method uses a greedy marginal contribution computed from the value decomposition as an individual credit to incentivize agents to learn the optimal cooperative policy. We further extend the module with an action encoder to guarantee the linear time complexity for computing the greedy marginal contribution. Experimental results demonstrate that our method achieves significant performance improvements in several non-monotonic domains. Besides, we showcase that our model maintains a good sense of interpretability and rationality. This suggests our model can be applied to scenarios with more realistic demands.

    @inproceedings{liu2023avd,
    title = {Adaptive Value Decomposition with Greedy Marginal Contribution Computation for Cooperative Multi-Agent Reinforcement Learning},
    author = {Shanqi Liu and Yujing Hu and Runze Wu and Dong Xing and Yu Xiong and Changjie Fan and Kun Kuang and Yong Liu},
    year = 2023,
    booktitle = {International Conference on Autonomous Agents and Multiagent Systems (AAMAS)},
    pages = {31-39},
    abstract = {Real-world cooperation often requires intensive coordination among agents simultaneously. This task has been extensively studied within the framework of cooperative multi-agent reinforcement learning (MARL), and value decomposition methods are among those cutting-edge solutions. However, traditional methods that learn the value function as a monotonic mixing of per-agent utilities cannot solve the tasks with non-monotonic returns. This hinders their application in generic scenarios. Recent methods tackle this problem from the perspective of implicit credit assignment by learning value functions with complete expressiveness or using additional structures to improve cooperation. However, they are either difficult to learn due to large joint action spaces or insufficient to capture the complicated interactions among agents which are essential to solving tasks with non-monotonic returns. Moreover, applications in real-world scenarios usually require policies to be interpretable, but interpretability is limited in the implicit credit assignment methods. To address these problems, we propose a novel explicit credit assignment method to address the non-monotonic problem. Our method, Adaptive Value decomposition with Greedy Marginal contribution (AVGM), is based on an adaptive value decomposition that learns the cooperative value of a group of dynamically changing agents. We first illustrate that the proposed value decomposition can consider the complicated interactions among agents and is feasible to learn in large-scale scenarios. Then, our method uses a greedy marginal contribution computed from the value decomposition as an individual credit to incentivize agents to learn the optimal cooperative policy. We further extend the module with an action encoder to guarantee the linear time complexity for computing the greedy marginal contribution. Experimental results demonstrate that our method achieves significant performance improvements in several non-monotonic domains. Besides, we showcase that our model maintains a good sense of interpretability and rationality. This suggests our model can be applied to scenarios with more realistic demands.}
    }

  • L. Li, W. Ding, Y. Wen, Y. Liang, Y. Liu, and G. Wan, “A Unified BEV Model for Joint Learning 3D Local Features and Overlap Estimation," in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 8341-8348.
    [BibTeX] [Abstract] [DOI] [PDF]

    Pairwise point cloud registration is a critical task for many applications, which heavily depends on finding correct correspondences from the two point clouds. However, the low overlap between input point clouds causes the registration to fail easily, leading to mistaken overlapping and mismatched correspondences, especially in scenes where non-overlapping regions contain similar structures. In this paper, we present a unified bird’s-eye view (BEV) model for jointly learning of 3D local features and overlap estimation to fulfill pairwise registration and loop closure. Feature description is performed by a sparse UNet-like network based on BEV representation, and 3D keypoints are extracted by a detection head for 2D locations, and a regression head for heights. For overlap detection, a cross-attention module is applied for interacting contextual information of input point clouds, followed by a classification head to estimate the overlapping region. We evaluate our unified model extensively on the KITTI dataset and Apollo-SouthBay dataset. The experiments demonstrate that our method significantly outperforms existing methods on overlap estimation, especially in scenes with small overlaps. It also achieves top registration performance on both datasets in terms of translation and rotation errors.

    @inproceedings{li2023bev,
    title = {A Unified BEV Model for Joint Learning 3D Local Features and Overlap Estimation},
    author = {Lin Li and Wendong Ding and Yongkun Wen and Yufei Liang and Yong Liu and Guowei Wan},
    year = 2023,
    booktitle = {2023 IEEE International Conference on Robotics and Automation (ICRA)},
    pages = {8341-8348},
    doi = {10.1109/ICRA48891.2023.10160492},
    abstract = {Pairwise point cloud registration is a critical task for many applications, which heavily depends on finding correct correspondences from the two point clouds. However, the low overlap between input point clouds causes the registration to fail easily, leading to mistaken overlapping and mismatched correspondences, especially in scenes where non-overlapping regions contain similar structures. In this paper, we present a unified bird's-eye view (BEV) model for jointly learning of 3D local features and overlap estimation to fulfill pairwise registration and loop closure. Feature description is performed by a sparse UNet-like network based on BEV representation, and 3D keypoints are extracted by a detection head for 2D locations, and a regression head for heights. For overlap detection, a cross-attention module is applied for interacting contextual information of input point clouds, followed by a classification head to estimate the overlapping region. We evaluate our unified model extensively on the KITTI dataset and Apollo-SouthBay dataset. The experiments demonstrate that our method significantly outperforms existing methods on overlap estimation, especially in scenes with small overlaps. It also achieves top registration performance on both datasets in terms of translation and rotation errors.}
    }

  • G. Xu, D. Zhu, J. Cao, Y. Liu, and J. Yang, “Shunted Collision Avoidance for Multi-UAV Motion Planning with Posture Constraints," in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 3671-3678.
    [BibTeX] [Abstract] [DOI] [PDF]

    This paper investigates the problem of fixed-wing unmanned aerial vehicles (UAV s) motion planning with posture constraints and the problem of the more general symmetrical situations where UAVs have more than one optimal solution. In this paper, the posture constraints are formulated in the 3D Dubins method, and the symmetrical situations are overcome by a more collaborative strategy called the shunted strategy. The effectiveness of the proposed method has been validated by conducting extensive simulation experiments. Meanwhile, we compared the proposed method with the other state-of-the-art methods, and the comparison results show that the proposed method advances the previous works. Finally, the practicability of the proposed algorithm was analyzed by the statistic in computational cost. The source code of our method can be available at https://github.com/wuuya1/SCA.

    @inproceedings{xu2023sca,
    title = {Shunted Collision Avoidance for Multi-UAV Motion Planning with Posture Constraints},
    author = {Gang Xu and Deye Zhu and Junjie Cao and Yong Liu and Jian Yang},
    year = 2023,
    booktitle = {2023 IEEE International Conference on Robotics and Automation (ICRA)},
    pages = {3671-3678},
    doi = {10.1109/ICRA48891.2023.10160979},
    abstract = {This paper investigates the problem of fixed-wing unmanned aerial vehicles (UAV s) motion planning with posture constraints and the problem of the more general symmetrical situations where UAVs have more than one optimal solution. In this paper, the posture constraints are formulated in the 3D Dubins method, and the symmetrical situations are overcome by a more collaborative strategy called the shunted strategy. The effectiveness of the proposed method has been validated by conducting extensive simulation experiments. Meanwhile, we compared the proposed method with the other state-of-the-art methods, and the comparison results show that the proposed method advances the previous works. Finally, the practicability of the proposed algorithm was analyzed by the statistic in computational cost. The source code of our method can be available at https://github.com/wuuya1/SCA.}
    }

  • Y. Ma, X. Zhao, H. Li, Y. Gu, X. Lang, and Y. Liu, “RoLM:Radar on LiDAR Map Localization," in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 3976-3982.
    [BibTeX] [Abstract] [DOI] [PDF]

    Multi-sensor fusion-based localization technology has achieved high accuracy in autonomous systems. How to improve the robustness is the main challenge at present. The most commonly used LiDAR and camera are weather-sensitive, while the FMCW radar has strong adaptability but suffers from noise and ghost effects. In this paper, we propose a heterogeneous localization method of Radar on LiDAR Map (RoLM), which can eliminate the accumulated error of radar odometry in real-time to achieve higher localization accuracy without dependence on loop closures. We embed the two sensor modalities into a density map and calculate the spatial vector similarity with offset to seek the corresponding place index in the candidates and calculate the rotation and translation. We use the ICP to pursue perfect matching on the LiDAR submap based on the coarse alignment. Extensive experiments on Mulran Radar Dataset, Oxford Radar RobotCar Dataset, and our data verify the feasibility and effectiveness of our approach.

    @inproceedings{ma2023rol,
    title = {RoLM:Radar on LiDAR Map Localization},
    author = {Yukai Ma and Xiangrui Zhao and Han Li and Yaqing Gu and Xiaolei Lang and Yong Liu},
    year = 2023,
    booktitle = {2023 IEEE International Conference on Robotics and Automation (ICRA)},
    pages = {3976-3982},
    doi = {10.1109/ICRA48891.2023.10161203},
    abstract = {Multi-sensor fusion-based localization technology has achieved high accuracy in autonomous systems. How to improve the robustness is the main challenge at present. The most commonly used LiDAR and camera are weather-sensitive, while the FMCW radar has strong adaptability but suffers from noise and ghost effects. In this paper, we propose a heterogeneous localization method of Radar on LiDAR Map (RoLM), which can eliminate the accumulated error of radar odometry in real-time to achieve higher localization accuracy without dependence on loop closures. We embed the two sensor modalities into a density map and calculate the spatial vector similarity with offset to seek the corresponding place index in the candidates and calculate the rotation and translation. We use the ICP to pursue perfect matching on the LiDAR submap based on the coarse alignment. Extensive experiments on Mulran Radar Dataset, Oxford Radar RobotCar Dataset, and our data verify the feasibility and effectiveness of our approach.}
    }

  • J. Zhu, L. Liu, Y. Liu, W. Li, F. Wen, and H. Zhang, “FG-Depth: Flow-Guided Unsupervised Monocular Depth Estimation," in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 4924-4930.
    [BibTeX] [Abstract] [DOI] [PDF]

    The great potential of unsupervised monocular depth estimation has been demonstrated by many works due to low annotation cost and impressive accuracy comparable to supervised methods. To further improve the performance, recent works mainly focus on designing more complex network structures and exploiting extra supervised information, e.g., semantic segmentation. These methods optimize the models by exploiting the reconstructed relationship between the target and reference images in varying degrees. However, previous methods prove that this image reconstruction optimization is prone to get trapped in local minima. In this paper, our core idea is to guide the optimization with prior knowledge from pretrained Flow-Net. And we show that the bottleneck of unsupervised monocular depth estimation can be broken with our simple but effective framework named FG-Depth. In particular, we propose (i) a flow distillation loss to replace the typical photometric loss that limits the capacity of the model and (ii) a prior flow based mask to remove invalid pixels that bring the noise in training loss. Extensive experiments demonstrate the effectiveness of each component, and our approach achieves state-of-the-art results on both KITTI and NYU-Depth-v2 datasets.

    @inproceedings{zhu2023fgd,
    title = {FG-Depth: Flow-Guided Unsupervised Monocular Depth Estimation},
    author = {Junyu Zhu and Lina Liu and Yong Liu and Wanlong Li and Feng Wen and Hongbo Zhang},
    year = 2023,
    booktitle = {2023 IEEE International Conference on Robotics and Automation (ICRA)},
    pages = {4924-4930},
    doi = {10.1109/ICRA48891.2023.10160534},
    abstract = {The great potential of unsupervised monocular depth estimation has been demonstrated by many works due to low annotation cost and impressive accuracy comparable to supervised methods. To further improve the performance, recent works mainly focus on designing more complex network structures and exploiting extra supervised information, e.g., semantic segmentation. These methods optimize the models by exploiting the reconstructed relationship between the target and reference images in varying degrees. However, previous methods prove that this image reconstruction optimization is prone to get trapped in local minima. In this paper, our core idea is to guide the optimization with prior knowledge from pretrained Flow-Net. And we show that the bottleneck of unsupervised monocular depth estimation can be broken with our simple but effective framework named FG-Depth. In particular, we propose (i) a flow distillation loss to replace the typical photometric loss that limits the capacity of the model and (ii) a prior flow based mask to remove invalid pixels that bring the noise in training loss. Extensive experiments demonstrate the effectiveness of each component, and our approach achieves state-of-the-art results on both KITTI and NYU-Depth-v2 datasets.}
    }

  • J. Jiang, X. Chen, G. Tian, and Y. Liu, “VIG-UNET: Vision Graph Neural Networks for Medical Image Segmentation," in IEEE 20th International Symposium on Biomedical Imaging (ISBI), 2023.
    [BibTeX] [Abstract] [DOI] [PDF]

    Deep neural networks have been widely used in medical image analysis and medical image segmentation is one of the most important tasks. U-shaped neural networks with encoder-decoder are prevailing and have succeeded greatly in various segmentation tasks. While CNNs treat an image as a grid of pixels in Euclidean space and Transformers recognize an image as a sequence of patches, graph-based representation is more generalized and can construct connections for each part of an image. In this paper, we propose a novel ViG-UNet, a graph neural network-based U-shaped architecture with the encoder, the decoder, the bottleneck, and skip connections. The downsampling and upsampling modules are also carefully designed. The experimental results on ISIC 2016, ISIC 2017 and Kvasir-SEG datasets demonstrate that our proposed architecture outperforms most existing classic and state-of-the-art U-shaped networks.

    @inproceedings{jiang2023vig,
    title = {VIG-UNET: Vision Graph Neural Networks for Medical Image Segmentation},
    author = {Juntao Jiang and Xiyu Chen and Guanzhong Tian and Yong Liu},
    year = 2023,
    booktitle = {IEEE 20th International Symposium on Biomedical Imaging (ISBI)},
    doi = {10.1109/ISBI53787.2023.10230496},
    abstract = {Deep neural networks have been widely used in medical image analysis and medical image segmentation is one of the most important tasks. U-shaped neural networks with encoder-decoder are prevailing and have succeeded greatly in various segmentation tasks. While CNNs treat an image as a grid of pixels in Euclidean space and Transformers recognize an image as a sequence of patches, graph-based representation is more generalized and can construct connections for each part of an image. In this paper, we propose a novel ViG-UNet, a graph neural network-based U-shaped architecture with the encoder, the decoder, the bottleneck, and skip connections. The downsampling and upsampling modules are also carefully designed. The experimental results on ISIC 2016, ISIC 2017 and Kvasir-SEG datasets demonstrate that our proposed architecture outperforms most existing classic and state-of-the-art U-shaped networks.}
    }

  • J. Xing, M. Wang, B. Mu, and Y. Liu, “Revisiting the Spatial and Temporal Modeling for Few-Shot Action Recognition," in 37th AAAI Conference on Artificial Intelligence (AAAI), 2023, pp. 3001-3009.
    [BibTeX] [Abstract] [PDF]

    Spatial and temporal modeling is one of the most core aspects of few-shot action recognition. Most previous works mainly focus on long-term temporal relation modeling based on high-level spatial representations, without considering the crucial low-level spatial features and short-term temporal relations. Actually, the former feature could bring rich local semantic information, and the latter feature could represent motion characteristics of adjacent frames, respectively. In this paper, we propose SloshNet, a new framework that revisits the spatial and temporal modeling for few-shot action recognition in a finer manner. First, to exploit the low-level spatial features, we design a feature fusion architecture search module to automatically search for the best combination of the low-level and high-level spatial features. Next, inspired by the recent transformer, we introduce a long-term temporal modeling module to model the global temporal relations based on the extracted spatial appearance features. Meanwhile, we design another short-term temporal modeling module to encode the motion characteristics between adjacent frame representations. After that, the final predictions can be obtained by feeding the embedded rich spatial-temporal features to a common frame-level class prototype matcher. We extensively validate the proposed SloshNet on four few-shot action recognition datasets, including Something-Something V2, Kinetics, UCF101, and HMDB51. It achieves favorable results against state-of-the-art methods in all datasets.

    @inproceedings{xing2023rst,
    title = {Revisiting the Spatial and Temporal Modeling for Few-Shot Action Recognition},
    author = {Jiazheng Xing and Mengmeng Wang and Boyu Mu and Yong Liu},
    year = 2023,
    booktitle = {37th AAAI Conference on Artificial Intelligence (AAAI)},
    pages = {3001-3009},
    abstract = {Spatial and temporal modeling is one of the most core aspects of few-shot action recognition. Most previous works mainly focus on long-term temporal relation modeling based
    on high-level spatial representations, without considering the crucial low-level spatial features and short-term temporal relations. Actually, the former feature could bring rich local semantic information, and the latter feature could represent motion characteristics of adjacent frames, respectively. In this paper, we propose SloshNet, a new framework that revisits the spatial and temporal modeling for few-shot action recognition in a finer manner. First, to exploit the low-level spatial features, we design a feature fusion architecture search module to automatically search for the best combination of the low-level and high-level spatial features. Next, inspired by the recent transformer, we introduce a long-term temporal modeling module to model the global temporal relations based on
    the extracted spatial appearance features. Meanwhile, we design another short-term temporal modeling module to encode the motion characteristics between adjacent frame representations. After that, the final predictions can be obtained by feeding the embedded rich spatial-temporal features to a common frame-level class prototype matcher. We extensively validate the proposed SloshNet on four few-shot action recognition datasets, including Something-Something V2, Kinetics, UCF101, and HMDB51. It achieves favorable results against state-of-the-art methods in all datasets.}
    }

2022

  • T. Huang, X. Yang, J. Zhang, J. Cui, H. Zou, J. C. and Xiangrui Zhao, and Y. Liu, “Learning to Train a Point Cloud Reconstruction Network Without Matching," in European Conference on Computer Vision (ECCV), 2022.
    [BibTeX] [Abstract] [DOI]

    Reconstruction networks for well-ordered data such as 2D images and 1D continuous signals are easy to optimize through element-wised squared errors, while permutation-arbitrary point clouds cannot be constrained directly because their points permutations are not fixed. Though existing works design algorithms to match two point clouds and evaluate shape errors based on matched results, they are limited by pre-defined matching processes. In this work, we propose a novel framework named PCLossNet which learns to train a point cloud reconstruction network without any matching. By training through an adversarial process together with the reconstruction network, PCLossNet can better explore the differences between point clouds and create more precise reconstruction results. Experiments on multiple datasets prove the superiority of our method, where PCLossNet can help networks achieve much lower reconstruction errors and extract more representative features, with about 4 times faster training efficiency than the commonly-used EMD loss. Our codes can be found in https://github.com/Tianxinhuang/PCLossNet.

    @inproceedings{huang2022ltt,
    title = {Learning to Train a Point Cloud Reconstruction Network Without Matching},
    author = {Tianxin Huang and Xuemeng Yang and Jiangning Zhang and Jinhao Cui and Hao Zou and Jun Chen and Xiangrui Zhao and Yong Liu},
    year = 2022,
    booktitle = {European Conference on Computer Vision (ECCV)},
    doi = {10.1007/978-3-031-19769-7_11},
    abstract = {Reconstruction networks for well-ordered data such as 2D images and 1D continuous signals are easy to optimize through element-wised squared errors, while permutation-arbitrary point clouds cannot be constrained directly because their points permutations are not fixed. Though existing works design algorithms to match two point clouds and evaluate shape errors based on matched results, they are limited by pre-defined matching processes. In this work, we propose a novel framework named PCLossNet which learns to train a point cloud reconstruction network without any matching. By training through an adversarial process together with the reconstruction network, PCLossNet can better explore the differences between point clouds and create more precise reconstruction results. Experiments on multiple datasets prove the superiority of our method, where PCLossNet can help networks achieve much lower reconstruction errors and extract more representative features, with about 4 times faster training efficiency than the commonly-used EMD loss. Our codes can be found in https://github.com/Tianxinhuang/PCLossNet.}
    }

  • T. Huang, J. Zhang, J. C. and Yuang Liu, and Y. Liu, “Resolution-free Point Cloud Sampling Network with Data Distillation," in European Conference on Computer Vision (ECCV), 2022.
    [BibTeX] [Abstract] [DOI]

    Down-sampling algorithms are adopted to simplify the point clouds and save the computation cost on subsequent tasks. Existing learning-based sampling methods often need to train a big sampling network to support sampling under different resolutions, which must generate sampled points with the costly maximum resolution even if only low-resolution points need to be sampled. In this work, we propose a novel resolution-free point clouds sampling network to directly sample the original point cloud to different resolutions, which is conducted by optimizing non-learning-based initial sampled points to better positions. Besides, we introduce data distillation to assist the training process by considering the differences between task network outputs from original point clouds and sampled points. Experiments on point cloud reconstruction and recognition tasks demonstrate that our method can achieve SOTA performances with lower time and memory cost than existing learning-based sampling strategies. Codes are available at https://github.com/Tianxinhuang/PCDNet.

    @inproceedings{huang2022rfp,
    title = {Resolution-free Point Cloud Sampling Network with Data Distillation},
    author = {Tianxin Huang and Jiangning Zhang and Jun Chen and Yuang Liu and Yong Liu},
    year = 2022,
    booktitle = {European Conference on Computer Vision (ECCV)},
    doi = {10.1007/978-3-031-20086-1_4},
    abstract = {Down-sampling algorithms are adopted to simplify the point clouds and save the computation cost on subsequent tasks. Existing learning-based sampling methods often need to train a big sampling network to support sampling under different resolutions, which must generate sampled points with the costly maximum resolution even if only low-resolution points need to be sampled. In this work, we propose a novel resolution-free point clouds sampling network to directly sample the original point cloud to different resolutions, which is conducted by optimizing non-learning-based initial sampled points to better positions. Besides, we introduce data distillation to assist the training process by considering the differences between task network outputs from original point clouds and sampled points. Experiments on point cloud reconstruction and recognition tasks demonstrate that our method can achieve SOTA performances with lower time and memory cost than existing learning-based sampling strategies. Codes are available at https://github.com/Tianxinhuang/PCDNet.}
    }

  • C. Xu, J. Zhang, Y. Han, and Y. Liu, “Designing One Unified framework for High-Fidelity Face Reenactment and Swapping," in European Conference on Computer Vision (ECCV), 2022.
    [BibTeX] [Abstract] [DOI]

    Face reenactment and swapping share a similar identity and attribute manipulating pattern, but most methods treat them separately, which is redundant and practical-unfriendly. In this paper, we propose an effective end-to-end unified framework to achieve both tasks. Unlike existing methods that directly utilize pre-estimated structures and do not fully exploit their potential similarity, our model sufficiently transfers identity and attribute based on learned disentangled representations to generate high-fidelity faces. Specifically, Feature Disentanglement first disentangles identity and attribute unsupervisedly. Then the proposed Attribute Transfer (AttrT) employs learned Feature Displacement Fields to transfer the attribute granularly, and Identity Transfer (IdT) explicitly models identity-related feature interaction to adaptively control the identity fusion. We joint AttrT and IdT according to their intrinsic relationship to further facilitate each task, i.e., help improve identity consistency in reenactment and attribute preservation in swapping. Extensive experiments demonstrate the superiority of our method. Code is available at https://github.com/xc-csc101/UniFace.

    @inproceedings{xu2022dou,
    title = {Designing One Unified framework for High-Fidelity Face Reenactment and Swapping},
    author = {Chao Xu and Jiangning Zhang and Yue Han and Yong Liu},
    year = 2022,
    booktitle = {European Conference on Computer Vision (ECCV)},
    doi = {10.1007/978-3-031-19784-0_4},
    abstract = {Face reenactment and swapping share a similar identity and attribute manipulating pattern, but most methods treat them separately, which is redundant and practical-unfriendly. In this paper, we propose an effective end-to-end unified framework to achieve both tasks. Unlike existing methods that directly utilize pre-estimated structures and do not fully exploit their potential similarity, our model sufficiently transfers identity and attribute based on learned disentangled representations to generate high-fidelity faces. Specifically, Feature Disentanglement first disentangles identity and attribute unsupervisedly. Then the proposed Attribute Transfer (AttrT) employs learned Feature Displacement Fields to transfer the attribute granularly, and Identity Transfer (IdT) explicitly models identity-related feature interaction to adaptively control the identity fusion. We joint AttrT and IdT according to their intrinsic relationship to further facilitate each task, i.e., help improve identity consistency in reenactment and attribute preservation in swapping. Extensive experiments demonstrate the superiority of our method. Code is available at https://github.com/xc-csc101/UniFace.}
    }

  • X. Zhao, S. Yang, T. Huang, J. C. and Teng Ma, M. Li, and Y. Liu, “SuperLine3D:Self-supervised 3D Line Segmentation and Description for LiDAR Point Cloud," in European Conference on Computer Vision (ECCV), 2022.
    [BibTeX] [Abstract] [DOI]

    Poles and building edges are frequently observable objects on urban roads, conveying reliable hints for various computer vision tasks. To repetitively extract them as features and perform association between discrete LiDAR frames for registration, we propose the first learning-based feature segmentation and description model for 3D lines in LiDAR point cloud. To train our model without the time consuming and tedious data labeling process, we first generate synthetic primitives for the basic appearance of target lines, and build an iterative line auto-labeling process to gradually refine line labels on real LiDAR scans. Our segmentation model can extract lines under arbitrary scale perturbations, and we use shared EdgeConv encoder layers to train the two segmentation and descriptor heads jointly. Base on the model, we can build a highly-available global registration module for point cloud registration, in conditions without initial transformation hints. Experiments have demonstrated that our line-based registration method is highly competitive to state-of-the-art point-based approaches. Our code is available at https://github.com/zxrzju/SuperLine3D.git.

    @inproceedings{zhao2022sls,
    title = {SuperLine3D:Self-supervised 3D Line Segmentation and Description for LiDAR Point Cloud},
    author = {Xiangrui Zhao and Sheng Yang and Tianxin Huang and Jun Chen and Teng Ma and Mingyang Li and Yong Liu},
    year = 2022,
    booktitle = {European Conference on Computer Vision (ECCV)},
    doi = {10.1007/978-3-031-20077-9_16},
    abstract = {Poles and building edges are frequently observable objects on urban roads, conveying reliable hints for various computer vision tasks. To repetitively extract them as features and perform association between discrete LiDAR frames for registration, we propose the first learning-based feature segmentation and description model for 3D lines in LiDAR point cloud. To train our model without the time consuming and tedious data labeling process, we first generate synthetic primitives for the basic appearance of target lines, and build an iterative line auto-labeling process to gradually refine line labels on real LiDAR scans. Our segmentation model can extract lines under arbitrary scale perturbations, and we use shared EdgeConv encoder layers to train the two segmentation and descriptor heads jointly. Base on the model, we can build a highly-available global registration module for point cloud registration, in conditions without initial transformation hints. Experiments have demonstrated that our line-based registration method is highly competitive to state-of-the-art point-based approaches. Our code is available at https://github.com/zxrzju/SuperLine3D.git.}
    }

  • Z. Li, M. wang, H. Pi, K. Xu, J. Mei, and Y. Liu, “E-NeRV: Expedite Neural Video Representation with Disentangled Spatial-Temporal Context," in European Conference on Computer Vision (ECCV), 2022.
    [BibTeX] [Abstract] [DOI]

    Recently, the image-wise implicit neural representation of videos, NeRV, has gained popularity for its promising results and swift speed compared to regular pixel-wise implicit representations. However, the redundant parameters within the network structure can cause a large model size when scaling up for desirable performance. The key reason of this phenomenon is the coupled formulation of NeRV, which outputs the spatial and temporal information of video frames directly from the frame index input. In this paper, we propose E-NeRV, which dramatically expedites NeRV by decomposing the image-wise implicit neural representation into separate spatial and temporal context. Under the guidance of this new formulation, our model greatly reduces the redundant model parameters, while retaining the representation ability. We experimentally find that our method can improve the performance to a large extent with fewer parameters, resulting in a more than 8× faster speed on convergence. Code is available at https://github.com/kyleleey/E-NeRV.

    @inproceedings{li2022ene,
    title = {E-NeRV: Expedite Neural Video Representation with Disentangled Spatial-Temporal Context},
    author = {Zizhang Li and Mengmeng wang and Huaijin Pi and Kechun Xu and Jianbiao Mei and Yong Liu},
    year = 2022,
    booktitle = {European Conference on Computer Vision (ECCV)},
    doi = {10.1007/978-3-031-19833-5_16},
    abstract = {Recently, the image-wise implicit neural representation of videos, NeRV, has gained popularity for its promising results and swift speed compared to regular pixel-wise implicit representations. However, the redundant parameters within the network structure can cause a large model size when scaling up for desirable performance. The key reason of this phenomenon is the coupled formulation of NeRV, which outputs the spatial and temporal information of video frames directly from the frame index input. In this paper, we propose E-NeRV, which dramatically expedites NeRV by decomposing the image-wise implicit neural representation into separate spatial and temporal context. Under the guidance of this new formulation, our model greatly reduces the redundant model parameters, while retaining the representation ability. We experimentally find that our method can improve the performance to a large extent with fewer parameters, resulting in a more than 8× faster speed on convergence. Code is available at https://github.com/kyleleey/E-NeRV.}
    }

  • J. Huang, L. Li, X. Zhao, X. Lang, D. Zhu, and Y. Liu, “LODM: Large-scale Online Dense Mapping for UAV," in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    This paper proposes a method for online large-scale dense mapping. The UAV is within a range of 150-250 meters, combining GPS and visual odometry to estimate the scaled pose and sparse points. In order to use the depth of sparse points for depth map, we propose Sparse Confidence Cascade View-Aggregation MVSNet (SCCVA-MVSNet), which projects the depth-converged points in the sliding window on keyframes to obtain a sparse depth map. The photometric error constructs sparse confidence. The coarse depth and confidence through normalized convolution use the images of all keyframes, coarse depth, and confidence as the input of CVA-MVSNet to extract features and construct 3D cost volumes with adaptive view aggregation to balance the different stereo baselines between the keyframes. Our proposed network utilizes sparse features point information, the output of the network better maintains the consistency of the scale. Our experiments show that MVSNet using sparse feature point information outperforms image-only MVSNet, and our online reconstruction results are comparable to offline reconstruction methods. To benefit the research community, we open-source our code at https://github.com/hjxwhy/LODM.git

    @inproceedings{huang2022lls,
    title = {LODM: Large-scale Online Dense Mapping for UAV},
    author = {Jianxin Huang and Laijian Li and Xiangrui Zhao and Xiaolei Lang and Deye Zhu and Yong Liu},
    year = 2022,
    booktitle = {2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    doi = {10.1109/IROS47612.2022.9981994},
    abstract = {This paper proposes a method for online large-scale dense mapping. The UAV is within a range of 150-250 meters, combining GPS and visual odometry to estimate the scaled pose and sparse points. In order to use the depth of sparse points for depth map, we propose Sparse Confidence Cascade View-Aggregation MVSNet (SCCVA-MVSNet), which projects the depth-converged points in the sliding window on keyframes to obtain a sparse depth map. The photometric error constructs sparse confidence. The coarse depth and confidence through normalized convolution use the images of all keyframes, coarse depth, and confidence as the input of CVA-MVSNet to extract features and construct 3D cost volumes with adaptive view aggregation to balance the different stereo baselines between the keyframes. Our proposed network utilizes sparse features point information, the output of the network better maintains the consistency of the scale. Our experiments show that MVSNet using sparse feature point information outperforms image-only MVSNet, and our online reconstruction results are comparable to offline reconstruction methods. To benefit the research community, we open-source our code at https://github.com/hjxwhy/LODM.git}
    }

  • C. Xu, J. Zhang, and M. H. H. Y. and Qian and Zili and Yong Liu, “Region-Aware Face Swapping," in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    This paper presents a novel Region-Aware Face Swapping (RAFSwap) network to achieve identity-consistent harmonious high-resolution face generation in a local-global manner: 1) Local Facial Region-Aware (FRA) branch augments local identity-relevant features by introducing the Transformer to effectively model misaligned crossscale semantic interaction. 2) Global Source Feature Adaptive (SFA) branch further complements global identity relevant cues for generating identity-consistent swapped faces. Besides, we propose a Face Mask Predictor (FMP) module incorporated with StyleGAN2 to predict identity relevant soft facial masks in an unsupervised manner that is more practical for generating harmonious high-resolution faces. Abundant experiments qualitatively and quantita tively demonstrate the superiority of our method for generating more identity-consistent high-resolution swapped faces over SOTA methods, e.g., obtaining 96.70 ID retrieval that outperforms SOTA MegaFS by 5.87↑.

    @inproceedings{xu2022raf,
    title = {Region-Aware Face Swapping},
    author = {Chao Xu and Jiangning Zhang and Miao Hua and Qian He and Zili Yi and Yong Liu},
    year = 2022,
    booktitle = {2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    doi = {10.48550/arXiv.2203.04564},
    abstract = {This paper presents a novel Region-Aware Face Swapping (RAFSwap) network to achieve identity-consistent harmonious high-resolution face generation in a local-global manner: 1) Local Facial Region-Aware (FRA) branch augments local identity-relevant features by introducing the Transformer to effectively model misaligned crossscale semantic interaction. 2) Global Source Feature Adaptive (SFA) branch further complements global identity relevant cues for generating identity-consistent swapped faces. Besides, we propose a Face Mask Predictor (FMP) module incorporated with StyleGAN2 to predict identity relevant soft facial masks in an unsupervised manner that is more practical for generating harmonious high-resolution faces. Abundant experiments qualitatively and quantita tively demonstrate the superiority of our method for generating more identity-consistent high-resolution swapped faces over SOTA methods, e.g., obtaining 96.70 ID retrieval that outperforms SOTA MegaFS by 5.87↑.}
    }

  • J. Zhang, C. Xu, J. Li, Y. Han, Y. Wang, Y. Tai, and Y. Liu, “SCSNet: An Efficient Paradigm for Learning Simultaneously Image Colorization and Super-Resolution," in Proceedings of the 36th AAAI Conference on Artificial Intelligence, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    In the practical application of restoring low-resolution grayscale images, we generally need to run three separate processes of image colorization, super-resolution, and dowssampling operation for the target device. However, this pipeline is redundant and inefficient for the independent processes, and some inner features could have been shared.Therefore, we present an efficient paradigm to perform Simultaneously Image Colorization and Super-resolution(SCS) and propose an end-to-end SCSNet to achieve this goal. The proposed method consists of two parts: colorization branch for learning color information that employs the proposed plug-and-play Pyramid Valve Cross Attention (PVCAttn) module to aggregate feature maps between source andreference images; and super-resolution branch for integrating color and texture information to predict target images, which uses the designed Continuous Pixel Mapping (CPM) module to predict high-resolution images at continuous magni-fication. Furthermore, our SCSNet supports both automatic and referential modes that is more flexible for practical application. Abundant experiments demonstrate the superiority of our method for generating authentic images over state-of-theart methods, e.g., averagely decreasing FID by 1.8↓ and 5.1↓ compared with current best scores for automatic and referential modes, respectively, while owning fewer parameters(more than ×2↓) and faster running speed (more than ×3↑).

    @inproceedings{zhang2022scs,
    title = {SCSNet: An Efficient Paradigm for Learning Simultaneously Image Colorization and Super-Resolution},
    author = {Jiangning Zhang and Chao Xu and Jian Li and Yue Han and Yabiao Wang and Ying Tai and Yong Liu},
    year = 2022,
    booktitle = {Proceedings of the 36th AAAI Conference on Artificial Intelligence},
    doi = {https://doi.org/10.48550/arXiv.2201.04364},
    abstract = {In the practical application of restoring low-resolution grayscale images, we generally need to run three separate processes of image colorization, super-resolution, and dowssampling operation for the target device. However, this pipeline is redundant and inefficient for the independent processes, and some inner features could have been shared.Therefore, we present an efficient paradigm to perform Simultaneously Image Colorization and Super-resolution(SCS) and propose an end-to-end SCSNet to achieve this goal. The proposed method consists of two parts: colorization branch for learning color information that employs the proposed plug-and-play Pyramid Valve Cross Attention (PVCAttn) module to aggregate feature maps between source andreference images; and super-resolution branch for integrating color and texture information to predict target images, which uses the designed Continuous Pixel Mapping (CPM) module to predict high-resolution images at continuous magni-fication. Furthermore, our SCSNet supports both automatic and referential modes that is more flexible for practical application. Abundant experiments demonstrate the superiority of our method for generating authentic images over state-of-theart methods, e.g., averagely decreasing FID by 1.8↓ and 5.1↓ compared with current best scores for automatic and referential modes, respectively, while owning fewer parameters(more than ×2↓) and faster running speed (more than ×3↑).}
    }

2021

  • Z. Chen, T. Huang, Z. Xue, Z. Zhu, J. Xu, and Y. Liu, “A Novel Unmanned Surface Vehicle with 2D-3D Fused Perception and Obstacle Avoidance Module," in 2021 IEEE International Conference on Robotics and Biomimetics (ROBIO), 2021, pp. 1804-1809.
    [BibTeX] [Abstract] [DOI] [PDF]

    Unmanned surface vehicles (USVs) are important intelligent equipment that can accomplish various tasks on open area marine. During operation, environmental perception and obstacle avoidance is of vital significance to its autonomy. In this paper, we propose a novel USV equipped with fused perception and obstacle avoidance module that contains robust perception, localization and effective obstacle avoidance strategy. The new module is named Three-Dimensional Perception Module (PMTD), which utilizes camera and LiDAR to integrate multi-dimensional environmental information. It is able to detect, identify and track target objects in the process of autonomous travel. The localization precision achieves a centimeter-level with GPS and IMU devices. Meanwhile, the obstacle avoidance strategy allows the USV to efficiently keep away from static and dynamic floating objects in water areas. Through real-world experiments, we show that with the help of the proposed module, the USV can complete stable and autonomous operation and obstacles avoidance path planning even without any manual intervention. This indicates the strong ability of the module in autonomous driving for USVs.

    @inproceedings{chen2021anu,
    title = {A Novel Unmanned Surface Vehicle with 2D-3D Fused Perception and Obstacle Avoidance Module},
    author = {Zhe Chen and Tao Huang and Zhenfeng Xue and Zongzhi Zhu and Jinhong Xu and Yong Liu},
    year = 2021,
    booktitle = {2021 IEEE International Conference on Robotics and Biomimetics (ROBIO)},
    pages = {1804-1809},
    doi = {https://doi.org/10.1109/ROBIO54168.2021.9739449},
    abstract = {Unmanned surface vehicles (USVs) are important intelligent equipment that can accomplish various tasks on open area marine. During operation, environmental perception and obstacle avoidance is of vital significance to its autonomy. In this paper, we propose a novel USV equipped with fused perception and obstacle avoidance module that contains robust perception, localization and effective obstacle avoidance strategy. The new module is named Three-Dimensional Perception Module (PMTD), which utilizes camera and LiDAR to integrate multi-dimensional environmental information. It is able to detect, identify and track target objects in the process of autonomous travel. The localization precision achieves a centimeter-level with GPS and IMU devices. Meanwhile, the obstacle avoidance strategy allows the USV to efficiently keep away from static and dynamic floating objects in water areas. Through real-world experiments, we show that with the help of the proposed module, the USV can complete stable and autonomous operation and obstacles avoidance path planning even without any manual intervention. This indicates the strong ability of the module in autonomous driving for USVs.}
    }

  • C. Tao, Z. Li, X. Zhu, G. Huang, Y. Liu, and J. Dai, “Searching Parameterized AP Loss for Object Detection," in Advances in Neural Information Processing Systems 34 – 35th Conference on Neural Information Processing Systems, 2021, pp. 22021-22033.
    [BibTeX] [Abstract] [PDF]

    Loss functions play an important role in training deep-network-based object detectors. The most widely used evaluation metric for object detection is Average Precision (AP), which captures the performance of localization and classification sub-tasks simultaneously. However, due to the non-differentiable nature of the AP metric, traditional object detectors adopt separate differentiable losses for the two sub-tasks. Such a mis-alignment issue may well lead to performance degradation. To address this, existing works seek to design surrogate losses for the AP metric manually, which requires expertise and may still be sub-optimal. In this paper, we propose Parameterized AP Loss, where parameterized functions are introduced to substitute the non-differentiable components in the AP calculation. Different AP approximations are thus represented by a family of parameterized functions in a uni-fied formula. Automatic parameter search algorithm is then employed to search for the optimal parameters. Extensive experiments on the COCO benchmark with three different object detectors (i.e., RetinaNet, Faster R-CNN, and Deformable DETR) demonstrate that the proposed Parameterized AP Loss consistently outperforms existing handcrafted losses. Code shall be released.

    @inproceedings{li2021spa,
    title = {Searching Parameterized AP Loss for Object Detection},
    author = {Chenxin Tao and Zizhang Li and Xizhou Zhu and Gao Huang and Yong Liu and Jifeng Dai},
    year = 2021,
    booktitle = {Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems},
    pages = {22021-22033},
    abstract = {Loss functions play an important role in training deep-network-based object detectors. The most widely used evaluation metric for object detection is Average Precision (AP), which captures the performance of localization and classification sub-tasks simultaneously. However, due to the non-differentiable nature of the AP metric, traditional object detectors adopt separate differentiable losses for the two sub-tasks. Such a mis-alignment issue may well lead to performance degradation. To address this, existing works seek to design surrogate losses for the AP metric manually, which requires expertise and may still be sub-optimal. In this paper, we propose Parameterized AP Loss, where parameterized functions are introduced to substitute the non-differentiable components in the AP calculation. Different AP approximations are thus represented by a family of parameterized functions in a uni-fied formula. Automatic parameter search algorithm is then employed to search for the optimal parameters. Extensive experiments on the COCO benchmark with three different object detectors (i.e., RetinaNet, Faster R-CNN, and Deformable DETR) demonstrate that the proposed Parameterized AP Loss consistently outperforms existing handcrafted losses. Code shall be released.}
    }

  • J. Zhang, C. Xu, J. Li, W. Chen, Y. Wang, Y. Tai, S. Chen, C. Wang, F. Huang, and Y. Liu, “Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model," in Advances in Neural Information Processing Systems 34 – 35th Conference on Neural Information Processing Systems, 2021, pp. 26674-26688.
    [BibTeX] [Abstract] [PDF]

    Inspired by biological evolution, we explain the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA) and derive that both of them have consistent mathematical representation. Analogous to the dynamic local population in EA, we improve the existing transformer structure and propose a more efficient EAT model, and design task-related heads to deal with different tasks more flexibly. Moreover, we introduce the spatial-filling curve into the current vision transformer to sequence image data into a uniform sequential format. Thus we can design a unified EAT framework to address multi-modal tasks, separating the network architecture from the data format adaptation. Our approach achieves state-of-the-art results on the ImageNet classification task compared with recent vision transformer works while having smaller parameters and greater throughput. We further conduct multi-modal tasks to demonstrate the superiority of the unified EAT, e.g., Text-Based Image Retrieval, and our approach improves the rank-1 by +3.7 points over the baseline on the CSS dataset.

    @inproceedings{zhang2021analogous,
    title = {Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model},
    author = {Jiangning Zhang and Chao Xu and Jian Li and Wenzhou Chen and Yabiao Wang and Ying Tai and Shuo Chen and Chengjie Wang and Feiyue Huang and Yong Liu},
    year = 2021,
    booktitle = {Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems},
    pages = {26674-26688},
    abstract = {Inspired by biological evolution, we explain the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA) and derive that both of them have consistent mathematical representation. Analogous to the dynamic local population in EA, we improve the existing transformer structure and propose a more efficient EAT model, and design task-related heads to deal with different tasks more flexibly. Moreover, we introduce the spatial-filling curve into the current vision transformer to sequence image data into a uniform sequential format. Thus we can design a unified EAT framework to address multi-modal tasks, separating the network architecture from the data format adaptation. Our approach achieves state-of-the-art results on the ImageNet classification task compared with recent vision transformer works while having smaller parameters and greater throughput. We further conduct multi-modal tasks to demonstrate the superiority of the unified EAT, e.g., Text-Based Image Retrieval, and our approach improves the rank-1 by +3.7 points over the baseline on the CSS dataset.}
    }

  • W. Liu, S. Liu, J. Yang, and Y. Liu, “Learning Intra-group Cooperation in Multi-agent Systems," in 2021 27th International Conference on Mechatronics and Machine Vision in Practice, 2021, pp. 688-692.
    [BibTeX] [Abstract] [DOI] [PDF]

    Reinforcement learning is one of the algorithms used in multi-agent systems to promote agent cooperation. However, most current multi-agent reinforcement learning algorithms improve the communication capabilities of agents for cooperation, but the overall communication is costly and even harmful due to bandwidth limitations. In addition, de-centralized execution cannot generate joint actions, which is not conducive to cooperation. Therefore, we proposed the Hierarchical Group Cooperation Network (HGCN). Advanced strategy, Group Network (GroNet), learns to group all agents based on their state rather than their location. The Low-level strategy, Group Cooperation Network (GCoNet), is a method of centralized training and centralized execution within a group, which effectively promotes agent collaboration. Finally, we validated our method in various experiments.

    @inproceedings{liu2021lig,
    title = {Learning Intra-group Cooperation in Multi-agent Systems},
    author = {Weiwei Liu and Shanqi Liu and Jian Yang and Yong Liu},
    year = 2021,
    booktitle = {2021 27th International Conference on Mechatronics and Machine Vision in Practice},
    pages = {688-692},
    doi = {https://doi.org/10.1109/M2VIP49856.2021.9665049},
    abstract = {Reinforcement learning is one of the algorithms used in multi-agent systems to promote agent cooperation. However, most current multi-agent reinforcement learning algorithms improve the communication capabilities of agents for cooperation, but the overall communication is costly and even harmful due to bandwidth limitations. In addition, de-centralized execution cannot generate joint actions, which is not conducive to cooperation. Therefore, we proposed the Hierarchical Group Cooperation Network (HGCN). Advanced strategy, Group Network (GroNet), learns to group all agents based on their state rather than their location. The Low-level strategy, Group Cooperation Network (GCoNet), is a method of centralized training and centralized execution within a group, which effectively promotes agent collaboration. Finally, we validated our method in various experiments.}
    }

  • T. Huang, H. Zou, J. Cui, X. Yang, M. Wang, X. Zhao, J. Z. and Yi Yuan, Y. Xu, and Y. Liu, “RFNet: Recurrent Forward Network for Dense Point Cloud Completion," in 2021 International Conference on Computer Vision, 2021, pp. 12488-12497.
    [BibTeX] [Abstract] [DOI] [PDF]

    Point cloud completion is an interesting and challenging task in 3D vision, aiming to recover complete shapes from sparse and incomplete point clouds. Existing learning based methods often require vast computation cost to achieve excellent performance, which limits their practical applications. In this paper, we propose a novel Recurrent Forward Network (RFNet), which is composed of three modules: Recurrent Feature Extraction (RFE), Forward Dense Completion (FDC) and Raw Shape Protection (RSP). The RFE extracts multiple global features from the incomplete point clouds for different recurrent levels, and the FDC generates point clouds in a coarse-to-fine pipeline. The RSP introduces details from the original incomplete models to refine the completion results. Besides, we propose a Sampling Chamfer Distance to better capture the shapes of models and a new Balanced Expansion Constraint to restrict the expansion distances from coarse to fine. According to the experiments on ShapeNet and KITTI, our network can achieve the state-of-the-art with lower memory cost and faster convergence.

    @inproceedings{huang2021rfnetrf,
    title = {RFNet: Recurrent Forward Network for Dense Point Cloud Completion},
    author = {Tianxin Huang and Hao Zou and Jinhao Cui and Xuemeng Yang and Mengmeng Wang and Xiangrui Zhao and Jiangning Zhang and Yi Yuan and Yifan Xu and Yong Liu},
    year = 2021,
    booktitle = {2021 International Conference on Computer Vision},
    pages = {12488-12497},
    doi = {https://doi.org/10.1109/ICCV48922.2021.01228},
    abstract = {Point cloud completion is an interesting and challenging task in 3D vision, aiming to recover complete shapes from sparse and incomplete point clouds. Existing learning based methods often require vast computation cost to achieve excellent performance, which limits their practical applications. In this paper, we propose a novel Recurrent Forward Network (RFNet), which is composed of three modules: Recurrent Feature Extraction (RFE), Forward Dense Completion (FDC) and Raw Shape Protection (RSP). The RFE extracts multiple global features from the incomplete point clouds for different recurrent levels, and the FDC generates point clouds in a coarse-to-fine pipeline. The RSP introduces details from the original incomplete models to refine the completion results. Besides, we propose a Sampling Chamfer Distance to better capture the shapes of models and a new Balanced Expansion Constraint to restrict the expansion distances from coarse to fine. According to the experiments on ShapeNet and KITTI, our network can achieve the state-of-the-art with lower memory cost and faster convergence.}
    }

  • L. Liu, X. Song, M. Wang, Y. Liu, and L. Zhang, “Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation," in 2021 International Conference on Computer Vision, 2021, pp. 12717-12726.
    [BibTeX] [Abstract] [DOI] [PDF]

    Remarkable results have been achieved by DCNN based self-supervised depth estimation approaches. However, most of these approaches can only handle either day-time or night-time images, while their performance degrades for all-day images due to large domain shift and the variation of illumination between day and night images. To relieve these limitations, we propose a domain-separated network for self-supervised depth estimation of all-day images. Specifically, to relieve the negative influence of disturbing terms (illumination, etc.), we partition the information of day and night image pairs into two complementary sub-spaces: private and invariant domains, where the former contains the unique information (illumination, etc.) of day and night images and the latter contains essential shared information (texture, etc.). Meanwhile, to guarantee that the day and night images contain the same information, the domain-separated network takes the day-time images and corresponding night-time images (generated by GAN) as input, and the private and invariant feature extractors are learned by orthogonality and similarity loss, where the domain gap can be alleviated, thus better depth maps can be expected. Meanwhile, the reconstruction and photometric losses are utilized to estimate complementary information and depth maps effectively. Experimental results demonstrate that our approach achieves state-of-the art depth estimation results for all-day images on the challenging Oxford RobotCar dataset, proving the superiority of our proposed approach. Code and data split are available at https://github.com/LINA-lln/ADDS-DepthNet.

    @inproceedings{liu2021selfsm,
    title = {Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation},
    author = {Lina Liu and Xibin Song and Mengmeng Wang and Yong Liu and Liangjun Zhang},
    year = 2021,
    booktitle = {2021 International Conference on Computer Vision},
    pages = {12717-12726},
    doi = {https://doi.org/10.1109/ICCV48922.2021.01250},
    abstract = {Remarkable results have been achieved by DCNN based self-supervised depth estimation approaches. However, most of these approaches can only handle either day-time or night-time images, while their performance degrades for all-day images due to large domain shift and the variation of illumination between day and night images. To relieve these limitations, we propose a domain-separated network for self-supervised depth estimation of all-day images. Specifically, to relieve the negative influence of disturbing terms (illumination, etc.), we partition the information of day and night image pairs into two complementary sub-spaces: private and invariant domains, where the former contains the unique information (illumination, etc.) of day and night images and the latter contains essential shared information (texture, etc.). Meanwhile, to guarantee that the day and night images contain the same information, the domain-separated network takes the day-time images and corresponding night-time images (generated by GAN) as input, and the private and invariant feature extractors are learned by orthogonality and similarity loss, where the domain gap can be alleviated, thus better depth maps can be expected. Meanwhile, the reconstruction and photometric losses are utilized to estimate complementary information and depth maps effectively. Experimental results demonstrate that our approach achieves state-of-the art depth estimation results for all-day images on the challenging Oxford RobotCar dataset, proving the superiority of our proposed approach. Code and data split are available at https://github.com/LINA-lln/ADDS-DepthNet.}
    }

  • J. Lv, K. Hu, J. Xu, Y. Liu, and X. Zuo, “CLINS: Continuous-Time Trajectory Estimation for LiDAR Inertial System," in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2021, pp. 6657-6663.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this paper, we propose a highly accurate continuous-time trajectory estimation framework dedicated to SLAM (Simultaneous Localization and Mapping) applications, which enables fuse high-frequency and asynchronous sensor data effectively. We apply the proposed framework in a 3D LiDAR-inertial system for evaluations. The proposed method adopts a non-rigid registration method for continuous-time trajectory estimation and simultaneously removing the motion distortion in LiDAR scans. Additionally, we propose a two-state continuous-time trajectory correction method to efficiently and efficiently tackle the computationally-intractable global optimization problem when loop closure happens. We examine the accuracy of the proposed approach on several publicly available datasets and the data we collected. The experimental results indicate that the proposed method outperforms the discrete-time methods regarding accuracy especially when aggressive motion occurs. Furthermore, we open source our code at https://github.com/APRIL-ZJU/clins to benefit research community.

    @inproceedings{lv2021clins,
    title = {CLINS: Continuous-Time Trajectory Estimation for LiDAR Inertial System},
    author = {Jiajun Lv and Kewei Hu and Jinhong Xu and Yong Liu and Xingxing Zuo},
    year = 2021,
    booktitle = {2021 IEEE/RSJ International Conference on Intelligent Robots and Systems},
    pages = {6657-6663},
    doi = {https://doi.org/10.1109/IROS51168.2021.9636676},
    abstract = {In this paper, we propose a highly accurate continuous-time trajectory estimation framework dedicated to SLAM (Simultaneous Localization and Mapping) applications, which enables fuse high-frequency and asynchronous sensor data effectively. We apply the proposed framework in a 3D LiDAR-inertial system for evaluations. The proposed method adopts a non-rigid registration method for continuous-time trajectory estimation and simultaneously removing the motion distortion in LiDAR scans. Additionally, we propose a two-state continuous-time trajectory correction method to efficiently and efficiently tackle the computationally-intractable global optimization problem when loop closure happens. We examine the accuracy of the proposed approach on several publicly available datasets and the data we collected. The experimental results indicate that the proposed method outperforms the discrete-time methods regarding accuracy especially when aggressive motion occurs. Furthermore, we open source our code at https://github.com/APRIL-ZJU/clins to benefit research community.}
    }

  • S. Liu, licheng Wen, J. Cui, X. Yang, J. Cao, and Y. Liu, “Moving Forward in Formation: A Decentralized Hierarchical Learning Approach to Multi-Agent Moving Together," in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2021, pp. 4777-4784.
    [BibTeX] [Abstract] [DOI] [PDF]

    Multi-agent path finding in formation has manypotential real-world applications like mobile warehouse robotics. However, previous multi-agent path finding (MAPF) methods hardly take formation into consideration. Furthermore, they are usually centralized planners and require the whole state of the environment. Other decentralized partially observable approaches to MAPF are reinforcement learning (RL) methods. However, these RL methods encounter difficulties when learning path finding and formation problems at the same time. In this paper, we propose a novel decentralized partially observable RL algorithm that uses a hierarchical structure to decompose the multi-objective task into unrelated ones. It also calculates a theoretical weight that makes each tasks reward has equal influence on the final RL value function. Additionally, we introduce a communication method that helps agents cooperate with each other. Experiments in simulation show that our method outperforms other end-toend RL methods and our method can naturally scale to large world sizes where centralized planner struggles. We also deploy and validate our method in a real-world scenario.

    @inproceedings{liu2021movingfi,
    title = {Moving Forward in Formation: A Decentralized Hierarchical Learning Approach to Multi-Agent Moving Together},
    author = {Shanqi Liu and licheng Wen and Jinhao Cui and Xuemeng Yang and Junjie Cao and Yong Liu},
    year = 2021,
    booktitle = {2021 IEEE/RSJ International Conference on Intelligent Robots and Systems},
    pages = {4777-4784},
    doi = {https://doi.org/10.1109/IROS51168.2021.9636224},
    abstract = {Multi-agent path finding in formation has manypotential real-world applications like mobile warehouse robotics. However, previous multi-agent path finding (MAPF) methods hardly take formation into consideration. Furthermore, they are usually centralized planners and require the whole state of the environment. Other decentralized partially observable approaches to MAPF are reinforcement learning (RL) methods. However, these RL methods encounter difficulties when learning path finding and formation problems at the same time. In this paper, we propose a novel decentralized partially observable RL algorithm that uses a hierarchical structure to decompose the multi-objective task into unrelated ones. It also calculates a theoretical weight that makes each tasks reward has equal influence on the final RL value function. Additionally, we introduce a communication method that helps agents cooperate with each other. Experiments in simulation show that our method outperforms other end-toend RL methods and our method can naturally scale to large world sizes where centralized planner struggles. We also deploy and validate our method in a real-world scenario.}
    }

  • H. Zou, X. Yang, T. Huang, C. Zhang, Y. Liu, W. Li, F. Wen, and H. Zhang, “Up-to-Down Network: Fusing Multi-Scale Context for 3D Semantic Scene Completion," in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2021, pp. 16-23.
    [BibTeX] [Abstract] [DOI] [PDF]

    An efficient 3D scene perception algorithm is a vital component for autonomous driving and robotics systems. In this paper, we focus on semantic scene completion, which is a task of jointly estimating the volumetric occupancy and semantic labels of objects. Since the real-world data is sparse and occluded, this is an extremely challenging task. We propose a novel framework, named Up-to-Down network (UDNet), to achieve the large-scale semantic scene completion with an encoder-decoder architecture for voxel grids. The novel up-to-down block can effectively aggregate multi-scale context information to improve labeling coherence, and the atrous spatial pyramid pooling module is leveraged to expand the receptive field while preserving detailed geometric information. Besides, the proposed multi-scale fusion mechanism efficiently aggregates global background information and improves the semantic completion accuracy. Moreover, to further satisfy the needs of different tasks, our UDNet can accomplish the multi-resolution semantic completion, achieving faster but coarser completion. Detailed experiments in the semantic scene completion benchmark of SemanticKITTI illustrate that our proposed framework surpasses the state-of-the-art methods with remarkable margins and a real-time inference speed by using only voxel grids as input.

    @inproceedings{zou2021utd,
    title = {Up-to-Down Network: Fusing Multi-Scale Context for 3D Semantic Scene Completion},
    author = {Hao Zou and Xuemeng Yang and Tianxin Huang and Chujuan Zhang and Yong Liu and Wanlong Li and Feng Wen and Hongbo Zhang},
    year = 2021,
    booktitle = {2021 IEEE/RSJ International Conference on Intelligent Robots and Systems},
    pages = {16-23},
    doi = {https://doi.org/10.1109/IROS51168.2021.9635888},
    abstract = {An efficient 3D scene perception algorithm is a vital component for autonomous driving and robotics systems. In this paper, we focus on semantic scene completion, which is a task of jointly estimating the volumetric occupancy and semantic labels of objects. Since the real-world data is sparse and occluded, this is an extremely challenging task. We propose a novel framework, named Up-to-Down network (UDNet), to achieve the large-scale semantic scene completion with an encoder-decoder architecture for voxel grids. The novel up-to-down block can effectively aggregate multi-scale context information to improve labeling coherence, and the atrous spatial pyramid pooling module is leveraged to expand the receptive field while preserving detailed geometric information. Besides, the proposed multi-scale fusion mechanism efficiently aggregates global background information and improves the semantic completion accuracy. Moreover, to further satisfy the needs of different tasks, our UDNet can accomplish the multi-resolution semantic completion, achieving faster but coarser completion. Detailed experiments in the semantic scene completion benchmark of SemanticKITTI illustrate that our proposed framework surpasses the state-of-the-art methods with remarkable margins and a real-time inference speed by using only voxel grids as input.}
    }

  • H. Zou, C. Zhang, Y. Liu, W. Li, F. Wen, and H. Zhang, “PointSiamRCNN: Target Aware Two-stage Siamese Tracker for Point Clouds," in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2021, pp. 7029-7035.
    [BibTeX] [Abstract] [DOI] [PDF]

    Currently, there have been many kinds of pointbased 3D trackers, while voxel-based methods are still underexplored. In this paper, we first propose a voxel-based tracker, named PointSiamRCNN, improving tracking performance by embedding target information into the search region. Our framework is composed of two parts for achieving proposal generation and proposal refinement, which fully releases the potential of the two-stage object tracking. Specifically, it takes advantage of efficient feature learning of the voxel-based Siamese network and high-quality proposal generation of the Siamese region proposal network head. In the search region, the groundtruth annotations are utilized to realize semantic segmentation, which leads to more discriminative feature learning with pointwise supervisions. Furthermore, we propose the Self and Cross Attention Module for embedding target information into the search region. Finally, the multi-scale RoI pooling module is proposed to obtain compact representations from target-aware features for proposal refinement. Exhaustive experiments on the KITTI tracking dataset demonstrate that our framework reaches the competitive performance with the state-of-the-art 3D tracking methods and achieves the state-of-the-art in terms of BEV tracking.

    @inproceedings{zou2021pta,
    title = {PointSiamRCNN: Target Aware Two-stage Siamese Tracker for Point Clouds},
    author = {Hao Zou and Chujuan Zhang and Yong Liu and Wanlong Li and Feng Wen and Hongbo Zhang},
    year = 2021,
    booktitle = {2021 IEEE/RSJ International Conference on Intelligent Robots and Systems},
    pages = {7029-7035},
    doi = {https://doi.org/10.1109/IROS51168.2021.9636863},
    abstract = {Currently, there have been many kinds of pointbased 3D trackers, while voxel-based methods are still underexplored. In this paper, we first propose a voxel-based tracker, named PointSiamRCNN, improving tracking performance by embedding target information into the search region. Our framework is composed of two parts for achieving proposal generation and proposal refinement, which fully releases the potential of the two-stage object tracking. Specifically, it takes advantage of efficient feature learning of the voxel-based Siamese network and high-quality proposal generation of the Siamese region proposal network head. In the search region, the groundtruth annotations are utilized to realize semantic segmentation, which leads to more discriminative feature learning with pointwise supervisions. Furthermore, we propose the Self and Cross Attention Module for embedding target information into the search region. Finally, the multi-scale RoI pooling module is proposed to obtain compact representations from target-aware features for proposal refinement. Exhaustive experiments on the KITTI tracking dataset demonstrate that our framework reaches the competitive performance with the state-of-the-art 3D tracking methods and achieves the state-of-the-art in terms of BEV tracking.}
    }

  • X. Yang, H. Zou, X. Kong, T. Huang, Y. Liu, W. Li, F. Wen, and H. Zhang, “Semantic Segmentation-assisted Scene Completion for LiDAR Point Clouds," in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2021, pp. 3555-3562.
    [BibTeX] [Abstract] [DOI] [PDF]

    Outdoor scene completion is a challenging issue in 3D scene understanding, which plays an important role in intelligent robotics and autonomous driving. Due to the sparsity of LiDAR acquisition, it is far more complex for 3D scene completion and semantic segmentation. Since semantic features can provide constraints and semantic priors for completion tasks, the relationship between them is worth exploring. Therefore, we propose an end-to-end semantic segmentation-assisted scene completion network, including a 2D completion branch and a 3D semantic segmentation branch. Specifically, the network takes a raw point cloud as input, and merges the features from the segmentation branch into the completion branch hierarchically to provide semantic information. By adopting BEV representation and 3D sparse convolution, we can benefit from the lower operand while maintaining effective expression. Besides, the decoder of the segmentation branch is used as an auxiliary, which can be discarded in the inference stage to save computational consumption. Extensive experiments demonstrate that our method achieves competitive performance on SemanticKITTI dataset with low latency. Code and models will be released at https://github.com/jokester-zzz/SSA-SC.

    @inproceedings{yang2021ssa,
    title = {Semantic Segmentation-assisted Scene Completion for LiDAR Point Clouds},
    author = {Xuemeng Yang and Hao Zou and Xin Kong and Tianxin Huang and Yong Liu and Wanlong Li and Feng Wen and Hongbo Zhang},
    year = 2021,
    booktitle = {2021 IEEE/RSJ International Conference on Intelligent Robots and Systems},
    pages = {3555-3562},
    doi = {https://doi.org/10.1109/IROS51168.2021.9636662},
    abstract = {Outdoor scene completion is a challenging issue in 3D scene understanding, which plays an important role in intelligent robotics and autonomous driving. Due to the sparsity of LiDAR acquisition, it is far more complex for 3D scene completion and semantic segmentation. Since semantic features can provide constraints and semantic priors for completion tasks, the relationship between them is worth exploring. Therefore, we propose an end-to-end semantic segmentation-assisted scene completion network, including a 2D completion branch and a 3D semantic segmentation branch. Specifically, the network takes a raw point cloud as input, and merges the features from the segmentation branch into the completion branch hierarchically to provide semantic information. By adopting BEV representation and 3D sparse convolution, we can benefit from the lower operand while maintaining effective expression. Besides, the decoder of the segmentation branch is used as an auxiliary, which can be discarded in the inference stage to save computational consumption. Extensive experiments demonstrate that our method achieves competitive performance on SemanticKITTI dataset with low latency. Code and models will be released at https://github.com/jokester-zzz/SSA-SC.}
    }

  • L. Li, X. Kong, X. Zhao, T. Huang, and Y. Liu, “SSC: Semantic Scan Context for Large-Scale Place Recognition," in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2021, pp. 2092-2099.
    [BibTeX] [Abstract] [DOI] [PDF]

    Place recognition gives a SLAM system the ability to correct cumulative errors. Unlike images that contain rich texture features, point clouds are almost pure geometric information which makes place recognition based on point clouds challenging. Existing works usually encode low-level features such as coordinate, normal, reflection intensity, etc., as local or global descriptors to represent scenes. Besides, they often ignore the translation between point clouds when matching descriptors. Different from most existing methods, we explore the use of high-level features, namely semantics, to improve the descriptor’s representation ability. Also, when matching descriptors, we try to correct the translation between point clouds to improve accuracy. Concretely, we propose a novel global descriptor, Semantic Scan Context, which explores semantic information to represent scenes more effectively. We also present a two-step global semantic ICP to obtain the 3D pose (x, y, yaw) used to align the point cloud to improve matching performance. Our experiments on the KITTI dataset show that our approach outperforms the state-of-the-art methods with a large margin. Our code is available at: https://github.com/lilin-hitcrt/SSC.

    @inproceedings{li2021ssc,
    title = {SSC: Semantic Scan Context for Large-Scale Place Recognition},
    author = {Lin Li and Xin Kong and Xiangrui Zhao and Tianxin Huang and Yong Liu},
    year = 2021,
    booktitle = {2021 IEEE/RSJ International Conference on Intelligent Robots and Systems},
    pages = {2092-2099},
    doi = {https://doi.org/10.1109/IROS51168.2021.9635904},
    abstract = {Place recognition gives a SLAM system the ability to correct cumulative errors. Unlike images that contain rich texture features, point clouds are almost pure geometric information which makes place recognition based on point clouds challenging. Existing works usually encode low-level features such as coordinate, normal, reflection intensity, etc., as local or global descriptors to represent scenes. Besides, they often ignore the translation between point clouds when matching descriptors. Different from most existing methods, we explore the use of high-level features, namely semantics, to improve the descriptor’s representation ability. Also, when matching descriptors, we try to correct the translation between point clouds to improve accuracy. Concretely, we propose a novel global descriptor, Semantic Scan Context, which explores semantic information to represent scenes more effectively. We also present a two-step global semantic ICP to obtain the 3D pose (x, y, yaw) used to align the point cloud to improve matching performance. Our experiments on the KITTI dataset show that our approach outperforms the state-of-the-art methods with a large margin. Our code is available at: https://github.com/lilin-hitcrt/SSC.}
    }

  • S. Liu, J. Cao, W. Chen, licheng Wen, and Y. Liu, “HILONet: Hierarchical Imitation Learning from Non-Aligned Observations," in 2021 IEEE 10th data Driven Control And Learning Systems Conference, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]

    It is challenging learning from demonstrated observation-only trajectories in a non-time-aligned environment because most imitation learning methods aim to imitate experts by following the demonstration step-by-step. However, aligned demonstrations are seldom obtainable in real-world scenarios. In this work, we propose a new imitation learning approach called Hierarchical Imitation Learning from Observation(HILONet), which adopts a hierarchical structure to choose feasible sub-goals from demonstrated observations dynamically. Our method can solve all kinds of tasks by achieving these sub-goals, whether it has a single goal position or not. We also present three different ways to increase sample efficiency in the hierarchical structure. We conduct extensive experiments using several environments. The results show the improvement in both performance and learning efficiency.

    @inproceedings{liu2021hilonethi,
    title = {HILONet: Hierarchical Imitation Learning from Non-Aligned Observations},
    author = {Shanqi Liu and Junjie Cao and Wenzhou Chen and licheng Wen and Yong Liu},
    year = 2021,
    booktitle = {2021 IEEE 10th data Driven Control And Learning Systems Conference},
    doi = {https://doi.org/10.48550/arXiv.2011.02671},
    abstract = {It is challenging learning from demonstrated observation-only trajectories in a non-time-aligned environment because most imitation learning methods aim to imitate experts by following the demonstration step-by-step. However, aligned demonstrations are seldom obtainable in real-world scenarios. In this work, we propose a new imitation learning approach called Hierarchical Imitation Learning from Observation(HILONet), which adopts a hierarchical structure to choose feasible sub-goals from demonstrated observations dynamically. Our method can solve all kinds of tasks by achieving these sub-goals, whether it has a single goal position or not. We also present three different ways to increase sample efficiency in the hierarchical structure. We conduct extensive experiments using several environments. The results show the improvement in both performance and learning efficiency.}
    }

  • K. Zhang and Y. Liu, “Unsupervised Feature Learning with Data Augmentation for Control Valve Stiction Detection," in 2021 IEEE 10th data Driven Control And Learning Systems Conference, 2021, pp. 1385-1390.
    [BibTeX] [Abstract] [DOI] [PDF]

    This paper proposes an unsupervised feature learning approach on industrial time series data for detection of valve stiction. Considering the commonly existed characteristics of industrial time series signals and the condition that sometimes massive reliable labeled-data are not available, a new time series data transformation and augmentation method is developed. The transformation stage converts the raw time series signals to 2-D matrices and the augmentation stage increases the diversity of the matrices by performing transformation on different timescales. Then a convolutional autoencoder is used to extract the representative features on the augmented data, these new features are taken as the inputs of the traditional clustering algorithms. Unlike the traditional approaches using hand-crafted features or requiring labeled-data, the proposed strategy can automatically learn features on the time series data collected from industrial control loops without supervision. The effectiveness of the pro-posed approach is evaluated through the International Stiction Data Base (ISDB). Compared with the traditional machine learning methods and deep learning based methods, the experimental results demonstrate that the proposed strategy outperforms the other methods. Besides performance evaluation, we provide a visualization process of feature learning via principal component analy-sis.

    @inproceedings{zhang2021unsupervisedfl,
    title = {Unsupervised Feature Learning with Data Augmentation for Control Valve Stiction Detection},
    author = {Kexin Zhang and Yong Liu},
    year = 2021,
    booktitle = {2021 IEEE 10th data Driven Control And Learning Systems Conference},
    pages = {1385-1390},
    doi = {https://doi.org/10.1109/DDCLS52934.2021.9455535},
    abstract = {This paper proposes an unsupervised feature learning approach on industrial time series data for detection of valve stiction. Considering the commonly existed characteristics of industrial time series signals and the condition that sometimes massive reliable labeled-data are not available, a new time series data transformation and augmentation method is developed. The transformation stage converts the raw time series signals to 2-D matrices and the augmentation stage increases the diversity of the matrices by performing transformation on different timescales. Then a convolutional autoencoder is used to extract the representative features on the augmented data, these new features are taken as the inputs of the traditional clustering algorithms. Unlike the traditional approaches using hand-crafted features or requiring labeled-data, the proposed strategy can automatically learn features on the time series data collected from industrial control loops without supervision. The effectiveness of the pro-posed approach is evaluated through the International Stiction Data Base (ISDB). Compared with the traditional machine learning methods and deep learning based methods, the experimental results demonstrate that the proposed strategy outperforms the other methods. Besides performance evaluation, we provide a visualization process of feature learning via principal component analy-sis.}
    }

  • X. Zuo, N. Merrill, W. Li, Y. Liu, M. Pollefeys, and G. (. Huang, “CodeVIO: Visual-Inertial Odometry with Learned Optimizable Dense Depth," in 2021 IEEE International Conference on Robotics and Automation, 2021, pp. 14382-14388.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    In this work, we present a lightweight, tightly-coupled deep depth network and visual-inertial odometry (VIO) system, which can provide accurate state estimates and dense depth maps of the immediate surroundings. Leveraging the pro-posed lightweight Conditional Variational Autoencoder (CVAE) for depth inference and encoding, we provide the network with previously marginalized sparse features from VIO to increase the accuracy of initial depth prediction and generalization capability. The compact encoded depth maps are then updated jointly with navigation states in a sliding window estimator in order to provide the dense local scene geometry. We additionally propose a novel method to obtain the CVAE’s Jacobian which is shown to be more than an order of magnitude faster than previous works, and we additionally leverage First-Estimate Jacobian (FEJ) to avoid recalculation. As opposed to previous works relying on completely dense residuals, we propose to only provide sparse measurements to update the depth code and show through careful experimentation that our choice of sparse measurements and FEJs can still significantly improve the estimated depth maps. Our full system also exhibits state-of-the-art pose estimation accuracy, and we show that it can run in real-time with single-thread execution while utilizing GPU acceleration only for the network and code Jacobian.

    @inproceedings{zuo2021codeviovi,
    title = {CodeVIO: Visual-Inertial Odometry with Learned Optimizable Dense Depth},
    author = {Xingxing Zuo and Nathaniel Merrill and Wei Li and Yong Liu and Marc Pollefeys and Guoquan (Paul) Huang},
    year = 2021,
    booktitle = {2021 IEEE International Conference on Robotics and Automation},
    pages = {14382-14388},
    doi = {https://doi.org/10.1109/ICRA48506.2021.9560792},
    abstract = {In this work, we present a lightweight, tightly-coupled deep depth network and visual-inertial odometry (VIO) system, which can provide accurate state estimates and dense depth maps of the immediate surroundings. Leveraging the pro-posed lightweight Conditional Variational Autoencoder (CVAE) for depth inference and encoding, we provide the network with previously marginalized sparse features from VIO to increase the accuracy of initial depth prediction and generalization capability. The compact encoded depth maps are then updated jointly with navigation states in a sliding window estimator in order to provide the dense local scene geometry. We additionally propose a novel method to obtain the CVAE’s Jacobian which is shown to be more than an order of magnitude faster than previous works, and we additionally leverage First-Estimate Jacobian (FEJ) to avoid recalculation. As opposed to previous works relying on completely dense residuals, we propose to only provide sparse measurements to update the depth code and show through careful experimentation that our choice of sparse measurements and FEJs can still significantly improve the estimated depth maps. Our full system also exhibits state-of-the-art pose estimation accuracy, and we show that it can run in real-time with single-thread execution while utilizing GPU acceleration only for the network and code Jacobian.},
    arxiv = {https://arxiv.org/abs/2012.10133}
    }

  • J. Cui, H. Zou, X. Kong, X. Yang, X. Zhao, Y. Liu, W. Li, F. Wen, and H. Zhang, “PocoNet: SLAM-oriented 3D LiDAR Point Cloud Online Compression Network," in 2021 IEEE International Conference on Robotics and Automation, 2021, pp. 1868-1874.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this paper, we present PocoNet: Point cloud Online COmpression NETwork to address the task of SLAM- oriented compression. The aim of this task is to select a compact subset of points with high priority to maintain localization accuracy. The key insight is that points with high priority have similar geometric features in SLAM scenarios. Hence, we tackle this task as point cloud segmentation to capture complex geometric information. We calculate observation counts by matching between maps and point clouds and divide them into different priority levels. Trained by labels annotated with such observation counts, the proposed network could evaluate the point-wise priority. Experiments are conducted by integrating our compression module into an existing SLAM system to evaluate compression ratios and localization performances. Ex- perimental results on two different datasets verify the feasibility and generalization of our approach.

    @inproceedings{cui2021poconetso,
    title = {PocoNet: SLAM-oriented 3D LiDAR Point Cloud Online Compression Network},
    author = {Jinhao Cui and Hao Zou and Xin Kong and Xuemeng Yang and Xiangrui Zhao and Yong Liu and Wanlong Li and Feng Wen and Hongbo Zhang},
    year = 2021,
    booktitle = {2021 IEEE International Conference on Robotics and Automation},
    pages = {1868-1874},
    doi = {https://doi.org/10.1109/ICRA48506.2021.9561309},
    abstract = {In this paper, we present PocoNet: Point cloud Online COmpression NETwork to address the task of SLAM- oriented compression. The aim of this task is to select a compact subset of points with high priority to maintain localization accuracy. The key insight is that points with high priority have similar geometric features in SLAM scenarios. Hence, we tackle this task as point cloud segmentation to capture complex geometric information. We calculate observation counts by matching between maps and point clouds and divide them into different priority levels. Trained by labels annotated with such observation counts, the proposed network could evaluate the point-wise priority. Experiments are conducted by integrating our compression module into an existing SLAM system to evaluate compression ratios and localization performances. Ex- perimental results on two different datasets verify the feasibility and generalization of our approach.}
    }

  • L. Li, X. Kong, X. Zhao, and Y. Liu, “SA-LOAM: Semantic-aided LiDAR SLAM with Loop Closure," in 2021 IEEE International Conference on Robotics and Automation, 2021, pp. 7627-7634.
    [BibTeX] [Abstract] [DOI] [PDF]

    LiDAR-based SLAM system is admittedly more accurate and stable than others, while its loop closure de-tection is still an open issue. With the development of 3D semantic segmentation for point cloud, semantic information can be obtained conveniently and steadily, essential for high-level intelligence and conductive to SLAM. In this paper, we present a novel semantic-aided LiDAR SLAM with loop closure based on LOAM, named SA-LOAM, which leverages semantics in odometry as well as loop closure detection. Specifically, we propose a semantic-assisted ICP, including semantically matching, downsampling and plane constraint, and integrates a semantic graph-based place recognition method in our loop closure detection module. Benefitting from semantics, we can improve the localization accuracy, detect loop closures effec-tively, and construct a global consistent semantic map even in large-scale scenes. Extensive experiments on KITTI and Ford Campus dataset show that our system significantly improves baseline performance, has generalization ability to unseen data and achieves competitive results compared with state-of-the-art methods.

    @inproceedings{li2021ssa,
    title = {SA-LOAM: Semantic-aided LiDAR SLAM with Loop Closure},
    author = {Lin Li and Xin Kong and Xiangrui Zhao and Yong Liu},
    year = 2021,
    booktitle = {2021 IEEE International Conference on Robotics and Automation},
    pages = {7627-7634},
    doi = {https://doi.org/10.1109/ICRA48506.2021.9560884},
    abstract = {LiDAR-based SLAM system is admittedly more accurate and stable than others, while its loop closure de-tection is still an open issue. With the development of 3D semantic segmentation for point cloud, semantic information can be obtained conveniently and steadily, essential for high-level intelligence and conductive to SLAM. In this paper, we present a novel semantic-aided LiDAR SLAM with loop closure based on LOAM, named SA-LOAM, which leverages semantics in odometry as well as loop closure detection. Specifically, we propose a semantic-assisted ICP, including semantically matching, downsampling and plane constraint, and integrates a semantic graph-based place recognition method in our loop closure detection module. Benefitting from semantics, we can improve the localization accuracy, detect loop closures effec-tively, and construct a global consistent semantic map even in large-scale scenes. Extensive experiments on KITTI and Ford Campus dataset show that our system significantly improves baseline performance, has generalization ability to unseen data and achieves competitive results compared with state-of-the-art methods.}
    }

  • G. Zhai, Z. Zhang, X. Kong, and Y. Liu, “Efficient Pedestrian Following by Quadruped Robots," in 2021 IEEE International Conference on Robotics and Automation Workshop, 2021.
    [BibTeX] [Abstract] [PDF]

    Legged robots have superior terrain adaptability and flexible movement capabilities than traditional wheeled robots. In this work, we use a quadruped robot as an example of legged robots to complete a pedestrian-following task in challenging scenarios. The whole system consists of two modules: the perception and planning module, relying on the various onboard sensors.

    @inproceedings{zhai2021epf,
    title = {Efficient Pedestrian Following by Quadruped Robots},
    author = {Guangyao Zhai and Zhen Zhang and Xin Kong and Yong Liu},
    year = 2021,
    booktitle = {2021 IEEE International Conference on Robotics and Automation Workshop},
    abstract = {Legged robots have superior terrain adaptability and flexible movement capabilities than traditional wheeled robots. In this work, we use a quadruped robot as an example of legged robots to complete a pedestrian-following task in challenging scenarios. The whole system consists of two modules: the perception and planning module, relying on the various onboard sensors.}
    }

  • L. Liu, X. Song, X. Lyu, J. Diao, M. Wang, Y. Liu, and L. Zhang, “FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion," in Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), 2021.
    [BibTeX] [Abstract] [arXiv] [PDF]

    Depth completion aims to recover a dense depth map from a sparse depth map with the corresponding color image as input. Recent approaches mainly formulate the depth completion as a one-stage end-to-end learning task, which outputs dense depth maps directly. However, the feature extraction and supervision in one-stage frameworks are insufficient, limiting the performance of these approaches. To address this problem, we propose a novel end-to-end residual learning framework, which formulates the depth completion as a two-stage learning task, i.e., a sparse-to-coarse stage and a coarse-to-fine stage. First, a coarse dense depth map is obtained by a simple CNN framework. Then, a refined depth map is further obtained using a residual learning strategy in the coarse-to-fine stage with coarse depth map and color image as input. Specially, in the coarse-to-fine stage, a channel shuffle extraction operation is utilized to extract more representative features from color image and coarse depth map, and an energy based fusion operation is exploited to effectively fuse these features obtained by channel shuffle operation, thus leading to more accurate and refined depth maps. We achieve SoTA performance in RMSE on KITTI benchmark. Extensive experiments on other datasets future demonstrate the superiority of our approach over current state-of-the-art depth completion approaches.

    @inproceedings{liu2020fcfrnetff,
    title = {FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion},
    author = {Lina Liu and Xibin Song and Xiaoyang Lyu and Junwei Diao and Mengmeng Wang and Yong Liu and Liangjun Zhang},
    year = 2021,
    booktitle = {Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI)},
    abstract = {Depth completion aims to recover a dense depth map from a sparse depth map with the corresponding color image as input. Recent approaches mainly formulate the depth completion as a one-stage end-to-end learning task, which outputs dense depth maps directly. However, the feature extraction and supervision in one-stage frameworks are insufficient, limiting the performance of these approaches. To address this problem, we propose a novel end-to-end residual learning framework, which formulates the depth completion as a two-stage learning task, i.e., a sparse-to-coarse stage and a coarse-to-fine stage. First, a coarse dense depth map is obtained by a simple CNN framework. Then, a refined depth map is further obtained using a residual learning strategy in the coarse-to-fine stage with coarse depth map and color image as input. Specially, in the coarse-to-fine stage, a channel shuffle extraction operation is utilized to extract more representative features from color image and coarse depth map, and an energy based fusion operation is exploited to effectively fuse these features obtained by channel shuffle operation, thus leading to more accurate and refined depth maps. We achieve SoTA performance in RMSE on KITTI benchmark. Extensive experiments on other datasets future demonstrate the superiority of our approach over current state-of-the-art depth completion approaches.},
    arxiv = {https://arxiv.org/pdf/2012.08270.pdf}
    }

  • X. Lyu, L. Liu, M. Wang, X. Kong, L. Liu, Y. Liu, X. Chen, and Y. Yuan, “HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation," in Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), 2021.
    [BibTeX] [Abstract] [arXiv] [PDF]

    Self-supervised learning shows great potential in monoculardepth estimation, using image sequences as the only source ofsupervision. Although people try to use the high-resolutionimage for depth estimation, the accuracy of prediction hasnot been significantly improved. In this work, we find thecore reason comes from the inaccurate depth estimation inlarge gradient regions, making the bilinear interpolation er-ror gradually disappear as the resolution increases. To obtainmore accurate depth estimation in large gradient regions, itis necessary to obtain high-resolution features with spatialand semantic information. Therefore, we present an improvedDepthNet, HR-Depth, with two effective strategies: (1) re-design the skip-connection in DepthNet to get better high-resolution features and (2) propose feature fusion Squeeze-and-Excitation(fSE) module to fuse feature more efficiently.Using Resnet-18 as the encoder, HR-Depth surpasses all pre-vious state-of-the-art(SoTA) methods with the least param-eters at both high and low resolution. Moreover, previousstate-of-the-art methods are based on fairly complex and deepnetworks with a mass of parameters which limits their realapplications. Thus we also construct a lightweight networkwhich uses MobileNetV3 as encoder. Experiments show thatthe lightweight network can perform on par with many largemodels like Monodepth2 at high-resolution with only20%parameters. All codes and models will be available at this https URL.

    @inproceedings{lyu2020hrdepthhr,
    title = {HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation},
    author = {Xiaoyang Lyu and Liang Liu and Mengmeng Wang and Xin Kong and Lina Liu and Yong Liu and Xinxin Chen and Yi Yuan},
    year = 2021,
    booktitle = {Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI)},
    abstract = {Self-supervised learning shows great potential in monoculardepth estimation, using image sequences as the only source ofsupervision. Although people try to use the high-resolutionimage for depth estimation, the accuracy of prediction hasnot been significantly improved. In this work, we find thecore reason comes from the inaccurate depth estimation inlarge gradient regions, making the bilinear interpolation er-ror gradually disappear as the resolution increases. To obtainmore accurate depth estimation in large gradient regions, itis necessary to obtain high-resolution features with spatialand semantic information. Therefore, we present an improvedDepthNet, HR-Depth, with two effective strategies: (1) re-design the skip-connection in DepthNet to get better high-resolution features and (2) propose feature fusion Squeeze-and-Excitation(fSE) module to fuse feature more efficiently.Using Resnet-18 as the encoder, HR-Depth surpasses all pre-vious state-of-the-art(SoTA) methods with the least param-eters at both high and low resolution. Moreover, previousstate-of-the-art methods are based on fairly complex and deepnetworks with a mass of parameters which limits their realapplications. Thus we also construct a lightweight networkwhich uses MobileNetV3 as encoder. Experiments show thatthe lightweight network can perform on par with many largemodels like Monodepth2 at high-resolution with only20%parameters. All codes and models will be available at this https URL.},
    arxiv = {https://arxiv.org/pdf/2012.07356.pdf}
    }

2020

  • J. Cao, Y. Liu, J. Yang, and Z. Pan, “Model-Based Robot Learning Control with Uncertainty Directed Exploration," in 2020 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), 2020, p. 2004–2010.
    [BibTeX] [Abstract] [DOI] [PDF]

    The Robot with nonlinear and stochastic dynamic challenges optimal control that relying on an analytical model. Model-free reinforcement learning algorithms have shown their potential in robot learning control without an analytical or statistical dynamic model. However, requiring numerous samples hinders its application. Model-based reinforcement learning that combines dynamic model learning with model predictive control provides promising methods to control the robot with complex dynamics. Robot exploration generates diverse data for dynamic model learning. Model predictive control exploits the approximated model to select an optimal action. There is a dilemma between exploration and exploitation. Uncertainty provides a direction for robot exploring, resulting in better exploration and exploitation trade-off. In this paper, we propose Model Predictive Control with Posterior Sampling (PSMPC) to make the robot learn to control efficiently. Our PSMPC does approximate sampling from the posterior of the dynamic model and applies model predictive control to achieve uncertainty directed exploration. In order to reduce the computational complexity of the resulting controller, we also propose a PSMPC guided policy optimization algorithm. The results of simulation in the high fidelity simulator “MuJoCo” show the effectiveness of our proposed robot learning control scheme.

    @inproceedings{cao2020modelbasedrl,
    title = {Model-Based Robot Learning Control with Uncertainty Directed Exploration},
    author = {Junjie Cao and Yong Liu and Jian Yang and Zaisheng Pan},
    year = 2020,
    booktitle = {2020 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM)},
    pages = {2004--2010},
    doi = {https://doi.org/10.1109/aim43001.2020.9158962},
    abstract = {The Robot with nonlinear and stochastic dynamic challenges optimal control that relying on an analytical model. Model-free reinforcement learning algorithms have shown their potential in robot learning control without an analytical or statistical dynamic model. However, requiring numerous samples hinders its application. Model-based reinforcement learning that combines dynamic model learning with model predictive control provides promising methods to control the robot with complex dynamics. Robot exploration generates diverse data for dynamic model learning. Model predictive control exploits the approximated model to select an optimal action. There is a dilemma between exploration and exploitation. Uncertainty provides a direction for robot exploring, resulting in better exploration and exploitation trade-off. In this paper, we propose Model Predictive Control with Posterior Sampling (PSMPC) to make the robot learn to control efficiently. Our PSMPC does approximate sampling from the posterior of the dynamic model and applies model predictive control to achieve uncertainty directed exploration. In order to reduce the computational complexity of the resulting controller, we also propose a PSMPC guided policy optimization algorithm. The results of simulation in the high fidelity simulator “MuJoCo” show the effectiveness of our proposed robot learning control scheme.}
    }

  • J. Chen, J. Zhao, W. Zhang, and Y. Liu, “Tracking an Object over 200 FPS with the Fusion of Prior Probability and Kalman Filter," in 12th International Conference on Machine Learning and Computing (ICMLC), 2020.
    [BibTeX] [Abstract] [DOI] [PDF]

    Efficient object tracking is a challenge problem as it needs to distinguish the object by learned appearance model as quickly as possible. In this paper, a novel robust approach fusing the prediction information of Kalman filter and prior probability is proposed for tracking arbitrary objects. Firstly, we obtain an image patch based on predicted information by fusing the prior probability and Kalman filter. Secondly, the samples derived from the obtained image patch for our tracker are entered into support vector machine (SVM) to classify the object, where these samples need to be extracted features by Histogram of Oriented Gradients (HOG). Our approach has two advantages: efficient computation, and certain anti-interference ability. The samples obtained from image patch is less than that obtained from image, which makes SVM model more efficient in classification and reduces interference outside the image patch. Experimentally, we evaluate our approach on a standard tracking benchmark that includes 50 video sequences to demonstrate our tracker’s nearly state-of-the-art performance compared with 5 trackers. Furthermore, because extracting samples and classifying HOG features is computationally very cheap, our tracker is much faster than these mentioned trackers. It achieves over 200 fps on the Intel i3 CPU for tracking an arbitrary object on benchmark.

    @inproceedings{chen2020trackingao,
    title = {Tracking an Object over 200 FPS with the Fusion of Prior Probability and Kalman Filter},
    author = {Jun Chen and Jinhui Zhao and Wei Zhang and Yong Liu},
    year = 2020,
    booktitle = {12th International Conference on Machine Learning and Computing (ICMLC)},
    doi = {https://doi.org/10.1145/3383972.3384011},
    abstract = {Efficient object tracking is a challenge problem as it needs to distinguish the object by learned appearance model as quickly as possible. In this paper, a novel robust approach fusing the prediction information of Kalman filter and prior probability is proposed for tracking arbitrary objects. Firstly, we obtain an image patch based on predicted information by fusing the prior probability and Kalman filter. Secondly, the samples derived from the obtained image patch for our tracker are entered into support vector machine (SVM) to classify the object, where these samples need to be extracted features by Histogram of Oriented Gradients (HOG). Our approach has two advantages: efficient computation, and certain anti-interference ability. The samples obtained from image patch is less than that obtained from image, which makes SVM model more efficient in classification and reduces interference outside the image patch. Experimentally, we evaluate our approach on a standard tracking benchmark that includes 50 video sequences to demonstrate our tracker's nearly state-of-the-art performance compared with 5 trackers. Furthermore, because extracting samples and classifying HOG features is computationally very cheap, our tracker is much faster than these mentioned trackers. It achieves over 200 fps on the Intel i3 CPU for tracking an arbitrary object on benchmark.}
    }

  • X. Kong, X. Yang, G. Zhai, X. Zhao, X. Zeng, M. Wang, Y. Liu, W. Li, and F. Wen, “Semantic Graph Based Place Recognition for 3D Point Clouds," in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, p. 8216–8223.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    Due to the difficulty in generating the effective descriptors which are robust to occlusion and viewpoint changes, place recognition for 3D point cloud remains an open issue. Unlike most of the existing methods that focus on extracting local, global, and statistical features of raw point clouds, our method aims at the semantic level that can be superior in terms of robustness to environmental changes. Inspired by the perspective of humans, who recognize scenes through identifying semantic objects and capturing their relations, this paper presents a novel semantic graph based approach for place recognition. First, we propose a novel semantic graph representation for the point cloud scenes by reserving the semantic and topological information of the raw point cloud. Thus, place recognition is modeled as a graph matching problem. Then we design a fast and effective graph similarity network to compute the similarity. Exhaustive evaluations on the KITTI dataset show that our approach is robust to the occlusion as well as viewpoint changes and outperforms the state-of-the-art methods with a large margin. Our code is available at: https://github.com/kxhit/SG_PR.

    @inproceedings{kong2020semanticgb,
    title = {Semantic Graph Based Place Recognition for 3D Point Clouds},
    author = {Xin Kong and Xuemeng Yang and Guangyao Zhai and Xiangrui Zhao and Xianfang Zeng and Mengmeng Wang and Yong Liu and Wanlong Li and Feng Wen},
    year = 2020,
    booktitle = {2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {8216--8223},
    doi = {https://doi.org/10.1109/IROS45743.2020.9341060},
    abstract = {Due to the difficulty in generating the effective descriptors which are robust to occlusion and viewpoint changes, place recognition for 3D point cloud remains an open issue. Unlike most of the existing methods that focus on extracting local, global, and statistical features of raw point clouds, our method aims at the semantic level that can be superior in terms of robustness to environmental changes. Inspired by the perspective of humans, who recognize scenes through identifying semantic objects and capturing their relations, this paper presents a novel semantic graph based approach for place recognition. First, we propose a novel semantic graph representation for the point cloud scenes by reserving the semantic and topological information of the raw point cloud. Thus, place recognition is modeled as a graph matching problem. Then we design a fast and effective graph similarity network to compute the similarity. Exhaustive evaluations on the KITTI dataset show that our approach is robust to the occlusion as well as viewpoint changes and outperforms the state-of-the-art methods with a large margin. Our code is available at: https://github.com/kxhit/SG_PR.},
    arxiv = {https://arxiv.org/pdf/2008.11459.pdf}
    }

  • L. Liu, J. Zhang, R. He, Y. Liu, Y. Wang, Y. Tai, D. Luo, C. Wang, J. Li, and F. Huang, “Learning by Analogy: Reliable Supervision From Transformations for Unsupervised Optical Flow Estimation," in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, p. 6488–6497.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    Unsupervised learning of optical flow, which leverages the supervision from view synthesis, has emerged as a promising alternative to supervised methods. However, the objective of unsupervised learning is likely to be unreliable in challenging scenes. In this work, we present a framework to use more reliable supervision from transformations. It simply twists the general unsupervised learning pipeline by running another forward pass with transformed data from augmentation, along with using transformed predictions of original data as the self-supervision signal. Besides, we further introduce a lightweight network with multiple frames by a highly-shared flow decoder. Our method consistently gets a leap of performance on several benchmarks with the best accuracy among deep unsupervised methods. Also, our method achieves competitive results to recent fully supervised methods while with much fewer parameters.

    @inproceedings{liu2020learningba,
    title = {Learning by Analogy: Reliable Supervision From Transformations for Unsupervised Optical Flow Estimation},
    author = {Liang Liu and Jiangning Zhang and Ruifei He and Yong Liu and Yabiao Wang and Ying Tai and Donghao Luo and Chengjie Wang and Jilin Li and Feiyue Huang},
    year = 2020,
    booktitle = {2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {6488--6497},
    doi = {https://doi.org/10.1109/cvpr42600.2020.00652},
    abstract = {Unsupervised learning of optical flow, which leverages the supervision from view synthesis, has emerged as a promising alternative to supervised methods. However, the objective of unsupervised learning is likely to be unreliable in challenging scenes. In this work, we present a framework to use more reliable supervision from transformations. It simply twists the general unsupervised learning pipeline by running another forward pass with transformed data from augmentation, along with using transformed predictions of original data as the self-supervision signal. Besides, we further introduce a lightweight network with multiple frames by a highly-shared flow decoder. Our method consistently gets a leap of performance on several benchmarks with the best accuracy among deep unsupervised methods. Also, our method achieves competitive results to recent fully supervised methods while with much fewer parameters.},
    arxiv = {http://arxiv.org/pdf/2003.13045}
    }

  • J. Lv, J. Xu, K. Hu, Y. Liu, and X. Zuo, “Targetless Calibration of LiDAR-IMU System Based on Continuous-time Batch Estimation," in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, p. 9968–9975.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    Sensor calibration is the fundamental block for a multi-sensor fusion system. This paper presents an accurate and repeatable LiDAR-IMU calibration method (termed LI-Calib), to calibrate the 6-DOF extrinsic transformation between the 3D LiDAR and the Inertial Measurement Unit (IMU). Regarding the high data capture rate for LiDAR and IMU sensors, LI-Calib adopts a continuous-time trajectory formulation based on B-Spline, which is more suitable for fusing high-rate or asynchronous measurements than discrete-time based approaches. Additionally, LI-Calib decomposes the space into cells and identifies the planar segments for data association, which renders the calibration problem well-constrained in usual scenarios without any artificial targets. We validate the proposed calibration approach on both simulated and real-world experiments. The results demonstrate the high accuracy and good repeatability of the proposed method in common human-made scenarios. To benefit the research community, we open-source our code at https://github.com/APRIL-ZJU/lidar_IMU_calib.

    @inproceedings{lv2020targetlessco,
    title = {Targetless Calibration of LiDAR-IMU System Based on Continuous-time Batch Estimation},
    author = {Jiajun Lv and Jinhong Xu and Kewei Hu and Yong Liu and Xingxing Zuo},
    year = 2020,
    booktitle = {2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {9968--9975},
    doi = {https://doi.org/10.1109/IROS45743.2020.9341405},
    abstract = {Sensor calibration is the fundamental block for a multi-sensor fusion system. This paper presents an accurate and repeatable LiDAR-IMU calibration method (termed LI-Calib), to calibrate the 6-DOF extrinsic transformation between the 3D LiDAR and the Inertial Measurement Unit (IMU). Regarding the high data capture rate for LiDAR and IMU sensors, LI-Calib adopts a continuous-time trajectory formulation based on B-Spline, which is more suitable for fusing high-rate or asynchronous measurements than discrete-time based approaches. Additionally, LI-Calib decomposes the space into cells and identifies the planar segments for data association, which renders the calibration problem well-constrained in usual scenarios without any artificial targets. We validate the proposed calibration approach on both simulated and real-world experiments. The results demonstrate the high accuracy and good repeatability of the proposed method in common human-made scenarios. To benefit the research community, we open-source our code at https://github.com/APRIL-ZJU/lidar_IMU_calib.},
    arxiv = {https://arxiv.org/pdf/2007.14759.pdf}
    }

  • L. Wen, J. Yan, X. Yang, Y. Liu, and Y. Gu, “Collision-free Trajectory Planning for Autonomous Surface Vehicle," in 2020 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), 2020, p. 1098–1105.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    In this paper, we propose an efficient and accurate method for autonomous surface vehicles to generate a smooth and collision-free trajectory considering its dynamics constraints. We decouple the trajectory planning problem as a front-end feasible path searching and a back-end kinodynamic trajectory optimization. Firstly, we model the type of two-thrusts under-actuated surface vessel. Then we adopt a sampling-based path searching to find an asymptotic optimal path through the obstacle-surrounding environment and extract several waypoints from it. We apply a numerical optimization method in the back-end to generate the trajectory. From the perspective of security in the field voyage, we propose the sailing corridor method to guarantee the trajectory away from obstacles. Moreover, considering limited fuel ASV carrying, we design a numerical objective function which can optimize a fuel-saving trajectory. Finally, we validate and compare the proposed method in simulation environments and the results fit our expected trajectory.

    @inproceedings{wen2020collisionfreetp,
    title = {Collision-free Trajectory Planning for Autonomous Surface Vehicle},
    author = {Licheng Wen and Jiaqing Yan and Xuemeng Yang and Yong Liu and Yong Gu},
    year = 2020,
    booktitle = {2020 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM)},
    pages = {1098--1105},
    doi = {https://doi.org/10.1109/AIM43001.2020.9158907},
    abstract = {In this paper, we propose an efficient and accurate method for autonomous surface vehicles to generate a smooth and collision-free trajectory considering its dynamics constraints. We decouple the trajectory planning problem as a front-end feasible path searching and a back-end kinodynamic trajectory optimization. Firstly, we model the type of two-thrusts under-actuated surface vessel. Then we adopt a sampling-based path searching to find an asymptotic optimal path through the obstacle-surrounding environment and extract several waypoints from it. We apply a numerical optimization method in the back-end to generate the trajectory. From the perspective of security in the field voyage, we propose the sailing corridor method to guarantee the trajectory away from obstacles. Moreover, considering limited fuel ASV carrying, we design a numerical objective function which can optimize a fuel-saving trajectory. Finally, we validate and compare the proposed method in simulation environments and the results fit our expected trajectory.},
    arxiv = {http://arxiv.org/pdf/2005.09857}
    }

  • X. Zeng, Y. Pan, M. Wang, J. Zhang, and Y. Liu, “Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose," in Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), 2020.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    Recent works have shown how realistic talking face images can be obtained under the supervision of geometry guidance, e.g., facial landmark or boundary. To alleviate the demand for manual annotations, in this paper, we propose a novel self-supervised hybrid model (DAE-GAN) that learns how to reenact face naturally given large amounts of unlabeled videos. Our approach combines two deforming autoencoders with the latest advances in the conditional generation. On the one hand, we adopt the deforming autoencoder to disentangle identity and pose representations. A strong prior in talking face videos is that each frame can be encoded as two parts: one for video-specific identity and the other for various poses. Inspired by that, we utilize a multi-frame deforming autoencoder to learn a pose-invariant embedded face for each video. Meanwhile, a multi-scale deforming autoencoder is proposed to extract pose-related information for each frame. On the other hand, the conditional generator allows for enhancing fine details and overall reality. It leverages the disentangled features to generate photo-realistic and pose-alike face images. We evaluate our model on VoxCeleb1 and RaFD dataset. Experiment results demonstrate the superior quality of reenacted images and the flexibility of transferring facial movements between identities.

    @inproceedings{zeng2020realisticfr,
    title = {Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose},
    author = {Xianfang Zeng and Yusu Pan and Mengmeng Wang and Jiangning Zhang and Yong Liu},
    year = 2020,
    booktitle = {Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI)},
    doi = {https://doi.org/10.1609/AAAI.V34I07.6970},
    abstract = {Recent works have shown how realistic talking face images can be obtained under the supervision of geometry guidance, e.g., facial landmark or boundary. To alleviate the demand for manual annotations, in this paper, we propose a novel self-supervised hybrid model (DAE-GAN) that learns how to reenact face naturally given large amounts of unlabeled videos. Our approach combines two deforming autoencoders with the latest advances in the conditional generation. On the one hand, we adopt the deforming autoencoder to disentangle identity and pose representations. A strong prior in talking face videos is that each frame can be encoded as two parts: one for video-specific identity and the other for various poses. Inspired by that, we utilize a multi-frame deforming autoencoder to learn a pose-invariant embedded face for each video. Meanwhile, a multi-scale deforming autoencoder is proposed to extract pose-related information for each frame. On the other hand, the conditional generator allows for enhancing fine details and overall reality. It leverages the disentangled features to generate photo-realistic and pose-alike face images. We evaluate our model on VoxCeleb1 and RaFD dataset. Experiment results demonstrate the superior quality of reenacted images and the flexibility of transferring facial movements between identities.},
    arxiv = {https://arxiv.org/pdf/2003.12957.pdf}
    }

  • J. Zhang, L. Liu, Z. Xue, and Y. Liu, “APB2FACE: Audio-Guided Face Reenactment with Auxiliary Pose and Blink Signals," in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, p. 4402–4406.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    Audio-guided face reenactment aims at generating photorealistic faces using audio information while maintaining the same facial movement as when speaking to a real person. However, existing methods can not generate vivid face images or only reenact low-resolution faces, which limits the application value. To solve those problems, we propose a novel deep neural network named APB2Face, which consists of GeometryPredictor and FaceReenactor modules. GeometryPredictor uses extra head pose and blink state signals as well as audio to predict the latent landmark geometry information, while FaceReenactor inputs the face landmark image to reenact the photorealistic face. A new dataset AnnV I collected from YouTube is presented to support the approach, and experimental results indicate the superiority of our method than state-of-the-arts, whether in authenticity or controllability.

    @inproceedings{zhang2020apb2faceaf,
    title = {APB2FACE: Audio-Guided Face Reenactment with Auxiliary Pose and Blink Signals},
    author = {Jiangning Zhang and Liang Liu and Zhucun Xue and Yong Liu},
    year = 2020,
    booktitle = {2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
    pages = {4402--4406},
    doi = {https://doi.org/10.1109/ICASSP40776.2020.9052977},
    abstract = {Audio-guided face reenactment aims at generating photorealistic faces using audio information while maintaining the same facial movement as when speaking to a real person. However, existing methods can not generate vivid face images or only reenact low-resolution faces, which limits the application value. To solve those problems, we propose a novel deep neural network named APB2Face, which consists of GeometryPredictor and FaceReenactor modules. GeometryPredictor uses extra head pose and blink state signals as well as audio to predict the latent landmark geometry information, while FaceReenactor inputs the face landmark image to reenact the photorealistic face. A new dataset AnnV I collected from YouTube is presented to support the approach, and experimental results indicate the superiority of our method than state-of-the-arts, whether in authenticity or controllability.},
    arxiv = {http://arxiv.org/pdf/2004.14569}
    }

  • J. Zhang, C. Xu, L. Liu, M. Wang, X. Wu, Y. Liu, and Y. Jiang, “Dtvnet: Dynamic time-lapse video generation via single still image," in ECCV, 2020, p. 300–315.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    This paper presents a novel end-to-end dynamic time-lapse video generation framework, named DTVNet, to generate diversified time-lapse videos from a single landscape image, which are conditioned on normalized motion vectors. The proposed DTVNet consists of two submodules: Optical Flow Encoder (OFE) and Dynamic Video Generator (DVG). The OFE maps a sequence of optical flow maps to a normalized motion vector that encodes the motion information inside the generated video. The DVG contains motion and content streams that learn from the motion vector and the single image respectively, as well as an encoder and a decoder to learn shared content features and construct video frames with corresponding motion respectively. Specifically, the motion stream introduces multiple adaptive instance normalization (AdaIN) layers to integrate multi-level motion information that are processed by linear layers. In the testing stage, videos with the same content but various motion information can be generated by different normalized motion vectors based on only one input image. We further conduct experiments on Sky Time-lapse dataset, and the results demonstrate the superiority of our approach over the state-of-the-art methods for generating high-quality and dynamic videos, as well as the variety for generating videos with various motion information.

    @inproceedings{zhang2020dtvnet,
    title = {Dtvnet: Dynamic time-lapse video generation via single still image},
    author = {Zhang, Jiangning and Xu, Chao and Liu, Liang and Wang, Mengmeng and Wu, Xia and Liu, Yong and Jiang, Yunliang},
    year = 2020,
    booktitle = {{ECCV}},
    pages = {300--315},
    doi = {https://doi.org/10.1007/978-3-030-58558-7_18},
    abstract = {This paper presents a novel end-to-end dynamic time-lapse video generation framework, named DTVNet, to generate diversified time-lapse videos from a single landscape image, which are conditioned on normalized motion vectors. The proposed DTVNet consists of two submodules: Optical Flow Encoder (OFE) and Dynamic Video Generator (DVG). The OFE maps a sequence of optical flow maps to a normalized motion vector that encodes the motion information inside the generated video. The DVG contains motion and content streams that learn from the motion vector and the single image respectively, as well as an encoder and a decoder to learn shared content features and construct video frames with corresponding motion respectively. Specifically, the motion stream introduces multiple adaptive instance normalization (AdaIN) layers to integrate multi-level motion information that are processed by linear layers. In the testing stage, videos with the same content but various motion information can be generated by different normalized motion vectors based on only one input image. We further conduct experiments on Sky Time-lapse dataset, and the results demonstrate the superiority of our approach over the state-of-the-art methods for generating high-quality and dynamic videos, as well as the variety for generating videos with various motion information.},
    arxiv = {https://arxiv.org/abs/2008.04776}
    }

  • H. Zhang, M. Wang, Y. Liu, and Y. Yuan, “FDN: Feature Decoupling Network for Head Pose Estimation," in Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), 2020.
    [BibTeX] [Abstract] [DOI] [PDF]

    Head pose estimation from RGB images without depth information is a challenging task due to the loss of spatial information as well as large head pose variations in the wild. The performance of existing landmark-free methods remains unsatisfactory as the quality of estimated pose is inferior. In this paper, we propose a novel three-branch network architecture, termed as Feature Decoupling Network (FDN), a more powerful architecture for landmark-free head pose estimation from a single RGB image. In FDN, we first propose a feature decoupling (FD) module to explicitly learn the discriminative features for each pose angle by adaptively recalibrating its channel-wise responses. Besides, we introduce a cross-category center (CCC) loss to constrain the distribution of the latent variable subspaces and thus we can obtain more compact and distinct subspaces. Extensive experiments on both in-the-wild and controlled environment datasets demonstrate that the proposed method outperforms other state-of-the-art methods based on a single RGB image and behaves on par with approaches based on multimodal input resources.

    @inproceedings{zhang2020fdnfd,
    title = {FDN: Feature Decoupling Network for Head Pose Estimation},
    author = {Hao Zhang and Mengmeng Wang and Yong Liu and Yi Yuan},
    year = 2020,
    booktitle = {Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI)},
    doi = {https://doi.org/10.1609/AAAI.V34I07.6974},
    abstract = {Head pose estimation from RGB images without depth information is a challenging task due to the loss of spatial information as well as large head pose variations in the wild. The performance of existing landmark-free methods remains unsatisfactory as the quality of estimated pose is inferior. In this paper, we propose a novel three-branch network architecture, termed as Feature Decoupling Network (FDN), a more powerful architecture for landmark-free head pose estimation from a single RGB image. In FDN, we first propose a feature decoupling (FD) module to explicitly learn the discriminative features for each pose angle by adaptively recalibrating its channel-wise responses. Besides, we introduce a cross-category center (CCC) loss to constrain the distribution of the latent variable subspaces and thus we can obtain more compact and distinct subspaces. Extensive experiments on both in-the-wild and controlled environment datasets demonstrate that the proposed method outperforms other state-of-the-art methods based on a single RGB image and behaves on par with approaches based on multimodal input resources.}
    }

  • J. Zhang, X. Zeng, M. Wang, Y. Pan, L. Liu, Y. Liu, Y. Ding, and C. Fan, “FReeNet: Multi-Identity Face Reenactment," in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, p. 5325–5334.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    This paper presents a novel multi-identity face reenactment framework, named FReeNet, to transfer facial expressions from an arbitrary source face to a target face with a shared model. The proposed FReeNet consists of two parts: Unified Landmark Converter (ULC) and Geometry-aware Generator (GAG). The ULC adopts an encode-decoder architecture to efficiently convert expression in a latent landmark space, which significantly narrows the gap of the face contour between source and target identities. The GAG leverages the converted landmark to reenact the photorealistic image with a reference image of the target person. Moreover, a new triplet perceptual loss is proposed to force the GAG module to learn appearance and geometry information simultaneously, which also enriches facial details of the reenacted images. Further experiments demonstrate the superiority of our approach for generating photorealistic and expression-alike faces, as well as the flexibility for transferring facial expressions between identities.

    @inproceedings{zhang2020freenetmf,
    title = {FReeNet: Multi-Identity Face Reenactment},
    author = {Jiangning Zhang and Xianfang Zeng and Mengmeng Wang and Yusu Pan and Liang Liu and Yong Liu and Yu Ding and Changjie Fan},
    year = 2020,
    booktitle = {2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {5325--5334},
    doi = {https://doi.org/10.1109/cvpr42600.2020.00537},
    abstract = {This paper presents a novel multi-identity face reenactment framework, named FReeNet, to transfer facial expressions from an arbitrary source face to a target face with a shared model. The proposed FReeNet consists of two parts: Unified Landmark Converter (ULC) and Geometry-aware Generator (GAG). The ULC adopts an encode-decoder architecture to efficiently convert expression in a latent landmark space, which significantly narrows the gap of the face contour between source and target identities. The GAG leverages the converted landmark to reenact the photorealistic image with a reference image of the target person. Moreover, a new triplet perceptual loss is proposed to force the GAG module to learn appearance and geometry information simultaneously, which also enriches facial details of the reenacted images. Further experiments demonstrate the superiority of our approach for generating photorealistic and expression-alike faces, as well as the flexibility for transferring facial expressions between identities.},
    arxiv = {http://arxiv.org/pdf/1905.11805}
    }

  • X. Zhao, C. Deng, X. Kong, J. Xu, and Y. Liu, “Learning to Compensate for the Drift and Error of Gyroscope in Vehicle Localization," in 2020 IEEE Intelligent Vehicles Symposium (IV), 2020, p. 852–857.
    [BibTeX] [Abstract] [DOI] [PDF]

    Self-localization is an essential technology for autonomous vehicles. Building robust odometry in a GPS-denied environment is still challenging, especially when LiDAR and camera are uninformative. In this paper, We propose a learning-based approach to cure the drift of gyroscope for vehicle localization. For consumer-level MEMS gyroscope (stability ∼10° /h), our GyroNet can estimate the error of each measurement. For high-precision Fiber optics Gyroscope (stability ∼0.05° /h), we build a FoGNet which can obtain its drift by observing data in a long time window. We perform comparative experiments on publicly available datasets. The results demonstrate that our GyroNet can get higher precision angular velocity than traditional digital filters and static initialization methods. In the vehicle localization, the FoGNet can effectively correct the small drift of the Fiber optics Gyroscope (FoG) and can achieve better results than the state-of-the-art method.

    @inproceedings{zhao2020learningtc,
    title = {Learning to Compensate for the Drift and Error of Gyroscope in Vehicle Localization},
    author = {Xiangrui Zhao and Chunfang Deng and Xin Kong and Jinhong Xu and Yong Liu},
    year = 2020,
    booktitle = {2020 IEEE Intelligent Vehicles Symposium (IV)},
    pages = {852--857},
    doi = {https://doi.org/10.1109/IV47402.2020.9304715},
    abstract = {Self-localization is an essential technology for autonomous vehicles. Building robust odometry in a GPS-denied environment is still challenging, especially when LiDAR and camera are uninformative. In this paper, We propose a learning-based approach to cure the drift of gyroscope for vehicle localization. For consumer-level MEMS gyroscope (stability ∼10° /h), our GyroNet can estimate the error of each measurement. For high-precision Fiber optics Gyroscope (stability ∼0.05° /h), we build a FoGNet which can obtain its drift by observing data in a long time window. We perform comparative experiments on publicly available datasets. The results demonstrate that our GyroNet can get higher precision angular velocity than traditional digital filters and static initialization methods. In the vehicle localization, the FoGNet can effectively correct the small drift of the Fiber optics Gyroscope (FoG) and can achieve better results than the state-of-the-art method.}
    }

  • H. Zou, J. Cui, X. Kong, C. Zhang, Y. Liu, F. Wen, and W. Li, “F-Siamese Tracker: A Frustum-based Double Siamese Network for 3D Single Object Tracking," in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, p. 8133–8139.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    This paper presents F-Siamese Tracker, a novel approach for single object tracking prominently characterized by more robustly integrating 2D and 3D information to reduce redundant search space. A main challenge in 3D single object tracking is how to reduce search space for generating appropriate 3D candidates. Instead of solely relying on 3D proposals, firstly, our method leverages the Siamese network applied on RGB images to produce 2D region proposals which are then extruded into 3D viewing frustums. Besides, we perform an on-line accuracy validation on the 3D frustum to generate refined point cloud searching space, which can be embedded directly into the existing 3D tracking backbone. For efficiency, our approach gains better performance with fewer candidates by reducing search space. In addition, benefited from introducing the online accuracy validation, for occasional cases with strong occlusions or very sparse points, our approach can still achieve high precision, even when the 2D Siamese tracker loses the target. This approach allows us to set a new state-of-the-art in 3D single object tracking by a significant margin on a sparse outdoor dataset (KITTI tracking). Moreover, experiments on 2D single object tracking show that our framework boosts 2D tracking performance as well.

    @inproceedings{zou2020fsiameseta,
    title = {F-Siamese Tracker: A Frustum-based Double Siamese Network for 3D Single Object Tracking},
    author = {Hao Zou and Jinhao Cui and Xin Kong and Chujuan Zhang and Yong Liu and Feng Wen and Wanlong Li},
    year = 2020,
    booktitle = {2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {8133--8139},
    doi = {ttps://doi.org/10.1109/IROS45743.2020.9341120},
    abstract = {This paper presents F-Siamese Tracker, a novel approach for single object tracking prominently characterized by more robustly integrating 2D and 3D information to reduce redundant search space. A main challenge in 3D single object tracking is how to reduce search space for generating appropriate 3D candidates. Instead of solely relying on 3D proposals, firstly, our method leverages the Siamese network applied on RGB images to produce 2D region proposals which are then extruded into 3D viewing frustums. Besides, we perform an on-line accuracy validation on the 3D frustum to generate refined point cloud searching space, which can be embedded directly into the existing 3D tracking backbone. For efficiency, our approach gains better performance with fewer candidates by reducing search space. In addition, benefited from introducing the online accuracy validation, for occasional cases with strong occlusions or very sparse points, our approach can still achieve high precision, even when the 2D Siamese tracker loses the target. This approach allows us to set a new state-of-the-art in 3D single object tracking by a significant margin on a sparse outdoor dataset (KITTI tracking). Moreover, experiments on 2D single object tracking show that our framework boosts 2D tracking performance as well.},
    arxiv = {https://arxiv.org/pdf/2010.11510.pdf}
    }

  • X. Zuo, Y. Yang, P. Geneva, J. Lv, Y. Liu, G. Huang, and M. Pollefeys, “LIC-Fusion 2.0: LiDAR-Inertial-Camera Odometry with Sliding-Window Plane-Feature Tracking," in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, p. 5112–5119.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    Multi-sensor fusion of multi-modal measurements from commodity inertial, visual and LiDAR sensors to provide robust and accurate 6DOF pose estimation holds great potential in robotics and beyond. In this paper, building upon our prior work (i.e., LIC-Fusion), we develop a sliding-window filter based LiDAR-Inertial-Camera odometry with online spatiotemporal calibration (i.e., LIC-Fusion 2.0), which introduces a novel sliding-window plane-feature tracking for efficiently processing 3D LiDAR point clouds. In particular, after motion compensation for LiDAR points by leveraging IMU data, low-curvature planar points are extracted and tracked across the sliding window. A novel outlier rejection criteria is proposed in the plane-feature tracking for high quality data association. Only the tracked planar points belonging to the same plane will be used for plane initialization, which makes the plane extraction efficient and robust. Moreover, we perform the observability analysis for the IMU-LiDAR subsystem under consideration and report the degenerate cases for spatiotemporal calibration using plane features. While the estimation consistency and identified degenerate motions are validated in Monte-Carlo simulations, different real-world experiments are also conducted to show that the proposed LIC-Fusion 2.0 outperforms its predecessor and other state-of-the-art methods.

    @inproceedings{zuo2020licfusion2l,
    title = {LIC-Fusion 2.0: LiDAR-Inertial-Camera Odometry with Sliding-Window Plane-Feature Tracking},
    author = {Xingxing Zuo and Yulin Yang and Patrick Geneva and Jiajun Lv and Yong Liu and Guoquan Huang and Marc Pollefeys},
    year = 2020,
    booktitle = {2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {5112--5119},
    doi = {10.1109/IROS45743.2020.9340704},
    abstract = {Multi-sensor fusion of multi-modal measurements from commodity inertial, visual and LiDAR sensors to provide robust and accurate 6DOF pose estimation holds great potential in robotics and beyond. In this paper, building upon our prior work (i.e., LIC-Fusion), we develop a sliding-window filter based LiDAR-Inertial-Camera odometry with online spatiotemporal calibration (i.e., LIC-Fusion 2.0), which introduces a novel sliding-window plane-feature tracking for efficiently processing 3D LiDAR point clouds. In particular, after motion compensation for LiDAR points by leveraging IMU data, low-curvature planar points are extracted and tracked across the sliding window. A novel outlier rejection criteria is proposed in the plane-feature tracking for high quality data association. Only the tracked planar points belonging to the same plane will be used for plane initialization, which makes the plane extraction efficient and robust. Moreover, we perform the observability analysis for the IMU-LiDAR subsystem under consideration and report the degenerate cases for spatiotemporal calibration using plane features. While the estimation consistency and identified degenerate motions are validated in Monte-Carlo simulations, different real-world experiments are also conducted to show that the proposed LIC-Fusion 2.0 outperforms its predecessor and other state-of-the-art methods.},
    arxiv = {https://arxiv.org/pdf/2008.07196.pdf}
    }

2019

  • T. Huang and Y. Liu, “3D Point Cloud Geometry Compression on Deep Learning," in Proceedings of the 27th ACM International Conference on Multimedia (MM), 2019.
    [BibTeX] [Abstract] [DOI] [PDF]

    3D point cloud presentation has been widely used in computer vision, automatic driving, augmented reality, smart cities and virtual reality. 3D point cloud compression method with higher compression ratio and tiny loss is the key to improve data transportation efficiency. In this paper, we propose a new 3D point cloud geometry compression method based on deep learning, also an auto-encoder performing better than other networks in detail reconstruction. It can reach much higher compression ratio than the state-of-art while keeping tolerable loss. It also supports parallel compressing multiple models by GPU, which can improve processing efficiency greatly. The compression process is composed of two parts. Firstly, Raw data is compressed into codeword by extracting feature of raw model with encoder. Then, the codeword is further compressed with sparse coding. Decompression process is implemented in reverse order. Codeword is recovered and fed into decoder to reconstruct point cloud. Detail reconstruction ability is improved by a hierarchical structure in our decoder. Latter outputs are grown from former fuzzier outputs. In this way, details are added to former output by latter layers step by step to make a more precise prediction. We compare our method with PCL compression and Draco compression on ShapeNet40 part dataset. Our method may be the first deep learning-based point cloud compression algorithm. The experiments demonstrate it is superior to former common compression algorithms with large compression ratio, which can also reserve original shapes with tiny loss.

    @inproceedings{huang20193dpc,
    title = {3D Point Cloud Geometry Compression on Deep Learning},
    author = {Tianxing Huang and Yong Liu},
    year = 2019,
    booktitle = {Proceedings of the 27th ACM International Conference on Multimedia (MM)},
    doi = {https://doi.org/10.1145/3343031.3351061},
    abstract = {3D point cloud presentation has been widely used in computer vision, automatic driving, augmented reality, smart cities and virtual reality. 3D point cloud compression method with higher compression ratio and tiny loss is the key to improve data transportation efficiency. In this paper, we propose a new 3D point cloud geometry compression method based on deep learning, also an auto-encoder performing better than other networks in detail reconstruction. It can reach much higher compression ratio than the state-of-art while keeping tolerable loss. It also supports parallel compressing multiple models by GPU, which can improve processing efficiency greatly. The compression process is composed of two parts. Firstly, Raw data is compressed into codeword by extracting feature of raw model with encoder. Then, the codeword is further compressed with sparse coding. Decompression process is implemented in reverse order. Codeword is recovered and fed into decoder to reconstruct point cloud. Detail reconstruction ability is improved by a hierarchical structure in our decoder. Latter outputs are grown from former fuzzier outputs. In this way, details are added to former output by latter layers step by step to make a more precise prediction. We compare our method with PCL compression and Draco compression on ShapeNet40 part dataset. Our method may be the first deep learning-based point cloud compression algorithm. The experiments demonstrate it is superior to former common compression algorithms with large compression ratio, which can also reserve original shapes with tiny loss.}
    }

  • X. Kong, G. Zhai, B. Zhong, and Y. Liu, “PASS3D: Precise and Accelerated Semantic Segmentation for 3D Point Cloud," in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, p. 3467–3473.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    In this paper, we propose PASS3D to achieve point-wise semantic segmentation for 3D point cloud. Our framework combines the efficiency of traditional geometric methods with robustness of deep learning methods, consisting of two stages: At stage -1, our accelerated cluster proposal algorithm will generate refined cluster proposals by segmenting point clouds without ground, capable of generating less redundant proposals with higher recall in an extremely short time; stage -2 we will amplify and further process these proposals by a neural network to estimate semantic label for each point and meanwhile propose a novel data augmentation method to enhance the network’s recognition capability for all categories especially for non-rigid objects. Evaluated on KITTI raw dataset, PASS3D stands out against the state-of-the-art on some results, making itself competent to 3D perception in autonomous driving system. Our source code will be open-sourced. A video demonstration is available at https://www.youtube.com/watch?v=cukEqDuP_Qw.

    @inproceedings{kong2019pass3dpa,
    title = {PASS3D: Precise and Accelerated Semantic Segmentation for 3D Point Cloud},
    author = {Xin Kong and Guangyao Zhai and Baoquan Zhong and Yong Liu},
    year = 2019,
    booktitle = {2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {3467--3473},
    doi = {https://doi.org/10.1109/IROS40897.2019.8968296},
    abstract = {In this paper, we propose PASS3D to achieve point-wise semantic segmentation for 3D point cloud. Our framework combines the efficiency of traditional geometric methods with robustness of deep learning methods, consisting of two stages: At stage -1, our accelerated cluster proposal algorithm will generate refined cluster proposals by segmenting point clouds without ground, capable of generating less redundant proposals with higher recall in an extremely short time; stage -2 we will amplify and further process these proposals by a neural network to estimate semantic label for each point and meanwhile propose a novel data augmentation method to enhance the network’s recognition capability for all categories especially for non-rigid objects. Evaluated on KITTI raw dataset, PASS3D stands out against the state-of-the-art on some results, making itself competent to 3D perception in autonomous driving system. Our source code will be open-sourced. A video demonstration is available at https://www.youtube.com/watch?v=cukEqDuP_Qw.},
    arxiv = {http://arxiv.org/pdf/1909.01643}
    }

  • Y. Li, Y. Liu, and C. Zhang, “What Elements are Essential to Recognize Human Actions?," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
    [BibTeX] [Abstract] [PDF]

    RGB image has been widely used for human action recognition. However, it could be redundant to include all information for human action depiction. We thus ask the following question: What elements are essential for human action recognition? To this end, we investigate several different human representations. These representations emphasize dissimilarly on elements (e.g. background context, actor appearance, and human shape). Systematic analysis enables us to find out essential elements as well as unnecessary contents for human action description. More specifically, our experimental results demonstrate the following: Firstly, both context-related elements and actor appearance are not vital for action recognition in most cases. But an accurate and consistent human representation is important. Secondly, essential human representation ensures better performance and cross-dataset transferability. Thirdly, fine-tuning works only when networks acquire essential elements from human representations. Fourthly, 3D reconstruction-related representation is beneficial for human action recognition tasks. Our study shows researchers need to reflect on more essential elements to depict human actions, and it is also instructive for practical human action recognition in real-world scenarios.

    @inproceedings{li2019whatea,
    title = {What Elements are Essential to Recognize Human Actions?},
    author = {YaChun Li and Yong Liu and Chi Zhang},
    year = 2019,
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    abstract = {RGB image has been widely used for human action recognition. However, it could be redundant to include all information for human action depiction. We thus ask the following question: What elements are essential for human action recognition? To this end, we investigate several different human representations. These representations emphasize dissimilarly on elements (e.g. background context, actor appearance, and human shape). Systematic analysis enables us to find out essential elements as well as unnecessary contents for human action description. More specifically, our experimental results demonstrate the following: Firstly, both context-related elements and actor appearance are not vital for action recognition in most cases. But an accurate and consistent human representation is important. Secondly, essential human representation ensures better performance and cross-dataset transferability. Thirdly, fine-tuning works only when networks acquire essential elements from human representations. Fourthly, 3D reconstruction-related representation is beneficial for human action recognition tasks. Our study shows researchers need to reflect on more essential elements to depict human actions, and it is also instructive for practical human action recognition in real-world scenarios.}
    }

  • L. Liu, G. Zhai, W. Ye, and Y. Liu, “Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity," in 28th International Joint Conference on Artificial Intelligence (IJCAI), 2019.
    [BibTeX] [Abstract] [DOI] [PDF]

    Scene flow estimation in the dynamic scene remains a challenging task. Computing scene flow by a combination of 2D optical flow and depth has shown to be considerably faster with acceptable performance. In this work, we present a unified framework for joint unsupervised learning of stereo depth and optical flow with explicit local rigidity to estimate scene flow. We estimate camera motion directly by a Perspective-n-Point method from the optical flow and depth predictions, with RANSAC outlier rejection scheme. In order to disambiguate the object motion and the camera motion in the scene, we distinguish the rigid region by the re-project error and the photometric similarity. By joint learning with the local rigidity, both depth and optical networks can be refined. This framework boosts all four tasks: depth, optical flow, camera motion estimation, and object motion segmentation. Through the evaluation on the KITTI benchmark, we show that the proposed framework achieves state-of-the-art results amongst unsupervised methods. Our models and code are available at https://github.com/lliuz/unrigidflow.

    @inproceedings{liu2019unsupervisedlo,
    title = {Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity},
    author = {Liang Liu and Guangyao Zhai and Wenlong Ye and Yong Liu},
    year = 2019,
    booktitle = {28th International Joint Conference on Artificial Intelligence (IJCAI)},
    doi = {https://doi.org/10.24963/ijcai.2019%2F123},
    abstract = {Scene flow estimation in the dynamic scene remains a challenging task. Computing scene flow by a combination of 2D optical flow and depth has shown to be considerably faster with acceptable performance. In this work, we present a unified framework for joint unsupervised learning of stereo depth and optical flow with explicit local rigidity to estimate scene flow. We estimate camera motion directly by a Perspective-n-Point method from the optical flow and depth predictions, with RANSAC outlier rejection scheme. In order to disambiguate the object motion and the camera motion in the scene, we distinguish the rigid region by the re-project error and the photometric similarity. By joint learning with the local rigidity, both depth and optical networks can be refined. This framework boosts all four tasks: depth, optical flow, camera motion estimation, and object motion segmentation. Through the evaluation on the KITTI benchmark, we show that the proposed framework achieves state-of-the-art results amongst unsupervised methods. Our models and code are available at https://github.com/lliuz/unrigidflow.}
    }

  • T. Shi, Y. Yuan, C. Fan, Z. Zou, Z. Shi, and Y. Liu, “Face-to-Parameter Translation for Game Character Auto-Creation," in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, p. 161–170.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    Character customization system is an important component in Role-Playing Games (RPGs), where players are allowed to edit the facial appearance of their in-game characters with their own preferences rather than using default templates. This paper proposes a method for automatically creating in-game characters of players according to an input face photo. We formulate the above “artistic creation" process under a facial similarity measurement and parameter searching paradigm by solving an optimization problem over a large set of physically meaningful facial parameters. To effectively minimize the distance between the created face and the real one, two loss functions, i.e. a “discriminative loss" and a “facial content loss", are specifically designed. As the rendering process of a game engine is not differentiable, a generative network is further introduced as an “imitator" to imitate the physical behavior of the game engine so that the proposed method can be implemented under a neural style transfer framework and the parameters can be optimized by gradient descent. Experimental results demonstrate that our method achieves a high degree of generation similarity between the input face photo and the created in-game character in terms of both global appearance and local details. Our method has been deployed in a new game last year and has now been used by players over 1 million times.

    @inproceedings{shi2019facetoparametertf,
    title = {Face-to-Parameter Translation for Game Character Auto-Creation},
    author = {Tianyang Shi and Yi Yuan and Changjie Fan and Zhengxia Zou and Zhenwei Shi and Yong Liu},
    year = 2019,
    booktitle = {2019 IEEE/CVF International Conference on Computer Vision (ICCV)},
    pages = {161--170},
    doi = {https://doi.org/10.1109/ICCV.2019.00025},
    abstract = {Character customization system is an important component in Role-Playing Games (RPGs), where players are allowed to edit the facial appearance of their in-game characters with their own preferences rather than using default templates. This paper proposes a method for automatically creating in-game characters of players according to an input face photo. We formulate the above "artistic creation" process under a facial similarity measurement and parameter searching paradigm by solving an optimization problem over a large set of physically meaningful facial parameters. To effectively minimize the distance between the created face and the real one, two loss functions, i.e. a "discriminative loss" and a "facial content loss", are specifically designed. As the rendering process of a game engine is not differentiable, a generative network is further introduced as an "imitator" to imitate the physical behavior of the game engine so that the proposed method can be implemented under a neural style transfer framework and the parameters can be optimized by gradient descent. Experimental results demonstrate that our method achieves a high degree of generation similarity between the input face photo and the created in-game character in terms of both global appearance and local details. Our method has been deployed in a new game last year and has now been used by players over 1 million times.},
    arxiv = {http://arxiv.org/pdf/1909.01064}
    }

  • G. Tian, Y. Yuan, and Y. Liu, “Audio2Face: Generating Speech/Face Animation from Single Audio with Attention-Based Bidirectional LSTM Networks," in 2019 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), 2019, p. 366–371.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    We propose an end to end deep learning approach for generating real-time facial animation from just audio. Specifically, our deep architecture employs deep bidirectional long short-term memory network and attention mechanism to discover the latent representations of time-varying contextual information within the speech and recognize the significance of different information contributed to certain face status. Therefore, our model is able to drive different levels of facial movements at inference and automatically keep up with the corresponding pitch and latent speaking style in the input audio, with no assumption or further human intervention. Evaluation results show that our method could not only generate accurate lip movements from audio, but also successfully regress the speaker’s time-varying facial movements.

    @inproceedings{tian2019audio2facegs,
    title = {Audio2Face: Generating Speech/Face Animation from Single Audio with Attention-Based Bidirectional LSTM Networks},
    author = {Guanzhong Tian and Yi Yuan and Yong Liu},
    year = 2019,
    booktitle = {2019 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)},
    pages = {366--371},
    doi = {https://doi.org/10.1109/ICMEW.2019.00069},
    abstract = {We propose an end to end deep learning approach for generating real-time facial animation from just audio. Specifically, our deep architecture employs deep bidirectional long short-term memory network and attention mechanism to discover the latent representations of time-varying contextual information within the speech and recognize the significance of different information contributed to certain face status. Therefore, our model is able to drive different levels of facial movements at inference and automatically keep up with the corresponding pitch and latent speaking style in the input audio, with no assumption or further human intervention. Evaluation results show that our method could not only generate accurate lip movements from audio, but also successfully regress the speaker's time-varying facial movements.},
    arxiv = {http://arxiv.org/pdf/1905.11142}
    }

  • Y. Yang, P. Geneva, X. Zuo, K. Eckenhoff, Y. Liu, and G. Huang, “Tightly-Coupled Aided Inertial Navigation with Point and Plane Features," in 2019 International Conference on Robotics and Automation (ICRA), 2019, p. 6094–6100.
    [BibTeX] [Abstract] [DOI] [PDF]

    This paper presents a tightly-coupled aided inertial navigation system (INS) with point and plane features, a general sensor fusion framework applicable to any visual and depth sensor (e.g., RGBD, LiDAR) configuration, in which the camera is used for point feature tracking and depth sensor for plane extraction. The proposed system exploits geometrical structures (planes) of the environments and adopts the closest point (CP) for plane parameterization. Moreover, we distinguish planar point features from non-planar point features in order to enforce point-on-plane constraints which are used in our state estimator, thus further exploiting structural information from the environment. We also introduce a simple but effective plane feature initialization algorithm for feature-based simultaneous localization and mapping (SLAM). In addition, we perform online spatial calibration between the IMU and the depth sensor as it is difficult to obtain this critical calibration parameter in high precision. Both Monte-Carlo simulations and real-world experiments are performed to validate the proposed approach.

    @inproceedings{yang2019tightlycoupledai,
    title = {Tightly-Coupled Aided Inertial Navigation with Point and Plane Features},
    author = {Yulin Yang and Patrick Geneva and Xingxing Zuo and Kevin Eckenhoff and Yong Liu and Guoquan Huang},
    year = 2019,
    booktitle = {2019 International Conference on Robotics and Automation (ICRA)},
    pages = {6094--6100},
    doi = {https://doi.org/10.1109/ICRA.2019.8794078},
    abstract = {This paper presents a tightly-coupled aided inertial navigation system (INS) with point and plane features, a general sensor fusion framework applicable to any visual and depth sensor (e.g., RGBD, LiDAR) configuration, in which the camera is used for point feature tracking and depth sensor for plane extraction. The proposed system exploits geometrical structures (planes) of the environments and adopts the closest point (CP) for plane parameterization. Moreover, we distinguish planar point features from non-planar point features in order to enforce point-on-plane constraints which are used in our state estimator, thus further exploiting structural information from the environment. We also introduce a simple but effective plane feature initialization algorithm for feature-based simultaneous localization and mapping (SLAM). In addition, we perform online spatial calibration between the IMU and the depth sensor as it is difficult to obtain this critical calibration parameter in high precision. Both Monte-Carlo simulations and real-world experiments are performed to validate the proposed approach.}
    }

  • W. Ye, R. Zheng, F. Zhang, Z. Ouyang, and Y. Liu, “Robust and Efficient Vehicles Motion Estimation with Low-Cost Multi-Camera and Odometer-Gyroscope," in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, p. 4490–4496.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this paper, we present a robust and efficient estimation approach with multi-camera, odometer and gyroscope. Robust initialization, tightly-coupled optimization estimator and multi-camera loop-closure detection are utilized in the proposed approach. In initialization, the measurements of odometer and gyroscope are used to compute scale, and then estimate the bias of sensors. In estimator, the pre-integration of odometer and gyroscope is derived and combined with the measurements of multi-camera to estimate the motion in a tightly-coupled optimization framework. In loop-closure detection, a connection between different cameras of the vehicle can be built, which significantly improve the success rate of loop-closure detection. The proposed algorithm is validated in multiple real-world datasets collected in different places, time, weather and illumination. Experimental results show that the proposed approach can estimate the motion of vehicles robustly and efficiently.

    @inproceedings{ye2019robustae,
    title = {Robust and Efficient Vehicles Motion Estimation with Low-Cost Multi-Camera and Odometer-Gyroscope},
    author = {Wenlong Ye and Renjie Zheng and Fangqiang Zhang and Zizhou Ouyang and Yong Liu},
    year = 2019,
    booktitle = {2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {4490--4496},
    doi = {https://doi.org/10.1109/IROS40897.2019.8968048},
    abstract = {In this paper, we present a robust and efficient estimation approach with multi-camera, odometer and gyroscope. Robust initialization, tightly-coupled optimization estimator and multi-camera loop-closure detection are utilized in the proposed approach. In initialization, the measurements of odometer and gyroscope are used to compute scale, and then estimate the bias of sensors. In estimator, the pre-integration of odometer and gyroscope is derived and combined with the measurements of multi-camera to estimate the motion in a tightly-coupled optimization framework. In loop-closure detection, a connection between different cameras of the vehicle can be built, which significantly improve the success rate of loop-closure detection. The proposed algorithm is validated in multiple real-world datasets collected in different places, time, weather and illumination. Experimental results show that the proposed approach can estimate the motion of vehicles robustly and efficiently.}
    }

  • X. Zhao, R. Zheng, W. Ye, Y. Liu, and M. Li, “A Robust Stereo Semi-direct SLAM System Based on Hybrid Pyramid," in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, p. 5376–5382.
    [BibTeX] [Abstract] [DOI] [PDF]

    We propose a hybrid pyramid based approach to fuse the direct and indirect methods in visual SLAM, to allow robust localization under various situations including large-baseline motion, low-texture environment, and various illumination changes. In our approach, we first calculate coarse inter-frame pose estimation by matching the feature points. Subsequently, we use both direct image alignment and a multiscale pyramid method, for refining the previous estimation to attain better precision. Furthermore, we perform online photometric calibration along with pose estimation, to reduce un-modelled errors. To evaluate our approach, we conducted various real-world experiments on both public datasets and self-collected ones, by implementing a full SLAM system with the proposed methods. The results show that our system improves both localization accuracy and robustness by a wide margin.

    @inproceedings{zhao2019ars,
    title = {A Robust Stereo Semi-direct SLAM System Based on Hybrid Pyramid},
    author = {Xiangrui Zhao and Renjie Zheng and Wenlong Ye and Yong Liu and Mingyang Li},
    year = 2019,
    booktitle = {2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {5376--5382},
    doi = {https://doi.org/10.1109/IROS40897.2019.8968008},
    abstract = {We propose a hybrid pyramid based approach to fuse the direct and indirect methods in visual SLAM, to allow robust localization under various situations including large-baseline motion, low-texture environment, and various illumination changes. In our approach, we first calculate coarse inter-frame pose estimation by matching the feature points. Subsequently, we use both direct image alignment and a multiscale pyramid method, for refining the previous estimation to attain better precision. Furthermore, we perform online photometric calibration along with pose estimation, to reduce un-modelled errors. To evaluate our approach, we conducted various real-world experiments on both public datasets and self-collected ones, by implementing a full SLAM system with the proposed methods. The results show that our system improves both localization accuracy and robustness by a wide margin.}
    }

  • X. Zuo, P. Geneva, W. Lee, Y. Liu, and G. Huang, “LIC-Fusion: LiDAR-Inertial-Camera Odometry," in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, p. 5848–5854.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    This paper presents a tightly-coupled multi-sensor fusion algorithm termed LiDAR-inertial-camera fusion (LIC-Fusion), which efficiently fuses IMU measurements, sparse visual features, and extracted LiDAR points. In particular, the proposed LIC-Fusion performs online spatial and temporal sensor calibration between all three asynchronous sensors, in order to compensate for possible calibration variations. The key contribution is the optimal (up to linearization errors) multi-modal sensor fusion of detected and tracked sparse edge/surf feature points from LiDAR scans within an efficient MSCKF-based framework, alongside sparse visual feature observations and IMU readings. We perform extensive experiments in both indoor and outdoor environments, showing that the proposed LIC-Fusion outperforms the state-of-the-art visual-inertial odometry (VIO) and LiDAR odometry methods in terms of estimation accuracy and robustness to aggressive motions.

    @inproceedings{zuo2019licfusionlo,
    title = {LIC-Fusion: LiDAR-Inertial-Camera Odometry},
    author = {Xingxing Zuo and Patrick Geneva and Woosik Lee and Yong Liu and Guoquan Huang},
    year = 2019,
    booktitle = {2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {5848--5854},
    doi = {https://doi.org/10.1109/IROS40897.2019.8967746},
    abstract = {This paper presents a tightly-coupled multi-sensor fusion algorithm termed LiDAR-inertial-camera fusion (LIC-Fusion), which efficiently fuses IMU measurements, sparse visual features, and extracted LiDAR points. In particular, the proposed LIC-Fusion performs online spatial and temporal sensor calibration between all three asynchronous sensors, in order to compensate for possible calibration variations. The key contribution is the optimal (up to linearization errors) multi-modal sensor fusion of detected and tracked sparse edge/surf feature points from LiDAR scans within an efficient MSCKF-based framework, alongside sparse visual feature observations and IMU readings. We perform extensive experiments in both indoor and outdoor environments, showing that the proposed LIC-Fusion outperforms the state-of-the-art visual-inertial odometry (VIO) and LiDAR odometry methods in terms of estimation accuracy and robustness to aggressive motions.},
    arxiv = {http://arxiv.org/pdf/1909.04102}
    }

  • X. Zuo, M. Zhang, Y. Chen, Y. Liu, G. Huang, and M. Li, “Visual-Inertial Localization for Skid-Steering Robots with Kinematic Constraints," in 2019 The International Symposium on Robotics Research (ISRR), 2019.
    [BibTeX] [Abstract] [arXiv] [PDF]

    While visual localization or SLAM has witnessed great progress in past decades, when deploying it on a mobile robot in practice, few works have explicitly considered the kinematic (or dynamic) constraints of the real robotic system when designing state estimators. To promote the practical deployment of current state-of-the-art visual-inertial localization algorithms, in this work we propose a low-cost kinematics-constrained localization system particularly for a skid-steering mobile robot. In particular, we derive in a principle way the robot’s kinematic constraints based on the instantaneous centers of rotation (ICR) model and integrate them in a tightly-coupled manner into the sliding-window bundle adjustment (BA)-based visual-inertial estimator. Because the ICR model parameters are time-varying due to, for example, track-to-terrain interaction and terrain roughness, we estimate these kinematic parameters online along with the navigation state. To this end, we perform in-depth the observability analysis and identify motion conditions under which the state/parameter estimation is viable. The proposed kinematics-constrained visual-inertial localization system has been validated extensively in different terrain scenarios.

    @inproceedings{zuo2019visualinertiallf,
    title = {Visual-Inertial Localization for Skid-Steering Robots with Kinematic Constraints},
    author = {Xingxing Zuo and Mingming Zhang and Yiming Chen and Yong Liu and Guoquan Huang and Mingyang Li},
    year = 2019,
    booktitle = {2019 The International Symposium on Robotics Research (ISRR)},
    abstract = {While visual localization or SLAM has witnessed great progress in past decades, when deploying it on a mobile robot in practice, few works have explicitly considered the kinematic (or dynamic) constraints of the real robotic system when designing state estimators. To promote the practical deployment of current state-of-the-art visual-inertial localization algorithms, in this work we propose a low-cost kinematics-constrained localization system particularly for a skid-steering mobile robot. In particular, we derive in a principle way the robot's kinematic constraints based on the instantaneous centers of rotation (ICR) model and integrate them in a tightly-coupled manner into the sliding-window bundle adjustment (BA)-based visual-inertial estimator. Because the ICR model parameters are time-varying due to, for example, track-to-terrain interaction and terrain roughness, we estimate these kinematic parameters online along with the navigation state. To this end, we perform in-depth the observability analysis and identify motion conditions under which the state/parameter estimation is viable. The proposed kinematics-constrained visual-inertial localization system has been validated extensively in different terrain scenarios.},
    arxiv = {https://arxiv.org/pdf/1911.05787.pdf}
    }

2018

  • W. Chen and Y. Liu, “Active Planning of Robot Navigation for 3D Scene Exploration," in 2018 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), 2018, p. 516–520.
    [BibTeX] [Abstract] [DOI] [PDF]

    This work addresses the active planning of robot navigation tasks for 3D scene exploration. 3D scene exploration is an old and difficult task in robotics. In this paper, we present a strategy to guide a mobile autonomous robot equipped with a camera in order to autonomously explore the unknown 3D scene. By merging the particle filter into 3D scene exploration, we address the robot navigation problem in a heuristic way, and generate a sequence of camera poses to coverage the unknown 3D scene. First, we randomly generate a bunch of potential camera pose vectors. Then, we select the vectors through our criteria. After determining the first camera pose vector, we generate the next group of vectors based on the former one. We select the new camera pose vector and thereafter. We verify the algorithm theoretically and show the good performance in the simulation environment.

    @inproceedings{chen2018activepo,
    title = {Active Planning of Robot Navigation for 3D Scene Exploration},
    author = {Wenzhou Chen and Yong Liu},
    year = 2018,
    booktitle = {2018 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM)},
    pages = {516--520},
    doi = {https://doi.org/10.1109/AIM.2018.8452299},
    abstract = {This work addresses the active planning of robot navigation tasks for 3D scene exploration. 3D scene exploration is an old and difficult task in robotics. In this paper, we present a strategy to guide a mobile autonomous robot equipped with a camera in order to autonomously explore the unknown 3D scene. By merging the particle filter into 3D scene exploration, we address the robot navigation problem in a heuristic way, and generate a sequence of camera poses to coverage the unknown 3D scene. First, we randomly generate a bunch of potential camera pose vectors. Then, we select the vectors through our criteria. After determining the first camera pose vector, we generate the next group of vectors based on the former one. We select the new camera pose vector and thereafter. We verify the algorithm theoretically and show the good performance in the simulation environment.}
    }

  • J. Xu, J. Lv, Z. Pan, Y. Liu, and Y. Chen, “Real-Time LiDAR Data Assocation Aided by IMU in High Dynamic Environment," in 2018 IEEE International Conference on Real-time Computing and Robotics (RCAR), 2018, p. 202–205.
    [BibTeX] [Abstract] [DOI] [PDF]

    In recent years, with the breakthroughs in sensor technology, SLAM technology is developing towards high speed and high dynamic applications. The rotating multi line LiDAR sensor plays an important role. However, the rotating multi line LiDAR sensors need to restructure the data in high dynamic environment. Our work is to propose a LiDAR data correction method based on IMU and hardware synchronization, and make a hardware synchronization unit. This method can still output correct point cloud information when LiDAR sensor is moving violently.

    @inproceedings{xu2018realtimeld,
    title = {Real-Time LiDAR Data Assocation Aided by IMU in High Dynamic Environment},
    author = {Jinhong Xu and Jiajun Lv and Zaishen Pan and Yong Liu and Yinan Chen},
    year = 2018,
    booktitle = {2018 IEEE International Conference on Real-time Computing and Robotics (RCAR)},
    pages = {202--205},
    doi = {https://doi.org/10.1109/RCAR.2018.8621627},
    abstract = {In recent years, with the breakthroughs in sensor technology, SLAM technology is developing towards high speed and high dynamic applications. The rotating multi line LiDAR sensor plays an important role. However, the rotating multi line LiDAR sensors need to restructure the data in high dynamic environment. Our work is to propose a LiDAR data correction method based on IMU and hardware synchronization, and make a hardware synchronization unit. This method can still output correct point cloud information when LiDAR sensor is moving violently.}
    }

2017

  • Y. Liao, L. Huang, Y. Wang, S. Kodagoda, Y. Yu, and Y. Liu, “Parse geometry from a line: Monocular depth estimation with partial laser observation," in 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, p. 5059–5066.
    [BibTeX] [Abstract] [DOI] [PDF]

    Many standard robotic platforms are equipped with at least a fixed 2D laser range finder and a monocular camera. Although those platforms do not have sensors for 3D depth sensing capability, knowledge of depth is an essential part in many robotics activities. Therefore, recently, there is an increasing interest in depth estimation using monocular images. As this task is inherently ambiguous, the data-driven estimated depth might be unreliable in robotics applications. In this paper, we have attempted to improve the precision of monocular depth estimation by introducing 2D planar observation from the remaining laser range finder without extra cost. Specifically, we construct a dense reference map from the sparse laser range data, redefining the depth estimation task as estimating the distance between the real and the reference depth. To solve the problem, we construct a novel residual of residual neural network, and tightly combine the classification and regression losses for continuous depth estimation. Experimental results suggest that our method achieves considerable promotion compared to the state-of-the-art methods on both NYUD2 and KITTI, validating the effectiveness of our method on leveraging the additional sensory information. We further demonstrate the potential usage of our method in obstacle avoidance where our methodology provides comprehensive depth information compared to the solution using monocular camera or 2D laser range finder alone.

    @inproceedings{liao2017parsegf,
    title = {Parse geometry from a line: Monocular depth estimation with partial laser observation},
    author = {Yiyi Liao and Lichao Huang and Yue Wang and Sarath Kodagoda and Yinan Yu and Yong Liu},
    year = 2017,
    booktitle = {2017 IEEE International Conference on Robotics and Automation (ICRA)},
    pages = {5059--5066},
    doi = {https://doi.org/10.1109/ICRA.2017.7989590},
    abstract = {Many standard robotic platforms are equipped with at least a fixed 2D laser range finder and a monocular camera. Although those platforms do not have sensors for 3D depth sensing capability, knowledge of depth is an essential part in many robotics activities. Therefore, recently, there is an increasing interest in depth estimation using monocular images. As this task is inherently ambiguous, the data-driven estimated depth might be unreliable in robotics applications. In this paper, we have attempted to improve the precision of monocular depth estimation by introducing 2D planar observation from the remaining laser range finder without extra cost. Specifically, we construct a dense reference map from the sparse laser range data, redefining the depth estimation task as estimating the distance between the real and the reference depth. To solve the problem, we construct a novel residual of residual neural network, and tightly combine the classification and regression losses for continuous depth estimation. Experimental results suggest that our method achieves considerable promotion compared to the state-of-the-art methods on both NYUD2 and KITTI, validating the effectiveness of our method on leveraging the additional sensory information. We further demonstrate the potential usage of our method in obstacle avoidance where our methodology provides comprehensive depth information compared to the solution using monocular camera or 2D laser range finder alone.}
    }

  • M. Wang, Y. Liu, and Z. Huang, “Large Margin Object Tracking with Circulant Feature Maps," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, p. 4800–4808.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    Structured output support vector machine (SVM) based tracking algorithms have shown favorable performance recently. Nonetheless, the time-consuming candidate sampling and complex optimization limit their real-time applications. In this paper, we propose a novel large margin object tracking method which absorbs the strong discriminative ability from structured output SVM and speeds up by the correlation filter algorithm significantly. Secondly, a multimodal target detection technique is proposed to improve the target localization precision and prevent model drift introduced by similar objects or background noise. Thirdly, we exploit the feedback from high-confidence tracking results to avoid the model corruption problem. We implement two versions of the proposed tracker with the representations from both conventional hand-crafted and deep convolution neural networks (CNNs) based features to validate the strong compatibility of the algorithm. The experimental results demonstrate that the proposed tracker performs superiorly against several state-of-the-art algorithms on the challenging benchmark sequences while runs at speed in excess of 80 frames per second.

    @inproceedings{wang2017largemo,
    title = {Large Margin Object Tracking with Circulant Feature Maps},
    author = {Mengmeng Wang and Yong Liu and Zeyi Huang},
    year = 2017,
    booktitle = {2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {4800--4808},
    doi = {https://doi.org/10.1109/CVPR.2017.510},
    arxiv = {http://arxiv.org/pdf/1703.05020},
    abstract = {Structured output support vector machine (SVM) based tracking algorithms have shown favorable performance recently. Nonetheless, the time-consuming candidate sampling and complex optimization limit their real-time applications. In this paper, we propose a novel large margin object tracking method which absorbs the strong discriminative ability from structured output SVM and speeds up by the correlation filter algorithm significantly. Secondly, a multimodal target detection technique is proposed to improve the target localization precision and prevent model drift introduced by similar objects or background noise. Thirdly, we exploit the feedback from high-confidence tracking results to avoid the model corruption problem. We implement two versions of the proposed tracker with the representations from both conventional hand-crafted and deep convolution neural networks (CNNs) based features to validate the strong compatibility of the algorithm. The experimental results demonstrate that the proposed tracker performs superiorly against several state-of-the-art algorithms on the challenging benchmark sequences while runs at speed in excess of 80 frames per second.}
    }

  • M. Wang, D. Su, L. Shi, Y. Liu, and J. V. Miro, “Real-time 3D human tracking for mobile robots with multisensors," in 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, p. 5081–5087.
    [BibTeX] [Abstract] [DOI] [PDF]

    Acquiring the accurate 3-D position of a target person around a robot provides fundamental and valuable information that is applicable to a wide range of robotic tasks, including home service, navigation and entertainment. This paper presents a real-time robotic 3-D human tracking system which combines a monocular camera with an ultrasonic sensor by the extended Kalman filter (EKF). The proposed system consists of three sub-modules: monocular camera sensor tracking model, ultrasonic sensor tracking model and multi-sensor fusion. An improved visual tracking algorithm is presented to provide partial location estimation (2-D). The algorithm is designed to overcome severe occlusions, scale variation, target missing and achieve robust re-detection. The scale accuracy is further enhanced by the estimated 3-D information. An ultrasonic sensor array is employed to provide the range information from the target person to the robot and Gaussian Process Regression is used for partial location estimation (2-D). EKF is adopted to sequentially process multiple, heterogeneous measurements arriving in an asynchronous order from the vision sensor and the ultrasonic sensor separately. In the experiments, the proposed tracking system is tested in both simulation platform and actual mobile robot for various indoor and outdoor scenes. The experimental results show the superior performance of the 3-D tracking system in terms of both the accuracy and robustness.

    @inproceedings{wang2017realtime3h,
    title = {Real-time 3D human tracking for mobile robots with multisensors},
    author = {Mengmeng Wang and Daobilige Su and Lei Shi and Yong Liu and Jaime Valls Miro},
    year = 2017,
    booktitle = {2017 IEEE International Conference on Robotics and Automation (ICRA)},
    pages = {5081--5087},
    doi = {https://doi.org/10.1109/ICRA.2017.7989593},
    abstract = {Acquiring the accurate 3-D position of a target person around a robot provides fundamental and valuable information that is applicable to a wide range of robotic tasks, including home service, navigation and entertainment. This paper presents a real-time robotic 3-D human tracking system which combines a monocular camera with an ultrasonic sensor by the extended Kalman filter (EKF). The proposed system consists of three sub-modules: monocular camera sensor tracking model, ultrasonic sensor tracking model and multi-sensor fusion. An improved visual tracking algorithm is presented to provide partial location estimation (2-D). The algorithm is designed to overcome severe occlusions, scale variation, target missing and achieve robust re-detection. The scale accuracy is further enhanced by the estimated 3-D information. An ultrasonic sensor array is employed to provide the range information from the target person to the robot and Gaussian Process Regression is used for partial location estimation (2-D). EKF is adopted to sequentially process multiple, heterogeneous measurements arriving in an asynchronous order from the vision sensor and the ultrasonic sensor separately. In the experiments, the proposed tracking system is tested in both simulation platform and actual mobile robot for various indoor and outdoor scenes. The experimental results show the superior performance of the 3-D tracking system in terms of both the accuracy and robustness.}
    }

  • K. Wu, X. Li, R. Ranasinghe, G. Dissanayake, and Y. Liu, “RISAS: A novel rotation, illumination, scale invariant appearance and shape feature," in 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, p. 4008–4015.
    [BibTeX] [Abstract] [DOI] [PDF]

    This paper presents a novel appearance and shape feature, RISAS, which is robust to viewpoint, illumination, scale and rotation variations. RISAS consists of a keypoint detector and a feature descriptor both of which utilise texture and geometric information present in the appearance and shape channels. A novel response function based on the surface normals is used in combination with the Harris corner detector for selecting keypoints in the scene. A strategy that uses the depth information for scale estimation and background elimination is proposed to select the neighbourhood around the keypoints in order to build precise invariant descriptors. Proposed descriptor relies on the ordering of both grayscale intensity and shape information in the neighbourhood. Comprehensive experiments which confirm the effectiveness of the proposed RGB-D feature when compared with CSHOT [1] and LOIND[2] are presented. Furthermore, we highlight the utility of incorporating texture and shape information in the design of both the detector and the descriptor by demonstrating the enhanced performance of CSHOT and LOIND when combined with RISAS detector.

    @inproceedings{wu2017risasan,
    title = {RISAS: A novel rotation, illumination, scale invariant appearance and shape feature},
    author = {Kanzhi Wu and Xiaoyang Li and Ravindra Ranasinghe and Gamini Dissanayake and Yong Liu},
    year = 2017,
    booktitle = {2017 IEEE International Conference on Robotics and Automation (ICRA)},
    pages = {4008--4015},
    doi = {https://doi.org/10.1109/icra.2017.7989461},
    abstract = {This paper presents a novel appearance and shape feature, RISAS, which is robust to viewpoint, illumination, scale and rotation variations. RISAS consists of a keypoint detector and a feature descriptor both of which utilise texture and geometric information present in the appearance and shape channels. A novel response function based on the surface normals is used in combination with the Harris corner detector for selecting keypoints in the scene. A strategy that uses the depth information for scale estimation and background elimination is proposed to select the neighbourhood around the keypoints in order to build precise invariant descriptors. Proposed descriptor relies on the ordering of both grayscale intensity and shape information in the neighbourhood. Comprehensive experiments which confirm the effectiveness of the proposed RGB-D feature when compared with CSHOT [1] and LOIND[2] are presented. Furthermore, we highlight the utility of incorporating texture and shape information in the design of both the detector and the descriptor by demonstrating the enhanced performance of CSHOT and LOIND when combined with RISAS detector.}
    }

  • X. Zuo, X. Xie, Y. Liu, and G. Huang, “Robust visual SLAM with point and line features," in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, p. 1775–1782.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    In this paper, we develop a robust efficient visual SLAM system that utilizes heterogeneous point and line features. By leveraging ORB-SLAM [1], the proposed system consists of stereo matching, frame tracking, local mapping, loop detection, and bundle adjustment of both point and line features. In particular, as the main theoretical contributions of this paper, we, for the first time, employ the orthonormal representation as the minimal parameterization to model line features along with point features in visual SLAM and analytically derive the Jacobians of the re-projection errors with respect to the line parameters, which significantly improves the SLAM solution. The proposed SLAM has been extensively tested in both synthetic and real-world experiments whose results demonstrate that the proposed system outperforms the state-of-the-art methods in various scenarios.

    @inproceedings{zuo2017robustvs,
    title = {Robust visual SLAM with point and line features},
    author = {Xingxing Zuo and Xiaojia Xie and Yong Liu and Guoquan Huang},
    year = 2017,
    booktitle = {2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {1775--1782},
    doi = {https://doi.org/10.1109/IROS.2017.8205991},
    arxiv = {http://arxiv.org/pdf/1711.08654},
    abstract = {In this paper, we develop a robust efficient visual SLAM system that utilizes heterogeneous point and line features. By leveraging ORB-SLAM [1], the proposed system consists of stereo matching, frame tracking, local mapping, loop detection, and bundle adjustment of both point and line features. In particular, as the main theoretical contributions of this paper, we, for the first time, employ the orthonormal representation as the minimal parameterization to model line features along with point features in visual SLAM and analytically derive the Jacobians of the re-projection errors with respect to the line parameters, which significantly improves the SLAM solution. The proposed SLAM has been extensively tested in both synthetic and real-world experiments whose results demonstrate that the proposed system outperforms the state-of-the-art methods in various scenarios.}
    }

2016

  • Y. Liao, S. Kodagoda, Y. Wang, L. Shi, and Y. Liu, “Understand scene categories by objects: A semantic regularized scene classifier using Convolutional Neural Networks," in 2016 IEEE International Conference on Robotics and Automation (ICRA), 2016, p. 2318–2325.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    Scene classification is a fundamental perception task for environmental understanding in today’s robotics. In this paper, we have attempted to exploit the use of popular machine learning technique of deep learning to enhance scene understanding, particularly in robotics applications. As scene images have larger diversity than the iconic object images, it is more challenging for deep learning methods to automatically learn features from scene images with less samples. Inspired by human scene understanding based on object knowledge, we address the problem of scene classification by encouraging deep neural networks to incorporate object-level information. This is implemented with a regularization of semantic segmentation. With only 5 thousand training images, as opposed to 2.5 million images, we show the proposed deep architecture achieves superior scene classification results to the state-of-the-art on a publicly available SUN RGB-D dataset. In addition, performance of semantic segmentation, the regularizer, also reaches a new record with refinement derived from predicted scene labels. Finally, we apply our model trained on SUN RGB-D dataset to a set of images captured in our university using a mobile robot, demonstrating the generalization ability of the proposed algorithm.

    @inproceedings{liao2016understandsc,
    title = {Understand scene categories by objects: A semantic regularized scene classifier using Convolutional Neural Networks},
    author = {Yiyi Liao and Sarath Kodagoda and Yue Wang and Lei Shi and Yong Liu},
    year = 2016,
    booktitle = {2016 IEEE International Conference on Robotics and Automation (ICRA)},
    pages = {2318--2325},
    doi = {https://doi.org/10.1109/ICRA.2016.7487381},
    arxiv = {https://arxiv.org/pdf/1509.06470.pdf},
    abstract = {Scene classification is a fundamental perception task for environmental understanding in today's robotics. In this paper, we have attempted to exploit the use of popular machine learning technique of deep learning to enhance scene understanding, particularly in robotics applications. As scene images have larger diversity than the iconic object images, it is more challenging for deep learning methods to automatically learn features from scene images with less samples. Inspired by human scene understanding based on object knowledge, we address the problem of scene classification by encouraging deep neural networks to incorporate object-level information. This is implemented with a regularization of semantic segmentation. With only 5 thousand training images, as opposed to 2.5 million images, we show the proposed deep architecture achieves superior scene classification results to the state-of-the-art on a publicly available SUN RGB-D dataset. In addition, performance of semantic segmentation, the regularizer, also reaches a new record with refinement derived from predicted scene labels. Finally, we apply our model trained on SUN RGB-D dataset to a set of images captured in our university using a mobile robot, demonstrating the generalization ability of the proposed algorithm.}
    }

  • M. Wang, Y. Liu, and R. Xiong, “Robust object tracking with a hierarchical ensemble framework," in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016, p. 438–445.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]

    Autonomous robots enjoy a wide popularity nowadays and have been applied in many applications, such as home security, entertainment, delivery, navigation and guidance. It is vital for robots to track objects accurately in real time in these applications, so it is necessary to focus on tracking algorithms to improve the robustness, speed and accuracy. In this paper, we propose a real-time robust object tracking algorithm based on a hierarchical ensemble framework which incorporates information including individual pixel features, local patches and holistic target models. The framework combines multiple ensemble models simultaneously instead of using a single ensemble model individually. A discriminative model which accounts for the matching degree of local patches is adopted via a bottom ensemble layer, and a generative model which exploits holistic templates is used to search for the object based on the middle ensemble layer as well as an adaptive Kalman filter. We test the proposed tracker on challenging benchmark image sequences. The experimental results demonstrate that the proposed tracker performs superiorly against several state-of-the-art algorithms, especially when the appearance changes dramatically and the occlusions occur.

    @inproceedings{wang2016robustot,
    title = {Robust object tracking with a hierarchical ensemble framework},
    author = {Mengmeng Wang and Yong Liu and Rong Xiong},
    year = 2016,
    booktitle = {2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {438--445},
    doi = {https://doi.org/10.1109/IROS.2016.7759091},
    arxiv = {http://arxiv.org/pdf/1509.06925},
    abstract = {Autonomous robots enjoy a wide popularity nowadays and have been applied in many applications, such as home security, entertainment, delivery, navigation and guidance. It is vital for robots to track objects accurately in real time in these applications, so it is necessary to focus on tracking algorithms to improve the robustness, speed and accuracy. In this paper, we propose a real-time robust object tracking algorithm based on a hierarchical ensemble framework which incorporates information including individual pixel features, local patches and holistic target models. The framework combines multiple ensemble models simultaneously instead of using a single ensemble model individually. A discriminative model which accounts for the matching degree of local patches is adopted via a bottom ensemble layer, and a generative model which exploits holistic templates is used to search for the object based on the middle ensemble layer as well as an adaptive Kalman filter. We test the proposed tracker on challenging benchmark image sequences. The experimental results demonstrate that the proposed tracker performs superiorly against several state-of-the-art algorithms, especially when the appearance changes dramatically and the occlusions occur.}
    }

2015

  • G. Feng, Y. Liu, and Y. Liao, “LOIND: An illumination and scale invariant RGB-D descriptor," in 2015 IEEE International Conference on Robotics and Automation (ICRA), 2015, p. 1893–1898.
    [BibTeX] [Abstract] [DOI] [PDF]

    We introduce a novel RGB-D descriptor called local ordinal intensity and normal descriptor (LOIND) with the integration of texture information in RGB image and geometric information in depth image. We implement the descriptor with a 3-D histogram supported by orders of intensities and angles between normal vectors, in addition with the spatial sub-divisions. The former ordering information which is invariant under the transformation of illumination, scale and rotation provides the robustness of our descriptor, while the latter spatial distribution provides higher information capacity so that the discriminative performance is promoted. Comparable experiments with the state-of-art descriptors, e.g. SIFT, SURF, CSHOT and BRAND, show the effectiveness of our LOIND to the complex illumination changes and scale transformation. We also provide a new method to estimate the dominant orientation with only the geometric information, which can ensure the rotation invariance under extremely poor illumination.

    @inproceedings{feng2015loindai,
    title = {LOIND: An illumination and scale invariant RGB-D descriptor},
    author = {Guanghua Feng and Yong Liu and Yiyi Liao},
    year = 2015,
    booktitle = {2015 IEEE International Conference on Robotics and Automation (ICRA)},
    pages = {1893--1898},
    doi = {https://doi.org/10.1109/ICRA.2015.7139445},
    abstract = {We introduce a novel RGB-D descriptor called local ordinal intensity and normal descriptor (LOIND) with the integration of texture information in RGB image and geometric information in depth image. We implement the descriptor with a 3-D histogram supported by orders of intensities and angles between normal vectors, in addition with the spatial sub-divisions. The former ordering information which is invariant under the transformation of illumination, scale and rotation provides the robustness of our descriptor, while the latter spatial distribution provides higher information capacity so that the discriminative performance is promoted. Comparable experiments with the state-of-art descriptors, e.g. SIFT, SURF, CSHOT and BRAND, show the effectiveness of our LOIND to the complex illumination changes and scale transformation. We also provide a new method to estimate the dominant orientation with only the geometric information, which can ensure the rotation invariance under extremely poor illumination.}
    }

  • Y. Wang, J. Cai, Y. Wang, Y. Hu, R. Xiong, Y. Liu, J. Zhang, and L. Qi, “Probabilistic graph based spatial assembly relation inference for programming of assembly task by demonstration," in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2015, p. 4402–4407.
    [BibTeX] [Abstract] [DOI] [PDF]

    In robot programming by demonstration (PBD) for assembly tasks, one of the important topics is to inference the poses and spatial relations of parts during the demonstration. In this paper, we propose a world model called assembly graph (AG) to achieve this task. The model is able to represent the poses of all parts, the relations, observations provided by vision techniques and prior knowledge in a unified probabilistic graph. Then the problem is stated as likelihood maximization estimation of pose parameters with the relations being the latent variables. Classification expectation maximization algorithm (CEM) is employed to solve the model. Besides, the contradiction between relations is incorporated as prior knowledge to better shape the posterior, thus guiding the algorithm find a more accurate solution. In experiments, both simulated and real world datasets are applied to evaluate the performance of our proposed method. The experimental results show that the AG gives better accuracy than the relations as deterministic variables (RDV) employed in some previous works due to the robustness and global consistency. Finally, the solution is implemented into a PBD system with ABB industrial robotic arm simulator as the execution stage, succeeding in real world captured assembly tasks.

    @inproceedings{wang2015probabilisticgb,
    title = {Probabilistic graph based spatial assembly relation inference for programming of assembly task by demonstration},
    author = {Yue Wang and Jie Cai and Yabiao Wang and Youzhong Hu and Rong Xiong and Yong Liu and Jiafan Zhang and Liwei Qi},
    year = 2015,
    booktitle = {2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {4402--4407},
    doi = {https://doi.org/10.1109/IROS.2015.7354002},
    abstract = {In robot programming by demonstration (PBD) for assembly tasks, one of the important topics is to inference the poses and spatial relations of parts during the demonstration. In this paper, we propose a world model called assembly graph (AG) to achieve this task. The model is able to represent the poses of all parts, the relations, observations provided by vision techniques and prior knowledge in a unified probabilistic graph. Then the problem is stated as likelihood maximization estimation of pose parameters with the relations being the latent variables. Classification expectation maximization algorithm (CEM) is employed to solve the model. Besides, the contradiction between relations is incorporated as prior knowledge to better shape the posterior, thus guiding the algorithm find a more accurate solution. In experiments, both simulated and real world datasets are applied to evaluate the performance of our proposed method. The experimental results show that the AG gives better accuracy than the relations as deterministic variables (RDV) employed in some previous works due to the robustness and global consistency. Finally, the solution is implemented into a PBD system with ABB industrial robotic arm simulator as the execution stage, succeeding in real world captured assembly tasks.}
    }

  • Q. Zhang, Y. Liu, Y. Liao, and Y. Wang, “Traversable region detection with a learning framework," in 2015 IEEE International Conference on Robotics and Automation (ICRA), 2015, p. 1678–1683.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this paper, we present a novel learning framework for traversable region detection. Firstly, we construct features from the super-pixel level which can reduce the computational cost compared to pixel level. Multi-scale super-pixels are extracted to give consideration to both outline and detail information. Then we classify the multiple-scale super-pixels and merge the labels in pixel level. Meanwhile, we use weighted ELM as our classifier which can deal with the imbalanced class distribution since we only assume that a small region in front of robot is traversable at the beginning of learning. Finally, we employ the online learning process so that our framework can be adaptive to varied scenes. Experimental results on three different style of image sequences, i.e. shadow road, rain sequence and variational sequence, demonstrate the adaptability, stability and parameter insensitivity of our method to the varied scenes and complex illumination.

    @inproceedings{zhang2015traversablerd,
    title = {Traversable region detection with a learning framework},
    author = {Qinquan Zhang and Yong Liu and Yiyi Liao and Yue Wang},
    year = 2015,
    booktitle = {2015 IEEE International Conference on Robotics and Automation (ICRA)},
    pages = {1678--1683},
    doi = {https://doi.org/10.1109/ICRA.2015.7139413},
    abstract = {In this paper, we present a novel learning framework for traversable region detection. Firstly, we construct features from the super-pixel level which can reduce the computational cost compared to pixel level. Multi-scale super-pixels are extracted to give consideration to both outline and detail information. Then we classify the multiple-scale super-pixels and merge the labels in pixel level. Meanwhile, we use weighted ELM as our classifier which can deal with the imbalanced class distribution since we only assume that a small region in front of robot is traversable at the beginning of learning. Finally, we employ the online learning process so that our framework can be adaptive to varied scenes. Experimental results on three different style of image sequences, i.e. shadow road, rain sequence and variational sequence, demonstrate the adaptability, stability and parameter insensitivity of our method to the varied scenes and complex illumination.}
    }

2014

  • Y. Liao, Y. Wang, and Y. Liu, “Image Representation Learning Using Graph Regularized Auto-Encoders," in 2nd International Conference on Learning Representations (ICLR), 2014.
    [BibTeX] [Abstract] [arXiv] [PDF]

    We consider the problem of image representation for the tasks of unsupervised learning and semi-supervised learning. In those learning tasks, the raw image vectors may not provide enough representation for their intrinsic structures due to their highly dense feature space. To overcome this problem, the raw image vectors should be mapped to a proper representation space which can capture the latent structure of the original data and represent the data explicitly for further learning tasks such as clustering. Inspired by the recent research works on deep neural network and representation learning, in this paper, we introduce the multiple-layer auto-encoder into image representation, we also apply the locally invariant ideal to our image representation with auto-encoders and propose a novel method, called Graph regularized Auto-Encoder (GAE). GAE can provide a compact representation which uncovers the hidden semantics and simultaneously respects the intrinsic geometric structure. Extensive experiments on image clustering show encouraging results of the proposed algorithm in comparison to the state-of-the-art algorithms on real-word cases.

    @inproceedings{liao2014imagerl,
    title = {Image Representation Learning Using Graph Regularized Auto-Encoders},
    author = {Yiyi Liao and Yue Wang and Yong Liu},
    year = 2014,
    booktitle = {2nd International Conference on Learning Representations (ICLR)},
    arxiv = {https://arxiv.org/pdf/1312.0786.pdf},
    abstract = {We consider the problem of image representation for the tasks of unsupervised learning and semi-supervised learning. In those learning tasks, the raw image vectors may not provide enough representation for their intrinsic structures due to their highly dense feature space. To overcome this problem, the raw image vectors should be mapped to a proper representation space which can capture the latent structure of the original data and represent the data explicitly for further learning tasks such as clustering. Inspired by the recent research works on deep neural network and representation learning, in this paper, we introduce the multiple-layer auto-encoder into image representation, we also apply the locally invariant ideal to our image representation with auto-encoders and propose a novel method, called Graph regularized Auto-Encoder (GAE). GAE can provide a compact representation which uncovers the hidden semantics and simultaneously respects the intrinsic geometric structure. Extensive experiments on image clustering show encouraging results of the proposed algorithm in comparison to the state-of-the-art algorithms on real-word cases.}
    }

  • Q. Xie, Y. Liu, R. Xiong, and J. Chu, “Real-time accurate ball trajectory estimation with “asynchronous” stereo camera system for humanoid Ping-Pong robot," in 2014 IEEE International Conference on Robotics and Automation (ICRA), 2014, p. 6212–6217.
    [BibTeX] [Abstract] [DOI] [PDF]

    Temporal asynchrony between two cameras in the vision system is a usual problem in practice. In some vision task such as estimating fast moving targets, the estimation error caused by the tiny temporal asynchrony will become non-ignorable essentials. This paper will address on the asynchrony in the stereo vision system of humanoid Ping-Pong robot, and present a real-time accurate Ping-Pong ball trajectory estimation algorithm. In our approach, the complex Ping-Pong ball motion model is simplified by a polynomial parameter function of time t due to the limited observing time interval and the requirement of real-time computation. We then use the perspective projection camera model to re-project the ball’s parameter function on time t into its image coordinates on both cameras. Based on the assumption that the time gap of two asynchronous cameras will maintain a const during very short time interval, we can obtain the time gap value and also the trajectory parameters of the Ping-Pong ball in a short time interval by minimizing the errors between the images of the ball in each camera and their re-projection images from the modeled parameter function on time t. Comprehensive experiments on real Ping-Pong robot cases are carried out, the results show our approach is more proper for the vision system of humanoid Ping-Pong robot, when concerning the accuracy and real-time performance simultaneously.

    @inproceedings{xie2014realtimeab,
    title = {Real-time accurate ball trajectory estimation with “asynchronous” stereo camera system for humanoid Ping-Pong robot},
    author = {Qi Xie and Yong Liu and Rong Xiong and Jian Chu},
    year = 2014,
    booktitle = {2014 IEEE International Conference on Robotics and Automation (ICRA)},
    pages = {6212--6217},
    doi = {https://doi.org/10.1109/ICRA.2014.6907775},
    abstract = {Temporal asynchrony between two cameras in the vision system is a usual problem in practice. In some vision task such as estimating fast moving targets, the estimation error caused by the tiny temporal asynchrony will become non-ignorable essentials. This paper will address on the asynchrony in the stereo vision system of humanoid Ping-Pong robot, and present a real-time accurate Ping-Pong ball trajectory estimation algorithm. In our approach, the complex Ping-Pong ball motion model is simplified by a polynomial parameter function of time t due to the limited observing time interval and the requirement of real-time computation. We then use the perspective projection camera model to re-project the ball's parameter function on time t into its image coordinates on both cameras. Based on the assumption that the time gap of two asynchronous cameras will maintain a const during very short time interval, we can obtain the time gap value and also the trajectory parameters of the Ping-Pong ball in a short time interval by minimizing the errors between the images of the ball in each camera and their re-projection images from the modeled parameter function on time t. Comprehensive experiments on real Ping-Pong robot cases are carried out, the results show our approach is more proper for the vision system of humanoid Ping-Pong robot, when concerning the accuracy and real-time performance simultaneously.}
    }

2011

  • R. Xiong, L. Yong, and H. Zheng, “A humanoid robot for table tennis playing," in 2011 IEEE Workshop on Advanced Robotics and Its Social Impacts (ARSO), 2011, p. 66–67.
    [BibTeX] [Abstract] [DOI] [PDF]

    Humanoid robot has been one of the most active research topics in the field of robotics. Their human-like form and configuration gives it advantages in working in human-interactive environment. The bipedal walking capability makes them possible to step over and onto obstacles, providing accessibility and mobility in cluttered space. The multi-DOF design of arms and legs enables them assist or replace humans in their normal tasks, making human life easier and safer. Humanoid robots, with their human-like outlook, also bring better interactive experience and are expect to play a part in people’s daily life and help the elderly and the children.

    @inproceedings{xiong2011ahr,
    title = {A humanoid robot for table tennis playing},
    author = {Rong Xiong and Long Yong and Hongbo Zheng},
    year = 2011,
    booktitle = {2011 IEEE Workshop on Advanced Robotics and Its Social Impacts (ARSO)},
    pages = {66--67},
    doi = {https://doi.org/10.1109/ARSO.2011.6301960},
    abstract = {Humanoid robot has been one of the most active research topics in the field of robotics. Their human-like form and configuration gives it advantages in working in human-interactive environment. The bipedal walking capability makes them possible to step over and onto obstacles, providing accessibility and mobility in cluttered space. The multi-DOF design of arms and legs enables them assist or replace humans in their normal tasks, making human life easier and safer. Humanoid robots, with their human-like outlook, also bring better interactive experience and are expect to play a part in people's daily life and help the elderly and the children.}
    }

2018

  • 廖依伊. 刘勇, 正则化深度学习及其在机器人环境感知中的应用, 科学出版社, 2018.
    [BibTeX]
    @book{正则化深度学习及其在机器人环境感知中的应用,
    title = {正则化深度学习及其在机器人环境感知中的应用},
    author = {刘勇, 廖依伊},
    year = 2018,
    publisher = {科学出版社}
    }