• Dense 3D reconstruction

    Dense 3D reconstruction results of Kaist-Urban-07 dataset by simply assembling 2D LiDAR scans from SICK LMS-511 with the estimated continuous-time trajectory from CLINS.

  • Time-lapse Video Generation

    In this paper, we propose a novel end-to-end one-stage dynamic time-lapse video generation framework, i.e. DTVNet, to generate diversified time-lapse videos from a single landscape image.

About Research Group

Welcome to the website of the APRIL Lab led by Prof. Yong Liu. Our lab was founded in December 2011 and is part of the Institute of Cyber-Systems and Control, at the Zhejiang University.

Our mission is to investigate the fundamental challenges and practical applications of robotics and computer vision for the benefit of all humanity. Our main interests encompass the areas of deep learning, computer vision, SLAM, and robotics.

read more

Representative Publications

  • J. Mei, Y. Yang, M. Wang, J. Zhu, J. Ra, Y. Ma, L. Li, and Y. Liu, “Camera-Based 3D Semantic Scene Completion With Sparse Guidance Network," IEEE Transactions on Image Processing, vol. 33, pp. 5468-5481, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80% mIoU and 45.45% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. Code is available at https://github.com/Jieqianyu/SGN.

    @article{mei2024cbs,
    title = {Camera-Based 3D Semantic Scene Completion With Sparse Guidance Network},
    author = {Jianbiao Mei and Yu Yang and Mengmeng Wang and Junyu Zhu and Jongwon Ra and Yukai Ma and Laijian Li and Yong Liu},
    year = 2024,
    journal = {IEEE Transactions on Image Processing},
    volume = 33,
    pages = {5468-5481},
    doi = {10.1109/TIP.2024.3461989},
    abstract = {Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80% mIoU and 45.45% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. Code is available at https://github.com/Jieqianyu/SGN.}
    }

  • W. Liu, B. Zhang, T. Liu, J. Jiang, and Y. Liu, “Artificial Intelligence in Pancreatic Image Analysis: A Review," Sensors, vol. 24, p. 4749, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    Pancreatic cancer is a highly lethal disease with a poor prognosis. Its early diagnosis and accurate treatment mainly rely on medical imaging, so accurate medical image analysis is especially vital for pancreatic cancer patients. However, medical image analysis of pancreatic cancer is facing challenges due to ambiguous symptoms, high misdiagnosis rates, and significant financial costs. Artificial intelligence (AI) offers a promising solution by relieving medical personnel’s workload, improving clinical decision-making, and reducing patient costs. This study focuses on AI applications such as segmentation, classification, object detection, and prognosis prediction across five types of medical imaging: CT, MRI, EUS, PET, and pathological images, as well as integrating these imaging modalities to boost diagnostic accuracy and treatment efficiency. In addition, this study discusses current hot topics and future directions aimed at overcoming the challenges in AI-enabled automated pancreatic cancer diagnosis algorithms.

    @article{jiang2024aii,
    title = {Artificial Intelligence in Pancreatic Image Analysis: A Review},
    author = {Weixuan Liu and Bairui Zhang and Tao Liu and Juntao Jiang and Yong Liu},
    year = 2024,
    journal = {Sensors},
    volume = 24,
    pages = {4749},
    doi = {10.3390/s24144749},
    abstract = {Pancreatic cancer is a highly lethal disease with a poor prognosis. Its early diagnosis and accurate treatment mainly rely on medical imaging, so accurate medical image analysis is especially vital for pancreatic cancer patients. However, medical image analysis of pancreatic cancer is facing challenges due to ambiguous symptoms, high misdiagnosis rates, and significant financial costs. Artificial intelligence (AI) offers a promising solution by relieving medical personnel's workload, improving clinical decision-making, and reducing patient costs. This study focuses on AI applications such as segmentation, classification, object detection, and prognosis prediction across five types of medical imaging: CT, MRI, EUS, PET, and pathological images, as well as integrating these imaging modalities to boost diagnostic accuracy and treatment efficiency. In addition, this study discusses current hot topics and future directions aimed at overcoming the challenges in AI-enabled automated pancreatic cancer diagnosis algorithms.}
    }

  • H. Li, Y. Ma, Y. Huang, Y. Gu, W. Xu, Y. Liu, and X. Zuo, “RIDERS: Radar-Infrared Depth Estimation for Robust Sensing," IEEE Transactions on Intelligent Transportation Systems, 2024.
    [BibTeX] [Abstract] [DOI]

    Dense depth recovery is crucial in autonomous driving, serving as a foundational element for obstacle avoidance, 3D object detection, and local path planning. Adverse weather conditions, including haze, dust, rain, snow, and darkness, introduce significant challenges to accurate dense depth estimation, thereby posing substantial safety risks in autonomous driving. These challenges are particularly pronounced for traditional depth estimation methods that rely on short electromagnetic wave sensors, such as visible spectrum cameras and near-infrared LiDAR, due to their susceptibility to diffraction noise and occlusion in such environments. To fundamentally overcome this issue, we present a novel approach for robust metric depth estimation by fusing a millimeter-wave radar and a monocular infrared thermal camera, which are capable of penetrating atmospheric particles and unaffected by lighting conditions. Our proposed Radar-Infrared fusion method achieves highly accurate and finely detailed dense depth estimation through three stages, including monocular depth prediction with global scale alignment, quasi-dense radar augmentation by learning radar-pixels correspondences, and local scale refinement of dense depth using a scale map learner. Our method achieves exceptional visual quality and accurate metric estimation by addressing the challenges of ambiguity and misalignment that arise from directly fusing multi-modal long-wave features. We evaluate the performance of our approach on the NTU4DRadLM dataset and our self-collected challenging ZJU-Multispectrum dataset. Especially noteworthy is the unprecedented robustness demonstrated by our proposed method in smoky scenarios.

    @article{li2024riders,
    title = {RIDERS: Radar-Infrared Depth Estimation for Robust Sensing},
    author = {Han Li and Yukai Ma and Yuehao Huang and Yaqing Gu and Weihua Xu and Yong Liu and Xingxing Zuo},
    year = 2024,
    journal = {IEEE Transactions on Intelligent Transportation Systems},
    doi = {10.1109/TITS.2024.3432996},
    abstract = {Dense depth recovery is crucial in autonomous driving, serving as a foundational element for obstacle avoidance, 3D object detection, and local path planning. Adverse weather conditions, including haze, dust, rain, snow, and darkness, introduce significant challenges to accurate dense depth estimation, thereby posing substantial safety risks in autonomous driving. These challenges are particularly pronounced for traditional depth estimation methods that rely on short electromagnetic wave sensors, such as visible spectrum cameras and near-infrared LiDAR, due to their susceptibility to diffraction noise and occlusion in such environments. To fundamentally overcome this issue, we present a novel approach for robust metric depth estimation by fusing a millimeter-wave radar and a monocular infrared thermal camera, which are capable of penetrating atmospheric particles and unaffected by lighting conditions. Our proposed Radar-Infrared fusion method achieves highly accurate and finely detailed dense depth estimation through three stages, including monocular depth prediction with global scale alignment, quasi-dense radar augmentation by learning radar-pixels correspondences, and local scale refinement of dense depth using a scale map learner. Our method achieves exceptional visual quality and accurate metric estimation by addressing the challenges of ambiguity and misalignment that arise from directly fusing multi-modal long-wave features. We evaluate the performance of our approach on the NTU4DRadLM dataset and our self-collected challenging ZJU-Multispectrum dataset. Especially noteworthy is the unprecedented robustness demonstrated by our proposed method in smoky scenarios.}
    }

  • L. Peng, R. Cai, J. Xiang, J. Zhu, W. Liu, W. Gao, and Y. Liu, “LiteGrasp: A Light Robotic Grasp Detection via Semi-Supervised Knowledge Distillation," IEEE Robotics and Automation Letters, vol. 9, pp. 7995-8002, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    Grasping detection from single images in robotic applications poses a significant challenge. While contemporary deep learning techniques excel, their success often hinges on large annotated datasets and intricate network architectures. In this letter, we present LiteGrasp, a novel semi-supervised lightweight framework purpose-built for grasp detection, eliminating the necessity for exhaustive supervision and intricate networks. Our approach uses a limited amount of labeled data via a knowledge distillation method, introducing HRGrasp-Net, a model with high efficiency for extracting features and largely based on HRNet. We incorporate pseudo-label filtering within a mutual learning model set within a teacher-student paradigm. This enhances the transference of data from images with labels to those without. Additionally, we introduce the streamlined Lite HRGrasp-Net, acting as the student network which gains further distillation knowledge using a multi-level fusion cascade originating from HRGrasp-Net. Impressively, LiteGrasp thrives with just a fraction (4.3%) of HRGrasp-Net’s original model size, and with limited labeled data relative to total data (25% ratio) across all benchmarks, regularly outperforming solely supervised and semi-supervised models. Taking just 6 ms for execution, LiteGrasp showcases exceptional accuracy (99.99% and 97.21% on Cornell and Jacquard data sets respectively), as well as an impressive 95.3% rate of success in grasping when deployed using a 6DoF UR5e robotic arm. These highlights underscore the effectiveness and efficiency of LiteGrasp for grasp detection, even under resource-limited conditions.

    @article{peng2024lal,
    title = {LiteGrasp: A Light Robotic Grasp Detection via Semi-Supervised Knowledge Distillation},
    author = {Linpeng Peng and Rongyao Cai and Jingyang Xiang and Junyu Zhu and Weiwei Liu and Wang Gao and Yong Liu},
    year = 2024,
    journal = {IEEE Robotics and Automation Letters},
    volume = 9,
    pages = {7995-8002},
    doi = {10.1109/LRA.2024.3436336},
    abstract = {Grasping detection from single images in robotic applications poses a significant challenge. While contemporary deep learning techniques excel, their success often hinges on large annotated datasets and intricate network architectures. In this letter, we present LiteGrasp, a novel semi-supervised lightweight framework purpose-built for grasp detection, eliminating the necessity for exhaustive supervision and intricate networks. Our approach uses a limited amount of labeled data via a knowledge distillation method, introducing HRGrasp-Net, a model with high efficiency for extracting features and largely based on HRNet. We incorporate pseudo-label filtering within a mutual learning model set within a teacher-student paradigm. This enhances the transference of data from images with labels to those without. Additionally, we introduce the streamlined Lite HRGrasp-Net, acting as the student network which gains further distillation knowledge using a multi-level fusion cascade originating from HRGrasp-Net. Impressively, LiteGrasp thrives with just a fraction (4.3%) of HRGrasp-Net's original model size, and with limited labeled data relative to total data (25% ratio) across all benchmarks, regularly outperforming solely supervised and semi-supervised models. Taking just 6 ms for execution, LiteGrasp showcases exceptional accuracy (99.99% and 97.21% on Cornell and Jacquard data sets respectively), as well as an impressive 95.3% rate of success in grasping when deployed using a 6DoF UR5e robotic arm. These highlights underscore the effectiveness and efficiency of LiteGrasp for grasp detection, even under resource-limited conditions.}
    }

  • R. Cai, W. Gao, L. Peng, Z. Lu, K. Zhang, and Y. Liu, “Debiased Contrastive Learning With Supervision Guidance for Industrial Fault Detection," IEEE Transactions on Industrial Informatics, 2024.
    [BibTeX] [Abstract] [DOI]

    The time series self-supervised contrastive learning framework has succeeded significantly in industrial fault detection scenarios. It typically consists of pretraining on abundant unlabeled data and fine-tuning on limited annotated data. However, the two-phase framework faces three challenges: Sampling bias, task-agnostic representation issue, and angular-centricity issue. These challenges hinder further development in industrial applications. This article introduces a debiased contrastive learning with supervision guidance (DCLSG) framework and applies it to industrial fault detection tasks. First, DCLSG employs channel augmentation to integrate temporal and frequency domain information. Pseudolabels based on momentum clustering operation are assigned to extracted representations, thereby mitigating the sampling bias raised by the selection of positive pairs. Second, the generated supervisory signal guides the pretraining phase, tackling the task-agnostic representation issue. Third, the angular-centricity issue is addressed using the proposed Gaussian distance metric measuring the radial distribution of representations. The experiments conducted on three industrial datasets (ISDB, CWRU, and practical datasets) validate the superior performance of the DCLSG compared to other fault detection methods.

    @article{cai2024dcl,
    title = {Debiased Contrastive Learning With Supervision Guidance for Industrial Fault Detection},
    author = {Rongyao Cai and Wang Gao and Linpeng Peng and Zhengming Lu and Kexin Zhang and Yong Liu},
    year = 2024,
    journal = {IEEE Transactions on Industrial Informatics},
    doi = {10.1109/TII.2024.3424561},
    abstract = {The time series self-supervised contrastive learning framework has succeeded significantly in industrial fault detection scenarios. It typically consists of pretraining on abundant unlabeled data and fine-tuning on limited annotated data. However, the two-phase framework faces three challenges: Sampling bias, task-agnostic representation issue, and angular-centricity issue. These challenges hinder further development in industrial applications. This article introduces a debiased contrastive learning with supervision guidance (DCLSG) framework and applies it to industrial fault detection tasks. First, DCLSG employs channel augmentation to integrate temporal and frequency domain information. Pseudolabels based on momentum clustering operation are assigned to extracted representations, thereby mitigating the sampling bias raised by the selection of positive pairs. Second, the generated supervisory signal guides the pretraining phase, tackling the task-agnostic representation issue. Third, the angular-centricity issue is addressed using the proposed Gaussian distance metric measuring the radial distribution of representations. The experiments conducted on three industrial datasets (ISDB, CWRU, and practical datasets) validate the superior performance of the DCLSG compared to other fault detection methods.}
    }

  • Y. Han, J. Zhang, Y. Wang, C. Wang, Y. Liu, L. Qi, X. Li, and M. Yang, “Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
    [BibTeX] [Abstract] [DOI]

    Few-Shot Instance Segmentation (FSIS) requires detecting and segmenting novel classes with limited support examples. Existing methods based on Region Proposal Networks (RPNs) face two issues: 1) Overfitting suppresses novel class objects; 2) Dual-branch models require complex spatial correlation strategies to prevent spatial information loss when generating class prototypes. We introduce a unified framework, Reference Twice (RefT), to exploit the relationship between support and query features for FSIS and related tasks. Our three main contributions are: 1) A novel transformer-based baseline that avoids overfitting, offering a new direction for FSIS; 2) Demonstrating that support object queries encode key factors after base training, allowing query features to be enhanced twice at both feature and query levels using simple cross-attention, thus avoiding complex spatial correlation interaction; 3) Introducing a class-enhanced base knowledge distillation loss to address the issue of DETR-like models struggling with incremental settings due to the input projection layer, enabling easy extension to incremental FSIS. Extensive experimental evaluations on the COCO dataset under three FSIS settings demonstrate that our method performs favorably against existing approaches across different shots, e.g., +8.2/ + 9.4 performance gain over state-of-the-art methods with 10/30-shots. Source code and models will be available at this github site.

    @article{han2024rta,
    title = {Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation},
    author = {Yue Han and Jiangning Zhang and Yabiao Wang and Chengjie Wang and Yong Liu and Lu Qi and Xiangtai Li and Ming-Hsuan Yang},
    year = 2024,
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    doi = {10.1109/TPAMI.2024.3421340},
    abstract = {Few-Shot Instance Segmentation (FSIS) requires detecting and segmenting novel classes with limited support examples. Existing methods based on Region Proposal Networks (RPNs) face two issues: 1) Overfitting suppresses novel class objects; 2) Dual-branch models require complex spatial correlation strategies to prevent spatial information loss when generating class prototypes. We introduce a unified framework, Reference Twice (RefT), to exploit the relationship between support and query features for FSIS and related tasks. Our three main contributions are: 1) A novel transformer-based baseline that avoids overfitting, offering a new direction for FSIS; 2) Demonstrating that support object queries encode key factors after base training, allowing query features to be enhanced twice at both feature and query levels using simple cross-attention, thus avoiding complex spatial correlation interaction; 3) Introducing a class-enhanced base knowledge distillation loss to address the issue of DETR-like models struggling with incremental settings due to the input projection layer, enabling easy extension to incremental FSIS. Extensive experimental evaluations on the COCO dataset under three FSIS settings demonstrate that our method performs favorably against existing approaches across different shots, e.g., +8.2/ + 9.4 performance gain over state-of-the-art methods with 10/30-shots. Source code and models will be available at this github site.}
    }

  • K. Zhang, Q. Wen, C. Zhang, R. Cai, M. Jin, Y. Liu, J. Y. Zhang, Y. Liang, G. Pang, D. Song, and S. Pan, “Self-Supervised Learning for Time Series Analysis: Taxonomy, Progress, and Prospects," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, pp. 6775-6794, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    Self-supervised learning (SSL) has recently achieved impressive performance on various time series tasks. The most prominent advantage of SSL is that it reduces the dependence on labeled data. Based on the pre-training and fine-tuning strategy, even a small amount of labeled data can achieve high performance. Compared with many published self-supervised surveys on computer vision and natural language processing, a comprehensive survey for time series SSL is still missing. To fill this gap, we review current state-of-the-art SSL methods for time series data in this article. To this end, we first comprehensively review existing surveys related to SSL and time series, and then provide a new taxonomy of existing time series SSL methods by summarizing them from three perspectives: generative-based, contrastive-based, and adversarial-based. These methods are further divided into ten subcategories with detailed reviews and discussions about their key intuitions, main frameworks, advantages and disadvantages. To facilitate the experiments and validation of time series SSL methods, we also summarize datasets commonly used in time series forecasting, classification, anomaly detection, and clustering tasks. Finally, we present the future directions of SSL for time series analysis.

    @article{zhang2024ssl,
    title = {Self-Supervised Learning for Time Series Analysis: Taxonomy, Progress, and Prospects},
    author = {Kexin Zhang and Qingsong Wen and Chaoli Zhang and Rongyao Cai and Ming Jin and Yong Liu and James Y. Zhang and Yuxuan Liang and Guansong Pang and Dongjin Song and Shirui Pan},
    year = 2024,
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    volume = 46,
    pages = {6775-6794},
    doi = {10.1109/TPAMI.2024.3387317},
    abstract = {Self-supervised learning (SSL) has recently achieved impressive performance on various time series tasks. The most prominent advantage of SSL is that it reduces the dependence on labeled data. Based on the pre-training and fine-tuning strategy, even a small amount of labeled data can achieve high performance. Compared with many published self-supervised surveys on computer vision and natural language processing, a comprehensive survey for time series SSL is still missing. To fill this gap, we review current state-of-the-art SSL methods for time series data in this article. To this end, we first comprehensively review existing surveys related to SSL and time series, and then provide a new taxonomy of existing time series SSL methods by summarizing them from three perspectives: generative-based, contrastive-based, and adversarial-based. These methods are further divided into ten subcategories with detailed reviews and discussions about their key intuitions, main frameworks, advantages and disadvantages. To facilitate the experiments and validation of time series SSL methods, we also summarize datasets commonly used in time series forecasting, classification, anomaly detection, and clustering tasks. Finally, we present the future directions of SSL for time series analysis.}
    }

  • J. Mei, M. Wang, Y. Yang, Z. Li, and Y. Liu, “Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation," Applied Intelligence, vol. 54, pp. 6138-6153, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]

    Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼50% faster inference speed with only a slight 0.2% (J&F) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git.

    @article{mei2024lsr,
    title = {Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation},
    author = {Jianbiao Mei and Mengmeng Wang and Yu Yang and Zizhang Li and Yong Liu},
    year = 2024,
    journal = {Applied Intelligence},
    volume = 54,
    pages = {6138-6153},
    doi = {10.1007/s10489-024-05486-y},
    abstract = {Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼50% faster inference speed with only a slight 0.2% (J&F) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git.}
    }

View all Publications