• Dense 3D reconstruction

    Dense 3D reconstruction results of Kaist-Urban-07 dataset by simply assembling 2D LiDAR scans from SICK LMS-511 with the estimated continuous-time trajectory from CLINS.

  • Time-lapse Video Generation

    In this paper, we propose a novel end-to-end one-stage dynamic time-lapse video generation framework, i.e. DTVNet, to generate diversified time-lapse videos from a single landscape image.

About Research Group

Welcome to the website of the APRIL Lab led by Prof. Yong Liu. Our lab was founded in December 2011 and is part of the Institute of Cyber-Systems and Control, at the Zhejiang University.

Our mission is to investigate the fundamental challenges and practical applications of robotics and computer vision for the benefit of all humanity. Our main interests encompass the areas of deep learning, computer vision, SLAM, and robotics.

read more

Representative Publications

  • S. Li, J. Chen, J. Xiang, C. Zhu, J. Yang, X. Wei, Y. Jiang, and Y. Liu, “Automatic Data-Free Pruning via Channel Similarity Reconstruction," Neurocomputing, vol. 661, p. 131885, 2026.
    [BibTeX] [Abstract] [DOI] [PDF]

    Structured pruning methods are developed to bridge the gap between the massive scale of neural networks and the limited hardware resources. Most current structured pruning methods rely on training datasets to fine-tune the compressed model, resulting in high computational burdens and being inapplicable for scenarios with stringent requirements on privacy and security. As an alternative, some data-free methods have been proposed, however, these methods often require handcrafted parameter tuning and can only achieve inflexible reconstruction. In this paper, we propose the Automatic Data-Free Pruning (AutoDFP) method that achieves automatic pruning and reconstruction without fine-tuning. Our approach is based on the assumption that the loss of information can be partially compensated by retaining focused information from similar channels. Specifically, we formulate data-free pruning as an optimization problem, which can be effectively addressed through reinforcement learning. AutoDFP assesses the similarity of channels for each layer and provides this information to the reinforcement learning agent, guiding the pruning and reconstruction process of the network. We evaluate AutoDFP with multiple networks on multiple datasets, achieving impressive compression results.

    @article{li2026adf,
    title = {Automatic Data-Free Pruning via Channel Similarity Reconstruction},
    author = {Siqi Li and Jun Chen and Jingyang Xiang and Chengrui Zhu and Jiandang Yang and Xiaobin Wei and Yunliang Jiang and Yong Liu},
    year = 2026,
    journal = {Neurocomputing},
    volume = 661,
    pages = {131885},
    doi = {10.1016/j.neucom.2025.131885},
    abstract = {Structured pruning methods are developed to bridge the gap between the massive scale of neural networks and the limited hardware resources. Most current structured pruning methods rely on training datasets to fine-tune the compressed model, resulting in high computational burdens and being inapplicable for scenarios with stringent requirements on privacy and security. As an alternative, some data-free methods have been proposed, however, these methods often require handcrafted parameter tuning and can only achieve inflexible reconstruction. In this paper, we propose the Automatic Data-Free Pruning (AutoDFP) method that achieves automatic pruning and reconstruction without fine-tuning. Our approach is based on the assumption that the loss of information can be partially compensated by retaining focused information from similar channels. Specifically, we formulate data-free pruning as an optimization problem, which can be effectively addressed through reinforcement learning. AutoDFP assesses the similarity of channels for each layer and provides this information to the reinforcement learning agent, guiding the pruning and reconstruction process of the network. We evaluate AutoDFP with multiple networks on multiple datasets, achieving impressive compression results.}
    }

  • J. Xing, J. Zhao, C. Xu, M. Wang, G. Dai, Y. Liu, J. Wang, and X. Li, “MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition," Pattern Recognition, vol. 169, p. 111902, 2026.
    [BibTeX] [Abstract] [DOI] [PDF]

    Applying large-scale vision-language pre-trained models like CLIP to few-shot action recognition (FSAR) can significantly enhance both performance and efficiency. While several studies have recognized this advantage, most rely on full-parameter fine-tuning to adapt CLIP’s visual encoder to FSAR data, which not only incurs high computational costs but also overlooks the potential of the visual encoder to engage in temporal modeling and focus on targeted semantics directly. To tackle these issues, we introduce MA-FSAR, a framework that employs the Parameter-Efficient Fine-Tuning (PEFT) technique to enhance the CLIP visual encoder in terms of action related temporal and semantic representations. Our solution involves a token-level Fine-grained Multimodal Adaptation mechanism: a Global Temporal Adaptation captures motion cues from video sequences, while a Local Multimodal Adaptation integrates text-guided semantics from the support set to emphasize action-critical features. Additionally, we propose a prototype-level text-guided construction module to further enrich the temporal and semantic characteristics of video prototypes. Extensive experiments demonstrate our superior performance in various tasks using minor trainable parameters.

    @article{xing2026maf,
    title = {MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition},
    author = {Jiazheng Xing and Jian Zhao and Chao Xu and Mengmeng Wang and Guang Dai and Yong Liu and Jingdong Wang and Xuelong Li},
    year = 2026,
    journal = {Pattern Recognition},
    volume = 169,
    pages = {111902},
    doi = {10.1016/j.patcog.2025.111902},
    abstract = {Applying large-scale vision-language pre-trained models like CLIP to few-shot action recognition (FSAR) can significantly enhance both performance and efficiency. While several studies have recognized this advantage, most rely on full-parameter fine-tuning to adapt CLIP’s visual encoder to FSAR data, which not only incurs high computational costs but also overlooks the potential of the visual encoder to engage in temporal modeling and focus on targeted semantics directly. To tackle these issues, we introduce MA-FSAR, a framework that employs the Parameter-Efficient Fine-Tuning (PEFT) technique to enhance the CLIP visual encoder in terms of action related temporal and semantic representations. Our solution involves a token-level Fine-grained Multimodal Adaptation mechanism: a Global Temporal Adaptation captures motion cues from video sequences, while a Local Multimodal Adaptation integrates text-guided semantics from the support set to emphasize action-critical features. Additionally, we propose a prototype-level text-guided construction module to further enrich the temporal and semantic characteristics of video prototypes. Extensive experiments demonstrate our superior performance in various tasks using minor trainable parameters.}
    }

  • Y. Ma, T. Wei, N. Zhong, J. Mei, T. Hu, L. Wen, X. Yang, B. Shi, and Y. Liu, “LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process Thinking," IEEE Transactions on Neural Networks and Learning System, 2025.
    [BibTeX] [Abstract] [DOI]

    While autonomous driving technology has made remarkable strides, data-driven approaches still struggle with complex scenarios due to their limited reasoning capabilities. Meanwhile, knowledge-driven autonomous driving systems have evolved considerably with the popularization of visual language models. In this article, we propose LeapVAD, a novel method based on cognitive perception and dual-process thinking. Our approach implements a human-attentional mechanism to identify and focus on critical traffic elements that influence driving decisions. By characterizing these objects through comprehensive attributes—including appearance, motion patterns, and associated risks—LeapVAD achieves more effective environmental representation and streamlines the decision-making process. Furthermore, LeapVAD incorporates an innovative dual-process decision-making module mimicking the human-driving learning process. The system consists of an analytic process (System-II) that accumulates driving experience through logical reasoning and a heuristic process (System-I) that refines this knowledge via fine-tuning and few-shot learning. LeapVAD also includes reflective mechanisms and a growing memory bank, enabling it to learn from past mistakes and continuously improve its performance in a closed-loop environment. To enhance efficiency, we develop a scene encoder network that generates compact scene representations for rapid retrieval of relevant driving experiences. Extensive evaluations conducted on two leading autonomous driving simulators, CARLA and DriveArena, demonstrate that LeapVAD achieves superior performance compared with camera-only approaches despite limited training data. Comprehensive ablation studies further emphasize its effectiveness in continuous learning and domain adaptation. Project page: https://pjlab-adg.github.io/LeapVAD/

    @article{ma2025leap,
    title = {LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process Thinking},
    author = {Yukai Ma and Tiantian Wei and Naiting Zhong and Jianbiao Mei and Tao Hu and Licheng Wen and Xuemeng Yang and Botian Shi and Yong Liu},
    year = 2025,
    journal = {IEEE Transactions on Neural Networks and Learning System},
    doi = {10.1109/TNNLS.2025.3626711},
    abstract = {While autonomous driving technology has made remarkable strides, data-driven approaches still struggle with complex scenarios due to their limited reasoning capabilities. Meanwhile, knowledge-driven autonomous driving systems have evolved considerably with the popularization of visual language models. In this article, we propose LeapVAD, a novel method based on cognitive perception and dual-process thinking. Our approach implements a human-attentional mechanism to identify and focus on critical traffic elements that influence driving decisions. By characterizing these objects through comprehensive attributes—including appearance, motion patterns, and associated risks—LeapVAD achieves more effective environmental representation and streamlines the decision-making process. Furthermore, LeapVAD incorporates an innovative dual-process decision-making module mimicking the human-driving learning process. The system consists of an analytic process (System-II) that accumulates driving experience through logical reasoning and a heuristic process (System-I) that refines this knowledge via fine-tuning and few-shot learning. LeapVAD also includes reflective mechanisms and a growing memory bank, enabling it to learn from past mistakes and continuously improve its performance in a closed-loop environment. To enhance efficiency, we develop a scene encoder network that generates compact scene representations for rapid retrieval of relevant driving experiences. Extensive evaluations conducted on two leading autonomous driving simulators, CARLA and DriveArena, demonstrate that LeapVAD achieves superior performance compared with camera-only approaches despite limited training data. Comprehensive ablation studies further emphasize its effectiveness in continuous learning and domain adaptation. Project page: https://pjlab-adg.github.io/LeapVAD/}
    }

  • J. Zhang, T. Hu, H. He, Z. Xue, Y. Wang, C. Wang, Y. Liu, X. Li, and D. Tao, “EMOv2: Pushing 5M Vision Model Frontier," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, pp. 10560-10576, 2025.
    [BibTeX] [Abstract] [DOI] [PDF]

    This work focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Our goal is to set up the new frontier of the 5 M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterparts have been recognized by attention-based design. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern Improved Inverted Residual Mobile Block (i22RMB) and improve a hierarchical Efficient MOdel (EMOv2) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4 G/5 G bandwidth and ensuring model performance, we investigate the performance upper limit of lightweight models with a magnitude of 5 M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1 M/2M/5 M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly. At the same time, EMOv2-5 M equipped RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the previous EMO-5 M by +2.6↑ . When employing the more robust training recipe, our EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the performance of 5M magnitude models to a new level.

    @article{zhang2025emov,
    title = {EMOv2: Pushing 5M Vision Model Frontier},
    author = {Jiangning Zhang and Teng Hu and Haoyang He and Zhucun Xue and Yabiao Wang and Chengjie Wang and Yong Liu and Xiangtai Li and Dacheng Tao},
    year = 2025,
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    volume = 47,
    pages = {10560-10576},
    doi = {10.1109/TPAMI.2025.3596776},
    abstract = {This work focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Our goal is to set up the new frontier of the 5 M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterparts have been recognized by attention-based design. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern Improved Inverted Residual Mobile Block (i22RMB) and improve a hierarchical Efficient MOdel (EMOv2) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4 G/5 G bandwidth and ensuring model performance, we investigate the performance upper limit of lightweight models with a magnitude of 5 M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1 M/2M/5 M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly. At the same time, EMOv2-5 M equipped RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the previous EMO-5 M by +2.6↑ . When employing the more robust training recipe, our EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the performance of 5M magnitude models to a new level.}
    }

  • X. Liu, M. Lin, S. Li, G. Xu, Z. Wang, H. Wu, and Y. Liu, “FLARE: Fast Large-scale Autonomous Exploration Guided by Unknown Regions," IEEE Robotics and Automation Letters, vol. 10, pp. 12197-12204, 2025.
    [BibTeX] [Abstract] [DOI] [PDF]

    Autonomous exploration is a critical foundation for uncrewed aerial vehicles (UAV) applications such as search and rescue. However, existing methods typically focus only on known spaces or frontiers without considering unknown regions or providing further guidance for the global path, which results in low exploration efficiency. This letter proposes FLARE, which enables Fast UAV exploration in LARge-scale and complex unknown Environments. The incremental unknown region partitioning method partitions the unexplored space into multiple unknown regions in real-time by integrating known information with the sensor perception range. Building on this, the hierarchical planner first computes a global path that encompasses all unknown regions and then generates safe and feasible local trajectories for the UAV. We evaluate the performance of FLARE through extensive simulations and real-world experiments. The results show that, compared to existing state-of-the-art algorithms, FLARE significantly improves exploration efficiency, reducing exploration time by 16.8% to 27.9% and flight distance by 15.8% to 25.5%. The source code of FLARE will be released to benefit the community.

    @article{liu2025flare,
    title = {FLARE: Fast Large-scale Autonomous Exploration Guided by Unknown Regions},
    author = {Xinyang Liu and Min Lin and Shengbo Li and Gang Xu and Zhifang Wang and Huifeng Wu and Yong Liu},
    year = 2025,
    journal = {IEEE Robotics and Automation Letters},
    volume = 10,
    pages = {12197-12204},
    doi = {10.1109/LRA.2025.3620618},
    abstract = {Autonomous exploration is a critical foundation for uncrewed aerial vehicles (UAV) applications such as search and rescue. However, existing methods typically focus only on known spaces or frontiers without considering unknown regions or providing further guidance for the global path, which results in low exploration efficiency. This letter proposes FLARE, which enables Fast UAV exploration in LARge-scale and complex unknown Environments. The incremental unknown region partitioning method partitions the unexplored space into multiple unknown regions in real-time by integrating known information with the sensor perception range. Building on this, the hierarchical planner first computes a global path that encompasses all unknown regions and then generates safe and feasible local trajectories for the UAV. We evaluate the performance of FLARE through extensive simulations and real-world experiments. The results show that, compared to existing state-of-the-art algorithms, FLARE significantly improves exploration efficiency, reducing exploration time by 16.8% to 27.9% and flight distance by 15.8% to 25.5%. The source code of FLARE will be released to benefit the community.}
    }

  • R. Cai, K. Zhang, H. Tai, Y. Zhou, Y. Ding, C. Zhou, and Y. Liu, “Adversarial Multimodal Contrastive Learning for Robust Industrial Fault Diagnosis," IEEE Transactions on Instrumentation and Measurement, vol. 74, p. 3559412, 2025.
    [BibTeX] [Abstract] [DOI] [PDF]

    Fault diagnosis (FD) techniques leveraging self-supervised contrastive learning (SSCL) have demonstrated significant potential in industrial scenarios due to their reduced dependence on manually annotated data. However, the existing SSCL algorithms primarily focus on establishing complex similarity relationships among unimodal augmented views. These unimodal SSCL approaches are particularly vulnerable to learning shallow, domain-dependent spurious features in the training data rather than more intrinsic and essential features. Consequently, such spurious features may cause the algorithm failure when encountering distribution shift issues resulting from environmental perturbations or changes in working conditions. To address this challenge, we propose adversarial multimodal contrastive learning (AMMCL), a novel approach designed to extract robust and generalizable multimodal representations from time series and their corresponding spectrograms. AMMCL utilizes intermodal contrastive learning and adversarial training strategy to align modal-invariant features from both elementwise and setwise perspectives. These essential features are beneficial for intradomain and cross-domain FD tasks. Furthermore, a slice segmentation processing (SSP) method based on dominant frequency is employed to enhance model’s ability to recognize varying patterns within time series. AMMCL is first evaluated on intradomain and cross-domain FD tasks using the Gearbox and XJTU-SY datasets, where it outperforms nine existing FD algorithms in terms of performance. Additionally, AMMCL is compared with ten other valve stiction detection algorithms on International Stiction Database (ISDB) dataset, successfully identifying the most loop states (23 out of 26). Finally, the trained AMMCL model on the ISDB dataset is implemented in actual industrial valve detection, demonstrating the feasibility and practicality of AMMCL in real industrial scenarios.

    @article{cai2025amc,
    title = {Adversarial Multimodal Contrastive Learning for Robust Industrial Fault Diagnosis},
    author = {Rongyao Cai and Kexin Zhang and Hanchen Tai and Yang Zhou and Yuanyuan Ding and Chunlin Zhou and Yong Liu},
    year = 2025,
    journal = {IEEE Transactions on Instrumentation and Measurement},
    volume = 74,
    pages = {3559412},
    doi = {10.1109/TIM.2025.3608323},
    abstract = {Fault diagnosis (FD) techniques leveraging self-supervised contrastive learning (SSCL) have demonstrated significant potential in industrial scenarios due to their reduced dependence on manually annotated data. However, the existing SSCL algorithms primarily focus on establishing complex similarity relationships among unimodal augmented views. These unimodal SSCL approaches are particularly vulnerable to learning shallow, domain-dependent spurious features in the training data rather than more intrinsic and essential features. Consequently, such spurious features may cause the algorithm failure when encountering distribution shift issues resulting from environmental perturbations or changes in working conditions. To address this challenge, we propose adversarial multimodal contrastive learning (AMMCL), a novel approach designed to extract robust and generalizable multimodal representations from time series and their corresponding spectrograms. AMMCL utilizes intermodal contrastive learning and adversarial training strategy to align modal-invariant features from both elementwise and setwise perspectives. These essential features are beneficial for intradomain and cross-domain FD tasks. Furthermore, a slice segmentation processing (SSP) method based on dominant frequency is employed to enhance model’s ability to recognize varying patterns within time series. AMMCL is first evaluated on intradomain and cross-domain FD tasks using the Gearbox and XJTU-SY datasets, where it outperforms nine existing FD algorithms in terms of performance. Additionally, AMMCL is compared with ten other valve stiction detection algorithms on International Stiction Database (ISDB) dataset, successfully identifying the most loop states (23 out of 26). Finally, the trained AMMCL model on the ISDB dataset is implemented in actual industrial valve detection, demonstrating the feasibility and practicality of AMMCL in real industrial scenarios.}
    }

  • C. Zhu, Z. Zhang, W. Liu, S. Li, and Y. Liu, “Learning Accurate and Robust Velocity Tracking for Quadrupedal Robots," Journal of Field Robotics, 2025.
    [BibTeX] [DOI]
    @article{zhu2025lar,
    title = {Learning Accurate and Robust Velocity Tracking for Quadrupedal Robots},
    author = {Chengrui Zhu and Zhen Zhang and Weiwei Liu and Siqi Li and Yong Liu},
    year = 2025,
    journal = {Journal of Field Robotics},
    doi = {10.1002/rob.70028}
    }

  • G. Xu, Y. Wu, S. Tao, Y. Yang, T. Liu, T. Huang, H. Wu, and Y. Liu, “Efficient Multi-Robot Task and Path Planning in Large-Scale Cluttered Environments," IEEE Robotics and Automation Letters, vol. 10, pp. 9112-9119, 2025.
    [BibTeX] [Abstract] [DOI] [PDF]

    As the potential of multi-robot systems continues to be explored and validated across various real-world applications, such as package delivery, search and rescue, and autonomous exploration, the need to improve the efficiency and quality of task and path planning has become increasingly urgent, particularly in large-scale, obstacle-rich environments. To this end, this letter investigates the problem of multi-robot task and path planning (MRTPP) in large-scale cluttered scenarios. Specifically, we first propose an obstacle-vertex search (OVS) path planner that quickly constructs the cost matrix of collision-free paths for multi-robot task planning, ensuring the rationality of task planning in obstacle-rich environments. Furthermore, we introduce an efficient auction-based method for solving the MRTPP problem by incorporating a novel memory-aware strategy, aiming to minimize the maximum travel cost among robots for task visits. The proposed method effectively improves computational efficiency while maintaining solution quality in the multi-robot task planning problem. Finally, we demonstrated the effectiveness and practicality of the proposed method through extensive benchmark comparisons.

    @article{xu2025emr,
    title = {Efficient Multi-Robot Task and Path Planning in Large-Scale Cluttered Environments},
    author = {Gang Xu and Yuchen Wu and Sheng Tao and Yifan Yang and Tao Liu and Tao Huang and Huifeng Wu and Yong Liu},
    year = 2025,
    journal = {IEEE Robotics and Automation Letters},
    volume = 10,
    pages = {9112-9119},
    doi = {10.1109/LRA.2025.3592146},
    abstract = {As the potential of multi-robot systems continues to be explored and validated across various real-world applications, such as package delivery, search and rescue, and autonomous exploration, the need to improve the efficiency and quality of task and path planning has become increasingly urgent, particularly in large-scale, obstacle-rich environments. To this end, this letter investigates the problem of multi-robot task and path planning (MRTPP) in large-scale cluttered scenarios. Specifically, we first propose an obstacle-vertex search (OVS) path planner that quickly constructs the cost matrix of collision-free paths for multi-robot task planning, ensuring the rationality of task planning in obstacle-rich environments. Furthermore, we introduce an efficient auction-based method for solving the MRTPP problem by incorporating a novel memory-aware strategy, aiming to minimize the maximum travel cost among robots for task visits. The proposed method effectively improves computational efficiency while maintaining solution quality in the multi-robot task planning problem. Finally, we demonstrated the effectiveness and practicality of the proposed method through extensive benchmark comparisons.}
    }

View all Publications