• Dense 3D reconstruction

    Dense 3D reconstruction results of Kaist-Urban-07 dataset by simply assembling 2D LiDAR scans from SICK LMS-511 with the estimated continuous-time trajectory from CLINS.

  • Time-lapse Video Generation

    In this paper, we propose a novel end-to-end one-stage dynamic time-lapse video generation framework, i.e. DTVNet, to generate diversified time-lapse videos from a single landscape image.

About Research Group

Welcome to the website of the APRIL Lab led by Prof. Yong Liu. Our lab was founded in December 2011 and is part of the Institute of Cyber-Systems and Control, at the Zhejiang University.

Our mission is to investigate the fundamental challenges and practical applications of robotics and computer vision for the benefit of all humanity. Our main interests encompass the areas of deep learning, computer vision, SLAM, and robotics.

read more

Representative Publications

  • X. Zuo, W. Ye, Y. Yang, R. Zheng, T. Vidal-Calleja, G. Huang, and Y. Liu, “Multimodal localization: Stereo over LiDAR map," Journal of Field Robotics, vol. 37, p. 1003–1026, 2020.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this paper, we present a real‐time high‐precision visual localization system for an autonomous vehicle which employs only low‐cost stereo cameras to localize the vehicle with a priori map built using a more expensive 3D LiDAR sensor. To this end, we construct two different visual maps: a sparse feature visual map for visual odometry (VO) based motion tracking, and a semidense visual map for registration with the prior LiDAR map. To register two point clouds sourced from different modalities (i.e., cameras and LiDAR), we leverage probabilistic weighted normal distributions transformation (ProW‐NDT), by particularly taking into account the uncertainty of source point clouds. The registration results are then fused via pose graph optimization to correct the VO drift. Moreover, surfels extracted from the prior LiDAR map are used to refine the sparse 3D visual features that will further improve VO‐based motion estimation. The proposed system has been tested extensively in both simulated and real‐world experiments, showing that robust, high‐precision, real‐time localization can be achieved.

    @article{zuo2020multimodalls,
    title = {Multimodal localization: Stereo over LiDAR map},
    author = {Xingxing Zuo and Wenlong Ye and Yulin Yang and Renjie Zheng and Teresa Vidal-Calleja and Guoquan Huang and Yong Liu},
    year = 2020,
    journal = {Journal of Field Robotics},
    volume = 37,
    pages = {1003--1026},
    doi = {https://doi.org/10.1002/rob.21936},
    abstract = {In this paper, we present a real‐time high‐precision visual localization system for an autonomous vehicle which employs only low‐cost stereo cameras to localize the vehicle with a priori map built using a more expensive 3D LiDAR sensor. To this end, we construct two different visual maps: a sparse feature visual map for visual odometry (VO) based motion tracking, and a semidense visual map for registration with the prior LiDAR map. To register two point clouds sourced from different modalities (i.e., cameras and LiDAR), we leverage probabilistic weighted normal distributions transformation (ProW‐NDT), by particularly taking into account the uncertainty of source point clouds. The registration results are then fused via pose graph optimization to correct the VO drift. Moreover, surfels extracted from the prior LiDAR map are used to refine the sparse 3D visual features that will further improve VO‐based motion estimation. The proposed system has been tested extensively in both simulated and real‐world experiments, showing that robust, high‐precision, real‐time localization can be achieved.}
    }

  • Y. Liao, Y. Wang, and Y. Liu, “Graph Regularized Auto-Encoders for Image Representation," IEEE Transactions on Image Processing, vol. 26, p. 2839–2852, 2017.
    [BibTeX] [Abstract] [DOI] [PDF]

    Image representation has been intensively explored in the domain of computer vision for its significant influence on the relative tasks such as image clustering and classification. It is valuable to learn a low-dimensional representation of an image which preserves its inherent information from the original image space. At the perspective of manifold learning, this is implemented with the local invariant idea to capture the intrinsic low-dimensional manifold embedded in the high-dimensional input space. Inspired by the recent successes of deep architectures, we propose a local invariant deep nonlinear mapping algorithm, called graph regularized auto-encoder (GAE). With the graph regularization, the proposed method preserves the local connectivity from the original image space to the representation space, while the stacked auto-encoders provide explicit encoding model for fast inference and powerful expressive capacity for complex modeling. Theoretical analysis shows that the graph regularizer penalizes the weighted Frobenius norm of the Jacobian matrix of the encoder mapping, where the weight matrix captures the local property in the input space. Furthermore, the underlying effects on the hidden representation space are revealed, providing insightful explanation to the advantage of the proposed method. Finally, the experimental results on both clustering and classification tasks demonstrate the effectiveness of our GAE as well as the correctness of the proposed theoretical analysis, and it also suggests that GAE is a superior solution to the current deep representation learning techniques comparing with variant auto-encoders and existing local invariant methods.

    @article{liao2017graphra,
    title = {Graph Regularized Auto-Encoders for Image Representation},
    author = {Yiyi Liao and Yue Wang and Yong Liu},
    year = 2017,
    journal = {IEEE Transactions on Image Processing},
    volume = 26,
    pages = {2839--2852},
    doi = {https://doi.org/10.1109/TIP.2016.2605010},
    abstract = {Image representation has been intensively explored in the domain of computer vision for its significant influence on the relative tasks such as image clustering and classification. It is valuable to learn a low-dimensional representation of an image which preserves its inherent information from the original image space. At the perspective of manifold learning, this is implemented with the local invariant idea to capture the intrinsic low-dimensional manifold embedded in the high-dimensional input space. Inspired by the recent successes of deep architectures, we propose a local invariant deep nonlinear mapping algorithm, called graph regularized auto-encoder (GAE). With the graph regularization, the proposed method preserves the local connectivity from the original image space to the representation space, while the stacked auto-encoders provide explicit encoding model for fast inference and powerful expressive capacity for complex modeling. Theoretical analysis shows that the graph regularizer penalizes the weighted Frobenius norm of the Jacobian matrix of the encoder mapping, where the weight matrix captures the local property in the input space. Furthermore, the underlying effects on the hidden representation space are revealed, providing insightful explanation to the advantage of the proposed method. Finally, the experimental results on both clustering and classification tasks demonstrate the effectiveness of our GAE as well as the correctness of the proposed theoretical analysis, and it also suggests that GAE is a superior solution to the current deep representation learning techniques comparing with variant auto-encoders and existing local invariant methods.}
    }

  • Y. Wang, Y. Liu, Y. Liao, and R. Xiong, “Scalable Learning Framework for Traversable Region Detection Fusing With Appearance and Geometrical Information," IEEE Transactions on Intelligent Transportation Systems, vol. 18, p. 3267–3281, 2017.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this paper, we present an online learning framework for traversable region detection fusing both appearance and geometry information. Our framework proposes an appearance classifier supervised by the sparse geometric clues to capture the variation in online data, yielding dense detection result in real time. It provides superior detection performance using appearance information with weak geometric prior and can be further improved with more geometry from external sensors. The learning process is divided into three steps: First, we construct features from the super-pixel level, which reduces the computational cost compared with the pixel level processing. Then we classify the multi-scale super-pixels to vote the label of each pixel. Second, we use weighted extreme learning machine as our classifier to deal with the imbalanced data distribution since the weak geometric prior only initializes the labels in a small region. Finally, we employ the online learning process so that our framework can be adaptive to the changing scenes. Experimental results on three different styles of image sequences, i.e., shadow road, rain sequence, and variational sequence, demonstrate the adaptability, stability, and parameter insensitivity of our weak geometry motivated method. We further demonstrate the performance of learning framework on additional five challenging data sets captured by Kinect V2 and stereo camera, validating the method’s effectiveness and efficiency.

    @article{wang2017scalablelf,
    title = {Scalable Learning Framework for Traversable Region Detection Fusing With Appearance and Geometrical Information},
    author = {Yue Wang and Yong Liu and Yiyi Liao and Rong Xiong},
    year = 2017,
    journal = {IEEE Transactions on Intelligent Transportation Systems},
    volume = 18,
    pages = {3267--3281},
    doi = {https://doi.org/10.1109/TITS.2017.2682218},
    abstract = {In this paper, we present an online learning framework for traversable region detection fusing both appearance and geometry information. Our framework proposes an appearance classifier supervised by the sparse geometric clues to capture the variation in online data, yielding dense detection result in real time. It provides superior detection performance using appearance information with weak geometric prior and can be further improved with more geometry from external sensors. The learning process is divided into three steps: First, we construct features from the super-pixel level, which reduces the computational cost compared with the pixel level processing. Then we classify the multi-scale super-pixels to vote the label of each pixel. Second, we use weighted extreme learning machine as our classifier to deal with the imbalanced data distribution since the weak geometric prior only initializes the labels in a small region. Finally, we employ the online learning process so that our framework can be adaptive to the changing scenes. Experimental results on three different styles of image sequences, i.e., shadow road, rain sequence, and variational sequence, demonstrate the adaptability, stability, and parameter insensitivity of our weak geometry motivated method. We further demonstrate the performance of learning framework on additional five challenging data sets captured by Kinect V2 and stereo camera, validating the method’s effectiveness and efficiency.}
    }

  • Y. Liu, R. Xiong, Y. Wang, H. Huang, X. Xie, X. Liu, and G. Zhang, “Stereo Visual-Inertial Odometry With Multiple Kalman Filters Ensemble," IEEE Transactions on Industrial Electronics, vol. 63, p. 6205–6216, 2016.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this paper, we present a stereo visual-inertial odometry algorithm assembled with three separated Kalman filters, i.e., attitude filter, orientation filter, and position filter. Our algorithm carries out the orientation and position estimation with three filters working on different fusion intervals, which can provide more robustness even when the visual odometry estimation fails. In our orientation estimation, we propose an improved indirect Kalman filter, which uses the orientation error space represented by unit quaternion as the state of the filter. The performance of the algorithm is demonstrated through extensive experimental results, including the benchmark KITTI datasets and some challenging datasets captured in a rough terrain campus.

    @article{liu2016stereovo,
    title = {Stereo Visual-Inertial Odometry With Multiple Kalman Filters Ensemble},
    author = {Yong Liu and Rong Xiong and Yue Wang and Hong Huang and Xiaojia Xie and Xiaofeng Liu and Gaoming Zhang},
    year = 2016,
    journal = {IEEE Transactions on Industrial Electronics},
    volume = 63,
    pages = {6205--6216},
    doi = {https://doi.org/10.1109/TIE.2016.2573765},
    abstract = {In this paper, we present a stereo visual-inertial odometry algorithm assembled with three separated Kalman filters, i.e., attitude filter, orientation filter, and position filter. Our algorithm carries out the orientation and position estimation with three filters working on different fusion intervals, which can provide more robustness even when the visual odometry estimation fails. In our orientation estimation, we propose an improved indirect Kalman filter, which uses the orientation error space represented by unit quaternion as the state of the filter. The performance of the algorithm is demonstrated through extensive experimental results, including the benchmark KITTI datasets and some challenging datasets captured in a rough terrain campus.}
    }

  • Y. Liu, F. Tang, and Z. Zeng, “Feature Selection Based on Dependency Margin," IEEE Transactions on Cybernetics, vol. 45, p. 1209–1221, 2015.
    [BibTeX] [Abstract] [DOI] [PDF]

    Feature selection tries to find a subset of feature from a larger feature pool and the selected subset can provide the same or even better performance compared with using the whole set. Feature selection is usually a critical preprocessing step for many machine-learning applications such as clustering and classification. In this paper, we focus on feature selection for supervised classification which targets at finding features that can best predict class labels. Traditional greedy search algorithms incrementally find features based on the relevance of candidate features and the class label. However, this may lead to suboptimal results when there are redundant features that may interfere with the selection. To solve this problem, we propose a subset selection algorithm that considers both the selected and remaining features’ relevances with the label. The intuition is that features, which do not have better alternatives from the feature set, should be selected first. We formulate the selection problem as maximizing the dependency margin which is measured by the difference between the selected feature set performance and the remaining feature set performance. Extensive experiments on various data sets show the superiority of the proposed approach against traditional algorithms.

    @article{liu2015featuresb,
    title = {Feature Selection Based on Dependency Margin},
    author = {Yong Liu and Feng Tang and Zhiyong Zeng},
    year = 2015,
    journal = {IEEE Transactions on Cybernetics},
    volume = 45,
    pages = {1209--1221},
    doi = {https://doi.org/10.1109/TCYB.2014.2347372},
    abstract = {Feature selection tries to find a subset of feature from a larger feature pool and the selected subset can provide the same or even better performance compared with using the whole set. Feature selection is usually a critical preprocessing step for many machine-learning applications such as clustering and classification. In this paper, we focus on feature selection for supervised classification which targets at finding features that can best predict class labels. Traditional greedy search algorithms incrementally find features based on the relevance of candidate features and the class label. However, this may lead to suboptimal results when there are redundant features that may interfere with the selection. To solve this problem, we propose a subset selection algorithm that considers both the selected and remaining features' relevances with the label. The intuition is that features, which do not have better alternatives from the feature set, should be selected first. We formulate the selection problem as maximizing the dependency margin which is measured by the difference between the selected feature set performance and the remaining feature set performance. Extensive experiments on various data sets show the superiority of the proposed approach against traditional algorithms.}
    }

  • Q. Shen, X. Zhang, J. Lou, Y. Liu, and Y. Jiang, “Interval-valued intuitionistic fuzzy multi-attribute second-order decision making based on partial connection numbers of set pair analysis," Soft Computing, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    Multi-attribute decision making (MADM) with attribute values as interval-valued intuitionistic fuzzy numbers (IVIFNs) is essentially a second-order decision making problem with uncertainty. To this end, the partial connection number (PCN) of set pair analysis is applied to MADM with IVIFNs. The PCN is an adjoin function of the connection number (CN), and its calculation process reflects the contradictory movement of the connection component in the CN at various micro-levels. It is the main mathematical tool of multi-level analysis method for the macro state and micro-trend. First, we convert IVIFNs into ternary connection numbers (TCNs); then, we calculate the first-order and second-order total PCNs for TCNs. According to the uncertainty analysis of the first-order total PCN, the possible ranking (first-order ranking) of the schemes in the uncertain environment is given, and the deterministic ordering (second-order ranking) of the schemes is given according to the value of the second-order total PCN to meet the needs of different decision making levels. The practical application shows that the method presented is novel, and the results are in line with the uncertainty decision making. Furthermore, the current status and development trend of schemes are taken into account to make decision making progress more reasonable and operable.

    @article{shen2022ivi,
    title = {Interval-valued intuitionistic fuzzy multi-attribute second-order decision making based on partial connection numbers of set pair analysis},
    author = {Qing Shen and Xiongtao Zhang and Jungang Lou and Yong Liu and Yunliang Jiang},
    year = 2022,
    journal = {Soft Computing},
    doi = {10.1007/s00500-022-07314-2},
    abstract = {Multi-attribute decision making (MADM) with attribute values as interval-valued intuitionistic fuzzy numbers (IVIFNs) is essentially a second-order decision making problem with uncertainty. To this end, the partial connection number (PCN) of set pair analysis is applied to MADM with IVIFNs. The PCN is an adjoin function of the connection number (CN), and its calculation process reflects the contradictory movement of the connection component in the CN at various micro-levels. It is the main mathematical tool of multi-level analysis method for the macro state and micro-trend. First, we convert IVIFNs into ternary connection numbers (TCNs); then, we calculate the first-order and second-order total PCNs for TCNs. According to the uncertainty analysis of the first-order total PCN, the possible ranking (first-order ranking) of the schemes in the uncertain environment is given, and the deterministic ordering (second-order ranking) of the schemes is given according to the value of the second-order total PCN to meet the needs of different decision making levels. The practical application shows that the method presented is novel, and the results are in line with the uncertainty decision making. Furthermore, the current status and development trend of schemes are taken into account to make decision making progress more reasonable and operable.}
    }

  • G. Zhai, Y. Zheng, Z. Xu, X. Kong, Y. Liu, B. Busam, Y. Ren, N. Navab, and Z. Zhang, “DA^2 Dataset: Toward Dexterity-Aware Dual-Arm Grasping," IEEE Robotics and Automation Letters (RA-L), 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    In this paper, we introduce DA^2, the first large-scale dual-arm dexterity-aware dataset for the generation of optimal bimanual grasping pairs for arbitrary large objects. The dataset contains about 9M pairs of parallel-jaw grasps, generated from more than 6000 objects and each labeled with various grasp dexterity measures. In addition, we propose an end-to-end dual-arm grasp evaluation model trained on the rendered scenes from this dataset. We utilize the evaluation model as our baseline to show the value of this novel and nontrivial dataset by both online analysis and real robot experiments. All data and related code will be open-sourced at https://sites.google.com/view/da2dataset.

    @article{zhai2022ddt,
    title = {DA^2 Dataset: Toward Dexterity-Aware Dual-Arm Grasping},
    author = {Guangyao Zhai and Yu Zheng and Ziwei Xu and Xin Kong and Yong Liu and Benjamin Busam and Yi Ren and Nassir Navab and Zhengyou Zhang},
    year = 2022,
    journal = {IEEE Robotics and Automation Letters (RA-L)},
    doi = {10.1109/LRA.2022.3189959},
    abstract = {In this paper, we introduce DA^2, the first large-scale dual-arm dexterity-aware dataset for the generation of optimal bimanual grasping pairs for arbitrary large objects. The dataset contains about 9M pairs of parallel-jaw grasps, generated from more than 6000 objects and each labeled with various grasp dexterity measures. In addition, we propose an end-to-end dual-arm grasp evaluation model trained on the rendered scenes from this dataset. We utilize the evaluation model as our baseline to show the value of this novel and nontrivial dataset by both online analysis and real robot experiments. All data and related code will be open-sourced at https://sites.google.com/view/da2dataset.}
    }

  • C. Xu, J. Zhang, M. Wang, G. Tian, and Y. Liu, “Multi-level Spatial-temporal Feature Aggregation for Video Object Detection," IEEE Transactions on Circuits and Systems for Video Technology, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]

    Video object detection (VOD) focuses on detecting objects for each frame in a video, which is a challenging task due to appearance deterioration in certain video frames. Recent works usually distill crucial information from multiple support frames to improve the reference features, but they only perform at frame level or proposal level that cannot integrate spatial-temporal features sufficiently. To deal with this challenge, we treat VOD as a spatial-temporal hierarchical features interacting process and introduce a Multi-level Spatial-Temporal (MST) feature aggregation framework to fully exploit frame-level, proposal-level, and instance-level information in a unified framework. Specifically, MST first measures context similarity in pixel space to enhance all frame-level features rather than only update reference features. The proposal-level feature aggregation then models object relation to augment reference object proposals. Furthermore, to filter out irrelevant information from other classes and backgrounds, we introduce an instance ID constraint to boost instance-level features by leveraging support object proposal features that belong to the same object. Besides, we propose a Deformable Feature Alignment (DAlign) module before MST to achieve a more accurate pixel-level spatial alignment for better feature aggregation. Extensive experiments are conducted on ImageNet VID and UAVDT datasets that demonstrate the superiority of our method over state-of-the-art (SOTA) methods. Our method achieves 83.3% and 62.1% with ResNet-101 on two datasets, outperforming SOTA MEGA by 0.4% and 2.7%.

    @article{xu2022mls,
    title = {Multi-level Spatial-temporal Feature Aggregation for Video Object Detection},
    author = {Chao Xu and Jiangning Zhang and Mengmeng Wang and Guanzhong Tian and Yong Liu},
    year = 2022,
    journal = {IEEE Transactions on Circuits and Systems for Video Technology},
    doi = {10.1109/TCSVT.2022.3183646},
    abstract = {Video object detection (VOD) focuses on detecting objects for each frame in a video, which is a challenging task due to appearance deterioration in certain video frames. Recent works usually distill crucial information from multiple support frames to improve the reference features, but they only perform at frame level or proposal level that cannot integrate spatial-temporal features sufficiently. To deal with this challenge, we treat VOD as a spatial-temporal hierarchical features interacting process and introduce a Multi-level Spatial-Temporal (MST) feature aggregation framework to fully exploit frame-level, proposal-level, and instance-level information in a unified framework. Specifically, MST first measures context similarity in pixel space to enhance all frame-level features rather than only update reference features. The proposal-level feature aggregation then models object relation to augment reference object proposals. Furthermore, to filter out irrelevant information from other classes and backgrounds, we introduce an instance ID constraint to boost instance-level features by leveraging support object proposal features that belong to the same object. Besides, we propose a Deformable Feature Alignment (DAlign) module before MST to achieve a more accurate pixel-level spatial alignment for better feature aggregation. Extensive experiments are conducted on ImageNet VID and UAVDT datasets that demonstrate the superiority of our method over state-of-the-art (SOTA) methods. Our method achieves 83.3% and 62.1% with ResNet-101 on two datasets, outperforming SOTA MEGA by 0.4% and 2.7%.}
    }

View all Publications