Xiaoyang Lv

MS Student

Institute of Cyber-Systems and Control, Zhejiang University, China

Biography

I have recived my bachelor’s degree in Mechatronic Engineering, Harbin Institute of Technology, Harbin, China and I’m pursuing master degree in College of Control Science and Engineering, Zhejiang University, Hangzhou, China. My major research interests in Self-supervised Monocular Depth Estimation. and welecome to communicate with me.

Research and Interests

Depth Estimation

Publications

Lina Liu, Xibin Song, Jiadai Sun, Xiaoyang Lyu, Lin Li, Yong Liu, and Liangjun Zhang. MFF-Net: Towards Efficient Monocular Depth Completion with Multi-Modal Feature Fusion. IEEE Robotics and Automation Letters (RA-L), 8:920-927, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

Remarkable progress has been achieved by current depth completion approaches, which produce dense depth maps from sparse depth maps and corresponding color images. However, the performances of these approaches are limited due to the insufficient feature extractions and fusions. In this work, we propose an efficient multi-modal feature fusion based depth completion framework (MFF-Net), which can efficiently extract and fuse features with different modals in both encoding and decoding processes, thus more depth details with better performance can be obtained. In specific, the encoding process contains three branches where different modals of features from both color and sparse depth input can be extracted, and a multi-feature channel shuffle is utilized to enhance these features thus features with better representation abilities can be obtained. Meanwhile, the decoding process contains two branches to sufficiently fuse the extracted multi-modal features, and a multi-level weighted combination is employed to further enhance and fuse features with different modals, thus leading to more accurate and better refined depth maps. Extensive experiments on different benchmarks demonstrate that we achieve state-of-the-art among online methods. Meanwhile, we further evaluate the predicted dense depth by RGB-D SLAM, which is a commonly used downstream robotic perception task, and higher accuracy on vehicle’s trajectory can be obtained in KITTI odometry dataset, which demonstrates the high quality of our depth prediction and the potential of improving the related downstream tasks with depth completion results.

@article{liu2023mff,
title = {MFF-Net: Towards Efficient Monocular Depth Completion with Multi-Modal Feature Fusion},
author = {Lina Liu and Xibin Song and Jiadai Sun and Xiaoyang Lyu and Lin Li and Yong Liu and Liangjun Zhang},
year = 2023,
journal = {IEEE Robotics and Automation Letters (RA-L)},
volume = 8,
pages = {920-927},
doi = {10.1109/LRA.2023.3234776},
abstract = {Remarkable progress has been achieved by current depth completion approaches, which produce dense depth maps from sparse depth maps and corresponding color images. However, the performances of these approaches are limited due to the insufficient feature extractions and fusions. In this work, we propose an efficient multi-modal feature fusion based depth completion framework (MFF-Net), which can efficiently extract and fuse features with different modals in both encoding and decoding processes, thus more depth details with better performance can be obtained. In specific, the encoding process contains three branches where different modals of features from both color and sparse depth input can be extracted, and a multi-feature channel shuffle is utilized to enhance these features thus features with better representation abilities can be obtained. Meanwhile, the decoding process contains two branches to sufficiently fuse the extracted multi-modal features, and a multi-level weighted combination is employed to further enhance and fuse features with different modals, thus leading to more accurate and better refined depth maps. Extensive experiments on different benchmarks demonstrate that we achieve state-of-the-art among online methods. Meanwhile, we further evaluate the predicted dense depth by RGB-D SLAM, which is a commonly used downstream robotic perception task, and higher accuracy on vehicle's trajectory can be obtained in KITTI odometry dataset, which demonstrates the high quality of our depth prediction and the potential of improving the related downstream tasks with depth completion results.}
}

Zizhang Li, Xiaoyang Lyu, Yuanyuan Ding, Mengmeng Wang, Yiyi Liao, and Yong Liu. RICO: Regularizing the Unobservable for Indoor Compositional Reconstruction. In 19th IEEE/CVF International Conference on Computer Vision (ICCV), pages 17715-17725, 2023.
[BibTeX] [Abstract] [DOI] [PDF]

Recently, neural implicit surfaces have become popular for multi-view reconstruction. To facilitate practical applications like scene editing and manipulation, some works extend the framework with semantic masks input for the object-compositional reconstruction rather than the holistic perspective. Though achieving plausible disentanglement, the performance drops significantly when processing the indoor scenes where objects are usually partially observed. We propose RICO to address this by regularizing the unobservable regions for indoor compositional reconstruction. Our key idea is to first regularize the smoothness of the occluded background, which then in turn guides the foreground object reconstruction in unobservable regions based on the object-background relationship. Particularly, we regularize the geometry smoothness of occluded background patches. With the improved background surface, the signed distance function and the reversedly rendered depth of objects can be optimized to bound them within the background range. Extensive experiments show our method outperforms other methods on synthetic and real-world indoor scenes and prove the effectiveness of proposed regularizations. The code is available at https://github.com/kyleleey/RICO

@inproceedings{li2023rico,
title = {RICO: Regularizing the Unobservable for Indoor Compositional Reconstruction},
author = {Zizhang Li and Xiaoyang Lyu and Yuanyuan Ding and Mengmeng Wang and Yiyi Liao and Yong Liu},
year = 2023,
booktitle = {19th IEEE/CVF International Conference on Computer Vision (ICCV)},
pages = {17715-17725},
doi = {10.1109/ICCV51070.2023.01628},
abstract = {Recently, neural implicit surfaces have become popular for multi-view reconstruction. To facilitate practical applications like scene editing and manipulation, some works extend the framework with semantic masks input for the object-compositional reconstruction rather than the holistic perspective. Though achieving plausible disentanglement, the performance drops significantly when processing the indoor scenes where objects are usually partially observed. We propose RICO to address this by regularizing the unobservable regions for indoor compositional reconstruction. Our key idea is to first regularize the smoothness of the occluded background, which then in turn guides the foreground object reconstruction in unobservable regions based on the object-background relationship. Particularly, we regularize the geometry smoothness of occluded background patches. With the improved background surface, the signed distance function and the reversedly rendered depth of objects can be optimized to bound them within the background range. Extensive experiments show our method outperforms other methods on synthetic and real-world indoor scenes and prove the effectiveness of proposed regularizations. The code is available at https://github.com/kyleleey/RICO}
}

Lina Liu, Xibin Song, Xiaoyang Lyu, Junwei Diao, Mengmeng Wang, Yong Liu, and Liangjun Zhang. FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), 2021.
[BibTeX] [Abstract] [arXiv] [PDF]

Depth completion aims to recover a dense depth map from a sparse depth map with the corresponding color image as input. Recent approaches mainly formulate the depth completion as a one-stage end-to-end learning task, which outputs dense depth maps directly. However, the feature extraction and supervision in one-stage frameworks are insufficient, limiting the performance of these approaches. To address this problem, we propose a novel end-to-end residual learning framework, which formulates the depth completion as a two-stage learning task, i.e., a sparse-to-coarse stage and a coarse-to-fine stage. First, a coarse dense depth map is obtained by a simple CNN framework. Then, a refined depth map is further obtained using a residual learning strategy in the coarse-to-fine stage with coarse depth map and color image as input. Specially, in the coarse-to-fine stage, a channel shuffle extraction operation is utilized to extract more representative features from color image and coarse depth map, and an energy based fusion operation is exploited to effectively fuse these features obtained by channel shuffle operation, thus leading to more accurate and refined depth maps. We achieve SoTA performance in RMSE on KITTI benchmark. Extensive experiments on other datasets future demonstrate the superiority of our approach over current state-of-the-art depth completion approaches.

@inproceedings{liu2020fcfrnetff,
title = {FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion},
author = {Lina Liu and Xibin Song and Xiaoyang Lyu and Junwei Diao and Mengmeng Wang and Yong Liu and Liangjun Zhang},
year = 2021,
booktitle = {Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI)},
abstract = {Depth completion aims to recover a dense depth map from a sparse depth map with the corresponding color image as input. Recent approaches mainly formulate the depth completion as a one-stage end-to-end learning task, which outputs dense depth maps directly. However, the feature extraction and supervision in one-stage frameworks are insufficient, limiting the performance of these approaches. To address this problem, we propose a novel end-to-end residual learning framework, which formulates the depth completion as a two-stage learning task, i.e., a sparse-to-coarse stage and a coarse-to-fine stage. First, a coarse dense depth map is obtained by a simple CNN framework. Then, a refined depth map is further obtained using a residual learning strategy in the coarse-to-fine stage with coarse depth map and color image as input. Specially, in the coarse-to-fine stage, a channel shuffle extraction operation is utilized to extract more representative features from color image and coarse depth map, and an energy based fusion operation is exploited to effectively fuse these features obtained by channel shuffle operation, thus leading to more accurate and refined depth maps. We achieve SoTA performance in RMSE on KITTI benchmark. Extensive experiments on other datasets future demonstrate the superiority of our approach over current state-of-the-art depth completion approaches.},
arxiv = {https://arxiv.org/pdf/2012.08270.pdf}
}

Xiaoyang Lyu, Liang Liu, Mengmeng Wang, Xin Kong, Lina Liu, Yong Liu, Xinxin Chen, and Yi Yuan. HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), 2021.
[BibTeX] [Abstract] [arXiv] [PDF]

Self-supervised learning shows great potential in monoculardepth estimation, using image sequences as the only source ofsupervision. Although people try to use the high-resolutionimage for depth estimation, the accuracy of prediction hasnot been significantly improved. In this work, we find thecore reason comes from the inaccurate depth estimation inlarge gradient regions, making the bilinear interpolation er-ror gradually disappear as the resolution increases. To obtainmore accurate depth estimation in large gradient regions, itis necessary to obtain high-resolution features with spatialand semantic information. Therefore, we present an improvedDepthNet, HR-Depth, with two effective strategies: (1) re-design the skip-connection in DepthNet to get better high-resolution features and (2) propose feature fusion Squeeze-and-Excitation(fSE) module to fuse feature more efficiently.Using Resnet-18 as the encoder, HR-Depth surpasses all pre-vious state-of-the-art(SoTA) methods with the least param-eters at both high and low resolution. Moreover, previousstate-of-the-art methods are based on fairly complex and deepnetworks with a mass of parameters which limits their realapplications. Thus we also construct a lightweight networkwhich uses MobileNetV3 as encoder. Experiments show thatthe lightweight network can perform on par with many largemodels like Monodepth2 at high-resolution with only20%parameters. All codes and models will be available at this https URL.

@inproceedings{lyu2020hrdepthhr,
title = {HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation},
author = {Xiaoyang Lyu and Liang Liu and Mengmeng Wang and Xin Kong and Lina Liu and Yong Liu and Xinxin Chen and Yi Yuan},
year = 2021,
booktitle = {Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI)},
abstract = {Self-supervised learning shows great potential in monoculardepth estimation, using image sequences as the only source ofsupervision. Although people try to use the high-resolutionimage for depth estimation, the accuracy of prediction hasnot been significantly improved. In this work, we find thecore reason comes from the inaccurate depth estimation inlarge gradient regions, making the bilinear interpolation er-ror gradually disappear as the resolution increases. To obtainmore accurate depth estimation in large gradient regions, itis necessary to obtain high-resolution features with spatialand semantic information. Therefore, we present an improvedDepthNet, HR-Depth, with two effective strategies: (1) re-design the skip-connection in DepthNet to get better high-resolution features and (2) propose feature fusion Squeeze-and-Excitation(fSE) module to fuse feature more efficiently.Using Resnet-18 as the encoder, HR-Depth surpasses all pre-vious state-of-the-art(SoTA) methods with the least param-eters at both high and low resolution. Moreover, previousstate-of-the-art methods are based on fairly complex and deepnetworks with a mass of parameters which limits their realapplications. Thus we also construct a lightweight networkwhich uses MobileNetV3 as encoder. Experiments show thatthe lightweight network can perform on par with many largemodels like Monodepth2 at high-resolution with only20%parameters. All codes and models will be available at this https URL.},
arxiv = {https://arxiv.org/pdf/2012.07356.pdf}
}

Address

Xiaoyang Lv

Biography

Research and Interests

Publications

Latest Events

APRIL实验室斩获ATEC 2025科技精英赛冠军，具身智能技术实现真实场景重大突破

喜报！APRIL实验室硕士生侯典泳荣获IROS 2025“移动操作领域最佳论文提名奖”

喜报！APRIL实验室在IROS 2025四足机器人挑战赛上荣获最佳自主导航奖