Address

Room 101, Institute of Cyber-Systems and Control, Yuquan Campus, Zhejiang University, Hangzhou, Zhejiang, China

Contact Information

Email: 186368@zju.edu.cn

Jiangning Zhang

PhD Student

Institute of Cyber-Systems and Control, Zhejiang University, China

Biography

I am pursuing my Ph.D. degree in College of Control Science and Engineering, Zhejiang University, Hangzhou, China. My major research interests include Generative Adversarial Network (GAN) and Neural Architecture Design (NAD).

Research and Interests

  • Low-level Computer Vision
  • Generative Adversarial Network
  • Neural Architecture Design

Publications

  • Yue Han, Jiangning Zhang, Yabiao Wang, Chengjie Wang, Yong Liu, Lu Qi, Xiangtai Li, and Ming-Hsuan Yang. Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
    [BibTeX] [Abstract] [DOI]
    Few-Shot Instance Segmentation (FSIS) requires detecting and segmenting novel classes with limited support examples. Existing methods based on Region Proposal Networks (RPNs) face two issues: 1) Overfitting suppresses novel class objects; 2) Dual-branch models require complex spatial correlation strategies to prevent spatial information loss when generating class prototypes. We introduce a unified framework, Reference Twice (RefT), to exploit the relationship between support and query features for FSIS and related tasks. Our three main contributions are: 1) A novel transformer-based baseline that avoids overfitting, offering a new direction for FSIS; 2) Demonstrating that support object queries encode key factors after base training, allowing query features to be enhanced twice at both feature and query levels using simple cross-attention, thus avoiding complex spatial correlation interaction; 3) Introducing a class-enhanced base knowledge distillation loss to address the issue of DETR-like models struggling with incremental settings due to the input projection layer, enabling easy extension to incremental FSIS. Extensive experimental evaluations on the COCO dataset under three FSIS settings demonstrate that our method performs favorably against existing approaches across different shots, e.g., +8.2/ + 9.4 performance gain over state-of-the-art methods with 10/30-shots. Source code and models will be available at this github site.
    @article{han2024rta,
    title = {Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation},
    author = {Yue Han and Jiangning Zhang and Yabiao Wang and Chengjie Wang and Yong Liu and Lu Qi and Xiangtai Li and Ming-Hsuan Yang},
    year = 2024,
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    doi = {10.1109/TPAMI.2024.3421340},
    abstract = {Few-Shot Instance Segmentation (FSIS) requires detecting and segmenting novel classes with limited support examples. Existing methods based on Region Proposal Networks (RPNs) face two issues: 1) Overfitting suppresses novel class objects; 2) Dual-branch models require complex spatial correlation strategies to prevent spatial information loss when generating class prototypes. We introduce a unified framework, Reference Twice (RefT), to exploit the relationship between support and query features for FSIS and related tasks. Our three main contributions are: 1) A novel transformer-based baseline that avoids overfitting, offering a new direction for FSIS; 2) Demonstrating that support object queries encode key factors after base training, allowing query features to be enhanced twice at both feature and query levels using simple cross-attention, thus avoiding complex spatial correlation interaction; 3) Introducing a class-enhanced base knowledge distillation loss to address the issue of DETR-like models struggling with incremental settings due to the input projection layer, enabling easy extension to incremental FSIS. Extensive experimental evaluations on the COCO dataset under three FSIS settings demonstrate that our method performs favorably against existing approaches across different shots, e.g., +8.2/ + 9.4 performance gain over state-of-the-art methods with 10/30-shots. Source code and models will be available at this github site.}
    }
  • Jiangning Zhang, Xiangtai Li, Yabiao Wang, Chengjie Wang, Yibo Yang, Yong Liu, and Dacheng Tao. EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm. International Journal of Computer Vision, 132:3509-3536, 2024.
    [BibTeX] [Abstract] [DOI] [PDF]
    Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block, which consists of three residual parts, i.e., Multi-scale region aggregation, global and local interaction, and feed-forward network modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a task-related head docked with transformer backbone to complete final information fusion more flexibly and improve a modulated deformable MSA to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over state-of-the-art methods. E.g., our Mobile (1.8 M), Tiny (6.1 M), Small (24.3 M), and Base (49.0 M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7. Code is available at https://github.com/zhangzjn/EATFormer.
    @article{zhang2024eat,
    title = {EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm},
    author = {Jiangning Zhang and Xiangtai Li and Yabiao Wang and Chengjie Wang and Yibo Yang and Yong Liu and Dacheng Tao},
    year = 2024,
    journal = {International Journal of Computer Vision},
    volume = 132,
    pages = {3509-3536},
    doi = {10.1007/s11263-024-02034-6},
    abstract = {Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block, which consists of three residual parts, i.e., Multi-scale region aggregation, global and local interaction, and feed-forward network modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a task-related head docked with transformer backbone to complete final information fusion more flexibly and improve a modulated deformable MSA to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over state-of-the-art methods. E.g., our Mobile (1.8 M), Tiny (6.1 M), Small (24.3 M), and Base (49.0 M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7. Code is available at https://github.com/zhangzjn/EATFormer.}
    }
  • Yufei Liang, Jiangning Zhang, Shiwei Zhao, Runze Wu, Yong Liu, and Shuwen Pan. Omni-Frequency Channel-Selection Representations for Unsupervised Anomaly Detection. IEEE Transactions on Image Processing, 32:4327-4340, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Density-based and classification-based methods have ruled unsupervised anomaly detection in recent years, while reconstruction-based methods are rarely mentioned for the poor reconstruction ability and low performance. However, the latter requires no costly extra training samples for the unsupervised training that is more practical, so this paper focuses on improving reconstruction-based method and proposes a novel Omni-frequency Channel-selection Reconstruction (OCR-GAN) network to handle sensory anomaly detection task in a perspective of frequency. Concretely, we propose a Frequency Decoupling (FD) module to decouple the input image into different frequency components and model the reconstruction process as a combination of parallel omni-frequency image restorations, as we observe a significant difference in the frequency distribution of normal and abnormal images. Given the correlation among multiple frequencies, we further propose a Channel Selection (CS) module that performs frequency interaction among different encoders by adaptively selecting different channels. Abundant experiments demonstrate the effectiveness and superiority of our approach over different kinds of methods, e.g., achieving a new state-of-theart 98.3 detection AUC on the MVTec AD dataset without extra training data that markedly surpasses the reconstruction-based baseline by +38.11. and the current SOTA method by +0.31.. The source code is available in the additional materials.
    @article{liang2023omni,
    title = {Omni-Frequency Channel-Selection Representations for Unsupervised Anomaly Detection},
    author = {Yufei Liang and Jiangning Zhang and Shiwei Zhao and Runze Wu and Yong Liu and Shuwen Pan},
    year = 2023,
    journal = {IEEE Transactions on Image Processing},
    volume = 32,
    pages = {4327-4340},
    doi = {10.1109/TIP.2023.3293772},
    abstract = {Density-based and classification-based methods have ruled unsupervised anomaly detection in recent years, while reconstruction-based methods are rarely mentioned for the poor reconstruction ability and low performance. However, the latter requires no costly extra training samples for the unsupervised training that is more practical, so this paper focuses on improving reconstruction-based method and proposes a novel Omni-frequency Channel-selection Reconstruction (OCR-GAN) network to handle sensory anomaly detection task in a perspective of frequency. Concretely, we propose a Frequency Decoupling (FD) module to decouple the input image into different frequency components and model the reconstruction process as a combination of parallel omni-frequency image restorations, as we observe a significant difference in the frequency distribution of normal and abnormal images. Given the correlation among multiple frequencies, we further propose a Channel Selection (CS) module that performs frequency interaction among different encoders by adaptively selecting different channels. Abundant experiments demonstrate the effectiveness and superiority of our approach over different kinds of methods, e.g., achieving a new state-of-theart 98.3 detection AUC on the MVTec AD dataset without extra training data that markedly surpasses the reconstruction-based baseline by +38.11. and the current SOTA method by +0.31.. The source code is available in the additional materials.}
    }
  • Tianxin Huang, Hao Zou, Jinhao Cui, Jiangning Zhang, Xuemeng Yang, Lin Li, and Yong Liu. Adaptive Recurrent Forward Network for Dense Point Cloud Completion. IEEE Transactions on Multimedia, 25:5903-5915, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Point cloud completion is an interesting and challenging task in 3D vision, which aims to recover complete shapes from sparse and incomplete point clouds. Existing completion networks often require a vast number of parameters and substantial computational costs to achieve a high performance level, which may limit their practical application. In this work, we propose a novel Adaptive efficient Recurrent Forward Network (ARFNet), which is composed of three parts: Recurrent Feature Extraction (RFE), Forward Dense Completion (FDC) and Raw Shape Protection (RSP). In an RFE, multiple short global features are extracted from incomplete point clouds, while a dense quantity of completed results are generated in a coarse-to-fine pipeline in the FDC. Finally, we propose the Adamerge module to preserve the details from the original models by merging the generated results with the original incomplete point clouds in the RSP. In addition, we introduce the Sampling Chamfer Distance to better capture the shapes of the models and the balanced expansion constraint to restrict the expansion distances from coarse to fine. According to the experiments on ShapeNet and KITTI, our network can achieve state-of-the-art completion performances on dense point clouds with fewer parameters, smaller model sizes, lower memory costs and a faster convergence.
    @article{huang2022arf,
    title = {Adaptive Recurrent Forward Network for Dense Point Cloud Completion},
    author = {Tianxin Huang and Hao Zou and Jinhao Cui and Jiangning Zhang and Xuemeng Yang and Lin Li and Yong Liu},
    year = 2023,
    journal = {IEEE Transactions on Multimedia},
    volume = {25},
    pages = {5903-5915},
    doi = {10.1109/TMM.2022.3200851},
    abstract = {Point cloud completion is an interesting and challenging task in 3D vision, which aims to recover complete shapes from sparse and incomplete point clouds. Existing completion networks often require a vast number of parameters and substantial computational costs to achieve a high performance level, which may limit their practical application. In this work, we propose a novel Adaptive efficient Recurrent Forward Network (ARFNet), which is composed of three parts: Recurrent Feature Extraction (RFE), Forward Dense Completion (FDC) and Raw Shape Protection (RSP). In an RFE, multiple short global features are extracted from incomplete point clouds, while a dense quantity of completed results are generated in a coarse-to-fine pipeline in the FDC. Finally, we propose the Adamerge module to preserve the details from the original models by merging the generated results with the original incomplete point clouds in the RSP. In addition, we introduce the Sampling Chamfer Distance to better capture the shapes of the models and the balanced expansion constraint to restrict the expansion distances from coarse to fine. According to the experiments on ShapeNet and KITTI, our network can achieve state-of-the-art completion performances on dense point clouds with fewer parameters, smaller model sizes, lower memory costs and a faster convergence.}
    }
  • Xintian Shen, Jiangning Zhang, Jun Chen, Shipeng Bai, Yue Han, Yabiao Wang, Chengjie Wang, and Yong Liu. Learning Global-Aware Kernel for Image Harmonization. In 19th IEEE/CVF International Conference on Computer Vision (ICCV), pages 7501-7510, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Image harmonization aims to solve the visual inconsistency problem in composited images by adaptively adjusting the foreground pixels with the background as references. Existing methods employ local color transformation or region matching between foreground and background, which neglects powerful proximity prior and independently distinguishes fore-/back-ground as a whole part for harmonization. As a result, they still show a limited performance across varied foreground objects and scenes. To address this issue, we propose a novel Global-aware Kernel Net-work (GKNet) to harmonize local regions with comprehensive consideration of long-distance background references. Specifically, GKNet includes two parts, i.e., harmony kernel prediction and harmony kernel modulation branches. The former includes a Long-distance Reference Extractor (LRE) to obtain long-distance context and Kernel Prediction Blocks (KPB) to predict multi-level harmony kernels by fusing global information with local features. To achieve this goal, a novel Selective Correlation Fusion (SCF) module is proposed to better select relevant long-distance background references for local harmonization. The latter employs the predicted kernels to harmonize foreground regions with local and global awareness. Abundant experiments demonstrate the superiority of our method for image harmonization over state-of-the-art methods, e.g., achieving 39.53dB PSNR that surpasses the best counterpart by +0.78dB ↑; decreasing fMSE/MSE by 11.5%↓/6.7%↓ compared with the SoTA method. Code will be available at here.
    @inproceedings{shen2023lga,
    title = {Learning Global-Aware Kernel for Image Harmonization},
    author = {Xintian Shen and Jiangning Zhang and Jun Chen and Shipeng Bai and Yue Han and Yabiao Wang and Chengjie Wang and Yong Liu},
    year = 2023,
    booktitle = {19th IEEE/CVF International Conference on Computer Vision (ICCV)},
    pages = {7501-7510},
    doi = {10.1109/ICCV51070.2023.00693},
    abstract = {Image harmonization aims to solve the visual inconsistency problem in composited images by adaptively adjusting the foreground pixels with the background as references. Existing methods employ local color transformation or region matching between foreground and background, which neglects powerful proximity prior and independently distinguishes fore-/back-ground as a whole part for harmonization. As a result, they still show a limited performance across varied foreground objects and scenes. To address this issue, we propose a novel Global-aware Kernel Net-work (GKNet) to harmonize local regions with comprehensive consideration of long-distance background references. Specifically, GKNet includes two parts, i.e., harmony kernel prediction and harmony kernel modulation branches. The former includes a Long-distance Reference Extractor (LRE) to obtain long-distance context and Kernel Prediction Blocks (KPB) to predict multi-level harmony kernels by fusing global information with local features. To achieve this goal, a novel Selective Correlation Fusion (SCF) module is proposed to better select relevant long-distance background references for local harmonization. The latter employs the predicted kernels to harmonize foreground regions with local and global awareness. Abundant experiments demonstrate the superiority of our method for image harmonization over state-of-the-art methods, e.g., achieving 39.53dB PSNR that surpasses the best counterpart by +0.78dB ↑; decreasing fMSE/MSE by 11.5%↓/6.7%↓ compared with the SoTA method. Code will be available at here.}
    }
  • Chao Xu, Junwei Zhu, Jiangning Zhang, Yue Han, Wenqing Chu, Ying Tai, Chengjie Wang, Zhifeng Xie, and Yong Liu. High-Fidelity Generalized Emotional Talking Face Generation with Multi-Modal Emotion Space Learning. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6609-6619, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose a more flexible and generalized framework. Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space, which inherits rich semantic prior from CLIP. Consequently, effective multi-modal emotion space learning helps our method support arbitrary emotion modality during testing and could generalize to unseen emotion styles. Besides, an Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation. A followed style-based High-fidelity Emotional Face generator is designed to generate arbitrary high-resolution realistic identities. Our texture generator hierarchically learns flow fields and animated faces in a residual manner. Extensive experiments demonstrate the flexibility and generalization of our method in emotion control and the effectiveness of high-quality face synthesis.
    @inproceedings{xu2023hfg,
    title = {High-Fidelity Generalized Emotional Talking Face Generation with Multi-Modal Emotion Space Learning},
    author = {Chao Xu and Junwei Zhu and Jiangning Zhang and Yue Han and Wenqing Chu and Ying Tai and Chengjie Wang and Zhifeng Xie and Yong Liu},
    year = 2023,
    booktitle = {2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {6609-6619},
    doi = {10.1109/CVPR52729.2023.00639},
    abstract = {Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose a more flexible and generalized framework. Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space, which inherits rich semantic prior from CLIP. Consequently, effective multi-modal emotion space learning helps our method support arbitrary emotion modality during testing and could generalize to unseen emotion styles. Besides, an Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation. A followed style-based High-fidelity Emotional Face generator is designed to generate arbitrary high-resolution realistic identities. Our texture generator hierarchically learns flow fields and animated faces in a residual manner. Extensive experiments demonstrate the flexibility and generalization of our method in emotion control and the effectiveness of high-quality face synthesis.}
    }
  • Xuhai Chen, Jiangning Zhang, Chao Xu, Yabiao Wang, Chengjie Wang, and Yong Liu. Better “CMOS” Produces Clearer Images: Learning Space-Variant Blur Estimation for Blind Image Super-Resolution. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1651-1661, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Most of the existing blind image Super-Resolution (SR) methods assume that the blur kernels are space-invariant. However, the blur involved in real applications are usually space-variant due to object motion, out-of-focus, etc., resulting in severe performance drop of the advanced SR methods. To address this problem, we firstly introduce two new datasets with out-of-focus blur, i.e., NYUv2-BSR and Cityscapes-BSR, to support further researches of blind SR with space-variant blur. Based on the datasets, we design a novel Cross-MOdal fuSion network (CMOS) that estimate both blur and semantics simultaneously, which leads to improved SR results. It involves a feature Grouping Interactive Attention (GIA) module to make the two modalities in-teract more effectively and avoid inconsistency. GIA can also be used for the interaction of other features because of the universality of its structure. Qualitative and quantitative experiments compared with state-of-the-art methods on above datasets and real-world images demonstrate the superiority of our method, e.g., obtaining PSNR/SSIM by +1.91↑/+0.0048↑ on NYUv2-BSR than MANet 1 1 https://github.com/ByChelsea/CMOS.git.
    @inproceedings{chen2023cmos,
    title = {Better "CMOS" Produces Clearer Images: Learning Space-Variant Blur Estimation for Blind Image Super-Resolution},
    author = {Xuhai Chen and Jiangning Zhang and Chao Xu and Yabiao Wang and Chengjie Wang and Yong Liu},
    year = 2023,
    booktitle = {2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {1651-1661},
    doi = {10.1109/CVPR52729.2023.00165},
    abstract = {Most of the existing blind image Super-Resolution (SR) methods assume that the blur kernels are space-invariant. However, the blur involved in real applications are usually space-variant due to object motion, out-of-focus, etc., resulting in severe performance drop of the advanced SR methods. To address this problem, we firstly introduce two new datasets with out-of-focus blur, i.e., NYUv2-BSR and Cityscapes-BSR, to support further researches of blind SR with space-variant blur. Based on the datasets, we design a novel Cross-MOdal fuSion network (CMOS) that estimate both blur and semantics simultaneously, which leads to improved SR results. It involves a feature Grouping Interactive Attention (GIA) module to make the two modalities in-teract more effectively and avoid inconsistency. GIA can also be used for the interaction of other features because of the universality of its structure. Qualitative and quantitative experiments compared with state-of-the-art methods on above datasets and real-world images demonstrate the superiority of our method, e.g., obtaining PSNR/SSIM by +1.91↑/+0.0048↑ on NYUv2-BSR than MANet 1 1 https://github.com/ByChelsea/CMOS.git.}
    }
  • Tianxin Huang, Zhonggan Ding, Jiangning Zhang, Ying Tai, Zhenyu Zhang, Mingang Chen, Chengjie Wang, and Yong Liu. Learning to Measure the Point Cloud Reconstruction Loss in a Representation Space. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12208-12217, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    For point cloud reconstruction-related tasks, the reconstruction losses to evaluate the shape differences between reconstructed results and the ground truths are typically used to train the task networks. Most existing works measure the training loss with point-to-point distance, which may introduce extra defects as predefined matching rules may deviate from the real shape differences. Although some learning-based works have been proposed to overcome the weaknesses of manually-defined rules, they still measure the shape differences in 3D Euclidean space, which may limit their ability to capture defects in reconstructed shapes. In this work, we propose a learning-based Contrastive Adver-sarial Loss (CALoss) to measure the point cloud reconstruction loss dynamically in a non-linear representation space by combining the contrastive constraint with the adversarial strategy. Specifically, we use the contrastive constraint to help CALoss learn a representation space with shape similarity, while we introduce the adversarial strategy to help CALoss mine differences between reconstructed results and ground truths. According to experiments on reconstruction-related tasks, CALoss can help task networks improve re-construction performances and learn more representative representations.
    @inproceedings{huang2023ltm,
    title = {Learning to Measure the Point Cloud Reconstruction Loss in a Representation Space},
    author = {Tianxin Huang and Zhonggan Ding and Jiangning Zhang and Ying Tai and Zhenyu Zhang and Mingang Chen and Chengjie Wang and Yong Liu},
    year = 2023,
    booktitle = {2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {12208-12217},
    doi = {10.1109/CVPR52729.2023.01175},
    abstract = {For point cloud reconstruction-related tasks, the reconstruction losses to evaluate the shape differences between reconstructed results and the ground truths are typically used to train the task networks. Most existing works measure the training loss with point-to-point distance, which may introduce extra defects as predefined matching rules may deviate from the real shape differences. Although some learning-based works have been proposed to overcome the weaknesses of manually-defined rules, they still measure the shape differences in 3D Euclidean space, which may limit their ability to capture defects in reconstructed shapes. In this work, we propose a learning-based Contrastive Adver-sarial Loss (CALoss) to measure the point cloud reconstruction loss dynamically in a non-linear representation space by combining the contrastive constraint with the adversarial strategy. Specifically, we use the contrastive constraint to help CALoss learn a representation space with shape similarity, while we introduce the adversarial strategy to help CALoss mine differences between reconstructed results and ground truths. According to experiments on reconstruction-related tasks, CALoss can help task networks improve re-construction performances and learn more representative representations.}
    }
  • Tianxin Huang, Jiangning Zhang, Jun Chen and Zhonggan Ding, Ying Tai, Zhenyu Zhang, Chengjie Wang, and Yong Liu. 3QNet: 3D Point Cloud Geometry Quantization Compression Network. ACM Transactions on Graphics, 2022.
    [BibTeX] [Abstract] [DOI]
    Since the development of 3D applications, the point cloud, as a spatial description easily acquired by sensors, has been widely used in multiple areas such as SLAM and 3D reconstruction. Point Cloud Compression (PCC) has also attracted more attention as a primary step before point cloud transferring and saving, where the geometry compression is an important component of PCC to compress the points geometrical structures. However, existing non-learning-based geometry compression methods are often limited by manually pre-defined compression rules. Though learning-based compression methods can significantly improve the algorithm performances by learning compression rules from data, they still have some defects. Voxel-based compression networks introduce precision errors due to the voxelized operations, while point-based methods may have relatively weak robustness and are mainly designed for sparse point clouds. In this work, we propose a novel learning-based point cloud compression framework named 3D Point Cloud Geometry Quantiation Compression Network (3QNet), which overcomes the robustness limitation of existing point-based methods and can handle dense points. By learning a codebook including common structural features from simple and sparse shapes, 3QNet can efficiently deal with multiple kinds of point clouds. According to experiments on object models, indoor scenes, and outdoor scans, 3QNet can achieve better compression performances than many representative methods.
    @article{huang2022Net,
    title = {3QNet: 3D Point Cloud Geometry Quantization Compression Network},
    author = {Tianxin Huang and Jiangning Zhang and Jun Chen and Zhonggan Ding and Ying Tai and Zhenyu Zhang and Chengjie Wang and Yong Liu},
    year = 2022,
    journal = {ACM Transactions on Graphics},
    doi = {10.1145/3550454.3555481},
    abstract = {Since the development of 3D applications, the point cloud, as a spatial description easily acquired by sensors, has been widely used in multiple areas such as SLAM and 3D reconstruction. Point Cloud Compression (PCC) has also attracted more attention as a primary step before point cloud transferring and saving, where the geometry compression is an important component of PCC to compress the points geometrical structures. However, existing non-learning-based geometry compression methods are often limited by manually pre-defined compression rules. Though learning-based compression methods can significantly improve the algorithm performances by learning compression rules from data, they still have some defects. Voxel-based compression networks introduce precision errors due to the voxelized operations, while point-based methods may have relatively weak robustness and are mainly designed for sparse point clouds. In this work, we propose a novel learning-based point cloud compression framework named 3D Point Cloud Geometry Quantiation Compression Network (3QNet), which overcomes the robustness limitation of existing point-based methods and can handle dense points. By learning a codebook including common structural features from simple and sparse shapes, 3QNet can efficiently deal with multiple kinds of point clouds. According to experiments on object models, indoor scenes, and outdoor scans, 3QNet can achieve better compression performances than many representative methods.}
    }
  • Tianxin Huang, Jun Chen, Jiangning Zhang, Yong Liu, and Jie Liang. Fast Point Cloud Sampling Network. Pattern Recognition Letters, 2022.
    [BibTeX] [Abstract] [DOI]
    The increasing number of points in 3D point clouds has brought great challenges for subsequent algorithm efficiencies. Down-sampling algorithms are adopted to simplify the data and accelerate the computation. Except the well-known random sampling and farthest distance sampling, some recent works have tried to learn a sampling pattern according to the downstream task, which helps generate sampled points by fully-connected networks with fixed output point numbers. In this condition, a progress-net structure covering all resolutions sampling networks or multiple separate sampling networks for different resolutions are required, which is inconvenient. In this work, we propose a novel learning-based point cloud sampling framework, named Fast point cloud sampling network (FPN), which drives initial randomly sampled points to better positions instead of generating coordinates. FPN can be used to sample points clouds to any resolution once trained by changing the number of initial randomly sampled points. Results on point cloud reconstruction and recognition confirm that FPN can reach state-of-the-art performances with much higher sampling efficiency than most existing sampling methods.
    @article{huang2022fast,
    title = {Fast Point Cloud Sampling Network},
    author = {Tianxin Huang and Jun Chen and Jiangning Zhang and Yong Liu and Jie Liang},
    year = 2022,
    journal = {Pattern Recognition Letters},
    doi = {10.1016/j.patrec.2022.11.006},
    abstract = {The increasing number of points in 3D point clouds has brought great challenges for subsequent algorithm efficiencies. Down-sampling algorithms are adopted to simplify the data and accelerate the computation. Except the well-known random sampling and farthest distance sampling, some recent works have tried to learn a sampling pattern according to the downstream task, which helps generate sampled points by fully-connected networks with fixed output point numbers. In this condition, a progress-net structure covering all resolutions sampling networks or multiple separate sampling networks for different resolutions are required, which is inconvenient. In this work, we propose a novel learning-based point cloud sampling framework, named Fast point cloud sampling network (FPN), which drives initial randomly sampled points to better positions instead of generating coordinates. FPN can be used to sample points clouds to any resolution once trained by changing the number of initial randomly sampled points. Results on point cloud reconstruction and recognition confirm that FPN can reach state-of-the-art performances with much higher sampling efficiency than most existing sampling methods.}
    }
  • Chao Xu, Jiangning Zhang, Mengmeng Wang, Guanzhong Tian, and Yong Liu. Multi-level Spatial-temporal Feature Aggregation for Video Object Detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(11):7809-7820, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    Video object detection (VOD) focuses on detecting objects for each frame in a video, which is a challenging task due to appearance deterioration in certain video frames. Recent works usually distill crucial information from multiple support frames to improve the reference features, but they only perform at frame level or proposal level that cannot integrate spatial-temporal features sufficiently. To deal with this challenge, we treat VOD as a spatial-temporal hierarchical features interacting process and introduce a Multi-level Spatial-Temporal (MST) feature aggregation framework to fully exploit frame-level, proposal-level, and instance-level information in a unified framework. Specifically, MST first measures context similarity in pixel space to enhance all frame-level features rather than only update reference features. The proposal-level feature aggregation then models object relation to augment reference object proposals. Furthermore, to filter out irrelevant information from other classes and backgrounds, we introduce an instance ID constraint to boost instance-level features by leveraging support object proposal features that belong to the same object. Besides, we propose a Deformable Feature Alignment (DAlign) module before MST to achieve a more accurate pixel-level spatial alignment for better feature aggregation. Extensive experiments are conducted on ImageNet VID and UAVDT datasets that demonstrate the superiority of our method over state-of-the-art (SOTA) methods. Our method achieves 83.3% and 62.1% with ResNet-101 on two datasets, outperforming SOTA MEGA by 0.4% and 2.7%.
    @article{xu2022mls,
    title = {Multi-level Spatial-temporal Feature Aggregation for Video Object Detection},
    author = {Chao Xu and Jiangning Zhang and Mengmeng Wang and Guanzhong Tian and Yong Liu},
    year = 2022,
    journal = {IEEE Transactions on Circuits and Systems for Video Technology},
    volume = {32},
    number = {11},
    pages = {7809-7820},
    doi = {10.1109/TCSVT.2022.3183646},
    abstract = {Video object detection (VOD) focuses on detecting objects for each frame in a video, which is a challenging task due to appearance deterioration in certain video frames. Recent works usually distill crucial information from multiple support frames to improve the reference features, but they only perform at frame level or proposal level that cannot integrate spatial-temporal features sufficiently. To deal with this challenge, we treat VOD as a spatial-temporal hierarchical features interacting process and introduce a Multi-level Spatial-Temporal (MST) feature aggregation framework to fully exploit frame-level, proposal-level, and instance-level information in a unified framework. Specifically, MST first measures context similarity in pixel space to enhance all frame-level features rather than only update reference features. The proposal-level feature aggregation then models object relation to augment reference object proposals. Furthermore, to filter out irrelevant information from other classes and backgrounds, we introduce an instance ID constraint to boost instance-level features by leveraging support object proposal features that belong to the same object. Besides, we propose a Deformable Feature Alignment (DAlign) module before MST to achieve a more accurate pixel-level spatial alignment for better feature aggregation. Extensive experiments are conducted on ImageNet VID and UAVDT datasets that demonstrate the superiority of our method over state-of-the-art (SOTA) methods. Our method achieves 83.3% and 62.1% with ResNet-101 on two datasets, outperforming SOTA MEGA by 0.4% and 2.7%.}
    }
  • Jiangning Zhang, Xianfang Zeng, Chao Xu, and Yong Liu. Real-Time Audio-Guided Multi-Face Reenactment. IEEE Signal Processing Letters, 29:1–5, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    Audio-guided face reenactment aims to generate authentic target faces that have matched facial expression of the input audio, and many learning-based methods have successfully achieved this. However, mostmethods can only reenact a particular person once trained or suffer from the low-quality generation of the target images. Also, nearly none of the current reenactment works consider the model size and running speed that are important for practical use. To solve the above challenges, we propose an efficient Audio-guided Multi-face reenactment model named AMNet, which can reenact target faces among multiple persons with corresponding source faces and drive signals as inputs. Concretely, we design a Geometric Controller (GC) module to inject the drive signals so that the model can be optimized in an end-to-end manner and generate more authentic images. Also, we adopt a lightweight network for our face reenactor so that the model can run in realtime on both CPU and GPU devices. Abundant experiments prove our approach’s superiority over existing methods, e.g., averagely decreasing FID by 0.12. and increasing SSIM by 0.031. than APB2Face, while owning fewer parameters (x4 down arrow) and faster CPU speed (x4 up arrow).
    @article{zhang2022rta,
    title = {Real-Time Audio-Guided Multi-Face Reenactment},
    author = {Jiangning Zhang and Xianfang Zeng and Chao Xu and Yong Liu},
    year = 2022,
    journal = {IEEE Signal Processing Letters},
    volume = {29},
    pages = {1--5},
    doi = {10.1109/LSP.2021.3116506},
    abstract = {Audio-guided face reenactment aims to generate authentic target faces that have matched facial expression of the input audio, and many learning-based methods have successfully achieved this. However, mostmethods can only reenact a particular person once trained or suffer from the low-quality generation of the target images. Also, nearly none of the current reenactment works consider the model size and running speed that are important for practical use. To solve the above challenges, we propose an efficient Audio-guided Multi-face reenactment model named AMNet, which can reenact target faces among multiple persons with corresponding source faces and drive signals as inputs. Concretely, we design a Geometric Controller (GC) module to inject the drive signals so that the model can be optimized in an end-to-end manner and generate more authentic images. Also, we adopt a lightweight network for our face reenactor so that the model can run in realtime on both CPU and GPU devices. Abundant experiments prove our approach's superiority over existing methods, e.g., averagely decreasing FID by 0.12. and increasing SSIM by 0.031. than APB2Face, while owning fewer parameters (x4 down arrow) and faster CPU speed (x4 up arrow).}
    }
  • Tianxin Huang, Xuemeng Yang, Jiangning Zhang, Jinhao Cui, Hao Zou, Jun Chen and Xiangrui Zhao, and Yong Liu. Learning to Train a Point Cloud Reconstruction Network Without Matching. In European Conference on Computer Vision (ECCV), 2022.
    [BibTeX] [Abstract] [DOI]
    Reconstruction networks for well-ordered data such as 2D images and 1D continuous signals are easy to optimize through element-wised squared errors, while permutation-arbitrary point clouds cannot be constrained directly because their points permutations are not fixed. Though existing works design algorithms to match two point clouds and evaluate shape errors based on matched results, they are limited by pre-defined matching processes. In this work, we propose a novel framework named PCLossNet which learns to train a point cloud reconstruction network without any matching. By training through an adversarial process together with the reconstruction network, PCLossNet can better explore the differences between point clouds and create more precise reconstruction results. Experiments on multiple datasets prove the superiority of our method, where PCLossNet can help networks achieve much lower reconstruction errors and extract more representative features, with about 4 times faster training efficiency than the commonly-used EMD loss. Our codes can be found in https://github.com/Tianxinhuang/PCLossNet.
    @inproceedings{huang2022ltt,
    title = {Learning to Train a Point Cloud Reconstruction Network Without Matching},
    author = {Tianxin Huang and Xuemeng Yang and Jiangning Zhang and Jinhao Cui and Hao Zou and Jun Chen and Xiangrui Zhao and Yong Liu},
    year = 2022,
    booktitle = {European Conference on Computer Vision (ECCV)},
    doi = {10.1007/978-3-031-19769-7_11},
    abstract = {Reconstruction networks for well-ordered data such as 2D images and 1D continuous signals are easy to optimize through element-wised squared errors, while permutation-arbitrary point clouds cannot be constrained directly because their points permutations are not fixed. Though existing works design algorithms to match two point clouds and evaluate shape errors based on matched results, they are limited by pre-defined matching processes. In this work, we propose a novel framework named PCLossNet which learns to train a point cloud reconstruction network without any matching. By training through an adversarial process together with the reconstruction network, PCLossNet can better explore the differences between point clouds and create more precise reconstruction results. Experiments on multiple datasets prove the superiority of our method, where PCLossNet can help networks achieve much lower reconstruction errors and extract more representative features, with about 4 times faster training efficiency than the commonly-used EMD loss. Our codes can be found in https://github.com/Tianxinhuang/PCLossNet.}
    }
  • Tianxin Huang, Jiangning Zhang, Jun Chen and Yuang Liu, and Yong Liu. Resolution-free Point Cloud Sampling Network with Data Distillation. In European Conference on Computer Vision (ECCV), 2022.
    [BibTeX] [Abstract] [DOI]
    Down-sampling algorithms are adopted to simplify the point clouds and save the computation cost on subsequent tasks. Existing learning-based sampling methods often need to train a big sampling network to support sampling under different resolutions, which must generate sampled points with the costly maximum resolution even if only low-resolution points need to be sampled. In this work, we propose a novel resolution-free point clouds sampling network to directly sample the original point cloud to different resolutions, which is conducted by optimizing non-learning-based initial sampled points to better positions. Besides, we introduce data distillation to assist the training process by considering the differences between task network outputs from original point clouds and sampled points. Experiments on point cloud reconstruction and recognition tasks demonstrate that our method can achieve SOTA performances with lower time and memory cost than existing learning-based sampling strategies. Codes are available at https://github.com/Tianxinhuang/PCDNet.
    @inproceedings{huang2022rfp,
    title = {Resolution-free Point Cloud Sampling Network with Data Distillation},
    author = {Tianxin Huang and Jiangning Zhang and Jun Chen and Yuang Liu and Yong Liu},
    year = 2022,
    booktitle = {European Conference on Computer Vision (ECCV)},
    doi = {10.1007/978-3-031-20086-1_4},
    abstract = {Down-sampling algorithms are adopted to simplify the point clouds and save the computation cost on subsequent tasks. Existing learning-based sampling methods often need to train a big sampling network to support sampling under different resolutions, which must generate sampled points with the costly maximum resolution even if only low-resolution points need to be sampled. In this work, we propose a novel resolution-free point clouds sampling network to directly sample the original point cloud to different resolutions, which is conducted by optimizing non-learning-based initial sampled points to better positions. Besides, we introduce data distillation to assist the training process by considering the differences between task network outputs from original point clouds and sampled points. Experiments on point cloud reconstruction and recognition tasks demonstrate that our method can achieve SOTA performances with lower time and memory cost than existing learning-based sampling strategies. Codes are available at https://github.com/Tianxinhuang/PCDNet.}
    }
  • Chao Xu, Jiangning Zhang, Yue Han, and Yong Liu. Designing One Unified framework for High-Fidelity Face Reenactment and Swapping. In European Conference on Computer Vision (ECCV), 2022.
    [BibTeX] [Abstract] [DOI]
    Face reenactment and swapping share a similar identity and attribute manipulating pattern, but most methods treat them separately, which is redundant and practical-unfriendly. In this paper, we propose an effective end-to-end unified framework to achieve both tasks. Unlike existing methods that directly utilize pre-estimated structures and do not fully exploit their potential similarity, our model sufficiently transfers identity and attribute based on learned disentangled representations to generate high-fidelity faces. Specifically, Feature Disentanglement first disentangles identity and attribute unsupervisedly. Then the proposed Attribute Transfer (AttrT) employs learned Feature Displacement Fields to transfer the attribute granularly, and Identity Transfer (IdT) explicitly models identity-related feature interaction to adaptively control the identity fusion. We joint AttrT and IdT according to their intrinsic relationship to further facilitate each task, i.e., help improve identity consistency in reenactment and attribute preservation in swapping. Extensive experiments demonstrate the superiority of our method. Code is available at https://github.com/xc-csc101/UniFace.
    @inproceedings{xu2022dou,
    title = {Designing One Unified framework for High-Fidelity Face Reenactment and Swapping},
    author = {Chao Xu and Jiangning Zhang and Yue Han and Yong Liu},
    year = 2022,
    booktitle = {European Conference on Computer Vision (ECCV)},
    doi = {10.1007/978-3-031-19784-0_4},
    abstract = {Face reenactment and swapping share a similar identity and attribute manipulating pattern, but most methods treat them separately, which is redundant and practical-unfriendly. In this paper, we propose an effective end-to-end unified framework to achieve both tasks. Unlike existing methods that directly utilize pre-estimated structures and do not fully exploit their potential similarity, our model sufficiently transfers identity and attribute based on learned disentangled representations to generate high-fidelity faces. Specifically, Feature Disentanglement first disentangles identity and attribute unsupervisedly. Then the proposed Attribute Transfer (AttrT) employs learned Feature Displacement Fields to transfer the attribute granularly, and Identity Transfer (IdT) explicitly models identity-related feature interaction to adaptively control the identity fusion. We joint AttrT and IdT according to their intrinsic relationship to further facilitate each task, i.e., help improve identity consistency in reenactment and attribute preservation in swapping. Extensive experiments demonstrate the superiority of our method. Code is available at https://github.com/xc-csc101/UniFace.}
    }
  • Chao Xu, Jiangning Zhang, and Miao Hua He Yi and Qian and Zili and Yong Liu. Region-Aware Face Swapping. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    This paper presents a novel Region-Aware Face Swapping (RAFSwap) network to achieve identity-consistent harmonious high-resolution face generation in a local-global manner: 1) Local Facial Region-Aware (FRA) branch augments local identity-relevant features by introducing the Transformer to effectively model misaligned crossscale semantic interaction. 2) Global Source Feature Adaptive (SFA) branch further complements global identity relevant cues for generating identity-consistent swapped faces. Besides, we propose a Face Mask Predictor (FMP) module incorporated with StyleGAN2 to predict identity relevant soft facial masks in an unsupervised manner that is more practical for generating harmonious high-resolution faces. Abundant experiments qualitatively and quantita tively demonstrate the superiority of our method for generating more identity-consistent high-resolution swapped faces over SOTA methods, e.g., obtaining 96.70 ID retrieval that outperforms SOTA MegaFS by 5.87↑.
    @inproceedings{xu2022raf,
    title = {Region-Aware Face Swapping},
    author = {Chao Xu and Jiangning Zhang and Miao Hua and Qian He and Zili Yi and Yong Liu},
    year = 2022,
    booktitle = {2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    doi = {10.48550/arXiv.2203.04564},
    abstract = {This paper presents a novel Region-Aware Face Swapping (RAFSwap) network to achieve identity-consistent harmonious high-resolution face generation in a local-global manner: 1) Local Facial Region-Aware (FRA) branch augments local identity-relevant features by introducing the Transformer to effectively model misaligned crossscale semantic interaction. 2) Global Source Feature Adaptive (SFA) branch further complements global identity relevant cues for generating identity-consistent swapped faces. Besides, we propose a Face Mask Predictor (FMP) module incorporated with StyleGAN2 to predict identity relevant soft facial masks in an unsupervised manner that is more practical for generating harmonious high-resolution faces. Abundant experiments qualitatively and quantita tively demonstrate the superiority of our method for generating more identity-consistent high-resolution swapped faces over SOTA methods, e.g., obtaining 96.70 ID retrieval that outperforms SOTA MegaFS by 5.87↑.}
    }
  • Jiangning Zhang, Chao Xu, Jian Li, Yue Han, Yabiao Wang, Ying Tai, and Yong Liu. SCSNet: An Efficient Paradigm for Learning Simultaneously Image Colorization and Super-Resolution. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    In the practical application of restoring low-resolution grayscale images, we generally need to run three separate processes of image colorization, super-resolution, and dowssampling operation for the target device. However, this pipeline is redundant and inefficient for the independent processes, and some inner features could have been shared.Therefore, we present an efficient paradigm to perform Simultaneously Image Colorization and Super-resolution(SCS) and propose an end-to-end SCSNet to achieve this goal. The proposed method consists of two parts: colorization branch for learning color information that employs the proposed plug-and-play Pyramid Valve Cross Attention (PVCAttn) module to aggregate feature maps between source andreference images; and super-resolution branch for integrating color and texture information to predict target images, which uses the designed Continuous Pixel Mapping (CPM) module to predict high-resolution images at continuous magni-fication. Furthermore, our SCSNet supports both automatic and referential modes that is more flexible for practical application. Abundant experiments demonstrate the superiority of our method for generating authentic images over state-of-theart methods, e.g., averagely decreasing FID by 1.8↓ and 5.1↓ compared with current best scores for automatic and referential modes, respectively, while owning fewer parameters(more than ×2↓) and faster running speed (more than ×3↑).
    @inproceedings{zhang2022scs,
    title = {SCSNet: An Efficient Paradigm for Learning Simultaneously Image Colorization and Super-Resolution},
    author = {Jiangning Zhang and Chao Xu and Jian Li and Yue Han and Yabiao Wang and Ying Tai and Yong Liu},
    year = 2022,
    booktitle = {Proceedings of the 36th AAAI Conference on Artificial Intelligence},
    doi = {https://doi.org/10.48550/arXiv.2201.04364},
    abstract = {In the practical application of restoring low-resolution grayscale images, we generally need to run three separate processes of image colorization, super-resolution, and dowssampling operation for the target device. However, this pipeline is redundant and inefficient for the independent processes, and some inner features could have been shared.Therefore, we present an efficient paradigm to perform Simultaneously Image Colorization and Super-resolution(SCS) and propose an end-to-end SCSNet to achieve this goal. The proposed method consists of two parts: colorization branch for learning color information that employs the proposed plug-and-play Pyramid Valve Cross Attention (PVCAttn) module to aggregate feature maps between source andreference images; and super-resolution branch for integrating color and texture information to predict target images, which uses the designed Continuous Pixel Mapping (CPM) module to predict high-resolution images at continuous magni-fication. Furthermore, our SCSNet supports both automatic and referential modes that is more flexible for practical application. Abundant experiments demonstrate the superiority of our method for generating authentic images over state-of-theart methods, e.g., averagely decreasing FID by 1.8↓ and 5.1↓ compared with current best scores for automatic and referential modes, respectively, while owning fewer parameters(more than ×2↓) and faster running speed (more than ×3↑).}
    }
  • Guanzhong Tian, Yiran Sun, Yuang Liu, Xianfang Zeng, Mengmeng Wang, Yong Liu, Jiangning Zhang, and Jun Chen. Adding before Pruning: Sparse Filter Fusion for Deep Convolutional Neural Networks via Auxiliary Attention. IEEE Transactions on Neural Networks and Learning Systems, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    Filter pruning is a significant feature selection technique to shrink the existing feature fusion schemes (especially on convolution calculation and model size), which helps to develop more efficient feature fusion models while maintaining state-of-the-art performance. In addition, it reduces the storage and computation requirements of deep neural networks (DNNs) and accelerates the inference process dramatically. Existing methods mainly rely on manual constraints such as normalization to select the filters. A typical pipeline comprises two stages: first pruning the original neural network and then fine-tuning the pruned model. However, choosing a manual criterion can be somehow tricky and stochastic. Moreover, directly regularizing and modifying filters in the pipeline suffer from being sensitive to the choice of hyperparameters, thus making the pruning procedure less robust. To address these challenges, we propose to handle the filter pruning issue through one stage: using an attention-based architecture that adaptively fuses the filter selection with filter learning in a unified network. Specifically, we present a pruning method named adding before pruning (ABP) to make the model focus on the filters of higher significance by training instead of man-made criteria such as norm, rank, etc. First, we add an auxiliary attention layer into the original model and set the significance scores in this layer to be binary. Furthermore, to propagate the gradients in the auxiliary attention layer, we design a specific gradient estimator and prove its effectiveness for convergence in the graph flow through mathematical derivation. In the end, to relieve the dependence on the complicated prior knowledge for designing the thresholding criterion, we simultaneously prune and train the filters to automatically eliminate network redundancy with recoverability. Extensive experimental results on the two typical image classification benchmarks, CIFAR-10 and ILSVRC-2012, illustrate that the proposed approach performs favorably against previous state-of-the-art filter pruning algorithms.
    @article{tian2021abp,
    title = {Adding before Pruning: Sparse Filter Fusion for Deep Convolutional Neural Networks via Auxiliary Attention},
    author = {Guanzhong Tian and Yiran Sun and Yuang Liu and Xianfang Zeng and Mengmeng Wang and Yong Liu and Jiangning Zhang and Jun Chen},
    year = 2021,
    journal = {IEEE Transactions on Neural Networks and Learning Systems},
    doi = {10.1109/TNNLS.2021.3106917},
    abstract = {Filter pruning is a significant feature selection technique to shrink the existing feature fusion schemes (especially on convolution calculation and model size), which helps to develop more efficient feature fusion models while maintaining state-of-the-art performance. In addition, it reduces the storage and computation requirements of deep neural networks (DNNs) and accelerates the inference process dramatically. Existing methods mainly rely on manual constraints such as normalization to select the filters. A typical pipeline comprises two stages: first pruning the original neural network and then fine-tuning the pruned model. However, choosing a manual criterion can be somehow tricky and stochastic. Moreover, directly regularizing and modifying filters in the pipeline suffer from being sensitive to the choice of hyperparameters, thus making the pruning procedure less robust. To address these challenges, we propose to handle the filter pruning issue through one stage: using an attention-based architecture that adaptively fuses the filter selection with filter learning in a unified network. Specifically, we present a pruning method named adding before pruning (ABP) to make the model focus on the filters of higher significance by training instead of man-made criteria such as norm, rank, etc. First, we add an auxiliary attention layer into the original model and set the significance scores in this layer to be binary. Furthermore, to propagate the gradients in the auxiliary attention layer, we design a specific gradient estimator and prove its effectiveness for convergence in the graph flow through mathematical derivation. In the end, to relieve the dependence on the complicated prior knowledge for designing the thresholding criterion, we simultaneously prune and train the filters to automatically eliminate network redundancy with recoverability. Extensive experimental results on the two typical image classification benchmarks, CIFAR-10 and ILSVRC-2012, illustrate that the proposed approach performs favorably against previous state-of-the-art filter pruning algorithms.}
    }
  • Jiangning Zhang, Chao Xu, Xiangrui Zhao, Liang Liu, Yong Liu, Jinqiang Yao, and Zaisheng Pan. Learning hierarchical and efficient Person re-identification for robotic navigation. International Journal of Intelligent Robotics and Applications, 5:104–118, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    Recent works in the person re-identification task mainly focus on the model accuracy while ignoring factors related to efficiency, e.g., model size and latency, which are critical for practical application. In this paper, we propose a novel Hierarchical andEfficientNetwork (HENet) that learns hierarchical global, partial, and recovery features ensemble under the supervision of multiple loss combinations. To further improve the robustness against the irregular occlusion, we propose a new dataset augmentation approach, dubbed random polygon erasing, to random erase the input image’s irregular area imitating the body part missing. We also propose an EfficiencyScore (ES) metric to evaluate the model efficiency. Extensive experiments on Market1501, DukeMTMC-ReID, and CUHK03 datasets show the efficiency and superiority of our approach compared with epoch-making methods. We further deploy HENet on a robotic car, and the experimental result demonstrates the effectiveness of our method for robotic navigation.
    @article{zhang2021lha,
    title = {Learning hierarchical and efficient Person re-identification for robotic navigation},
    author = {Jiangning Zhang and Chao Xu and Xiangrui Zhao and Liang Liu and Yong Liu and Jinqiang Yao and Zaisheng Pan},
    year = 2021,
    journal = {International Journal of Intelligent Robotics and Applications},
    volume = 5,
    pages = {104--118},
    doi = {10.1007/s41315-021-00167-2},
    issue = 2,
    abstract = {Recent works in the person re-identification task mainly focus on the model accuracy while ignoring factors related to efficiency, e.g., model size and latency, which are critical for practical application. In this paper, we propose a novel Hierarchical andEfficientNetwork (HENet) that learns hierarchical global, partial, and recovery features ensemble under the supervision of multiple loss combinations. To further improve the robustness against the irregular occlusion, we propose a new dataset augmentation approach, dubbed random polygon erasing, to random erase the input image's irregular area imitating the body part missing. We also propose an EfficiencyScore (ES) metric to evaluate the model efficiency. Extensive experiments on Market1501, DukeMTMC-ReID, and CUHK03 datasets show the efficiency and superiority of our approach compared with epoch-making methods. We further deploy HENet on a robotic car, and the experimental result demonstrates the effectiveness of our method for robotic navigation.}
    }
  • Jiangning Zhang, Chao Xu, Jian Li, Wenzhou Chen, Yabiao Wang, Ying Tai, Shuo Chen, Chengjie Wang, Feiyue Huang, and Yong Liu. Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model. In Advances in Neural Information Processing Systems 34 – 35th Conference on Neural Information Processing Systems, pages 26674-26688, 2021.
    [BibTeX] [Abstract] [PDF]
    Inspired by biological evolution, we explain the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA) and derive that both of them have consistent mathematical representation. Analogous to the dynamic local population in EA, we improve the existing transformer structure and propose a more efficient EAT model, and design task-related heads to deal with different tasks more flexibly. Moreover, we introduce the spatial-filling curve into the current vision transformer to sequence image data into a uniform sequential format. Thus we can design a unified EAT framework to address multi-modal tasks, separating the network architecture from the data format adaptation. Our approach achieves state-of-the-art results on the ImageNet classification task compared with recent vision transformer works while having smaller parameters and greater throughput. We further conduct multi-modal tasks to demonstrate the superiority of the unified EAT, e.g., Text-Based Image Retrieval, and our approach improves the rank-1 by +3.7 points over the baseline on the CSS dataset.
    @inproceedings{zhang2021analogous,
    title = {Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model},
    author = {Jiangning Zhang and Chao Xu and Jian Li and Wenzhou Chen and Yabiao Wang and Ying Tai and Shuo Chen and Chengjie Wang and Feiyue Huang and Yong Liu},
    year = 2021,
    booktitle = {Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems},
    pages = {26674-26688},
    abstract = {Inspired by biological evolution, we explain the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA) and derive that both of them have consistent mathematical representation. Analogous to the dynamic local population in EA, we improve the existing transformer structure and propose a more efficient EAT model, and design task-related heads to deal with different tasks more flexibly. Moreover, we introduce the spatial-filling curve into the current vision transformer to sequence image data into a uniform sequential format. Thus we can design a unified EAT framework to address multi-modal tasks, separating the network architecture from the data format adaptation. Our approach achieves state-of-the-art results on the ImageNet classification task compared with recent vision transformer works while having smaller parameters and greater throughput. We further conduct multi-modal tasks to demonstrate the superiority of the unified EAT, e.g., Text-Based Image Retrieval, and our approach improves the rank-1 by +3.7 points over the baseline on the CSS dataset.}
    }
  • Liang Liu, Jiangning Zhang, Ruifei He, Yong Liu, Yabiao Wang, Ying Tai, Donghao Luo, Chengjie Wang, Jilin Li, and Feiyue Huang. Learning by Analogy: Reliable Supervision From Transformations for Unsupervised Optical Flow Estimation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 6488–6497, 2020.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]
    Unsupervised learning of optical flow, which leverages the supervision from view synthesis, has emerged as a promising alternative to supervised methods. However, the objective of unsupervised learning is likely to be unreliable in challenging scenes. In this work, we present a framework to use more reliable supervision from transformations. It simply twists the general unsupervised learning pipeline by running another forward pass with transformed data from augmentation, along with using transformed predictions of original data as the self-supervision signal. Besides, we further introduce a lightweight network with multiple frames by a highly-shared flow decoder. Our method consistently gets a leap of performance on several benchmarks with the best accuracy among deep unsupervised methods. Also, our method achieves competitive results to recent fully supervised methods while with much fewer parameters.
    @inproceedings{liu2020learningba,
    title = {Learning by Analogy: Reliable Supervision From Transformations for Unsupervised Optical Flow Estimation},
    author = {Liang Liu and Jiangning Zhang and Ruifei He and Yong Liu and Yabiao Wang and Ying Tai and Donghao Luo and Chengjie Wang and Jilin Li and Feiyue Huang},
    year = 2020,
    booktitle = {2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {6488--6497},
    doi = {https://doi.org/10.1109/cvpr42600.2020.00652},
    abstract = {Unsupervised learning of optical flow, which leverages the supervision from view synthesis, has emerged as a promising alternative to supervised methods. However, the objective of unsupervised learning is likely to be unreliable in challenging scenes. In this work, we present a framework to use more reliable supervision from transformations. It simply twists the general unsupervised learning pipeline by running another forward pass with transformed data from augmentation, along with using transformed predictions of original data as the self-supervision signal. Besides, we further introduce a lightweight network with multiple frames by a highly-shared flow decoder. Our method consistently gets a leap of performance on several benchmarks with the best accuracy among deep unsupervised methods. Also, our method achieves competitive results to recent fully supervised methods while with much fewer parameters.},
    arxiv = {http://arxiv.org/pdf/2003.13045}
    }
  • Xianfang Zeng, Yusu Pan, Mengmeng Wang, Jiangning Zhang, and Yong Liu. Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), 2020.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]
    Recent works have shown how realistic talking face images can be obtained under the supervision of geometry guidance, e.g., facial landmark or boundary. To alleviate the demand for manual annotations, in this paper, we propose a novel self-supervised hybrid model (DAE-GAN) that learns how to reenact face naturally given large amounts of unlabeled videos. Our approach combines two deforming autoencoders with the latest advances in the conditional generation. On the one hand, we adopt the deforming autoencoder to disentangle identity and pose representations. A strong prior in talking face videos is that each frame can be encoded as two parts: one for video-specific identity and the other for various poses. Inspired by that, we utilize a multi-frame deforming autoencoder to learn a pose-invariant embedded face for each video. Meanwhile, a multi-scale deforming autoencoder is proposed to extract pose-related information for each frame. On the other hand, the conditional generator allows for enhancing fine details and overall reality. It leverages the disentangled features to generate photo-realistic and pose-alike face images. We evaluate our model on VoxCeleb1 and RaFD dataset. Experiment results demonstrate the superior quality of reenacted images and the flexibility of transferring facial movements between identities.
    @inproceedings{zeng2020realisticfr,
    title = {Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose},
    author = {Xianfang Zeng and Yusu Pan and Mengmeng Wang and Jiangning Zhang and Yong Liu},
    year = 2020,
    booktitle = {Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI)},
    doi = {https://doi.org/10.1609/AAAI.V34I07.6970},
    abstract = {Recent works have shown how realistic talking face images can be obtained under the supervision of geometry guidance, e.g., facial landmark or boundary. To alleviate the demand for manual annotations, in this paper, we propose a novel self-supervised hybrid model (DAE-GAN) that learns how to reenact face naturally given large amounts of unlabeled videos. Our approach combines two deforming autoencoders with the latest advances in the conditional generation. On the one hand, we adopt the deforming autoencoder to disentangle identity and pose representations. A strong prior in talking face videos is that each frame can be encoded as two parts: one for video-specific identity and the other for various poses. Inspired by that, we utilize a multi-frame deforming autoencoder to learn a pose-invariant embedded face for each video. Meanwhile, a multi-scale deforming autoencoder is proposed to extract pose-related information for each frame. On the other hand, the conditional generator allows for enhancing fine details and overall reality. It leverages the disentangled features to generate photo-realistic and pose-alike face images. We evaluate our model on VoxCeleb1 and RaFD dataset. Experiment results demonstrate the superior quality of reenacted images and the flexibility of transferring facial movements between identities.},
    arxiv = {https://arxiv.org/pdf/2003.12957.pdf}
    }
  • Jiangning Zhang, Liang Liu, Zhucun Xue, and Yong Liu. APB2FACE: Audio-Guided Face Reenactment with Auxiliary Pose and Blink Signals. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 4402–4406, 2020.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]
    Audio-guided face reenactment aims at generating photorealistic faces using audio information while maintaining the same facial movement as when speaking to a real person. However, existing methods can not generate vivid face images or only reenact low-resolution faces, which limits the application value. To solve those problems, we propose a novel deep neural network named APB2Face, which consists of GeometryPredictor and FaceReenactor modules. GeometryPredictor uses extra head pose and blink state signals as well as audio to predict the latent landmark geometry information, while FaceReenactor inputs the face landmark image to reenact the photorealistic face. A new dataset AnnV I collected from YouTube is presented to support the approach, and experimental results indicate the superiority of our method than state-of-the-arts, whether in authenticity or controllability.
    @inproceedings{zhang2020apb2faceaf,
    title = {APB2FACE: Audio-Guided Face Reenactment with Auxiliary Pose and Blink Signals},
    author = {Jiangning Zhang and Liang Liu and Zhucun Xue and Yong Liu},
    year = 2020,
    booktitle = {2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
    pages = {4402--4406},
    doi = {https://doi.org/10.1109/ICASSP40776.2020.9052977},
    abstract = {Audio-guided face reenactment aims at generating photorealistic faces using audio information while maintaining the same facial movement as when speaking to a real person. However, existing methods can not generate vivid face images or only reenact low-resolution faces, which limits the application value. To solve those problems, we propose a novel deep neural network named APB2Face, which consists of GeometryPredictor and FaceReenactor modules. GeometryPredictor uses extra head pose and blink state signals as well as audio to predict the latent landmark geometry information, while FaceReenactor inputs the face landmark image to reenact the photorealistic face. A new dataset AnnV I collected from YouTube is presented to support the approach, and experimental results indicate the superiority of our method than state-of-the-arts, whether in authenticity or controllability.},
    arxiv = {http://arxiv.org/pdf/2004.14569}
    }
  • Jiangning Zhang, Chao Xu, Liang Liu, Mengmeng Wang, Xia Wu, Yong Liu, and Yunliang Jiang. Dtvnet: Dynamic time-lapse video generation via single still image. In ECCV, page 300–315, 2020.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]
    This paper presents a novel end-to-end dynamic time-lapse video generation framework, named DTVNet, to generate diversified time-lapse videos from a single landscape image, which are conditioned on normalized motion vectors. The proposed DTVNet consists of two submodules: Optical Flow Encoder (OFE) and Dynamic Video Generator (DVG). The OFE maps a sequence of optical flow maps to a normalized motion vector that encodes the motion information inside the generated video. The DVG contains motion and content streams that learn from the motion vector and the single image respectively, as well as an encoder and a decoder to learn shared content features and construct video frames with corresponding motion respectively. Specifically, the motion stream introduces multiple adaptive instance normalization (AdaIN) layers to integrate multi-level motion information that are processed by linear layers. In the testing stage, videos with the same content but various motion information can be generated by different normalized motion vectors based on only one input image. We further conduct experiments on Sky Time-lapse dataset, and the results demonstrate the superiority of our approach over the state-of-the-art methods for generating high-quality and dynamic videos, as well as the variety for generating videos with various motion information.
    @inproceedings{zhang2020dtvnet,
    title = {Dtvnet: Dynamic time-lapse video generation via single still image},
    author = {Zhang, Jiangning and Xu, Chao and Liu, Liang and Wang, Mengmeng and Wu, Xia and Liu, Yong and Jiang, Yunliang},
    year = 2020,
    booktitle = {{ECCV}},
    pages = {300--315},
    doi = {https://doi.org/10.1007/978-3-030-58558-7_18},
    abstract = {This paper presents a novel end-to-end dynamic time-lapse video generation framework, named DTVNet, to generate diversified time-lapse videos from a single landscape image, which are conditioned on normalized motion vectors. The proposed DTVNet consists of two submodules: Optical Flow Encoder (OFE) and Dynamic Video Generator (DVG). The OFE maps a sequence of optical flow maps to a normalized motion vector that encodes the motion information inside the generated video. The DVG contains motion and content streams that learn from the motion vector and the single image respectively, as well as an encoder and a decoder to learn shared content features and construct video frames with corresponding motion respectively. Specifically, the motion stream introduces multiple adaptive instance normalization (AdaIN) layers to integrate multi-level motion information that are processed by linear layers. In the testing stage, videos with the same content but various motion information can be generated by different normalized motion vectors based on only one input image. We further conduct experiments on Sky Time-lapse dataset, and the results demonstrate the superiority of our approach over the state-of-the-art methods for generating high-quality and dynamic videos, as well as the variety for generating videos with various motion information.},
    arxiv = {https://arxiv.org/abs/2008.04776}
    }
  • Jiangning Zhang, Xianfang Zeng, Mengmeng Wang, Yusu Pan, Liang Liu, Yong Liu, Yu Ding, and Changjie Fan. FReeNet: Multi-Identity Face Reenactment. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 5325–5334, 2020.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]
    This paper presents a novel multi-identity face reenactment framework, named FReeNet, to transfer facial expressions from an arbitrary source face to a target face with a shared model. The proposed FReeNet consists of two parts: Unified Landmark Converter (ULC) and Geometry-aware Generator (GAG). The ULC adopts an encode-decoder architecture to efficiently convert expression in a latent landmark space, which significantly narrows the gap of the face contour between source and target identities. The GAG leverages the converted landmark to reenact the photorealistic image with a reference image of the target person. Moreover, a new triplet perceptual loss is proposed to force the GAG module to learn appearance and geometry information simultaneously, which also enriches facial details of the reenacted images. Further experiments demonstrate the superiority of our approach for generating photorealistic and expression-alike faces, as well as the flexibility for transferring facial expressions between identities.
    @inproceedings{zhang2020freenetmf,
    title = {FReeNet: Multi-Identity Face Reenactment},
    author = {Jiangning Zhang and Xianfang Zeng and Mengmeng Wang and Yusu Pan and Liang Liu and Yong Liu and Yu Ding and Changjie Fan},
    year = 2020,
    booktitle = {2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {5325--5334},
    doi = {https://doi.org/10.1109/cvpr42600.2020.00537},
    abstract = {This paper presents a novel multi-identity face reenactment framework, named FReeNet, to transfer facial expressions from an arbitrary source face to a target face with a shared model. The proposed FReeNet consists of two parts: Unified Landmark Converter (ULC) and Geometry-aware Generator (GAG). The ULC adopts an encode-decoder architecture to efficiently convert expression in a latent landmark space, which significantly narrows the gap of the face contour between source and target identities. The GAG leverages the converted landmark to reenact the photorealistic image with a reference image of the target person. Moreover, a new triplet perceptual loss is proposed to force the GAG module to learn appearance and geometry information simultaneously, which also enriches facial details of the reenacted images. Further experiments demonstrate the superiority of our approach for generating photorealistic and expression-alike faces, as well as the flexibility for transferring facial expressions between identities.},
    arxiv = {http://arxiv.org/pdf/1905.11805}
    }
  • Liang Liu, Yong Liu, and Jiangning Zhang. Learning-Based Hand Motion Capture and Understanding in Assembly Process. IEEE Transactions on Industrial Electronics, 66:9703–9712, 2019.
    [BibTeX] [Abstract] [DOI] [PDF]
    Manual assembly is still an essential part in modern manufacturing. Understanding the actual state of the assembly process can not only improve quality control of products, but also collect comprehensive data for production planning and proficiency assessments. Addressing the rising complexity led by the uncertainty in manual assembly, this paper presents an efficient approach to automatically capture and analyze hand operations in the assembly process. In this paper, a detection-based tracking method is introduced to capture trajectories of hand movement from the camera installed in each workstation. Then, the actions in hand trajectories are identified with a novel temporal action localization model. The experimental results have proved that our method reached the application level with high accuracy and a low computational cost. The proposed system is lightweight enough to be quickly set up on an embedded computing device for real-time online inference and on a cloud server for offline analysis as well.
    @article{liu2019learningbasedhm,
    title = {Learning-Based Hand Motion Capture and Understanding in Assembly Process},
    author = {Liang Liu and Yong Liu and Jiangning Zhang},
    year = 2019,
    journal = {IEEE Transactions on Industrial Electronics},
    volume = 66,
    pages = {9703--9712},
    doi = {https://doi.org/10.1109/TIE.2018.2884206},
    abstract = {Manual assembly is still an essential part in modern manufacturing. Understanding the actual state of the assembly process can not only improve quality control of products, but also collect comprehensive data for production planning and proficiency assessments. Addressing the rising complexity led by the uncertainty in manual assembly, this paper presents an efficient approach to automatically capture and analyze hand operations in the assembly process. In this paper, a detection-based tracking method is introduced to capture trajectories of hand movement from the camera installed in each workstation. Then, the actions in hand trajectories are identified with a novel temporal action localization model. The experimental results have proved that our method reached the application level with high accuracy and a low computational cost. The proposed system is lightweight enough to be quickly set up on an embedded computing device for real-time online inference and on a cloud server for offline analysis as well.}
    }