Address

Room 101, Institute of Cyber-Systems and Control, Yuquan Campus, Zhejiang University, Hangzhou, Zhejiang, China

Contact Information

Email: zzlongjuanfeng@zju.edu.cn

Xianfang Zeng

PhD Student

Institute of Cyber-Systems and Control, Zhejiang University, China

Biography

I am pursuing my Ph.D. degree in College of Control Science and Engineering, Zhejiang University, Hangzhou, China. My major research interests include computer vision and image generation.

Research and Interests

  • Computer Vision
  • Image Generation

Publications

  • Jiangning Zhang, Xianfang Zeng, Chao Xu, and Yong Liu. Real-Time Audio-Guided Multi-Face Reenactment. IEEE Signal Processing Letters, 29:1–5, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    Audio-guided face reenactment aims to generate authentic target faces that have matched facial expression of the input audio, and many learning-based methods have successfully achieved this. However, mostmethods can only reenact a particular person once trained or suffer from the low-quality generation of the target images. Also, nearly none of the current reenactment works consider the model size and running speed that are important for practical use. To solve the above challenges, we propose an efficient Audio-guided Multi-face reenactment model named AMNet, which can reenact target faces among multiple persons with corresponding source faces and drive signals as inputs. Concretely, we design a Geometric Controller (GC) module to inject the drive signals so that the model can be optimized in an end-to-end manner and generate more authentic images. Also, we adopt a lightweight network for our face reenactor so that the model can run in realtime on both CPU and GPU devices. Abundant experiments prove our approach’s superiority over existing methods, e.g., averagely decreasing FID by 0.12. and increasing SSIM by 0.031. than APB2Face, while owning fewer parameters (x4 down arrow) and faster CPU speed (x4 up arrow).
    @article{zhang2022rta,
    title = {Real-Time Audio-Guided Multi-Face Reenactment},
    author = {Jiangning Zhang and Xianfang Zeng and Chao Xu and Yong Liu},
    year = 2022,
    journal = {IEEE Signal Processing Letters},
    volume = {29},
    pages = {1--5},
    doi = {10.1109/LSP.2021.3116506},
    abstract = {Audio-guided face reenactment aims to generate authentic target faces that have matched facial expression of the input audio, and many learning-based methods have successfully achieved this. However, mostmethods can only reenact a particular person once trained or suffer from the low-quality generation of the target images. Also, nearly none of the current reenactment works consider the model size and running speed that are important for practical use. To solve the above challenges, we propose an efficient Audio-guided Multi-face reenactment model named AMNet, which can reenact target faces among multiple persons with corresponding source faces and drive signals as inputs. Concretely, we design a Geometric Controller (GC) module to inject the drive signals so that the model can be optimized in an end-to-end manner and generate more authentic images. Also, we adopt a lightweight network for our face reenactor so that the model can run in realtime on both CPU and GPU devices. Abundant experiments prove our approach's superiority over existing methods, e.g., averagely decreasing FID by 0.12. and increasing SSIM by 0.031. than APB2Face, while owning fewer parameters (x4 down arrow) and faster CPU speed (x4 up arrow).}
    }
  • Guanzhong Tian, Yiran Sun, Yuang Liu, Xianfang Zeng, Mengmeng Wang, Yong Liu, Jiangning Zhang, and Jun Chen. Adding before Pruning: Sparse Filter Fusion for Deep Convolutional Neural Networks via Auxiliary Attention. IEEE Transactions on Neural Networks and Learning Systems, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    Filter pruning is a significant feature selection technique to shrink the existing feature fusion schemes (especially on convolution calculation and model size), which helps to develop more efficient feature fusion models while maintaining state-of-the-art performance. In addition, it reduces the storage and computation requirements of deep neural networks (DNNs) and accelerates the inference process dramatically. Existing methods mainly rely on manual constraints such as normalization to select the filters. A typical pipeline comprises two stages: first pruning the original neural network and then fine-tuning the pruned model. However, choosing a manual criterion can be somehow tricky and stochastic. Moreover, directly regularizing and modifying filters in the pipeline suffer from being sensitive to the choice of hyperparameters, thus making the pruning procedure less robust. To address these challenges, we propose to handle the filter pruning issue through one stage: using an attention-based architecture that adaptively fuses the filter selection with filter learning in a unified network. Specifically, we present a pruning method named adding before pruning (ABP) to make the model focus on the filters of higher significance by training instead of man-made criteria such as norm, rank, etc. First, we add an auxiliary attention layer into the original model and set the significance scores in this layer to be binary. Furthermore, to propagate the gradients in the auxiliary attention layer, we design a specific gradient estimator and prove its effectiveness for convergence in the graph flow through mathematical derivation. In the end, to relieve the dependence on the complicated prior knowledge for designing the thresholding criterion, we simultaneously prune and train the filters to automatically eliminate network redundancy with recoverability. Extensive experimental results on the two typical image classification benchmarks, CIFAR-10 and ILSVRC-2012, illustrate that the proposed approach performs favorably against previous state-of-the-art filter pruning algorithms.
    @article{tian2021abp,
    title = {Adding before Pruning: Sparse Filter Fusion for Deep Convolutional Neural Networks via Auxiliary Attention},
    author = {Guanzhong Tian and Yiran Sun and Yuang Liu and Xianfang Zeng and Mengmeng Wang and Yong Liu and Jiangning Zhang and Jun Chen},
    year = 2021,
    journal = {IEEE Transactions on Neural Networks and Learning Systems},
    doi = {10.1109/TNNLS.2021.3106917},
    abstract = {Filter pruning is a significant feature selection technique to shrink the existing feature fusion schemes (especially on convolution calculation and model size), which helps to develop more efficient feature fusion models while maintaining state-of-the-art performance. In addition, it reduces the storage and computation requirements of deep neural networks (DNNs) and accelerates the inference process dramatically. Existing methods mainly rely on manual constraints such as normalization to select the filters. A typical pipeline comprises two stages: first pruning the original neural network and then fine-tuning the pruned model. However, choosing a manual criterion can be somehow tricky and stochastic. Moreover, directly regularizing and modifying filters in the pipeline suffer from being sensitive to the choice of hyperparameters, thus making the pruning procedure less robust. To address these challenges, we propose to handle the filter pruning issue through one stage: using an attention-based architecture that adaptively fuses the filter selection with filter learning in a unified network. Specifically, we present a pruning method named adding before pruning (ABP) to make the model focus on the filters of higher significance by training instead of man-made criteria such as norm, rank, etc. First, we add an auxiliary attention layer into the original model and set the significance scores in this layer to be binary. Furthermore, to propagate the gradients in the auxiliary attention layer, we design a specific gradient estimator and prove its effectiveness for convergence in the graph flow through mathematical derivation. In the end, to relieve the dependence on the complicated prior knowledge for designing the thresholding criterion, we simultaneously prune and train the filters to automatically eliminate network redundancy with recoverability. Extensive experimental results on the two typical image classification benchmarks, CIFAR-10 and ILSVRC-2012, illustrate that the proposed approach performs favorably against previous state-of-the-art filter pruning algorithms.}
    }
  • Xianfang Zeng, Wenxuan Wu, Guangzhong Tian, Fuxin Li, and Yong Liu. Deep Superpixel Convolutional Network for Image Recognition. IEEE Signal Processing Letters, 28:922-926, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    Due to the high representational efficiency, superpixel largely reduces the number of image primitives for subsequent processing. However, superpixel is scarcely utilized in recent methods since its irregular shape is intractable for standard convolutional layer. In this paper, we propose an end-to-end trainable superpixel convolutional network, named SPNet, to learn high-level representation on image superpixel primitives. We start by treating irregular superpixel lattices as a 2D point cloud, where the low-level features inside one superpixel are aggregated to one feature vector. We replace the standard convolutional layer with the PointConv layer to handle the irregular and unordered point cloud. Besides, we propose grid based downsampling strategies to output uniform 2D sampling result. The resulting network largely utilizes the efficiency of superpixel and provides a novel view for image recognition task. Experiments on image recognition task show promising results compared with prominent image classification methods. The visualization of class activation mapping shows great accuracy at object localization and boundary segmentation.
    @article{zeng2021deepsc,
    title = {Deep Superpixel Convolutional Network for Image Recognition},
    author = {Xianfang Zeng and Wenxuan Wu and Guangzhong Tian and Fuxin Li and Yong Liu},
    year = 2021,
    journal = {IEEE Signal Processing Letters},
    volume = 28,
    pages = {922-926},
    doi = {10.1109/LSP.2021.3075605},
    abstract = {Due to the high representational efficiency, superpixel largely reduces the number of image primitives for subsequent processing. However, superpixel is scarcely utilized in recent methods since its irregular shape is intractable for standard convolutional layer. In this paper, we propose an end-to-end trainable superpixel convolutional network, named SPNet, to learn high-level representation on image superpixel primitives. We start by treating irregular superpixel lattices as a 2D point cloud, where the low-level features inside one superpixel are aggregated to one feature vector. We replace the standard convolutional layer with the PointConv layer to handle the irregular and unordered point cloud. Besides, we propose grid based downsampling strategies to output uniform 2D sampling result. The resulting network largely utilizes the efficiency of superpixel and provides a novel view for image recognition task. Experiments on image recognition task show promising results compared with prominent image classification methods. The visualization of class activation mapping shows great accuracy at object localization and boundary segmentation.}
    }
  • Guanzhong Tian, Jun Chen, Xianfang Zeng, and Yong Liu. Pruning by Training: A Novel Deep Neural Network Compression Framework for Image Processing. IEEE Signal Processing Letters, 28:344–348, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    Filter pruning for a pre-trained convolutional neural network is most normally performed through human-made constraints or criteria such as norms, ranks, etc. Typically, the pruning pipeline comprises two-stage: first learn a sparse structure from the original model, then optimize the weights in the new prune model. One disadvantage of using human-made criteria to prune filters is that the design and selection of threshold criteria depend on complicated prior knowledge. Besides, the pruning process is less robust due to the impact of directly regularizing on filters. To address the problems mentioned, we propose an effective one-stage pruning framework: introducing a trainable collaborative layer to jointly prune and learn neural networks in one go. In our framework, we first add a binary collaborative layer for each original filter. Then, a new type of gradient estimator – asymptotic gradient estimator is first introduced to pass the gradient in the binary collaborative layer. Finally, we simultaneously learn the sparse structure and optimize the weights from the original model in the training process. Our evaluation results on typical benchmarks, CIFAR and ImageNet, demonstrate very promising results against other state-of-the-art filter pruning methods.
    @article{tian2021pbt,
    title = {Pruning by Training: A Novel Deep Neural Network Compression Framework for Image Processing},
    author = {Guanzhong Tian and Jun Chen and Xianfang Zeng and Yong Liu},
    year = 2021,
    journal = {IEEE Signal Processing Letters},
    volume = 28,
    pages = {344--348},
    doi = {10.1109/LSP.2021.3054315},
    abstract = {Filter pruning for a pre-trained convolutional neural network is most normally performed through human-made constraints or criteria such as norms, ranks, etc. Typically, the pruning pipeline comprises two-stage: first learn a sparse structure from the original model, then optimize the weights in the new prune model. One disadvantage of using human-made criteria to prune filters is that the design and selection of threshold criteria depend on complicated prior knowledge. Besides, the pruning process is less robust due to the impact of directly regularizing on filters. To address the problems mentioned, we propose an effective one-stage pruning framework: introducing a trainable collaborative layer to jointly prune and learn neural networks in one go. In our framework, we first add a binary collaborative layer for each original filter. Then, a new type of gradient estimator - asymptotic gradient estimator is first introduced to pass the gradient in the binary collaborative layer. Finally, we simultaneously learn the sparse structure and optimize the weights from the original model in the training process. Our evaluation results on typical benchmarks, CIFAR and ImageNet, demonstrate very promising results against other state-of-the-art filter pruning methods.}
    }
  • Jun Chen, Liang Liu, Yong Liu, and Xianfang Zeng. A Learning Framework for n-Bit Quantized Neural Networks Toward FPGAs. IEEE Transactions on Neural Networks and Learning Systems, 32:1067–1081, 2021.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]
    The quantized neural network (QNN) is an efficient approach for network compression and can be widely used in the implementation of field-programmable gate arrays (FPGAs). This article proposes a novel learning framework for $n$ -bit QNNs, whose weights are constrained to the power of two. To solve the gradient vanishing problem, we propose a reconstructed gradient function for QNNs in the back-propagation algorithm that can directly get the real gradient rather than estimating an approximate gradient of the expected loss. We also propose a novel QNN structure named $n$ -BQ-NN, which uses shift operation to replace the multiply operation and is more suitable for the inference on FPGAs. Furthermore, we also design a shift vector processing element (SVPE) array to replace all 16-bit multiplications with SHIFT operations in convolution operation on FPGAs. We also carry out comparable experiments to evaluate our framework. The experimental results show that the quantized models of ResNet, DenseNet, and AlexNet through our learning framework can achieve almost the same accuracies with the original full-precision models. Moreover, when using our learning framework to train our $n$ -BQ-NN from scratch, it can achieve state-of-the-art results compared with typical low-precision QNNs. Experiments on Xilinx ZCU102 platform show that our $n$ -BQ-NN with our SVPE can execute 2.9 times faster than that with the vector processing element (VPE) in inference. As the SHIFT operation in our SVPE array will not consume digital signal processing (DSP) resources on FPGAs, the experiments have shown that the use of SVPE array also reduces average energy consumption to 68.7% of the VPE array with 16 bit.
    @article{chen2021alf,
    title = {A Learning Framework for n-Bit Quantized Neural Networks Toward FPGAs},
    author = {Jun Chen and Liang Liu and Yong Liu and Xianfang Zeng},
    year = 2021,
    journal = {IEEE Transactions on Neural Networks and Learning Systems},
    volume = 32,
    pages = {1067--1081},
    doi = {https://doi.org/10.1109/TNNLS.2020.2980041},
    abstract = {The quantized neural network (QNN) is an efficient approach for network compression and can be widely used in the implementation of field-programmable gate arrays (FPGAs). This article proposes a novel learning framework for  $n$ -bit QNNs, whose weights are constrained to the power of two. To solve the gradient vanishing problem, we propose a reconstructed gradient function for QNNs in the back-propagation algorithm that can directly get the real gradient rather than estimating an approximate gradient of the expected loss. We also propose a novel QNN structure named  $n$ -BQ-NN, which uses shift operation to replace the multiply operation and is more suitable for the inference on FPGAs. Furthermore, we also design a shift vector processing element (SVPE) array to replace all 16-bit multiplications with SHIFT operations in convolution operation on FPGAs. We also carry out comparable experiments to evaluate our framework. The experimental results show that the quantized models of ResNet, DenseNet, and AlexNet through our learning framework can achieve almost the same accuracies with the original full-precision models. Moreover, when using our learning framework to train our  $n$ -BQ-NN from scratch, it can achieve state-of-the-art results compared with typical low-precision QNNs. Experiments on Xilinx ZCU102 platform show that our  $n$ -BQ-NN with our SVPE can execute 2.9 times faster than that with the vector processing element (VPE) in inference. As the SHIFT operation in our SVPE array will not consume digital signal processing (DSP) resources on FPGAs, the experiments have shown that the use of SVPE array also reduces average energy consumption to 68.7% of the VPE array with 16 bit.},
    arxiv = {http://arxiv.org/pdf/2004.02396}
    }
  • Xin Kong, Xuemeng Yang, Guangyao Zhai, Xiangrui Zhao, Xianfang Zeng, Mengmeng Wang, Yong Liu, Wanlong Li, and Feng Wen. Semantic Graph Based Place Recognition for 3D Point Clouds. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), page 8216–8223, 2020.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]
    Due to the difficulty in generating the effective descriptors which are robust to occlusion and viewpoint changes, place recognition for 3D point cloud remains an open issue. Unlike most of the existing methods that focus on extracting local, global, and statistical features of raw point clouds, our method aims at the semantic level that can be superior in terms of robustness to environmental changes. Inspired by the perspective of humans, who recognize scenes through identifying semantic objects and capturing their relations, this paper presents a novel semantic graph based approach for place recognition. First, we propose a novel semantic graph representation for the point cloud scenes by reserving the semantic and topological information of the raw point cloud. Thus, place recognition is modeled as a graph matching problem. Then we design a fast and effective graph similarity network to compute the similarity. Exhaustive evaluations on the KITTI dataset show that our approach is robust to the occlusion as well as viewpoint changes and outperforms the state-of-the-art methods with a large margin. Our code is available at: https://github.com/kxhit/SG_PR.
    @inproceedings{kong2020semanticgb,
    title = {Semantic Graph Based Place Recognition for 3D Point Clouds},
    author = {Xin Kong and Xuemeng Yang and Guangyao Zhai and Xiangrui Zhao and Xianfang Zeng and Mengmeng Wang and Yong Liu and Wanlong Li and Feng Wen},
    year = 2020,
    booktitle = {2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    pages = {8216--8223},
    doi = {https://doi.org/10.1109/IROS45743.2020.9341060},
    abstract = {Due to the difficulty in generating the effective descriptors which are robust to occlusion and viewpoint changes, place recognition for 3D point cloud remains an open issue. Unlike most of the existing methods that focus on extracting local, global, and statistical features of raw point clouds, our method aims at the semantic level that can be superior in terms of robustness to environmental changes. Inspired by the perspective of humans, who recognize scenes through identifying semantic objects and capturing their relations, this paper presents a novel semantic graph based approach for place recognition. First, we propose a novel semantic graph representation for the point cloud scenes by reserving the semantic and topological information of the raw point cloud. Thus, place recognition is modeled as a graph matching problem. Then we design a fast and effective graph similarity network to compute the similarity. Exhaustive evaluations on the KITTI dataset show that our approach is robust to the occlusion as well as viewpoint changes and outperforms the state-of-the-art methods with a large margin. Our code is available at: https://github.com/kxhit/SG_PR.},
    arxiv = {https://arxiv.org/pdf/2008.11459.pdf}
    }
  • Xianfang Zeng, Yusu Pan, Mengmeng Wang, Jiangning Zhang, and Yong Liu. Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), 2020.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]
    Recent works have shown how realistic talking face images can be obtained under the supervision of geometry guidance, e.g., facial landmark or boundary. To alleviate the demand for manual annotations, in this paper, we propose a novel self-supervised hybrid model (DAE-GAN) that learns how to reenact face naturally given large amounts of unlabeled videos. Our approach combines two deforming autoencoders with the latest advances in the conditional generation. On the one hand, we adopt the deforming autoencoder to disentangle identity and pose representations. A strong prior in talking face videos is that each frame can be encoded as two parts: one for video-specific identity and the other for various poses. Inspired by that, we utilize a multi-frame deforming autoencoder to learn a pose-invariant embedded face for each video. Meanwhile, a multi-scale deforming autoencoder is proposed to extract pose-related information for each frame. On the other hand, the conditional generator allows for enhancing fine details and overall reality. It leverages the disentangled features to generate photo-realistic and pose-alike face images. We evaluate our model on VoxCeleb1 and RaFD dataset. Experiment results demonstrate the superior quality of reenacted images and the flexibility of transferring facial movements between identities.
    @inproceedings{zeng2020realisticfr,
    title = {Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose},
    author = {Xianfang Zeng and Yusu Pan and Mengmeng Wang and Jiangning Zhang and Yong Liu},
    year = 2020,
    booktitle = {Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI)},
    doi = {https://doi.org/10.1609/AAAI.V34I07.6970},
    abstract = {Recent works have shown how realistic talking face images can be obtained under the supervision of geometry guidance, e.g., facial landmark or boundary. To alleviate the demand for manual annotations, in this paper, we propose a novel self-supervised hybrid model (DAE-GAN) that learns how to reenact face naturally given large amounts of unlabeled videos. Our approach combines two deforming autoencoders with the latest advances in the conditional generation. On the one hand, we adopt the deforming autoencoder to disentangle identity and pose representations. A strong prior in talking face videos is that each frame can be encoded as two parts: one for video-specific identity and the other for various poses. Inspired by that, we utilize a multi-frame deforming autoencoder to learn a pose-invariant embedded face for each video. Meanwhile, a multi-scale deforming autoencoder is proposed to extract pose-related information for each frame. On the other hand, the conditional generator allows for enhancing fine details and overall reality. It leverages the disentangled features to generate photo-realistic and pose-alike face images. We evaluate our model on VoxCeleb1 and RaFD dataset. Experiment results demonstrate the superior quality of reenacted images and the flexibility of transferring facial movements between identities.},
    arxiv = {https://arxiv.org/pdf/2003.12957.pdf}
    }
  • Jiangning Zhang, Xianfang Zeng, Mengmeng Wang, Yusu Pan, Liang Liu, Yong Liu, Yu Ding, and Changjie Fan. FReeNet: Multi-Identity Face Reenactment. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 5325–5334, 2020.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]
    This paper presents a novel multi-identity face reenactment framework, named FReeNet, to transfer facial expressions from an arbitrary source face to a target face with a shared model. The proposed FReeNet consists of two parts: Unified Landmark Converter (ULC) and Geometry-aware Generator (GAG). The ULC adopts an encode-decoder architecture to efficiently convert expression in a latent landmark space, which significantly narrows the gap of the face contour between source and target identities. The GAG leverages the converted landmark to reenact the photorealistic image with a reference image of the target person. Moreover, a new triplet perceptual loss is proposed to force the GAG module to learn appearance and geometry information simultaneously, which also enriches facial details of the reenacted images. Further experiments demonstrate the superiority of our approach for generating photorealistic and expression-alike faces, as well as the flexibility for transferring facial expressions between identities.
    @inproceedings{zhang2020freenetmf,
    title = {FReeNet: Multi-Identity Face Reenactment},
    author = {Jiangning Zhang and Xianfang Zeng and Mengmeng Wang and Yusu Pan and Liang Liu and Yong Liu and Yu Ding and Changjie Fan},
    year = 2020,
    booktitle = {2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {5325--5334},
    doi = {https://doi.org/10.1109/cvpr42600.2020.00537},
    abstract = {This paper presents a novel multi-identity face reenactment framework, named FReeNet, to transfer facial expressions from an arbitrary source face to a target face with a shared model. The proposed FReeNet consists of two parts: Unified Landmark Converter (ULC) and Geometry-aware Generator (GAG). The ULC adopts an encode-decoder architecture to efficiently convert expression in a latent landmark space, which significantly narrows the gap of the face contour between source and target identities. The GAG leverages the converted landmark to reenact the photorealistic image with a reference image of the target person. Moreover, a new triplet perceptual loss is proposed to force the GAG module to learn appearance and geometry information simultaneously, which also enriches facial details of the reenacted images. Further experiments demonstrate the superiority of our approach for generating photorealistic and expression-alike faces, as well as the flexibility for transferring facial expressions between identities.},
    arxiv = {http://arxiv.org/pdf/1905.11805}
    }