Address

Institute of Cyber-Systems and Control, Yuquan Campus, Zhejiang University, Hangzhou, Zhejiang, China

Contact Information

Email: 22132041@zju.edu.cn

Yue Han

MS Student

Institute of Cyber-Systems and Control, Zhejiang University, China

Biography

I am pursuing my master degree in College of Control Science and Engineering, Zhejiang University, Hangzhou, China. My major research interest is generative adversarial network (GAN).

Research and Interests

  • Computer Vision
  • Generative Adversarial Network

Publications

  • Chao Xu, Junwei Zhu, Jiangning Zhang, Yue Han, Wenqing Chu, Ying Tai, Chengjie Wang, Zhifeng Xie, and Yong Liu. High-Fidelity Generalized Emotional Talking Face Generation with Multi-Modal Emotion Space Learning. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose a more flexible and generalized framework. Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space, which inherits rich semantic prior from CLIP. Consequently, effective multi-modal emotion space learning helps our method support arbitrary emotion modality during testing and could generalize to unseen emotion styles. Besides, an Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation. A followed style-based High-fidelity Emotional Face generator is designed to generate arbitrary high-resolution realistic identities. Our texture generator hierarchically learns flow fields and animated faces in a residual manner. Extensive experiments demonstrate the flexibility and generalization of our method in emotion control and the effectiveness of high-quality face synthesis.
    @inproceedings{xu2023hfg,
    title = {High-Fidelity Generalized Emotional Talking Face Generation with Multi-Modal Emotion Space Learning},
    author = {Chao Xu and Junwei Zhu and Jiangning Zhang and Yue Han and Wenqing Chu and Ying Tai and Chengjie Wang and Zhifeng Xie and Yong Liu},
    year = 2023,
    booktitle = {2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    doi = {10.1109/CVPR52729.2023.00639},
    abstract = {Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose a more flexible and generalized framework. Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space, which inherits rich semantic prior from CLIP. Consequently, effective multi-modal emotion space learning helps our method support arbitrary emotion modality during testing and could generalize to unseen emotion styles. Besides, an Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation. A followed style-based High-fidelity Emotional Face generator is designed to generate arbitrary high-resolution realistic identities. Our texture generator hierarchically learns flow fields and animated faces in a residual manner. Extensive experiments demonstrate the flexibility and generalization of our method in emotion control and the effectiveness of high-quality face synthesis.}
    }
  • Chao Xu, Jiangning Zhang, Yue Han, and Yong Liu. Designing One Unified framework for High-Fidelity Face Reenactment and Swapping. In European Conference on Computer Vision (ECCV), 2022.
    [BibTeX] [Abstract] [DOI]
    Face reenactment and swapping share a similar identity and attribute manipulating pattern, but most methods treat them separately, which is redundant and practical-unfriendly. In this paper, we propose an effective end-to-end unified framework to achieve both tasks. Unlike existing methods that directly utilize pre-estimated structures and do not fully exploit their potential similarity, our model sufficiently transfers identity and attribute based on learned disentangled representations to generate high-fidelity faces. Specifically, Feature Disentanglement first disentangles identity and attribute unsupervisedly. Then the proposed Attribute Transfer (AttrT) employs learned Feature Displacement Fields to transfer the attribute granularly, and Identity Transfer (IdT) explicitly models identity-related feature interaction to adaptively control the identity fusion. We joint AttrT and IdT according to their intrinsic relationship to further facilitate each task, i.e., help improve identity consistency in reenactment and attribute preservation in swapping. Extensive experiments demonstrate the superiority of our method. Code is available at https://github.com/xc-csc101/UniFace.
    @inproceedings{xu2022dou,
    title = {Designing One Unified framework for High-Fidelity Face Reenactment and Swapping},
    author = {Chao Xu and Jiangning Zhang and Yue Han and Yong Liu},
    year = 2022,
    booktitle = {European Conference on Computer Vision (ECCV)},
    doi = {10.1007/978-3-031-19784-0_4},
    abstract = {Face reenactment and swapping share a similar identity and attribute manipulating pattern, but most methods treat them separately, which is redundant and practical-unfriendly. In this paper, we propose an effective end-to-end unified framework to achieve both tasks. Unlike existing methods that directly utilize pre-estimated structures and do not fully exploit their potential similarity, our model sufficiently transfers identity and attribute based on learned disentangled representations to generate high-fidelity faces. Specifically, Feature Disentanglement first disentangles identity and attribute unsupervisedly. Then the proposed Attribute Transfer (AttrT) employs learned Feature Displacement Fields to transfer the attribute granularly, and Identity Transfer (IdT) explicitly models identity-related feature interaction to adaptively control the identity fusion. We joint AttrT and IdT according to their intrinsic relationship to further facilitate each task, i.e., help improve identity consistency in reenactment and attribute preservation in swapping. Extensive experiments demonstrate the superiority of our method. Code is available at https://github.com/xc-csc101/UniFace.}
    }
  • Jiangning Zhang, Chao Xu, Jian Li, Yue Han, Yabiao Wang, Ying Tai, and Yong Liu. SCSNet: An Efficient Paradigm for Learning Simultaneously Image Colorization and Super-Resolution. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    In the practical application of restoring low-resolution grayscale images, we generally need to run three separate processes of image colorization, super-resolution, and dowssampling operation for the target device. However, this pipeline is redundant and inefficient for the independent processes, and some inner features could have been shared.Therefore, we present an efficient paradigm to perform Simultaneously Image Colorization and Super-resolution(SCS) and propose an end-to-end SCSNet to achieve this goal. The proposed method consists of two parts: colorization branch for learning color information that employs the proposed plug-and-play Pyramid Valve Cross Attention (PVCAttn) module to aggregate feature maps between source andreference images; and super-resolution branch for integrating color and texture information to predict target images, which uses the designed Continuous Pixel Mapping (CPM) module to predict high-resolution images at continuous magni-fication. Furthermore, our SCSNet supports both automatic and referential modes that is more flexible for practical application. Abundant experiments demonstrate the superiority of our method for generating authentic images over state-of-theart methods, e.g., averagely decreasing FID by 1.8↓ and 5.1↓ compared with current best scores for automatic and referential modes, respectively, while owning fewer parameters(more than ×2↓) and faster running speed (more than ×3↑).
    @inproceedings{zhang2022scs,
    title = {SCSNet: An Efficient Paradigm for Learning Simultaneously Image Colorization and Super-Resolution},
    author = {Jiangning Zhang and Chao Xu and Jian Li and Yue Han and Yabiao Wang and Ying Tai and Yong Liu},
    year = 2022,
    booktitle = {Proceedings of the 36th AAAI Conference on Artificial Intelligence},
    doi = {https://doi.org/10.48550/arXiv.2201.04364},
    abstract = {In the practical application of restoring low-resolution grayscale images, we generally need to run three separate processes of image colorization, super-resolution, and dowssampling operation for the target device. However, this pipeline is redundant and inefficient for the independent processes, and some inner features could have been shared.Therefore, we present an efficient paradigm to perform Simultaneously Image Colorization and Super-resolution(SCS) and propose an end-to-end SCSNet to achieve this goal. The proposed method consists of two parts: colorization branch for learning color information that employs the proposed plug-and-play Pyramid Valve Cross Attention (PVCAttn) module to aggregate feature maps between source andreference images; and super-resolution branch for integrating color and texture information to predict target images, which uses the designed Continuous Pixel Mapping (CPM) module to predict high-resolution images at continuous magni-fication. Furthermore, our SCSNet supports both automatic and referential modes that is more flexible for practical application. Abundant experiments demonstrate the superiority of our method for generating authentic images over state-of-theart methods, e.g., averagely decreasing FID by 1.8↓ and 5.1↓ compared with current best scores for automatic and referential modes, respectively, while owning fewer parameters(more than ×2↓) and faster running speed (more than ×3↑).}
    }