Yue Han
MS Student
Institute of Cyber-Systems and Control, Zhejiang University, China
Biography
I am pursuing my master degree in College of Control Science and Engineering, Zhejiang University, Hangzhou, China. My major research interest is generative adversarial network (GAN).
Research and Interests
- Computer Vision
- Generative Adversarial Network
Publications
- Yue Han, Jiangning Zhang, Yabiao Wang, Chengjie Wang, Yong Liu, Lu Qi, Xiangtai Li, and Ming-Hsuan Yang. Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[BibTeX] [Abstract] [DOI]Few-Shot Instance Segmentation (FSIS) requires detecting and segmenting novel classes with limited support examples. Existing methods based on Region Proposal Networks (RPNs) face two issues: 1) Overfitting suppresses novel class objects; 2) Dual-branch models require complex spatial correlation strategies to prevent spatial information loss when generating class prototypes. We introduce a unified framework, Reference Twice (RefT), to exploit the relationship between support and query features for FSIS and related tasks. Our three main contributions are: 1) A novel transformer-based baseline that avoids overfitting, offering a new direction for FSIS; 2) Demonstrating that support object queries encode key factors after base training, allowing query features to be enhanced twice at both feature and query levels using simple cross-attention, thus avoiding complex spatial correlation interaction; 3) Introducing a class-enhanced base knowledge distillation loss to address the issue of DETR-like models struggling with incremental settings due to the input projection layer, enabling easy extension to incremental FSIS. Extensive experimental evaluations on the COCO dataset under three FSIS settings demonstrate that our method performs favorably against existing approaches across different shots, e.g., +8.2/ + 9.4 performance gain over state-of-the-art methods with 10/30-shots. Source code and models will be available at this github site.
@article{han2024rta, title = {Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation}, author = {Yue Han and Jiangning Zhang and Yabiao Wang and Chengjie Wang and Yong Liu and Lu Qi and Xiangtai Li and Ming-Hsuan Yang}, year = 2024, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence}, doi = {10.1109/TPAMI.2024.3421340}, abstract = {Few-Shot Instance Segmentation (FSIS) requires detecting and segmenting novel classes with limited support examples. Existing methods based on Region Proposal Networks (RPNs) face two issues: 1) Overfitting suppresses novel class objects; 2) Dual-branch models require complex spatial correlation strategies to prevent spatial information loss when generating class prototypes. We introduce a unified framework, Reference Twice (RefT), to exploit the relationship between support and query features for FSIS and related tasks. Our three main contributions are: 1) A novel transformer-based baseline that avoids overfitting, offering a new direction for FSIS; 2) Demonstrating that support object queries encode key factors after base training, allowing query features to be enhanced twice at both feature and query levels using simple cross-attention, thus avoiding complex spatial correlation interaction; 3) Introducing a class-enhanced base knowledge distillation loss to address the issue of DETR-like models struggling with incremental settings due to the input projection layer, enabling easy extension to incremental FSIS. Extensive experimental evaluations on the COCO dataset under three FSIS settings demonstrate that our method performs favorably against existing approaches across different shots, e.g., +8.2/ + 9.4 performance gain over state-of-the-art methods with 10/30-shots. Source code and models will be available at this github site.} }
- Xintian Shen, Jiangning Zhang, Jun Chen, Shipeng Bai, Yue Han, Yabiao Wang, Chengjie Wang, and Yong Liu. Learning Global-Aware Kernel for Image Harmonization. In 19th IEEE/CVF International Conference on Computer Vision (ICCV), pages 7501-7510, 2023.
[BibTeX] [Abstract] [DOI] [PDF]Image harmonization aims to solve the visual inconsistency problem in composited images by adaptively adjusting the foreground pixels with the background as references. Existing methods employ local color transformation or region matching between foreground and background, which neglects powerful proximity prior and independently distinguishes fore-/back-ground as a whole part for harmonization. As a result, they still show a limited performance across varied foreground objects and scenes. To address this issue, we propose a novel Global-aware Kernel Net-work (GKNet) to harmonize local regions with comprehensive consideration of long-distance background references. Specifically, GKNet includes two parts, i.e., harmony kernel prediction and harmony kernel modulation branches. The former includes a Long-distance Reference Extractor (LRE) to obtain long-distance context and Kernel Prediction Blocks (KPB) to predict multi-level harmony kernels by fusing global information with local features. To achieve this goal, a novel Selective Correlation Fusion (SCF) module is proposed to better select relevant long-distance background references for local harmonization. The latter employs the predicted kernels to harmonize foreground regions with local and global awareness. Abundant experiments demonstrate the superiority of our method for image harmonization over state-of-the-art methods, e.g., achieving 39.53dB PSNR that surpasses the best counterpart by +0.78dB ↑; decreasing fMSE/MSE by 11.5%↓/6.7%↓ compared with the SoTA method. Code will be available at here.
@inproceedings{shen2023lga, title = {Learning Global-Aware Kernel for Image Harmonization}, author = {Xintian Shen and Jiangning Zhang and Jun Chen and Shipeng Bai and Yue Han and Yabiao Wang and Chengjie Wang and Yong Liu}, year = 2023, booktitle = {19th IEEE/CVF International Conference on Computer Vision (ICCV)}, pages = {7501-7510}, doi = {10.1109/ICCV51070.2023.00693}, abstract = {Image harmonization aims to solve the visual inconsistency problem in composited images by adaptively adjusting the foreground pixels with the background as references. Existing methods employ local color transformation or region matching between foreground and background, which neglects powerful proximity prior and independently distinguishes fore-/back-ground as a whole part for harmonization. As a result, they still show a limited performance across varied foreground objects and scenes. To address this issue, we propose a novel Global-aware Kernel Net-work (GKNet) to harmonize local regions with comprehensive consideration of long-distance background references. Specifically, GKNet includes two parts, i.e., harmony kernel prediction and harmony kernel modulation branches. The former includes a Long-distance Reference Extractor (LRE) to obtain long-distance context and Kernel Prediction Blocks (KPB) to predict multi-level harmony kernels by fusing global information with local features. To achieve this goal, a novel Selective Correlation Fusion (SCF) module is proposed to better select relevant long-distance background references for local harmonization. The latter employs the predicted kernels to harmonize foreground regions with local and global awareness. Abundant experiments demonstrate the superiority of our method for image harmonization over state-of-the-art methods, e.g., achieving 39.53dB PSNR that surpasses the best counterpart by +0.78dB ↑; decreasing fMSE/MSE by 11.5%↓/6.7%↓ compared with the SoTA method. Code will be available at here.} }
- Chao Xu, Junwei Zhu, Jiangning Zhang, Yue Han, Wenqing Chu, Ying Tai, Chengjie Wang, Zhifeng Xie, and Yong Liu. High-Fidelity Generalized Emotional Talking Face Generation with Multi-Modal Emotion Space Learning. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6609-6619, 2023.
[BibTeX] [Abstract] [DOI] [PDF]Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose a more flexible and generalized framework. Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space, which inherits rich semantic prior from CLIP. Consequently, effective multi-modal emotion space learning helps our method support arbitrary emotion modality during testing and could generalize to unseen emotion styles. Besides, an Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation. A followed style-based High-fidelity Emotional Face generator is designed to generate arbitrary high-resolution realistic identities. Our texture generator hierarchically learns flow fields and animated faces in a residual manner. Extensive experiments demonstrate the flexibility and generalization of our method in emotion control and the effectiveness of high-quality face synthesis.
@inproceedings{xu2023hfg, title = {High-Fidelity Generalized Emotional Talking Face Generation with Multi-Modal Emotion Space Learning}, author = {Chao Xu and Junwei Zhu and Jiangning Zhang and Yue Han and Wenqing Chu and Ying Tai and Chengjie Wang and Zhifeng Xie and Yong Liu}, year = 2023, booktitle = {2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, pages = {6609-6619}, doi = {10.1109/CVPR52729.2023.00639}, abstract = {Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose a more flexible and generalized framework. Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space, which inherits rich semantic prior from CLIP. Consequently, effective multi-modal emotion space learning helps our method support arbitrary emotion modality during testing and could generalize to unseen emotion styles. Besides, an Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation. A followed style-based High-fidelity Emotional Face generator is designed to generate arbitrary high-resolution realistic identities. Our texture generator hierarchically learns flow fields and animated faces in a residual manner. Extensive experiments demonstrate the flexibility and generalization of our method in emotion control and the effectiveness of high-quality face synthesis.} }
- Chao Xu, Jiangning Zhang, Yue Han, and Yong Liu. Designing One Unified framework for High-Fidelity Face Reenactment and Swapping. In European Conference on Computer Vision (ECCV), 2022.
[BibTeX] [Abstract] [DOI]Face reenactment and swapping share a similar identity and attribute manipulating pattern, but most methods treat them separately, which is redundant and practical-unfriendly. In this paper, we propose an effective end-to-end unified framework to achieve both tasks. Unlike existing methods that directly utilize pre-estimated structures and do not fully exploit their potential similarity, our model sufficiently transfers identity and attribute based on learned disentangled representations to generate high-fidelity faces. Specifically, Feature Disentanglement first disentangles identity and attribute unsupervisedly. Then the proposed Attribute Transfer (AttrT) employs learned Feature Displacement Fields to transfer the attribute granularly, and Identity Transfer (IdT) explicitly models identity-related feature interaction to adaptively control the identity fusion. We joint AttrT and IdT according to their intrinsic relationship to further facilitate each task, i.e., help improve identity consistency in reenactment and attribute preservation in swapping. Extensive experiments demonstrate the superiority of our method. Code is available at https://github.com/xc-csc101/UniFace.
@inproceedings{xu2022dou, title = {Designing One Unified framework for High-Fidelity Face Reenactment and Swapping}, author = {Chao Xu and Jiangning Zhang and Yue Han and Yong Liu}, year = 2022, booktitle = {European Conference on Computer Vision (ECCV)}, doi = {10.1007/978-3-031-19784-0_4}, abstract = {Face reenactment and swapping share a similar identity and attribute manipulating pattern, but most methods treat them separately, which is redundant and practical-unfriendly. In this paper, we propose an effective end-to-end unified framework to achieve both tasks. Unlike existing methods that directly utilize pre-estimated structures and do not fully exploit their potential similarity, our model sufficiently transfers identity and attribute based on learned disentangled representations to generate high-fidelity faces. Specifically, Feature Disentanglement first disentangles identity and attribute unsupervisedly. Then the proposed Attribute Transfer (AttrT) employs learned Feature Displacement Fields to transfer the attribute granularly, and Identity Transfer (IdT) explicitly models identity-related feature interaction to adaptively control the identity fusion. We joint AttrT and IdT according to their intrinsic relationship to further facilitate each task, i.e., help improve identity consistency in reenactment and attribute preservation in swapping. Extensive experiments demonstrate the superiority of our method. Code is available at https://github.com/xc-csc101/UniFace.} }
- Jiangning Zhang, Chao Xu, Jian Li, Yue Han, Yabiao Wang, Ying Tai, and Yong Liu. SCSNet: An Efficient Paradigm for Learning Simultaneously Image Colorization and Super-Resolution. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, 2022.
[BibTeX] [Abstract] [DOI] [PDF]In the practical application of restoring low-resolution grayscale images, we generally need to run three separate processes of image colorization, super-resolution, and dowssampling operation for the target device. However, this pipeline is redundant and inefficient for the independent processes, and some inner features could have been shared.Therefore, we present an efficient paradigm to perform Simultaneously Image Colorization and Super-resolution(SCS) and propose an end-to-end SCSNet to achieve this goal. The proposed method consists of two parts: colorization branch for learning color information that employs the proposed plug-and-play Pyramid Valve Cross Attention (PVCAttn) module to aggregate feature maps between source andreference images; and super-resolution branch for integrating color and texture information to predict target images, which uses the designed Continuous Pixel Mapping (CPM) module to predict high-resolution images at continuous magni-fication. Furthermore, our SCSNet supports both automatic and referential modes that is more flexible for practical application. Abundant experiments demonstrate the superiority of our method for generating authentic images over state-of-theart methods, e.g., averagely decreasing FID by 1.8↓ and 5.1↓ compared with current best scores for automatic and referential modes, respectively, while owning fewer parameters(more than ×2↓) and faster running speed (more than ×3↑).
@inproceedings{zhang2022scs, title = {SCSNet: An Efficient Paradigm for Learning Simultaneously Image Colorization and Super-Resolution}, author = {Jiangning Zhang and Chao Xu and Jian Li and Yue Han and Yabiao Wang and Ying Tai and Yong Liu}, year = 2022, booktitle = {Proceedings of the 36th AAAI Conference on Artificial Intelligence}, doi = {https://doi.org/10.48550/arXiv.2201.04364}, abstract = {In the practical application of restoring low-resolution grayscale images, we generally need to run three separate processes of image colorization, super-resolution, and dowssampling operation for the target device. However, this pipeline is redundant and inefficient for the independent processes, and some inner features could have been shared.Therefore, we present an efficient paradigm to perform Simultaneously Image Colorization and Super-resolution(SCS) and propose an end-to-end SCSNet to achieve this goal. The proposed method consists of two parts: colorization branch for learning color information that employs the proposed plug-and-play Pyramid Valve Cross Attention (PVCAttn) module to aggregate feature maps between source andreference images; and super-resolution branch for integrating color and texture information to predict target images, which uses the designed Continuous Pixel Mapping (CPM) module to predict high-resolution images at continuous magni-fication. Furthermore, our SCSNet supports both automatic and referential modes that is more flexible for practical application. Abundant experiments demonstrate the superiority of our method for generating authentic images over state-of-theart methods, e.g., averagely decreasing FID by 1.8↓ and 5.1↓ compared with current best scores for automatic and referential modes, respectively, while owning fewer parameters(more than ×2↓) and faster running speed (more than ×3↑).} }