Address

Room 101, Institute of Cyber-Systems and Control, Yuquan Campus, Zhejiang University, Hangzhou, Zhejiang, China

Contact Information

Email: 21832066@zju.edu.cn

Chao Xu

PhD Student

Institute of Cyber-Systems and Control, Zhejiang University, China

Biography

I am pursuing my M.S. degree in College of Control Science and Engineering, Zhejiang University, Hangzhou, China. My major research interests include video generation and video understanding.

Research and Interests

  • GAN
  • Video Understanding

Publications

  • Chao Xu, Xia Wu, Mengmeng Wang, Feng Qiu, Yong Liu, and Jun Ren. Improving Dynamic Gesture Recognition in Untrimmed Videos by An Online Lightweight Framework and A New Gesture Dataset ZJUGesture. Neurocomputing, 523:58-68, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Human–computer interaction technology brings great convenience to people, and dynamic gesture recognition makes it possible for a man to interact naturally with a machine. However, recognizing gestures quickly and precisely in untrimmed videos remains a challenge in real-world systems since: (1) It is challenging to locate the temporal boundaries of performing gestures; (2) There are significant differences in performing gestures among different people, resulting in a variety of gestures; (3) There must be a trade-off between the accuracy and the computational consumption. In this work, we propose an online lightweight two-stage framework, including a detection module and a gesture recognition module, to precisely detect and classify dynamic gestures in untrimmed videos. Specifically, we first design a low-power detection module to locate gestures in time series, then a temporal relational reasoning module is employed for gesture recognition. Moreover, we present a new dynamic gesture dataset named ZJUGesture, which contains nine classes of common gestures in various scenarios. Extensive experiments on the proposed ZJUGesture and 20-bn-Jester dataset demonstrate the attractive performance of our method with high accuracy and a low computational cost.
    @article{xv2022idg,
    title = {Improving Dynamic Gesture Recognition in Untrimmed Videos by An Online Lightweight Framework and A New Gesture Dataset ZJUGesture},
    author = {Chao Xu and Xia Wu and Mengmeng Wang and Feng Qiu and Yong Liu and Jun Ren},
    year = 2023,
    journal = {Neurocomputing},
    volume = 523,
    pages = {58-68},
    doi = {10.1016/j.neucom.2022.12.022},
    abstract = {Human–computer interaction technology brings great convenience to people, and dynamic gesture recognition makes it possible for a man to interact naturally with a machine. However, recognizing gestures quickly and precisely in untrimmed videos remains a challenge in real-world systems since: (1) It is challenging to locate the temporal boundaries of performing gestures; (2) There are significant differences in performing gestures among different people, resulting in a variety of gestures; (3) There must be a trade-off between the accuracy and the computational consumption. In this work, we propose an online lightweight two-stage framework, including a detection module and a gesture recognition module, to precisely detect and classify dynamic gestures in untrimmed videos. Specifically, we first design a low-power detection module to locate gestures in time series, then a temporal relational reasoning module is employed for gesture recognition. Moreover, we present a new dynamic gesture dataset named ZJUGesture, which contains nine classes of common gestures in various scenarios. Extensive experiments on the proposed ZJUGesture and 20-bn-Jester dataset demonstrate the attractive performance of our method with high accuracy and a low computational cost.}
    }
  • Chao Xu, Junwei Zhu, Jiangning Zhang, Yue Han, Wenqing Chu, Ying Tai, Chengjie Wang, Zhifeng Xie, and Yong Liu. High-Fidelity Generalized Emotional Talking Face Generation with Multi-Modal Emotion Space Learning. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6609-6619, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose a more flexible and generalized framework. Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space, which inherits rich semantic prior from CLIP. Consequently, effective multi-modal emotion space learning helps our method support arbitrary emotion modality during testing and could generalize to unseen emotion styles. Besides, an Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation. A followed style-based High-fidelity Emotional Face generator is designed to generate arbitrary high-resolution realistic identities. Our texture generator hierarchically learns flow fields and animated faces in a residual manner. Extensive experiments demonstrate the flexibility and generalization of our method in emotion control and the effectiveness of high-quality face synthesis.
    @inproceedings{xu2023hfg,
    title = {High-Fidelity Generalized Emotional Talking Face Generation with Multi-Modal Emotion Space Learning},
    author = {Chao Xu and Junwei Zhu and Jiangning Zhang and Yue Han and Wenqing Chu and Ying Tai and Chengjie Wang and Zhifeng Xie and Yong Liu},
    year = 2023,
    booktitle = {2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {6609-6619},
    doi = {10.1109/CVPR52729.2023.00639},
    abstract = {Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose a more flexible and generalized framework. Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space, which inherits rich semantic prior from CLIP. Consequently, effective multi-modal emotion space learning helps our method support arbitrary emotion modality during testing and could generalize to unseen emotion styles. Besides, an Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation. A followed style-based High-fidelity Emotional Face generator is designed to generate arbitrary high-resolution realistic identities. Our texture generator hierarchically learns flow fields and animated faces in a residual manner. Extensive experiments demonstrate the flexibility and generalization of our method in emotion control and the effectiveness of high-quality face synthesis.}
    }
  • Xuhai Chen, Jiangning Zhang, Chao Xu, Yabiao Wang, Chengjie Wang, and Yong Liu. Better “CMOS” Produces Clearer Images: Learning Space-Variant Blur Estimation for Blind Image Super-Resolution. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1651-1661, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Most of the existing blind image Super-Resolution (SR) methods assume that the blur kernels are space-invariant. However, the blur involved in real applications are usually space-variant due to object motion, out-of-focus, etc., resulting in severe performance drop of the advanced SR methods. To address this problem, we firstly introduce two new datasets with out-of-focus blur, i.e., NYUv2-BSR and Cityscapes-BSR, to support further researches of blind SR with space-variant blur. Based on the datasets, we design a novel Cross-MOdal fuSion network (CMOS) that estimate both blur and semantics simultaneously, which leads to improved SR results. It involves a feature Grouping Interactive Attention (GIA) module to make the two modalities in-teract more effectively and avoid inconsistency. GIA can also be used for the interaction of other features because of the universality of its structure. Qualitative and quantitative experiments compared with state-of-the-art methods on above datasets and real-world images demonstrate the superiority of our method, e.g., obtaining PSNR/SSIM by +1.91↑/+0.0048↑ on NYUv2-BSR than MANet 1 1 https://github.com/ByChelsea/CMOS.git.
    @inproceedings{chen2023cmos,
    title = {Better "CMOS" Produces Clearer Images: Learning Space-Variant Blur Estimation for Blind Image Super-Resolution},
    author = {Xuhai Chen and Jiangning Zhang and Chao Xu and Yabiao Wang and Chengjie Wang and Yong Liu},
    year = 2023,
    booktitle = {2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {1651-1661},
    doi = {10.1109/CVPR52729.2023.00165},
    abstract = {Most of the existing blind image Super-Resolution (SR) methods assume that the blur kernels are space-invariant. However, the blur involved in real applications are usually space-variant due to object motion, out-of-focus, etc., resulting in severe performance drop of the advanced SR methods. To address this problem, we firstly introduce two new datasets with out-of-focus blur, i.e., NYUv2-BSR and Cityscapes-BSR, to support further researches of blind SR with space-variant blur. Based on the datasets, we design a novel Cross-MOdal fuSion network (CMOS) that estimate both blur and semantics simultaneously, which leads to improved SR results. It involves a feature Grouping Interactive Attention (GIA) module to make the two modalities in-teract more effectively and avoid inconsistency. GIA can also be used for the interaction of other features because of the universality of its structure. Qualitative and quantitative experiments compared with state-of-the-art methods on above datasets and real-world images demonstrate the superiority of our method, e.g., obtaining PSNR/SSIM by +1.91↑/+0.0048↑ on NYUv2-BSR than MANet 1 1 https://github.com/ByChelsea/CMOS.git.}
    }
  • Chao Xu, Jiangning Zhang, Mengmeng Wang, Guanzhong Tian, and Yong Liu. Multi-level Spatial-temporal Feature Aggregation for Video Object Detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(11):7809-7820, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    Video object detection (VOD) focuses on detecting objects for each frame in a video, which is a challenging task due to appearance deterioration in certain video frames. Recent works usually distill crucial information from multiple support frames to improve the reference features, but they only perform at frame level or proposal level that cannot integrate spatial-temporal features sufficiently. To deal with this challenge, we treat VOD as a spatial-temporal hierarchical features interacting process and introduce a Multi-level Spatial-Temporal (MST) feature aggregation framework to fully exploit frame-level, proposal-level, and instance-level information in a unified framework. Specifically, MST first measures context similarity in pixel space to enhance all frame-level features rather than only update reference features. The proposal-level feature aggregation then models object relation to augment reference object proposals. Furthermore, to filter out irrelevant information from other classes and backgrounds, we introduce an instance ID constraint to boost instance-level features by leveraging support object proposal features that belong to the same object. Besides, we propose a Deformable Feature Alignment (DAlign) module before MST to achieve a more accurate pixel-level spatial alignment for better feature aggregation. Extensive experiments are conducted on ImageNet VID and UAVDT datasets that demonstrate the superiority of our method over state-of-the-art (SOTA) methods. Our method achieves 83.3% and 62.1% with ResNet-101 on two datasets, outperforming SOTA MEGA by 0.4% and 2.7%.
    @article{xu2022mls,
    title = {Multi-level Spatial-temporal Feature Aggregation for Video Object Detection},
    author = {Chao Xu and Jiangning Zhang and Mengmeng Wang and Guanzhong Tian and Yong Liu},
    year = 2022,
    journal = {IEEE Transactions on Circuits and Systems for Video Technology},
    volume = {32},
    number = {11},
    pages = {7809-7820},
    doi = {10.1109/TCSVT.2022.3183646},
    abstract = {Video object detection (VOD) focuses on detecting objects for each frame in a video, which is a challenging task due to appearance deterioration in certain video frames. Recent works usually distill crucial information from multiple support frames to improve the reference features, but they only perform at frame level or proposal level that cannot integrate spatial-temporal features sufficiently. To deal with this challenge, we treat VOD as a spatial-temporal hierarchical features interacting process and introduce a Multi-level Spatial-Temporal (MST) feature aggregation framework to fully exploit frame-level, proposal-level, and instance-level information in a unified framework. Specifically, MST first measures context similarity in pixel space to enhance all frame-level features rather than only update reference features. The proposal-level feature aggregation then models object relation to augment reference object proposals. Furthermore, to filter out irrelevant information from other classes and backgrounds, we introduce an instance ID constraint to boost instance-level features by leveraging support object proposal features that belong to the same object. Besides, we propose a Deformable Feature Alignment (DAlign) module before MST to achieve a more accurate pixel-level spatial alignment for better feature aggregation. Extensive experiments are conducted on ImageNet VID and UAVDT datasets that demonstrate the superiority of our method over state-of-the-art (SOTA) methods. Our method achieves 83.3% and 62.1% with ResNet-101 on two datasets, outperforming SOTA MEGA by 0.4% and 2.7%.}
    }
  • Jiangning Zhang, Xianfang Zeng, Chao Xu, and Yong Liu. Real-Time Audio-Guided Multi-Face Reenactment. IEEE Signal Processing Letters, 29:1–5, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    Audio-guided face reenactment aims to generate authentic target faces that have matched facial expression of the input audio, and many learning-based methods have successfully achieved this. However, mostmethods can only reenact a particular person once trained or suffer from the low-quality generation of the target images. Also, nearly none of the current reenactment works consider the model size and running speed that are important for practical use. To solve the above challenges, we propose an efficient Audio-guided Multi-face reenactment model named AMNet, which can reenact target faces among multiple persons with corresponding source faces and drive signals as inputs. Concretely, we design a Geometric Controller (GC) module to inject the drive signals so that the model can be optimized in an end-to-end manner and generate more authentic images. Also, we adopt a lightweight network for our face reenactor so that the model can run in realtime on both CPU and GPU devices. Abundant experiments prove our approach’s superiority over existing methods, e.g., averagely decreasing FID by 0.12. and increasing SSIM by 0.031. than APB2Face, while owning fewer parameters (x4 down arrow) and faster CPU speed (x4 up arrow).
    @article{zhang2022rta,
    title = {Real-Time Audio-Guided Multi-Face Reenactment},
    author = {Jiangning Zhang and Xianfang Zeng and Chao Xu and Yong Liu},
    year = 2022,
    journal = {IEEE Signal Processing Letters},
    volume = {29},
    pages = {1--5},
    doi = {10.1109/LSP.2021.3116506},
    abstract = {Audio-guided face reenactment aims to generate authentic target faces that have matched facial expression of the input audio, and many learning-based methods have successfully achieved this. However, mostmethods can only reenact a particular person once trained or suffer from the low-quality generation of the target images. Also, nearly none of the current reenactment works consider the model size and running speed that are important for practical use. To solve the above challenges, we propose an efficient Audio-guided Multi-face reenactment model named AMNet, which can reenact target faces among multiple persons with corresponding source faces and drive signals as inputs. Concretely, we design a Geometric Controller (GC) module to inject the drive signals so that the model can be optimized in an end-to-end manner and generate more authentic images. Also, we adopt a lightweight network for our face reenactor so that the model can run in realtime on both CPU and GPU devices. Abundant experiments prove our approach's superiority over existing methods, e.g., averagely decreasing FID by 0.12. and increasing SSIM by 0.031. than APB2Face, while owning fewer parameters (x4 down arrow) and faster CPU speed (x4 up arrow).}
    }
  • Chao Xu, Jiangning Zhang, Yue Han, and Yong Liu. Designing One Unified framework for High-Fidelity Face Reenactment and Swapping. In European Conference on Computer Vision (ECCV), 2022.
    [BibTeX] [Abstract] [DOI]
    Face reenactment and swapping share a similar identity and attribute manipulating pattern, but most methods treat them separately, which is redundant and practical-unfriendly. In this paper, we propose an effective end-to-end unified framework to achieve both tasks. Unlike existing methods that directly utilize pre-estimated structures and do not fully exploit their potential similarity, our model sufficiently transfers identity and attribute based on learned disentangled representations to generate high-fidelity faces. Specifically, Feature Disentanglement first disentangles identity and attribute unsupervisedly. Then the proposed Attribute Transfer (AttrT) employs learned Feature Displacement Fields to transfer the attribute granularly, and Identity Transfer (IdT) explicitly models identity-related feature interaction to adaptively control the identity fusion. We joint AttrT and IdT according to their intrinsic relationship to further facilitate each task, i.e., help improve identity consistency in reenactment and attribute preservation in swapping. Extensive experiments demonstrate the superiority of our method. Code is available at https://github.com/xc-csc101/UniFace.
    @inproceedings{xu2022dou,
    title = {Designing One Unified framework for High-Fidelity Face Reenactment and Swapping},
    author = {Chao Xu and Jiangning Zhang and Yue Han and Yong Liu},
    year = 2022,
    booktitle = {European Conference on Computer Vision (ECCV)},
    doi = {10.1007/978-3-031-19784-0_4},
    abstract = {Face reenactment and swapping share a similar identity and attribute manipulating pattern, but most methods treat them separately, which is redundant and practical-unfriendly. In this paper, we propose an effective end-to-end unified framework to achieve both tasks. Unlike existing methods that directly utilize pre-estimated structures and do not fully exploit their potential similarity, our model sufficiently transfers identity and attribute based on learned disentangled representations to generate high-fidelity faces. Specifically, Feature Disentanglement first disentangles identity and attribute unsupervisedly. Then the proposed Attribute Transfer (AttrT) employs learned Feature Displacement Fields to transfer the attribute granularly, and Identity Transfer (IdT) explicitly models identity-related feature interaction to adaptively control the identity fusion. We joint AttrT and IdT according to their intrinsic relationship to further facilitate each task, i.e., help improve identity consistency in reenactment and attribute preservation in swapping. Extensive experiments demonstrate the superiority of our method. Code is available at https://github.com/xc-csc101/UniFace.}
    }
  • Chao Xu, Jiangning Zhang, and Miao Hua He Yi and Qian and Zili and Yong Liu. Region-Aware Face Swapping. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    This paper presents a novel Region-Aware Face Swapping (RAFSwap) network to achieve identity-consistent harmonious high-resolution face generation in a local-global manner: 1) Local Facial Region-Aware (FRA) branch augments local identity-relevant features by introducing the Transformer to effectively model misaligned crossscale semantic interaction. 2) Global Source Feature Adaptive (SFA) branch further complements global identity relevant cues for generating identity-consistent swapped faces. Besides, we propose a Face Mask Predictor (FMP) module incorporated with StyleGAN2 to predict identity relevant soft facial masks in an unsupervised manner that is more practical for generating harmonious high-resolution faces. Abundant experiments qualitatively and quantita tively demonstrate the superiority of our method for generating more identity-consistent high-resolution swapped faces over SOTA methods, e.g., obtaining 96.70 ID retrieval that outperforms SOTA MegaFS by 5.87↑.
    @inproceedings{xu2022raf,
    title = {Region-Aware Face Swapping},
    author = {Chao Xu and Jiangning Zhang and Miao Hua and Qian He and Zili Yi and Yong Liu},
    year = 2022,
    booktitle = {2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    doi = {10.48550/arXiv.2203.04564},
    abstract = {This paper presents a novel Region-Aware Face Swapping (RAFSwap) network to achieve identity-consistent harmonious high-resolution face generation in a local-global manner: 1) Local Facial Region-Aware (FRA) branch augments local identity-relevant features by introducing the Transformer to effectively model misaligned crossscale semantic interaction. 2) Global Source Feature Adaptive (SFA) branch further complements global identity relevant cues for generating identity-consistent swapped faces. Besides, we propose a Face Mask Predictor (FMP) module incorporated with StyleGAN2 to predict identity relevant soft facial masks in an unsupervised manner that is more practical for generating harmonious high-resolution faces. Abundant experiments qualitatively and quantita tively demonstrate the superiority of our method for generating more identity-consistent high-resolution swapped faces over SOTA methods, e.g., obtaining 96.70 ID retrieval that outperforms SOTA MegaFS by 5.87↑.}
    }
  • Jiangning Zhang, Chao Xu, Jian Li, Yue Han, Yabiao Wang, Ying Tai, and Yong Liu. SCSNet: An Efficient Paradigm for Learning Simultaneously Image Colorization and Super-Resolution. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    In the practical application of restoring low-resolution grayscale images, we generally need to run three separate processes of image colorization, super-resolution, and dowssampling operation for the target device. However, this pipeline is redundant and inefficient for the independent processes, and some inner features could have been shared.Therefore, we present an efficient paradigm to perform Simultaneously Image Colorization and Super-resolution(SCS) and propose an end-to-end SCSNet to achieve this goal. The proposed method consists of two parts: colorization branch for learning color information that employs the proposed plug-and-play Pyramid Valve Cross Attention (PVCAttn) module to aggregate feature maps between source andreference images; and super-resolution branch for integrating color and texture information to predict target images, which uses the designed Continuous Pixel Mapping (CPM) module to predict high-resolution images at continuous magni-fication. Furthermore, our SCSNet supports both automatic and referential modes that is more flexible for practical application. Abundant experiments demonstrate the superiority of our method for generating authentic images over state-of-theart methods, e.g., averagely decreasing FID by 1.8↓ and 5.1↓ compared with current best scores for automatic and referential modes, respectively, while owning fewer parameters(more than ×2↓) and faster running speed (more than ×3↑).
    @inproceedings{zhang2022scs,
    title = {SCSNet: An Efficient Paradigm for Learning Simultaneously Image Colorization and Super-Resolution},
    author = {Jiangning Zhang and Chao Xu and Jian Li and Yue Han and Yabiao Wang and Ying Tai and Yong Liu},
    year = 2022,
    booktitle = {Proceedings of the 36th AAAI Conference on Artificial Intelligence},
    doi = {https://doi.org/10.48550/arXiv.2201.04364},
    abstract = {In the practical application of restoring low-resolution grayscale images, we generally need to run three separate processes of image colorization, super-resolution, and dowssampling operation for the target device. However, this pipeline is redundant and inefficient for the independent processes, and some inner features could have been shared.Therefore, we present an efficient paradigm to perform Simultaneously Image Colorization and Super-resolution(SCS) and propose an end-to-end SCSNet to achieve this goal. The proposed method consists of two parts: colorization branch for learning color information that employs the proposed plug-and-play Pyramid Valve Cross Attention (PVCAttn) module to aggregate feature maps between source andreference images; and super-resolution branch for integrating color and texture information to predict target images, which uses the designed Continuous Pixel Mapping (CPM) module to predict high-resolution images at continuous magni-fication. Furthermore, our SCSNet supports both automatic and referential modes that is more flexible for practical application. Abundant experiments demonstrate the superiority of our method for generating authentic images over state-of-theart methods, e.g., averagely decreasing FID by 1.8↓ and 5.1↓ compared with current best scores for automatic and referential modes, respectively, while owning fewer parameters(more than ×2↓) and faster running speed (more than ×3↑).}
    }
  • Chao Xu, Xia Wu, Yachun Li, Yining Jin, Mengmeng Wang, and Yong Liu. Cross-modality online distillation for multi-view action recognition. Neurocomputing, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    Recently, some multi-modality features are introduced to the multi-view action recognition methods in order to obtain a more robust performance. However, it is intuitive that not all modalities are available in real applications, such as daily scenes that missing depth modal data and capture RGB sequences only. This raises the challenge of how to learn critical features from multi-modality data, while relying on RGB sequences and still get robust performance at test time. To address this challenge, our paper presents a novel two-stage teacher-student framework, the teacher network takes advantage of multi-view geometry-andtexture features during training, while a student network only given RGB sequences at test time. Specifically, in the first stage, Cross-modality Aggregated Transfer (CAT) network is proposed to transfer multi-view cross-modality aggregated features from the teacher network to the student network. Moreover, We design a Viewpoint-Aware Attention (VAA) module which captures discriminative information across different views to e_ectively combine multi-view features. In the second stage, Multi-view Features Strengthen (MFS) network that contains VAA module as well further strengthen the global view-invariance features of the student network. Besides, both of CAT and MFS learn in an online distillation manner so that the teacher and the student network can be trained jointly. Extensive experiments at IXMAS and Northwestern-UCLA demonstrate the effectiveness of the proposed method.
    @article{xu2021cmo,
    title = {Cross-modality online distillation for multi-view action recognition},
    author = {Chao Xu and Xia Wu and Yachun Li and Yining Jin and Mengmeng Wang and Yong Liu},
    year = 2021,
    journal = {Neurocomputing},
    doi = {10.1016/j.neucom.2021.05.077},
    abstract = {Recently, some multi-modality features are introduced to the multi-view action recognition methods in order to obtain a more robust performance. However, it is intuitive that not all modalities are available in real applications, such as daily scenes that missing depth modal data and capture RGB sequences only. This raises the challenge of how to learn critical features from multi-modality data, while relying on RGB sequences and still get robust performance at test time. To address this challenge, our paper presents a novel two-stage teacher-student framework, the teacher network takes advantage of multi-view geometry-andtexture features during training, while a student network only given RGB sequences at test time. Specifically, in the first stage, Cross-modality Aggregated Transfer (CAT) network is proposed to transfer multi-view cross-modality aggregated features from the teacher network to the student network. Moreover, We design a Viewpoint-Aware Attention (VAA) module which captures discriminative information across different views to e_ectively combine multi-view features. In the second stage, Multi-view Features Strengthen (MFS) network that contains VAA module as well further strengthen the global view-invariance features of the student network. Besides, both of CAT and MFS learn in an online distillation manner so that the teacher and the student network can be trained jointly. Extensive experiments at IXMAS and Northwestern-UCLA demonstrate the effectiveness of the proposed method.}
    }
  • Jiangning Zhang, Chao Xu, Xiangrui Zhao, Liang Liu, Yong Liu, Jinqiang Yao, and Zaisheng Pan. Learning hierarchical and efficient Person re-identification for robotic navigation. International Journal of Intelligent Robotics and Applications, 5:104–118, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    Recent works in the person re-identification task mainly focus on the model accuracy while ignoring factors related to efficiency, e.g., model size and latency, which are critical for practical application. In this paper, we propose a novel Hierarchical andEfficientNetwork (HENet) that learns hierarchical global, partial, and recovery features ensemble under the supervision of multiple loss combinations. To further improve the robustness against the irregular occlusion, we propose a new dataset augmentation approach, dubbed random polygon erasing, to random erase the input image’s irregular area imitating the body part missing. We also propose an EfficiencyScore (ES) metric to evaluate the model efficiency. Extensive experiments on Market1501, DukeMTMC-ReID, and CUHK03 datasets show the efficiency and superiority of our approach compared with epoch-making methods. We further deploy HENet on a robotic car, and the experimental result demonstrates the effectiveness of our method for robotic navigation.
    @article{zhang2021lha,
    title = {Learning hierarchical and efficient Person re-identification for robotic navigation},
    author = {Jiangning Zhang and Chao Xu and Xiangrui Zhao and Liang Liu and Yong Liu and Jinqiang Yao and Zaisheng Pan},
    year = 2021,
    journal = {International Journal of Intelligent Robotics and Applications},
    volume = 5,
    pages = {104--118},
    doi = {10.1007/s41315-021-00167-2},
    issue = 2,
    abstract = {Recent works in the person re-identification task mainly focus on the model accuracy while ignoring factors related to efficiency, e.g., model size and latency, which are critical for practical application. In this paper, we propose a novel Hierarchical andEfficientNetwork (HENet) that learns hierarchical global, partial, and recovery features ensemble under the supervision of multiple loss combinations. To further improve the robustness against the irregular occlusion, we propose a new dataset augmentation approach, dubbed random polygon erasing, to random erase the input image's irregular area imitating the body part missing. We also propose an EfficiencyScore (ES) metric to evaluate the model efficiency. Extensive experiments on Market1501, DukeMTMC-ReID, and CUHK03 datasets show the efficiency and superiority of our approach compared with epoch-making methods. We further deploy HENet on a robotic car, and the experimental result demonstrates the effectiveness of our method for robotic navigation.}
    }
  • Jiangning Zhang, Chao Xu, Jian Li, Wenzhou Chen, Yabiao Wang, Ying Tai, Shuo Chen, Chengjie Wang, Feiyue Huang, and Yong Liu. Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model. In Advances in Neural Information Processing Systems 34 – 35th Conference on Neural Information Processing Systems, pages 26674-26688, 2021.
    [BibTeX] [Abstract] [PDF]
    Inspired by biological evolution, we explain the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA) and derive that both of them have consistent mathematical representation. Analogous to the dynamic local population in EA, we improve the existing transformer structure and propose a more efficient EAT model, and design task-related heads to deal with different tasks more flexibly. Moreover, we introduce the spatial-filling curve into the current vision transformer to sequence image data into a uniform sequential format. Thus we can design a unified EAT framework to address multi-modal tasks, separating the network architecture from the data format adaptation. Our approach achieves state-of-the-art results on the ImageNet classification task compared with recent vision transformer works while having smaller parameters and greater throughput. We further conduct multi-modal tasks to demonstrate the superiority of the unified EAT, e.g., Text-Based Image Retrieval, and our approach improves the rank-1 by +3.7 points over the baseline on the CSS dataset.
    @inproceedings{zhang2021analogous,
    title = {Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model},
    author = {Jiangning Zhang and Chao Xu and Jian Li and Wenzhou Chen and Yabiao Wang and Ying Tai and Shuo Chen and Chengjie Wang and Feiyue Huang and Yong Liu},
    year = 2021,
    booktitle = {Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems},
    pages = {26674-26688},
    abstract = {Inspired by biological evolution, we explain the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA) and derive that both of them have consistent mathematical representation. Analogous to the dynamic local population in EA, we improve the existing transformer structure and propose a more efficient EAT model, and design task-related heads to deal with different tasks more flexibly. Moreover, we introduce the spatial-filling curve into the current vision transformer to sequence image data into a uniform sequential format. Thus we can design a unified EAT framework to address multi-modal tasks, separating the network architecture from the data format adaptation. Our approach achieves state-of-the-art results on the ImageNet classification task compared with recent vision transformer works while having smaller parameters and greater throughput. We further conduct multi-modal tasks to demonstrate the superiority of the unified EAT, e.g., Text-Based Image Retrieval, and our approach improves the rank-1 by +3.7 points over the baseline on the CSS dataset.}
    }
  • Jiangning Zhang, Chao Xu, Liang Liu, Mengmeng Wang, Xia Wu, Yong Liu, and Yunliang Jiang. Dtvnet: Dynamic time-lapse video generation via single still image. In ECCV, page 300–315, 2020.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]
    This paper presents a novel end-to-end dynamic time-lapse video generation framework, named DTVNet, to generate diversified time-lapse videos from a single landscape image, which are conditioned on normalized motion vectors. The proposed DTVNet consists of two submodules: Optical Flow Encoder (OFE) and Dynamic Video Generator (DVG). The OFE maps a sequence of optical flow maps to a normalized motion vector that encodes the motion information inside the generated video. The DVG contains motion and content streams that learn from the motion vector and the single image respectively, as well as an encoder and a decoder to learn shared content features and construct video frames with corresponding motion respectively. Specifically, the motion stream introduces multiple adaptive instance normalization (AdaIN) layers to integrate multi-level motion information that are processed by linear layers. In the testing stage, videos with the same content but various motion information can be generated by different normalized motion vectors based on only one input image. We further conduct experiments on Sky Time-lapse dataset, and the results demonstrate the superiority of our approach over the state-of-the-art methods for generating high-quality and dynamic videos, as well as the variety for generating videos with various motion information.
    @inproceedings{zhang2020dtvnet,
    title = {Dtvnet: Dynamic time-lapse video generation via single still image},
    author = {Zhang, Jiangning and Xu, Chao and Liu, Liang and Wang, Mengmeng and Wu, Xia and Liu, Yong and Jiang, Yunliang},
    year = 2020,
    booktitle = {{ECCV}},
    pages = {300--315},
    doi = {https://doi.org/10.1007/978-3-030-58558-7_18},
    abstract = {This paper presents a novel end-to-end dynamic time-lapse video generation framework, named DTVNet, to generate diversified time-lapse videos from a single landscape image, which are conditioned on normalized motion vectors. The proposed DTVNet consists of two submodules: Optical Flow Encoder (OFE) and Dynamic Video Generator (DVG). The OFE maps a sequence of optical flow maps to a normalized motion vector that encodes the motion information inside the generated video. The DVG contains motion and content streams that learn from the motion vector and the single image respectively, as well as an encoder and a decoder to learn shared content features and construct video frames with corresponding motion respectively. Specifically, the motion stream introduces multiple adaptive instance normalization (AdaIN) layers to integrate multi-level motion information that are processed by linear layers. In the testing stage, videos with the same content but various motion information can be generated by different normalized motion vectors based on only one input image. We further conduct experiments on Sky Time-lapse dataset, and the results demonstrate the superiority of our approach over the state-of-the-art methods for generating high-quality and dynamic videos, as well as the variety for generating videos with various motion information.},
    arxiv = {https://arxiv.org/abs/2008.04776}
    }