Address

Room 101, Institute of Cyber-Systems and Control, Yuquan Campus, Zhejiang University, Hangzhou, Zhejiang, China

Contact Information

Email: xiawu@zju.edu.cn

Xia Wu

MS Student

Institute of Cyber-Systems and Control, Zhejiang University, China

Biography

I am pursuing my M.S. degree in College of Control Science and Engineering, Zhejiang University, Hangzhou, China. My major research interests include Action Recognition,Gesture Recognition.

Research and Interests

  • Action Recognition

Publications

  • Chao Xu, Xia Wu, Mengmeng Wang, Feng Qiu, Yong Liu, and Jun Ren. Improving Dynamic Gesture Recognition in Untrimmed Videos by An Online Lightweight Framework and A New Gesture Dataset ZJUGesture. Neurocomputing, 523:58-68, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Human–computer interaction technology brings great convenience to people, and dynamic gesture recognition makes it possible for a man to interact naturally with a machine. However, recognizing gestures quickly and precisely in untrimmed videos remains a challenge in real-world systems since: (1) It is challenging to locate the temporal boundaries of performing gestures; (2) There are significant differences in performing gestures among different people, resulting in a variety of gestures; (3) There must be a trade-off between the accuracy and the computational consumption. In this work, we propose an online lightweight two-stage framework, including a detection module and a gesture recognition module, to precisely detect and classify dynamic gestures in untrimmed videos. Specifically, we first design a low-power detection module to locate gestures in time series, then a temporal relational reasoning module is employed for gesture recognition. Moreover, we present a new dynamic gesture dataset named ZJUGesture, which contains nine classes of common gestures in various scenarios. Extensive experiments on the proposed ZJUGesture and 20-bn-Jester dataset demonstrate the attractive performance of our method with high accuracy and a low computational cost.
    @article{xv2022idg,
    title = {Improving Dynamic Gesture Recognition in Untrimmed Videos by An Online Lightweight Framework and A New Gesture Dataset ZJUGesture},
    author = {Chao Xu and Xia Wu and Mengmeng Wang and Feng Qiu and Yong Liu and Jun Ren},
    year = 2023,
    journal = {Neurocomputing},
    volume = 523,
    pages = {58-68},
    doi = {10.1016/j.neucom.2022.12.022},
    abstract = {Human–computer interaction technology brings great convenience to people, and dynamic gesture recognition makes it possible for a man to interact naturally with a machine. However, recognizing gestures quickly and precisely in untrimmed videos remains a challenge in real-world systems since: (1) It is challenging to locate the temporal boundaries of performing gestures; (2) There are significant differences in performing gestures among different people, resulting in a variety of gestures; (3) There must be a trade-off between the accuracy and the computational consumption. In this work, we propose an online lightweight two-stage framework, including a detection module and a gesture recognition module, to precisely detect and classify dynamic gestures in untrimmed videos. Specifically, we first design a low-power detection module to locate gestures in time series, then a temporal relational reasoning module is employed for gesture recognition. Moreover, we present a new dynamic gesture dataset named ZJUGesture, which contains nine classes of common gestures in various scenarios. Extensive experiments on the proposed ZJUGesture and 20-bn-Jester dataset demonstrate the attractive performance of our method with high accuracy and a low computational cost.}
    }
  • Chao Xu, Xia Wu, Yachun Li, Yining Jin, Mengmeng Wang, and Yong Liu. Cross-modality online distillation for multi-view action recognition. Neurocomputing, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    Recently, some multi-modality features are introduced to the multi-view action recognition methods in order to obtain a more robust performance. However, it is intuitive that not all modalities are available in real applications, such as daily scenes that missing depth modal data and capture RGB sequences only. This raises the challenge of how to learn critical features from multi-modality data, while relying on RGB sequences and still get robust performance at test time. To address this challenge, our paper presents a novel two-stage teacher-student framework, the teacher network takes advantage of multi-view geometry-andtexture features during training, while a student network only given RGB sequences at test time. Specifically, in the first stage, Cross-modality Aggregated Transfer (CAT) network is proposed to transfer multi-view cross-modality aggregated features from the teacher network to the student network. Moreover, We design a Viewpoint-Aware Attention (VAA) module which captures discriminative information across different views to e_ectively combine multi-view features. In the second stage, Multi-view Features Strengthen (MFS) network that contains VAA module as well further strengthen the global view-invariance features of the student network. Besides, both of CAT and MFS learn in an online distillation manner so that the teacher and the student network can be trained jointly. Extensive experiments at IXMAS and Northwestern-UCLA demonstrate the effectiveness of the proposed method.
    @article{xu2021cmo,
    title = {Cross-modality online distillation for multi-view action recognition},
    author = {Chao Xu and Xia Wu and Yachun Li and Yining Jin and Mengmeng Wang and Yong Liu},
    year = 2021,
    journal = {Neurocomputing},
    doi = {10.1016/j.neucom.2021.05.077},
    abstract = {Recently, some multi-modality features are introduced to the multi-view action recognition methods in order to obtain a more robust performance. However, it is intuitive that not all modalities are available in real applications, such as daily scenes that missing depth modal data and capture RGB sequences only. This raises the challenge of how to learn critical features from multi-modality data, while relying on RGB sequences and still get robust performance at test time. To address this challenge, our paper presents a novel two-stage teacher-student framework, the teacher network takes advantage of multi-view geometry-andtexture features during training, while a student network only given RGB sequences at test time. Specifically, in the first stage, Cross-modality Aggregated Transfer (CAT) network is proposed to transfer multi-view cross-modality aggregated features from the teacher network to the student network. Moreover, We design a Viewpoint-Aware Attention (VAA) module which captures discriminative information across different views to e_ectively combine multi-view features. In the second stage, Multi-view Features Strengthen (MFS) network that contains VAA module as well further strengthen the global view-invariance features of the student network. Besides, both of CAT and MFS learn in an online distillation manner so that the teacher and the student network can be trained jointly. Extensive experiments at IXMAS and Northwestern-UCLA demonstrate the effectiveness of the proposed method.}
    }
  • Jiangning Zhang, Chao Xu, Liang Liu, Mengmeng Wang, Xia Wu, Yong Liu, and Yunliang Jiang. Dtvnet: Dynamic time-lapse video generation via single still image. In ECCV, page 300–315, 2020.
    [BibTeX] [Abstract] [DOI] [arXiv] [PDF]
    This paper presents a novel end-to-end dynamic time-lapse video generation framework, named DTVNet, to generate diversified time-lapse videos from a single landscape image, which are conditioned on normalized motion vectors. The proposed DTVNet consists of two submodules: Optical Flow Encoder (OFE) and Dynamic Video Generator (DVG). The OFE maps a sequence of optical flow maps to a normalized motion vector that encodes the motion information inside the generated video. The DVG contains motion and content streams that learn from the motion vector and the single image respectively, as well as an encoder and a decoder to learn shared content features and construct video frames with corresponding motion respectively. Specifically, the motion stream introduces multiple adaptive instance normalization (AdaIN) layers to integrate multi-level motion information that are processed by linear layers. In the testing stage, videos with the same content but various motion information can be generated by different normalized motion vectors based on only one input image. We further conduct experiments on Sky Time-lapse dataset, and the results demonstrate the superiority of our approach over the state-of-the-art methods for generating high-quality and dynamic videos, as well as the variety for generating videos with various motion information.
    @inproceedings{zhang2020dtvnet,
    title = {Dtvnet: Dynamic time-lapse video generation via single still image},
    author = {Zhang, Jiangning and Xu, Chao and Liu, Liang and Wang, Mengmeng and Wu, Xia and Liu, Yong and Jiang, Yunliang},
    year = 2020,
    booktitle = {{ECCV}},
    pages = {300--315},
    doi = {https://doi.org/10.1007/978-3-030-58558-7_18},
    abstract = {This paper presents a novel end-to-end dynamic time-lapse video generation framework, named DTVNet, to generate diversified time-lapse videos from a single landscape image, which are conditioned on normalized motion vectors. The proposed DTVNet consists of two submodules: Optical Flow Encoder (OFE) and Dynamic Video Generator (DVG). The OFE maps a sequence of optical flow maps to a normalized motion vector that encodes the motion information inside the generated video. The DVG contains motion and content streams that learn from the motion vector and the single image respectively, as well as an encoder and a decoder to learn shared content features and construct video frames with corresponding motion respectively. Specifically, the motion stream introduces multiple adaptive instance normalization (AdaIN) layers to integrate multi-level motion information that are processed by linear layers. In the testing stage, videos with the same content but various motion information can be generated by different normalized motion vectors based on only one input image. We further conduct experiments on Sky Time-lapse dataset, and the results demonstrate the superiority of our approach over the state-of-the-art methods for generating high-quality and dynamic videos, as well as the variety for generating videos with various motion information.},
    arxiv = {https://arxiv.org/abs/2008.04776}
    }