Address

Institute of Cyber-Systems and Control, Yuquan Campus, Zhejiang University, Hangzhou, Zhejiang, China

Contact Information

Email: jiazhengxing@zju.edu.cn

Jiazheng Xing

MS Student

Institute of Cyber-Systems and Control, Zhejiang University, China

Biography

I am pursuing my M.S. degree in College of Control Science and Engineering, Zhejiang University, Hangzhou, China. My major research interests include Deep Learning, Computer Vision and Action Recognition.

Research and Interests

  • Computer Vision
  • Action Recognition

Publications

  • Mengmeng Wang, Jiazheng Xing, Jing Su, Jun Chen, and Yong Liu. Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:3347-3362, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    Recent methods for action recognition always apply 3D Convolutional Neural Networks (CNNs) to extract spatiotemporal features and introduce optical flows to present motion features. Although achieving state-of-the-art performance, they are expensive in both time and space. In this paper, we propose to represent both the two kinds of features in a unified 2D CNN without any 3D convolution or optical flows calculation. In particular, we first design a channel-wise spatiotemporal module to present the spatiotemporal features and a channel-wise motion module to encode feature-level motion features efficiently. Secondly, we combine these two modules and an identity mapping path into one united block that can easily replaces the original residual block in the ResNet architecture, forming a simple yet effective network termed STM network by introducing very limited extra computation cost and parameters. Thirdly, we propose a novel Twins Training framework for action recognition by incorporating a correlation loss to optimize the inter-class and intra-class correlation and a siamese structure to fully stretch the training data. We extensively validate the proposed STM on both temporal-related datasets (i.e., Something-Something v1 & v2) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51). It achieves favorable results against state-of-the-art methods in all the datasets.
    @article{wang2022lsm,
    title = {Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition},
    author = {Mengmeng Wang and Jiazheng Xing and Jing Su and Jun Chen and Yong Liu},
    year = 2022,
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    volume = 45,
    pages = {3347-3362},
    doi = {10.1109/TPAMI.2022.3173658},
    abstract = {Recent methods for action recognition always apply 3D Convolutional Neural Networks (CNNs) to extract spatiotemporal features and introduce optical flows to present motion features. Although achieving state-of-the-art performance, they are expensive in both time and space. In this paper, we propose to represent both the two kinds of features in a unified 2D CNN without any 3D convolution or optical flows calculation. In particular, we first design a channel-wise spatiotemporal module to present the spatiotemporal features and a channel-wise motion module to encode feature-level motion features efficiently. Secondly, we combine these two modules and an identity mapping path into one united block that can easily replaces the original residual block in the ResNet architecture, forming a simple yet effective network termed STM network by introducing very limited extra computation cost and parameters. Thirdly, we propose a novel Twins Training framework for action recognition by incorporating a correlation loss to optimize the inter-class and intra-class correlation and a siamese structure to fully stretch the training data. We extensively validate the proposed STM on both temporal-related datasets (i.e., Something-Something v1 \& v2) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51). It achieves favorable results against state-of-the-art methods in all the datasets.}
    }
  • Jiazheng Xing, Mengmeng Wang, Boyu Mu, and Yong Liu. Revisiting the Spatial and Temporal Modeling for Few-Shot Action Recognition. In 37th AAAI Conference on Artificial Intelligence (AAAI), 2023.
    [BibTeX] [Abstract] [DOI]
    Spatial and temporal modeling is one of the most core aspects of few-shot action recognition. Most previous works mainly focus on long-term temporal relation modeling based on high-level spatial representations, without considering the crucial low-level spatial features and short-term temporal relations. Actually, the former feature could bring rich local semantic information, and the latter feature could represent motion characteristics of adjacent frames, respectively. In this paper, we propose SloshNet, a new framework that revisits the spatial and temporal modeling for few-shot action recognition in a finer manner. First, to exploit the low-level spatial features, we design a feature fusion architecture search module to automatically search for the best combination of the low-level and high-level spatial features. Next, inspired by the recent transformer, we introduce a long-term temporal modeling module to model the global temporal relations based on the extracted spatial appearance features. Meanwhile, we design another short-term temporal modeling module to encode the motion characteristics between adjacent frame representations. After that, the final predictions can be obtained by feeding the embedded rich spatial-temporal features to a common frame-level class prototype matcher. We extensively validate the proposed SloshNet on four few-shot action recognition datasets, including Something-Something V2, Kinetics, UCF101, and HMDB51. It achieves favorable results against state-of-the-art methods in all datasets.
    @inproceedings{xing2023rst,
    title = {Revisiting the Spatial and Temporal Modeling for Few-Shot Action Recognition},
    author = {Jiazheng Xing and Mengmeng Wang and Boyu Mu and Yong Liu},
    year = 2023,
    booktitle = {37th AAAI Conference on Artificial Intelligence (AAAI)},
    doi = {10.48550/arXiv.2301.07944},
    abstract = {Spatial and temporal modeling is one of the most core aspects of few-shot action recognition. Most previous works mainly focus on long-term temporal relation modeling based
    on high-level spatial representations, without considering the crucial low-level spatial features and short-term temporal relations. Actually, the former feature could bring rich local semantic information, and the latter feature could represent motion characteristics of adjacent frames, respectively. In this paper, we propose SloshNet, a new framework that revisits the spatial and temporal modeling for few-shot action recognition in a finer manner. First, to exploit the low-level spatial features, we design a feature fusion architecture search module to automatically search for the best combination of the low-level and high-level spatial features. Next, inspired by the recent transformer, we introduce a long-term temporal modeling module to model the global temporal relations based on
    the extracted spatial appearance features. Meanwhile, we design another short-term temporal modeling module to encode the motion characteristics between adjacent frame representations. After that, the final predictions can be obtained by feeding the embedded rich spatial-temporal features to a common frame-level class prototype matcher. We extensively validate the proposed SloshNet on four few-shot action recognition datasets, including Something-Something V2, Kinetics, UCF101, and HMDB51. It achieves favorable results against state-of-the-art methods in all datasets.}
    }