Address

Room 101, Institute of Cyber-Systems and Control, Yuquan Campus, Zhejiang University, Hangzhou, Zhejiang, China

Contact Information

Email: 637lsq@gmail.com

Shanqi Liu

PhD Student

Institute of Cyber-Systems and Control, Zhejiang University, China

Biography

I am pursuing my PhD degree in College of Control Science and Engineering, Zhejiang University, Hangzhou, China. My major research interests include reinforcement learning.

Research and Interests

  • Reinforcement Learning

Publications

  • Weiwei Liu, Wei Jing, Shanqi Liu, Yudi Ruan, Kexin Zhang, Jian Yang, and Yong Liu. Expert Demonstrations Guide Reward Decomposition for Multi-Agent Cooperation. Neural Computing and Applications, 35:19847-19863, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Humans are able to achieve good teamwork through collaboration, since the contributions of the actions from human team members are properly understood by each individual. Therefore, reasonable credit assignment is crucial for multi-agent cooperation. Although existing work uses value decomposition algorithms to mitigate the credit assignment problem, since they decompose the global value function at multi-agents’ local value function level, the overall evaluation of the value function can easily lead to approximation errors. Moreover, such strategies are vulnerable to sparse reward scenarios. In this paper, we propose to use expert demonstrations to guide the team reward decomposition at each time step, rather than value decomposition. The proposed method computes the reward ratio of each agent according to the similarity between the state-action pair of the agent and the expert demonstrations. In addition, under this setting, each agent can independently train its value function and evaluate its behavior, which makes the algorithm highly robust to team rewards. Moreover, the proposed method constrains the policy to collect data with similar distribution to the expert data during the exploration, which makes policy update more robust. We conduct extensive experiments to validate our proposed method in various MARL environments, the results show that our algorithm outperforms the state-of-the-art algorithms in most scenarios; our method is robust to various reward functions; and the trajectories by our policy is closer to that of the expert policy.
    @article{liu2023edg,
    title = {Expert Demonstrations Guide Reward Decomposition for Multi-Agent Cooperation},
    author = {Weiwei Liu and Wei Jing and Shanqi Liu and Yudi Ruan and Kexin Zhang and Jian Yang and Yong Liu},
    year = 2023,
    journal = {Neural Computing and Applications},
    volume = 35,
    pages = {19847-19863},
    doi = {10.1007/s00521-023-08785-6},
    abstract = {Humans are able to achieve good teamwork through collaboration, since the contributions of the actions from human team members are properly understood by each individual. Therefore, reasonable credit assignment is crucial for multi-agent cooperation. Although existing work uses value decomposition algorithms to mitigate the credit assignment problem, since they decompose the global value function at multi-agents' local value function level, the overall evaluation of the value function can easily lead to approximation errors. Moreover, such strategies are vulnerable to sparse reward scenarios. In this paper, we propose to use expert demonstrations to guide the team reward decomposition at each time step, rather than value decomposition. The proposed method computes the reward ratio of each agent according to the similarity between the state-action pair of the agent and the expert demonstrations. In addition, under this setting, each agent can independently train its value function and evaluate its behavior, which makes the algorithm highly robust to team rewards. Moreover, the proposed method constrains the policy to collect data with similar distribution to the expert data during the exploration, which makes policy update more robust. We conduct extensive experiments to validate our proposed method in various MARL environments, the results show that our algorithm outperforms the state-of-the-art algorithms in most scenarios; our method is robust to various reward functions; and the trajectories by our policy is closer to that of the expert policy.}
    }
  • Shanqi Liu, Weiwei Liu, Wenzhou Chen, Guanzhong Tian, Jun Chen, Yao Tong, Junjie Cao, and Yong Liu. Learning Multi-Agent Cooperation via Considering Actions of Teammates. IEEE Transactions on Neural Networks and Learning Systems, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Recently value-based centralized training with decentralized execution (CTDE) multi-agent reinforcement learning (MARL) methods have achieved excellent performance in cooperative tasks. However, the most representative method among these methods, Q-network MIXing (QMIX), restricts the joint action Q values to be a monotonic mixing of each agent ‘ s utilities. Furthermore, current methods cannot generalize to unseen environments or different agent configurations, which is known as ad hoc team play situation. In this work, we propose a novel Q values decomposition that considers both the return of an agent acting on its own and cooperating with other observable agents to address the nonmonotonic problem. Based on the decomposition, we propose a greedy action searching method that can improve exploration and is not affected by changes in observable agents or changes in the order of agents ‘ actions. In this way, our method can adapt to ad hoc team play situation. Furthermore, we utilize an auxiliary loss related to environmental cognition consistency and a modified prioritized experience replay (PER) buffer to assist training. Our extensive experimental results show that our method achieves significant performance improvements in both challenging monotonic and nonmonotonic domains, and can handle the ad hoc team play situation perfectly.
    @article{liu2023lma,
    title = {Learning Multi-Agent Cooperation via Considering Actions of Teammates},
    author = {Shanqi Liu and Weiwei Liu and Wenzhou Chen and Guanzhong Tian and Jun Chen and Yao Tong and Junjie Cao and Yong Liu},
    year = 2023,
    journal = {IEEE Transactions on Neural Networks and Learning Systems},
    doi = {10.1109/TNNLS.2023.3262921},
    abstract = {Recently value-based centralized training with decentralized execution (CTDE) multi-agent reinforcement learning (MARL) methods have achieved excellent performance in cooperative tasks. However, the most representative method among these methods, Q-network MIXing (QMIX), restricts the joint action Q values to be a monotonic mixing of each agent ' s utilities. Furthermore, current methods cannot generalize to unseen environments or different agent configurations, which is known as ad hoc team play situation. In this work, we propose a novel Q values decomposition that considers both the return of an agent acting on its own and cooperating with other observable agents to address the nonmonotonic problem. Based on the decomposition, we propose a greedy action searching method that can improve exploration and is not affected by changes in observable agents or changes in the order of agents ' actions. In this way, our method can adapt to ad hoc team play situation. Furthermore, we utilize an auxiliary loss related to environmental cognition consistency and a modified prioritized experience replay (PER) buffer to assist training. Our extensive experimental results show that our method achieves significant performance improvements in both challenging monotonic and nonmonotonic domains, and can handle the ad hoc team play situation perfectly.}
    }
  • Shanqi Liu, Junjie Cao, Yujie Wang, Wenzhou Chen, and Yong Liu. Self-play reinforcement learning with comprehensive critic in computer games. Neurocomputing, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    Self-play reinforcement learning, where agents learn by playing with themselves, has been successfully applied in many game scenarios. However, the training procedure for self-play reinforcement learning is unstable and more sample-inefficient than (general) reinforcement learning, especially in imperfect information games. To improve the self-play training process, we incorporate a comprehensive critic into the policy gradient method to form a self-play actor-critic (SPAC) method for training agents to play com-puter games. We evaluate our method in four different environments in both competitive and coopera-tive tasks. The results show that the agent trained with our SPAC method outperforms those trained with deep deterministic policy gradient (DDPG) and proximal policy optimization (PPO) algorithms in many different evaluation approaches, which vindicate the effect of our comprehensive critic in the self-play training procedure. CO 2021 Elsevier B.V. All rights reserved.
    @article{liu2021spr,
    title = {Self-play reinforcement learning with comprehensive critic in computer games},
    author = {Shanqi Liu and Junjie Cao and Yujie Wang and Wenzhou Chen and Yong Liu},
    year = 2021,
    journal = {Neurocomputing},
    doi = {10.1016/j.neucom.2021.04.006},
    abstract = {Self-play reinforcement learning, where agents learn by playing with themselves, has been successfully applied in many game scenarios. However, the training procedure for self-play reinforcement learning is unstable and more sample-inefficient than (general) reinforcement learning, especially in imperfect information games. To improve the self-play training process, we incorporate a comprehensive critic into the policy gradient method to form a self-play actor-critic (SPAC) method for training agents to play com-puter games. We evaluate our method in four different environments in both competitive and coopera-tive tasks. The results show that the agent trained with our SPAC method outperforms those trained with deep deterministic policy gradient (DDPG) and proximal policy optimization (PPO) algorithms in many different evaluation approaches, which vindicate the effect of our comprehensive critic in the self-play training procedure. CO 2021 Elsevier B.V. All rights reserved.}
    }
  • Weiwei Liu, Shanqi Liu, Junjie Cao, Qi Wang, Xiaolei Lang, and Yong Liu. Learning Communication for Cooperation in Dynamic Agent-Number Environment. IEEE/ASME Transactions on Mechatronics, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    The number of agents in many multi-agent systems in the real world changes all the time, such as storage robots and drone cluster systems. Still, most current multi-agent reinforcement learning algorithms are limited to fixed network dimensions, and prior knowledge is used to preset the number of agents in the training phase, which leads to a poor generalization of the algorithm. In addition, these algorithms use centralized training to solve the instability problem of multi-agent systems. However, the centralized learning of large-scale multi-agent reinforcement learning algorithms will lead to an explosion of network dimensions, which in turn leads to very limited scalability of centralized learning algorithms. To solve these two difficulties, we propose Group Centralized Training and Decentralized Execution-Unlimited Dynamic Agent-number Network (GCTDE-UDAN). Firstly, since we use the attention mechanism to select several leaders and establish a dynamic number of teams, and UDAN performs a non-linear combination of all agents’ Q values when performing value decomposition, it is not affected by changes in the number of agents. Moreover, our algorithm can unite any agent to form a group and conduct centralized training within the group, avoiding network dimension explosion caused by global centralized training of large-scale agents. Finally, we verified on the simulation and experimental platform that the algorithm can learn and perform cooperative behaviors in many dynamic multi-agent environments.
    @article{liu2021lcf,
    title = {Learning Communication for Cooperation in Dynamic Agent-Number Environment},
    author = {Weiwei Liu and Shanqi Liu and Junjie Cao and Qi Wang and Xiaolei Lang and Yong Liu},
    year = 2021,
    journal = {IEEE/ASME Transactions on Mechatronics},
    doi = {10.1109/TMECH.2021.3076080},
    abstract = {The number of agents in many multi-agent systems in the real world changes all the time, such as storage robots and drone cluster systems. Still, most current multi-agent reinforcement learning algorithms are limited to fixed network dimensions, and prior knowledge is used to preset the number of agents in the training phase, which leads to a poor generalization of the algorithm. In addition, these algorithms use centralized training to solve the instability problem of multi-agent systems. However, the centralized learning of large-scale multi-agent reinforcement learning algorithms will lead to an explosion of network dimensions, which in turn leads to very limited scalability of centralized learning algorithms. To solve these two difficulties, we propose Group Centralized Training and Decentralized Execution-Unlimited Dynamic Agent-number Network (GCTDE-UDAN). Firstly, since we use the attention mechanism to select several leaders and establish a dynamic number of teams, and UDAN performs a non-linear combination of all agents' Q values when performing value decomposition, it is not affected by changes in the number of agents. Moreover, our algorithm can unite any agent to form a group and conduct centralized training within the group, avoiding network dimension explosion caused by global centralized training of large-scale agents. Finally, we verified on the simulation and experimental platform that the algorithm can learn and perform cooperative behaviors in many dynamic multi-agent environments.}
    }
  • Weiwei Liu, Shanqi Liu, Jian Yang, and Yong Liu. Learning Intra-group Cooperation in Multi-agent Systems. In 2021 27th International Conference on Mechatronics and Machine Vision in Practice, pages 688-692, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    Reinforcement learning is one of the algorithms used in multi-agent systems to promote agent cooperation. However, most current multi-agent reinforcement learning algorithms improve the communication capabilities of agents for cooperation, but the overall communication is costly and even harmful due to bandwidth limitations. In addition, de-centralized execution cannot generate joint actions, which is not conducive to cooperation. Therefore, we proposed the Hierarchical Group Cooperation Network (HGCN). Advanced strategy, Group Network (GroNet), learns to group all agents based on their state rather than their location. The Low-level strategy, Group Cooperation Network (GCoNet), is a method of centralized training and centralized execution within a group, which effectively promotes agent collaboration. Finally, we validated our method in various experiments.
    @inproceedings{liu2021lig,
    title = {Learning Intra-group Cooperation in Multi-agent Systems},
    author = {Weiwei Liu and Shanqi Liu and Jian Yang and Yong Liu},
    year = 2021,
    booktitle = {2021 27th International Conference on Mechatronics and Machine Vision in Practice},
    pages = {688-692},
    doi = {https://doi.org/10.1109/M2VIP49856.2021.9665049},
    abstract = {Reinforcement learning is one of the algorithms used in multi-agent systems to promote agent cooperation. However, most current multi-agent reinforcement learning algorithms improve the communication capabilities of agents for cooperation, but the overall communication is costly and even harmful due to bandwidth limitations. In addition, de-centralized execution cannot generate joint actions, which is not conducive to cooperation. Therefore, we proposed the Hierarchical Group Cooperation Network (HGCN). Advanced strategy, Group Network (GroNet), learns to group all agents based on their state rather than their location. The Low-level strategy, Group Cooperation Network (GCoNet), is a method of centralized training and centralized execution within a group, which effectively promotes agent collaboration. Finally, we validated our method in various experiments.}
    }
  • Shanqi Liu, licheng Wen, Jinhao Cui, Xuemeng Yang, Junjie Cao, and Yong Liu. Moving Forward in Formation: A Decentralized Hierarchical Learning Approach to Multi-Agent Moving Together. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4777-4784, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    Multi-agent path finding in formation has manypotential real-world applications like mobile warehouse robotics. However, previous multi-agent path finding (MAPF) methods hardly take formation into consideration. Furthermore, they are usually centralized planners and require the whole state of the environment. Other decentralized partially observable approaches to MAPF are reinforcement learning (RL) methods. However, these RL methods encounter difficulties when learning path finding and formation problems at the same time. In this paper, we propose a novel decentralized partially observable RL algorithm that uses a hierarchical structure to decompose the multi-objective task into unrelated ones. It also calculates a theoretical weight that makes each tasks reward has equal influence on the final RL value function. Additionally, we introduce a communication method that helps agents cooperate with each other. Experiments in simulation show that our method outperforms other end-toend RL methods and our method can naturally scale to large world sizes where centralized planner struggles. We also deploy and validate our method in a real-world scenario.
    @inproceedings{liu2021movingfi,
    title = {Moving Forward in Formation: A Decentralized Hierarchical Learning Approach to Multi-Agent Moving Together},
    author = {Shanqi Liu and licheng Wen and Jinhao Cui and Xuemeng Yang and Junjie Cao and Yong Liu},
    year = 2021,
    booktitle = {2021 IEEE/RSJ International Conference on Intelligent Robots and Systems},
    pages = {4777-4784},
    doi = {https://doi.org/10.1109/IROS51168.2021.9636224},
    abstract = {Multi-agent path finding in formation has manypotential real-world applications like mobile warehouse robotics. However, previous multi-agent path finding (MAPF) methods hardly take formation into consideration. Furthermore, they are usually centralized planners and require the whole state of the environment. Other decentralized partially observable approaches to MAPF are reinforcement learning (RL) methods. However, these RL methods encounter difficulties when learning path finding and formation problems at the same time. In this paper, we propose a novel decentralized partially observable RL algorithm that uses a hierarchical structure to decompose the multi-objective task into unrelated ones. It also calculates a theoretical weight that makes each tasks reward has equal influence on the final RL value function. Additionally, we introduce a communication method that helps agents cooperate with each other. Experiments in simulation show that our method outperforms other end-toend RL methods and our method can naturally scale to large world sizes where centralized planner struggles. We also deploy and validate our method in a real-world scenario.}
    }
  • Shanqi Liu, Junjie Cao, Wenzhou Chen, licheng Wen, and Yong Liu. HILONet: Hierarchical Imitation Learning from Non-Aligned Observations. In 2021 IEEE 10th data Driven Control And Learning Systems Conference, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    It is challenging learning from demonstrated observation-only trajectories in a non-time-aligned environment because most imitation learning methods aim to imitate experts by following the demonstration step-by-step. However, aligned demonstrations are seldom obtainable in real-world scenarios. In this work, we propose a new imitation learning approach called Hierarchical Imitation Learning from Observation(HILONet), which adopts a hierarchical structure to choose feasible sub-goals from demonstrated observations dynamically. Our method can solve all kinds of tasks by achieving these sub-goals, whether it has a single goal position or not. We also present three different ways to increase sample efficiency in the hierarchical structure. We conduct extensive experiments using several environments. The results show the improvement in both performance and learning efficiency.
    @inproceedings{liu2021hilonethi,
    title = {HILONet: Hierarchical Imitation Learning from Non-Aligned Observations},
    author = {Shanqi Liu and Junjie Cao and Wenzhou Chen and licheng Wen and Yong Liu},
    year = 2021,
    booktitle = {2021 IEEE 10th data Driven Control And Learning Systems Conference},
    doi = {https://doi.org/10.48550/arXiv.2011.02671},
    abstract = {It is challenging learning from demonstrated observation-only trajectories in a non-time-aligned environment because most imitation learning methods aim to imitate experts by following the demonstration step-by-step. However, aligned demonstrations are seldom obtainable in real-world scenarios. In this work, we propose a new imitation learning approach called Hierarchical Imitation Learning from Observation(HILONet), which adopts a hierarchical structure to choose feasible sub-goals from demonstrated observations dynamically. Our method can solve all kinds of tasks by achieving these sub-goals, whether it has a single goal position or not. We also present three different ways to increase sample efficiency in the hierarchical structure. We conduct extensive experiments using several environments. The results show the improvement in both performance and learning efficiency.}
    }