Address

Room 101, Institute of Cyber-Systems and Control, Yuquan Campus, Zhejiang University, Hangzhou, Zhejiang, China

Contact Information

Email: 11932061@zju.edu.cn

Weiwei Liu

PhD Student

Institute of Cyber-Systems and Control, Zhejiang University, China

Biography

I am pursuing my Ph.D. in College of Control Science and Engineering, Zhejiang University, Hangzhou, China.

Research and Interests

  • Reinforcement Learning

Publications

  • Weiwei Liu, Wei Jing, Shanqi Liu, Yudi Ruan, Kexin Zhang, Jian Yang, and Yong Liu. Expert Demonstrations Guide Reward Decomposition for Multi-Agent Cooperation. Neural Computing and Applications, 35:19847-19863, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Humans are able to achieve good teamwork through collaboration, since the contributions of the actions from human team members are properly understood by each individual. Therefore, reasonable credit assignment is crucial for multi-agent cooperation. Although existing work uses value decomposition algorithms to mitigate the credit assignment problem, since they decompose the global value function at multi-agents’ local value function level, the overall evaluation of the value function can easily lead to approximation errors. Moreover, such strategies are vulnerable to sparse reward scenarios. In this paper, we propose to use expert demonstrations to guide the team reward decomposition at each time step, rather than value decomposition. The proposed method computes the reward ratio of each agent according to the similarity between the state-action pair of the agent and the expert demonstrations. In addition, under this setting, each agent can independently train its value function and evaluate its behavior, which makes the algorithm highly robust to team rewards. Moreover, the proposed method constrains the policy to collect data with similar distribution to the expert data during the exploration, which makes policy update more robust. We conduct extensive experiments to validate our proposed method in various MARL environments, the results show that our algorithm outperforms the state-of-the-art algorithms in most scenarios; our method is robust to various reward functions; and the trajectories by our policy is closer to that of the expert policy.
    @article{liu2023edg,
    title = {Expert Demonstrations Guide Reward Decomposition for Multi-Agent Cooperation},
    author = {Weiwei Liu and Wei Jing and Shanqi Liu and Yudi Ruan and Kexin Zhang and Jian Yang and Yong Liu},
    year = 2023,
    journal = {Neural Computing and Applications},
    volume = 35,
    pages = {19847-19863},
    doi = {10.1007/s00521-023-08785-6},
    abstract = {Humans are able to achieve good teamwork through collaboration, since the contributions of the actions from human team members are properly understood by each individual. Therefore, reasonable credit assignment is crucial for multi-agent cooperation. Although existing work uses value decomposition algorithms to mitigate the credit assignment problem, since they decompose the global value function at multi-agents' local value function level, the overall evaluation of the value function can easily lead to approximation errors. Moreover, such strategies are vulnerable to sparse reward scenarios. In this paper, we propose to use expert demonstrations to guide the team reward decomposition at each time step, rather than value decomposition. The proposed method computes the reward ratio of each agent according to the similarity between the state-action pair of the agent and the expert demonstrations. In addition, under this setting, each agent can independently train its value function and evaluate its behavior, which makes the algorithm highly robust to team rewards. Moreover, the proposed method constrains the policy to collect data with similar distribution to the expert data during the exploration, which makes policy update more robust. We conduct extensive experiments to validate our proposed method in various MARL environments, the results show that our algorithm outperforms the state-of-the-art algorithms in most scenarios; our method is robust to various reward functions; and the trajectories by our policy is closer to that of the expert policy.}
    }
  • Weiwei Liu, Linpeng Peng, Licheng Wen, Jian Yang, and Yong Liu. Decomposing Shared Networks for Separate Cooperation with Multi-agent Reinforcement Learning. Information Sciences, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Sharing network parameters between agents is an essential and typical operation to improve the scalability of multi-agent reinforcement learning algorithms. However, agents with different tasks sharing the same network parameters are not conducive to distinguishing the agents’ skills. In addition, the importance of communication between agents undertaking the same task is much higher than that with external agents. Therefore, we propose Dual Cooperation Networks (DCN). In order to distinguish whether agents undertake the same task, all agents are grouped according to their status through the graph neural network instead of the traditional proximity. The agent communicates within the group to achieve strong cooperation. After that, the global value function is decomposed by groups to facilitate cooperation between groups. Finally, we have verified it in simulation and physical hardware, and the algorithm has achieved excellent performance.
    @article{liu2023dsn,
    title = {Decomposing Shared Networks for Separate Cooperation with Multi-agent Reinforcement Learning},
    author = {Weiwei Liu and Linpeng Peng and Licheng Wen and Jian Yang and Yong Liu},
    year = 2023,
    journal = {Information Sciences},
    doi = {10.1016/j.ins.2023.119085},
    abstract = {Sharing network parameters between agents is an essential and typical operation to improve the scalability of multi-agent reinforcement learning algorithms. However, agents with different tasks sharing the same network parameters are not conducive to distinguishing the agents' skills. In addition, the importance of communication between agents undertaking the same task is much higher than that with external agents. Therefore, we propose Dual Cooperation Networks (DCN). In order to distinguish whether agents undertake the same task, all agents are grouped according to their status through the graph neural network instead of the traditional proximity. The agent communicates within the group to achieve strong cooperation. After that, the global value function is decomposed by groups to facilitate cooperation between groups. Finally, we have verified it in simulation and physical hardware, and the algorithm has achieved excellent performance.}
    }
  • Shanqi Liu, Weiwei Liu, Wenzhou Chen, Guanzhong Tian, Jun Chen, Yao Tong, Junjie Cao, and Yong Liu. Learning Multi-Agent Cooperation via Considering Actions of Teammates. IEEE Transactions on Neural Networks and Learning Systems, 2023.
    [BibTeX] [Abstract] [DOI] [PDF]
    Recently value-based centralized training with decentralized execution (CTDE) multi-agent reinforcement learning (MARL) methods have achieved excellent performance in cooperative tasks. However, the most representative method among these methods, Q-network MIXing (QMIX), restricts the joint action Q values to be a monotonic mixing of each agent ‘ s utilities. Furthermore, current methods cannot generalize to unseen environments or different agent configurations, which is known as ad hoc team play situation. In this work, we propose a novel Q values decomposition that considers both the return of an agent acting on its own and cooperating with other observable agents to address the nonmonotonic problem. Based on the decomposition, we propose a greedy action searching method that can improve exploration and is not affected by changes in observable agents or changes in the order of agents ‘ actions. In this way, our method can adapt to ad hoc team play situation. Furthermore, we utilize an auxiliary loss related to environmental cognition consistency and a modified prioritized experience replay (PER) buffer to assist training. Our extensive experimental results show that our method achieves significant performance improvements in both challenging monotonic and nonmonotonic domains, and can handle the ad hoc team play situation perfectly.
    @article{liu2023lma,
    title = {Learning Multi-Agent Cooperation via Considering Actions of Teammates},
    author = {Shanqi Liu and Weiwei Liu and Wenzhou Chen and Guanzhong Tian and Jun Chen and Yao Tong and Junjie Cao and Yong Liu},
    year = 2023,
    journal = {IEEE Transactions on Neural Networks and Learning Systems},
    doi = {10.1109/TNNLS.2023.3262921},
    abstract = {Recently value-based centralized training with decentralized execution (CTDE) multi-agent reinforcement learning (MARL) methods have achieved excellent performance in cooperative tasks. However, the most representative method among these methods, Q-network MIXing (QMIX), restricts the joint action Q values to be a monotonic mixing of each agent ' s utilities. Furthermore, current methods cannot generalize to unseen environments or different agent configurations, which is known as ad hoc team play situation. In this work, we propose a novel Q values decomposition that considers both the return of an agent acting on its own and cooperating with other observable agents to address the nonmonotonic problem. Based on the decomposition, we propose a greedy action searching method that can improve exploration and is not affected by changes in observable agents or changes in the order of agents ' actions. In this way, our method can adapt to ad hoc team play situation. Furthermore, we utilize an auxiliary loss related to environmental cognition consistency and a modified prioritized experience replay (PER) buffer to assist training. Our extensive experimental results show that our method achieves significant performance improvements in both challenging monotonic and nonmonotonic domains, and can handle the ad hoc team play situation perfectly.}
    }
  • Gang Xu, Yansong Chen, Junjie Cao, Deye Zhu, Weiwei Liu, and Yong Liu. Multivehicle Motion Planning with Posture Constraints in Real World. IEEE-ASME Transactions on Mechatronics, 27(4):2125-2133, 2022.
    [BibTeX] [Abstract] [DOI] [PDF]
    This article addresses the posture constraints problem in multivehicle motion planning for specific applications such as ground exploration tasks. Unlike most of the related work in motion planning, this article investigates more practical applications in the real world for nonholonomic unmanned ground vehicles (UGVs). In this case, a strategy of diversion is designed to optimize the smoothness of motion. Considering the problem of the posture constraints, a postured collision avoidance algorithm is proposed for the motion planning of the multiple nonholonomic UGVs. Two simulation experiments were conducted to verify the effectiveness and analyze the quantitative performance of the proposed method. Then, the practicability of the proposed algorithm was verified with an experiment in a natural environment.
    @article{xu2022mmp,
    title = {Multivehicle Motion Planning with Posture Constraints in Real World},
    author = {Gang Xu and Yansong Chen and Junjie Cao and Deye Zhu and Weiwei Liu and Yong Liu},
    year = 2022,
    journal = {IEEE-ASME Transactions on Mechatronics},
    volume = {27},
    number = {4},
    pages = {2125-2133},
    doi = {10.1109/TMECH.2022.3173130},
    abstract = {This article addresses the posture constraints problem in multivehicle motion planning for specific applications such as ground exploration tasks. Unlike most of the related work in motion planning, this article investigates more practical applications in the real world for nonholonomic unmanned ground vehicles (UGVs). In this case, a strategy of diversion is designed to optimize the smoothness of motion. Considering the problem of the posture constraints, a postured collision avoidance algorithm is proposed for the motion planning of the multiple nonholonomic UGVs. Two simulation experiments were conducted to verify the effectiveness and analyze the quantitative performance of the proposed method. Then, the practicability of the proposed algorithm was verified with an experiment in a natural environment.}
    }
  • Weiwei Liu, Shanqi Liu, Junjie Cao, Qi Wang, Xiaolei Lang, and Yong Liu. Learning Communication for Cooperation in Dynamic Agent-Number Environment. IEEE/ASME Transactions on Mechatronics, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    The number of agents in many multi-agent systems in the real world changes all the time, such as storage robots and drone cluster systems. Still, most current multi-agent reinforcement learning algorithms are limited to fixed network dimensions, and prior knowledge is used to preset the number of agents in the training phase, which leads to a poor generalization of the algorithm. In addition, these algorithms use centralized training to solve the instability problem of multi-agent systems. However, the centralized learning of large-scale multi-agent reinforcement learning algorithms will lead to an explosion of network dimensions, which in turn leads to very limited scalability of centralized learning algorithms. To solve these two difficulties, we propose Group Centralized Training and Decentralized Execution-Unlimited Dynamic Agent-number Network (GCTDE-UDAN). Firstly, since we use the attention mechanism to select several leaders and establish a dynamic number of teams, and UDAN performs a non-linear combination of all agents’ Q values when performing value decomposition, it is not affected by changes in the number of agents. Moreover, our algorithm can unite any agent to form a group and conduct centralized training within the group, avoiding network dimension explosion caused by global centralized training of large-scale agents. Finally, we verified on the simulation and experimental platform that the algorithm can learn and perform cooperative behaviors in many dynamic multi-agent environments.
    @article{liu2021lcf,
    title = {Learning Communication for Cooperation in Dynamic Agent-Number Environment},
    author = {Weiwei Liu and Shanqi Liu and Junjie Cao and Qi Wang and Xiaolei Lang and Yong Liu},
    year = 2021,
    journal = {IEEE/ASME Transactions on Mechatronics},
    doi = {10.1109/TMECH.2021.3076080},
    abstract = {The number of agents in many multi-agent systems in the real world changes all the time, such as storage robots and drone cluster systems. Still, most current multi-agent reinforcement learning algorithms are limited to fixed network dimensions, and prior knowledge is used to preset the number of agents in the training phase, which leads to a poor generalization of the algorithm. In addition, these algorithms use centralized training to solve the instability problem of multi-agent systems. However, the centralized learning of large-scale multi-agent reinforcement learning algorithms will lead to an explosion of network dimensions, which in turn leads to very limited scalability of centralized learning algorithms. To solve these two difficulties, we propose Group Centralized Training and Decentralized Execution-Unlimited Dynamic Agent-number Network (GCTDE-UDAN). Firstly, since we use the attention mechanism to select several leaders and establish a dynamic number of teams, and UDAN performs a non-linear combination of all agents' Q values when performing value decomposition, it is not affected by changes in the number of agents. Moreover, our algorithm can unite any agent to form a group and conduct centralized training within the group, avoiding network dimension explosion caused by global centralized training of large-scale agents. Finally, we verified on the simulation and experimental platform that the algorithm can learn and perform cooperative behaviors in many dynamic multi-agent environments.}
    }
  • Weiwei Liu, Linpeng Peng, Junjie Cao, Xiaokuan Fu, Yong Liu, and Zaisheng Pan. Ensemble Bootstrapped Deep Deterministic Policy Gradient for Vision-Based Robotic Grasping. IEEE Access, 9:19916–19925, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    With sufficient practice, humans can grab objects they have never seen before through brain decision-making. However, the manipulators, which has a wide range of applications in industrial production, can still only grab specific objects. Because most of the grasp algorithms rely on prior knowledge such as hand-eye calibration results, object model features, and can only target specific types of objects. When the task scenario and the operation target change, it cannot perform effective redeployment. In order to solve the above problems, academia often uses reinforcement learning to train grasping algorithms. However, the method of reinforcement learning in the field of manipulators grasping mainly encounters these main problems: insufficient sample utilization, poor algorithm stability, and limited exploration. This article uses LfD, BC, and DDPG to improve sample utilization. Use multiple critics to integrate and evaluate input actions to solve the problem of algorithm instability. Finally, inspired by Thompson’s sampling idea, the input action is evaluated from different angles, which increases the algorithm’s exploration of the environment and reduces the number of interactions with the environment. EDDPG and EBDDPG algorithm is designed in the article. In order to further improve the generalization ability of the algorithm, this article does not use extra information that is difficult to obtain directly on the physical platform, such as the real coordinates of the target object and the continuous motion space at the end of the manipulator in the Cartesian coordinate system is used as the output of the decision. The simulation results show that, under the same number of interactions, the manipulators’ success rate in grabbing 1000 random objects has increased more than double and reached state-of-the-art(SOTA) performance.
    @article{liu2021ensemblebd,
    title = {Ensemble Bootstrapped Deep Deterministic Policy Gradient for Vision-Based Robotic Grasping},
    author = {Weiwei Liu and Linpeng Peng and Junjie Cao and Xiaokuan Fu and Yong Liu and Zaisheng Pan},
    year = 2021,
    journal = {IEEE Access},
    volume = 9,
    pages = {19916--19925},
    doi = {10.1109/ACCESS.2021.3049860},
    abstract = {With sufficient practice, humans can grab objects they have never seen before through brain decision-making. However, the manipulators, which has a wide range of applications in industrial production, can still only grab specific objects. Because most of the grasp algorithms rely on prior knowledge such as hand-eye calibration results, object model features, and can only target specific types of objects. When the task scenario and the operation target change, it cannot perform effective redeployment. In order to solve the above problems, academia often uses reinforcement learning to train grasping algorithms. However, the method of reinforcement learning in the field of manipulators grasping mainly encounters these main problems: insufficient sample utilization, poor algorithm stability, and limited exploration. This article uses LfD, BC, and DDPG to improve sample utilization. Use multiple critics to integrate and evaluate input actions to solve the problem of algorithm instability. Finally, inspired by Thompson's sampling idea, the input action is evaluated from different angles, which increases the algorithm's exploration of the environment and reduces the number of interactions with the environment. EDDPG and EBDDPG algorithm is designed in the article. In order to further improve the generalization ability of the algorithm, this article does not use extra information that is difficult to obtain directly on the physical platform, such as the real coordinates of the target object and the continuous motion space at the end of the manipulator in the Cartesian coordinate system is used as the output of the decision. The simulation results show that, under the same number of interactions, the manipulators' success rate in grabbing 1000 random objects has increased more than double and reached state-of-the-art(SOTA) performance.}
    }
  • Weiwei Liu, Shanqi Liu, Jian Yang, and Yong Liu. Learning Intra-group Cooperation in Multi-agent Systems. In 2021 27th International Conference on Mechatronics and Machine Vision in Practice, pages 688-692, 2021.
    [BibTeX] [Abstract] [DOI] [PDF]
    Reinforcement learning is one of the algorithms used in multi-agent systems to promote agent cooperation. However, most current multi-agent reinforcement learning algorithms improve the communication capabilities of agents for cooperation, but the overall communication is costly and even harmful due to bandwidth limitations. In addition, de-centralized execution cannot generate joint actions, which is not conducive to cooperation. Therefore, we proposed the Hierarchical Group Cooperation Network (HGCN). Advanced strategy, Group Network (GroNet), learns to group all agents based on their state rather than their location. The Low-level strategy, Group Cooperation Network (GCoNet), is a method of centralized training and centralized execution within a group, which effectively promotes agent collaboration. Finally, we validated our method in various experiments.
    @inproceedings{liu2021lig,
    title = {Learning Intra-group Cooperation in Multi-agent Systems},
    author = {Weiwei Liu and Shanqi Liu and Jian Yang and Yong Liu},
    year = 2021,
    booktitle = {2021 27th International Conference on Mechatronics and Machine Vision in Practice},
    pages = {688-692},
    doi = {https://doi.org/10.1109/M2VIP49856.2021.9665049},
    abstract = {Reinforcement learning is one of the algorithms used in multi-agent systems to promote agent cooperation. However, most current multi-agent reinforcement learning algorithms improve the communication capabilities of agents for cooperation, but the overall communication is costly and even harmful due to bandwidth limitations. In addition, de-centralized execution cannot generate joint actions, which is not conducive to cooperation. Therefore, we proposed the Hierarchical Group Cooperation Network (HGCN). Advanced strategy, Group Network (GroNet), learns to group all agents based on their state rather than their location. The Low-level strategy, Group Cooperation Network (GCoNet), is a method of centralized training and centralized execution within a group, which effectively promotes agent collaboration. Finally, we validated our method in various experiments.}
    }
  • Junjie Cao, Weiwei Liu, Yong Liu, and Jian Yang. Generalize Robot Learning From Demonstration to Variant Scenarios With Evolutionary Policy Gradient. Frontiers in Neurorobotics, 14, 2020.
    [BibTeX] [Abstract] [DOI] [PDF]
    There has been substantial growth in research on the robot automation, which aims to make robots capable of directly interacting with the world or human. Robot learning for automation from human demonstration is central to such situation. However, the dependence of demonstration restricts robot to a fixed scenario, without the ability to explore in variant situations to accomplish the same task as in demonstration. Deep reinforcement learning methods may be a good method to make robot learning beyond human demonstration and fulfilling the task in unknown situations. The exploration is the core of such generalization to different environments. While the exploration in reinforcement learning may be ineffective and suffer from the problem of low sample efficiency. In this paper, we present Evolutionary Policy Gradient (EPG) to make robot learn from demonstration and perform goal oriented exploration efficiently. Through goal oriented exploration, our method can generalize robot learned skill to environments with different parameters. Our Evolutionary Policy Gradient combines parameter perturbation with policy gradient method in the framework of Evolutionary Algorithms (EAs) and can fuse the benefits of both, achieving effective and efficient exploration. With demonstration guiding the evolutionary process, robot can accelerate the goal oriented exploration to generalize its capability to variant scenarios. The experiments, carried out in robot control tasks in OpenAI Gym with dense and sparse rewards, show that our EPG is able to provide competitive performance over the original policy gradient methods and EAs. In the manipulator task, our robot can learn to open the door with vision in environments which are different from where the demonstrations are provided.
    @article{cao2020generalizerl,
    title = {Generalize Robot Learning From Demonstration to Variant Scenarios With Evolutionary Policy Gradient},
    author = {Junjie Cao and Weiwei Liu and Yong Liu and Jian Yang},
    year = 2020,
    journal = {Frontiers in Neurorobotics},
    volume = 14,
    doi = { https://doi.org/10.3389/fnbot.2020.00021},
    abstract = {There has been substantial growth in research on the robot automation, which aims to make robots capable of directly interacting with the world or human. Robot learning for automation from human demonstration is central to such situation. However, the dependence of demonstration restricts robot to a fixed scenario, without the ability to explore in variant situations to accomplish the same task as in demonstration. Deep reinforcement learning methods may be a good method to make robot learning beyond human demonstration and fulfilling the task in unknown situations. The exploration is the core of such generalization to different environments. While the exploration in reinforcement learning may be ineffective and suffer from the problem of low sample efficiency. In this paper, we present Evolutionary Policy Gradient (EPG) to make robot learn from demonstration and perform goal oriented exploration efficiently. Through goal oriented exploration, our method can generalize robot learned skill to environments with different parameters. Our Evolutionary Policy Gradient combines parameter perturbation with policy gradient method in the framework of Evolutionary Algorithms (EAs) and can fuse the benefits of both, achieving effective and efficient exploration. With demonstration guiding the evolutionary process, robot can accelerate the goal oriented exploration to generalize its capability to variant scenarios. The experiments, carried out in robot control tasks in OpenAI Gym with dense and sparse rewards, show that our EPG is able to provide competitive performance over the original policy gradient methods and EAs. In the manipulator task, our robot can learn to open the door with vision in environments which are different from where the demonstrations are provided.}
    }