Junjie Cao
PhD Student
Institute of Cyber-Systems and Control, Zhejiang University, China
Biography
I am currently working toward the Ph.D. degree at the College of Control Science and Engineering, Zhejiang University. My current research interests include machine learning, sequential decision making and robotics.
Research and Interests
- Machine Learning
- Sequential Decision Making
- Robot Control
Publications
- Yansong Chen, Yuchen Wu, Helei Yang, Junjie Cao, Qinqin Wang, and Yong Liu. A Distributed Pipeline for Collaborative Pursuit in the Target Guarding Problem. IEEE Robotics and Automation Letters (RA-L), 9:2064-2071, 2024.
[BibTeX] [Abstract] [DOI] [PDF]The target guarding problem (TGP) is a classical combat game where pursuers aim to capture evaders to protect a territory from intrusion. This paper proposes a distributed pipeline for multi-pursuer multi-evader TGP with the capability to accommodate varying numbers of evaders and criteria for successful pursuit. The pipeline integrates a cooperative encirclement-oriented distributed model predictive control (CEO-DMPC) method with a collaborative grouping strategy for trajectory planning of pursuers. This integration achieves cooperation and collision avoidance during the capture process across various scenarios. Besides, the objective function of CEO-DMPC employs sequences of predicted states instead of only a terminal state. Evaders are guided by the artificial potential field (APF) policy to reach their goals without being captured. Simulations with different parameters are conducted to validate the whole pipeline and the experiment results are illustrated and analyzed.
@article{chen2024adp, title = {A Distributed Pipeline for Collaborative Pursuit in the Target Guarding Problem}, author = {Yansong Chen and Yuchen Wu and Helei Yang and Junjie Cao and Qinqin Wang and Yong Liu}, year = 2024, journal = {IEEE Robotics and Automation Letters (RA-L)}, volume = 9, pages = {2064-2071}, doi = {10.1109/LRA.2024.3349977}, abstract = {The target guarding problem (TGP) is a classical combat game where pursuers aim to capture evaders to protect a territory from intrusion. This paper proposes a distributed pipeline for multi-pursuer multi-evader TGP with the capability to accommodate varying numbers of evaders and criteria for successful pursuit. The pipeline integrates a cooperative encirclement-oriented distributed model predictive control (CEO-DMPC) method with a collaborative grouping strategy for trajectory planning of pursuers. This integration achieves cooperation and collision avoidance during the capture process across various scenarios. Besides, the objective function of CEO-DMPC employs sequences of predicted states instead of only a terminal state. Evaders are guided by the artificial potential field (APF) policy to reach their goals without being captured. Simulations with different parameters are conducted to validate the whole pipeline and the experiment results are illustrated and analyzed.} }
- Shanqi Liu, Weiwei Liu, Wenzhou Chen, Guanzhong Tian, Jun Chen, Yao Tong, Junjie Cao, and Yong Liu. Learning Multi-Agent Cooperation via Considering Actions of Teammates. IEEE Transactions on Neural Networks and Learning Systems, 35:11553-11564, 2024.
[BibTeX] [Abstract] [DOI] [PDF]Recently value-based centralized training with decentralized execution (CTDE) multi-agent reinforcement learning (MARL) methods have achieved excellent performance in cooperative tasks. However, the most representative method among these methods, Q-network MIXing (QMIX), restricts the joint action Q values to be a monotonic mixing of each agent ‘ s utilities. Furthermore, current methods cannot generalize to unseen environments or different agent configurations, which is known as ad hoc team play situation. In this work, we propose a novel Q values decomposition that considers both the return of an agent acting on its own and cooperating with other observable agents to address the nonmonotonic problem. Based on the decomposition, we propose a greedy action searching method that can improve exploration and is not affected by changes in observable agents or changes in the order of agents ‘ actions. In this way, our method can adapt to ad hoc team play situation. Furthermore, we utilize an auxiliary loss related to environmental cognition consistency and a modified prioritized experience replay (PER) buffer to assist training. Our extensive experimental results show that our method achieves significant performance improvements in both challenging monotonic and nonmonotonic domains, and can handle the ad hoc team play situation perfectly.
@article{liu2024lma, title = {Learning Multi-Agent Cooperation via Considering Actions of Teammates}, author = {Shanqi Liu and Weiwei Liu and Wenzhou Chen and Guanzhong Tian and Jun Chen and Yao Tong and Junjie Cao and Yong Liu}, year = 2024, journal = {IEEE Transactions on Neural Networks and Learning Systems}, volume = 35, pages = {11553-11564}, doi = {10.1109/TNNLS.2023.3262921}, abstract = {Recently value-based centralized training with decentralized execution (CTDE) multi-agent reinforcement learning (MARL) methods have achieved excellent performance in cooperative tasks. However, the most representative method among these methods, Q-network MIXing (QMIX), restricts the joint action Q values to be a monotonic mixing of each agent ' s utilities. Furthermore, current methods cannot generalize to unseen environments or different agent configurations, which is known as ad hoc team play situation. In this work, we propose a novel Q values decomposition that considers both the return of an agent acting on its own and cooperating with other observable agents to address the nonmonotonic problem. Based on the decomposition, we propose a greedy action searching method that can improve exploration and is not affected by changes in observable agents or changes in the order of agents ' actions. In this way, our method can adapt to ad hoc team play situation. Furthermore, we utilize an auxiliary loss related to environmental cognition consistency and a modified prioritized experience replay (PER) buffer to assist training. Our extensive experimental results show that our method achieves significant performance improvements in both challenging monotonic and nonmonotonic domains, and can handle the ad hoc team play situation perfectly.} }
- Gang Xu, Xiao Kang, Helei Yang, Yuchen Wu, Weiwei Liu, Junjie Cao, and Yong Liu. Distributed Multi-Vehicle Task Assignment and Motion Planning in Dense Environments. IEEE Transactions on Automation Science and Engineering, 2023.
[BibTeX] [Abstract] [DOI]This article investigates the multi-vehicle task assignment and motion planning (MVTAMP) problem. In a dense environment, a fleet of non-holonomic vehicles is appointed to visit a series of target positions and then move to a specific ending area for real-world applications such as clearing threat targets, aid rescue, and package delivery. We presented a novel hierarchical method to simultaneously address the multiple vehicles’ task assignment and motion planning problem. Unlike most related work, our method considers the MVTAMP problem applied to non-holonomic vehicles in large-scale scenarios. At the high level, we proposed a novel distributed algorithm to address task assignment, which produces a closer to the optimal task assignment scheme by reducing the intersection paths between vehicles and tasks or between tasks and tasks. At the low level, we proposed a novel distributed motion planning algorithm that addresses the vehicle deadlocks in local planning and then quickly generates a feasible new velocity for the non-holonomic vehicle in dense environments, guaranteeing that each vehicle efficiently visits its assigned target positions. Extensive simulation experiments in large-scale scenarios for non-holonomic vehicles and two real-world experiments demonstrate the effectiveness and advantages of our method in practical applications. The source code of our method can be available at https://github.com/wuuya1/LRGO. Note to Practitioners-The motivation for this article stems from the need to solve the multi-vehicle task assignment and motion planning (MVTAMP) problem for non-holonomic vehicles in dense environments. Many real-world applications exist, such as clearing threat targets, aid rescue, and package delivery. However, when vehicles need to continuously visit a series of assigned targets, motion planning for non-holonomic vehicles becomes more difficult because it is more likely to occur sharp turns between adjacent target path nodes. In this case, a better task allocation scheme can often lead to more efficient target visits and save all vehicles’ total traveling distance. To bridge this, we proposed a hierarchical method for solving the MVTAMP problem in large-scale complex scenarios. The numerous large-scale simulations and two real-world experiments show the effectiveness of the proposed method. Our future work will focus on the integrated task assignment and motion planning problem for non-holonomic vehicles in highly dynamic scenarios.
@article{xu2023dmv, title = {Distributed Multi-Vehicle Task Assignment and Motion Planning in Dense Environments}, author = {Gang Xu and Xiao Kang and Helei Yang and Yuchen Wu and Weiwei Liu and Junjie Cao and Yong Liu}, year = 2023, journal = {IEEE Transactions on Automation Science and Engineering}, doi = {10.1109/TASE.2023.3336076}, abstract = {This article investigates the multi-vehicle task assignment and motion planning (MVTAMP) problem. In a dense environment, a fleet of non-holonomic vehicles is appointed to visit a series of target positions and then move to a specific ending area for real-world applications such as clearing threat targets, aid rescue, and package delivery. We presented a novel hierarchical method to simultaneously address the multiple vehicles' task assignment and motion planning problem. Unlike most related work, our method considers the MVTAMP problem applied to non-holonomic vehicles in large-scale scenarios. At the high level, we proposed a novel distributed algorithm to address task assignment, which produces a closer to the optimal task assignment scheme by reducing the intersection paths between vehicles and tasks or between tasks and tasks. At the low level, we proposed a novel distributed motion planning algorithm that addresses the vehicle deadlocks in local planning and then quickly generates a feasible new velocity for the non-holonomic vehicle in dense environments, guaranteeing that each vehicle efficiently visits its assigned target positions. Extensive simulation experiments in large-scale scenarios for non-holonomic vehicles and two real-world experiments demonstrate the effectiveness and advantages of our method in practical applications. The source code of our method can be available at https://github.com/wuuya1/LRGO. Note to Practitioners-The motivation for this article stems from the need to solve the multi-vehicle task assignment and motion planning (MVTAMP) problem for non-holonomic vehicles in dense environments. Many real-world applications exist, such as clearing threat targets, aid rescue, and package delivery. However, when vehicles need to continuously visit a series of assigned targets, motion planning for non-holonomic vehicles becomes more difficult because it is more likely to occur sharp turns between adjacent target path nodes. In this case, a better task allocation scheme can often lead to more efficient target visits and save all vehicles' total traveling distance. To bridge this, we proposed a hierarchical method for solving the MVTAMP problem in large-scale complex scenarios. The numerous large-scale simulations and two real-world experiments show the effectiveness of the proposed method. Our future work will focus on the integrated task assignment and motion planning problem for non-holonomic vehicles in highly dynamic scenarios.} }
- Helei Yang, Peng Ge, Junjie Cao, Yifan Yang, and Yong Liu. Large Scale Pursuit-Evasion Under Collision Avoidance Using Deep Reinforcement Learning. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2232-2239, 2023.
[BibTeX] [Abstract] [DOI] [PDF]This paper examines a pursuit-evasion game (PEG) involving multiple pursuers and evaders. The decentralized pursuers aim to collaborate to capture the faster evaders while avoiding collisions. The policies of all agents are learning-based and are subjected to kinematic constraints that are specific to unicycles. To address the challenge of high dimensionality encountered in large-scale scenarios, we propose a state processing method named Mix-Attention, which is based on Self-Attention. This method effectively mitigates the curse of dimensionality. The simulation results provided in this study demonstrate that the combination of Mix-Attention and Independent Proximal Policy Optimization (IPPO) surpasses alternative approaches when solving the multi-pursuer multi-evader PEG, particularly as the number of entities increases. Moreover, the trained policies showcase their ability to adapt to scenarios involving varying numbers of agents and obstacles without requiring retraining. This adaptability showcases their transferability and robustness. Finally, our proposed approach has been validated through physical experiments conducted with six robots.
@inproceedings{yang2023lsp, title = {Large Scale Pursuit-Evasion Under Collision Avoidance Using Deep Reinforcement Learning}, author = {Helei Yang and Peng Ge and Junjie Cao and Yifan Yang and Yong Liu}, year = 2023, booktitle = {2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, pages = {2232-2239}, doi = {10.1109/IROS55552.2023.10341975}, abstract = {This paper examines a pursuit-evasion game (PEG) involving multiple pursuers and evaders. The decentralized pursuers aim to collaborate to capture the faster evaders while avoiding collisions. The policies of all agents are learning-based and are subjected to kinematic constraints that are specific to unicycles. To address the challenge of high dimensionality encountered in large-scale scenarios, we propose a state processing method named Mix-Attention, which is based on Self-Attention. This method effectively mitigates the curse of dimensionality. The simulation results provided in this study demonstrate that the combination of Mix-Attention and Independent Proximal Policy Optimization (IPPO) surpasses alternative approaches when solving the multi-pursuer multi-evader PEG, particularly as the number of entities increases. Moreover, the trained policies showcase their ability to adapt to scenarios involving varying numbers of agents and obstacles without requiring retraining. This adaptability showcases their transferability and robustness. Finally, our proposed approach has been validated through physical experiments conducted with six robots.} }
- Gang Xu, Deye Zhu, Junjie Cao, Yong Liu, and Jian Yang. Shunted Collision Avoidance for Multi-UAV Motion Planning with Posture Constraints. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 3671-3678, 2023.
[BibTeX] [Abstract] [DOI] [PDF]This paper investigates the problem of fixed-wing unmanned aerial vehicles (UAV s) motion planning with posture constraints and the problem of the more general symmetrical situations where UAVs have more than one optimal solution. In this paper, the posture constraints are formulated in the 3D Dubins method, and the symmetrical situations are overcome by a more collaborative strategy called the shunted strategy. The effectiveness of the proposed method has been validated by conducting extensive simulation experiments. Meanwhile, we compared the proposed method with the other state-of-the-art methods, and the comparison results show that the proposed method advances the previous works. Finally, the practicability of the proposed algorithm was analyzed by the statistic in computational cost. The source code of our method can be available at https://github.com/wuuya1/SCA.
@inproceedings{xu2023sca, title = {Shunted Collision Avoidance for Multi-UAV Motion Planning with Posture Constraints}, author = {Gang Xu and Deye Zhu and Junjie Cao and Yong Liu and Jian Yang}, year = 2023, booktitle = {2023 IEEE International Conference on Robotics and Automation (ICRA)}, pages = {3671-3678}, doi = {10.1109/ICRA48891.2023.10160979}, abstract = {This paper investigates the problem of fixed-wing unmanned aerial vehicles (UAV s) motion planning with posture constraints and the problem of the more general symmetrical situations where UAVs have more than one optimal solution. In this paper, the posture constraints are formulated in the 3D Dubins method, and the symmetrical situations are overcome by a more collaborative strategy called the shunted strategy. The effectiveness of the proposed method has been validated by conducting extensive simulation experiments. Meanwhile, we compared the proposed method with the other state-of-the-art methods, and the comparison results show that the proposed method advances the previous works. Finally, the practicability of the proposed algorithm was analyzed by the statistic in computational cost. The source code of our method can be available at https://github.com/wuuya1/SCA.} }
- Gang Xu, Yansong Chen, Junjie Cao, Deye Zhu, Weiwei Liu, and Yong Liu. Multivehicle Motion Planning with Posture Constraints in Real World. IEEE-ASME Transactions on Mechatronics, 27(4):2125-2133, 2022.
[BibTeX] [Abstract] [DOI] [PDF]This article addresses the posture constraints problem in multivehicle motion planning for specific applications such as ground exploration tasks. Unlike most of the related work in motion planning, this article investigates more practical applications in the real world for nonholonomic unmanned ground vehicles (UGVs). In this case, a strategy of diversion is designed to optimize the smoothness of motion. Considering the problem of the posture constraints, a postured collision avoidance algorithm is proposed for the motion planning of the multiple nonholonomic UGVs. Two simulation experiments were conducted to verify the effectiveness and analyze the quantitative performance of the proposed method. Then, the practicability of the proposed algorithm was verified with an experiment in a natural environment.
@article{xu2022mmp, title = {Multivehicle Motion Planning with Posture Constraints in Real World}, author = {Gang Xu and Yansong Chen and Junjie Cao and Deye Zhu and Weiwei Liu and Yong Liu}, year = 2022, journal = {IEEE-ASME Transactions on Mechatronics}, volume = {27}, number = {4}, pages = {2125-2133}, doi = {10.1109/TMECH.2022.3173130}, abstract = {This article addresses the posture constraints problem in multivehicle motion planning for specific applications such as ground exploration tasks. Unlike most of the related work in motion planning, this article investigates more practical applications in the real world for nonholonomic unmanned ground vehicles (UGVs). In this case, a strategy of diversion is designed to optimize the smoothness of motion. Considering the problem of the posture constraints, a postured collision avoidance algorithm is proposed for the motion planning of the multiple nonholonomic UGVs. Two simulation experiments were conducted to verify the effectiveness and analyze the quantitative performance of the proposed method. Then, the practicability of the proposed algorithm was verified with an experiment in a natural environment.} }
- Junjie Cao, Yujie Wang, Yong Liu, and Xuesong Ni. Multi-Robot Learning Dynamic Obstacle Avoidance in Formation with Information-Directed Exploration. IEEE Transactions on Emerging Topics in Computational Intelligence, 6(6):1357-1367, 2022.
[BibTeX] [Abstract] [DOI] [PDF]This paper presents an algorithm that generates distributed collision-free velocities for multi-robot while maintain formation as much as possible. The adaptive formation problem is cast as a sequential decision-making problem, which is solved using reinforcement learning that trains several distributed policies to avoid dynamic obstacles on the top of consensus velocities. We construct the policy with Bayesian Linear Regression based on a neural network (called BNL) to compute the state-action value uncertainty efficiently for sequential decision making. The information-directed sampling is applied in our BNL policy to achieve efficient exploration. By further combining the distributional reinforcement learning, we can estimate the intrinsic uncertainty of the state-action value globally and more accurately. For continuous control tasks, efficient exploration can be achieved by optimizing a policy with the sampled action value function from a BNL model. Through our experiments in some contextual Bandit and sequential decision-making tasks, we show that exploration with the BNL model has improved efficiency in both computation and training samples. By augmenting the consensus velocities with our BNL policy, experiments on Multi-Robot navigation demonstrate that adaptive formation is achieved.
@article{cao2021mrl, title = {Multi-Robot Learning Dynamic Obstacle Avoidance in Formation with Information-Directed Exploration}, author = {Junjie Cao and Yujie Wang and Yong Liu and Xuesong Ni}, year = 2022, journal = {IEEE Transactions on Emerging Topics in Computational Intelligence}, volume = {6}, number = {6}, pages = {1357-1367}, doi = {10.1109/TETCI.2021.3127925}, abstract = {This paper presents an algorithm that generates distributed collision-free velocities for multi-robot while maintain formation as much as possible. The adaptive formation problem is cast as a sequential decision-making problem, which is solved using reinforcement learning that trains several distributed policies to avoid dynamic obstacles on the top of consensus velocities. We construct the policy with Bayesian Linear Regression based on a neural network (called BNL) to compute the state-action value uncertainty efficiently for sequential decision making. The information-directed sampling is applied in our BNL policy to achieve efficient exploration. By further combining the distributional reinforcement learning, we can estimate the intrinsic uncertainty of the state-action value globally and more accurately. For continuous control tasks, efficient exploration can be achieved by optimizing a policy with the sampled action value function from a BNL model. Through our experiments in some contextual Bandit and sequential decision-making tasks, we show that exploration with the BNL model has improved efficiency in both computation and training samples. By augmenting the consensus velocities with our BNL policy, experiments on Multi-Robot navigation demonstrate that adaptive formation is achieved.} }
- Shanqi Liu, Junjie Cao, Yujie Wang, Wenzhou Chen, and Yong Liu. Self-play reinforcement learning with comprehensive critic in computer games. Neurocomputing, 2021.
[BibTeX] [Abstract] [DOI] [PDF]Self-play reinforcement learning, where agents learn by playing with themselves, has been successfully applied in many game scenarios. However, the training procedure for self-play reinforcement learning is unstable and more sample-inefficient than (general) reinforcement learning, especially in imperfect information games. To improve the self-play training process, we incorporate a comprehensive critic into the policy gradient method to form a self-play actor-critic (SPAC) method for training agents to play com-puter games. We evaluate our method in four different environments in both competitive and coopera-tive tasks. The results show that the agent trained with our SPAC method outperforms those trained with deep deterministic policy gradient (DDPG) and proximal policy optimization (PPO) algorithms in many different evaluation approaches, which vindicate the effect of our comprehensive critic in the self-play training procedure. CO 2021 Elsevier B.V. All rights reserved.
@article{liu2021spr, title = {Self-play reinforcement learning with comprehensive critic in computer games}, author = {Shanqi Liu and Junjie Cao and Yujie Wang and Wenzhou Chen and Yong Liu}, year = 2021, journal = {Neurocomputing}, doi = {10.1016/j.neucom.2021.04.006}, abstract = {Self-play reinforcement learning, where agents learn by playing with themselves, has been successfully applied in many game scenarios. However, the training procedure for self-play reinforcement learning is unstable and more sample-inefficient than (general) reinforcement learning, especially in imperfect information games. To improve the self-play training process, we incorporate a comprehensive critic into the policy gradient method to form a self-play actor-critic (SPAC) method for training agents to play com-puter games. We evaluate our method in four different environments in both competitive and coopera-tive tasks. The results show that the agent trained with our SPAC method outperforms those trained with deep deterministic policy gradient (DDPG) and proximal policy optimization (PPO) algorithms in many different evaluation approaches, which vindicate the effect of our comprehensive critic in the self-play training procedure. CO 2021 Elsevier B.V. All rights reserved.} }
- Weiwei Liu, Shanqi Liu, Junjie Cao, Qi Wang, Xiaolei Lang, and Yong Liu. Learning Communication for Cooperation in Dynamic Agent-Number Environment. IEEE/ASME Transactions on Mechatronics, 2021.
[BibTeX] [Abstract] [DOI] [PDF]The number of agents in many multi-agent systems in the real world changes all the time, such as storage robots and drone cluster systems. Still, most current multi-agent reinforcement learning algorithms are limited to fixed network dimensions, and prior knowledge is used to preset the number of agents in the training phase, which leads to a poor generalization of the algorithm. In addition, these algorithms use centralized training to solve the instability problem of multi-agent systems. However, the centralized learning of large-scale multi-agent reinforcement learning algorithms will lead to an explosion of network dimensions, which in turn leads to very limited scalability of centralized learning algorithms. To solve these two difficulties, we propose Group Centralized Training and Decentralized Execution-Unlimited Dynamic Agent-number Network (GCTDE-UDAN). Firstly, since we use the attention mechanism to select several leaders and establish a dynamic number of teams, and UDAN performs a non-linear combination of all agents’ Q values when performing value decomposition, it is not affected by changes in the number of agents. Moreover, our algorithm can unite any agent to form a group and conduct centralized training within the group, avoiding network dimension explosion caused by global centralized training of large-scale agents. Finally, we verified on the simulation and experimental platform that the algorithm can learn and perform cooperative behaviors in many dynamic multi-agent environments.
@article{liu2021lcf, title = {Learning Communication for Cooperation in Dynamic Agent-Number Environment}, author = {Weiwei Liu and Shanqi Liu and Junjie Cao and Qi Wang and Xiaolei Lang and Yong Liu}, year = 2021, journal = {IEEE/ASME Transactions on Mechatronics}, doi = {10.1109/TMECH.2021.3076080}, abstract = {The number of agents in many multi-agent systems in the real world changes all the time, such as storage robots and drone cluster systems. Still, most current multi-agent reinforcement learning algorithms are limited to fixed network dimensions, and prior knowledge is used to preset the number of agents in the training phase, which leads to a poor generalization of the algorithm. In addition, these algorithms use centralized training to solve the instability problem of multi-agent systems. However, the centralized learning of large-scale multi-agent reinforcement learning algorithms will lead to an explosion of network dimensions, which in turn leads to very limited scalability of centralized learning algorithms. To solve these two difficulties, we propose Group Centralized Training and Decentralized Execution-Unlimited Dynamic Agent-number Network (GCTDE-UDAN). Firstly, since we use the attention mechanism to select several leaders and establish a dynamic number of teams, and UDAN performs a non-linear combination of all agents' Q values when performing value decomposition, it is not affected by changes in the number of agents. Moreover, our algorithm can unite any agent to form a group and conduct centralized training within the group, avoiding network dimension explosion caused by global centralized training of large-scale agents. Finally, we verified on the simulation and experimental platform that the algorithm can learn and perform cooperative behaviors in many dynamic multi-agent environments.} }
- Yunliang Jiang, Kang Zhao, Junjie Cao, Jing Fan, and Yong Liu. Asynchronous parallel hyperparameter search with population evolution. Control and Decision, 36:1825–1833, 2021.
[BibTeX] [Abstract] [DOI] [PDF]In recent years, with the continuous increase of deep learning models, especially deep reinforcement learning models, the training cost, that is, the search space of hyperparameters, has also continuously increased. However, most traditional hyperparameter search algorithms are based on sequential execution of training, which often takes weeks or even months to find a better hyperparameter configuration. In order to solve the problem of the long search time hyperparameters and the difficulty in finding a better hyperparameter of deep reinforcement learning configuration, this paper proposes a new hyper-parameter search algorithm, named asynchronous parallel hyperparameter search with population evolution. This algorithm combines the idea of evolutionary algorithms and uses a fixed resource budget to search the population model and its hyperparameters asynchronously and in parallel, thereby improving the performance of the algorithm. It is realized that a parameter search algorithm can run on the Ray parallel distributed framework. Experiments show that the parametric asynchronous parallel search based on population evolution on the parallel framework is better than the traditional hyperparameter search algorithm, and its performance is stable.
@article{fan2021aph, title = {Asynchronous parallel hyperparameter search with population evolution}, author = {Yunliang Jiang and Kang Zhao and Junjie Cao and Jing Fan and Yong Liu}, year = 2021, journal = {Control and Decision}, volume = 36, pages = {1825--1833}, doi = {10.13195/j.kzyjc.2019.1743}, issue = 8, abstract = {In recent years, with the continuous increase of deep learning models, especially deep reinforcement learning models, the training cost, that is, the search space of hyperparameters, has also continuously increased. However, most traditional hyperparameter search algorithms are based on sequential execution of training, which often takes weeks or even months to find a better hyperparameter configuration. In order to solve the problem of the long search time hyperparameters and the difficulty in finding a better hyperparameter of deep reinforcement learning configuration, this paper proposes a new hyper-parameter search algorithm, named asynchronous parallel hyperparameter search with population evolution. This algorithm combines the idea of evolutionary algorithms and uses a fixed resource budget to search the population model and its hyperparameters asynchronously and in parallel, thereby improving the performance of the algorithm. It is realized that a parameter search algorithm can run on the Ray parallel distributed framework. Experiments show that the parametric asynchronous parallel search based on population evolution on the parallel framework is better than the traditional hyperparameter search algorithm, and its performance is stable.} }
- Weiwei Liu, Linpeng Peng, Junjie Cao, Xiaokuan Fu, Yong Liu, and Zaisheng Pan. Ensemble Bootstrapped Deep Deterministic Policy Gradient for Vision-Based Robotic Grasping. IEEE Access, 9:19916–19925, 2021.
[BibTeX] [Abstract] [DOI] [PDF]With sufficient practice, humans can grab objects they have never seen before through brain decision-making. However, the manipulators, which has a wide range of applications in industrial production, can still only grab specific objects. Because most of the grasp algorithms rely on prior knowledge such as hand-eye calibration results, object model features, and can only target specific types of objects. When the task scenario and the operation target change, it cannot perform effective redeployment. In order to solve the above problems, academia often uses reinforcement learning to train grasping algorithms. However, the method of reinforcement learning in the field of manipulators grasping mainly encounters these main problems: insufficient sample utilization, poor algorithm stability, and limited exploration. This article uses LfD, BC, and DDPG to improve sample utilization. Use multiple critics to integrate and evaluate input actions to solve the problem of algorithm instability. Finally, inspired by Thompson’s sampling idea, the input action is evaluated from different angles, which increases the algorithm’s exploration of the environment and reduces the number of interactions with the environment. EDDPG and EBDDPG algorithm is designed in the article. In order to further improve the generalization ability of the algorithm, this article does not use extra information that is difficult to obtain directly on the physical platform, such as the real coordinates of the target object and the continuous motion space at the end of the manipulator in the Cartesian coordinate system is used as the output of the decision. The simulation results show that, under the same number of interactions, the manipulators’ success rate in grabbing 1000 random objects has increased more than double and reached state-of-the-art(SOTA) performance.
@article{liu2021ensemblebd, title = {Ensemble Bootstrapped Deep Deterministic Policy Gradient for Vision-Based Robotic Grasping}, author = {Weiwei Liu and Linpeng Peng and Junjie Cao and Xiaokuan Fu and Yong Liu and Zaisheng Pan}, year = 2021, journal = {IEEE Access}, volume = 9, pages = {19916--19925}, doi = {10.1109/ACCESS.2021.3049860}, abstract = {With sufficient practice, humans can grab objects they have never seen before through brain decision-making. However, the manipulators, which has a wide range of applications in industrial production, can still only grab specific objects. Because most of the grasp algorithms rely on prior knowledge such as hand-eye calibration results, object model features, and can only target specific types of objects. When the task scenario and the operation target change, it cannot perform effective redeployment. In order to solve the above problems, academia often uses reinforcement learning to train grasping algorithms. However, the method of reinforcement learning in the field of manipulators grasping mainly encounters these main problems: insufficient sample utilization, poor algorithm stability, and limited exploration. This article uses LfD, BC, and DDPG to improve sample utilization. Use multiple critics to integrate and evaluate input actions to solve the problem of algorithm instability. Finally, inspired by Thompson's sampling idea, the input action is evaluated from different angles, which increases the algorithm's exploration of the environment and reduces the number of interactions with the environment. EDDPG and EBDDPG algorithm is designed in the article. In order to further improve the generalization ability of the algorithm, this article does not use extra information that is difficult to obtain directly on the physical platform, such as the real coordinates of the target object and the continuous motion space at the end of the manipulator in the Cartesian coordinate system is used as the output of the decision. The simulation results show that, under the same number of interactions, the manipulators' success rate in grabbing 1000 random objects has increased more than double and reached state-of-the-art(SOTA) performance.} }
- Shanqi Liu, licheng Wen, Jinhao Cui, Xuemeng Yang, Junjie Cao, and Yong Liu. Moving Forward in Formation: A Decentralized Hierarchical Learning Approach to Multi-Agent Moving Together. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4777-4784, 2021.
[BibTeX] [Abstract] [DOI] [PDF]Multi-agent path finding in formation has manypotential real-world applications like mobile warehouse robotics. However, previous multi-agent path finding (MAPF) methods hardly take formation into consideration. Furthermore, they are usually centralized planners and require the whole state of the environment. Other decentralized partially observable approaches to MAPF are reinforcement learning (RL) methods. However, these RL methods encounter difficulties when learning path finding and formation problems at the same time. In this paper, we propose a novel decentralized partially observable RL algorithm that uses a hierarchical structure to decompose the multi-objective task into unrelated ones. It also calculates a theoretical weight that makes each tasks reward has equal influence on the final RL value function. Additionally, we introduce a communication method that helps agents cooperate with each other. Experiments in simulation show that our method outperforms other end-toend RL methods and our method can naturally scale to large world sizes where centralized planner struggles. We also deploy and validate our method in a real-world scenario.
@inproceedings{liu2021movingfi, title = {Moving Forward in Formation: A Decentralized Hierarchical Learning Approach to Multi-Agent Moving Together}, author = {Shanqi Liu and licheng Wen and Jinhao Cui and Xuemeng Yang and Junjie Cao and Yong Liu}, year = 2021, booktitle = {2021 IEEE/RSJ International Conference on Intelligent Robots and Systems}, pages = {4777-4784}, doi = {https://doi.org/10.1109/IROS51168.2021.9636224}, abstract = {Multi-agent path finding in formation has manypotential real-world applications like mobile warehouse robotics. However, previous multi-agent path finding (MAPF) methods hardly take formation into consideration. Furthermore, they are usually centralized planners and require the whole state of the environment. Other decentralized partially observable approaches to MAPF are reinforcement learning (RL) methods. However, these RL methods encounter difficulties when learning path finding and formation problems at the same time. In this paper, we propose a novel decentralized partially observable RL algorithm that uses a hierarchical structure to decompose the multi-objective task into unrelated ones. It also calculates a theoretical weight that makes each tasks reward has equal influence on the final RL value function. Additionally, we introduce a communication method that helps agents cooperate with each other. Experiments in simulation show that our method outperforms other end-toend RL methods and our method can naturally scale to large world sizes where centralized planner struggles. We also deploy and validate our method in a real-world scenario.} }
- Shanqi Liu, Junjie Cao, Wenzhou Chen, licheng Wen, and Yong Liu. HILONet: Hierarchical Imitation Learning from Non-Aligned Observations. In 2021 IEEE 10th data Driven Control And Learning Systems Conference, 2021.
[BibTeX] [Abstract] [DOI] [PDF]It is challenging learning from demonstrated observation-only trajectories in a non-time-aligned environment because most imitation learning methods aim to imitate experts by following the demonstration step-by-step. However, aligned demonstrations are seldom obtainable in real-world scenarios. In this work, we propose a new imitation learning approach called Hierarchical Imitation Learning from Observation(HILONet), which adopts a hierarchical structure to choose feasible sub-goals from demonstrated observations dynamically. Our method can solve all kinds of tasks by achieving these sub-goals, whether it has a single goal position or not. We also present three different ways to increase sample efficiency in the hierarchical structure. We conduct extensive experiments using several environments. The results show the improvement in both performance and learning efficiency.
@inproceedings{liu2021hilonethi, title = {HILONet: Hierarchical Imitation Learning from Non-Aligned Observations}, author = {Shanqi Liu and Junjie Cao and Wenzhou Chen and licheng Wen and Yong Liu}, year = 2021, booktitle = {2021 IEEE 10th data Driven Control And Learning Systems Conference}, doi = {https://doi.org/10.48550/arXiv.2011.02671}, abstract = {It is challenging learning from demonstrated observation-only trajectories in a non-time-aligned environment because most imitation learning methods aim to imitate experts by following the demonstration step-by-step. However, aligned demonstrations are seldom obtainable in real-world scenarios. In this work, we propose a new imitation learning approach called Hierarchical Imitation Learning from Observation(HILONet), which adopts a hierarchical structure to choose feasible sub-goals from demonstrated observations dynamically. Our method can solve all kinds of tasks by achieving these sub-goals, whether it has a single goal position or not. We also present three different ways to increase sample efficiency in the hierarchical structure. We conduct extensive experiments using several environments. The results show the improvement in both performance and learning efficiency.} }
- Junjie Cao, Weiwei Liu, Yong Liu, and Jian Yang. Generalize Robot Learning From Demonstration to Variant Scenarios With Evolutionary Policy Gradient. Frontiers in Neurorobotics, 14, 2020.
[BibTeX] [Abstract] [DOI] [PDF]There has been substantial growth in research on the robot automation, which aims to make robots capable of directly interacting with the world or human. Robot learning for automation from human demonstration is central to such situation. However, the dependence of demonstration restricts robot to a fixed scenario, without the ability to explore in variant situations to accomplish the same task as in demonstration. Deep reinforcement learning methods may be a good method to make robot learning beyond human demonstration and fulfilling the task in unknown situations. The exploration is the core of such generalization to different environments. While the exploration in reinforcement learning may be ineffective and suffer from the problem of low sample efficiency. In this paper, we present Evolutionary Policy Gradient (EPG) to make robot learn from demonstration and perform goal oriented exploration efficiently. Through goal oriented exploration, our method can generalize robot learned skill to environments with different parameters. Our Evolutionary Policy Gradient combines parameter perturbation with policy gradient method in the framework of Evolutionary Algorithms (EAs) and can fuse the benefits of both, achieving effective and efficient exploration. With demonstration guiding the evolutionary process, robot can accelerate the goal oriented exploration to generalize its capability to variant scenarios. The experiments, carried out in robot control tasks in OpenAI Gym with dense and sparse rewards, show that our EPG is able to provide competitive performance over the original policy gradient methods and EAs. In the manipulator task, our robot can learn to open the door with vision in environments which are different from where the demonstrations are provided.
@article{cao2020generalizerl, title = {Generalize Robot Learning From Demonstration to Variant Scenarios With Evolutionary Policy Gradient}, author = {Junjie Cao and Weiwei Liu and Yong Liu and Jian Yang}, year = 2020, journal = {Frontiers in Neurorobotics}, volume = 14, doi = { https://doi.org/10.3389/fnbot.2020.00021}, abstract = {There has been substantial growth in research on the robot automation, which aims to make robots capable of directly interacting with the world or human. Robot learning for automation from human demonstration is central to such situation. However, the dependence of demonstration restricts robot to a fixed scenario, without the ability to explore in variant situations to accomplish the same task as in demonstration. Deep reinforcement learning methods may be a good method to make robot learning beyond human demonstration and fulfilling the task in unknown situations. The exploration is the core of such generalization to different environments. While the exploration in reinforcement learning may be ineffective and suffer from the problem of low sample efficiency. In this paper, we present Evolutionary Policy Gradient (EPG) to make robot learn from demonstration and perform goal oriented exploration efficiently. Through goal oriented exploration, our method can generalize robot learned skill to environments with different parameters. Our Evolutionary Policy Gradient combines parameter perturbation with policy gradient method in the framework of Evolutionary Algorithms (EAs) and can fuse the benefits of both, achieving effective and efficient exploration. With demonstration guiding the evolutionary process, robot can accelerate the goal oriented exploration to generalize its capability to variant scenarios. The experiments, carried out in robot control tasks in OpenAI Gym with dense and sparse rewards, show that our EPG is able to provide competitive performance over the original policy gradient methods and EAs. In the manipulator task, our robot can learn to open the door with vision in environments which are different from where the demonstrations are provided.} }
- Junjie Cao, Yong Liu, Jian Yang, and Zaisheng Pan. Model-Based Robot Learning Control with Uncertainty Directed Exploration. In 2020 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), page 2004–2010, 2020.
[BibTeX] [Abstract] [DOI] [PDF]The Robot with nonlinear and stochastic dynamic challenges optimal control that relying on an analytical model. Model-free reinforcement learning algorithms have shown their potential in robot learning control without an analytical or statistical dynamic model. However, requiring numerous samples hinders its application. Model-based reinforcement learning that combines dynamic model learning with model predictive control provides promising methods to control the robot with complex dynamics. Robot exploration generates diverse data for dynamic model learning. Model predictive control exploits the approximated model to select an optimal action. There is a dilemma between exploration and exploitation. Uncertainty provides a direction for robot exploring, resulting in better exploration and exploitation trade-off. In this paper, we propose Model Predictive Control with Posterior Sampling (PSMPC) to make the robot learn to control efficiently. Our PSMPC does approximate sampling from the posterior of the dynamic model and applies model predictive control to achieve uncertainty directed exploration. In order to reduce the computational complexity of the resulting controller, we also propose a PSMPC guided policy optimization algorithm. The results of simulation in the high fidelity simulator “MuJoCo” show the effectiveness of our proposed robot learning control scheme.
@inproceedings{cao2020modelbasedrl, title = {Model-Based Robot Learning Control with Uncertainty Directed Exploration}, author = {Junjie Cao and Yong Liu and Jian Yang and Zaisheng Pan}, year = 2020, booktitle = {2020 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM)}, pages = {2004--2010}, doi = {https://doi.org/10.1109/aim43001.2020.9158962}, abstract = {The Robot with nonlinear and stochastic dynamic challenges optimal control that relying on an analytical model. Model-free reinforcement learning algorithms have shown their potential in robot learning control without an analytical or statistical dynamic model. However, requiring numerous samples hinders its application. Model-based reinforcement learning that combines dynamic model learning with model predictive control provides promising methods to control the robot with complex dynamics. Robot exploration generates diverse data for dynamic model learning. Model predictive control exploits the approximated model to select an optimal action. There is a dilemma between exploration and exploitation. Uncertainty provides a direction for robot exploring, resulting in better exploration and exploitation trade-off. In this paper, we propose Model Predictive Control with Posterior Sampling (PSMPC) to make the robot learn to control efficiently. Our PSMPC does approximate sampling from the posterior of the dynamic model and applies model predictive control to achieve uncertainty directed exploration. In order to reduce the computational complexity of the resulting controller, we also propose a PSMPC guided policy optimization algorithm. The results of simulation in the high fidelity simulator “MuJoCo” show the effectiveness of our proposed robot learning control scheme.} }