SIA OpenIR  > 工业控制网络与系统研究室
未知环境下工业机械臂自主路径规划
Alternative TitleAutonomous Path Planning of Industrial Robot Arm in Unknown Environment
李振
Department工业控制网络与系统研究室
Thesis Advisor刘意杨
Keyword路径规划 DDPG 估计奖励 信任度 加权动作
Pages74页
Degree Discipline控制工程
Degree Name专业学位硕士
2021-05-21
Degree Grantor中国科学院沈阳自动化研究所
Place of Conferral沈阳
Abstract2011年的汉诺威工业博览会上正式提出工业4.0的概念,意味着工厂逐渐朝着智能化的方向发展。工业现场为了适应这种新的变化而采用机械臂代替人进行工厂作业。但是当前机械臂的工作主要依靠于人工示教方法,这种方法在面对智能工厂复杂多变的环境以及高度智能化的趋势显然不能满足要求,因而出现了利用软件进行机械臂路径规划的方法,传统的软件方法仍然需要人工进行大量的准备工作,无法满足机械臂在任意未知环境中自主路径规划的需求。强化学习拥有自主学习的功能,本文将利用强化学习算法在CoppeliaSim Edu软件完成对七自由度机械臂的自主路径规划任务。由于机械臂的关节状态为连续值,因此本文采用DDPG(确定性策略梯度算法)进行agent的策略学习,并针对简单、中等、困难三种场景分别进行实验,并且得到该算法可以完成路径规划的任务。根据实验结果可以发现,DDPG算法存在着学习速度慢的问题,并据此分析得到三点原因:(1)由于探索空间庞大,agent极容易在错误的探索方向消耗过多精力。(2)环境中正奖励太少,稀疏奖励场景导致agent学习速度缓慢。(3)神经网络值逼近过程中拟合值方差过大。针对以上三个问题,本文提出了相对应的解决办法。首先根据agent策略网络的学习稳定程度评估出此处Q值是否完成收敛,根据收敛情况调整agent的迭代出信任度的值。通过引入信任度的概念,可以使agent有以下两个特点:(1)面对将来回报为正的状态,agent优先选择收益更加稳定的动作。(2)面对将来回报为负的状态,agent则会放弃收益稳定的动作。同时为了解决稀疏奖励的问题,本文根据粒子群算法最优粒子的概念,引入了最优状态的概念,使得agent不仅仅完成任务才获得正奖励,同时每一个状态可以收获一个估计奖励。最后为了解决神经网络值方差过大的问题使得agent出现极端动作的现象,本文引入的加权动作的方法,使得动作输出为多个扰动状态得到动作的加权值。通过实验证明三种方法均可以有效地提高agent的表现,最终达到缩短训练时长,提高算法训练效率的目的。
Other AbstractThe concept of Industry 4.0 was formally put forward at the Hannover Messe in 2011, which means that the factory is gradually developing in the direction of intelligence. In order to adapt to this new change, industrial sites use robotic arms instead of humans to perform factory operations. However, the work of the current robotic arm mainly relies on manual teaching methods. This method obviously cannot meet the requirements in the complex and changeable environment of smart factories and the trend of high intelligence. Therefore, a method of using software to plan the path of the robotic arm has appeared. However, the traditional software method still requires a lot of manual preparation work, which cannot meet the needs of autonomous path planning of the robotic arm in any unknown environment. Reinforcement learning has the function of autonomous learning. This article will use the reinforcement learning algorithm to complete the autonomous path planning task of the seven-degree-of-freedom manipulator in CoppeliaSim Edu software. Since the joint state of the robotic arm is a continuous value, this article uses DDPG (Deterministic Policy Gradient Algorithm) to learn the agent's strategy, and conducts experiments for three scenarios of simple, medium, and difficult, and obtains that the algorithm can complete path planning. task. According to the experimental results, it can be found that the DDPG algorithm has the problem of slow learning speed, and based on this analysis, three reasons are obtained: (1) Due to the huge exploration space, the agent is very easy to consume too much energy in the wrong exploration direction. (2) There are too few positive rewards in the environment, and the sparse reward scenario leads to slow agent learning speed. (3) The variance of the fitted value is too large in the process of neural network value approximation. In view of the above three problems, this article proposes corresponding solutions. Firstly, according to the learning stability of the agent strategy network, it is estimated whether the Q value here has converged, and the agent’s iterative value is adjusted according to the convergence. By introducing the concept of trust, the agent can have the following two characteristics: (1) In the face of positive returns in the future, the agent preferentially chooses actions with more stable returns. (2) In the face of negative returns in the future, the agent will give up the action of stable returns. At the same time, in order to solve the problem of sparse reward, this paper introduces the concept of optimal state based on the concept of optimal particles in the particle swarm algorithm, so that the agent not only completes the task to obtain a positive reward, but also receives an estimated reward for each state. Finally, in order to solve the problem of excessive neural network value variance, which makes the agent appear extreme actions, the weighted action method introduced in this paper makes the action output into multiple disturbance states to obtain the weighted value of the action. Experiments show that the three methods can effectively improve the performance of the agent, and ultimately achieve the purpose of shortening the training time and improving the efficiency of algorithm training.
Language中文
Contribution Rank1
Document Type学位论文
Identifierhttp://ir.sia.cn/handle/173321/28976
Collection工业控制网络与系统研究室
Affiliation中国科学院沈阳自动化研究所
Recommended Citation
GB/T 7714
李振. 未知环境下工业机械臂自主路径规划[D]. 沈阳. 中国科学院沈阳自动化研究所,2021.
Files in This Item:
File Name/Size DocType Version Access License
未知环境下工业机械臂自主路径规划.pdf(3251KB)学位论文 开放获取CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[李振]'s Articles
Baidu academic
Similar articles in Baidu academic
[李振]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[李振]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.