SIA OpenIR  > 空间自动化技术研究室
基于模仿学习和强化学习的机器人技能自动获取方法研究
Alternative TitleImitation Learning and Reinforcement learning for Autonomous Robot Skills Acquisition
张会文
Department空间自动化技术研究室
Keyword技能学习 模仿学习 强化学习 逆强化学习 策略优化
Pages126页
Degree Discipline机械电子工程
Degree Name博士
2019-12-07
Degree Grantor中国科学院沈阳自动化研究所
Place of Conferral沈阳
Abstract随着机器人技术的快速发展,机器人所面临的任务也变得越来越复杂,工作环境呈现出一定的非结构化特性。更关键的是,对于涉及接触、摩擦、柔性等因素的复杂任务,经典的建模、感知、控制的编程范式很难再胜任。相反,基于学习的方法由于其编程灵活和可泛化性引起了研究人员的巨大兴趣。正是在这样的背景下,本文开展了基于模仿学习和强化学习的机器人技能学习方法研究。根据技能的复杂程度、观测信息的模态,本文重点开展了三个层面的研究。首先是单技能模仿学习,该研究的核心问题是如何对示教数据进行表征与泛化。这个方面的研究较多,但是对于强非线性的示教动作依然缺乏兼顾效率和精度的建模手段。其次,传统的技能学习往往需要特定的传感器来采集示教数据,传感器的标定和安装并不方便,因此最新的研究专注于直接从原始视觉观测中学习技能,称之为视觉运动技能学习。本文对视觉运动技能学习中的状态表征和策略优化问题开展了研究。最后,为了学习具有层次结构或者包含多个子任务的复杂技能,需要算法能够自动地对任务进行分解。对于该问题,目前主要有两类解决方法:启发式方法和非参贝叶斯方法。启发式方法太依赖于先验,贝叶斯方法的计算复杂度高,因此开发能够在线完成任务分割的算法具有重大的现实意义。本文围绕这三个问题,系统地开展了从低阶单技能学习到高阶复杂技能学习、从低维状态观测到视觉运动技能学习的研究。对于单技能模仿学习问题,本文提出了曲率高斯模型来对示教数据进行编码,该模型通过对传统高斯模型做非线性化变换来更好的编码非线性数据。为了推断曲率高斯模型的参数,本文提出了交叉熵优化算法,该算法能够自动地推断混合模型中成分的个数。在此基础上,进一步提出了曲率高斯混合模型和曲率高斯混合回归算法。基于手写字母任务、手写动作任务和锤击任务,验证了提出算法的有效性。视觉运动技能学习的核心问题是状态表征学习和策略优化。针对控制任务特点,本文提出了以关键点为中心的状态表征学习思路。为了提取这样的特征,针对不同任务提出了不同的特征学习方法。对于弱环境交互任务,比如行走任务,关键点是智能体自身的关节位置、速度等状态。在这种情况下,本文利用人体网格恢复算法来提取关键点特征。对于强环境交互任务,关键点是环境中物体和机械臂的状态。在这种情况下,本文使用了基于瓶颈网络的表征学习方法。在策略学习方面,为了避免显式地定义回报函数,本文利用对抗模仿学习的思想来学习回报函数,利用近似策略优化方法求解策略网络的参数。整个框架实现了高维视觉观测下行走技能的学习。复杂技能学习的难点在于如何自动地完成任务分解。对于有示教数据的情况,任务分解问题等价于轨迹分割。本文提出了一种基于贝叶斯理论的任务分割学习框架。该框架假设潜在的子任务模型服从高斯分布,利用后验概率对分割点建模,通过推断每个时刻发生分割的概率确定分割点。该方法不需要任何先验知识,而且能够在线完成。为了识别重复的子任务,提出了以KL散度为距离度量的聚类算法,实现了高精度的任务分割和识别。对于无示教数据的情况,利用分层强化学习中的option框架对任务建模,提出了融合proto-value functions和option-critic理论的复杂技能学习算法。最后,通过“打开-放置”任务验证了整个框架的有效性。
Other AbstractWith the rapid development of robot techniques, the tasks that robots face are becoming more and more complex, and the working environment presents certain unstructured characteristics. More critically, programming robots with classical modeling, sensing, and control paradigm is difficult for tasks involving contact, friction, and flexibility. In contrast, learning-based approaches have attracted researchers' great interest because of their flexibility in programming and generalizable features. Under this context, this dissertation conducted research on robot skill learning based on the imitation learning and reinforcement learning. According to the complexity of learning skills and the modality of observational information, this paper focuses on three levels of research. The first is individual skills imitation learning. The core issue of this study is how to represent and generalize the demonstration data. There are many studies in this area, but there is still a lack of methods which consider both computational efficiency and encoding precision for strong nonlinear demonstration motions. Secondly, traditional skill learning often requires specific sensors to collect teaching data, and the calibration and installation of these sensors are not convenient. Therefore, recent researches have focused on learning skills directly from raw visual observations, which is called visuomotor skill learning. This dissertation conducts research on the state representation learning and policy learning problems for visuomotor skill learning. Finally, in order to learn complex skills with a hierarchical structure or multiple subtasks, algorithms for autonomous task decomposition are needed. For this problem, currently there are two main types of solutions: heuristics methods and non-parametric Bayesian methods. The heuristic method relies too much on a priori, and the Bayesian method has a high computational complexity. So it is of great practical significance to develop methods that can segment tasks online. This dissertation focuses on these three issues, and carried out research from low-level individual skill learning to complex hierarchical skill learning, from low-dimensional state observation to visuomotor skill learning systematically. For the individual skill imitation learning problem, this dissertation proposes a curvilinear Gaussian model to encode the demonstration data. This model fits nonlinear data better by applying a nonlinear transformation to the traditional Gaussians. In order to infer the parameters of the curvilinear Gaussian model, this dissertation proposes a cross entropy optimization algorithm, which can automatically infer the optimal number of components in the mixture model. The proposed algorithms are verified by the handwritten letter task, handwritten motion task and hammer-over-a-nail task. The core issues of visuomotor skills learning are state representation learning and policy optimization. In view of the characteristics of control tasks, this dissertation proposes the keypoint-centric representation ideas. In order to extract such features, we propose two different feature learning methods for different tasks. For weak environmental interaction tasks, such as walking tasks, the keypoint are the joint position and speed of the agent itself. In this case, we use the human body mesh recovery algorithm to extract keypoint features. For manipulation tasks with strong environmental interactions, the keypoint features are the states of the objects in the environment and the state of the robotic arm. In this case, a representation learning method based on the bottleneck network is leveraged. In terms of policy learning, in order to avoid explicit definition of reward function, we use the idea of adversarial imitation learning to learn the reward function, and use the proximal policy optimization method to solve the weights of the policy network. The entire framework enables the learning of walking skills from high-dimensional visual observations. The difficulty of learning complex skills lies in how to do task decomposition. For the cases where we can access to the demonstration data, the task decomposition problem is equivalent to trajectory segmentation. This dissertation proposes a task segmentation and learning framework based on the Bayesian theory. This framework assumes that the potential model of each subtask obeys the Gaussian distribution and uses the posterior probability to model the segmentation point. The segmentation point is determined by inferring the probability of segmentation occurring at each step. This method does not require any prior knowledge and can be done online. In order to identify repeated subtasks, a clustering algorithm with KL divergence as a distance metric is proposed to achieve high-precision task segmentation and recognition. For the case where no demonstration is given, the task is modeled using the option framework in hierarchical reinforcement learning, and a complex skill learning algorithm by combining the proto-value functions and option-critic theory is proposed. Finally, the proposed algorithms are verified by an open-and-place task.
Language中文
Contribution Rank1
Document Type学位论文
Identifierhttp://ir.sia.cn/handle/173321/25936
Collection空间自动化技术研究室
Recommended Citation
GB/T 7714
张会文. 基于模仿学习和强化学习的机器人技能自动获取方法研究[D]. 沈阳. 中国科学院沈阳自动化研究所,2019.
Files in This Item:
File Name/Size DocType Version Access License
基于模仿学习和强化学习的机器人技能自动获(26465KB)学位论文 开放获取CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[张会文]'s Articles
Baidu academic
Similar articles in Baidu academic
[张会文]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[张会文]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.