SIA OpenIR  > 数字工厂研究室
基于全局交互的图像语义理解方法研究
Alternative TitleResearch on Image Semantic Understanding Method Based on Global Interaction
熊艳彬
Department数字工厂研究室
Thesis Advisor库涛
Keyword卷积神经网络 循环神经网络 图像语义理解 全局交互机制 注意力机制
Pages66页
Degree Discipline模式识别与智能系统
Degree Name硕士
2020-05-26
Degree Grantor中国科学院沈阳自动化研究所
Place of Conferral沈阳
Abstract图像语义理解以图像识别为基础,融合了计算机科学、心理学以及语言学等多学科的交叉性学科研究,对图像与文本之间的跨模态交互研究也做出了重要贡献。图像语义理解技术想要对目标图像整体进行理性或感性理解,并生成符合人类习惯的自然语言描述,不仅需要对目标图像所包含的场景、对象及属性进行提取和识别,还要分析各对象及属性之间的相互关系,包括每个对象的动作、形态以及人物心理和情感,并根据这些信息生成图像的文本描述,因此这是一项非常复杂且具有挑战的任务。传统的图像语义理解方法主要是基于模板的方法和转移生成的方法,这些方法的局限性在于整体模型过于依赖于某种语法模板或参考图像文本数据库,忽略了语言模型对图像进行灵活解析并生成全新文本的过程,因此,模型的输出结果不尽人意。近年来,随着基于编码器-解码器的神经网络模型在图像语义理解领域的应用,这一任务取得了突飞猛进的进步和成果。本文主要围绕基于编码器-解码器的神经网络模型在图像语义理解任务上如何进行有效的改进和提高展开研究,重点关注深度卷积神经网络在图像语义理解中对图像特征提取的能力、双向门控循环单元模型用于图像的语义解析、在双向门控循环单元的基础上引入全局图像的交互机制对图像语义理解模型的改善、将图像和文本数据进行正则化处理并采用word2vec文本映射方式来表示文本信息解决数据稀疏和偏态问题、以及注意力机制在双向门控循环单元中的应用这个五个方面,主要工作如下:(1)图像语义理解技术首先需要获取目标图像的特征信息,如果提取的目标图像的特征缺乏代表性或者准确性较低,那么语义解析过程中就很难区分目标图像中各对象属性以及各对象的相互关系,从而无法生成目标图像的准确描述。本文针对此问题,结合当前基于深度卷积神经网络的图像分类、目标检测等算法的快速发展,在比较了不同卷积神经网络在图像特征提取和分类的基础上,采用基于迁移学习的深度卷积神经网络进行图像特征提取。(2)图像特征提取之后,重点关注图像数据和文本数据的交互以及自然语言模型的构建,为解决基线模型在生成目标图像语义描述时逻辑性较差,本文提出了一种全局交互的图像语义理解模型,用于图像语义生成,即在生成文本的过程中采用双向循环神经网络模型进行语义解析,从而实现模型在语义解析过程中实时关注前后语境信息,保证语义连贯性;并且在语义解析过程中,实时关注图像的全局信息来指导语义生成;将提取的图像特征数据和文本数据进行正则化处理、并采用word2vec文本映射方式来表示文本信息,从而降低数据噪声影响,解决高维数据稀疏和数据偏态问题。(3)在识别较为复杂的目标图像以及要描述图像的细节任务时,需要准确并且重点关注目标图像的突出特征信息和细节属性,避免语义解析模型“盲目”地关注目标图像的全部特征,导致语义解析模型在预测下一个单词时与图像内容不对应的现象。本文针对此问题进一步改进基于全局交互的图像语义理解模型,引入注意力机制,提出了一种基于注意力机制的全局交互的图像语义理解模型,即在生成图像文本的过程中重点关注目标图像的重要信息和属性,从而使图像语义理解更加具体和准确。
Other AbstractImage semantic understanding is based on image recognition, which integrates the interdisciplinary research of computer science, psychology and linguistics. It also makes an important contribution to the research of cross modal interaction between image and text. Image semantic understanding technology wants to understand the whole target image rationally or perceptually, and generate natural language description that conforms to human habits. It not only needs to extract and identify the scene, object and attribute contained in the target image, but also analyzes the relationship between the objects and attributes, including the action, form, psychology and emotion of each object. According to This information generates a text description of the image, so it is a very complex and challenging task. The traditional methods of image semantic understanding are mainly based on template and transfer generation. The limitation of these methods is that the whole model is too dependent on a certain syntax template or reference image text database, ignoring the process of flexible image parsing and generating new text by language model. Therefore, the output of the model is unsatisfactory. In recent years, with the application of neural network model based on encoder decoder in the field of image semantic understanding, this task has made rapid progress and achievements. This paper focuses on how to improve and improve the neural network model based on coder decoder in image semantic understanding. It focuses on the ability of deep convolution neural network to extract image features in image semantic understanding, the application of bi-directional gating loop unit model in image semantic analysis, and the introduction of global gating loop unit based on bi-directional gating loop unit Image interaction mechanism improves image semantic understanding model, regularizes image and text data, and uses word2vec text mapping method to represent text information to solve data sparse and skew problems, as well as the application of attention mechanism in two-way gating cycle unit. The main work is as follows: (1) Image semantic understanding technology first needs to obtain the feature information of the target image. If the extracted feature of the target image is lack of representativeness or low accuracy, it is difficult to distinguish the object attributes and the relationship between the objects in the target image in the process of semantic analysis, so as to generate the accurate description of the target image. In order to solve this problem, combined with the rapid development of image classification and target detection algorithms based on the deep convolution neural network, this paper compares different convolution neural networks in image feature extraction and classification, and uses the deep convolution neural network based on migration learning to extract image features. (2) After image feature extraction, we focus on the interaction between image data and text data and the construction of natural language model. In order to solve the poor logic of baseline model in generating semantic description of target image, this paper proposes an image semantic understanding model based on global interaction, which is used for image semantic generation, that is, the bi-directional cyclic neural network is used in the process of generating text In the process of semantic analysis, the model pays attention to the contextual information in real time to ensure semantic consistency; in the process of semantic analysis, it pays attention to the global information of image in real time to guide semantic generation; it regularizes the extracted image feature data and text data, and uses word2vec text mapping to represent text information It can reduce the influence of data noise and solve the problem of high-dimensional data sparsity and data skewness. (3) In order to recognize the more complex target image and describe the detail task of the image, it is necessary to pay attention to the prominent feature information and detail attribute of the target image accurately and emphatically, so as to avoid the phenomenon that the semantic analysis model "blindly" pays attention to all the features of the target image, resulting in the mismatch between the semantic analysis model and the image content when predicting the next word. In order to solve this problem, this paper further improves the model of image semantic understanding based on global interaction, introduces attention mechanism, and proposes a model of image semantic understanding based on global interaction based on attention mechanism, which focuses on the important information and attributes of the target image in the process of generating image text, so as to make the image semantic understanding more specific and accurate.
Language中文
Contribution Rank1
Document Type学位论文
Identifierhttp://ir.sia.cn/handle/173321/27120
Collection数字工厂研究室
Affiliation中国科学院沈阳自动化研究所
Recommended Citation
GB/T 7714
熊艳彬. 基于全局交互的图像语义理解方法研究[D]. 沈阳. 中国科学院沈阳自动化研究所,2020.
Files in This Item:
File Name/Size DocType Version Access License
基于全局交互的图像语义理解方法研究.pd(2148KB)学位论文 开放获取CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[熊艳彬]'s Articles
Baidu academic
Similar articles in Baidu academic
[熊艳彬]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[熊艳彬]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.