SIA OpenIR  > 数字工厂研究室
混合数据聚类算法研究及在spark下的应用
Alternative TitleResearch on Mixed Data Clustering Algorithms and Application in Spark
姜智涵1,2
Department数字工厂研究室
Thesis Advisor朱军
Keyword混合属性数据 谱聚类 软子空间聚类 Spark
Pages75页
Degree Discipline控制工程
Degree Name硕士
2019-05-17
Degree Grantor中国科学院沈阳自动化研究所
Place of Conferral沈阳
Abstract聚类技术是数据挖掘领域的一项关键技术,包括生物学、经济学和医学在内的各个领域都有很多应用。它的应用包括数据挖掘、文档检索、图像分割和模式识别。本文主要围绕混合属性数据聚类所面临的一些问题展开研究探讨。主要分为三个方面:(1)针对目前大多数的聚类算法只能对单一属性的数据进行聚类,不能解决混合属性数据的聚类问题。以及目前大多数混合属性数据聚类算法对初始化敏感、不能处理任意形状的数据的问题。提出了一种基于信息熵的混合属性数据谱聚类算法(EBSCMD),用于处理混合类型数据。首先,提出一种新的相似性计算方式,利用谱聚类算法中的数值型数据构成的高斯核函数矩阵和新的基于信息熵加权处理的分类型数据构成的影响因子矩阵相结合代替了传统的相似度矩阵,新的相似度矩阵避免了两种属性数据之间的转化和参数调整。然后,把新的相似度矩阵运用到谱聚类算法中以便于处理任意形状的数据,最终得出聚类结果。通过UCI机器学习库的数据集上的实验表明,该算法能有效地处理混合属性数据的聚类问题,且具有较高的稳定性以及良好的鲁棒性。(2)针对传统软子空间聚类算法仅适用于连续属性,不适用于混合属性数据,以及目前大多数混合属性数据聚类算法没有考虑不同属性对不同的簇有不同贡献的问题。研究了适用于处理混合类型数据的软子空间聚类算法,提出了一种新的基于混合属性数据的加权软子空间聚类算法(WSSCMD)。首先,提出了一种新的数值属性和分类属性的统一加权方案,对每个维度属性到聚类的贡献度进行量化,产生一种新的相似度度量方式,新的相似度度量方式避免了对属性贡献度评估不足和参数调整的问题。然后,把新的相似度度量方式运用到模糊聚类算法中以便于更好地处理混合属性数据聚类的问题,最终得出聚类结果。通过不同的数据集上的实验表明,该算法在处理混合类型数据方面的优越性。(3)针对传统单机版混合属性数据聚类算法运算效低,不适用于大规模的混合类型数据聚类的问题。研究了基于Spark的并行化混合属性数据谱聚类算法,把混合属性数据谱聚类算法应用于Spark集群上,并提出一种新的并行化相似度矩阵计算方式。通过在混合属性数据集上的实验,验证了基于Spark聚类算法运算的高效性。
Other AbstractCluster analysis is an important research topic in the field of data mining. It has many applications in various fields, including biology, economics and medicine. Its applications include data mining, document retrieval, image segmentation and pattern recognition. This paper mainly focuses on some problems of mixed attribute data clustering. It is mainly divided into three aspects: (1) Aiming at the problem that the traditional clustering algorithm can only deal with single attribute data and can’t handle the clustering problem of mixed type data very well. Most of the clustering algorithms for mixed type data currently have the problem of initializing sensitive and can’t handle the data of arbitrary shape. In this paper, an entropy-based spectral clustering algorithm for mixed type data is proposed to deal with mixed type data. First, this paper proposes a new similarity measure. Using the numerical data in the spectral clustering algorithm constitutes a gaussian kernel function of the matrix, and using the classification data constitutes an entropy-based the influence factor of the matrix. A new similarity matrix combines these two matrices. Instead of the traditional similarity matrix, the new similarity matrix is proposed avoid feature transformation and parameter adjustment between the numerical data and the classification data. Then, the new similarity matrix is applied to the spectral clustering algorithm so as to deal with the data of arbitrary shape, and finally get the clustering result. Experiments on UCI data sets show that this algorithm can effectively deal with the clustering problem of mixed attribute data, with high stability and good robustness. (2) Aiming at the problem that the traditional soft subspace clustering algorithm is only suitable to numerical data and can’t be applied to mixed type data. At present, most of the mixed type data clustering algorithms don’t consider the different contribution of different attributes to different clusters. A soft subspace clustering algorithm suitable for processing mixed type data is studied, and a new weighted soft subspace clustering algorithm based on mixed type data is proposed. Firstly, a new consolidated weighting scheme for numerical attributes and classified attributes is proposed to quantify the contribution of each dimension attributes to clustering, and a new similarity measure is produced. The new similarity measure avoids the problem of insufficient evaluation of attribute contribution and parameter adjustment. Then, the new similarity measure is applied to the fuzzy clustering algorithm in order to deal better with the clustering problem of mixed type data, and finally the clustering results are obtained. The experimental results on different data sets demonstrate the superiority of the algorithm. (3) The traditional single-machine version of mixed type data clustering algorithm is inefficient and not suitable for large-scale mix data clustering. The parallel mixed type data spectral clustering algorithm based on Spark is studied. The mixed type data spectral clustering algorithm is applied to Spark, and a new parallel similarity matrix calculation method is proposed. Experiments on mixed type datasets demonstrate the efficiency of Spark clustering algorithm.
Language中文
Contribution Rank1
Document Type学位论文
Identifierhttp://ir.sia.cn/handle/173321/25205
Collection数字工厂研究室
Affiliation1.中国科学院沈阳自动化研究所
2.中国科学院大学
Recommended Citation
GB/T 7714
姜智涵. 混合数据聚类算法研究及在spark下的应用[D]. 沈阳. 中国科学院沈阳自动化研究所,2019.
Files in This Item:
File Name/Size DocType Version Access License
混合数据聚类算法研究及在spark下的应(2363KB)学位论文 开放获取CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[姜智涵]'s Articles
Baidu academic
Similar articles in Baidu academic
[姜智涵]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[姜智涵]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.