中国科学院沈阳自动化研究所机构知识库
Advanced  
SIA OpenIR  > 工业信息学研究室  > 先进制造技术研究室  > 学位论文
题名: 相关集合与数据挖掘方法研究
其他题名: A Study on Correlativity Sets and Methods of Data Mining
作者: 王晓峰
导师: 王天然 ; 朱枫
分类号: TP311.13
关键词: 相关集合 ; 数据挖掘 ; 相关测度 ; 挖掘算法
索取号: TP311.13/W37/2002
学位专业: 机械电子工程
学位类别: 博士
答辩日期: 2002-12-06
授予单位: 中国科学院沈阳自动化研究所
学位授予地点: 中国科学院沈阳自动化研究所
作者部门: 先进制造技术研究室
中文摘要: 数据挖掘是数据库、统计学、人工智能、机器学习等多学科交叉融合的产物,在工业、农业、军事、社会、商业、经济、科学等领域具有广泛的应用价值。本文从认识论出发,以等价类为基础,利用相关集合,提出了相关强度、相关测度、似然关系等一系列新概念。重点研究了基于相关集合的知识表示方法,自顶向下投影约简的长频繁项挖掘方法、双空间搜索、以及基于交集运算的增量挖掘技术与方法等。在数据挖掘基础理论方面,提出了一种基于相关集合的数据挖掘理论基础框架。论文的主要工作如下: 第一,提出的新概念主要有:相关集合、相关强度、相关测度、似然关系等。文中给出了相关集合的性质,相关测度的性质以及相关集合在数据挖掘中的应用。相关集合是等价类的推广,是一种基于关系的新集合。相关集合具有集合的性质又是一种关系的表达,相关集合中的任一元素必与论域中某些元素有某种关系。它是通过集合来表现关系、研究关系, 同Rough Set既有联系又有区别,亦不同于Fuzzy Set。相关集合从知识是事物之间的联系入手,可作为知识库系统的一种形式化描述,解决知识库系统的化简、从数据库中发现知识及机器学习等领域中的某些实际问题,亦能用于研究关系的各种特性。相关强度和相关测度反映了关系的强弱,进而可表达知识的可信程度。相关集合、相关测度等新概念为数据挖掘和知识发现提供了新的研究思路。第二,从相关集合的概念出发,利用相关强度、最相关集合等概念及性质提出了一种挖掘数据库分类知识的基本方法。通过启发式信息的利用,减少了搜索范围,使该方法具有较高的实用价值。另外,论文还提出了一种基于相关集合的知识库化简新方法,具有简单实用的特点。第三,比较了相关集合和粗糙集合的区别与联系,指出相关测度和粗糙度具有内在的联系,从理论上证明了粗糙度在本质上也是一种相关测度。第四,利用相关集合重新定义了Apriori算法中的支持度和信任度,给出了数据增量时支持度和信任度的计算公式和递推算法,并讨论了提高增量算法效率的途径。所给出的计算公式、递推公式是原理性的,可以用于经典的关联规则算法,也可用于粗糙集理论及其有关算法。提出了一个重要的相关测度的定义、性质,证明了信任度增量定理和支持度增量定理。相关测度为粗糙集方法的粗糙度和Apriori算法的信任度(支持度)等提供了一个统一的数学公式,为研究分类规则和关联规则的统一挖掘方法奠定了基础。这种统一的方法对于综合数据挖掘方法、开发数据挖掘语言有重要的意义,这也许是今后一个重要的研究方向。第五,提出了分别由项目集X和事务集T构成的两个搜索空间新概念,T、X空间存在一种对偶关系,一个项目集如果在X空间中增大其长度,那么在T空间中的支持度将减小;反之亦然。新定义的f、g相关集合映射反映了这种对应关系,为T、X两个空间的相互转换提供了一个有力的工具。提出了基于相关集合的双空间搜索新方法,这是挖掘频繁项的一类新方法,文中分别对项目优先和事务优先等挖掘方法进行了详细、深入的研究。从一个空间到另一个空间双向搜索,充分地利用两个空间的结构信息和数值特征,比只在单个空间中用数值特征搜索的效率要高很多。T、X集合映射及其相关定理从理论上分析了频繁项的性质与结构关系,计算机实验结果证明了新方法的正确性和有效性,比FP-tree和TreeProject方法更有效。第六,提出了Top-Down_Miner频繁项挖掘新方法,采用数据库投影约简和自顶向下搜索策略,解决了Apriori类方法挖掘长型频繁项所遇到的困难,通过理论分析和计算机实验证明该方法是非常有效的。由于算法采用了粗糙集中的某些方法,使得原本不同的两类挖掘方法(一种用于关联规则挖掘,一种用于分类规则挖掘)得到了融合,为建立新的集成算法提供了新思路。第七,对数据挖掘的基础理论作了深入细致的分析,提出了似然关系、似然度等新概念,指出数据挖掘的本质是从数据库中发现隐含在数据背后的似然关系。研究了相关测度与似然度的关系,利用相关测度能把粗糙集理论中的粗糙度、关联规则中的信任度,贝叶斯公式等有关数据挖掘方法的关键计算统一起来,并据此建立一个表达数据挖掘本质的GPDM一般方法与结构模型。在此基础上建立一个能描述各种数据挖掘方法的理论基础和架构,该模型和架构可满足Jiawei Han和Kamber M.理想理论框架的基本要求,容纳现有经典的挖掘方法,对开发数据挖掘语言及统一建模等具有重要的意义。综上所述,本文较为深入、系统地研究了相关集合的概念、性质、相关强度、相关测度、似然关系、似然度等相关集合方法,并且将这些方法成功地运用到数据挖掘理论及方法的研究工作之中。建立了基于相关集合的知识库约简算法、分类规则挖掘方法、关联规则的增量挖掘方法、双空间搜索的频繁项挖掘方法、自顶向下挖掘长频繁项的有效方法,数据挖掘的理论基础框架等。这些成功的例证,说明相关集合的思想、方法及其概念是正确的,具有重要的应用价值,值得进一步深入细致的研究。
英文摘要: Data mining, a multidisciplinary subject involving database, statistics, artificial intelligence, and machine learning, has very important applications in industry, agriculture, military, society, business, economy, and scientific research. In this work, we present a new set, correlativity set, based on concepts in epistemology and equivalence classes, and we also proposed a series of new concepts such correlativity measure, correlativity intensity, and plausibility for data mining. The main focus of this research includes the representation of knowledge,top-down project reduction algorithms for mining long frequents, dual space search algorithms for mining frequents, increment mining technology, and methods based on sets intersection operation. As a fundamental basis of data mining,a theoretical framework based on correlativity sets is studied。Our main contributions are summarized as follows. First,new concepts such as correlativity sets,correlativity intensity,correlativity measure,plausibility relation are proposed, The properties of the correlativity sets, and the correlativity measure are analyzed。The equivalence class is extended to the correlativity set,a new relation-based set。The correlativity set with the property of the traditional set is also a representation of the relation,any elements of the correlativity set have relations with certain elements in the universe. We can express a relation and study a relation using the correlativity set。The correlativity set is related to both the rough set and the fuzzy set, but is different from them. We consider that knowledge represents a certain relationship among things,and the correlativity set is a formalization of a knowledge based system,Using this notion practice problems such as simplification of a knowledge base system,knowledge discovered from a data base can be solved,and the properties of a relation can be studied。Correlativity set and correlativity measure also opens a new avenue for data mining and Knowledge Discovered from Data base(KDD)。 Second,we start work with the concept of the correlativity set,using this concept, properties of correlativity measure and maximal correlativity set,a basic method of mining classification knowledge from the data base are studied. This method is useful in the reduction of search domain based on heuristic information. We also propose a new algorithm simplify the knowledge base. Third, we compare the correlativity set with the rough set to find the difference and the relationship between these two sets. The correlativity measure and the rough degree exist inherent relations, and it is proved in theory that, in essence, a rough degree is also one of the correlativity measures. Fourth, we redefine the support and confidence in apriori algorithm by the correlativity set, present incremental computing formulas of support and confidence, and discuss the approach of improving efficiency of the incremental algorithm. The recursion and computing formulas of support and confidence proposed are foundation of mining the association rules which can be used in the algorithm of mining the association rules or be used in rough sets theory and relational methods. We also present an important definition of the correlativity measure and its property, and prove the theorems of incremental computational support and confidence. The correlativity measure provides uniform formula for the rough degree of a rough set and a confidence measure in the apriori algorithm, and establishes a unified theoretical foundation of mining classification rules and association rules. This unified theoretical foundation is of important significance to synthesize data mining approaches and study the language of data mining. It may become one of the important research directions in the domain of data mining. Fifth, we propose a new concept of dual search space of items sets and transaction sets, there exists a dual relation in space T (transaction sets) and space X (items sets), if an items set extends its length in space X, than its support decreases in space T, and vice versa. The defined mapping of f and g in correlativity sets corresponds to the dual relation, which provides a powerful tool for spaces X and T. A dual space search method based on correlativity set is proposed for mining the frequents. The methods of item-first search and transaction-first search are separately studied in detail. The dual space search is an algorithm of bi-directional search in two spaces. Since it uses the configuration information and property of numerical value in a pair of spaces, the efficiency of search in two spaces is higher than that based only on one space. Properties of the frequents and relation of set configuration are analyzed in theory by means of the mapping between sets T and X and the corresponding with respect to theorems, the validity of new approach are proved by experiment in computer. The efficiency of new methods for mining frenquents is higher than PF-tree and Tree project methods。 Sixth, new top-down_miner algorithm of mining frequents is proposed, the strategies of top- down search and reduction data base projected are adopted, the problem of like apriori methods to fall across long frequents is solved. The efficiency of the proposed algorithm is high, which is testified by computer experiments and theoretical analysis. In addition, because the algorithm employs some methods related to the rough set are used in the, two different approaches of mining (for mining relation rules and classification rules) are integrated, it provides a new idea for researching the integration methods of mining. Finally, we analyze the basic theory of data mining in detail, and propose new concepts of the plausibility relation and plausibility degree. We consider data mining to be a process of finding the plausibility relation in database and correlativity measure to be a particular plausibility relation based on correlativity sets. The crucial calculates such as the accuracy of the rough sets, the confidence and the bayesian formula in data mining can be unified using the correlativity measure. The General Process of Data Mining (GPDM) approach of data mining is also proposed. The theoretical foundation and frameworks for data mining based on correlativity sets are also given and discussed in this work.
语种: 中文
产权排序: 1
内容类型: 学位论文
URI标识: http://ir.sia.cn/handle/173321/9575
Appears in Collections:工业信息学研究室_先进制造技术研究室_学位论文

Files in This Item:
File Name/ File Size Content Type Version Access License
相关集合与数据挖掘方法研究.pdf(1084KB)----限制开放 联系获取全文

Recommended Citation:
王晓峰.相关集合与数据挖掘方法研究.[博士学位论文].中国科学院沈阳自动化研究所.2002
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[王晓峰]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[王晓峰]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2016  中国科学院沈阳自动化研究所 - Feedback
Powered by CSpace