The use of tree-based data structure in association mining

Date of Award




Degree Name

Doctor of Philosophy (Ph.D.)


Electrical and Computer Engineering

First Committee Member

Miroslav Kubat, Committee Chair


Data Mining is the process of discovering valuable information from large databases. One popular sub-field of data mining is the area of association mining that studies algorithms capable of discovering frequently co-occurring groups of items in transaction databases. Answering targeted query refers to constrain the search to those association rules that contain certain user-specified items. One recent approach proposed a mechanism that converts the database into a data structure called an itemset tree (IT-tree) that facilitates speedy processing of such queries.This dissertation addresses several research questions related to the IT-tree framework. The main goal is to improve IT-tree approach for the need of association mining.First, we extend the IT-tree to general query answering rectifying its major deficiency. The theoretical analysis and experimental results show that proposed IT-Mining algorithm is very efficient and scales roughly linear with the size of the database. It outperforms another tree-based algorithm, DepthProject, for mining long frequent itemsets with high pair-wise overlaps. Second, we propose TF-Mining algorithm to achieve significant reduction in processing time for answering targeted queries. TF-Mining runs faster than the original methods by two orders of magnitude. Third, we compare the IT-tree approach to similar tree-structure approaches, [AAP1999] and [HPY2000]. We introduce targeted query concept into these two approaches by proposing AAP-QueriedMining and FP-QueriedMining algorithms respectively. Experimental results indicate that IT-tree has distinct advantages when querying with variant minimum support while the other two are better when queried minimum support is fixed. Fourth, we explore the effect of pruning heuristics designed for further speeding up query processing. Our research show that as long as the user is interested only in itemsets with very low supports, the size of the tree and the costs of query processing can be significantly reduced at the small price of missing acceptable percentage of frequent itemsets. Finally we extend the IT-tree paradigm to another type of specialized query, item constraints. The proposed IC-Mining is very efficient and scalable with the size of the database. The performance analysis demonstrates that IC-Mining would outperform another similar approach, Direct algorithm, in very large databases.


Computer Science

Link to Full Text