Off-campus University of Miami users: To download campus access dissertations, please use the following link to log into our proxy server with your University of Miami CaneID and Password.

Non-University of Miami users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Publication Date

2011-05-02

Availability

UM campus only

Embargo Period

2011-05-02

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PHD)

Department

Electrical and Computer Engineering (Engineering)

Date of Defense

2011-04-12

First Committee Member

Miroslav Kubat

Second Committee Member

Nigel M. John

Third Committee Member

Moiez A. Tapia

Fourth Committee Member

Akmal A. Younis

Fifth Committee Member

Geoff Sutcliffe

Abstract

Induction of classifiers from sets of preclassified training examples is one of the most popular machine learning tasks. This dissertation focuses on the techniques needed in the field of automated text categorization. Here, each document can be labeled with more than one class, sometimes with many classes. Moreover, the classes are hierarchically organized, the mutual relations being typically expressed in terms of a generalization tree. Both aspects (multi-label classification and hierarchically organized classes) have so far received inadequate attention. Existing literature work largely assumes that it is enough to induce a separate binary classifier for each class, and the question of class hierarchy is rarely addressed. This, however, ignores some serious problems. For one thing, induction of thousands of classifiers from hundreds of thousands of examples described by tens of thousands of features (a common case in automated text categorization) incurs prohibitive computational costs---even a single binary classifier in domains of this kind often takes hours, even days, to induce. For another, the circumstance that the classes are hierarchically organized affects the way we view the classification performance of the induced classifiers. The presented work proposes a technique referred to by the acronym "H-kNN-plus." The technique combines support vector machines and nearest neighbor classifiers with the intention to capitalize on the strengths of both. As for performance evaluation, a variety of measures have been used to evaluate hierarchical classifiers, including the standard non-hierarchical criteria that assign the same weight to different types of error. The author proposes a performance measure that overcomes some of their weaknesses. The dissertation begins with a study of (non-hierarchical) multi-label classification. One of the reasons for the poor performance of earlier techniques is the class-imbalance problem---a small number of positive examples being outnumbered by a great many negative examples. Another difficulty is that each of the classes tends to be characterized by a different set of characteristic features. This means that most of the binary classifiers are induced from examples described by predominantly irrelevant features. Addressing these weaknesses by majority-class undersampling and feature selection, the proposed technique significantly improves the overall classification performance. Even more challenging is the issue of hierarchical classification. Here, the dissertation introduces a new induction mechanism, H-kNN-plus, and subjects it to extensive experiments with two real-world datasets. The results indicate its superiority, in these domains, over earlier work in terms of prediction performance as well as computational costs.

Keywords

Induction; Text categorization; Hierarchical classification; Multi-label examples; Imbalanced classes

Share

COinS