Publication Date

2018-08-07

Availability

Embargoed

Embargo Period

2020-08-06

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PHD)

Department

Biostatistics (Medicine)

Date of Defense

2018-06-22

First Committee Member

Hemant Ishwaran

Second Committee Member

J. Sunil Rao

Third Committee Member

Steven Chen

Fourth Committee Member

Mei-Ling Shyu

Abstract

Random forest is a machine learning algorithm that has been applied to a variety of problems, though mostly in the supervised setting. Here, a new method of applying random forests to the unsupervised setting will be introduced, which we call sidClustering. sidClustering first involves what is called sidification of the features by: first, staggering the features so they have mutually exclusive ranges; and secondly, forming all pairwise interactions from these shifted variables. Sidification results in what are called the SID main features and the SID interaction features, respectively. Then a multivariate random forest (MVRF) from the randomForestSRC R-package (whose splitting rules can handle both continuous and categorical target variables at the same time) is used to predict the SID main features. Sidification in conjunction with MVRF provides a better way to carve out the data space which results in better measures of distance between observations. sidClustering’s advantages are that it is adept at finding clusters arising from categorical and/or continuous variables, requires minimal tuning (just like random forests), and retains all of the advantages of random forests regarding computational scaling for big data without distributional and specification assumptions. Later, we will discuss the development of a rule generator and method for estimating k (the number of clusters to be determined). The idea is that after the clusters have been determined, we need some way of describing the clusters with human readable output. This is done by first discretizing all the continuous features by utilizing the splits from a random forest and then create rules based on a small set of features that most drive the clusters. This method can also be applied to the semisupervised setting and gives us a method of peaking into the random forest model. As for the estimation of k, we take advantage of the existence of the OOB set in the random forest algorithm and test for the k that brings out the most stable clusters. The idea is that the correct k value should bring about the most consistent clusters. Lastly, we will go over covariate adjusted random forest statistics. These take advantage of the multivariate random forest framework to determine estimates of the variability and covariance between outcomes. The idea is that by utilizing random forest weights we are able to develop weighted versions of these statistics for individual observations.

Keywords

Unsupervised learning; random forest; SID

Available for download on Thursday, August 06, 2020

Share

COinS