Publication Date

2017-07-12

Availability

Embargoed

Embargo Period

2019-07-12

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PHD)

Department

Biostatistics (Medicine)

Date of Defense

2017-05-30

First Committee Member

Hemant Ishwaran

Second Committee Member

Sunil Rao

Third Committee Member

Daniel Feaster

Fourth Committee Member

Yongtao Guan

Abstract

Boosting is one of the most powerful machine learning method use for modeling a univariate response. However its application for the multivariate response is limited. We use gradient boosting approach (a generic form of boosting) for modeling multivariate response. Specifically we focus on the longitudinal data in which repeated measurements are observed for a subject over time. Our gradient boosting approach is use to boost multivariate tree to fit a novel flexible semi-nonparametric marginal model for longitudinal data. In this model, features are modeled non-parametrically using multivariate tree, while feature-time interactions are modeled semi-nonparametrically utilizing P-splines with estimated smoothing parameter. In order to avoid overfitting, we describe a relatively simple in sample cross-validation method which can be use to estimate the optimal boosting iteration and which has the surprising added benefit of stabilizing certain parameter estimates. Our new multivariate tree boosting method is shown to be highly flexible, robust to covariance misspecification and unbalanced designs, and resistant to overfitting in high dimensions. Feature selection is performed using variable importance to identify important features and feature-time interactions. We also explain some modification to the approach that improves the prediction performance. This includes using new gradient component as well as using random forest as the base learner. Additionally, we described a new multivariate boosting approach for the multivariate response when the data is generated from the cross-sectional study. In this approach, our aim is to detect covariates which are related to most of the response variables in the high-dimensional sparse setting. Throughout, the efficiency of our approach is demonstrated using simulated as well as real dataset.

Keywords

Gradient boosting; Multivariate regression tree; P-spline; Marginal Model; Longitudinal data; Multivariate regression

Available for download on Friday, July 12, 2019

Share

COinS