Publication Date

2019-08-06

Availability

Open access

Embargo Period

2019-08-06

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PHD)

Department

Computer Science (Arts and Sciences)

Date of Defense

2019-06-25

First Committee Member

Stefan Wuchty

Second Committee Member

Odelia Schwartz

Third Committee Member

Geoff Sutcliffe

Fourth Committee Member

Miroslav Kubat

Abstract

Feature Subset Selection (FSS) is a non-convex, non-linear, multi-modal, high-dimensional, Multi-Objective Optimization problem, where finding the optimum solution is a fairly difficult task due to its ultra-large search space and its very large number of local minima. Existing FSS methods suffer from problems like stagnation in local optima and high computational cost. Evolutionary Computation (EC) techniques are well known global search algorithms. Particle Swarm Optimization (PSO) is an EC technique that is simple to implement and that converges faster than other methods such as Genetic Algorithms (GA), Differential Evolution (DE) or Ant Colony Optimization (ACO). PSO has been successfully applied to areas such as engineering, biology, image processing, job scheduling, robotics and neural networks, but its potential for FSS has not been fully investigated. The overall goal of this thesis is to investigate and improve the capability of PSO for FSS aiming to select the smallest, most compact subset of features with the best classification performance in the least amount of time. This thesis investigates the use of PSO for wrapper, filter, embedded and hybrid approaches and it implements PSO on both single and multi-objective FSS. FSS problems are characterized by an increasing number of features and instances, massively challenging classical methods. Representing efficient global search methods, Swarm Intelligence (SI) based Evolutionary Computation (EC) techniques have increasingly been applied to solve such FSS problems. Despite their popularity, PSO and other swarm-based algorithms still suffer from a variety of problems and short comings as a consequence of high-dimensional data. We start this dissertation with a comprehensive survey, we formulate the challenges of SI algorithms that tackle FSS problems, discuss algorithmic solutions and present an overview of metrics that allow the systematic and comparable assessment of algorithms performance. Then, in bioinformatics, PSO methods can select subsets of predictive genes for sample classification. However; the obtained gene subsets lack interpretability and are usually too large to be used as biomarkers. Such characteristics may be rooted in the observation that PSO algorithms usually lose the diversity of the swarm, leading to premature convergence and leaving many areas of the search space unexplored. Furthermore, high-dimensional FSS problems like those encountered in gene expression data pose an additional challenge to EC techniques since their ability to pick out the important features must remain unchanged despite the increasing dimensionality of the feature space. Our goal is to develop an algorithm capable of scaling up to high-dimensional gene expression data while still selecting the smallest subsets that capture the information to carry the desired prediction tasks. To address these drawbacks, we introduce a new PSO variant called Combinatorial PSO (COMB-PSO). The proposed method develops a new encoding mechanism, an enhanced Pareto nondominated set maintenance scheme, a fast time-variant convergence technique and a new diversity operator. The new algorithm is then implemented as a wrapper-based approach to solve both single and multi-objective FSS on high-dimensional gene expression datasets. We also expect this research to contribute to a deeper understanding of the most important factors that impact the performance of PSO when scaling up to high-dimensional FSS problems. Next, one of the problems with existing PSO-based FSS methods is its weakness in fine-tuning near local optimum points. Moreover, the existing PSO-based approaches do not use correlation information of the features to guide the search process. These problems result in a situation where similar features have a high probability to be selected in the final subset which reduces the classifier performance. To overcome these problems, we propose a novel hybrid FSS algorithm based on PSO. The proposed method called Combinatorial PSO with Local Search (COMB-PSO-LS) uses a local search strategy that is embedded in the PSO to select the salient and less correlated features. The goal of the local search technique is to guide the search process of the PSO in such a way that less correlated (dissimilar) features have high probability to be selected compared to more correlated (similar) features. Moreover, the proposed method utilizes a subset size determination scheme to select a subset of features with reduced size. Such utilization ultimately leads to a significant performance gain when implementing COMB-PSO-LS for FSS. Finally, in classification tasks that involve multiple data sources, feature distributions may vary from one source to another. In such settings, knowledge of dimensions that differ in the source and target data is important to reduce the distance between domains, allowing accurate knowledge transfer. In this work, we present a novel method that integrates bi-level programming with PSO, called combinatorial PSO for transfer learning COMB-PSO-TL to identify variant and invariant features between source and target datasets and integrate such results to simultaneously reduce the variance between two distributions while optimizing the size and classification error of the selected subset. To the best of my knowledge, this is the first time a bi-level multi-objective programming approach is implemented to solve FSS problems for a transfer learning using an EC PSO algorithm.

Keywords

feature selection; evolutionary computation; particle swarm optimization; multi-objective optimization; transfer learning; bioinformatics

Share

COinS