Doctor of Philosophy (PHD)
Electrical and Computer Engineering (Engineering)
Date of Defense
First Committee Member
Second Committee Member
Third Committee Member
Fourth Committee Member
Fifth Committee Member
The development in information science has enabled an explosive growth of data, which attracts more and more researchers to engage in the field of big data analytics. Noticeably, in many real-world applications, large amounts of data are imbalanced data since the events of interests occur infrequently. Classification of imbalanced data is an important research problem as lots of real-world datasets have skewed class distributions in which the majority of instances (examples) belong to one class and far fewer instances belong to the others. A classifier induced from an imbalanced dataset is more likely to be biased towards the majority classes and shows very poor classification accuracy on the minority classes. While in many applications, the minority instances actually represent the concept of interest (e.g., fraud in banking operations, abnormal cell in medical data, etc.), and the detection of these rare events has become more important. Despite extensive research efforts, rare event mining remains one of the most challenging problems in information retrieval, especially for multimedia big data. To tackle this challenge, in this dissertation, we propose an extended deep learning approach to achieve promising performance in classifying largely skewed multimedia dataset. Specifically, we investigate the integration of bootstrapping methods and a state-of-the-art deep learning approach, Convolutional Neural Networks (CNNs), with extensive empirical studies. Considering the fact that deep learning approaches such as CNNs are usually computationally expensive, we propose to feed low-level features to CNNs and prove its feasibility in achieving promising performance while saving a lot of training time. Furthermore, since big training datasets are required to train CNNs, we propose to extract features from pre-trained CNN models and feed those features to another full connected neural network. Implementations in big data environments show promising performance of our model in handling big datasets with respect to feasibility and scalability. In order to further improve the classification results and bridge the semantic gap between high-level concepts and low-level visual features, correlation discovery in semantic concept mining is worth exploring. Though inter-concept correlations have been utilized to address this issue recently, the very small number of instances in the minority classes often lead to the detection of imprecise correlations and unsatisfactory classification results. Meanwhile, correlation discovery is a computationally intensive task in the sense that it requires a deep analysis of very large and growing repositories. This dissertation further proposes a novel concept correlation analysis strategy framework that utilizes the correlations between the retrieval scores and labels. By integrating the correlation information, the proposed framework can help imbalanced data classification and enhance rare class (event or concept) mining even with trivial scores from the minority classes. Not only deep learning but also numerous other classification algorithms have been developed for a variety of data types. However, it is nearly impossible for one classifier to perform the best in all kinds of datasets all the time. Therefore, ensemble learning models which aim to take advantages of different classifiers have received a lot of attentions recently. In this dissertation, a scalable classifier ensemble framework assisted by a set of "judgers" is also proposed to integrate the outputs from multiple classifiers for multimedia big data classification. Specifically, based on the confusion matrices of different classifiers, a set of judgers are organized into a hierarchically structured decision model. A testing instance is first input to different classifiers, and then the classification results are passed to the proposed hierarchical structured decision model to derive the final result. The ensemble system can be run on Spark, which is designed for big data processing. All the proposed components are evaluated on multimedia datasets containing different kinds of data. The experimental results show the effectiveness of our framework in classifying severely imbalanced data with promising performance, and demonstrate that the proposed classifier ensemble framework outperforms several state-of-the-art model fusion approaches. Furthermore, the proposed framework is applied to two real-world applications, i.e., deep learning based text data analysis on an Amazon review dataset and efficient large-scale stance Analysis in Twitter, and achieves promising results in both. In additional, we also design a web-based information retrieval system and identify several future directions that could be explored to further improve the current work.
Deep Learning; Imbalanced Data; Information Retrieval; Big Data; Classifier Fusion; Stance Classification
Yan, Yilin, "Deep Learning Based Imbalanced Data Classification and Information Retrieval for Multimedia Big Data" (2018). Open Access Dissertations. 2145.