Publication Date



Open access

Embargo Period


Degree Type


Degree Name

Doctor of Philosophy (PHD)


Electrical and Computer Engineering (Engineering)

Date of Defense


First Committee Member

Mei-Ling Shyu

Second Committee Member

Xiaodong Cai

Third Committee Member

Saman Aliari Zonouz

Fourth Committee Member

Nigel John

Fifth Committee Member

Shu-Ching Chen


With the proliferation of digital photo-capture devices and the development of web technologies, the era of big data has arrived, which poses challenges to process and retrieve vast amounts of data with heterogeneous and diverse dimensionality. In the field of multimedia information retrieval, traditional keyword-based approaches perform well on text data, but it can hardly adapt to image and video due to the fact that a large proportion of this data nowadays is unorganized. This means the textual descriptions of images or videos, also known as metadata, could be unavailable, incomplete or even incorrect. Therefore, Content-Based Multimedia Information Retrieval (CBMIR) has emerged, which retrieves relevant images or videos by analyzing their visual content. Various data mining techniques such as feature selection, classification, clustering and filtering, have been utilized in CBMIR to solve issues involving data imbalance, data quality and size, limited ground truth, user subjectivity, etc. However, as an intrinsic problem of CBMIR, the semantic gap between low-level visual features and high-level semantics is still difficult to conquer. Now, with the rapid popularization of social media repositories, which allows users to upload images and videos, and assign tags to describe them, it has brought new directions as well as new challenges to the area of multimedia information retrieval. As suggested by the name, multimedia is a combination of different content forms that include text, audio, images, videos, etc. A series of research studies have been conducted to take advantage of one modality to compensate the other for various tasks.
A framework proposed in this dissertation focuses on integrating visual information and text information, which are referred to as the content and the context modalities respectively, for multimedia big data retrieval. The framework contains two components, namely MCA-based feature selection and sparse linear integration. First, a feature selection method based on Multiple Correspondence Analysis (MCA) is proposed to select features having high correlations with a given class since these features can provide more discriminative information when predicting class labels. This is especially useful for the context modality since the tags assigned to the images or videos by users are known to be very noisy. Selecting discriminative tags can not only remove noise but also reduce feature dimensions. Considering MCA is a technique used to analyze nominal features, a discretization method based on MCA is developed accordingly to handle numeric features. Then the sparse linear integration component takes the selected features from both modalities as the inputs and builds a model that learns a pairwise instance similarity matrix. An optimization problem is formulated to minimize the differences between the similarity matrix generated from the context modality and the differences between the similarity matrix generated from the content modality. Coordinate descent and soft-thresholding can be applied to solve the problem. Compared to the existing approaches, the proposed framework is able to handle noisy and high dimensional features in each of the modalities. Feature correlations are taken into account and no local decision or handcrafted structure is required. The methods presented in this framework can be carried out in parallel, thus parallel and distributed programming framework, such as MapReduce, can be adopted to improve the computing capacity and scale to very large data sets. In the experiment, multiple public benchmark data sets, including collections of images and videos, are used to evaluate each of the components. Comparison with some existing popular approaches verifies the effectiveness of the proposed methods for the task of semantic concept retrieval. Two applications using the proposed methods for content-based recommender systems are presented. The first one uses the sparse linear integration model to find similar items by considering the information from both images and their metadata. Experiment and subjective evaluation are conducted on a self-collected bag data set for online shopping recommendations. The second one employs a topic model to the features extracted from videos and their metadata to determine topics in an unified manner. This application recommends movies with similar distributions in textual topics and visual topics to the users. Benchmark MovieLens1M data set is used for evaluation. Several research directions are identified to improve the framework for various practical challenges.


Multimedia Retrieval; Information Integration; Recommender System; Semantic Concept Detection; Feature Selection