Publication Date



Open access

Embargo Period


Degree Type


Degree Name

Doctor of Philosophy (PHD)


Electrical and Computer Engineering (Engineering)

Date of Defense


First Committee Member

Mei-Ling Shyu

Second Committee Member

Xiaodong Cai

Third Committee Member

Saman Aliari Zonouz

Fourth Committee Member

Nigel John

Fifth Committee Member

Shu-Ching Chen


The development of the Internet makes the number of online videos increase dramatically, which brings new demands to the video search engines for automatic retrieval and classification. We propose an unsupervised moving object detection and retrieval framework by exploiting and analyzing spatio-temporal visual information in the video sequences. The motivation is to use visual content information to estimate the locations of the moving objects in the spatio-temporal domain. Compared with the existing approaches, our proposed detection algorithm is unsupervised. It does not need to train models for specific objects. Furthermore, it is suitable for the detection of unknown objects. Therefore, after object detection, the object-level features can be extracted for video retrieval. The proposed moving object detection algorithm consists of two layers: global motion estimation layer and local motion estimation layer. The two layers explore and estimate motion information from different scopes in the spatio-temporal domain. The global motion estimation layer uses a temporal-centered estimation method to obtain a preliminary region of motion. Specially, it analyzes the motion in the temporal domain by using our proposed novel motion representation method called the weighted histogram of Harris3D volume which combines the optical flow field and Harris3D corner detector to obtain a good spatio-temporal estimation in the video sequences. The idea is motivated by taking advantages of the two sources of motion knowledge identified by different methods to get a complementary motion data to be kept in the new motion representation. The method, considering integrated motion information, works well with the dynamic background and camera motion, and demonstrates the advantages of integrating multiple spatio-temporal cues in the proposed framework. In addition, a center-surround coherency evaluation model is proposed to compute the local motion saliency and weight the spatio-temporal motion to find the region of a moving object by the integral density algorithm. The global motion estimation layer passes the preliminary region of motion to the local motion estimation layer. The latter uses a spatial-centered estimation method to integrate visual information spatially in adjacent frames to obtain the region of the moving object. The visual information in the frame is analyzed to find visual key locations which are defined as the maxima and minima of the result of the difference-of-Gaussian function. A motion map of adjacent frames is obtained to represent the temporal information from the differences of the outcomes from the simultaneous partition and class parameter estimation (SPCPE) framework. The motion map filters visual key locations into key motion locations (KMLs) where the existence of the moving object is implied. The integral density method is employed to find the region with the highest density of KMLs as the moving object. The features extracted from the motion region are used to train the global Gaussian mixture models for the video representation. The representation significantly reduces the classification model training time in comparison to the time needed when the whole feature sets are used. It also achieves better classification performance. When combined with the information of scenes, the performance is further enhanced. Besides the proposed spatio-temporal object detection work, two other related methods are also proposed since they play subsidiary roles in the detection model. One is the innovative key frame detection method which selects representative frames as the key frames to provide key locations in the spatial-centered estimation method. By analyzing the visual differences between frames and utilizing the clustering technique, a set of key frame candidates is first selected at the shot level, and then the information within a video shot and between video shots is used to adaptively filter the candidate set to generate the final set of key frames for spatial motion analysis. Another new method is to segment and track two objects under occlusion situations, which is useful in multiple object detection scenarios.


spatio-temporal; object detection; video retrieval; action recognition; key frame; motion estimation