Publication Date

2010-12-16

Availability

Open access

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PHD)

Department

Electrical and Computer Engineering (Engineering)

Date of Defense

2010-12-14

First Committee Member

Akmal Younis - Committee Chair

Second Committee Member

Miroslav Kubat - Committee Member

Third Committee Member

Sawsan Khuri - Committee Member

Fourth Committee Member

Nigel M John - Committee Member

Fifth Committee Member

Mei-Ling Shyu - Committee Member

Abstract

Grid computing emerged as a framework for supporting complex operations over large datasets; it enables the harnessing of large numbers of processors working in parallel to solve computing problems that typically spread across various domains. We focus on the problems of data management in a grid/cloud environment. The broader context of designing a services oriented architecture (SOA) for information integration is studied, identifying the main components for realizing this architecture. The BioFederator is a web services-based data federation architecture for bioinformatics applications. Based on collaborations with bioinformatics researchers, several domain-specific data federation challenges and needs are identified. The BioFederator addresses such challenges and provides an architecture that incorporates a series of utility services; these address issues like automatic workflow composition, domain semantics, and the distributed nature of the data. The design also incorporates a series of data-oriented services that facilitate the actual integration of data. Schema integration is a core problem in the BioFederator context. Previous methods for schema integration rely on the exploration, implicit or explicit, of the multiple design choices that are possible for the integrated schema. Such exploration relies heavily on user interaction; thus, it is time consuming and labor intensive. Furthermore, previous methods have ignored the additional information that typically results from the schema matching process, that is, the weights and in some cases the directions that are associated with the correspondences. We propose a more automatic approach to schema integration that is based on the use of directed and weighted correspondences between the concepts that appear in the source schemas. A key component of our approach is a ranking mechanism for the automatic generation of the best candidate schemas. The algorithm gives more weight to schemas that combine the concepts with higher similarity or coverage. Thus, the algorithm makes certain decisions that otherwise would likely be taken by a human expert. We show that the algorithm runs in polynomial time and moreover has good performance in practice. The proposed methods and algorithms are compared to the state of the art approaches. The BioFederator design, services, and usage scenarios are discussed. We demonstrate how our architecture can be leveraged on real world bioinformatics applications. We preformed a whole human genome annotation for nucleosome exclusion regions. The resulting annotations were studied and correlated with tissue specificity, gene density and other important gene regulation features. We also study data processing models on grid environments. MapReduce is one popular parallel programming model that is proven to scale. However, using the low-level MapReduce for general data processing tasks poses the problem of developing, maintaining and reusing custom low-level user code. Several frameworks have emerged to address this problem; these frameworks share a top-down approach, where a high-level language is used to describe the problem semantics, and the framework takes care of translating this problem description into the MapReduce constructs. We highlight several issues in the existing approaches and alternatively propose a novel refined MapReduce model that addresses the maintainability and reusability issues, without sacrificing the low-level controllability offered by directly writing MapReduce code. We present MapReduce-LEGOS (MR-LEGOS), an explicit model for composing MapReduce constructs from simpler components, namely, "Maplets", "Reducelets" and optionally "Combinelets". Maplets and Reducelets are standard MapReduce constructs that can be composed to define aggregated constructs describing the problem semantics. This composition can be viewed as defining a micro-workflow inside the MapReduce job. Using the proposed model, complex problem semantics can be defined in the encompassing micro-workflow provided by MR-LEGOS while keeping the building blocks simple. We discuss the design details, its main features and usage scenarios. Through experimental evaluation, we show that the proposed design is highly scalable and has good performance in practice.

Keywords

Data Federation; Data Integration; Schema Integration; Bioinformatics; Grid Computing; Cloud Computing; Mapreduce; Hadoop; Data Management; Extract Transform Load

Share

COinS