META DATA:
THE KEY TO DATA WAREHOUSE DESIGN
( A SYSTEMS ENGINEERING APPROACH )

AUTHORS:

[Patrick Burton] [Stephanie Green] [ENSE623]
[Project Proposal] [Related Links]

Table of Contents
I. Introduction
II. Data Warehouse Life-Cycle
System Planning
System Requirements & Analysis
System Design
System Integration
System Verification and Maintenance
III. Conclusion
IV. Case Study
IV. Case Tool
V. Bibliography




INTRODUCTION

"The data warehouse concept sprang from the growing competitive need to quickly analyze business information." Typical relational databases which were designed for on-line transactional processing (OLTP) do not meet the requirements for effective on-line analytical processing (OLAP). As a result, data warehouses are designed differently then traditional relational databases. As the value of data warehouses and their associated OLAP capabilities have increased, comprehensive meta data management has proven to be a vital element to the success or failure of data warehouses. Without accurate meta data, a data warehouse rapidly becomes unmanageable and ineffective.

Meta data is literally "data about data". It describes the kind of information in the warehouse, where it stored, how it relates to other information, where it comes from, and how it is related to the business. The topic of standardizing meta data across various products and applying a systems engineering approach to this process in order to facilitate data warehouse design is what this project intends to address.

A data warehouse stores current and historical data from disparate operational systems (i.e. transactional databases) into one consolidated system where data is cleansed and restructured to support data analysis. We intend to create a design process for creating a data warehouse through the development of a metadata system. The meta data system we intend to create will provide a framework to organize and design the data warehouse. Using a systems engineering approach, this process will be used initially to help define the data warehouse requirements and it will them be used iteratively during the life of the data warehouse to update and integrate new dimensions of the warehouse. We will illustrate our process with a case study using an advertising/product-based business as the data warehouse application.

In order to be effective, the user of the data warehouse must have access to meta data that is accurate and up to date. Without a good source of meta data to operate from, the job of the analyst is much more difficult and the work required for analysis is compounded significantly.

Understanding the role of the enterprise data model relative to the data warehouse is critical to a successful design and implementation of a data warehouse and its meta data management. Some specific challenges which involve the physical design tradeoffs of a data warehouse include: For years, data has aggregated in the corporate world as a by-product of transaction processing and other uses of the computer. For the most part, the data that has accumulated has served only the immediate set of requirements for which is was initially intended. The data warehouse presents an alternative in data processing that represent the integrated information requirements of the enterprise. It is designed to support analytical processing for the entire business organization. Analytical processing looks across either various pieces of or the entire enterprise and identifies trends and patterns otherwise not apparent. These trends and patterns are absolutely vital to the vision of management in directing the organization.

Meta data itself is arguably the most critical element in effective data management. Effective tools must be used to properly store and use the meta data generated by the various systems. This paper outlines a Systems Engineering approach to designing and maintaining a data warehouse, emphasizing the key element of metadata. This approach is then used as a template for a case study of the development of an actual data warehouse in a product based business. Finally, a metadata system is created and explained as a tool for designing and updating a data warehouse.

[Menu]


SYSTEM PLANNING
The Systems Planning Phase of the life cycle communicates an overall vision for the data warehouse activity and its role in the organization's daily and weekly life. Decisions made during this phase have significant impact on the implementation, scope, and size of the effort. It begins with the identification of a need and then defines timeliness, tasks and deliverables. Then it proceeds through the following key planning decisions:

[Menu]


SYSTEM REQUIREMENTS & ANALYSIS
This Phase identifies the requirements/functions that the data warehouse will deliver. In addition to the features and functions needed, the requirements will clearly describe the operating environment in which the data warehouse will be delivered.

The amount of requirements gathering depends on the implementation approach that you take. The goal is to get an understanding of the core use of initial data and to identify other users who may need to leverage access to the data.

The primary purpose of collecting end user requirements for a data warehouse is to understand how users conduct their business, what data they currently use, and what they would like to do in the future. You should be able to further decompose this information into Business entities and their attributes, relationships between the entities, and hierarchies.

The requirements can be gathered through a series of interviews with the different users. Answers to the following questions from the different users will generate the requirements needed for further development of the data warehouse.
[Menu]


SYSTEM DESIGN
The systems design phase is important because it determines how the requirements will be met. The main focus of this phase is to convert the systems requirements into a set of system-level specifications through deriving logical and physical data models for the data warehouse (i.e. E-R Diagrams and MetaModels).

The specifications are then used to generate the data warehouse extractors and transformation, integration, summarization, and aggregation software. Next, a "Build versus Buy" trade-off analysis is performed. Decisions are complicated by technical, economic, and political considerations. A poor decision can ruin an otherwise successful analysis and design. On the other hand, an appropriate selection of vendor-supplied component or incorporation of existing investment can make it possible to build a quick and effective data warehouse.

Finally, processes are identified to connect the data sources, the data warehouse, and the end user access tools together.

[Menu]


SYSTEM INTEGRATION
The System Integration phase is started in conjunction with the design phase. This phase encompasses locating the source of the data in the operational systems doing analysis to understand what types of data transformations may be needed, and mapping the source data to the target data within the warehouse design. The challenge during this phase is to understand how to incorporate existing investments in platforms, technology, and know-how. Systems integration capabilities tie vendor systems together with existing data sources and existing and proposed access tools. Proper selection and evaluation of vendor supplied components and Metadata Management will prove invaluable in successfully integrating the disparate sources of data into the warehouse.

The main focus of this phase is developing procedures to extract and move data in a form that can then be loaded into the warehouse. Programs and transformation tools should be used to reduce the customized programming required to transform and integrate the data. Finally, the data must be analyzed to determine whether or not certain elements should be cleansed prior to putting it into the warehouse. Business rules should be established to standardize the data formats within the warehouse if possible.

[Menu]


SYSTEM VERIFICATION & MAINTENANCE
The Verification phase is relatively straightforward but it also requires a lot of diligence. The best methods for verifying the data in the warehouse is to formulate applications (reports) on the data in the warehouse and compare it to figures based on the data prior to being put in the warehouse which are known to be correct. Often the users themselves are the best people to verify the data because they are often more familiar with the specific type of data they are looking for. Finally, Maintenance is imperative at every stage of the life-cycle. Primarily this entails documentation of programs, applications and most importantly, metadata. Although most warehouse managers realize that maintaining metadata is of critical importance to the viability of the data warehouse, documentation is also the task which most often gets pushed to the side when more urgent tasks for the users arrive. This is a dangerous precedent to set, because once a growing data warehouse gets behind in documenting metadata, it is on its way to being a useless tool which will become irrelevant because no one will have any faith in the data which it stores.


[Menu]


CONCLUSION
As the case study below illustrates in detail, following a coherent methodology which emphasizes the importance of metadata is a critical factor which often determines the success or failure of a data warehouse project. By being diligent and insisting that proper documentation be maintained, you will ensure that the data warehouse will have a better design overall, and the pattern of waterfall sequence of the warehouse life-cycle will become more habitual.

[Menu]


CASE STUDY
[Menu]


CASE TOOl

SLATE (System Level Automation Tool for Engineers) by TD Technologies is a case tool that allows one to build a system design model in the form of a hierarchy of trees. An object oriented database is used where each tree is made up of blocks diagrams. Attached to each diagram are objects which represent Requirements, Documents, Abstraction Hierarchies, Traceability Relationships, Budgets and other information related objects used by engineers to develop systems. Slate, with its flexible design approaches, provides engineers with to create optimal designs from a global perspective.

Due to SLATE's explicit line of functions & features, we chose to utilize SLATE to demonstrate how important it is to begin any design with a High-Level Hierarchical view followed with bottom-up composition starting with MetaData.

The Data Warehouse Architecture divides into the following Abstraction Blocks: Data Source, Data Warehouse, Transportor, Access & Usage, and MetaData. Below are screen captures of the Hierarchical Design and further descriptions of each abstraction block.



Hierarchical tree structure (view)



Hierarchical outline structure (view)

  • Data Source Block
    This block represents the many different types of data that can be integrated into the Data Warehouse. Data comes from on-line transactional databases, historical data, and syndicated sources such as the Dow Jones Retrieval. MetaData must be developed on each pieces of data coming from the selected sources. This will provide information on the name of the data captured and extracted from the sources, the definition of the contents, the date of creation, and the source of the data.

  • Data Warehouse Block
    This block consist of a Re-Engineering abstraction and Modeling abstraction. The Re-Engineering component is responsible for conditioning the data to meet the analysis needs of the user. This involves Integrating (Integrator) the many different data types from multiple systems into the the Target system. Next, the Translator formats the data so it can be combined uniformly and consistently. It also re-maps the data to the original source to enable refreshing of the newly created data. The Modeling component derives data models from the enterprise or E-R models, if one exits, or creates new models. This is also where design issues concerning data partitions, subject areas, and granularity are handled. MetaData is further obtained in each of the above abstraction.

  • Transport Block
    This block addresses the task of transporting the data through the system. It addresses the necessary communications bridges between the hardware/software platforms, network protocols, network management systems, security, and middleware components. MetaData also must be provided in this block.

  • Access & Usage Block
    This block consist of two abstraction blocks: Access & Retrieval, and Analysis and Reporting. This block provides the payoff and the value to the entire data warehouse implementation. The Access & Retrieval abstraction is responsible for transforming the retrieved data into views for subsequent analysis. The Analysis & Reporting abstraction is responsible for the family of tools and applications needed to get value from the data warehouse. These tools can be classified into reporting tools, analysis and decision-support tools, and data mining tools. This abstraction is also responsible for developing navigation tools for metadata in order to understand and report on the contents of the data warehouse. Metadata development is also included in both abstraction. Metadata here can assist with verifying the reliability of the data, determining the location of the data ,and figuring out how to access and use the actual data.

  • MetaData Block
    This block provides assistance to the designer in understanding, monitoring, and managing the data in the data warehouse. It provides a road map of data to the user. Without metadata, the data warehouse becomes and abyss of meaningless scraps of information.

    The Hierarchical Architecture is designed to meet the requirements of all stakeholders of the data warehouse. It presents a unified and common way of looking at the components of a data warehouse solution while providing a firm foundation for planning, analysis, estimation, risk assessment, and a blueprint for implementation. The entire Architecture is built on the concept of data definitions, or Metadata. Metadata which is captured from each block pervades every activity of the data warehouse.


    BIBLIOGRAPHY

    Poe Vidette, "Building A Data Warehouse for Decision Support"
    ....Prentice Hall, New Jersey, 1995.

    Herb Edelstein, "Planning & Designing the Data Warehouse"
    ...Prentice Hall, New Jersey, 1996.

    John Makulowich, "Data Warehouse Market Draws a Crowd"
    ...Washington Technology, Oct. 24, 1996.

    Wayne Eckerson, "Metadata Hullabaloo"
    ...Oracle Magazine, Mar/Apr 1996.

    Dave Menninger, "Designing a Database for OLAP"
    ...Oracle Magazine, Mar/Apr 1996.

    Douglas Hackney, "Metadata Maturity"
    ...Data Management Review, Mar 1996.

    Larry Greenfield, "Don't Let Data Warehousing GOTCHAS Getcha"
    ...Datamation, Mar 1,1996.

    Satya Sachdeve, "Metadata: Guiding Users Through Disparate Data Layers"
    ...Application Development Tools, Dec 1995.

    Kevin Strehlo, "Data Warehousing: Avoid Planned Obsolescence"
    ...Datamation, Jan 15, 1996.

    Craig S. Mullins, "Dealing with Data Outhouses"
    ...Data Management Review, Mar 1996.

    Mark Austin, "ENSE623: Systems Engineering Design Projects, Volumes I & II"
    ...Institute for Systems Research, 1995-1996.


    [Menu]


    RELATED LINKS
    [Menu]



    [Return to Menu]