Literature review about data warehouse

Review, Pages 31 (7668 words)

Views

210

Chapter 2 LITERATURE REVIEW

2.1 Introduction

Chapter 2 provides literature reappraisal about informations warehouse, OLAP MDDB and informations excavation construct. We reviewed construct, features, design and execution attack of each above mentioned engineering to place a suited information warehouse model. This model will back up integrating of OLAP MDDB and informations excavation theoretical account.

Section 2.2 discussed about the fundamental of informations warehouse which includes informations warehouse theoretical accounts and informations treating techniques such as infusion, transform and burden ( ETL ) processes. A comparative survey was done on informations warehouse theoretical accounts introduced by William Inmons ( Inmon, 1999 ) , Ralph Kimball ( Kimball, 1996 ) and Matthias Nicola ( Nicola, 2000 ) to place suited theoretical account, design and features.

Don't use plagiarized sources. Get your custom essay on

“ Literature review about data warehouse ”

Get custom paper

NEW! smart matching with writer

Section 2.3 introduces about OLAP theoretical account and architecture. We besides discussed construct of processing in OLAP based MDDB, MDDB scheme design and execution. Section 2.4 introduces informations excavation techniques, methods and procedures for OLAP excavation ( OLAM ) which is used to mine MDDB. Section 2.5 provides decision on literature reappraisal particularly arrows on our determination to suggest a new information warehouse theoretical account.

Since we propose to utilize Microsoft ® merchandise to implement the propose theoretical account, we besides discussed a merchandise comparing to warrant why Microsoft ® merchandise is selected.

2.2 DATA WAREHOUSE

Harmonizing to William Inmon, information warehouse is a `` subject-oriented, integrated, time-variant, and non-volatile aggregation of informations in support of the direction 's decision-making procedure '' ( Inmon, 1999 ) . Data warehouse is a database incorporating informations that normally represents the concern history of an organisation. This historical information is used for analysis that supports concern determinations at many degrees, from strategic be aftering to public presentation rating of a distinct organisational unit.

It provides an effectual integrating of operational databases into an environment that enables strategic usage of informations ( Zhou, Hull, King and Franchitti, 1995 ) . These engineerings include relational and MDDB direction systems, client/server architecture, meta-data modeling and depositories, graphical user interface and much more ( Hammer, Garcia-Molina, Labio, Widom, and Zhuge, 1995 ; Harinarayan, Rajaraman, and Ullman, 1996 ) .

The outgrowth of cross subject sphere such as cognition direction in finance, wellness and e-commerce have proved that huge sum of informations need to be analysed. The development of informations in informations warehouse can supply multiple dataset dimensions to work out assorted jobs. Therefore, critical determination doing procedure of this dataset needs suited informations warehouse theoretical account ( Barquin and Edelstein, 1996 ) .

The chief advocates of informations warehouse are William Inmon ( Inmon, 1999 ) and Ralph Kimball ( Kimball, 1996 ) . But they have different positions on informations warehouse in term of design and architecture. Inmon ( Inmon, 1999 ) defined informations warehouse as a dependent informations marketplace construction while Kimball ( Kimball, 1996 ) defined informations warehouse as a coach based informations marketplace construction. Table 2.1 discussed the differences in informations warehouse construction between William Inmon and Ralph Kimball.

A information warehouse is a read-only information beginning where end-users are non allowed to alter the values or informations elements. Inmon 's ( Inmon, 1999 ) informations warehouse architecture scheme is different from Kimball 's ( Kimball, 1996 ) . Inmon 's informations warehouse theoretical account splits informations marketplaces as a transcript and distributed as an interface between informations warehouse and terminal users. Kimball 's positions informations warehouse as a brotherhoods of informations marketplaces. The information warehouse is the aggregations of informations marketplaces combine into one cardinal depository. Figure 2.1 illustrates the differences between Inmon 's and Kimball 's informations warehouse architecture adopted from ( Mailvaganam, 2007 ) .

Although Inmon and Kimball have a different design position of informations warehouse, they do hold on successful execution of informations warehouse that depends on an effectual aggregation of operational informations and proof of informations marketplace. The function of database theatrical production and ETL processes on informations are inevitable constituents in both research workers data warehouse design. Both believed that dependent informations warehouse architecture is necessary to carry through the demand of enterprise terminal users in term of clearcutness, timing and informations relevance

2.2.1 DATA WAREHOUSE ARCHITECTURE

Although informations warehouse architecture have broad research range, and it can be viewed in many positions. ( Thilini and Hugh, 2005 ) and ( Eckerson, 2003 ) supply some meaningful manner to position and analyse informations warehouse architecture. Eckerson states that a successful information warehouse system depends on database theatrical production procedure which derives informations from different integrated Online Transactional Processing ( OLTP ) system. In this instance, ETL procedure plays a important function to do database presenting procedure feasible. Survey on factors that influenced choice on informations warehouse architecture by ( Thilini, 2005 ) indentifies five informations warehouse architecture that are common in usage as shown in Table 2.2

Independent Data Marketplaces

Independent information marketplaces besides known as localised or little graduated table informations warehouse. It is chiefly used by sections, divisions of company to supply single operational databases. This type of informations marketplace is simple yet consists of different signifier that was derived from multiple design constructions from assorted inconsistent database designs. Therefore, it complicates transverse informations mart analysis. Since every organisational units tend to construct their ain database which operates as independent informations marketplace ( Thilini and Hugh, 2005 ) cited the work of ( Winsberg, 1996 ) and ( Hoss, 2002 ) , it is best used as an ad-hoc information warehouse and besides to be use as a paradigm before constructing a existent information warehouse.

Data Mart Bus Architecture

( Kimball, 1996 ) pioneered the design and architecture of informations warehouse with brotherhoods of informations marketplaces which are known as the coach architecture or practical informations warehouse. Bus architecture allows informations marketplaces non merely located in one waiter but it can be besides being located on different waiter. This allows the informations warehouse to maps more in practical manner and combined all informations marketplaces and procedure as one information warehouse.

Hub-and-spoke architecture

( Inmon, 1999 ) developed hub and spoke architecture. The hub is the cardinal waiter taking attention of information exchange and the radius grip informations transmutation for all regional operation informations shops. Hub and spoke chiefly focused on constructing a scalable and maintainable substructure for informations warehouse.

Centralized Data Warehouse Architecture

Central informations warehouse architecture build based on hub-and-spoke architecture but without the dependant informations mart constituent. This architecture transcripts and shops heterogenous operational and external informations to a individual and consistent informations warehouse. This architecture has merely one information theoretical account which are consistent and complete from all informations beginnings. Harmonizing to ( Inmon, 1999 ) and ( Kimball, 1996 ) , cardinal informations warehouse should dwell of database theatrical production or known as operational informations shop as an intermediate phase for operational processing of informations integrating before transform into the information warehouse.

Federated Architecture

Harmonizing to ( Hackney, 2000 ) , federated informations warehouse is an integrating of multiple heterogenous informations marketplaces, database theatrical production or operational informations shop, combination of analytical application and describing systems. The construct of federated focal point on incorporate model to do informations warehouse more dependable. ( Jindal, 2004 ) conclude that federated informations warehouse are a practical attack as it focus on higher dependability and supply first-class value.

( Thilini and Hugh, 2005 ) conclude that hub and radius and centralized informations warehouse architectures are similar. Hub and spoke is faster and easier to implement because no informations marketplaces are required. For centralised informations warehouse architecture scored higher than hub and spoke as for urgency demands for comparatively fast execution attack.

In this work, it is really of import to place which informations warehouse architecture that is robust and scalable in footings of edifice and deploying endeavor broad systems. ( Laney, 2000 ) , states that choice of appropriate informations warehouse architecture must integrate successful feature of assorted informations warehouse theoretical account. It is apparent that two informations warehouse architecture prove to be popular as shown by ( Thilini and Hugh, 2005 ) , ( Eckerson, 2003 ) and ( Mailvaganam, 2007 ) . First hub-and-spoke proposed by ( Inmon, 1999 ) as it is a information warehouse with dependent informations marketplaces and secondly is the informations marketplace coach architecture with dimensional informations marketplaces proposed by ( Kimball, 1996 ) . The choice of the new proposed theoretical account will utilize hub-and-spoke informations warehouse architecture which can be used for MDDB modeling.

2.2.2 DATA WAREHOUSE EXTRACT, TRANSFORM, Load

Data warehouse architecture procedure begins with ETL procedure to guarantee the information passes the quality threshold. Harmonizing to Evin ( 2001 ) , it is indispensable to hold right dataset. ETL are an of import constituent in informations warehouse environment to guarantee dataset in the informations warehouse are cleansed from assorted OLTP systems. ETLs are besides responsible for running scheduled undertakings that extract informations from OLTP systems. Typically, a information warehouse is populated with historical information from within a peculiar organisation ( Bunger, Colby, Cole, McKenna, Mulagund, and Wilhite, 2001 ) . The complete procedure descriptions of ETL are discussed in table 2.3.

Data warehouse database can be populated with a broad assortment of informations beginnings from different locations, therefore roll uping all the different dataset and hive awaying it in one cardinal location is an highly ambitious undertaking ( Calvanese, Giacomo, Lenzerini, Nardi, and Rosati, , 2001 ) . However, ETL processes extinguish the complexness of informations population via simplified procedure as depicts in figure 2.2. The ETL procedure begins with informations infusion from operational databases where informations cleaning and scouring are done, to guarantee all information 's are validated. Then it is transformed to run into the informations warehouse criterions before it is loaded into informations warehouse.

( Zhou et al, 1995 ) states that during informations integrating procedure in informations warehouse, ETL can help in import and export of operational informations between heterogenous informations beginnings utilizing Object linking and implanting database ( OLE-DB ) based architecture where the informations are transform to dwell all validated informations into informations warehouse.

In ( Kimball, 1996 ) informations warehouse architecture as depicted in figure 2.3 focal points on three of import faculties, which is `` the back room '' `` presentation waiter '' and `` the forepart room '' . ETL processes is implemented in the back room procedure, where the information theatrical production services in charge of garnering all beginning systems operational databases to execute extraction of informations from beginning systems from different file format from different systems and platforms. The 2nd measure is to run the transmutation procedure to guarantee all incompatibility is removed to guarantee informations unity. Finally, it is loaded into informations marketplaces. The ETL procedures are normally executed from a occupation control via scheduling undertaking. The presentation waiter is the information warehouse where information marketplaces are stored and procedure here. Data stored in star schema consist of dimension and fact tabular arraies. This is where informations are so procedure of in the forepart room where it is entree by question services such as coverage tools, desktop tools, OLAP and informations excavation tools.

Although ETL processes turn out to be an indispensable constituent to guarantee informations unity in informations warehouse, the issue of complexness and scalability dramas of import function in make up one's minding types of informations warehouse architecture. One manner to accomplish a scalable, non-complex solution is to follow a `` hub-and-spoke '' architecture for the ETL procedure. Harmonizing to Evin ( 2001 ) , ETL best operates in hub-and-spoke architecture because of its flexibleness and efficiency. Centralized information warehouse design can act upon the care of full entree control of ETL procedures.

ETL processes in hub and spoke informations warehouse architecture is recommended in ( Inmon, 1999 ) and ( Kimball, 1996 ) . The hub is the informations warehouse after treating informations from operational database to presenting database and the radius ( s ) are the informations marketplaces for administering informations. Sherman, R ( 2005 ) province that hub-and-spoke attack uses one-to-many interfaces from informations warehouse to many informations marketplaces. One-to-many are simpler to implement, cost effectual in a long tally and guarantee consistent dimensions. Compared to many-to-many attack it is more complicated and dearly-won.

2.2.3 DATA WAREHOUSE FAILURE AND SUCCESS FACTORS

Constructing a information warehouse is so a disputing undertaking as informations warehouse undertaking inheriting a alone features that may act upon the overall dependability and hardiness of informations warehouse. These factors can be applied during the analysis, design and execution stages which will guarantee a successful information warehouse system. Section 2.2.3.1 focal point on factors that influence informations warehouse undertaking failure. Section 2.2.3.2 discusses on the success factors which implementing the right theoretical account to back up a successful information warehouse undertaking.

2.2.3.1 DATA WAREHOUSE FAILURE FACTORS

( Hayen, Rutashobya, and Vetter, 2007 ) surveies shows that implementing a information warehouse undertaking is dearly-won and hazardous as a information warehouse undertaking can be over $ 1 million in the first twelvemonth. It is estimated that two-thirds of the attempt of puting up the informations warehouse undertakings attempt will neglect finally. ( Hayen et al, 2007 ) cited on the work of ( Briggs, 2002 ) and ( Vassiliadis, 2004 ) noticed three factors for the failure of informations warehouse undertaking which is environment, undertaking and proficient factors as shown in table 2.4.

Environment leads to organisation alterations in term of concern, political relations, amalgamations, coup d'etats and deficiency of top direction support. These include human mistake, corporate civilization, determination devising procedure and hapless alteration direction ( Watson, 2004 ) ( Hayen et al, 2007 ) .

Poor proficient cognition on the demands of informations definitions and informations quality from different organisation units may do informations warehouse failure. Incompetent and deficient cognition on informations integrating, hapless choice on informations warehouse theoretical account and informations warehouse analysis applications may do immense failure.

In malice of heavy investing on hardware, package and people, hapless undertaking direction factors may take informations warehouse undertaking failure. For illustration, delegating a undertaking director that deficiencies of cognition and undertaking experience in informations warehouse, may do hindrance of quantifying the return on investing ( ROI ) and accomplishment of undertaking ternary restraint ( cost, range, clip ) .

Data ownership and handiness is a possible factor that may do informations warehouse undertaking failure. This is considered vulnerable issue within the organisation that one must non portion or get person else data as this considered losing authorization on the information ( Vassiliadis, 2004 ) . Therefore, it emphasis limitation on any sections to declare entire ownership of pure clean and error free informations that might do possible job on ownership of informations rights.

2.2.3.2 DATA WAREHOUSE SUCCESS FACTORS

( Hwang M.I. , 2007 ) emphasis that informations warehouse executions are an of import country of research and industrial patterns but merely few researches made an appraisal in the critical success factors for informations warehouse executions. He conducted a study on six informations warehouse research workers ( Watson & A ; Haley, 1997 ; Chen et al. , 2000 ; Wixom & A ; Watson, 2001 ; Watson et al. , 2001 ; Hwang & A ; Cappel, 2002 ; Shin, 2003 ) on the success factors in a information warehouse undertaking. He concluded his study with a list of successful factors which influenced informations warehouse execution as depicted in figure 2.8. He shows eight execution factors which will straight impact the six selected success variables

The above mentioned informations warehouse success factors provide an of import guideline for implementing a successful information warehouse undertakings. ( Hwang M.I. , 2007 ) surveies shows an incorporate choice of assorted factors such as terminal user engagement, top direction support, acquisition of quality beginning informations with profound and chiseled concern demands plays important function in informations warehouse execution. Beside that, other factors that was highlighted by Hayen R.L. ( 2007 ) cited on the work of Briggs ( 2002 ) and Vassiliadis ( 2004 ) , Watson ( 2004 ) such as undertaking, environment and proficient cognition besides influenced informations warehouse execution.

Drumhead

In this work on the new proposed theoretical account, hub-and-spoke architecture is use as `` Central depository service '' , as many bookmans including Inmon, Kimball, Evin, Sherman and Nicola follow to this information warehouse architecture. This attack allows turn uping the hub ( informations warehouse ) and radiuss ( informations marketplaces ) centrally and can be distributed across local or broad country web depending on concern demand. In planing the new proposed theoretical account, the hub-and-spoke architecture clearly identifies six of import informations warehouse constituents that a informations warehouse should hold, which includes ETL, Staging Database or operational database shop, Data marketplaces, MDDB, OLAP and informations mining terminal users applications such as Data question, coverage, analysis, statistical tools. However, this procedure may differ from organisation to organisation. Depending on the ETL apparatus, some informations warehouse may overwrite old informations with new informations and in some informations warehouse may merely keep history and audit test of all alterations of the informations.

2.3 ONLINE ANALYTICAL Processing

OLAP Council ( 1997 ) define OLAP as a group of determination support system that facilitate fast, consistent and synergistic entree of information that has been reformulate, transformed and summarized from relational dataset chiefly from informations warehouse into MDDB which allow optimum informations retrieval and for executing tendency analysis.

Harmonizing to Chaudhuri ( 1997 ) , Burdick, D. et Al. ( 2006 ) and Vassiladis, P. ( 1999 ) , OLAP is of import construct for strategic database analysis. OLAP have the ability to analyse big sum of informations for the extraction of valuable information. Analytical development can be of concern, instruction or medical sectors. The engineerings of informations warehouse, OLAP, and analysing tools support that ability.

OLAP enable discovering form and relationship contain in concern activity by question dozenss of informations from multiple database beginning systems at one clip ( Nigel. P. , 2008 ) . Processing database information utilizing OLAP required an OLAP waiter to form and transformed and construct MDDB. MDDB are so separated by regular hexahedrons for client OLAP tools to execute informations analysis which purpose to detect new form relationship between the regular hexahedron. Some popular OLAP waiter package plans include Oracle ( C ) , IBM ( C ) and Microsoft ( C ) .

Madeira ( 2003 ) supports the fact that OLAP and informations warehouse are complementary engineering which blends together. Data warehouse shops and manages informations while OLAP transforms informations warehouse datasets into strategic information. OLAP map ranges from basic pilotage and browse ( frequently known as `` piece and die '' ) , to computations and besides serious analysis such as clip series and complex modeling. As decision-makers implement more advanced OLAP capablenesss, they move from basic informations entree to creative activity of information and to detecting of new cognition.

2.3.4 OLAP ARCHITECTURE

In comparing to informations warehouse which normally based on relational engineering, OLAP uses a multidimensional position to aggregate informations to supply rapid entree to strategic information for analysis. There are three type of OLAP architecture based on the method in which they store multi-dimensional informations and execute analysis operations on that dataset ( Nigel, P. , 2008 ) . The classs are multidimensional OLAP ( MOLAP ) , relational OLAP ( ROLAP ) and intercrossed OLAP ( HOLAP ) .

In MOLAP as depicted in Diagram 2.11, datasets are stored and summarized in a multidimensional regular hexahedron. The MOLAP architecture can execute faster than ROLAP and HOLAP ( C ) . MOLAP regular hexahedron designed and construct for rapid informations retrieval to heighten efficient slice and cubing operations. MOLAP can execute complex computations which have been pre-generated after cube creative activity. MOLAP processing is restricted to initial regular hexahedron that was created and are non bound to any extra reproduction of regular hexahedron.

In ROLAP as depict in Diagram 2.12, informations and collections are stored in relational database tabular arraies to supply the OLAP slice and cubing functionalities. ROLAP are the slowest among the OLAP spirit. ROLAP relies on informations pull stringsing straight in the relational database to give the manifestation of conventional OLAP 's slice and cubing functionality. Basically, each slice and cubing action is tantamount to adding a `` WHERE '' clause in the SQL statement. ( C )

ROLAP can pull off big sums of informations and ROLAP do non hold any restrictions for informations size. ROLAP can act upon the intrinsic functionality in a relational database. ROLAP are slow in public presentation because each ROLAP activity are basically a SQL question or multiple SQL questions in the relational database. The question clip and figure of SQL statements executed steps by its complexness of the SQL statements and can be a constriction if the implicit in dataset size is big. ROLAP basically depends on SQL statements coevals to question the relational database and do non provide all demands which make ROLAP engineering conventionally limited by what SQL functionality can offer. ( C )

HOLAP as depict in Diagram 2.13, combine the engineerings of MOLAP and ROLAP. Datas are stored in ROLAP relational database tabular arraies and the collections are stored in MOLAP regular hexahedron. HOLAP can bore down from multidimensional regular hexahedron into the implicit in relational database informations. To get drumhead type of information, HOLAP leverages cube engineering for faster public presentation. Whereas to recover item type of information, HOLAP can bore down from the regular hexahedron into the implicit in relational informations. ( C )

In OLAP architectures ( MOLAP, ROLAP and HOLAP ) , the datasets are stored in a multidimensional format as it involves the creative activity of multidimensional blocks called informations regular hexahedrons ( Harinarayan, 1996 ) . The regular hexahedron in OLAP architecture may hold three axes ( dimensions ) , or more. Each axis ( dimension ) represents a logical class of informations. One axis may for illustration represent the geographic location of the informations, while others may bespeak a province of clip or a specific school. Each of the classs, which will be described in the undermentioned subdivision, can be broken down into consecutive degrees and it is possible to bore up or down between the degrees.

Cabibo ( 1997 ) states that OLAP dividers are usually stored in an OLAP waiter, with the relational database often stored on a separate waiter from OLAP waiter. OLAP waiter must question across the web whenever it needs to entree the relational tabular arraies to decide a question. The impact of questioning across the web depends on the public presentation features of the web itself. Even when the relational database is placed on the same waiter as OLAP waiter, inter-process calls and the associated context exchanging are required to recover relational informations. With a OLAP divider, calls to the relational database, whether local or over the web, do non happen during questioning.

2.3.3 OLAP FUNCTIONALITY

OLAP functionality offers dynamic multidimensional analysis back uping terminal users with analytical activities includes computations and patterning applied across dimensions, tendency analysis over clip periods, sliting subsets for on-screen screening, boring to deeper degrees of records ( OLAP Council, 1997 ) OLAP is implemented in a multi-user client/server environment and supply faithfully fast response to questions, in malice of database size and complexness. OLAP facilitate the terminal user integrate endeavor information through relation, customized screening, analysis of historical and present informations in assorted `` what-if '' informations theoretical account scenario. This is achieved through usage of an OLAP Server as depicted in diagram 2.9.

OLAP functionality is provided by an OLAP waiter. OLAP waiter design and information construction are optimized for fast information retrieval in any class and flexible computation and transmutation of unrefined informations. The OLAP waiter may either really carry out the processed multidimensional information to administer consistent and fast response times to stop users, or it may make full its informations constructions in existent clip from relational databases, or offer a pick of both.

Basically, OLAP create information in regular hexahedron signifier which allows more composite analysis compares to relational database. OLAP analysis techniques employ 'slice and die ' and 'drilling ' methods to segregate informations into tonss of information depending on given parametric quantities. Slice is placing a individual value for one or more variable which is non-subset of multidimensional array. Whereas dice map is application of piece map on more than two dimensions of multidimensional regular hexahedrons. Boring map allows end user to track between condensed informations to most precise informations unit as depict in Diagram 2.10.

2.3.5 MULTIDIMENSIONAL DATABASE SCHEMA

The base of every information warehouse system is a relational database physique utilizing a dimensional theoretical account. Dimensional theoretical account consists of fact and dimension tabular arraies which are described as star scheme or snowflake scheme ( Kimball, 1999 ) . A scheme is a aggregation of database objects, tabular arraies, positions and indexes ( Inmon, 1996 ) . To understand dimensional informations modeling, Table 2.10 defines some of the footings normally used in this type of modeling:

In planing informations theoretical accounts for informations warehouse, the most normally used scheme types are leading scheme and snowflake scheme. In the star scheme design, fact table sits in the center and is connected to other environing dimension tabular arraies like a star. A star scheme can be simple or complex. A simple star consists of one fact tabular array ; a complex star can hold more than one fact tabular array.

Most informations warehouses use a star scheme to stand for the multidimensional informations theoretical account. The database consists of a individual fact tabular array and a individual tabular array for each dimension. Each tuple in the fact table consists of a arrow or foreign cardinal to each of the dimensions that provide its multidimensional co-ordinates, and shops the numeral steps for those co-ordinates. A tuple consist of a unit of informations extracted from regular hexahedron in a scope of member from one or more dimension tabular arraies. ( C, hypertext transfer protocol: //msdn.microsoft.com/en-us/library/aa216769 % 28SQL.80 % 29.aspx ) . Each dimension tabular array consists of columns that correspond to properties of the dimension. Diagram 2.14 shows an illustration of a star scheme For Medical Informatics System.

Star schemes do non explicitly supply support for property hierarchies which are non suited for architecture such as MOLAP which require tonss of hierarchies of dimension tabular arraies for efficient boring of datasets.

Snowflake schemas supply a polish of star scheme where the dimensional hierarchy is explicitly represented by normalising the dimension tabular arraies, as shown in Diagram 2.15. The chief advantage of the snowflake scheme is the betterment in question public presentation due to minimized disc storage demands and fall ining smaller search tabular arraies. The chief disadvantage of the snowflake scheme is the extra care attempts needed due to the addition figure of lookup tabular arraies. ( C )

Levene. M ( 2003 ) stresses that in add-on to the fact and dimension tabular arraies, informations warehouses store selected drumhead tabular arraies incorporating pre-aggregated informations. In the simplest instances, the pre-aggregated informations corresponds to aggregating the fact tabular array on one or more selected dimensions. Such pre-aggregated drumhead informations can be represented in the database in at least two ways. Whether to utilize star or a snowflake chiefly depends on concern demands.

2.3.2 OLAP Evaluation

As OLAP engineering taking outstanding topographic point in informations warehouse industry, there should be a suited appraisal tool to measure it. E.F. Codd non merely invented OLAP but besides provided a set of processs which are known as the 'Twelve Rules ' for OLAP merchandise ability appraisal which include informations use, limitless dimensions and collection degrees and flexible coverage as shown in Table 2.8 ( Codd, 1993 ) :

Codd 12 regulations of OLAP provide us an indispensable tool to verify the OLAP maps and OLAP theoretical accounts used are able to bring forth coveted consequence. Berson, A. ( 2001 ) stressed that a good OLAP system should besides back up a complete database direction tools as a public-service corporation for incorporate centralised tool to allow database direction to execute distribution of databases within the endeavor. OLAP ability to execute boring mechanism within the MDDB allows the functionality of drill down right to the beginning or root of the item record degree. This implies that OLAP tool license a smooth conversion from the MDDB to the item record degree of the beginning relational database. OLAP systems besides must back up incremental database refreshes. This is an of import characteristic as to forestall stableness issues on operations and serviceability jobs when the size of the database increases.

2.3.1 OLTP and OLAP

The design of OLAP for multidimensional regular hexahedron is wholly different comparison to OLTP for database. OLTP is implemented into relational database to back up day-to-day processing in an organisation. OLTP system chief map is to capture informations into computing machines. OLTP allow effectual informations use and storage of informations for day-to-day operational resulting in immense measure of transactional informations. Administrations build multiple OLTP systems to manage immense measures of day-to-day operations transactional informations can in short period of clip.

OLAP is designed for informations entree and analysis to back up managerial user strategic determination doing procedure. OLAP engineering focuses on aggregating datasets into multidimensional position without impeding the system public presentation. Harmonizing to Han, J. ( 2001 ) , states OLTP systems as `` Customer oriented '' and OLAP is a `` market oriented '' . He summarized major differences between OLTP and OLAP system based on 17 cardinal standards as shown in table 2.7.

It is complicated to unify OLAP and OLTP into one centralized database system. The dimensional informations design theoretical account used in OLAP is much more effectual for questioning than the relational database question used in OLTP system. OLAP may utilize one cardinal database as informations beginning and OLTP used different informations beginning from different database sites. The dimensional design of OLAP is non suited for OLTP system, chiefly due to redundancy and the loss of referential unity of the informations. Organization chooses to hold two separate information systems, one OLTP and one OLAP system ( Poe, V. , 1997 ) .

We can reason that the intent of OLTP systems is to acquire informations into computing machines, whereas the intent of OLAP is to acquire informations or information out of computing machines.

2.4 DATA Mining

Many informations excavation bookmans ( Fayyad, 1998 ; Freitas, 2002 ; Han, J. et. al. , 1996 ; Frawley, 1992 ) have defined informations excavation as detecting concealed forms from historical datasets by utilizing pattern acknowledgment as it involves seeking for specific, unknown information in a database. Chung, H. ( 1999 ) and Fayyad et Al ( 1996 ) referred informations excavation as a measure of cognition find in database and it is the procedure of analysing informations and infusions knowledge from a big database besides known as informations warehouse ( Han, J. , 2000 ) and doing it into utile information.

Freitas ( 2002 ) and Fayyad ( 1996 ) have recognized the advantageous tool of informations excavation for pull outing cognition from a information warehouse. The consequences of the extraction uncover hidden forms and incompatibility that are non seeable in the bing a information sets. The find of such concealed forms and informations incompatibility can non be achieved by utilizing conventional informations analysis and question tools attacks. Data excavation techniques vary from conventional informations analysis attack as informations mining involve in pull outing concealed forms in a dataset while conventional informations analysis tool merely assume on the consequence from a information set.

There are several informations excavation techniques that are used to show different informations excavation technique in different countries of applications. Data excavation techniques covers association, categorization, constellating and anticipation ( Citation ) . Freitas ( 2002 ) stressed that informations mining issues has to believe about potency of work outing the issues utilizing different informations excavation techniques. Therefore, to transport out a successful information excavation applications with the chosen informations excavation techniques, a procedure theoretical account is required as it include a series of stairss that will steer to agreeable consequences. Chapter 2.4.1 will discussed about the informations excavation techniques and the preferable techniques used in this survey. Chapter 2.4.2 presents the elaborate informations excavation procedure theoretical account and besides discussed the procedure theoretical account usage throughout this research in deploying the experimental application tools which is farther discuss in Chapter 4.

2.4.1 Data Mining Techniques

In general, informations excavation is capable of foretelling or prediction of future events based on historical informations set and its intent is to happen concealed forms in the database. There are several informations excavation techniques that are used and applied in the different countries of informations. The cognition on how each information excavation technique is indispensable used is to choose the suited technique for a specific country.

Harmonizing to Mailvaganam ( 2007 ) , informations excavation techniques consists of two theoretical accounts which is descriptive and prognostic theoretical accounts as describe in Diagram 2.16. Descriptive theoretical accounts can be generated by using association regulations discovery and constellating algorithms. As for Predictive theoretical accounts, it is generated by utilizing categorization and arrested development algorithms. Descriptive theoretical accounts can supply concealed relationships knowledge in a give information set, for illustration, in pupil database, pupils who pass mathematics tends to go through scientific discipline. Predictive theoretical accounts can act upon the hereafter consequences in a given information set, for illustration, in selling, a client 's gender, age, and purchase history might foretell the likeliness of a future sale.

Diagram 2.16 Descriptive and Predictive Model ( adapted from Mailvaganam, 2007 )

Data excavation algorithm is the mechanisms that generate a information excavation theoretical account. In order to bring forth a information excavation theoretical account, a information excavation algorithm needs to be define. The algorithm will so analyze a set of given informations to look into for an identifiable hidden forms and tendencies consequences. This consequence will so be used by the algorithm to specify parametric quantities of the excavation theoretical account. These parametric quantities are so used across the whole information set to pull out actionable forms and elaborate statistics. More inside informations on the information excavation algorithms are discussed as follows:

Association algorithm is a powerful correlativity numbering engine. It can execute scalable and efficient analysis in placing points in a aggregation that occur together. ( commendation )
Categorization is the procedure of happening a set of theoretical accounts that describes and distinguishes data categories or constructs for the intent of being capable of utilizing the theoretical account to foretell a category of objects with unknown category labels. This is the determination tree algorithm, including both categorization & A ; arrested development. It can besides construct multiple trees in a individual theoretical account to public presentation association analysis. ( Citation )
Clustering algorithm includes 2 different constellating techniques: EM outlook and maximization ) and K-means. It automatically detects the figure of natural bunchs in the datasets and detecting groups or classs. ( Citation )
Prediction can be viewed as a theoretical account constructed and used to entree the category of a unlabelled sample or the value ranges of an property that a given sample is likely to hold. ( Citation )

In informations excavation techniques, taking the best algorithm are based on specific concern user instance. It is possible to utilize different informations excavation algorithm to execute excavation on the same concern user instance informations sets, each algorithm will bring forth different set of consequences and some informations excavation algorithms can bring forth more than one type of consequence. Data excavation algorithms are flexible and do non necessitate to be use individually. Having a individual information excavation solutions, first action is to utilize an algorithm to research the information set and so utilize other algorithm to execute anticipation on a specific consequence based on the explored informations ( Citation ) . In a specific information excavation solution, some algorithms like constellating can be use to research informations which is usage for recognize forms and interrupt informations set into groups and so utilize other algorithms like determination trees model based on categorization algorithm to foretell a specific result based on that information. ( Citation )

Data excavation theoretical accounts are used to foretell values, find concealed tendencies and generate sum-ups informations. It is of import to cognize which information excavation algorithm to utilize in order to run a concern user instance. Table 2.11 shows the suggestions on which algorithms to utilize for a information excavation solutions adapted from Microsoft ® ( 2009 ) .

As informations excavation theoretical account are built, deployed and trained, the consequence of the information excavation theoretical account inside informations is stored as informations mining theoretical account nodes. This nodes is used to roll up the properties, description, chances, and distribution information for the theoretical account component it represents and relation to other nodes. Every information excavation theoretical account node has a connected node type that aid in meaning a information excavation theoretical account. A information excavation theoretical account node is the topmost node, irrespective of the existent construction of the theoretical account. All informations excavation theoretical accounts start with a theoretical account node. ( Citation )

In this survey, we used determination tree and constellating theoretical account as the chief informations excavation techniques. This information excavation techniques and theoretical account will cover farther discourse in Chapter 3 and 4.

2.4.1.1 Decision Tree Model

Decision tree are standard informations excavation theoretical account for categorization and anticipation techniques ( Citation ) . Decision tree are preferred in contrast to nervous webs, determination trees representation of regulations. These regulations can easy show and understood. A determination tree theoretical account can be used to categorise an case by initialising at the root of the tree and build the foliage node which provides the categorization of the case.

A determination tree theoretical account is a tree like construction utilizing categorization techniques, in which a node in the tree construction represents each inquiry used to further sort informations. Decision tree is efficient and can be built faster than other theoretical account and geting consequences with similar truth in some instances. Therefore, it is appropriate for big preparation informations set. Decision tree theoretical account is easy to understand and construe depending on the complexness of the determination tree and it handles not numerical informations. The assorted methods used to make determination trees have been used widely for decennaries, and there is a big organic structure of work depicting these statistical techniques ( Citation ) . Harmonizing to Witten et Al ( 2000 ) , determination tree theoretical account attack is known for its fast informations excavation modeling, as it uses divide and conquers approach.

Witten et Al ( 2000 ) describe determination tree procedure is constructed recursively. A theoretical account is placed at the root node of the tree and do out one or more tree node with possible value. Tree nodes developing sets are so divide up into subsets makes up Decision Tree 1, Decision Tree 2 or more tree nodes. This procedure is repeated recursively for each subdivision until the node has the same categorizations so the tree building will halt. This means the foliage node with one category of `` true '' or `` false '' can non be split farther which resulted the recursive procedure to halt. The aim of determination tree theoretical account is to construct as simplified determination tree theoretical account as possible to bring forth good categorization or prognostic public presentation consequences. ( Citation )

In determination tree-based theoretical account as shown in diagram 2.17, the theoretical account node serves as the root node of the tree. Decision Tree theoretical account may hold many trees nodes that make up the whole construction, but there is merely one tree node from which all other nodes such as interior and distribution nodes that are related for each tree. An interior node represents a split in the tree theoretical account or known as the `` subdivision '' node and a distribution node describes the distribution of values for one or more properties harmonizing to the informations represented by this node or known as the `` foliage '' node. A determination tree based theoretical account ever has one theoretical account node and at least one or more than one tree node ( Citation ) .

2.4.1.2 Clustering Model

Clustering is a information excavation technique that is used to divide informations set into groups or bunchs based on the similarity between the information entities ( Citation ) . Entities of the bunch portion common characteristics that differentiate them from other bunchs. Similarity is measured in footings of distance between its elements or entities. Unlike categorization, which has predefined labels ( supervised acquisition ) , constellating is considered as unsupervised acquisition because it automatically comes up with the labels ( Citation ) .

Harmonizing to Kogan, J. et.al. ( 2006 ) , constellating techniques is divided into partitioning and hierarchal methods. Partitioning methods construct assorted dividers of similar and dissimilar points in a group or bunchs evaluated by conditions. For hierarchy methods, it builds hierarchal dislocation utilizing a set of informations increasingly utilizing either top-down attack or bottom up attack ( Citation ) . Using top down attack begins with a bunch incorporating all informations and dislocation into a smaller bunch known as bomber bunchs. Using bottom-up attack begins with little bunchs and unite them recursively from larger bunch in a nested method. The advantage of hierarchal bunch compared to divider is that it is flexible as respects to the label of coarseness. Clustering techniques are assessed in commissariats of certain characteristics related to size, distance between parts of the bunch or form of the bunch. Clustering techniques back up application that requires sectioning the information into common groups ( Citation ) .

In clustering-based theoretical account as shown in diagram 2.18, the theoretical account node serves as the root node of the bunch ( Citation ) . A bunch node gathers the properties and informations for the abstraction of the specific bunch. Basically, it gathers the set of distribution that comprises a bunch of instances for the information excavation theoretical account. A bunch based theoretical account invariably has one theoretical account node and at least one bunch node. A user does non necessitate to place the figure of bunchs to be developed in progress. The constellating procedure automatically creates the exact figure of bunchs by stipulating how similar the records within the single bunchs should be. The constellating attack works best with categorical and non insistent variables.

2.4.2 Data Mining Process Model

A procedure theoretical account is required to implement a information excavation undertaking. This procedure theoretical account involves a sequence of stairss that will bring forth right consequences. Some illustrations of these procedure theoretical accounts are CRISP ( Chapman et al, 2000 ) and TWOCROWS ( Two Crows, 1999 ) . In this survey, applications experimental tools are based on the CRISP information excavation procedure theoretical account. The difference stages of CRISP informations excavation procedure theoretical account are presented in Diagram 2.19 ( Citation ) . The focal point of this chapter is on the first three CRISP stages which relevant to the research objectives of this survey.

Harmonizing to Chapman et Al ( 2000 ) , CRISP information excavation theoretical account is a life rhythm for a information excavation undertaking which besides includes the undertakings and relationship between the undertakings. CRISP life rhythm consists of six stages which includes concern understand, information apprehension, informations readying, modeling, rating and deployment, and the pointers indicate the most of import and frequent dependences between stages ( Citation INCLUDE PAGE NUMBER ) .

In CRISP informations excavation procedure theoretical account, it begins with the concern apprehension of the undertakings aims and demands as this is of import to change over it to data excavation job definition. Following measure is to execute informations understanding with the datasets to place informations quality job and to detect interesting subsets to organize hypothesis for concealed information. After the designation of the datasets, informations readying stage will lade all informations into the modeling tools from the initial datasets. This stage will put to death for multiple times to finish the transmutation and cleansing of informations for patterning. In patterning stage, assorted techniques are used and applied for the information excavation job to hold high quality theoretical accounts for informations analysis. In rating stage, the theoretical account ( s ) is thoroughly reviewed to guarantee whether it achieve the specified concern aims. Finally, deployment stage is executed as to bring forth simple coverage or complex informations excavation consequences as this stage chiefly triggered by the end-users.

One of the major advantage of CRISP informations excavation procedure theoretical account is tat its extremely replicable in which it support this survey. The procedure is flexible and can be applied on different types of informations and used in any concern user 's country. It besides provides a unvarying model as a guideline and certification.

2.4.3 OLAP Mining

As mentioned in subdivision 2.4.2, informations excavation techniques is utile for mining concealed form in a relational database. However, harmonizing to Song Lin ( 2002 ) , the combination of OLAP and informations excavation so known as OLAM or OLAP Mining is a tool for mining concealed forms in a MDDB. OLAM provides suggestions to the decision-maker harmonizing to the internal theoretical account with few quantitative informations excavation methods, like constellating or categorization. Data excavation have been introduced into OLAP and it is non involved for any development of informations mining algorithm. OLAM is the procedure of using intelligent methods to pull out informations forms, provides automatic informations analysis and anticipation, gathers hidden form and predicts unknown information with informations excavation tools. Diagram 2.20 ( Citation ) depicts the OLAM construct where MDDB integrates with informations mining algorithm to bring forth effectual coverage for determination shapers.

Harmonizing to Hans, J. ( 1997 ) OLAM architecture provides modular and systematic design for MDDB excavation on informations warehouse. In diagram 2.21 ( Citation ) , OLAM architecture consists of four beds. Layer 1 is the database bed consists of consistently constructed relational databases and performs informations cleansing, integrating, and consolidation in the edifice of information warehouses. Layer 2 is the MDDB bed, which offers a MDDB for OLAP and informations excavation. Layer 3 is the important bed for informations excavation as the OLAP and OLAM engines blends together for processing and excavation of informations. Last on bed 4 lays the graphical user interfaces which allow users to construct informations warehouses, MDDBs, perform OLAP and excavation, and visualize and research the consequences. A adept OLAM architecture should utilize bing substructure in this manner instead than building everything from abrasion.

Diagram 2.21 Online Analytical Mining ( adapted from Hans, 1998 page figure )

OLAM architecture benefits the OLAP based system as it provide explorative informations analysis environment utilizing informations excavation engineering ( Citation ) . As depicts in Diagram 2.2.1 ( Citation ) , layer 1, 2 and 3, the integrating from database, informations warehouse, MDDB, OLAP and OLAM makes informations mining possible in different subsets of informations from different degrees of abstractions by utilizing boring, pivoting, sliting and cubing a MDDB and intermediate informations excavation consequences. Therefore, it simplify synergistic informations mining maps such as determination trees and constellating for sing the consequences with flexible cognition visual image informations excavation tools.

Harmonizing to Hans ( 1998 ) , OLAM uses the integrating of multiple informations mining techniques such as association, categorization and bunch to mine different parts of the information warehouses and at different degrees of abstraction. Data excavation can be done with plans that analyze the information automatically. In order to better understand client behavior and penchants, concerns are utilizing informations excavation to go through through the immense sums of information gathered. Vacca ( 2002 ) summarized that OLAM are considered among the different construct and architectures for information excavation systems. It combines OLAP with informations excavation and excavation cognition in MDDB regular hexahedron.

The consequences generated by OLAP and informations excavation applications are evaluated utilizing the preciseness rating method. It comprises of determination tree and constellating techniques as shown in table 2.12

2.5 Summary

As a decision based on the literature reappraisal, we identified that hub and spoke based informations warehouse theoretical account emerged as outstanding theoretical account with six constituents that provides efficient architecture. Multiple OLAP theoretical accounts were discussed and MOLAP theoretical account was identified to be suited for MDDB architecture. In order to bore and pull out cognition from MDDB, there are assorted methods used such as question excavation and information excavation. Since question excavation tools such as regular hexahedron browser, pivot tabular array and Web-OLAP ( WOLAP ) largely embedded in OLAP applications, we will discourse it in Chapter 3. However, we discussed informations excavation particularly on categorization and constellating techniques which will be used in OLAM attack.

Finally, we justify that Microsoft ® information warehouse, OLAP and informations excavation tools will be used in this research. Microsoft ® provides a comprehensive information warehouse, OLAP and informations excavation platform with multiple capablenesss ( Kramer, 2002 ) . Kramer ( 2002 ) conducted a merchandise comparing ranges from Microsoft with Hyperion Solutions, ORACLE Corporation and IBM Corporation on the informations warehouse offerings. The sum-up of Kramer findings is shown in table 3.3. His rating have pointed that Microsoft ® based information warehouse, OLAP and informations excavation have a victorious constituent from Hyperion Solutions, ORACLE Corporation and IBM Corporation.