Wednesday, December 26, 2012

Data warehouse concepts




Data Integrity    OLTP     OLAP     MOLAP     ROLAP     HOLAP      Measure

Cube      Dimension      Hierarchy      Fact table      Dimension tables       Star chema       Snowflake schema

Dimensional Data Model     

Dimensional modeling (Data Model) is a design technique that puts the data in a standard framework and provides easy access. You must create a database model in order to provide quick access and to get the information you need for reporting. Two kinds of schemas are used when designing data models, either a star schema or a snowflake schema

Conceptual Model Design     

A conceptual data model identifies the highest-level relationships between the different entities. Features of conceptual data model include:
   -  Includes the important entities and the relationships among them
   -  No attribute is specified
   -  No primary key is specified

Logical Model Design     

The Logical Model is a complete model and has all the information for building the database entities.

The Logical Model include:
   -  Includes all entities and relationships among them
   -  All attributes for each entity are specified
   -  The primary key for each entity is specified
   -  Foreign keys (keys identifying the relationship between different entities) are specified

In this model we don't have informations related to a specific database (Oracle, DB2, SQL Server, MySQL, etc. ). For instance, a VARCHAR2 column can have a generic name "String(30)".

Physical Model Design

The Physical Model Design is the Logical Model adapted to a specific database.
For instance, instead a generic "String(30)" we have  VARCHAR2(30).

Data Integrity   

Data integrity is a term used to refer to the accuracy and reliability of data.

OLTP   ( On Line Transaction Processing ) 

We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.

OLAP    ( On Line Analytical Processing )

We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.

MOLAP  ( Multidimensional OLAP )   

In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats.

        Advantages:
              -  Excellent performance: MOLAP cubes are built for fast data retrieval
              -  Good data compression techniques
              -  Can perform complex calculations: All calculations have been pre-generated (results returned quickly) when the cube is created.

       Disadvantages:
              -  Limited in the amount of data a cube can handle
              -  A cube rebuilt could be very long
              -  Requires additional investment: to buy the proprietary format + investments in human resources

Examples of commercial products that use MOLAP are Cognos Powerplay, Oracle Database OLAP Option, Microsoft Analysis Services, Essbase, TM1, Lilith Hicare and Daptech Keystone. There is also an open source MOLAP server Palo.

ROLAP   ( Relational OLAP ) 

In ROLAP, data is stored in the relational database.

        Advantages:
              -  Can handle large amounts of data: limited to the database storage limit
              -  Can leverage functionalities inherent in the relational database
              -  Cost less than the MOLAP

       Disadvantages:
              -  Performance can be slow: the measures are not pre-generated
              -  Limited by SQL functionalities


HOLAP  ( Hybrid OLAP )

HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance.


Cube     

Cubes are logical representation of multidimensional data. The name of "cube" is visually related to a 3 dimensional model, but we can have more than 3 dimensions.

Here is a 3 dimensional cube:


Data Warehouse Concepts
You can see that a cube has some dimensions. In this case we have 3 dimensions: product, location, time. In real life we can have more than 3 dimensions. In that case, inside a cube we can have another cubes.

A 3 dimensional cube is made of cubes. Each cube can be imagines as a sum of cubes.

Dimension     

In the pink cube you can see, there are 3 dimensions:  product, location, time. A dimension is a structure that categorizes data in order to enable end users to answer business questions.

Measure

Each cube store a value at the intersection of each dimension and that value is named measure. In that cube, 200, 100 are measures.

Hierarchy

A hierarchy defines a set of parentage relationships between all or some of a dimension's members.


Fact table    
                          
The star shows how the data is stored. The middle is the central table (the fact table) containing the basic information and the points are dimension tables that show different views of the data. The central table has foreign keys to the dimension tables.


Dimension tables

The star shows how the data is stored. The middle is the central table (the fact table) containing the basic information and the points aredimension tables that show different views of the data. The central table has foreign keys to the dimension tables.


Star schema (= a form of dimensional model )

The star shows how the data is stored. The middle is the central table (the fact table) containing the basic information and the points are dimension tables that show different views of the data. The central table has foreign keys to the dimension tables. There are no foreign keys on the dimension tables.

Snowflake schema (= a form of dimensional model )

The snowflake shows how the data is stored. The middle is the central table (the fact table) containing the basic information and the points are dimension tables that show different views of the data. The central table has foreign keys to the dimension tables. There are also  foreign keyson the dimension tables. In snowflake schema, as opposed to its counter part star schema, relational keys are present inside dimensions also.