...

What is a Data Warehouse?

A Data Warehouse is a central place where large amounts of data from different sources are stored. It is a database system used for data analysis and reporting, enabling better decision-making.

Characteristics of Data Warehouse

Need for Data Warehousing

In challenging times, good decision-making becomes critical. A data warehouse enhances the speed and efficiency of accessing large amounts of data, making it easier for corporate decision-makers to derive insights that guide business and decision-making strategies.

OLTP and OLAP

OLTP (Online Transaction Processing)

OLTP is essentially a database that stores daily transactions, also known as current data. Typically, OLTP is utilized in online applications that require frequent updates to maintain data consistency. OLTP deals with a relatively small amount of data.

OLAP (Online Analytical Processing)

OLAP is nothing but a Data Warehouse; OLAP is the study of data analysis for planning and decision-making. When it comes to aggregations, the data in OLAP will come from various sources. Compared to OLTP, OLAP handles large amounts of data.

OLTPOLAP
Keeps transactional data and has a small amount of data.Stores historical data. It is a massive amount of data for users to inform business decisions.
Stores records of the last few days or weeks and has a fast response time.High performance for analytical queries.
Uses ER Modelling techniques for Database designUses Dimension modelling techniques for Data Warehousing design.
Optimized for write operations (Insert, Update, and Delete)Optimized for read operations. To access millions of records.
Tables and joins are complex since they are normalized. This is done to reduce redundant data and save storage space.Tables & joins are simple since they are de-normalized. This is done to reduce the response time for analytical queries.

Database and Data Warehousing

Database

A database is an organized collection of data, optimized for storing transactional activities, which allows users to easily access, manage, and retrieve data using a DBMS software. This can be treated as an OLTP system.

Data warehouse

A data warehouse is a centralized repository that stores and manages a large amount of data (historical data) from various sources, designed for analysis, reporting, and business intelligence (BI) purposes. This can be treated as an OLAP system.

DatabaseData Warehouse
A database is designed to store and manage data.A data warehouse is designed to analyze data.
The database contains only a few records, i.e., current data. This data can be updated.A data warehouse contains millions of records, i.e., Historical data. Here, this data cannot be updated.
The database has both read/write access.The data warehouse has only read access.
A database is always normalized.A data warehouse is denormalized.
The data view in the database will always be Relational.The data view in the Data Warehouse is always Multidimensional.
The database contains detailed data.A data warehouse contains the consolidated data.
Data in the Database needs to be available 24/7, meaning downtime is costly.The Data Warehouse will not be affected by downtime.

Data Mart

A Data Mart is a simplified form of Data Warehousing, focused on a single subject or line of business. A data mart is designed to serve the specific analytical needs of a particular department or business unit within an organization. Using Data Mart, the team can access data and gain insights faster because they don’t have to spend time searching within a complex Data Warehouse or manually aggregating data from different sources.

Data MartData Warehouse
A Data Mart stores Department data (a single subject)A Data Warehouse stores enterprise data (Integration of multiple departments)
A Data Mart is designed for Middle Management.A Data Warehouse designed for top management access.
The data warehouse size ranges from 100GB to 1TB.Data Mart size is less than 100GB.
A Data Warehouse is focused on all departments of an organization.Data Mart focuses on a specific group.

Dimension and Dimension Table

Emp_IDEmp_NameCityDesignationEmp_Mobile
101SanthoshChennaiMarketing94XXXXXX95
102NareshHyderabadManager97XXXXXX53
103BrundaBangaloreExecutive78XXXXXX28
104SrinivasHyderabadManager97XXXXXX52
105MaheshMumbaiAccountant98XXXXXX43

Types of Dimensions

Fact & Fact Table

DateProduct_IDStore_IDTotal_SalesProfitAverage Sales
2024-01-01P001S00112,0003,5001,200
2024-01-01P002S0018,5002,000850
2024-01-01P003S0025,0001,500500
2024-01-02P001S00115,0004,2001,500
2024-01-02P002S00210,0003,0001,000
2024-01-02P003S0036,5001,700650
2024-01-03P003S0017,5003,300750

Types of Facts

In this, Sales_Amount is an Additive fact, because we can sum up this fact along with all three dimensions in the fact table.

In this, Cur_Balance and Profit_Margin are facts. Cur_Balance is a Semi-Additive fact.

In this, Profit_Margin is the Non-Additive Fact.

Dimensions and Facts Flow Diagram

Dimension Data Modeling Techniques

The dimension model in DWH arranges data in such a way that it is easier and faster to retrieve information and reports. The dimension model defines the Data modelling technique by defining the schema (Star and Snowflake schema).

Types of Schemas

Star Schema: A Star Schema is a database that contains a centrally located fact table surrounded by multiple-dimensional tables. This looks like a STAR, so it is called Star Schema.

Snowflake Schema: Snowflake Schema is an extension of a STAR Schema where dimension tables are broken down into sub-dimensions. The dimension tables are normalized, which splits data into additional dimensional tables.

Data warehouse

Testing fact and dimension tables involves several steps, i.e., data completeness, accuracy, and consistency, to ensure the proper functioning of foreign keys and relationships between the tables. Both types of tables are essential in a data warehouse, and ensuring their integrity and alignment with business rules is crucial for delivering reliable and accurate reports and analysis.

By following the steps outlined above, you can ensure that your fact and dimension tables are well-tested, perform optimally, and meet the business requirements of your data warehouse.

Normalization & De-Normalization

Normalization is the process of dividing data into multiple tables and establishing relationships among them. Normalization is used to remove redundant data from the database and store non-redundant, consistent data into it.

data warehouse

Denormalization is used to combine multiple tables of data into a single table, allowing for faster query processing.

SCD (Slowly Changing Dimension)

Slowly changing dimensions (SCD) determine how the historical data changes over time, rather than changing on a regular schedule, as time-based changes in the dimension tables are handled. Implementing the SCD mechanism enables users to know to which category an item belongs on any given date.

SCD Types

For example, consider an employee named Athul who lives in Hyderabad. This can be represented in a table as below:

IDEmpIDEmpNameEmpDeptEmpCityYear
11001AthulDevHyderabad2023

SCD Type 1: This is used when there is no need to store historical data in the dimension table. In this, the old data is overwritten with the new data in the dimension table. No history is kept.

Ex: Athul is working in Hyderabad and gets transferred to Bangalore. The table will be changed as follows:

IDEmpIDEmpNameEmpDeptEmpCityYear
11001AthulDevBangalore2025

In this case, the previous entry, which is considered historical, is no longer available.

SCD Type 2: In this, both the original record and the new record inserted will be present in the same table. The new record gets its own primary key. It keeps all historical data. This is mainly for tracking the historical changes. The table will be changed as follows:

IDEmpIDEmpNameEmpDeptEmpCityYear
11001AthulDevHyderabad2023
21001AthulDevBangalore2025

SCD Type 3: In this, the previous record itself is always modified in such a way that neither the history is lost, nor is the newly inserted record displayed.

This can be achieved by creating two columns, one indicating a particular attribute of interest and the other indicating the current value, with the original value in the first column. There will also be a column that indicates when the current value becomes active. SCD Type 3 is rarely used. The table will be changed as follows:

IDEmpIDEmpNameEmpDeptYearEmpCity (Old)EmpCity (New)Eff_Year
11001AthulDev2023HyderabadBangalore2025

Surrogate Key in the Data Warehouse

A Surrogate Key is a unique system-generated identifier (often an auto-incrementing integer) that uniquely identifies each row in a table. Surrogate key values have no commercial significance in the business-related data. These are mainly used in SCDs and for faster query performance.

SURIDEmpIDEmpNameEmpCityEmpRole
1HOO1AshokHyderabadDEV
2MOO2RajuMumbaiDEV
3NOO3RamuNoidaQA
4BOO4SrinivasBangaloreDEV
5HOO1AshokChennaiDEV
6HOO1AshokPuneDEV

Testing in a Data Warehouse is essential to ensure the quality, reliability, and performance of the data and processes. Testing happens at different stages of the Data Warehouse lifecycle, including ETL processes, data validation, performance checks, security, and user acceptance. By thoroughly testing in these areas, you can ensure the data warehouse functions effectively and provides accurate, valuable insights.

1. ETL Testing (Extract, Transform, Load)

ETL processes move data from source systems to the data warehouse. Testing ensures the data is accurate and complete.

2. Data Validation and Reconciliation Testing

Ensure that the data in the warehouse is accurate, valid, and matches the data in the source systems.

3. Data Warehouse Schema Testing

This testing ensures the data warehouse schema (tables, views, relationships, etc.) is designed and functioning as expected.

4. Performance Testing

Test the performance of the data warehouse, particularly as the data volume grows and more users access it.

5. Data Governance and Security Testing

Ensure sensitive data is protected and access is controlled appropriately.

6. Business Intelligence (BI) Reporting Testing

Test BI tools like Tableau or Power BI to ensure that reports and dashboards are accurate, complete, and function properly.

7. Regression Testing

Make sure the new ETL processes and schema changes do not introduce new issues or break existing functionality.

8. User Acceptance Testing (UAT)

Test the data warehouse from the end user’s perspective to make sure it meets business requirements and is ready for deployment.

Conclusion

Testing within a Data Warehouse is not only a technical task but an essential process to guarantee the accuracy, reliability, security, and performance of data-oriented systems. Validating ETL processes and schema design, along with ensuring governance, scalability, and user satisfaction, comprehensive testing ensures that the data warehouse reliably provides accurate insights for business decisions. Through the adoption of organized and thorough testing at every phase, organizations can enhance the value of their data resources while reducing risks and mistakes.