Importance of Data Data Management and Analytics
Data Warehousing
Why Data Warehousing?
Data is the core of any organization. Businesses and organizations make use of reliable, distinctive and consolidated data in critical decision making. Normally, the organizational data resides in different data repositories in different formats. For example, a company might have different departments such as HR, finance, accounts etc that are poorly integrated most of the time. The company would have a solid customer base as well. The customer data might have been stored as plain text files or in databases, customer responses as emails on the Web, accounts data in Excel file and so on. Suppose, the accounts manager asks an accountant regarding a particular accounting activity in such an organization. He might be able to answer as he needs to refer only the accounting related data which is readily available. But suppose the CEO questions a department head where he needs to refer to data from two or three departments to answer his boss. Will he be able to answer easily and quickly? No, having data in heterogeneous formats in different data storage systems will not be useful for any organization when it comes to complex querying. Here comes data warehousing into picture.
What is Data Warehousing?
Data warehousing is the process of collecting and storing data from different resources to make it a single repository. In other words, data warehousing is the process of centralized data management and retrieval. According to William Inmon, “a data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process”. Data warehousing integrates data in multiple formats from different repositories in such a way that the data becomes consistent and integrated. Organizations use such data in critical decision making. Accurate and reliable information is required to make better decisions and large volumes of data are required to extract reliable information. Normally, databases that hold transactional data remove older data frequently whereas data warehouses contain historical data. This is the main difference between databases and data warehouses.
Data Warehouse Architecture
Data warehouse is a central repository of historical data which is fed from one or more resources and is accessed by many end-users. A data warehouse has many components including source systems, transformation or ETL tools, data warehouse, data mart and end user tools to make it really functional and useful. Source systems are transactional systems that provide inputs for the data warehouse and examples are Excel sheets, flat files, web logs, databases etc. ETL tools are responsible for the ETL process which is the major process in data warehousing. Data warehouse holds the cleansed and transformed data whereas data mart contains a subset of historical data stored in data warehouse. Different end user tools including data mining tools, online analytical processing tools and application development tools are available for users to interact with data warehouses.
Data Warehousing Techniques
Different techniques are applied while creating data warehouses. These techniques vary based on the nature of data, size of the data, data needs of the organization etc. ETL (Extract, Transform and Load) is the major process in data warehousing. ETL is responsible for extracting data from different data repositories, converting it into one consolidated data warehouse format and loading it into a data warehouse.
During Extraction process, data from different sources including data from flat files, XML files, World Wide Web, SAP, databases etc. are retrieved and consolidated. The Extraction process makes data ready for the next Transformation process. Usually, it is really difficult to make out the exact data of interest as the source might contain irrelevant and redundant data as well. So usually, more data than actually required get extracted.
During Transformation stage, data becomes ready for actual use. Obviously, data from different sources would be in different format and usually would be inconsistent. They need to be transformed into a common format. This is one of the major challenges data warehousing needs to sort out. Different processes such as data cleansing, data mapping, data validation etc happen for transforming data into a common format. The actual identification of relevant data happens at transformation stage by applying a set of rules. These rules or functions reject records that are not acceptable. A simple example is data cleansing by mapping Null to 0 and removing records with age value greater than 200. The rules or functions applied to validate the records vary based on the nature of the business and kind of target data expected. The nature of the data decides whether the data warehousing becomes simple or complex. The output of the Transformation process would be data that are ready to meet technical and business needs of the data warehouse.
Once the data is ready, it needs to be loaded into the target system. There are different methods to load data. The loading method is chosen mainly considering business needs. There are many ETL tools to manage ETL process and Oracle Warehouse Builder is an example. Using SQL insert statements, each record can be physically inserted as a new row into the table.
Data Storage Approach
There are two prominent methods for implementing data warehouses or for storing data in data warehouses. They are dimensional approach and normalized approach. In the dimensional approach, data is stored as facts or dimensions within data warehouse. Facts are number based transactional data whereas dimensions hold reference information that gives context to facts. Ralph Kimball made the dimensional approach more popular. In normalized approach, data is stored in third normal form in the data warehouse following database normalization rules. Data is divided into entities that are normally converted to different physical tables in a database. Normalized approach is also known as 3NF approach and this approach became popular by Bill Inmon.
Life Cycle
The entire process of data warehousing can be divided into various stages: design, prototype, deploy, operate and enhance. Data warehousing is not a one-time process and this cycle repeats every 12 to 18 months. During the design phase, the data analyst analyses business requirements in detail, conducts interview sessions with end users, defines critical business metrics and key performance indicators, etc and these acts as the inputs for the prototype stage. Prototyping is the stage where a working model of the designed data warehouse is developed. With this working model, end users could get a basic idea on the difference between what they asked for and what they are going to get. Normally, prototypes result in enhanced user satisfaction.
In the deployment stage, user approved prototype is formalized and becomes ready for actual production. All activities of a classical IT system deployment including documentation, training etc happen here as well. Operation stage includes the actual use of data warehouse which also involves day-to-day maintenance of the warehouse. During this stage, end users become familiar with different tools and services that are required to deal with data warehouse. Enhancement is the stage where modifications happen to schema, processes, design etc as required.
Advantages and Disadvantages
Business intelligence is an attractive term in today’s business world. It includes a number of methods to transform raw data into useful information. Actually, business intelligence becomes more reliable only with a clean and efficient data warehouse. In other words, success of business intelligence and data warehousing are closely related. With data warehousing, data becomes accessible at very simple to most complex levels. Thus, analysis and decision making becomes more accurate and productive. As data warehousing provides easy-to-use tools, users do not need to have lot of technical skills to use them. Thus, the process becomes lot more simple.
In data warehousing, data from different sources are stored at a central repository. Hence, large volume of historical data will be available for analysis and reporting which gives more accurate results. You can analyze trends and behaviors of different time periods to make better future predictions. Fast analysis and reporting is another major benefit of data warehousing. This helps business users to become more prolific as they need not have to spend much time to gather data and they can focus more on other productive tasks. In short, data warehousing helps organizations in quick, easy and efficient decision making.
Though data warehousing has many advantages, it is not a panacea for all problems of all companies. Data warehousing is actually a very expensive, complex, time consuming and risky process. Not every organization can afford such a complex and expensive system. Organizations need to consider the fact that data warehousing is not a one-time process, but requires constant maintenance. Moreover, if not implemented correctly, it could affect the whole business adversely.
Summary
If you could implement data warehousing successfully, you would be able to make critical decisions easily, quickly and proficiently. Thus, you would be able to increase profit, cut cost and stand out from the competition which is the ultimate objective of data warehousing.