Wednesday, February 18, 2015

Big Unstructured Data v/s Structured Relational Data

Data warehousing has become an essential part of any organization in today’s world. To understand data warehousing we first need to understand what databases are as the warehouses are usually based on these databases. The data warehouse is then used for the various analytic and reporting purposes. But, before we talk about data warehousing, lets look at the kind of data that organizations generate these days.



There used to be a time when only structured data was used for storage and performing analysis. But, as technology exponentially grows in every manner, getting valuable information out of unstructured data has also been the norm these days. We will discuss how to get useful information out of the unstructured data, first lets take a look at the difference between structured and unstructured data.

Unstructured Data


One of the most common ways of filing data is storing it in an unstructured form. When some data is called unstructured it does not have an identifiable structure associated with it. Unstructured data is described as the data that can’t be stored in rows and columns in a relational database. An example of unstructured data can be a document that is archived in a file folder or even images, audio and video.

Structured Data


Structured data refers to the type that follows a predefined schema for storage. For instance storage of fully structured data can be called a relational database system. Designing a database schema is a whole different process in itself. It requires the database designer to define the schema using the type and structure of data and its relations.  The basic purpose of having a well-defined schema for storage of structured data is the efficient processing of that data and ease of navigation through the database.

Comparison


So, based on the brief descriptions of both the types of data it can be seen that one apparent advantage of using unstructured data is that there is no extra effort required for its classification. Whereas, in case of structured data, first a well-defined schema needs to be put in place.  But, on the other hand it’s a lot easier to navigate structured data as compared to unstructured data. Unstructured data is highly flexible in its nature as well as comparatively more scalable.

There is also something called semi-structured data. This type of data doesn't require a predefined schema but it is possible to make one.

Data Warehousing


We know that a data warehouse is used for OLAP instead of OLTP as in the case of databases. A data warehouse primarily consists of aggregated historical data that is optimized for specific types of analysis. What is going to be stored in the data warehouse is dependent mainly on the client/user requirements. What the user wants to view at the output and at what levels of aggregation determines these requirements.

A typical data warehouse stores the following types of data:

Historical Data
Derived Data
Metadata

Historical Data – An organization typically stores several years of historical data in their data warehouses. Factors such as storage infrastructure and analysis required to meet the client requirements determine the amount of that historical data that is made available. The source of this kind of data can be transactional database archives among other sources. Summary data is also based out of historical data and most of the data in an organization revolves around this data type. Transactions make the major chunk of the volume of data for an organization. 

Derived Data – This type of data is generated from existing data usually by using some data transformation technique or mathematical operation. Usually when it’s required that we increase the response time of a query or for database maintenance operations, derived data is put in use. The volume of such kind of data depends on the requirements. If performance is of key importance and there needs to be lots of information derived from existing data, then to save processing time derived data can be used.

Metadata – Data that describes stored data and other schema objects is called metadata. This type of data is also used by applications to access and compute the data properly.

All the above listed types of data are stored in a data warehouse, which is modeled based on the given requirements. When it comes to analysis, the primary purpose of a data warehouse is to support strategic decision-making.

Doing analysis through transactional systems there used to be several issues with respect to the speed of queries, linking tables from separate systems and so forth. The purpose of a data warehouse is to specifically addressing such issues.

In a data warehouse, all the data is centralized
A data warehouse is designed to ease query writing and optimize the reporting speed
The linking of tables from different source transactional systems is facilitated by the key fields that are created by the data warehouse during the addition of new records
Talking about derived data, it is stored at different levels of granularity. This can easily be rolled up to match the granularity of other data warehouse tables.

Limitations of using Data Warehousing


Considering all the advantages that data warehousing provides there are certain areas where it lacks in providing service to the user. Some of these disadvantages are listed below:

The data must be cleaned, loaded and extracted in order for it to qualify for storage in a data warehouse. This takes up most of the effort put in building a data warehouse at the first place.
Proper training needs to be provided to the employees who use and maintain the data warehouse due to user variability
Since a data warehouse is incongruous among systems, it is usually quite difficult and complex to maintain.
A significant disadvantage of replicating data for use in a data warehouse is that the data contained in the warehouse might become inconsistent with the original sources. The updates are usually held periodically, and if the analysis being done requires the most recent or currently available information then it may not provide the most accurate results.

So when a client’s needs are unpredictable, a data warehouse might not be the best approach to the solution.

Data Warehousing & its Future


Today’s business problems are becoming more complex than ever. This necessitates the development of better business intelligence and data warehousing tools. Lets look at some of the promises and challenges that data warehousing holds for us in the future:

Real-time data warehousing: Data warehouses updated their data on a periodic basis. This leaves some time when there is old data in the warehouse compared to what the operational system holds. Real-time data warehousing means that the rate at which data is made available is more frequent. Almost as frequent as near-real-time update of the data can be possible where data latency typically is in the range of minutes for instance.

Software as a Service (SaaS): When using SaaS by deploying IS applications, the provider licenses its applications to customers for based on the service being used based on the demand. Finding SaaS based software applications and resources that meet specific needs and requirements can be challenging. Software’s are becoming more agile by the day, and this provides significant boost to the appeal and actual use of SaaS for data warehousing.


Cloud Computing: This is the newest trend in the market right now. Although, it is fairly established for operational applications today, there is not much use of cloud in the data warehouse platforms as yet. Clouds have the ability to provide dynamic allocation, which becomes helpful when data volume of a particular warehouse varies fairly unpredictably. This also makes planning the capacity of the warehouse difficult. Also, through cloud, the IS applications can significantly scale up based on the requirements.

No comments:

Post a Comment