Any finance director knows what happens if reports are run against the accounting system whilst period-end processing is underway: everything grinds to a halt.
For most business managers, the information that they require at their fingertips in order to run their operations comes from more than a single system. Also, the way the data is represented for transaction processing is often not suitable for really understanding what is going on.
These are the main issues that a data warehouse seeks to address. Many modern BI (Business Intelligence) tools can connect to multiple data sources and combine data from them to one degree or another, however this is not a scalable solution and in most cases the processing logic must be re-applied to each dashboard.
To complicate matters, database technologies designed for transactional workloads do not handle reporting and analytics very well. Trying to make them do so has an adverse effect upon the performance of transaction processing. The same can be said of the way data is organised within a database and a different design of database structure (or schema) is required to obtain optimal performance for the two workloads.
In simple terms, a data warehouse is a database, separate from your company’s transaction processing systems, optimised for reporting and analysis workloads, rather than transactional tasks. Data may be combined from multiple data sources and transformed into a form that makes it easy and efficient to present business-relevant metrics and insight. They do not necessarily make use of different technologies to transaction processing databases, but the higher performance systems do.
Business attitudes to the value of data changed radically in the nineties and noughties, due primarily to two factors: firstly computing became cheap enough to make it practical for companies of all sizes to store and process large volumes of data (as opposed to just the very largest corporations); secondly, companies who had large volumes of data, such as Tesco, started to leverage that data to understand, for example, customer behaviour and therefore to provide highly targeted promotions that resulted in a significant uplift in business. Data thus became a valuable commodity, not just a byproduct of operations, and the data warehouse was a key component required to store and analyse large volumes of data for this purpose.
The noughties saw the acceleration of the digitisation of business and with this came ever-expanding sources of data whose volume, variety and velocity overwhelmed traditional methods of storing and analysing it – even most data warehouses. This led to the coining of the term “Big Data” in 2005 and media hype suggesting that any significant business which had not jumped onto the Big Data bandwagon was doomed to fail. The hype was partially correct in that it is almost certainly the case that any significant business that does not become data-driven (but not necessarily Big Data-driven) will fall behind their competitors that do.
Big Data referred not just to structured data (which can be stored in a database) but also to semi-structured data (such as XML and JSON) and unstructured data such as documents, which require processing to derive structured “metadata” before they can be used for analysis purposes.
This led to companies hoarding data and documents, without necessarily knowing what they were going to do with them. Without an idea of the business questions to be answered by the data it was not possible to design effective storage structures (schema) and hence the concept of the data lake was born in which data is stored with the minimum structure required to contain it.
I regard a data lake as a parking area for data that might be useful in the future, but whose purpose has not been determined to a point where a suitable schema can be designed.
I am going to ignore unstructured documents at this stage as relatively few companies are attempting to analyse them. For most, a data lake is still a parking area for structured, and possibly semi-structured, data.
As data volumes increase, so the lake gets bigger and the task of making sense of what it contains becomes ever more difficult, along with that of actually deriving business value from it. Once a company has lost track of what its data lake contains, or does not have the means to leverage it then it becomes a data swamp – contributing no value to the business.
At this point it is time to take a step back and consider what it was that you were trying to achieve in the first place. There are advantages in having access to the granular raw data that is often stored in a data lake, and the argument for creating business-specific data schema still apply, but data anarchy can only end in sorrow.
Next week we will discuss an alternative approach for structured and semi-structured data that combines the best of both worlds to create a modern data architecture that can support both the raw data and the structured schema necessary to provide efficient business intelligence; without creating something that requires a raft of technologists to support it.