Blog > Garbage In, Garbage Out - Using Data Profiling
Garbage In, Garbage Out - Using Data Profiling
Friday May 24, 2019
I would image that everyone reading this has heard the adage “garbage in, garbage out” and many will know, through personal experience, exactly what this means. In the context of this blog series: you can’t make good decisions from bad data.
I talked about the data audit in my last blog. Having identified the sources of data available to you, the first step to determining its quality is to perform data profiling over the key elements that will be used to make decisions or feed other processes.
In its simplest form, data profiling identifies the completeness and consistency of a data element, as well as the type and range of values stored. A very simple example is that if an integer field is used to store a numerical representation of a month then all of the values should fall in the range 1-12. Data profiling should check this and also provide information relating to the distribution of values and the number of missing values, if any. Of course, missing values are not always a sign of a problem: ideally the “order cancel date” field should have the majority of its entries blank!
In more sophisticated scenarios, advanced analytics algorithms can be run across data to spot anomalies and outliers, which can then be investigated further to ensure that they are not errors. Algorithms can also be used to fill empty fields which, whilst imparting no relevant information to the record concerned, can at least maintain the statistics of the data element across the whole data set so that analysis results are not skewed by the synthesised values. This is generally used with numeric fields and is often necessary when using the data as input to a modelling algorithm as many of these cannot handle data sets with missing values. It is also important when calculating averages and the distribution of values, for example.
If data is to be combined across multiple sources then data quality issues usually become magnified and data profiling becomes even more important. The most common issue is the representation of the same thing in different ways across systems. Obvious examples include variations in company or product names and there are a number of tools available that will attempt to identify different representations of the same thing and present them as potential matches or duplicates. Some of these so-called “match-and-merge” tools can be configured to automatically merge information where the match is very strong however, in many (if not the majority of) cases, intervention by somebody from the business domain is required to determine if and how potential matches should be merged. The approach is similar when attempting to identify duplicate values (e.g. duplicate customer records) within a data set. In some cases a Master Data Management (MDM) approach would be appropriate and MDM tools will usually incorporate match-and-merge capabilities. The concept of MDM will be the subject for a future blog.
Another common problem which should be detected via data profiling is inconsistent data types, formats or units used to store what is ostensibly the same information in different systems. Looking at the range of values in a field is an extremely good way, for example, of spotting different units of measure and data types on both sides should be checked before trying to merge or relate fields between different sources.
I have mentioned this before, but, aside from structural issues, data quality, in the vast majority of cases, is not an IT issue, it is a business one. I have known a number of situations where IT has been tasked to “fix” dirty data without changes to the business processes that drive the capture or entry of that data, only to find that it deteriorates again rapidly over time.
The classic approach to missing data is to make fields mandatory at the point of entry. Anybody who has been through this process will know that this alone can result in bad data, which is worse than missing data as it is harder to detect, as people enter random information or select the easiest option. Unfortunately, it is also generally true that those responsible for capturing data are not the same people that will use it for analysis and decision making. Poor quality data generally results when somebody’s “day job” is hampered by mandating the capture of data when they do not see the relevance of it – particularly in time-stressed situations. What will help is for the implications of that data to other processes and decision making to be clearly communicated and frequently reinforced, however a business case approach should be adopted whereby the value of that data needs to be weighed up against the impact of capturing it.
Have you had to tackle issues with poor data quality? Has it had an impact upon your business? As ever, Like Share and comment or email us if you have specific questions.