Very often, when we talk or write about data quality in our industry, the discussion seems to be superficial and lacking depth. There is a lot of room for misunderstandings making the discussion as a whole obsolete. This post should help to classify arguments in the quality discussion and bring more depth into it.
The Five Dimensions of Quality
Let’s start with a general framework of quality to locate arguments in their semantic field. According to David Garmin (1984) there are five principal approaches to quality.
According to the transcendent approach quality is defined as an innate excellence that is absolute and universal. “High quality data needs to be perfect and flawless”. A general problem is that it’s actually pretty hard to tell what “perfect data” looks like and how to achieve it. However, this approach is fairly common in research. Validity as a “transcendent goal” for example very often leads to the trouble of finding a good trade-off between internal and external validity.
The product-based approach views quality as a result of the right ingredients and attributes of the product, – in our case data. “High quality data has carefully selected respondents in the sample, who write a lot of word into open text fields.” Here, data quality is quite tangible and can be measured precisely. However, this understanding of quality is very formalistic and, therefore, too superficial.
The user-based approach starts from the premise that different users may have different wants and requirements. Here, the highest data quality is what best satisfies these needs. Hence, data quality is highly individual and subjective: high quality for one user can be average or poor data quality for another.
The manufacturing-based definition focuses on the process of producing data, – or in research terminology: they focus on methodology. “Good data is collected in adherence to the scientific standards and the best-practices of our industry”. While this approach makes data highly comparable, it sometimes doesn’t fit to the researcher’s task at hand.
Last but not least, there is a value-based approach, that sees quality as a positive return on investment (or more specific: Return On Insight). Here, data has a high quality if the costs of collecting it are minimal while the benefit from using it is maximum. At first sight, this approach seems legit, but it also has its downsides. This approach doesn’t tell much about the data properties itself, but more about the information needs of the user.
Competing Views on Quality
All these approaches very often lead to competing views on quality. Data collectors, for example, may pay attention to methodology and data formats, while research buyers rather focus on their individual needs and the Return on Insight. And even within companies, there can be different perspectives. Members of the sales or marketing department may see the customers’ perspectives as paramount, while project managers see quality as well-defined specifications and processes. Being aware of these different views can help to improve the communication about quality, and consequently improve the quality itself.
But even if you have everyone on the same page, you may have difficulties to find the right approach. Let’s take observation data as an example. This method can be the best choice to answer your research questions but you may also run into the problem of complex data formats, missing values or outliers. This again can have an impact on the return on insight and demand a different approach.
To keep it short, it’s not easy to tell, what data quality actually is. Everyone is claiming to have it, but a closer look reveals that the corresponding arguments very often fall apart. Probably, it would be naïve to merely call for a more holistic perspective, as the different approaches are in an innate tension. It doesn’t mean, data quality is just an illusion or arbitrary, but it reminds us that data quality requires some effort and doesn’t fall into place by itself. In any case, good data quality starts with good communication of what is expected.
Let’s bring some quality into the quality discussion!
Continue: What is Data Quality? (Part 2/2)