Friday, September 7, 2012

Data Quality and Data Quality Assurance

This post is on data quality and how to go about assuring high data quality. See my video, Data Quality Concepts (I have explained multiple examples in detail in it) or read on...

First, let us understand data quality. Put simply, data are of high quality if they do not suffer from data issues. There are many potential issues with data (see examples below). Now, data are used for a number of organizational functions such as on-going operations, dealing with customers, marketing and analysis and decision making. If the data are not of high quality, there are a number of problems. Users get incorrect reports. Time and money is wasted in miscommunication. Bad data can lead to poor decisions. It can frustrate employees and most importantly, it can frustrate customers.

Although data quality assurance is particularly useful for production databases, it can very well be used in software testing as software testers need to ensure high data quality in gold test databases. Now, let us see examples of data issues that bring down data quality. 

1. Incompleteness
Customer (FirstName, LastName, Billing Address, Shipping Address, Email)
Dave| |111 Main Street,London|NULL|dave@example.co --> Missing last name and shipping address and truncated email address
Robert|Smith|Billing Address|NULL|bob@example.com  --> Default billing address and missing shipping address
Also, the Customer table may miss records that exist in the source data. Or, the Customer table may not be updated with the latest data in the source(s).

2. Inaccuracy
Employee (EmployeeID, Name, Position, Department, Email)
1|Anna|Product Management|Business Analyst|Anna@example.com --> The position and department are switched.
2|Berry|Sftware Engineer|Product Development|Berry@example.com --> The position has a typo.
3|Charles|Test Analyst|^&*^$$%|Charles@example.com --> The department has a junk value.

3. Lack of consistency
Order (OrderNumber, CustomerID, ShipDate, Total)
1000|1|21/9/2012|999$
1001|2|10/9/2012|1000€ --> The total is not in the default currency.
OrderDetails (OrderNumber, ProductNumber, Quantity)
1002|1|9/9/2012/100 --> The order number is missing (breaks referential integrity).

4. Invalidity
Product (ProductCode, Name, UnitsAvailable, RetireDate)
001|Widget1|-10|31/12/2012 --> The number of units available is negative.
002|Widget2|100|31/3/2009 --> The product is retired and no longer available.

5. Redundancy
Customer (FirstName, LastName, Email)
John|Smith|john@example.com
Jack|S.|john@example.com --> This record is a duplicate of the other record.

6. Not standard
Order (OrderNumber, CustomerID, ShipDate, Total)
1000|1|21/9/2012|19.99
1001|2|10/21/2012|24.9999 --> The shipping date and total are not in the required format.

With the above understanding, let us see the process of data quality assurance. This process has three main activities:

A. Data Profiling
Data profiling involves exploring the data to understand and analyze it. The data issues are identified. The data is summarized in terms of table row counts, blank values, duplicates and so on. The data quality goes down the more widespread an individual data issue is. For example, the data quality is higher if the Customer table has 2% rows missing from the source than if it has 20% rows missing. In fact, data quality can be expressed in data quality metrics such as % Complete, % Accurate, % Conformance (to business rules e.g. value >= lower range, value within a specified list etc.).

B. Data Cleansing
Data cleansing involves fixing the data issues. Simple updates fix some data issues e.g. misspellings/ typos, incorrect formats and missing foreign keys. In case of duplicates, one of the rows is selected as the master. Depending on its relevance, the data in the other rows is used to enrich the master or the data is discarded. Selective imports bring in the missing entities from the data source(s). Out of date data is deleted.
It is important to note that because of the risk associated with database changes, the database changes are not directly applied to the database. Each proposed change is listed. Then, each change is reviewed by a domain expert (a.k.a. data steward) and either approved and applied or rejected.

C. Data Monitoring
Data cleansing is not the last activity. The data has to be maintained in a clean state. Instead of the data steward taking the entire responsibility of ensuring data cleanliness, responsibilities are assigned to actors who interact with the data. Users entering new data check the business rules and duplicates before entering the data carefully to avoid issues. Users who consume the data submit data issues for review and correction. In addition, software can run checkers (e.g. parser, spell-checker, business rules checker, duplicate checker etc.) to notify data issues to the data steward for review and correction.

Lastly, let us see some challenges faced in data quality assurance:

a. Small sample size and lack of tools
Profiling data manually with the help of SQL queries is effort intensive. Therefore, only a small subset of the entire data is profiled. The data issues hiding in the rest of the data are never identified and fixed. However, tools such as SQL Server 2012 Data Quality Services have arrived recently which do not require writing SQL queries. It also runs checks on the entire dataset quickly.

b. Lack of domain knowledge
The data stewards may lack in domain knowledge resulting in incomplete checks on the data and missing certain data issues. Now ready made knowledge bases are available from service providers which contain business rules for multiple domains such as Address, ZIP code etc. Data stewards may even create their own knowledge bases documenting their domain knowledge in business rules and matching policies.