How to fix this misunderstanding is what Big Data professionals will explain in this post.
The C-level executives are using data collected by their BI and analytics initiatives to make strategic decisions to offer the company a competitive advantage. The case gets worse if the data is inaccurate or incorrect. It’s because the big data helps the company to make big bets, and it impacts the direction and future together. Bad Data can yield inappropriate results and losses.
Some interesting facts and statistics about big data, data warehousing, and data quality-
- 90% of US companies are applying data quality solution today.
- The poor data quality brings loss of $8.2 million to average organization every year.
- About 32% of the data is inaccurate.
- 46% of companies see data quality as a barrier for adoption of BI products.
- About 40% of all business initiatives fail to achieve target benefits due to poor data quality.
What is the reason behind disconnection between the first point and next four points? If 90% of US companies are using some sort of data quality solution, why are several companies encountering bad data issues?
You must understand the data quality and data testing are two different terms.
Data Quality or DQ assists companies benefit from their client data by optimizing the data quality using distinct capabilities, such as-
- Enhanced data entry
- De-duplication
- Real time validation
- Key enrichment capabilities
- Single customer view
Data quality essence makes and keeps customer data –
- Complete
- Actual
- Correct
- Unique
What benefits can companies get with data quality?
- Enhanced customer interaction
- Maintained reputation in the market
- Reduced costs
- Compliant with rules and regulations
- Optimized processes
Data quality tools –
- Profiling – It is a process of analyzing data which is later used for capturing statistics or metadata
- Parsing and standardization – This involves decomposition of text fields into components, formatting based on business rules and standards.
- Generalized “cleansing” – This tool is used to modify data values to meet integrity constraints, domain restrictions, or other business rules.
- Monitoring – It involves controls deployment to ensure continuity of data to conform to business rules
- Matching – Recognizing, linking, or merging relevant entries within or across data sets
- Enrichment – This helps in enhancing the data value by appending consumer geography and demographics.
- Subject-area-specific support- this tool is used for standardization capabilities for certain or special data subject areas
- Configuration environment – It offers capabilities for developing, managing, and applying data quality standards and rules.
- Metadata management – The tool features ability to capture, reconcile, and correlate metadata that is relevant or used in quality process
Data testing is opposite to data quality. Data testing includes four things – Data completeness, data transformation, data quality, and regression testing.
These four components are explained below-
Data Testing Methods
Many companies presently run data testing, data validation, and reconciliation processes due to their relevance. The thing that is making situation problematic is all of the enhancements made in the software space in Big data, databases, and data warehouses, the data testing process is still manual and loaded with risk of creating bulk bad data.
Experts use 2 most prevalent methods for data testing, i.e.-
- Sampling or Stare and Compare
- Minus Queries
In sampling method, the tester writes code in SQL to extract data from source data and DWH or Big Data store. Then he dumps the 2 result sets into Excel and tests them by viewing the results. As 1 test query can return as much as 40 billion data sets, and most teams have hundreds of these tests; the sampling method fails to validate more than a fraction of 1% of data. This implies that companies cannot count this method as a perfect solution to determine data errors.
In MINUS method, tester queries the source data and target data and minus the 1st result set from the 2nd one to find the set difference. No difference means no result set remains. Tester re-perform the MINUS method, subtracts the 2nd set from the 1st. This has its value; however, the test results may not be precise when there are duplicate rows. Moreover, tester cannot produce historical data and reports using MINUS method and this is a real concern for audit and regulatory reviews.
What can be done? What is the solution?
Big Data service offerings professionals and software vendors are thinking about automated data testing. These testing solutions can be used by companies for automating data testing and saving time. This will help the company management to yield enhanced data quality, lower data costs and bad data risks, significant ROI, and less sharing of data health information.