In this article...
data quality software is any tool designed to improve the accuracy, completeness, relevance, and/or consistency of an organization’s data. Most data quality tools will fall into one of three general categories:
Some data tools will focus on one category; however, as data analytics technology continues to mature, cross-functional solutions are becoming more prevalent. Before you choose a data quality solution, you’ll want to understand which of these areas you need help with.
READ MORE: What is Data Cleaning?
Sometimes called data scrubbing or data cleaning software, data cleansing software is generally focused on the removal and/or correction of low-quality data. Data cleansing may or may not be performed while data wrangling — manually converting or mapping data from one “raw” form into another to enable more convenient processing and analysis with automated tools. Data cleansing most often occurs in the intermediate staging area during the Extract-Transform-Load (ETL) process, but it can also be used to cleanse data in a cube, warehouse, or lake.
Data cleansing tools are essential for any data scientist; without automated data cleansing, data governance would be nearly impossible. Can you imagine de-duplicating and appending and maintaining a database of 40 million records manually?
Data cleansing is not to be confused with data sanitation, which involves the removal of specified information—such as names or contact information—usually for the purposes of privacy or security. Nor should it be confused with data validation; while similar, validation is usually performed only on newly acquired data and implies removal rather than replacement or appending. Occasionally, the data cleansing process will be referred to as data pre-processing, which is a catch-all term that describes preparation for human-directed data mining.
Data scientists use data cleansing software to solve a number of problems with data quality:
When evaluating data cleansing tools, it’s important to understand how data scientists approach a project. Beginning with the detection and removal of major errors and inconsistencies, your project should be supported by automation that limits manual inspection or extensive programming.
Furthermore, any proper approach should be extensible — it is often outside data that enables data cleansing (e.g. address verification using public records), so the data cleansing tool you select should be able to work with these supplemental datasets in addition to your own internal data. Finally, the data cleansing tool you select should enable schema-related data transformations and a workflow that helps you replicate and execute your process in perpetuity.
While data cleansing is arguably the most important of the data quality categories, it was not the first; many of the earliest data quality tools were simple database auditing software solutions.
Originally developed for the financial industry by companies such as CaseWare and ACL Services, data auditing tools help data scientists detect fraud, maintain compliance with business or regulatory standards, and support data discovery and modeling. Data scientists use data auditing tools to process and represent data in various reports and visualizations for internal and external stakeholders. Data auditing software is sometimes called data query, data examination, data profiling, data verification, or data monitoring software.
No data cleansing project or quality initiative is possible without a tool to digest and represent data in various forms. So-called “deep-dive” analytics, risk analysis, quality control, performance management, and change tracking are all performed using database auditing tools.
Perhaps one of the most useful features of any database auditing tool is the audit trail, or record of changes made to the schema or data. Essential to determine when and where erroneous data was introduced to the database, audit trail capabilities are a requirement for modern data governance.
As data auditing technology has matured, vendors have added more advanced functionality. For example, the decreasing cost and increasing abilities of online analytical processing (OLAP), the commoditization of server hardware, and the rise of Hadoop have enabled real-time data monitoring and sophisticated modeling that would have been prohibitively expensive a few short years ago.
Data migration tools, sometimes referred to as data integration, database migration, or data transfer software, help data scientists aggregate various datasets into a data warehouse, cube, or in-memory database for the purpose of analysis, cleansing, storage, etc.
These tools should not be confused with consumer products of the same name, as these are intended to migrate files or applications from one system to another, such as when transferring documents from an old PC to its replacement.
Without migration capabilities, any data quality or cleansing initiative must be repeated ad nauseam for each source. Any analysis or auditing must be performed on each silo and would require human observation or tedious spreadsheet merging.
Before you can begin to consider and compare data quality software vendors, you must identify the various problems you want your new tool to solve. Solutions to some business problems can require multiple data quality functions.
For example, a merge/purge operation (combining of multiple datasets and detecting duplicates) would involve functions from all three data quality software categories: data auditing helps discover missing or duplicated values; data cleansing tools help remove the duplicates and rectify incorrect or missing values; and a migration function will move audited, cleansed data to a data warehouse.
BSI is the United Kingdom’s National Standards Body and the originator of many of the world’s most commonly used standards. The company works with more than 64,000 clients in 150 countries. BSI used Oracle’s enterprise data quality management suite to create a single, accurate, complete record of each of these clients in just one month. As a result, the accuracy of its customer and corporate data has improved to nearly 100 percent, and BSI can refresh information four times faster.
The project began with a simple recognition: BSI wanted to optimize customer insight by creating master customer records that captured each client’s profile, purchasing history, relationships, and other attributes in a single view. The goal was to eliminate inaccurate, incomplete, nonstandard, multi-format, and duplicate customer and transactional data from the customer database, which was growing 3-4 percent each year. Senior officers knew that having complete and accurate data would improve the organization’s ability to segment customers, increase marketing effectiveness, boost sales per customer, and reduce churn. In addition, by standardizing names, dates, and values in corporate and customer records, BSI could improve the performance and productivity of its teams, improving the match between customer needs and BSI services.
The need for standardization also extended to the publications, training documents, tools, and services that BSI sold online. They needed to ensure consistent coding, description, and pricing formats in their electronic catalogue. BSI wanted to complete this master data management (MDM) project using its existing staff resources, despite a 20 percent year-over-year increase in data volume.
BSI chose Oracle Enterprise Data Quality, a suite of automated cleansing, matching, and profiling solutions, for its adaptability and value. IT professionals used the software to aggregate more than five million records held in multiple databases and formats into a set of “golden” customer records that span a four-year period and include transaction histories of all products and services.
The results have been impressive:
BSI used the Oracle technology to deliver a single customer view to 2,000 workers in 60 countries, while enforcing best practices in data governance and management within the organization.
BSI eliminated inaccurate, incomplete, nonstandard, multi-format, and duplicate data for a client database that is growing 3-4 percent annually.
Better marketing campaigns have improved sales by boosting cross- and up-selling opportunities and, in turn, providing a more complete customer experience.
BSI has standardized product categorizations to accommodate a 20 percent expansion in data volume each year, without adding staff.
This is by no means an exhaustive list, and your unique business requirements may also be served by an unlisted solution.
At TechnologyAdvice, we’ve extensively researched the data quality software market. We’ve compiled product information, reviews, case studies, feature lists, video walkthroughs, and more — all to help you and other software buyers make the purchase decision that best fits your requirements.
If you’re curious about any of the data quality management tools, concepts, or vendors that we’ve mentioned in this guide, we want to hear from you! Call one of our in-house Technology Advisors for a free consultation, or use the Product Selection Tool above for a list of custom recommendations.