What is Data Integrity?
This article was written by Adam Hart, JAPAC Strategic Advisor at MuleSoft.
Have you ever had to speak to a customer service representative to correct a personal detail — like the misspelling of a name (eg: Michelle not Michael), a transposed street number (eg: Unit 1/34, not Unit 34/1), or other missing or incorrect detail? These are instances when data integrity has failed.
Other real-world consequences related to data integrity failures include:
- Difficulty recovering funds from a financial transaction due to an error in an account number.
- A failure in mobile payment due to an embedded smiley face emoji.
- Never receiving a parcel due to an error in addressing, for example using the wrong postcode for the same suburb name in a different state.
In fact, business policies designed to protect correct data make fixing data mistakes painful. While government regulations to protect personally identifiable information (PII) make changes to PII data post-enrollment difficult and labour intensive for the customer. To avoid affecting the customer experience as in these examples we need strong data integrity.
Data integrity is a necessary business performance process that is vital to counter the errors that data undergoes as it is enrolled, replicated, and otherwise transcribed from real-world facts and events.
As the digital processes that organizations increasingly use become data-driven, especially through the use of machine learning, the ability to effectively make data-driven business decisions is increasingly affected by the integrity of an organization's operational and analytical data.
This article will explore the role of data integrity in your organization.
What is data integrity?
Integrity means the data can be trusted and relied upon. In accounting, the reporting standards of financial results mean that the numbers reported in financial statements must be accurate, complete, and consistent.
The same standards apply with data integrity. These factors can be applied to data to test its integrity:
- It is complete and there is no missing data elements
- It is accurate with no data errors from source
- It is consistent across different contexts
- It is timely and up-to-date
When these four conditions are not met, the data can fail integrity measures. Many of these are not obvious and the issues are sometimes only discovered through business process failures or comprehensive data profiling efforts.
Due to historical limitations in information systems, maybe not all data can be captured because there are insufficient fields. This is less common today due to extensibility of data schemas, however it can result in data that is captured in the wrong field (misclassification) or condensed into one field when more are required. This introduces noise and reduces useability.
While many systems have mandatory fields, too many mandatory fields slow down customer enrollment processes. This business choice can also result in incomplete data.
Many types of data like SSN and driver's licence numbers consist of long strings of numbers which are prone to human error. Also, spelling mistakes or the use of odd characters that need to be scrubbed in downstream data wrangling efforts impacts data useability.
Other mistakes like vanity year of birth (deliberately making oneself seem younger or older) are harder to detect. Derived or inferred fields that have business logic errors also affect accuracy. Older systems that only support male/female genders are also problematic.
Data inaccuracy (and incompleteness) can occasionally be repaired through retrospective data matching against an authoritative source. With large data sets this approach may be ineffective or prohibited by regulatory constraints.
Another type of data integrity issue is inconsistency between the natural fact and the
business processes that transcribe those true facts into corporate data stores and registries inside and across organizations. Or, inconsistency between a system of record and a secondary system that has copies of that data — becoming itself a source of truth, resulting in two authoritative sources that are inconsistent.
This happened at a major bank where a new “VIP” CRM system was stood up alongside a separate retail customer CRM, with the richer data captured in the new VIP CRM never being fed back to the retail CRM, even though they shared the same customer record.
If data is complete, accurate, and consistent then there still may be an issue with integrity.
This may be because it is either out-of-date (due to batch/ETL processes); the payload has a miscalculated timestamp (system time not event time); or the standard to calculate the effective date is different from the actual date. Or simply that the data is stale and needs to be refreshed.
This can happen when an invoice issue date is used instead of the purchase order date, which is the contracted date. This is also why customer contact processes constantly reconfirm customer’s key master data.
While not necessarily relevant to accounting data, with PII data in particular we must also be wary if the identity of the customer is authentic, especially on enrollment. With phishing and spoofing on the rise, companies must take pains to ensure customers identity data is accurate and complete first time.
A type of inauthentic data that exists in production systems is test data. While best practice suggests there should be no test data in production systems, this is rarely the case as operators are forced to conduct tests in production for BAU changes.
The importance of data provenance for data integrity
The business processes that enforce or degrade the provenance of the data are equally important to nurture along with the business processes that grow revenue or reduce costs for an organization.
Data provenance is really important to ensure that the origin of the data (the facts) and what has happened to the data during its replication and other changes (the lineage) is not corrupted or otherwise broken. It’s also not just the data but the definition of the data (metadata) that must remain as consistent as possible.
Many organizations have robust fact enrollment. In banking that is called KYC (Know Your Customer). In health FHIR and HL7 highly standardized patient and even medicines data. In other less regulated industries this process is resident in their Customer 360.
Less regulated industries may choose to prioritize speed of enrollment over capturing exhaustive customer details (affecting completeness). The impact of missing or inconsistent data on downstream processes is that the ability to maximize the value of that relationship, for example with highly personalized marketing campaigns, is constrained, and conversion rates remain low because they are not sticky enough due to missing data.
For any business data to be of maximum effectiveness in a value exchange between business processes, critical data needs to be complete, accurate and consistent with the true (or natural) fact and event, and also across relevant data stores inside an organization’s business boundaries. Data integrity across those facts and events to the outside world for regulatory and compliance purposes is also critical.
Learn more about how MuleSoft’s modern API iPaaS can uplift and enforce your data integrity efforts at the same time as the data is being integrated.