Top Five Data Integration Patterns
Data is an extremely valuable business asset, but it can sometimes be difficult to access, orchestrate and interpret. When data is moving across systems, it isn’t always in a standard format; data integration aims to make data agnostic and usable quickly across the business, so it can be accessed and handled by its constituents. And in order to make that data usable even more quickly, data integration patterns can be created to standardize the integration process.
Like a hiking trail, patterns are discovered and established based on use. Patterns always come in degrees of perfection, but can be optimized or adopted based on what business needs require solutions. You can think of the business use case as an instantiation of the pattern, i.e. a use for the generic process of data movement and handling.
There are five data integration patterns that we have identified and built templates around, based on business use cases as well as particular integration patterns.
Data Integration Pattern 1: Migration
Migration is the act of moving a specific set of data at a point in time from one system to the other. A migration contains a source system where the data resides at prior to execution, a criteria which determines the scope of the data to be migrated, a transformation that the data set will go through, a destination system where the data will be inserted and an ability to capture the results of the migration to know the final state vs the desired state.
Why is it valuable?
Migrations are essential to all data systems and are used extensively in any organization that has data operations. We spend a lot of time creating and maintaining data, and migration is key to keep that data agnostic from the tools that we use to create it, view it, and manage it. Without migration, we would be forced to lose all the data that we have amassed any time that we want to change tools, and this would cripple our ability to be productive in the digital world.
When is it useful?
Migrations will most commonly occur whenever you are moving from one system to another, moving from an instance of a system to another or newer instance of that system, spinning up a new system that extends your current infrastructure, backing up a dataset, adding nodes to database clusters, replacing database hardware, consolidating systems and many more.
Data Integration Pattern 2: Broadcast
Broadcast can also be called “one way sync from one to many”, and it is the act of moving data from a single source system to many destination systems in an ongoing and real-time (or near real-time), basis.
Whenever there is a need to keep our data up-to-date between multiple systems across time, you will need either a broadcast, bi-directional sync, or correlation pattern. The distinction here is that the broadcast pattern, like the migration pattern, only moves data in one direction, from the source to the destination. The broadcast pattern, unlike the migration pattern, is transactional. This means it does not execute the logic of the message processors for all items which are in scope; rather, it executes the logic only for those items that have recently changed. Think of broadcast as a sliding window that only captures those items which have field values that have changed since the last time the broadcast ran. Another major difference is in how the implementation of the pattern is designed. Migration will be tuned to handle large volumes of data and process many records in parallel and to have a graceful failure case. Broadcast patterns are optimized for processing the records as quickly as possible and being highly reliable to avoid losing critical data in transit as they are usually employed with low human oversight in mission critical applications.
Why is it valuable?
The broadcast pattern is extremely valuable when system B needs to know some information in near real time that originates or resides in system A. For example, you may want to create a real time reporting dashboard which is the destination of multiple broadcast applications where it receives updates so that you can know in real time what is going across multiple systems. You may want to immediately start fulfilment of orders that come from your CRM, online e-shop, or internal tool where the fulfilment processing system is centralized regardless of which channel the order comes from. You may want to send a notification of the temperature of your steam turbine to a monitoring system every 100 ms. You may want to broadcast to a general practitioner’s patient management system when one of their regular patients is checked into an emergency room. There are countless examples of when you want to take an important piece of information from an originating system and broadcast it to one or more receiving systems as soon as possible after the event happens.
When is it useful?
The broadcast pattern’s “need” can easily be identified by the following criteria:
Does system B need to know as soon as the event happens – Yes
Does data need to flow from A to B automatically, without human involvement – Yes
Does system A need to know what happens with the object in system B – No
The first question will help you decide whether you should use the migration pattern or broadcast based on how real time the data needs to be. Anything less than approximately every hour will tend to be a broadcast pattern. However, there are always exceptions based on volumes of data. The second question generally rules out “on demand” applications and in general broadcast patterns will either be initiated by a push notification or a scheduled job and hence will not have human involvement. The last question will let you know whether you need to union the two data sets so that they are synchronized across two system, which is what we call bi-directional sync. Different needs will call for different data integration patterns, but in general broadcast the broadcast pattern is much more flexible in how you can couple the applications and we would recommend using two broadcast applications over a bi-directional sync application.
Data Integration Pattern 3: Bi-Directional Sync
The bi-directional sync data integration pattern is the act of combining two datasets in two different systems so that they behave as one, while respecting their need to exist as different datasets. This type of integration need comes from having different tools or different systems for accomplishing different functions on the same dataset. For example, you may have a system for taking and managing orders and a different system for customer support. You may find that these two systems are best of breed and it is important to use them rather than a suite which supports both functions and has a shared database. Using bi-directional sync to share the dataset will enable you to use both systems while maintaining a consistent real-time view of the data in both systems.
Why is it valuable?
Bi-directional sync can be both an enabler and a savior depending on the circumstances that justify its need. If you have two or more independent and isolated representations of the same reality, you can use bi-directional sync to optimize your processes, have the data representations be much closer to reality in both systems and reduce the compound cost of having to manually address the inconsistencies, lack of data or the impact to your business from letting the inconsistencies exist. On the other hand, you can use bi-directional sync to take you from a suite of products that work well together but may not be the best at their own individual function, to a suite that you hand pick and integrate together using an enterprise integration platform like our Anypoint Platform.
When is it useful?
The need, or demand, for a bi-directional sync integration application is synonymous with wanting object representations of reality to be comprehensive and consistent. For example, if you want a single view of your customer, you can solve that manually by giving everyone access to all the systems that have a representation of the notion of a customer. But a more elegant and efficient solution to the same problem is to list out which fields need to be visible for that customer object in which systems and which systems are the owners. Most enterprise systems have a way to extend objects such that you can modify the customer object data structure to include those fields. Then you can create integration applications either as point to point applications (using a common integration platform) if it’s a simple solution, or a more advanced routing system like a pub/sub or queue routing model if there are multiple systems at play. For example, a salesperson should know the status of a delivery, but they don’t need to know at which warehouse the delivery is. Similarly, the delivery person needs to know the name of the customer that the delivery is for without needing to know how much the customer paid for it. Bi-directional synchronization allows both of those people to have a real-time view of the same customer within the perspective hey care about.
Data Integration Pattern 4: Correlation
The correlation data integration pattern is a design that identifies the intersection of two data sets and does a bi-directional synchronization of that scoped dataset only if that item occurs in both systems naturally. This is similar to how the bi-directional pattern synchronizes the union of the scoped dataset, correlation synchronizes the intersection. In the case of the correlation pattern, those items that reside in both systems may have been manually created in each of those systems, like two sales representatives entering same contact in both CRM systems. Or they may have been brought in as part of a different integration. The correlation pattern will not care where those objects came from; it will agnostically synchronize them as long as they are found in both systems.
Why is it valuable?
The correlation data integration pattern is useful in the case where you have two groups or systems that want to share data only if they both have a record representing the same item/person in reality. For example, a hospital group has two hospitals in the same city. You might like to share data between the two hospitals so if a patient uses either hospital, you will have a up to date record of what treatment they received at both locations. To accomplish an integration like this, you may decide to create two broadcast pattern integrations, one from Hospital A to Hospital B, and one from Hospital B to Hospital A. This will ensure that the data is synchronized; however you now have two integration applications to manage. To alleviate the need to manage two applications, you can just use the bi-directional synchronization pattern between Hospital A and B. But to increase efficiency, you might like the synchronization to not bring the records of patients of Hospital B if those patients have no association with Hospital A and to bring it in real time as soon as the patient’s record is created. The correlation pattern is valuable because it only bi-directionally synchronizes the objects on a “Need to know” basis rather than always moving the full scope of the dataset in both directions.
When is it useful?
The correlation data integration pattern is most useful when having the extra data is more costly than beneficial because it allows you to scope out the “unnecessary” data. For example, if you are a university, part of a larger university system, and you are looking to generate reports across your students. You probably don’t want a bunch of students in those reports that never attended your university. But you may want to include the units that those students completed at other universities in your university system. Here, the correlation pattern would save you a lot of effort either on the integration or the report generation side because it would allow you to synchronize only the information for the students that attended both universities.
Data Integration Pattern 5: Aggregation
Aggregation is the act of taking or receiving data from multiple systems and inserting into one. For example, customer data integration could reside in three different systems, and a data analyst might want to generate a report which uses data from all of them. One could create a daily migration from each of those systems to a data repository and then query that against that database. But then there would be another database to keep track of and keep synchronized. In addition, as things change in the three other systems, the data repository would have to be constantly kept up to date. Another downside is that the data would be a day old, so for real-time reports, the analyst would have to either initiate the migrations manually or wait another day. One could set up three broadcast applications, achieving a situation where the reporting database is always up to date with the most recent changes in each of the systems. But there would still be a need to maintain this database which only stores replicated data so that it can be queried every so often. In addition, there will be a number of wasted API calls to ensure that the database is always up to x minutes from reality. This is where the aggregation pattern comes into play. If you build an application, or use one of our templates that is built on it, you will notice that you can on demand query multiple systems, merge the data set, and do as you please with it. For example, you can build an integration app which queries the various systems, merges the data and then produces a report. This way you avoid having a separate database and you can have the report arrive in a format like .csv or the format of your choice. You could can place the report in the location where reports are stored directly.
Why is it valuable?
The aggregation pattern derives its value from allowing you to extract and process data from multiple systems in one united application. This means that the data is up to date at the time that you need it, does not get replicated, and can be processed or merged to produce the dataset you want.
When is it useful?
The aggregation pattern is valuable if you are creating orchestration APIs to “modernize” legacy systems, especially when you are creating an API which gets data from multiple systems, and then processes it into one response. Another use case is for creating reports or dashboards which similarly have to pull data from multiple systems and create an experience with that data. Finally, you may have systems that you use for compliance or auditing purposes which need to have related data from multiple systems. The aggregation pattern is helpful in ensuring that your compliance data lives in one system but can be the amalgamation of relevant data from multiple systems. You can therefore reduce the amount of learning that needs to take place across the various systems to ensure you have visibility into what is going on.