Data Transformation

Data is essential to the day-to-day operations of every enterprise. Unfortunately, the ad hoc development of many legacy systems has created information silos that contain redundant and inconsistent data. To harness data and make it valuable to the enterprise, it's important to integrate these information silos and leverage existing IT assets to create more flexible, agile enterprise systems. A key way to do this is through data transformation.

This article provides an overview of data transformation, introduces some of the tools and technologies used for transforming data, and discusses why it's important to find data integration solutions that can make the data generated by the enterprise useful.

What is Data Transformation?

Data transformation is the process of converting data from one format (e.g. a database file, XML document, or Excel sheet) to another. Because data often resides in different locations and formats across the enterprise, data transformation is necessary to ensure data from one application or database is intelligible to other applications and databases, a critical feature for applications integration.

In a typical scenario where information needs to be shared, data is extracted from the source application or data warehouse, transformed into another format, and then loaded into the target location. Extraction, transformation, and loading (together known as ETL) are the central processes of data integration. Depending on the nature of the integration scenario, data may need to be merged, aggregated, enriched, summarized, or filtered.

The first step of data transformation is data mapping. Data mapping determines the relationship between the data elements of two applications and establishes instructions for how the data from the source application is transformed before it is loaded into the target application. In other words, data mapping produces the critical metadata that is needed before the actual data conversion takes place.

For instance, in field mapping, the information in one application might be rendered in lowercase letters while another application stores information in uppercase letters. This means the data from the source application needs to be converted to uppercase letters before being loaded into the corresponding fields in the target application.

The structure of stored data may also vary between applications, requiring semantic mapping prior to the transformation process. For instance, two applications might store the same customer credit card information using slightly different structures:

APPLICATION A	EXAMPLE	APPLICATION B	EXAMPLE
Cardholder First Name	JOHN	Cardholder Name	JOHN DOE
Cardholder Last Name	DOE	Card Type	VISA
Card Type and Card Number	VISA 0123 4567 8910 1112	Card Number	0123 4567 8910 1112
Expiration Date	05/2012	Expiration Date	05/2012

To ensure that critical data isn’t lost when the two applications are integrated, information from Application A needs to be reorganized to fit the data structure of Application B. In some cases, data can be unstructured, such as an email from a customer, requiring data transformation to make the information contained in it usable.

Transformers: Data Integration Tools and Technologies

System administrators can choose from a variety of data integration products that automate the ETL process with visual mapping tools and drag-and-drop technologies. ETL capabilities can be found as standalone data integration software, as built-in tools in database servers, or as components of EAI middleware. These tools simplify and expedite complex data transformations by reducing the need for writing new code.

Standalone integration software and ETL tools found in database servers are particularly useful solutions for migrating data from enterprise applications to data warehouses and other IT projects geared towards data management. Informatica’s PowerCenter, for instance, is an example data integration solution that is capable of accessing data in a variety of formats, including flat files and web applications, while Talend’s Integration Suite offers similar data integration capabilities based on open source technologies. SQL Server Integration Services (SISS) is an ETL tool included in Microsoft SQL Server that extracts operational data from different sources for transformation and loading into a database.

For enterprises interested in integrating processes and data, middleware tools such as an Enterprise Service Bus (ESB) provide a way to share data between applications. Mule as an ESB, for instance, is not only capable of exchanging and transforming data through the DataWeave graphical interface, but also service creation and service orchestration - key processes for developing composite applications using existing units of functionality.

Data Transformation the enterprise

It's important to increase the amount of resource utilization by leveraging existing IT assets to build new business applications. This requires a fundamental shift in how data is understood. Legacy applications were originally designed for individual departments, with business functions tightly connected to data. Over time, however, the number of applications within the enterprise grew, resulting in the fragmentation of data into information silos.

But data no longer belongs to a single department, but rather to the entire enterprise. This means that data from one application should be usable in other applications and requires the loosening of data from business processes, and transformation into the right format. Without data transformation, data will fail to reach its potential in delivering tangible benefits to the enterprise.

Want to know more? Take a look at the whitepaper Big Data's Velocity and Variety Challenges.