Data doesn’t sit in one database, file system, data lake, or repository. Data created in a system of record must serve multiple business needs, integrate with other data sources, and then be used for analytics, customer-facing applications, or internal workflows. Examples include:
- Data from an e-commerce application is integrated with user analytics, customer data in a customer relationship management (CRM) system, or other master data sources to establish customer segments and tailor marketing messages.
- Internet of Things (IoT) sensor data is linked to operational and financial data stores and used to control throughput and report on the quality of a manufacturing process.
- An employee workflow application connects data and tools across multiple software-as-a-service (SaaS) platforms and internal data sources into one easy-to-use mobile interface.
Many organizations also have data scientists, data analysts, and innovation teams who increasingly need to integrate internal and external data sources. Data scientists developing predictive models often load multiple external data sources such as econometrics, weather, census, and other public data and then blend them with internal sources. Innovation teams experimenting with artificial intelligence need to aggregate large and often complex data sources to train and test their algorithms. And business and data analysts who once performed their analyses in spreadsheets may now require more sophisticated tools to load, join, and process multiple data feeds.
Programming and scripting data integrations
For anyone with even basic programming skills, the most common way to move data from source to destination is to develop a short script. Code pulls data from one or more sources, performs any necessary data validations and manipulations, and pushes it to one or several destinations.
Developers can code point-to-point data integrations using many approaches, such as:
- A database-stored procedure that pushes data changes to other database systems
- A script that runs as a scheduled job or a service
- A webhook that alerts a service when an application’s end-user changes data
- A microservice that connects data between systems
- A small data-processing code snippet deployed to a serverless architecture
These coding procedures can pull data from multiple sources, join, filter, cleanse, validate, and transform data before shipping them to destination data sources.
Scripting might be a quick and easy approach to moving data, but it is not considered a professional-grade data processing method. A production-class data-processing script needs to automate the steps required to process and transport data and handle several operational needs.
For example, integrations that process large data volumes should be multithreaded, and jobs against many data sources require robust data validation and exception handling. If significant business logic and data transformations are required, developers should log the steps or take other measures to ensure that the integration is observable.
The script programming to support these operational needs is not trivial. It requires the developer to anticipate things that can go wrong with the data integration and program accordingly. In addition, developing custom scripts may not be cost effective when working with many experimental data sources. Finally, data integration scripts are often difficult to knowledge transfer and maintain across multiple developers.
For these reasons, organizations with many data integration requirements often look beyond programming and scripting data flows.
Features of robust data integration platforms
Data integration platforms enable the development, testing, running, and updating of multiple data pipelines. Organizations select them because they recognize that data integration is a platform and capability with specific development skills, testing requirements, and operational service-level expectations. When architects, IT leaders, CIOs, and chief data officers talk about scaling data integration competencies, they recognize that the capabilities they seek go beyond what software developers can easily accomplish with custom code.
Here is an overview of what you are likely to find in a data integration platform.
- A tool specialized for developing and enhancing integrations; often low-code visualization tools allow drag-and-drop processing elements, configuring and connecting them into data pipelines.
- Out-of-the-box connectors that enable rapid integration with common enterprise systems, SaaS platforms, databases, data lakes, big data platforms, APIs, and cloud data services. For example, suppose you want to connect to Salesforce data, capture accounts and contacts, and push the data to AWS Relational Database Service. In that case, chances are the integration platform already has these connectors prebuilt and ready to be used in a data pipeline.
- The capability to handle multiple data structures and formats beyond relational data structures and file types. Data integration platforms typically support JSON, XML, Parquet, Avro, ORC, and may also support industry-specific formats such as NACHA in financial service, HIPAA EDI in healthcare, and ACORD XML in insurance.
- Advanced data quality and master data management capabilities may be features of the data integration platform, or they may be add-on products that developers can interface from data pipelines.
- Some data integration platforms target data science and machine learning capabilities and include analytics processing elements and interface with machine learning models. Some platforms also offer data prep tools so that data scientists and analysts can prototype and develop integrations.
- Devops capabilities, such as support for version control, automating data pipeline deployments, tearing up and down test environments, processing data in staging environments, scaling up and down production pipeline infrastructure, and enabling multithreaded execution.
- Multiple hosting options include data center, public cloud, and SaaS.
- Dataops capabilities can maintain test data sets, capture data lineage, enable pipeline reuse, and automate testing.
- In runtime, data integration platforms can trigger data pipelines using multiple methods, such as scheduled jobs, event-driven triggers, or real-time streaming modalities.
- Observable production data pipelines provide reporting on performance, alert on data source issues, and have tools to diagnose data processing problems.
- Different tools support security, compliance, and data governance requirements, such as encryption formats, auditing capabilities, data masking, access management, and integrations with data catalogs.
- Data integration pipelines don’t run in isolation; top platforms integrate with IT Service Management, agile development, and other IT platforms.
How to shop for a data integration platform
The list of data integration capabilities and requirements can be daunting considering the types of platforms, the number of vendors competing in each space, and the analyst terminology used to categorize the options. So, how do you choose the right mix of tools for today and future data integration requirements?
The simple answer is that it requires some discipline. Start by taking inventory of the integrations already in use, cataloging the use cases, and reverse engineering the requirements on data sources, formats, transformations, destination points, and triggering conditions. Then qualify the operating requirements, including service-level objectives, security requirements, compliance needs, and data validation requirements. Finally, consider adding some new or emerging use cases of high business importance that have requirements that differ from existing data integrations.
With this due diligence in hand, you can probably find ample reasons why do-it-yourself integrations are subpar solutions and some guidance about what to look for when reviewing data integration platforms.