Table of Contents

Search

  1. Preface
  2. Introduction to Data Engineering Streaming
  3. Data Engineering Streaming Administration
  4. Sources in a Streaming Mapping
  5. Targets in a Streaming Mapping
  6. Streaming Mappings
  7. Window Transformation
  8. Appendix A: Connections
  9. Appendix B: Monitoring REST API Reference
  10. Appendix C: Sample Files

Data Engineering Streaming Overview

Data Engineering Streaming Overview

Use Informatica Data Engineering Streaming to prepare and process streams of data in real time and uncover insights in time to meet your business needs. Data Engineering Streaming provides pre-built connectors such as Kafka, Amazon Kinesis, HDFS, enterprise messaging systems, and data transformations to enable a code-free method of defining data integration logic.
Data Engineering Streaming builds on the best of open source technologies. It uses Spark Structured Streaming for stream processing, and supports other open source stream processing platforms and frameworks, such as Kafka and Hadoop. Spark Structured Streaming is a scalable and fault-tolerant open source stream processing engine built on the Spark engine.
You can create streaming mappings to stream machine, device, and social media data in the form of messages. Streaming mappings collect machine, device, and social media data in the form of messages. The mapping builds the business logic for the data and pushes the logic to the Spark engine for processing. Use a Messaging connection to get data from Apache Kafka brokers, Amazon Kinesis, and Azure Event Hubs.
The Spark engine runs the streaming mapping continuously. The Spark engine reads the data, divides the data into micro batches, processes it, updates the results to a result table, and then writes to a target.
You can stream the following types of data:
  • Application and infrastructure log data
  • Change data(CDC) from databases
  • Clickstreams from web servers
  • Geo-spatial data from devices
  • Sensor data
  • Time series data
  • Supervisory Control And Data Acquisition (SCADA) data
  • Message bus data
  • Programmable logic controller (PLC) data
  • Point of sale data from devices
You can stream data to different types of targets, such as Kafka, HDFS, Amazon Kinesis Firehose, Amazon S3, HBase tables, Hive tables, JDBC-compliant databases, Microsoft Azure Event Hubs, and Azure Data Lake Store.
Data Engineering Streaming works with Data Engineering Integration to provide streaming capabilities. In a Hadoop environment, Data Engineering Streaming uses YARN to manage the resources on a Spark cluster. It uses third-party distributions to connect to and push job processing to a Hadoop environment. In a Databricks environment, Data Engineering Streaming uses the built-in standalone resource manager to manage the Spark clusters.
Use Informatica Developer (the Developer tool) to create streaming mappings. To run the streaming mapping, you can use the Hadoop run-time environment and the Spark engine or the Databricks run-time environment and the Databricks Spark engine for the supported sources and targets.
You can configure high availability to run the streaming mappings on the Hadoop or Databricks cluster.
For more information about running mappings on the Spark engine, see the
Data Engineering Integration User Guide
.