Use Informatica Data Engineering Streaming to prepare and process streams of data in real time and uncover insights in time to meet your business needs. Data Engineering Streaming provides pre-built connectors such as Kafka, Amazon Kinesis, HDFS, enterprise messaging systems, and data transformations to enable a code-free method of defining data integration logic.
Data Engineering Streaming builds on the best of open source technologies. It uses Spark Structured Streaming for stream processing, and supports other open source stream processing platforms and frameworks, such as Kafka and Hadoop. Spark Structured Streaming is a scalable and fault-tolerant open source stream processing engine built on the Spark engine.
You can create streaming mappings to stream machine, device, and social media data in the form of messages. Streaming mappings collect machine, device, and social media data in the form of messages. The mapping builds the business logic for the data and pushes the logic to the Spark engine for processing. Use a Messaging connection to get data from Apache Kafka brokers, Amazon Kinesis, and Azure Event Hubs.
The Spark engine runs the streaming mapping continuously. The Spark engine reads the data, divides the data into micro batches, processes it, updates the results to a result table, and then writes to a target.
You can stream the following types of data:
Application and infrastructure log data
Change data(CDC) from databases
Clickstreams from web servers
Geo-spatial data from devices
Sensor data
Time series data
Supervisory Control And Data Acquisition (SCADA) data
Message bus data
Programmable logic controller (PLC) data
Point of sale data from devices
You can stream data to different types of targets, such as Kafka, HDFS, Amazon Kinesis Firehose, Amazon S3, HBase tables, Hive tables, JDBC-compliant databases, Microsoft Azure Event Hubs, and Azure Data Lake Store.
Data Engineering Streaming works with Data Engineering Integration to provide streaming capabilities. In a Hadoop environment, Data Engineering Streaming uses YARN to manage the resources on a Spark cluster. It uses third-party distributions to connect to and push job processing to a Hadoop environment. In a Databricks environment, Data Engineering Streaming uses the built-in standalone resource manager to manage the Spark clusters.
Use Informatica Developer (the Developer tool) to create streaming mappings. To run the streaming mapping, you can use the Hadoop run-time environment and the Spark engine or the Databricks run-time environment and the Databricks Spark engine for the supported sources and targets.
You can configure high availability to run the streaming mappings on the Hadoop or Databricks cluster.
For more information about running mappings on the Spark engine, see the
Data Engineering Integration User Guide
.