Table of Contents

Search

  1. Preface
  2. Introduction to Data Engineering Streaming
  3. Data Engineering Streaming Administration
  4. Sources in a Streaming Mapping
  5. Targets in a Streaming Mapping
  6. Streaming Mappings
  7. Window Transformation
  8. Appendix A: Connections
  9. Appendix B: Monitoring REST API Reference
  10. Appendix C: Sample Files

Streaming Process

Streaming Process

A streaming mapping receives data from unbounded data sources. An unbounded data source is one where data is continuously flowing in and there is no definite boundary. Sources stream data as events. The Spark engine processes the data and continuously updates the results to a result table.
The following image shows how the Spark engine receives data and publishes data in micro batches:
the spark engine receives data from a kafka source and publishes the data in batches to an hdfs target.
The Spark engine uses Spark Structured Streaming to process data that it receives in batches. Spark Structured Streaming receives data from streaming sources such as Kafka and divides the data into micro batches. The Spark engine continuously processes data streams as a series of small batch jobs with end-to-end latencies as low as 100 milliseconds with exactly-once fault tolerance guarantees.
For more information about Spark Structured Streaming, see the Apache Spark documentation at https://spark.apache.org/documentation.html.
You can perform the following high-level tasks in a streaming mapping:
  1. Identify sources from which you need to stream data. You can access data that is in XML, JSON, Avro, flat, or binary format.
    In Hadoop environment, you can use Kafka Amazon Kinesis stream and Azure Event Hubs sources to connect to multiple data engineering sources.
  2. Configure the mapping and mapping logic to transform the data.
  3. Run the mapping on the Spark engine in the Hadoop environment or on the Databricks Spark Engine in the Databricks environment.
  4. Write the data to Kafka targets, HDFS complex files, HBase, Azure Event Hubs, Amazon S3, Azure Data Lake Storage, JMS, and Kinesis Firehose delivery streams.
  5. Monitor the status of your processing jobs. You can view monitoring statistics for your processing jobs in the Monitoring tool.