Table of Contents

Search

  1. Preface
  2. Introduction to Big Data Streaming
  3. Big Data Streaming Configuration
  4. Sources in a Streaming Mapping
  5. Targets in a Streaming Mapping
  6. Streaming Mappings
  7. Window Transformation
  8. Appendix A: Connections
  9. Appendix B: Data Type Reference
  10. Appendix C: Sample Files

Big Data Streaming User Guide

Big Data Streaming User Guide

Streaming Process

Streaming Process

A streaming mapping receives data from unbounded data sources. An unbounded data source is one where data is continuously flowing in and there is no definite boundary. Sources stream data as events. The Spark engine receives the input data streams and divides the data into micro batches. The Spark engine processes the data and publishes data in batches.
The following image shows how the Spark engine receives data and publishes data in batches:
The Spark engine receives data from a Kafka source and publishes the data in batches to an HDFS target.
The Spark engine uses Spark Streaming to process data that it receives in batches. Spark Streaming receives data from streaming sources such as Kafka and divides the data into discretized streams or DStreams. DStreams are a series of continuous streams of Resilient Distributed Datasets (RDD).
For more information about Spark Streaming, see the Apache Spark documentation at https://spark.apache.org/documentation.html.
You can perform the following high-level tasks in a streaming mapping:
  1. Identify sources from which you need to stream data. You can access data that is in XML, JSON, Avro, flat, or binary format.
    You can use Kafka, JMS, Amazon Kinesis stream, Azure Event Hubs, and MapR stream sources to connect to multiple big data sources.
  2. Configure the mapping and mapping logic to transform the data.
  3. Run the mapping on the Spark engine in the Hadoop environment.
  4. Write the data to Kafka targets, HDFS complex files, HBase, MapR-DB, MapR streams, Azure Event Hubs, Azure Data Lake Store, JMS, Kinesis Firehose delivery streams, and Hive tables.
  5. Monitor the status of your processing jobs. You can view monitoring statistics for your processing jobs in the Monitoring tool.

0 COMMENTS

We’d like to hear from you!