The Data Integration Service manages jobs that are deployed to run in a cluster environment. When you enable the Data Integration Service for big data recovery, the Data Integration Service can recover Hadoop mapping jobs that run on the Spark engine.
To use big data recovery, you must configure jobs to run on the Spark engine. You must configure the Data Integration Service and log settings. You must run the job from infacmd.
The Data Integration Service maintains a queue of jobs to run. The Data Integration Serviced assigns jobs from the queue to nodes which prepare them and send them to a compute cluster for processing.
The cluster assigns a YARN ID to each job and each of its child tasks to track jobs as it runs them. The Data Integration Service gets the YARN IDs from the cluster and stores them on the Model repository database.
If the Data Integration Service runs on a grid or multiple nodes, when a node fails, the Service Manager fails over to another node. The Data Integration Service queries the cluster for the status of tasks as identified by their YARN IDs and compares the response with the status of failed over tasks. Depending on the status, the Data Integration Service takes the following actions:
If a task has no YARN ID, it submits the task to the cluster.
If a task that has a YARN ID has not been sent to the cluster, it submits the task for processing.
If all tasks have been sent, it continues to monitor communications from the cluster until completion.
If the Data Integration Service runs on a single node, it attempts job recovery when the node is restored.
When the Data Integration Service restarts and runs a job, the job creates a cluster configuration under the
disTemp
directory. This process causes the
disTemp
directory to grow over time. Manage disk space by monitoring and periodically clearing the contents of the
disTemp
directory.
The Data Integration Service begins the recovery process by verifying that inactive nodes are not available, and then it assigns the recovered job to an available node. The verification process for unavailable nodes might take several minutes before the job is reassigned to an available node.