Profiling and Discovery Sizing Guidelines

Back Next

Scaling on Hadoop

Hadoop implements a grid in a different way than the Data Integration Service grid. Both systems achieve performance by scaling out in the number of nodes. Hadoop uses a distributed file system so that the computation engine can quickly access the data for each node.

When you run a profile, the data might not be on a particular grid node. Hadoop minimizes the number of network I/Os that the system needs when profiles run on a Data Integration Service grid. Unlike the Data Integration Service grid, only column profiles and data domain discovery run on Hadoop. The Profiling Service Module must run other data discovery processes in the native Data Integration Service mode. Additionally, Hadoop performs the computation of column profiles and data domain discovery on the Hadoop grid. Native sources do not support pushdown optimization.

Hadoop processes all data locally. If the data source is not in the Hadoop environment, the Profiling Service Module stages the data first in Hadoop and then runs a profile on it. This approach requires a fast network connection between the Hadoop cluster and database.

Rename Saved Search

Table of Contents

Profiling and Discovery Sizing Guidelines

Profiling and Discovery Sizing Guidelines

Scaling on Hadoop

Scaling on Hadoop