You can run a column profile on data objects that use Sqoop. You can select the Hive or Hadoop run-time environment to run the column profiles.
On the Hive engine, to run a column profile on a relational data object that uses Sqoop, you must set the Sqoop argument
m
to 1 in the JDBC connection. Use the following syntax:
-m 1
When you run a column profile on a logical data object or customized data object, you can configure the num-mappers argument to achieve parallelism and optimize performance. You must also configure the split-by argument to specify the column based on which Sqoop must split the work units.
Use the following syntax:
--split-by <column_name>
If the primary key does not have an even distribution of values between the minimum and maximum range, you can configure the split-by argument to specify another column that has a balanced distribution of data to split the work units.
If you do not define the split-by column, Sqoop splits work units based on the following criteria:
If the data object contains a single primary key, Sqoop uses the primary key as the split-by column.
If the data object contains a composite primary key, Sqoop defaults to the behavior of handling composite primary keys without the split-by argument. See the Sqoop documentation for more information.
If a data object contains two tables with an identical column, you must define the split-by column with a table-qualified name. For example, if the table name is CUSTOMER and the column name is FULL_NAME, define the split-by column as follows:
--split-by CUSTOMER.FULL_NAME
If the data object does not contain a primary key, the value of the m argument and num-mappers argument default to 1.
When you use Cloudera Connector Powered by Teradata or Hortonworks Connector for Teradata and the Teradata table does not contain a primary key, the split-by argument is required.