User Guide

10.2.2 HotFix 1
- 10.5.5
- 10.5.4
- 10.5.3
- 10.5.2
- 10.5.1
- 10.5
- 10.4.1
- 10.4.0
- 10.2.2 Service Pack 1
- 10.2.2
- 10.2.1

Back Next

Column Profiles for Sqoop Data Sources

You can run a column profile on data objects that use Sqoop. You can select the Hadoop run-time environment to run the column profiles.

When you run a column profile on a logical data object or customized data object, you can configure the num-mappers argument to achieve parallelism and optimize performance. You must also configure the split-by argument to specify the column based on which Sqoop must split the work units.

Use the following syntax:

--split-by <column_name>

If the primary key does not have an even distribution of values between the minimum and maximum range, you can configure the split-by argument to specify another column that has a balanced distribution of data to split the work units.

If you do not define the split-by column, Sqoop splits work units based on the following criteria:

If the data object contains a single primary key, Sqoop uses the primary key as the split-by column.

If the data object contains a composite primary key, Sqoop defaults to the behavior of handling composite primary keys without the split-by argument. See the Sqoop documentation for more information.

If a data object contains two tables with an identical column, you must define the split-by column with a table-qualified name. For example, if the table name is CUSTOMER and the column name is FULL_NAME, define the split-by column as follows:

--split-by CUSTOMER.FULL_NAME

If the data object does not contain a primary key, the value of the m argument and num-mappers argument default to 1.

When you use Cloudera Connector Powered by Teradata or Hortonworks Connector for Teradata and the Teradata table does not contain a primary key, the split-by argument is required.

Rename Saved Search

Table of Contents

User Guide

User Guide

Column Profiles for Sqoop Data Sources

Column Profiles for Sqoop Data Sources