How to Use a Midstream Mapping to Parse Hierarchical Data
How to Use a Midstream Mapping to Parse Hierarchical Data
Develop a midstream mapping to parse hierarchical data in a source string on the Spark engine.
The following is an outline of the high-level tasks to develop and run a mapping that parses hierarchical data in a source string.
Create a connection.
Create a connection to access data in complex files based on the file storage.
Create data objects.
Create a data object to represent the source file with the source string that contains hierarchical data in JSON or XML format.
Create a data object to represent the target that will include the hierarchical data in a struct.
Configure the data objects' properties.
Create or import a complex data type definition.
Use an existing
.amodel
complex data type definition that represents the schema for the hierarchical data in the source string.
The complex data type definitions are stored in the type definition library that is a Model repository object. The default name of the type definition library is m_Type_Definition_Library.
Or you can create a complex data type definition using a representative sample file.
Intelligent Structure Discovery parses the sample data and discovers the schema for the hierarchical data in the source string.
Create a mapping and add mapping objects.
Create a mapping, and add Read and Write transformations.
Create a Read transformation to read the hierarchical data from the source string.
Create a Write transformation to write the hierarchical data to a target struct.
Create an Expression transformation for the PARSE_JSON or PARSE_XML function.
Based on the mapping logic, add other transformations that are supported on the run-time engine.
Create and configure ports in transformations.
Create the Read ports including the string Type that contains the hierarchical data.
Create the Write ports including the struct Type that contains the parsed hierarchical data.
Create the Expression ports:
Configure the input string as input and output.
Configure the output struct as output. The Type Definition must reference the complex data type definition you created or imported. Configure the PARSE_JSON or PARSE_XML function for the expression.
Configure the transformations.
Link the ports and configure the transformation properties based on the mapping logic.
Configure the mapping properties.
Configure the mapping run-time properties: choose the Spark validation environment and Hadoop as the execution environment.
Validate and run the mapping.
Validate the mapping to identify and correct any errors.
Optionally, view the engine execution plan to debug the logic.