Table of Contents

Search

  1. Preface
  2. Introduction to Data Transformation
  3. Data Processor Transformation
  4. Wizard Input and Output Formats
  5. Relational Input and Output
  6. XMap
  7. Libraries
  8. Schema Object
  9. Command Line Interface
  10. Scripts
  11. Parsers
  12. Script Ports
  13. Document Processors
  14. Formats
  15. Data Holders
  16. Anchors
  17. Transformers
  18. Actions
  19. Serializers
  20. Mappers
  21. Locators, Keys, and Indexing
  22. Streamers
  23. Validators, Notifications, and Failure Handling
  24. Validation Rules
  25. Custom Script Components

Data Transformation User Guide

Data Transformation User Guide

PdfToTxt_4

PdfToTxt_4

The
PdfToTxt_4
document processor converts PDF files to text or XML.
The following table describes the properties of the
PdfToTxt_4
document processor:
Property
Description
param1
Defines the PDF table layout. The
param1
property has only one option: PdfLayout
value
Defines the PDF table layout. Double-click the
value
property to open the table configuration editor.
The table configuration editor customizes the way tables are read. Use it to correct problems with column alignment, word wrapping, line spacing, and overflow from one cell to another. For more information, see PdfToTxt_4 Table Configuration Editor.
The
PdfToTxt_4
document processor generates text output by default. Use the table configuration editor to select XML output. The XML conforms to the
PDF4.xsd
schema, which you can find in the following directory:
<INSTALL_DIR>\DataTransformation\doc
When you use the
PdfToTxt_4
document processor, set the input encoding to UTF-8 to enable the Parser, Mapper, or Serializer to correctly read the document.
The PdfToTxt pre-processor might not support certain PDFs with embedded fonts. If the pre-processor fails, copy the text from the input PDF into Notepad to check for embedded fonts. If you cannot paste the text or if is corrupted, the PDF probably contains embedded fonts.


Updated September 26, 2018