Table of Contents

Search

  1. Preface
  2. Introduction to Data Transformation
  3. Data Processor Transformation
  4. Wizard Input and Output Formats
  5. Relational Input and Output
  6. Using the IntelliScript Editor
  7. XMap
  8. Libraries
  9. Schema Object
  10. Command Line Interface
  11. Scripts
  12. Parsers
  13. Script Ports
  14. Document Processors
  15. Formats
  16. Data Holders
  17. Anchors
  18. Transformers
  19. Actions
  20. Serializers
  21. Mappers
  22. Locators, Keys, and Indexing
  23. Streamers
  24. Validators, Notifications, and Failure Handling
  25. Validation Rules
  26. Custom Script Components

Data Transformation User Guide

Data Transformation User Guide

PdfToTxt_4

PdfToTxt_4

The
PdfToTxt_4
document processor converts PDF files to text or XML.
The following table describes the properties of the
PdfToTxt_4
document processor:
Property
Description
param1
Defines the PDF table layout. The
param1
property has only one option: PdfLayout
value
Defines the PDF table layout. Double-click the
value
property to open the table configuration editor.
The table configuration editor customizes the way tables are read. Use it to correct problems with column alignment, word wrapping, line spacing, and overflow from one cell to another. For more information, see PdfToTxt_4 Table Configuration Editor.
The
PdfToTxt_4
document processor generates text output by default. Use the table configuration editor to select XML output. The XML conforms to the
PDF4.xsd
schema, which you can find in the following directory:
<INSTALL_DIR>\DataTransformation\doc
When you use the
PdfToTxt_4
document processor, set the input encoding to UTF-8 to enable the Parser, Mapper, or Serializer to correctly read the document.
The PdfToTxt pre-processor might not support certain PDFs with embedded fonts. If the pre-processor fails, copy the text from the input PDF into Notepad to check for embedded fonts. If you cannot paste the text or if is corrupted, the PDF probably contains embedded fonts.