In advanced mode, the Source transformation can read text from PDF files.
The Source transformation extracts the full structure of the document including text,
tables, headings, and metadata. You can extract text from documents that have different
document structures, such as invoices and reports, while preserving the order of the
text.
To read a PDF, use the
Source
tab and select
Document
.
Data Integration
sets the input type to PDF automatically.
To read a directory of PDFs, change the
Source Type
in the
advanced properties to
Directory
. For the
File Name
Override
, enter
*.pdf
.
The
Fields
tab displays fields to store the text, file path, file
type, and file name for each PDF.
You can pass the text to downstream Chunking and Vector Embedding transformations to
build a RAG ingestion pipeline, or you can process the text, create structured data from
it, and write it to a JSON file.