This workflow is for extracting information from PDF files and populating your own docx file.
Typically these PDF files would be created by a 3rd party, otherwise it would be advised to extract data from a docx file. That's because PDF's are more complex than a docx structure.
A use for this workflow might be for creating personalized documents while extracting data from 3rd party documents or a reliable and predictable in-house source could be a primary use of this workflow which happens to be in PDF format.
The problem with doing this process manually is the client details might occur many times throughout the document,
making it difficult to catch every occurence with 100% accuracy and thus making manual editing time consuming and error prone.
Think about what text needs to change in our docx document. Suppose we want the final result to look something like below
Below is a screenshot of page 4 of our 3rd party PDF file which contains most of the information we need.
The full document is located here
Decide what information you want to extract and what tags you want to create in your word docx
document so the process will know what data to extract and where to put it.
You can use any text tag you like, but it has to be unique within the document.
In this example, we are going to extract a number of fields from a PDF file.
To make the document more readable, we are going to have the tags in our docx document be self descriptive.
Our docx file will look something like below
This is the file we will be uploading to the project. It is our output document template.
Create a new project. Fill in the project name and description.
Now upload your pdf input file(s). It should look something like this, but with the filename you selected.
Now upload your docx output file. It should look something like this, but with the filename you selected.
This is where we connect the data in the PDF file to how we want it populated into our docx file. This
operation can take a little setup the first time because of the nature of a PDF file because they are not editable so fixing up any issues
is not possible.
Understanding why the extraction rules exist will help to understand why they are necessary.
In a PDF file each phrase of text is broken into a set of data which describes the text and it's x,y location. So, in order to find
the text we wish to locate, we need to find a non variable piece of text below like a title or label as a frame of reference.
Lets work with extracting the address. The tag in our docx file is ((address))
If we look at the PDF file, the address is described as "Property Address:". This is how we will locate the actual address, via its descriptive label
To describe how we find this lets look at the data required. In most cases you will not have to enter any more information than the label in the pdf to the tag in the docx file. This is if the descriptive label to search of is to the left of the actual data, which is the case most of the time.
Each data to be extracted has it's own rule. Each rule has a set of attributes which we need to setup.
1. Name
Name is the text to search for in the PDF document, which in this case is "Property Address:".
Note that this is case sensitive
2. Locator Position Default: +After
This describes where the actual data is in relation to the Name field above. The actual address is a field to the right
of Name, so it will be the next field in the PDF file. In this case you would choose +After. The + means 1 field after the label.
++After means 2 fields after the label.
3. Text Position Default: "Contains"
This is the position where the text must match to the label being searched. Generally "Contains" is the best option.
"Start" means that the
3. Occurence Default: 1
This is the number of nth occurences that must match before the search text is matched. Generally leave this at 1. If you set
it to 2 for example, in this case it will search for the 2nd occurence of the phrase "Property Address:".
3. Start Position Default: 0
This is the index position within the field to start extracting the text from. Normally you would leave the Default
to extract the entire field. If you for example wanted to skip the first 5 characters in the field, you would enter
a start position of 5.
3. Length Default: -1
This is the number of characters to extract from the field. The default of -1 means extract all. Otherwise it would
copy the number of characers entered here when extracting the string.
3. Tag
This is the tag in the docx file to place the data into. It must be unique to work correctly. In this example
the tag is ((address)). Note that this is case sensitive
Don't forget to save your project
Go back to the projects list and click the play button, which will run the task. Once the task is finished, a download button will appear next to each document for you to download.