Isolated field format losses with Acrobat PDF to XML exports
ReadMe3-9-2019Isolated fieldformatlosseswithAcrobatPDFtoXML_exports.txt
Adobe Acrobat Pro XML Export Bug Report
by Rich Hartness, rhartness@mabnc.org
1 Title: Isolated field format losses with Acrobat PDF to XML exports
2 Intent
I'm writing to report there seems to be a very small but significant bug in the Acrobat 2017 and Pro DC PDF to XML exporter, to share my compelling evidence, and hopefully to gain your support to authorize a code fix that will correct this issue in a future update.
3 Problem description and scope
The problem seems to only exist in XML exports. In these cases, individual fields of data are dumped into single XML records, without preserving their individual field identities in the XML code. This problem does not exist in All the other export formats such as Word, Excel, rich text, HTML, and spreadsheet XML.
4 Steps to reproduce the issue: Perform PDF to XML exports from Acrobat Pro 2017 or Acrobat Pro DC. When you find multiple fields in the original file showing up in a single XML record, then export the same file to Word, Excel, Rich text, HTML or spreadsheet XML, and you will likely find the problem does not exist in those other formats. I have included a sample PDF file and corresponding XML export file that demonstrates the problem in several sections.
5 OS is Windows 8.1 and Windows 10. Applications are latest Acrobat Pro 2017 and Pro DC.
6 Expected result: All individual fields of information in the source PDF should be preserved as distinct units in the XML file. Separate lines of information in the source file, like name address, city/state/zip, as well as column headers and individual cells in a table should appear as distinct elements in the XML file.
7 Observed result: In multiple isolated cases, several fields of information that were separate and distinct in the source PDF file became indistinguishable members of a single XML record, without any delimiters in the XML file.
8 Attached zip containing important file samples
Zip name: Isolatedfield-formatlosseswithAcrobatPDFtoXMLexports.zip
The zip contains 7 files. They are as follows:
Source document: SampleBank1.pdf
Problematic file: SampleBank1.xml
Other file exports demonstrating greater format integrity:
SampleBank1.docx
SampleBank1.html
SampleBank1.rtf
SampleBank1.xlsx
SampleBank1_SpreadsheetXML.xml
9 Documented examples of problematic XML export behavior
The attached zip contains a redacted bank statement PDF, and 6 different file formats I exported it to.
There are 3 instances in this XML sample that illustrate this ambiguous XML field formatting export behavior. Strikingly, this bad export behavior does not exist in the other 5 file export formats. I site them all clearly below.
•9A Instance 1 Name/address/CityStateZip
Open the Word formatted file in Word. From the top of the document, search for the text string, "JANE DOE". (not including the quotes) The cursor will land at the beginning of three consecutive lines of name and address information:
JANE DOE
1234 SESAME ST
ANYWHERE NC 12345-6789
This information is also presented as 3 distinct lines, Name, address and city/state/zip in rich text, HTML, and spreadsheet XML.
In Excel, the line containing JANE DOE almost appears like it's in the same record as street and city/state/zip, but when that cell is pasted into wordpad, it breaks out into 3 separate lines.
Contrastingly, in the XML file, these same three lines of data are populated to a single XML record, and separated only by a space character. The 3 fields are indistinguishable as follows.
JANE DOE 1234 SESAME ST ANYWHERE NC 12345-6789
The record containing the exact XML code taken directly from the XML file is as follows:
<TH>JANE DOE 1234 SESAME ST ANYWHERE NC 12345-6789 </TH>
There is no way to determine where the name ends and where the street address begins. Same true for City/state/zip. Instead of there being a discernible delimiter between those fields, an ambiguous space character is present.
•9B Instance 2 Column headers inside the "Summary of checks written" table
Open the Word formatted file in Word, and from the top of document, search for the text string, "Amount" (not including the quotes) The cursor lands in the 3rd column of the row of table headers inside the "Summary of checks written" table. The row of table headers appears in the Word file as:
Number(tab)Date(tab)Amount(tab)(tab)Number(tab)Date(tab)Amount(tab)(tab)Number(tab)Date(tab)Amount
Perfectly delimited, with a single tab character between Number, Date and Amount, and two consecutive tab characters between Amount and Number, .
The same exact presentation with one and two tab characters is also found in the rich text file.
In HTML, spreadsheet XML, and Excel, there's great format integrity, each column header is in its own column, and there is an empty column between AMOUNT and NUMBER when the table repeats further to the right.
Contrastingly, in the XML file, all 9 Colum header fields run together and appear in a single XML record as:
<P>Number Date Amount Number Date Amount Number Date Amount </P>
All 9 fields appear in that single XML record, and each column header is separated by a single space character. There's no way to tell where one column header ends and the next one begins. Similarly, there's no way to tell when the sequence of 3 columns is repeated.
•9C Case 3 Column data inside the "Summary of checks written" table
Open the Word formatted file in Word, from the top of the document, search for the second instance of the text string, "299" (not including the quotes). The cursor is placed in the first column of the first row of table data inside the "Summary of checks written" table. The complete first row of table data appears as:
299(tab)11/30(tab)114.38(tab)(tab)305(tab)12/11(tab)30.00(tab)(tab)307(tab)12/20(tab)40.40
Perfect format integrity again. Just like the row containing the column headers, there is a single tab character between adjacent columns containing 299, 11/30, and 114.38. There are also two consecutive tab characters between the Amount and Number columns, 114.38 and 305. Individual field identification is clear and concise.
The same exact superb presentation with one and two tab characters is also true for rich text.
In HTML, spreadsheet XML, and Excel, format integrity is preserved. Each column data entry is in its own column, and there is an empty column between AMOUNT and NUMBER when the table repeats further to the right.
Contrastingly, and even surprisingly, the XML file contains all 36 cells of information of data, that's the entire table, in a single record, with single spaces separating each field. Without meaningful field delimiters, there's no way to distinguish any of the data fields. It is not safe to assume the table entries will not contain spaces.
Record from XML file with exact XML coding:
<P>299 11/30 114.38 305 12/11 30.00 307 12/20 40.40 298 12/3 215.00 302 12/17 15.00 311 12/26 15.00 301 12/3 50.00 306 12/17 39.70 312 12/28 5.00 300 12/4 38.10 309 12/19 43.67 310 12/28 46.70 </P>
10 Conclusion:
Adobe has done a terrific job creating PDF content and format export functionality to many other popular file formats. I have sited only the parts of PDF to XML file export that appear broken. That's a small part of the whole. It does correctly identify and pass along most of other content and format properly. There was another table earlier in the document entitled "Transaction history", where every field of Colum headers and table data are properly exported and conveyed as distinct elements in the XML file, so it's not the entire approach of the XML exporter, only a couple of small instances. .
The problematic formatting of those ambiguous records sited above are unique to XML file export only. Since all the other 5 export formats (Word, Excel, Rich text, HTML and spreadsheet XML) export those records and maintain the format integrity of each contributing field, I believe the XML export coding is broken or simply misbehaving. I hope you will please join me in recognizing this rogue XML file export behavior as a bug or flaw, and pass it along to your change control process so that it may be corrected in updates coming soon.