Adobe Acrobat Pro bug - extracting text with ligatures
This is about both editing and exporting PDFs.
The attached PDF has an explicit explanation of the issues + screenshots. Text from the attached file goes right here.
I’ve got Adobe Acrobat Pro DC 2015 Release Version 2015.006.30355.
On Web page https://www.zm.gov.lv/mezi/statiskas-lapas/nozares-informacija/buklets-meza-nozare-latvija-?nid=1088#jump
..there is a link to a PDF file "Latvia's Forests During 20 Years of Independence" 2010, https://www.zm.gov.lv/public/ck/files/ZM/mezhi/buklets/MN20EN.pdf
The PDF file seems to be a nicely composed PDF with Adobe InDesign CS5.5 (7.5.2).
The font used, Museo Slab is a publicly available font, I bet it should not have any major issues.
In the said file, page 47 there are multiple issues with a context having a ligature ffi in the source text ‘office’. This is the only occurrence in the file containing the word ‘office’.
- When Copy/Paste is applied from Acrobat Pro to Notepad++ with the regular Select tool, I get C2 8A UTF-8 bytes which is shown as [VTS] in Notepad++.
- With Text Edit tool (Tools > Edit PDF > Edit) I cannot select the entire word. Only starting from the second letter. I try to do selection both with mouse and with SHIFT+keyboard arrows. No way to select the first letter. BUT as for the remaining letters, I get all the components of the ligature as correct letters in the output with Copy/Paste to Notepad++:
Just ‘ffice’ of ‘office’.
But this shows that there is facility to get normal text from ligatures, however it is not implemented correctly or not at all in all the tools and scenarios making them all buggy. Why can I not select the leading ‘o’?
- Also, when exported to HTML, the ligature gets encoded as C2 8A in UTF-8.
The same as for Copy/Paste.
If the routine which allows copying ffi from the edit tool was applied to Select and Export tools, some of the Acrobat Pro bugs would be fixed.
I have several other bugs in my list. If this makes sense for Adobe, please confirm, and I will keep posting. Moreover, I hope these will be fixed asap. These bugs are there since at least v9.
James Kidder commented
The group that seems to be the principal offender of which I am familiar is the Cornell group that runs ArXiv, the preprint physics server. Those papers routinely have ligature issues. Sometimes converting those papers to Microsquash Word does the trick and sometimes not.
Mike Dolbow commented
Just wanted to add that I found a similar issue with accessibility with a document that had "tt" or "ti" inside words exported from QGIS using the Calibri font. The ligatures that came through this export look fine in the PDF, until you do an accessibility scan, and those words show up as having bad character encoding. A screen reader gets completely confused, because it can't read the ligatures.
Words like "matter" and "intention" show up as "ma..er" and "inten..on" respectively if you tag the section as a paragraph and ask a screen reader to read it.
Over two years since reported...anyone know if it has any chance of being fixed?
Roberts Rozis commented
The same is true with other ligatures, too.
Eg. on page 4 the following text goes: it offers ways
Copy/Paste with Selection tool yields: it o[OSC]ers ways / o ers ways
Copy/Paste with Edit tool: it off ers ways
Export to HTML yields: it o[OSC]ers ways / o ers ways
..none of which is correct. :(