Soft and hard hyphens in German text
German, as we know, is a language with some very long words. Therefore it is very common that words will be automatically split across lines, using what we used to call a "soft" hyphen. When exporting a PDF in German to a Word document, this "soft" hyphen is not distinguished from a "hard" hyphen (i.e. one which is produced by a key-stroke (as in the hyphen used just then).
I was asked yesterday to "Compare" a revised document in German (submitted as a pdf) with a former version in Word, and submit an estimate for updating the English translation. The straight conversion using Adobe Acrobat produced an almost continuous red line when "Compared". Because every "soft" hyphen had been transformed into a "hard" hyphen plus space and hence a change was flagged.
I started a "Replace" running, and accepted/rejected for each hyphen and after half an hour realised that my estimating the job was likely to take as long as doing it. Frustrated, I gave up and had some dinner. Got up very early this morning (Tuesday) with this realisation:
Most "-[space]" in German texts were originally soft hyphens.
The main exceptions to this are "-[space]und", "-[space]oder" and "-[space]bzw."
Once I had this in mind, I searched and replaced each of these exceptions with an unlikely key, deleted all "-[space] and then replaced the unlikely key with the original. It was almost 100% successful.
I don't, of course, know how Acrobat works (I'm 65 and my entire experience with data processing has been tentative) - but I think it's an algorithm based on the appearance of the text and then a best match (rather than dealing with the ideas of "soft" or "hard" hyphens). But it seriously messes up the result.
So, I offer my ideas for your further research. I think it will hugely improve the conversion of German in particular.
If it works, you can thank me in the nicest way you can think of :).
AdminGirija Agarwala (Admin, Adobe) commented
Thanks for bringing this up.
Could you please help us with the pdf files and the exported output?