By Janna Katharina Müller
In my last blog post, I wrote about the source corpus for my master’s thesis – the journal “Monatliche Correspondenz zur Beförderung der Erd- und Himmelskunde” (MC) – and my plan to subject it to digital analysis. The main thing I needed for my analysis was a digital text. Thanks to the Thuringian University and State Library in Jena, the scanned originals of the MC are available online, but only as non-machine-readable PDF files. The first step towards usable data was thus to generate a text from image files.
However, it was essential to consider the type of writing used in the MC: As can be seen from the example page below, the MC was printed in a font that uses, among other things, the long s (“ſ”), an archaic form of the lower-case letter s. Unlike most German publications of the early 19th century, however, this is not a fractional font such as Fraktur, but rather an Antiqua font with serifs, which contains rounded arcs and was used primarily for Latin, Italian, and French texts, but was rather uncommon in German prints.

OCR with Tesseract
One of the best-known ways to recognize text is Optical Character Recognition (OCR), the electronic or mechanical conversion of images into machine-coded text based on the recognition of individual characters.
[...]
Quelle: https://href.hypotheses.org/2105