Why is this digital Haggadah special? – The DiJeSt Ground Truth Haggadah

Earlier this week we shared here a digitized Haggadah with Ladino translation and asked for assistance in identifying the edition. On the Hebrew Printing and Paleography facebook group, Noam Sienna, who will soon defend his thesis on “Making Jewish Books in North Africa, 1700-1900” came to our help and found it – an edition printed in Livorno(Leghorn) by Elia Benamozegh in 1867 (Yaari 958, Otzar Hahagadot 1292).

During these days of this highly virtual Passover, the social network is replete with beautiful scanned pages of Haggadot being shared online. A magnificent thread on Haggadot throughout the ages was twitted by Michelle Chesner while cooking for the Seder. There is also a rich variety of the digital texts of the Haggadah shared openly, for example, on the websites of Sefaria and the Open Siddur.

What is unique in our digital Haggadah?
The file we share below was created using the platform Transkribus, which enables its users not only to manually transcribe scanned image, or try to automatically identify the text, but also to train the computer to recognize text of its kind. While OCR – Optical Character Recognition software is extant for many years now, most of it is efficient mainly on standard prints of the most commonly spoken languages. Less so on historical materials, less common scripts, fonts and languages.  Recent advances in machine learning enable gaining better results even with special scripts and fonts, and even with handwriting.

Using Transkribus, we analyzed and transcribed the text so that the lines of text are linked to their exact location on the page:

When the data is saved this way, it can be used as “Ground truth” – an example that is used by the computer to train on. The outcome of training is a model that can then be used to automatically recognize similar material. Using the model that was trained on the Tetouan Haggadah, we automatically read a page from the Venice Haggadah printed by Bragadin 1629. The result, though not perfect, is relatively easy to correct. with more data, better models can be trained, and the road opens to automatically read more Haggadot.

 

A ground truth text, moreover, can be used to train automatic text recognition not only within the Transkribus platform, but in any other tools for text recognition machine learning. The Haggadah Hebrew-Ladino ground truth is available for download here in both ALTO.XML and PAGE.XML format. 

Leave a Reply

Your email address will not be published. Required fields are marked *