Our text recognition ground truth and model

Automated text recognition is based on machine learning, which means that in order to train a tool that could read the Hebrew script in its various Jewish languages, types and orthographies, we had to provide it with a “ground truth” or “gold standard”: a corpus of quality transcriptions representing a variety of printed materials. We used the platform TRANSKRIBUS to prepare the The DiJeSt ‘gold’ corpus, which comprises of over 1800 pages, including the following samples:

Hebrew

Several sets of pages compose the Hebrew part of the DiJeSt ground truth. The largest is a two-year span of the literary journal HaMeorer, that was fully transcribed by Hameorer team and is re-published monthly on Hameorer website. The journal used Meruba type font, the most common typeface used for print until the mid 20th century. Several pages in each issue include vocalized poetry. Also included were pages from two volumes of S. A. Rosanes, Divrei Yemei Yisrael Be-Togarmah –  transcribed by Adir Gupta and Dr. Lilach Assaf-Mølbach in the framework of DiJeSt. The first was printed in Margalit font,  and the other in Frank-Rühl. Several issues from the JPRESS Historical Jewish Press collection were transcribed in the framework of OMIlab Historical Press project , printed mainly in Meruba. An issue of Hed Hamizrach in Frank-Rühl, was prepared by Avital Rabinowicz-Perry from the Haifa University Program for Digital Humanities. An especially versatile set was the Ephemera collection of Haifa municipal archive,  which was digitized in the framework of Dr. Aynat Rubinstein’ s JEMH corpus project. It includes letters and posters with various typefaces, including also typewriter fonts.

Yiddish- Vaybertaytsh

This part of the Ground truth corpus includes 51 printed pages of Vaybertaytsh(װײַבערטײַטש‎), the semi cursive script typeface also known as mashket, mesheit, vaybershrift, vayberksav, tkhine ksav, and tsene-urene ksav.
These ground truth pages were transcribed by Dmitri Toperman in the framework of DiJeSt.

The sources are:
Yaakov ben Avraham’s 1602 “Eyn Shoyn Mayseh Bukh” . A Digital image of a copy from Basel University library, made available on e-rara platform at: htttp://dx.doi.org/10.3931/e-rara-33068.
Eliahu Bahur’s Bove de Antona, 1541. Text in Ivri teitch. (Jewish German) “zur” script. Digital copy of the image was created from the copy of Zentralbibliothek Zürich, RaP 116,2 and available on www.e-rara.ch platform at htto://dx.doi.org/10.3931/e-rara-68541.

Yiddish – Modern

The transcriptions and corrections of hundreds of pages of modern Yiddish works were made as voluntary contribution by Beni Sorkin, and partly by Dmitry Toperman and Raphi Halff, in the framework of DiJeSt. This part of the corpus comprises several types and styles of Modern Yiddish printed text, from Berlin Klal-Verlag of 1922, to
1935 and 1939 editions from the Moscow soviet ‘Der Emes” publishing house and to a 1948 New York edition by Yiddisher Kultur Farband (YIKUF):

The sources were digitized and made available on the digital Yiddish Library of the Yiddish Book Center.

Ladino

Ground truth pages were corrected by Matan Stein in the framework of the DiJeSt. It contains 30 pages from Rabbi Yaakov Culi’s 1730 edition, and 20 from Rabbi Yitzhak Magriso’s 1753 edition of “Meam Loez”. The text of these transcription will be annotated and eventually be made available in CoDiAJe – the Annotated Diachronic Corpus of Judeo-Spanish, on the TEITOK platform.

An issue of the newspaper La Epoca, provided by JPRESS,  was also transcribed by R. Matan Stein. Finally, the Hebrew and Ladino Jalfon family Haggadah, on which you can read more here, was also transcribed, corrected and vocalized by Michel Jalfon, Tomer Mazuz, Dr. Lilach Lilach Assaf-Mølbach and Dmitry Toperman and was included in this collection.

Judeo Arabic

Pages of the Judeo Arabic newspapers “Or Hama’arav” and “Maguid Micharim”, also provided by JPRESS, were transcribed by Nadia Serba Ammuna in the framework of DiJeSt.

 

Leave a Reply

Your email address will not be published. Required fields are marked *