Italian Trulli

EScripta: A New Digital Platform for the Study of Historical Texts and Writing

Peter Anthony Stokes (peter.stokes@ephe.psl.eu), École Pratique des Hautes Études – Université PSL and Daniel Stökl Ben Ezra (Daniel.Stoekl@ephe.psl.eu), École Pratique des Hautes Études – Université PSL and Benjamin Kiessling (benjamin.kiessling@psl.eu), Université PSL and Robin Tissot (tissotrobin@gmail.com), Université PSL

Throughout the history of DH, a great deal of effort and many research projects have been dedicated to the study of historical documents, particularly around linguistic annotation and critical and facsimile editions in TEI, but also in automated image analysis, handwritten text recognition (HTR), layout analysis, named entity recognition, and more. Furthermore, software implementing innovations in these fields is increasingly allowing for interconnections through Web APIs, a process that is further enabled by standards such as the International Image Interoperability Framework (IIIF) and Distributed Text Services (Almas et al. [n.d.]). All this strongly suggests that these different components can now usefully be brought together into a single interconnected infrastructure to allow for the online preparation, analysis and publication of texts, editions, images, and annotations. This is the goal of the eScripta platform which is currently under development at Université Paris Sciences et Lettres (PSL). A prototype platform is already available and will be demonstrated at DH2019. Although focussed on researchers at PSL, the code is already Open Source and freely available (Tissot et al. 2019), and, unlike most comparable systems, the trained models for HTR are also freely open and available through the Kraken model repository (Kiessling [n.d.]).

The platform is part of a larger funded project dedicated to the history and practice of writing. The host institution has an exceptional concentration of specialists in very diverse forms of writing, including manuscripts, documents and inscriptions not only in Latin, Greek, Hebrew and Arabic but also Egyptian (hieroglyphic, hieratic and demotic), Cuneiform (Sumerian and Akkadian), proto-Semitic, Aramaic, early Chinese, Old Vietnamese, Indic scripts in all forms, Batak, Javanese, Pyu, Tocharian, and many others. This wide expertise and range of use-cases means that the project presents an important opportunity to develop a deeply cross-cultural and cross-disciplinary infrastructure that allows for multigraphic and multilingual work of exceptional breadth.

A key existing component of this platform is Kraken (Kiessling [n.d.], Kiessling 2019), a module for handwritten text recognition (HTR) that allows automatic analysis and transcription of manuscripts and printed books. It has been developed specifically for multi-lingual documents and non-Latin scripts including non-alphabetic writing-systems and right-to-left, top-to-bottom, bottom-to-top and bidirectional writing. Kraken has already been integrated into the platform and provided with a new and significantly improved interface for text entry and correction, as well as enhanced line-detection for complex layouts (Kiessling et al. 2019).

eScripta Web interface to Kraken, ready to enter Ground Truth transcription.
Figure 1. eScripta Web interface to Kraken, ready to enter Ground Truth transcription.
Uncorrected results of newly-developed line detection on complex layout for right-to-left text (Kiessling 2019b).
Figure 2. Uncorrected results of newly-developed line detection on complex layout for right-to-left text (Kiessling 2019b).

A further key pre-existing component of the platform is Archetype (Brookes et al. 2015; Stokes et al. 2016), for deeply structured annotations to enable the exploration, study and communication of palaeographical and art-historical content; it is already being used in completed projects including DigiPal, Models of Authority and Exon Domesday.

The 'default' Archetype home page before customisation.
Figure 3. The 'default' Archetype home page before customisation.
Archetype applied to the Exon Domesday book showing occurrences of the 'gallows mark' (Exon 2018: Search > Graphs > Gallows Mark)
Figure 4. Archetype applied to the Exon Domesday book showing occurrences of the 'gallows mark' (Exon 2018: Search > Graphs > Gallows Mark)
Archetype used in the Models of Authority project to show images of a given title (‘marescallus’) in the charters (Models of Authority > Search > People > ‘Marescallus’)
Figure 5. Archetype used in the Models of Authority project to show images of a given title (‘marescallus’) in the charters (Models of Authority > Search > People > ‘Marescallus’)
Archetype used in the Exon Domesday project to display the text, transcription and translation of the book (Exon Domesday > Browse Text and Translation).
Figure 6. Archetype used in the Exon Domesday project to display the text, transcription and translation of the book (Exon Domesday > Browse Text and Translation).

Other important components to be added include linking to external services such as those provided by D ivaServices (Würsch, Liwicki and Ingold [n.d.]). These are to be complemented with an ergonomic interface for manual transcription and data-entry for different types of digital and printed editions, as well as translations, commentaries, and linguistic analyses. This component will again comprise a dialogue between manual linguistic annotation such as POS tagging for the preparation of training data with Pandora (Kestemont [n.d.]), and a module for post-correction using Pyrrha (Clérice et al. [n.d.]). The resulting texts will be disseminated through TEI Publisher or similar.

The Pyrrha interface for post-correction of automatic annotation
Figure 7. The Pyrrha interface for post-correction of automatic annotation
The Digital Mishnah (under development), built on TEI Publisher
Figure 8. The Digital Mishnah (under development), built on TEI Publisher

A further element is deep or allographic transcription, combining Archetype and Kraken with the interface already described. This will facilitate the production of facsimile editions and also training the computer for automatic transcription, alignment of existing texts with images, wordspotting, and text-image searches of the sort encouraged by Archetype (for examples see Figure 3 and Stokes et al. 2016). This combination of quantitative and qualitative approaches will also allow, for instance, the clustering of letters in a manuscript based on deep annotation, or analysing the distribution of allographs throughout an inscription or corpus. This becomes particularly powerful when combined with databases of manuscripts that are dated, geolocalised and/or written by the same scribe, such as Medium and Pinakes which are also planned for integration.

As well as problems around internationalisation and the many conventions and needs of such a broad project, another challenge here is how and to what degree these different components can be combined given their different data-models, purposes and assumptions. A Conceptual Reference Model is therefore being developed which will constitute a further important output of this project, based on the Biblissima ontology (an extension of FRBRoo) and CRMtex (Gehrke et al. 2015; Felicetti and Murano 2017; Murano and Felicetti 2017).

Appendix A

Bibliography
  1. Almas, B.M., Cayless, H., Clérice, T., Jolivet, V., Morlock, E., Robie, D., Tauber, J., Witt, J.C., and Liuzzo, P. [n.d.]. Distributed Text Services (DTS), https://distributed-text-services.github.io (accessed 26 November 2018).
  2. Archetype (2017). London: King’s College, https://archetype.ink (accessed 26 November 2018).
  3. Biblissima [n.d.]. Ontologie. Biblissima Documentation, http://doc.biblissima-condorcet.fr/ontologie (accessed 26 November 2018).
  4. Brookes, S., Stokes, P. A., Watson, M. and Matos, D. M. D.(2015). The DigiPal project for European scripts and decorations. In Conti, A., Da Rold, O. & Shaw, P. A. (eds), Writing Europe 500–1450: Texts and Contexts. D.S. Brewer, pp. 25–58.
  5. Clérice T., Pilla J. et al. [n.d.]. Pyrrha: A language-independent post-correction app for POS and lemmatization, https://github.com/hipster-philology/pyrrha (accessed 26 November 2018).
  6. Felicetti, A., Murano, F. (2017). Scripta manent: a CIDOC CRM semiotic reading of ancient texts. International Journal of Digital Libraries 18, 263–270. doi:10.1007/s00799-016-0189-z
  7. Gehrke, S., Frunzeanu, E., Charbonnier, P. and Muffat, M. (2015). Biblissima’s prototype on medieval manuscript illuminations and their context. In A. Zucker, I. Draelants, C.F. Zucker and A. Monnin (eds), SW4SH 2015: Proceedings of the First International Workshop on Semantic Web for Scientific Heritage, 43–48. http://ceur-ws.org/Vol-1364/paper5.pdf (accessed 26 November 2018).
  8. HIMANIS (2018). The HIMANIS Project: Indexing the Trésor des Chartes Registers. Paris: IRHT, http://www.himanis.org (accessed 26 November 2018).
  9. IIIF [n.d.]. IIIF: International Image Interoperability Framework, http://iiif.io/ (accessed 26 November 2018).
  10. Kestemont, M. [n.d.]. Pandora: A Tagger-Lemmatizer for Natural Languages, https://github.com/hipster-philology/pandora (accessed 26 November 2018).
  11. Kiessling, B. [n.d.]. Kraken, http://kraken.re (accessed 26 November 2018).
  12. Kiessling, B. (2019). Kraken: an universal text recognizer for the humanities. Digital Humanities Book of Abstracts. Utrecht. (in preparation).
  13. Kiessling B., Stoekl Ben-Ezra D., and Miller M. T. (2019). BADAM: A public dataset for baseline detection in Arabic-script manuscripts. Submitted to ICDAR 2019
  14. Medium (2016). Medium: Répertoire des manuscrits reproduits ou recensés. Paris: IRHT, http://medium.irht.cnrs.fr (accessed 26 November 2018).
  15. Models of Authority (2017). Models of Authority: Scottish Charters and the Emergence of Government 1100–1250. London: King’s College, https://www.modelsofauthority.ac.uk (accessed 24 November 2018).
  16. Murano, Francesca and Achille Felicetti (2017). Definition of the CRMtex: An Extension of CIDOC-CRM to Model Ancient Textual Entities, version 0.8 (CIDOC-CRM), http://www.cidoc-crm.org/crmtex/sites/default/files/CRMtex_v0.8.pdf (accessed 26 November 2018).
  17. Pinakes (2016). Pinakes: Textes et manuscrits grecs. Paris: IRHT, http://pinakes.irht.cnrs.fr (accessed 26 November 2018).
  18. Stokes, P.A. et al. (2016). The Models of Authority project: Extending the DigiPal Framework for script and decoration. Digital Humanities 2016: Conference Abstracts. Krakow: pp. 896-99, http://dh2016.adho.org/abstracts/387 (accessed 24 November 2018).
  19. Stokes, P.A. (2018). Modelling Multigraphism: The digital representation of multiple scripts and alphabets. Digital Humanities 2018 Book of Abstracts. Mexico City, pp. 292–96.
  20. Stokes, P.A., ed. (2018). Exon: The Domesday Survey of South-West England, Studies in Domesday, gen ed. J. Crick. London: King’s College, http://www.exondomesday.ac.uk (accessed 26 November 2018).
  21. Stutzmann, Dominique (2013). Ontologie des formes et encodage des textes manuscrits médiévaux. Le projet ORIFLAMMS, Document numérique 16: 81–95.
  22. Tissot, Robin et al. (2019). eScriptorium [Source Code]. Paris: INRIA, https://gitlab.inria.fr/scripta/escriptorium (accessed 29 April 2019).
  23. Würsch, M., M. Liwicki and R. Ingold [n.d.], Diva Services. Fribourg: University of Fribourg, https://diuf.unifr.ch/main/hisdoc/divaservices (accessed 26 November 2018).