Italian Trulli

Open Islamicate Texts Initiative: a Machine-Readable Corpus of Texts Produced the Premodern Islamicate World

Maxim Romanov (, University of Vienna, Austria and Masoumeh Seydi (, Leipzig University, Germany and Sarah Bowen Savant (, Aga Khan University—London, UK and Matthew Thomas Miller (, University of Maryland—College Park, USA

The written heritage of the “Islamicate” 1 cultures that stretch from modern Bengal to Spain is as vast as it is understudied and underrepresented in the digital humanities. The sheer volume and diversity of the surviving works produced in Arabic and Persian in the premodern period makes this body of texts ideal for computational analysis. While a great number of texts has been digitized over past two decades, OpenITI is the first corpus of Islamicate texts that is open, machine readable, and aims at being comprehensive. OpenITI strives to provide the essential textual infrastructure in Persian and Arabic for new forms of macro textual analysis and digital scholarship. The corpus is already actively used in several ERC projects.

Currently, OpenITI includes about 4,300 unique book titles (predominantly in Arabic) by over 1,800 authors, which amounts to approximately 750 million words (1,46 billion words with all versions). Most texts have been collected from open-access online collections of high-quality digital reproductions of premodern texts. 2

Figure 1. Chronological distribution of authors and books in OpenITI.

OpenITI is organized in compliance with Canonical Text Services (CTS) guidelines as implemented in the CapiTainS Suite (Clérice et al., 2017), with two exceptions made for practical purposes. First, we implemented human-readable URNs ( uniform resource name) as this allows for easier subsetting of the corpus and makes it easier to engage non-DH specialists in Islamicate studies in collaboration. Second, we are postponing conversion to TEI XML, since it poses a number of challenges for right-to-left languages, connected scripts, and extensive texts (many texts are over 1 million words).

Figure 2. CTS URN Structure: al-Ḏahabī’s Taʾrīḫ al-islām.

Figure 3. CapiTainS -Compliant Folder Structure: Versions of al-Ġazālī’s Iḥyāʾ ʿulūm al-dīn incorporated into the repository 0525AH within OpenITI.

CTS offers a powerful mechanism for building expandable and interoperable corpora. The power of CTS lies in the URN, which provides the permanent canonical references to texts and are used by CTS to identify or retrieve passages of text ( Figure 2). The tree-like structure of CTS URNs together with modular folder organization ( Figure 3) allow one to easily expand the corpus in a decentralized manner as well as to accommodate as many texts and their versions in any number of languages as might be necessary.

Texts have been automatically converted into our custom format— OpenITI mARkdown ( AR stands for Arabic). Facilitating conversion of raw texts into machine-actionable formats, this flavor of markdown: 1) simplifies work with multivolume texts that make up the core of the corpus; and 2) helps to avoid problems that one faces when paired symbols (e.g., angle brackets), LTR and RTL languages, and connected scripts occur in the same document, when even a simple editing task becomes overly complicated. Additionally, mARkdown will facilitate conversion into TEI XML, the de facto standard for publishing digital editions. The detailed description of OpenITI mARkdown can be found at

Figure 3. OpenITI mARkdown: A sample text ( biography) tagged morphologically and semantically (using EditPad Pro,

Available on GitHub (, OpenITI includes four different types of repositories:

  1. Raw texts repositories (naming pattern: `RAWrabica\d+`), where texts collected from online libraries are stored in their original formats. The total count exceeds 50,000 files. Texts here must be manually reviewed and moved into main working repositories (over 14,000 have been processed).
  2. Main working repositories (pattern: `\d{4}AH`), where texts in mARkdown are grouped into 25 year periods, based on authors’ death dates. This makes repositories more manageable, allowing scholars to focus on relevant texts. Manual editing, reviewing, and tagging of texts happens here.
  3. Instantiations (pattern: `i\.\w+`) include all texts of the corpus adapted for use with specific software. For example, in ` i.stylo`, texts are renamed and reformatted as required by the ` stylo` package for R (see, Eder et al., 2016).
  4. Releases ( planned) will have citable time-stamped versions of the corpus.

In order to achieve comprehensiveness, we currently are working on the improvement of Arabic-script OCR (see, Kiessling et al., 2017) and the development of a digital text production pipeline (collaboratively with SHARIAsource, Harvard Law School). Called CorpusBuilder, this user-friendly, web-based, open-source application will significantly lower the technological barriers to entry and help us involve colleagues, librarians and citizen scientists from around the world in our collaborative project of corpus creation.

Appendix A

  1. Clérice, T., Munson, M. and Almas, B. (2017). Capitains/Capitains.Github.Io: 2.0.0. Zenodo doi:10.5281/zenodo.570516. (accessed 29 April 2019).
  2. Eder, M., Rybicki, J. and Kestemont, M. (2016). Stylometry with R: A Package for Computational Text Analysis. The R Journal, 8(1): 107–121.
  3. Hodgson, M. G. S. (1974). The Venture of Islam: The Classical Age of Islam, 3 vols. Chicago: University of Chicago Press.
  4. Kiessling, B., Miller, M. T., Savant, S. B. and Romanov, M. G. (2017). Important New Developments in Arabographic Optical Character Recognition (OCR). Al-ʿUṣūr Al-Wusṭá (The Journal of Middle East Medievalists), 25: 1–13.

Introduced by the University of Chicago historian Marshall Hodgson, the term “Islamicate” refers to anything Islamic and non-Islamic, religious and non-religious that was produced in the vast geographical region we refer to as the “Islamic world” (see, Hodgson, 1974).