Italian Trulli

Stylometry for Noisy Medieval Data: Evaluating Paul Meyer's Hagiographic Hypothesis

The history of the composition of French saint’s Lives collections in prose is still a mystery. At the beginning of the XIII th century, legendaries were already constituted and intermediary steps are missing to understand how the Lives have been gathered. One of the existing hypotheses is that small collections of saint’s Lives might have circulated as small independent units about one saint or a series of saints (Philippart, 1977), sometimes traceable to a single author or translator, under the material form of libelli.

The first milestone in the study of the composition of legendaries was laid by Paul Meyer (1906). His studies led him to discover that some of these legendaries were derived from successive compilations. Using their macrostructure, Meyer dispatched them into families. He also had the intuition that the families were constituted of smaller components, groups or rather sequences of texts. He proposed the existence of primitive series based on authorship and the repetitive grouping of selected lives, such as the hagiographic collection of Wauchier de Denain’s Seint Confessor of the family C or the consecutive and recurrent series in the family B and C of the following three lives: Sixte, Laurent and Hippolyte. However, because most of the French saint’s lives are anonymous and because the collections were rearranged by multiple editors over time, it is extremely difficult to find what could have been the primitive series and Meyer couldn’t go further. This serial composition of the Lives of Saints is a datum also noted by other specialists of Latin hagiography such as Perrot (1992) and Philippart (1977), who even points out that these hagiographic series must be studied in their entirety in the same way as a literary work. But it is still very rare today that hagiographic text editing concerns a complete author’s legendary, mostly because of a lacking certainty about these groupings.

From there, with an exploratory stylometric analysis, we first want to find if Meyer’s hypotheses are wrong, can be nuanced or completed. In a second time, we would like to discover if the observed proximity between saint’s lives can reveal series from the same author.

We based our study on a legendary from family C, namely the manuscript fr. 412 of the Bibliothèque nationale de France. The task is complex, because textual variants and the absence of a standardised spelling can affect the stylometric analysis (Kestemont et al., 2015). While the lemmatization of the texts could have nullified the problem of spelling variation, in our case, the preparation of the corpus would have been extremely time-consuming. We decided to work from witnesses written by a single hand in the same manuscript in order to limit biases that could have been induced by spelling variations linked to the scribes and not to the author.

The analysis of the complete legendary is made possible by the use of the OCR software Kraken (Kiessling, 2018), trained on about 8890 transcribed lines of ground truth and tested on 897 lines, which results in up to 95.2% success in Character Recognition Score. Such results allow us to hope that the stylometric analysis is not corrupted by the margin of error (Franzini et al., 2018).

We work from the text thus obtained. Because most of the texts of the legendary are anonymous, we follow an unsupervised approach to the analysis of the texts (Camps & Cafiero, 2012), using agglomerative hierarchical clustering with Ward’s criterion (Ward, 1963), guided by its ability to form coherent clusters.

The texts are, on average, quite short, a known difficulty for stylometry (Eder, 2015), with a median value of 16,863 characters (space excluded), but with extreme values of 1,364 and 85,378. Texts that are too short create a problem of reliability, as the observed frequencies may not accurately represent the actual probability of a given variable’s appearance (Moisl, 2011). To limit this issue, we removed texts below 5,000 characters. Because of the errors regarding segmentation in the OCR, we extracted character n-grams, ignoring spaces. We experimented with different lengths, but, following existing benchmarks (Stamatatos, 2013), we retained the 4000 most frequent 3-grams. The metric and choices of normalisation are also an important parameter, one to which much attention has been devoted (Evert et al., 2017; Jannidis et al., 2015).

Following the benchmark by Evert et al. 2017, we chose to use Manhattan distance with z‑transformation (Burrows’ Delta) and vector-length Euclidean normalisation. The results are partially presented in figures 1 & 2.

 Dendogram of agglomerative hierarchical clustering using Manhattan distance, z-transformation and vector length normalisation over 4000 most frequent 3-grams
Figure 1. Dendogram of agglomerative hierarchical clustering using Manhattan distance, z-transformation and vector length normalisation over 4000 most frequent 3-grams
 Dendogram of agglomerative hierarchical clustering using Kohonen SOM coordinates over 4000 most frequent 3-grams
Figure 2. Dendogram of agglomerative hierarchical clustering using Kohonen SOM coordinates over 4000 most frequent 3-grams

Because, at the same time, the corpus is homogeneous, the texts can be quite short, and the data is noisy, separating them in stable clusters can prove quite hard. We tried to improve the quality of the signal by applying, first, a Kohonen self-organising map (Kohonen, 1988:59–69), and then using the coordinates of the points in the SOM for hierarchical clustering (Camps and Cafiero, 2012).

In addition, the specificity of composition of the legendary C by successive additions (lives of A, then lives of B, and finally addition of new lives) allows us to ensure a quick control of the likelihood of some proposed groupings. The presence of the hagiographic collection of Seint Confessors of Wauchier de Denain where the author identifies itself twice (both in Dialogues de Sulpice Sévère and in Vie de saint Martial de Limoges) also serves as an indicator of validity.

The study has already shown interesting connections between the legendary of Wauchier de Denain and the Vie de Saint Lambert de Liège and some collections have been revealed. Two of them are quite certain, one of the first five texts of C, all about saint apostles and hypothetically from A, and another one of six virgin saints’ lives, all from B including the Lives of saint Agathe, Lucie, Agnès, Felicité, Christine and Cécile.

To conclude, our analysis attempts to evaluate the best parameters for our study and to overcome certain difficulties inherent to our corpus. Two major obstacles have to be overcome: the lack of spelling standardisation and the lack of homogeneity in the separation of words. At the end of this prospective study, we hope to be able to reveal new hagiographic series prior to the composition of legendaries that were transmitted to us and that could have escaped us so far.

Appendix A

Bibliography
  1. Camps, J.-B. and Cafiero, F. (2012). Setting bounds in a homogeneous corpus: a methodological study applied to medieval literature. Revue Des Nouvelles Technologies de l’Information, SHS- 1 (MASHS 2011/2012. Modèles et Apprentissages en Sciences Humaines et Sociales Rédacteurs invités : Mar): 55–84.
  2. Evert, S., Proisl, T., Jannidis, F., Reger, I., Pielström, S., Schöch, C. and Vitt, T. (2017). Understanding and explaining Delta measures for authorship attribution. Digital Scholarship in the Humanities, 32 (suppl_2): ii4 – 16 doi:10.1093/llc/fqx023.
  3. Franzini, G., Kestemont, M., Rotari, G., Jander, M., Ochab, J. K., Franzini, E., Byszuk, J. and Rybicki, J. (2018). Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm. Frontiers in Digital Humanities, 5 doi:10.3389/fdigh.2018.00004. https://www.frontiersin.org/articles/10.3389/fdigh.2018.00004/full (accessed 7 August 2018).
  4. Jannidis, F., Pielström, S., Schöch, C. and Vitt, T. (2015). Improving Burrows’ Delta – An empirical evaluation of text distance measures. Digital Humanities Conference: 11.
  5. Kestemont, M. (2014). Function Words in Authorship Attribution. From Black Magic to Theory?. Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL). Gothenburg, Sweden: Association for Computational Linguistics, pp. 59–66 doi:10.3115/v1/W14-0908. http://aclweb.org/anthology/W14-0908 (accessed 23 November 2018).
  6. Kestemont, M., Moens, S. and Deploige, J. (2015). Collaborative authorship in the twelfth century: a stylometric study of Hildegard of Bingen and Guibert of Gembloux. Digtial Scholarship in the Humanities, 30 (2): 199–224 doi:http://dx.doi.org/10.1093/llc/fqt063.
  7. Kiessling, B. (2018). OCR Engine for All the Languages. Contribute to Mittagessen/Kraken Development by Creating an Account on GitHub. Python https://github.com/mittagessen/kraken (accessed 26 November 2018).
  8. Kohonen, T. (1988). Neurocomputing: Foundations of Research. In Anderson, J. A. and Rosenfeld, E. (eds). Cambridge, MA, USA: MIT Press, pp. 509–521 http://dl.acm.org/citation.cfm?id=65669.104428 (accessed 27 November 2018).
  9. Meyer, P. (1906). Légendes hagiographiques en français. Histoire littéraire de la France, vol. 33. Imprimerie nationale. Paris, pp. 328–458 http://archive.org/details/histoirelittra33riveuoft (accessed 1 May 2018).
  10. Moisl, H. (2011). Finding the Minimum Document Length for Reliable Clustering of Multi-Document Natural Language Corpora. Journal of Quantitative Linguistics, 18 (1): 23–52 doi:10.1080/09296174.2011.533588.
  11. Perrot, J.-P. (1992). Le passionnaire français au Moyen âge. Genève, Suisse: Droz.
  12. Philippart, G. (1977). Les Légendiers latins et autres manuscrits hagiographiques. Turnhout (Belgique), Belgique: Brépols.
  13. Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60 (3): 538–56 doi:10.1002/asi.21001.
  14. Stamatatos, E. (2013) On the robustness of authorship attribution based on character n-gram features. Journal of Law and Policy, 21 (2): 421–439.
  15. Ward, J. H. (1963). Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association, 58 (301): 236–44 doi:10.1080/01621459.1963.10500845.
Ariane Pinche (ariane.pinche@chartes.psl.eu), Ecole nationale des chartes, France; Université Lyon 3 and Jean-Baptiste Camps (jean-baptiste.camps@chartes.psl.eu), Ecole nationale des chartes, France and Thibault Clérice (thibault.clerice@chartes.psl.eu), Ecole nationale des chartes, France