Comparing Lexicons Diachronically in Italian Literary Corpora

The goal of the article is to provide a comparison between several words from Florentine vernacular language and modern Italian language, using software written by the author. This paper focuses on two corpora: the first one includes a selection of Florentine vernacular literature and the second one a group of literary books written in a modern Italian language from the end of XIX Century up until the present. The article demonstrates the use of some features of the software to compare the two corpora, ranking the lexicographic entries using different strategies. It is possible to analyse the lexicon taking into consideration different types of sorting, using only three parameters: the word frequency, the percentage of frequency according to the number of words in the corpus, and the percentage of texts where the word is found in the corpus. From these parameters a fourth parameter also arises the level of persistence of words in each corpus. The software allows observing the differences in the use of lexicon in various periods of history, comparing the Florentine vernacular language, which was used in the Italian peninsula till the beginning of XIX Century, to the modern Italian language.


Introduction 1
The diachronic linguistics became, over the years, a promising field for corpus linguistics. Analysing and comparing corpora under a diachronic point of view can better understand language evolution over time. The researcher often moves from an overall word frequency analysis to a closer textual reading (Alessi, Partington, 2020, p. 9), a method that also involves statistical analysis.
The case of the Italian language can be studied with the aim of corpus linguistics. The language of literature, in fact, has a strong connection with Florentine vernacular language, which was used at least from the XIV Century till the beginning of XIX Century. The structure of modern Italian is similar to Florentine vernacular, although there are some differences. The 1827 edition of the work I promessi sposi by Alessandro Manzoni made use of a new language closer to modern Italian. Manzoni, already far from Florentine vernacular, intended to adopt the language for cultured readers and all others capable of reading (Dotti, 2020, p. 373). The lexicon also changed along history, but the majority of words coming from Florentine vernacular are still in use today, in the modern Italian language. Other words became obsolete and disappeared from modern Italian. A number of corpora built using modern Italian are available today (Rossini Favretti, 2002, p. 28), but corpus-based studies comparing the literature written in modern Italian and the literature written in Florentine vernacular are still missing. An attempt to compare Italian language corpora with texts from different epochs was already provided by the author (Pavan, 2020). The program CorpStat, a software written by the author, is used to conduct a corpus-based analysis. However, in this article, software packages like AntConc (Anthony, 2014) used to retrieve keywords will not be considered.

Method
Some features of CorpStat were already described in a previous article (Pavan, 2020). The software was used to analyse two corpora, both assembled by the author. The first corpus is a selection of works written in Florentine vernacular language, from XIV to XVIII Century. In this corpus there are mainly works of literature, including about 2,700,000 words. The second corpus is a selection of literary works from the end of XIX Century till today, which includes about 2,500,000 words. The sizes of both corpora are quite similar: comparing diachronic corpora should involve the use of corpora with similar size (Kaunisto,p. 3). Some major works of the XIX Century (like Manzoni's I promessi sposi) were intentionally not included in this corpus, assuming at that time the modern Italian language was not yet well established. Both corpora were at first tokenized by CorpStat. Three parameters are showed by CorpStat after the tokenization: the word frequency in the corpus, the percentage of frequency according to the number of words in the corpus, and the percentage of texts where the word is found in the corpus. The words are later sorted in different ways according to each parameter. For example, if the parameter taken into consideration is the word frequency in the first corpus, CorpStat sorts like this: Sorting of frequency in the first corpus in descending order. Only the words found in both corpora are sorted.  In another example, if the parameter taken into consideration is the percentage of texts of the first corpus, in which the word is found, the output of CorpStat would be the following: Sorting the percentage of texts where the word is found in corpus 1 (descending order It is also useful to draw a chart with these values (Fig. 1). In this case, it is possible to compare the words visually to check their level of persistence, moving diachronically from a corpus to the other one. A high value in Y-axis means that the word is less popular (or absent) in the second corpus. Conversely, a low value shows that the word is found in both corpora to a certain degree. It is also possible to invert the first corpus with the second one, getting opposite results if it is more comfortable.

Results and discussion
Collecting the data from CorpStat gives the opportunity to observe the differences of lexicons in both corpora of written language in the Italian peninsula along history. For example, it is possible to study the historical persistence of words belonging to the same grammatical gender like the pronouns, using the differences between parameters. During the era of Vernacular language several linguists wrote treatises about the grammar: Giacomo Pergamini, writing about pronouns, listed among them questo, costui, colui, medesimo, esso (Pergamini, 1626, p.79 As shown earlier, it is possible to draw a chart (Fig. 2).

Fig. 2 -Chart showing the level of persistence for a group of words
Looking at the chart, the pronoun costui is less used in modern Italian when compared to the Florentine vernacular. Colui follows in the chart which means it was more popular in the Vernacular language. At present, Esso is mostly avoided in writings and is often replaced by lui, especially in the current Italian language. However, in the XIX century, esso was still used to some degree and this fact would explain its rank in the analysis. Medesimo and especially questo are very popular in both corpora. However, they can also have the function of being used as adjectives. For this reason, they also have more chances to be present in the corpora.
With the same method, many different kinds of analysis could be performed: for example. One could compare a group of adjectives, prepositions, nouns etc., to each other to make conclusions about the persistence of words along the timeline of history. Furthermore, CorpStat sorts the forks of parameters in both corpora, observing the ranking of words: in this case, one would want to get the most popular words in both corpora, or the less frequent words in one corpus.
In comparison with other languages, the modern Italian language has strong connections with his old ancestor, the Florentine vernacular, because of its similar language structure. For example, the old English looks more complicated in comparison with modern English, by the presence of unfamiliar words and spelling variants (Weisser, 2016, p. 15). In this case, CorpStat cannot compare words in both corpora since the software cannot detect the modifications of words and history. But the Italian language offers the opportunity to analyse the lexicon diachronically quite easily. However, CorpStat is able to analyse more than two corpora, as it was already demonstrated in a previous article (Pavan, 2020). In this case the words' level of persistence could be easily drawn as lines between a couple of corpora. In addition, words and their modifications can be compared. This kind of analysis is quite common in corpus linguistics to check the frequency of old and new words (Jones, Waller, 2015, p. 30).

Conclusion
Corpus linguistics is important in lexicography to make, among other things, an inventory of a language's lexicon (Zufferey, 2020, p. 3). Software packages, like CorpStat, can build the lexicon at the same time showing the changes in language over time. The three parameters in the output of CorpStat can help define the words' modifications in the language and history. In fact, analysing the two corpora with the software described in the article, it is possible to compare the words diachronically. For the first time, the article introduces a special parameter -the level of persistence of words in a language, showing how much the words changed over centuries. However, the software described here has some limitations, especially if one wants to compare spelling evolution in a language other than Italian. In the future, the new versions of the software could include some capabilities for spelling, allowing analysing different languages. To understand the evolution of languages, the study of corpora needs instruments like the one described here to analyse the modifications in different historical periods.