This paper concerns an empirical evaluation of nine different measures of distinctiveness or ‘keyness’ in the context of Computational Literary Studies. We use nine different sets of literary texts (specifically, novels) written in seven different languages as a basis for this evaluation. The evaluation is performed as a downstream classification task, where segments of the novels need to be classified by subgenre or period of first publication. The classifier receives different numbers of features identified using different measures of distinctiveness. The main contribution of our paper is that we can show that across a wide variety of parameters, but especially when only a small number of features is used, (more recent) dispersion-based measures very often outperform other (more established) frequency-based measures by significant margins. Our findings support an emerging trend to consider dispersion as an important property of words in addition to frequency.
Edward Tufte, the pioneer of data visualization, famously wrote: “At the heart of quantitative reasoning is a single question: Compared to what?” (
What the above observation points to is that comparison is a fundamental operation in many domains operating with numerical values. This is also true, however, for many text-based domains of research, whether statistically-oriented or not (
The research we report on in this contribution is set in the wider context of our research into measures of distinctiveness for comparison of groups of texts. Previously, we have worked on the issue of qualitative validation of measures of distinctiveness (see
In this paper, we focus mainly on subgenres of the novel as our dinstinguishing category. This is motivated both by the fact that subgenres are an important classificatory principle in Literary Studies
Specifically for the task at hand, we further hypothesize that dispersion-based measures of distinctiveness should have an advantage over other measures. The reason for this, we assume, is twofold: first, features (single word forms, in our case) identified to be distinctive by a dispersion-based measure have a higher chance of appearing in shorter, randomly selected segments taken from an entire novel than features identified using other kinds of measures, in particular frequency-based measures; second, dispersion-based measures have a tendency to identify content-related words as distinctive, in contrast to (some) frequency-based measures, which tend to identify high-frequency function words as distinctive (as observed in
Our paper is structured as follows: First, we summarize related work (a) describing different measures of distinctiveness and (b) specifically comparing several measures of distinctiveness to each other (
Related work falls into two groups, either defining and/or describing one or several measures of ‘keyness’ or distinctiveness, or specifically comparing several measures of distinctiveness to each other based on their mathematical properties or on their performance.
The measures of distinctiveness implemented in our framework have their origins in the disciplines of IR, CL, and CLS.
An overview of measures the of distinctiveness
TF-IDF | Term weighting | ||
Ratio of relative frequencies (RRF) | Frequency-based | ||
Chi-squared test ( |
Frequency-based | ||
Log-likelihood ratio test (LLR) | Frequency-based | ||
Welch’s t-test (Welch) | Distribution-based | ||
Wilcoxon rank sum test (Wilcoxon) | Dispersion-based | ||
Burrows Zeta (Zeta_orig) | Dispersion-based | ||
logarithmic Zeta (Zeta_log) | Dispersion-based | ||
Eta | Dispersion-based | ||
In
When it comes to the amount and the variety of measures of distinctiveness,
One of the simplest measures is the ratio of relative frequencies (
The Chi-squared (
Welch’s t-test, named for its creator, Bernard Lewis Welch, is an adaptation of Student’s t-test. Unlike the Student’s t-test, it does not assume an equal variance in the two populations (
Unlike previous measures, the Wilcoxon rank sum test, also known as Mann-Whitney U-test, does not make any assumption concerning the statistical distribution of words in a corpus; in particular, it does not require the words to follow a normal distribution, as assumed by other tests such as the t-test. Corpus frequencies are usually not normally distributed, making the Wilcoxon test better suited (
In
Eta is another dispersion-based measure recently proposed by Du et al. (
The evaluation of measures of distinctiveness is a non-trivial task for the simple reason that it is not feasible to ask human annotators to provide a gold-standard annotation. Unlike a given characteristic of tokens or phrases in many annotation tasks, a given word type is distinctive for a given corpus neither in itself, nor by virtue of a limited amount of context around it. Rather, it becomes distinctive for a given corpus based on a consideration of the entire target corpus when contrasted to an entire comparison corpus. Furthermore, whether or not a word can be considered to be distinctive depends on the category that serves to distinguish the target from the comparison corpus. Commonly used categories include genre or subgenre, authorship or author gender as well as period or geographical origin. For any meaningfully large target and comparison corpus, this is a task that is cognitively unfeasible for humans.
As a consequence, alternative methods of comparison and evaluation are required. In many cases, such an evaluation is in fact replaced by an explorative approach, based on the subjective interpretation of the word-lists resulting from two or more distinctiveness analyses, and performed by an expert who can relate the words in the word-lists to their knowledge about the two corpora that have been compared. More strictly evaluative methods (as described in more detail below) can either rely entirely on a comparison of the mathematical properties of measures (as in
We provide some more comments on previous work in this area. Kilgarriff (
Schöch et al. (
Egbert and Biber (
Du et al. (
Concerning an evaluation across languages, to the best of our knowledge, evaluations of measures of distinctiveness that use corpora in more than one language are virtually non-existent. The only example that comes to our mind is Schöch et al. (
For our analysis we used nine text collections. The first two corpora consist of contemporary popular novels in French published between 1980 and 1999 (160 novels published in the 1980s and 160 novels published in the 1990s). To enable the comparison and classification of texts, we designed these custom-built corpora in a way that they contain the same number of novels for each of four subgroups: highbrow novels on the one hand, and lowbrow novels of three subgenres (sentimental novels, crime fiction and science fiction) on the other. The texts in these corpora are, for obvious reasons, still protected by copyright. As a consequence, we cannot make these corpora freely available as full texts. We have published them, however, in the form of a so-called “derived text format” (see
Another group of text corpora that we used for our analysis consists of seven collections of novels in seven different European languages taken from the
Overview of the corpora used in our experiments.
corpus | document length | number of | |||
name | size (million words) | standard deviation | mean | types | authors |
fra_80s | 8.83 | 27,161 | 55,225 | 119,775 | 120 |
fra_90s | 8.48 | 26,976 | 53,010 | 111,501 | 124 |
ELTec_cze | 1.98 | 24,734 | 49,642 | 163,900 | 33 |
ELTec_deu | 4.62 | 101,915 | 115,531 | 158,726 | 30 |
ELTec_eng | 4.66 | 75,672 | 116,477 | 53,285 | 35 |
ELTec_fra | 3.31 | 86,926 | 82,802 | 65,799 | 37 |
ELTec_hun | 2.44 | 40,513 | 61,055 | 258,026 | 36 |
ELTec_por | 2.33 | 38,787 | 58,325 | 95,572 | 34 |
ELTec_rom | 2.41 | 36,493 | 60,395 | 156,103 | 37 |
To obtain a better understanding of the performance of different measures of distinctiveness, we evaluate how well the words selected by these measures are helpful for distinguishing texts into predefined groups. As mentioned above, we focus on subgenre (and, to a lesser degree, on time period) as the distinguishing category of these text groups here because these are both highly relevant categories in Literary Studies. This means that among the approaches for comparative evaluation outlined above, we have adopted the downstream classification task for the present study. The main reasons for this choice are that the rationale and the interpretation of this evaluation test is straightforward and that it can be implemented in a transparent and reproducible manner. In addition, we assume that it will give us an idea of how suitable the different measures are for identifying the words that are in fact distinctive of these groups.
In order to identify distinctive words, we first define a target corpus and a comparison corpus and run the analysis using nine different measures, including two variants of the Zeta measure. Concerning the first two corpora, which consist of contemporary French novels, we are interested in distinctive words for each of the four subgenres. Concerning the second, multilingual set of corpora, we make a separate comparison for each language based on two periods: earlier vs. later texts.
For the distinctiveness analysis of the contemporary French novels, we took novels from each subgenre as the target corpus and the novels from the remaining three subgenres as the comparison corpus. This means that we ran the distinctiveness analysis four times and obtained four lists of distinctive words for each subgenre and another four lists of distinctive words for each comparison corpus (words that are not ‘preferred’ by the target corpus). For the classification of these novels, which is a four-class classification scenario, we took the
For the multilingual set of corpora, the situation is simpler, because there are only two classes. We can get two lists of words, which are the distinctive words for each class by running the distinctiveness analysis only once, which takes one class (novels from 1840 to 1860) as the target corpus and the other class (novels from 1900 to 1920) as the comparison corpus. Here, we also took the
To observe the impact of
In order to create a baseline for the classification tasks, we randomly sample
This section describes the classification of French novel segments into four predefined classes: highbrow, sentimental, crime and scifi. Before running the tests on the corpora of different languages, we want to check the variance of results within one language. Only by excluding one confounding variable (language) from the test, we can conclude that the differences in the performance of measures of the ELTeC-corpora are caused by the differences among different languages. That’s why we built two corpora of French novels for our analysis: novels from the 1980s and from the 1990s.
First we applied bag-of-words based classification on both parts of the French novel corpus, testing four classifiers: Linear Support Vector Classification, multinomial Naive Bayes (NB), Logistic Regression and Decision Tree Classifier.
Classification performance on the French corpus (1980s) with four classifiers, depending on the distinctiveness measure and the setting of
F1-macro score distribution from the 10 fold cross-validation obtained by the genre classification of the French 1980s-corpus with Multinominal NB. The green line is the baseline F1-score.
The classification based on the
Another observation based on
T-test performed on every pair of the F1-score distributions of measures. F1-score were obtained from the classification of the 1980s-corpus. The black line is the significance threshold.
In
(a) F1-score distributions for classification with
The more interesting observation, however, is that we have clear differences in F1-scores of the measures when a small number of features is used (e.g.
First of all, we can observe in
We can also observe, in
This observation applies for classifications with greater
Significance test on F1-score distributions for each measure. F1-scores obtained from the classification of the 1980s-corpus. Black line is significance threshold.
Summarizing the information from the classification of both corpora, we can argue that Zeta_log, Zeta_orig, Eta and TF-IDF have the highest and the most robust performance when using the smallest number of features (
It is important to note that this group of the most successful measures have something in common: they are all dispersion-based (TF-IDF with some restrictions).
The above-mentioned conclusion regarding the superior performance of dispersion-based measures when compared to frequency-based measures is based on the specific use-case of our 20th-century French novel corpus. In order to verify whether this claim is also true when corpora in other languages are used, we performed the same tests on several subsets derived from ELTeC (as described above,
The classification task that we use differs from the previous one. We are not interested in classifying the texts by subgenre, but by their period of first publication (1840-1860 vs. 1900-1920). The main reason for this is practical: the corpora included in ELTeC do not have consistent metadata regarding the subgenre of the novels included, due to the large variability of definitions and practices in the various literary traditions that are covered by ELTeC. However, all collections cover a very similar temporal scope so that it is possible to use this as a shared criterion to define two groups for comparison.
Mean F1-score of classification across 7 ELTeC corpora (
We consider the performance across corpora and measures for
With regard to the frequency-based measures, we can observe that
Considering further analyses, we visualized
If we consider the stability of the measures across evaluation with different numbers of features, we can conclude that the results for several measures (RRF, Welch, Wilcoxon, ETA, Zeta_orig and Zeta_log) are stable: for almost all data sets, the number of significantly different results is less than 25%. This indicates that the setting of
Summarizing the results described above, we can conclude that dispersion-based and distribution-based measures have been shown again to yield higher performance in identifying distinctive words and to be more stable and robust than other measures. In contrast, the average performance of frequency-based measures is still considerably lower than that of the other measures.
To conclude, we have been able to show that a Naive Bayes classifier performs significantly better in two different classification tasks when it uses a small number of features selected using a dispersion- or distribution-based measure, compared to when it uses a small number of features selected using a measure based on frequency. This result was quite robust across all nine different corpora in seven different languages. In addition, we were able to observe it both for the four-class subgenre classification tasks and the two-class time period classification task. In this sense, our findings support an emerging trend (see e.g.
However, this result also comes with a number of provisos: We have observed this result only for small values of
The fact that these results can only be observed for small values of
Despite these results, there are of course a number of issues that we consider unsolved so far and that we would like to address in future work. The first issue was already mentioned above and concerns the length of the segments used in the classification task. As a next step, we would like to add
The second issue concerns the number and range of measures of distinctiveness implemented in our Python package so far. With nine different measures, we already provide a substantial number of measures. However, we plan to add several more measures to this list, notably Kullback-Leibler Divergence (a distribution-based measure, see:
Thirdly, it should be considered that almost all previous studies in the area of distinctiveness, our own included, do not allow any conclusions as to whether the words defined by a given measure as statistically distinctive are also perceived by humans as distinctive. Such an empirical evaluation is out of scope for our paper, but would certainly add a different kind of legitimacy to a measure of distinctiveness. In addition, words that prove to be statistically distinctive in a classification task are, strictly speaking, only shown to have a certain discriminatory power in the setting defined by the two groups of texts. Distinctiveness, however, can be understood in more ways than just discriminatory power; notably, distinctiveness can also be understood in terms of salience or aboutness.
Finally, we would of course like to expand our research regarding the elephant in the room, so to speak: not just evaluating statistically which measures perform more or less well in particular settings, but also explaining why they behave in this way. We believe that the distinction between measures based on frequency, distribution and dispersion is a good starting point for such an investigation, but pushing this further also requires to include measures that really measure only dispersion and not a mix of dispersion and frequency, as recently demonstrated by Gries (2021). Measures of distinctiveness have clearly not yielded all their secrets to us yet.
Data can be found here:
Software can be found here:
The research reported here was conducted in the framework of the project ‘Zeta and Company’ funded by the German Research Foundation (DFG, project number 424211690). This project is part of the Priority Programme SPP 2207 “Computational Literary Studies”. We thank our reviewers for their valuable feedback to an earlier version of this paper.
See:
For a concise introduction to genre theory, see Hempfer (
Statistical hypothesis tests are based on the computation of a p-value that expresses the probability that the observed distributions of words in a target and a comparison corpus could have arisen under the assumption that both corpora are random samples from the same underlying corpus (
On dispersion, see Lyne (
See
Texts and metadata for these collections are available on Github:
LinearSVC, MulinomialNB, LogisticRegression and DecisionTreeClassifier from the Python package
According to
Classification of the 1980s-collection leads to lower variations of the F1-scores compared to the classification of the 1990s-collection.
When
This observation on the 1980s-dataset can also be seen in the results from tests on the 1990s-dataset.
RRF median = 0.22,
We observe a slightly different tendency for the classification of the 1990s-dataset: Both Zetas, Eta, TF-IDF, Welch and Wilcoxon do not have significant differences in F1-scores for
The results of the classification of the 1990s-dataset show the same tendency.
Zeta_log has the highest mean F1-score (1980s: 0.75, 1990s: 0.72), followed closely by Eta (1980s: 0.75, 1990s: 0.72), and then by Zeta_orig (1980s: 0.75, 1990s: 0.70), TF-IDF (1980s: 0.72, 1990s: 0.71).
Dispersion describes the even/uneven spread of words across a corpus or across each particular text in a corpus. We cannot claim, however, that the measures we have used rely exclusively on dispersion; rather, they are also influenced by frequency; see Gries (2021).
For information about the types of measures, see
The data is available in our GitHub repository:
For a theoretical take on both issues raised here, see Schröter et al. (