1. Introduction
The beginning of the discussion about the capabilities and limitations of language models in 2021 was characterized by very general claims. On the one hand, some proclaimed that this was a big step towards Artificial General Intelligence (AGI). On the other hand, members of the linguistically oriented Natural Language Processing (NLP) and AI community criticized the language models as “stochastic parrots” (Bender et al. 2021), i.e., they produce language that looks like the language produced by humans, but has severe deficits. These deficits are not explicitly tied to any particular task. While humans share a common ground and “model each others’ mental states as they communicate” (ibid. p. 616), “[t]ext generated by an LM is not grounded in communicative intent, any model of the world, or any model of the reader’s state of mind.” (ibid.). The authors therefore claim that textual output produced by machines has no meaning (ibid.). However, neither of these extreme positions really contributed to a better understanding of the real capabilities of LLMs. Thus, they were soon replaced by more focused studies that attempted to experimentally clarify the capabilities of models in a particular domain, from a particular perspective, or for specific tasks, for example, LLMs’ abilities in logical reasoning (Mirzadeh et al. 2024) or their cognitive abilities to understand other people in terms of theory of mind (Trott 2022; Trott and Jones 2023; Trott et al. 2023). Similarly, the goal of our study is to investigate the ability of LLMs to understand literary texts, especially poetry, i.e., to perform specific tasks that humans can only perform if they have an adequate mental representation of a literary text and possess the knowledge and skills necessary to perform those tasks. AI and the new LLMs have been addressed by researchers interested in literary texts from very different angles: Kirschenbaum (2023) and Gengnagel et al. (2024) looked at the theoretical dimension of the concept of meaning and language that is realized or proved by LLMs. Walsh et al. (2024) explore their ability to generate literature. Most research in computational literary studies (CLS) examines task-specific performance without resorting to hermeneutically complex concepts such as interpretation and understanding (see, for instance Hicke et al. 2025).
In many public discussions, LLMs have been seen as a challenge to established teaching practices, or as part of the neoliberal world order hostile to the spirit of critique and reflection in the humanities. Concerns have also been raised about their environmental impact, particularly their high energy consumption and resource-intensive training processes. In this study, we do not contribute to any of these debates.
On the one hand, we are interested in understanding as a general ability. On the other hand, we will approach this ability through a series of concrete tasks. Current advances in NLP and computational humanities tend to validate individual tasks performed by LLMs through benchmarking (Bamman et al. 2024; Yu et al. 2024).1 This implies that validating the much more abstract and general tasks LLMs are often used for nowadays can be done in the same way as before. Irrespective of this general question, it will become more evident in the following that understanding is such a complex and multi-layered concept that concrete benchmarking tasks for such abstract tasks in this area would certainly be premature. Our goal is to lay the foundation for future work through a dual strategy of probing experiments and reflections on the complexity of attributing the ability to understand literature to a machine. Instead of directly recording correct output and developing metrics for quantifying the machine’s correct performance, the current state of research requires us to reflect on the relationship between prompt and output as an at least seemingly cooperative practice between humans and machines. For this reason, it is necessary for us to know the object of interpretation as well as possible.
For pragmatic reasons, we are focusing on two German-language poems, Hälfte des Lebens by Friedrich Hölderlin (1804) and Unsere Toten by Hans Pfeifer (1922).2 Hölderlin’s poem is famous, so we can assume that the models have seen it, maybe even repeatedly, during training. Additionally, there are many interpretations of it, some of which may also have been in the training corpus. We used the influential interpretations by Strauss (1965) and Schmidt (1982) as reference for most of our text descriptions and interpretations. Pfeifer’s poem, like the author, is completely unknown. To our knowledge, it has only been published once in an anthology from 1922 that has not been digitized until now (Uhlmann-Bixterheide 1922, 290).3 We expect that this selection will allow at least preliminary conclusions to be drawn about one relevant confounder, namely, the dependency between an LLM’s adequate text understanding and whether the object of interpretation was seen during pre-training.
Since we break down the general notion of understanding into concrete tasks, we had to select a number of such tasks from the abundance of possible aspects of literary analysis. Our selection includes the following nine aspects that we consider particularly relevant:
meter
rhyme
assonance
lexis
phrases
syntax
figurative language
title
text meaning
As this selection reflects an approach that primarily focuses on the text itself, context integration will have to be included in these nine aspects, respectively. For each of these aspects, we have developed a series of prompts to check how extensive, adequate and knowledgeable the understanding of the literary text is. All prompts are included in python scripts and sent to the LLMs through an API and are as such accessible and replicable through the Jupyter Notebooks in our GitHub repository.4
When people engage with literature, one can observe very different degrees of competence, ranging from an initial understanding of what is said on a literal level to the application of decades of expertise in interpreting complex aesthetic works of art. In our view, it is therefore important to be able to distinguish and scale different levels of complexity of understanding. These levels will be differentiated in the following sections, starting with the ability to generalize (section 3), to more complex and expertise-like reasoning (section 4), and the ability to perform more abstract steps of inductive and abductive reasoning (section 5). At the first level, we tested how well the models work at the level of general knowledge, i.e., roughly the knowledge that students have when they leave school with a high school diploma. At the second level, we want to know how well the models can solve problems like experts in literary studies. On the third level, we explored to what extent the models are able to abstract counterfactual rules from examples and apply them to the poems. The rules are counterfactual in that we invented them and therefore they have probably never been applied to a literary text before.5 Each of these nine aspects mentioned above will be investigated using this distinction between three levels of complexity of understanding.
This broad and inclusive use poses some challenges for our study design. Overall, we take an exploratory approach in our work and will not present quantitative results or benchmarks. Although we constantly set tasks for the models, we are not really interested in whether they solve them all flawlessly; rather, we are more interested in how they approach the tasks than in a successful solution. We do not want to test the models; rather, we are interested in what our experiments reveal about the appropriateness of our efforts to interpret the models’ output as the demonstration of an understanding ability. Even though this type of analysis may at first glance look like an attempt to evaluate LLMs, it is actually an attempt to prepare such – also quantitative – evaluation efforts on a theoretical but at the same time concrete and task-oriented basis. The reader will therefore see no tables showing quantitative results. As the nine aspects times three levels of complexity for two poems yield a number of 54 mostly multi-step tasks, we did not perform any further model parameter diversification and testing beyond the different prompting strategies according to the three levels of complexity. We further chose an exclusively discursive way of aggregating and discussing the results instead of listing them all side by side in a pseudo-quantifying way. Our central argument will be that we want to show that, in order to clarify the interpretative capabilities of LLMs – and before benchmarking studies can be meaningfully conducted – the complexity of interpretation theory requires an interplay of qualitative probing experiments and theoretical reflection. Robust and quantifiable statements about the performance of LLMs in comprehension tasks, which are only tentatively indicated here, will be the subject of future research.
In a preliminary study, we found that smaller models (<=70B) made too many errors. Under the assumption that models develop qualitatively different abilities as the number of parameters increases, we concentrated on the large models: Claude Sonnet 3.5 (Anthropic), Gemini 1.5 (Google), and GPT4o (OpenAI). We usually performed single-run evaluations, meaning that our prompts were run only once across all three models, without systematic repetition.
Our paper has the following structure. In section 2, we introduce a theoretical framework that helps to overcome the extreme position on understanding outlined above. This is meant to build a philosophically informed basis for the kind of analysis we offer in this paper. In section 3 to section 5 we report on the experiments on the three levels of general knowledge (section 3), expert knowledge (section 4) and abstraction and transfer (section 5). In our conclusion we will summarize our findings and will discuss some follow-up research questions.
2. Understanding
The basic distinction we recommend for an appropriate framework is that between internalist and externalist approaches to the concept of understanding. The internalist perspective is interested in the conditions that have to be fulfilled in the (human) mind and consciousness for understanding to take place. Although there are different approaches, Wilhelm Dilthey’s can be seen as a classic internalist position (Dilthey 1974), which assumes the psychological reproduction of a psychological state of the interpreted utterer or author, and also requires the ability to charge the utterance to be understood with relevance to the interpreter’s personal life (Makkreel 2002).6 Extreme internalist positions, usually subject to accusations of psychologism, would claim that the criterion for understanding a poem (or anything else) is a completely subjective sense of evidence in the first-person perspective. Positions like Bender et al. (2021) are internalist in that they make the notion of the respective ability or property (communication, meaning, and understanding) dependent on some internal requirement, here a grounding human consciousness.
In contrast, according to externalist approaches, often associated with the late Wittgenstein of the Philosophical Investigations (Wittgenstein [1953] 2003), understanding occurs in the form of practices (Künne 2003; Strube 2003).7 From an externalist stance, whether someone has understood a poem or utterance does not depend on a certain subjective quality of experience, but on whether they can show that they have understood that poem or utterance through their behavior. Understanding is then seen as a practice of acquiring understanding and as a kind of rule-following.8 The most prominent approach to an externalist strategy of verifying some agent’s intellectual abilities was Alan Turing’s essay on Computing Machinery (Turing [1950] 2021).
Although there are internal aspects of understanding that cannot be proven as irrelevant by just reducing understanding to external aspects,9 we will take an externalist stance in the broadest sense. The most obvious advantage of an externalist approach is that it avoids some of the internalist implications. As can be seen in Bender et al. (2021), internalism easily leads to a priori arguments about whether understanding per se requires a truly human agent.10 While relevant in certain areas of philosophical reasoning, such a priori discussion would be a dead end for a deeper understanding of the capabilities of language models. Externalism, in contrast, can help to find a viable balance between discussing LLMs’ understanding-related abilities and the complex implications of the concept of understanding. In order to have a kind of compass for the following analyses, we rely on a variant of externalism that describes the main difficulties in attributing understanding to machines. This is Dennett’s theory of intentional systems and, perhaps even more importantly, the subsequent discussions in the philosophy of mind. In Dennett’s approach, which has been a dominant branch in the philosophy of mind of the 1980s and 1990s (Bieri 1987), the following aspects are central. Dennett distinguishes three stances we can take to explain events, processes, or actions, the physical stance, the design stance and the intentional stance (Dennett 1971, 1987). It is the relation between the design stance and the intentional stance that gets intricate when interpreting machines. When we take the intentional stance, we treat our counterparts in a merely instrumentalist manner as intentional systems – regardless of their internal properties – and interpret their behavior as behavior aimed at achieving the agents’ goals (desires) based on their knowledge (beliefs) as rational action.11 When taking the design stance, in contrast, we describe and explain the behavior shown by our counterparts based on our knowledge of their internal functioning. The intentional stance requires an instrumentalist and externalist approach to interpreting the behavior of machines.
Dennett provides us with an important maxim that easily leads to confusion if disregarded. The maxim says that we must not confuse intentional explanation and the attribution of intentional states with assumptions at the level of the design stance. Bender et al. (2021) conflate the intentional explanation strategy with assumptions about system design, inferring, based on the reasonable premise that LLM design does not involve consciousness, the mistaken conclusion that attributing intentional states is fundamentally inappropriate. Vice versa – and more importantly – we must not make assumptions on emerging mental phenomena within the black box of LLMs based on justified intentional explanations of the machine’s behavior. In other words, Dennett frees us from the expectation of being able to make any statements about the actual internal knowledge or states of LLMs in general.
Here and in the following we will use the term ‘knowledge’ in the broad sense of the word, which includes not only declarative knowledge but also practical and procedural knowledge (Ryle 1945). Thus, when a model produces an answer A to a query, our use of ‘understanding’ refers to the complex attribution that we as humans would make, if we received A in response to a similar query from a human. The simplest form of understanding (section 3 on general knowledge) is then the ability to understand a text in the sense of forming true beliefs about an object and thus to say what is the case in a poem in terms of both form and content. More complex forms of understanding include inference and context integration (section 4) or even more complex transfer and abstraction processes (section 5), which require human observers of machine output to interpret that output as rational behavior.
A key benefit of taking Dennett’s instrumentalist externalism as a starting point is that it allows us to view the particular complexities and challenges involved in assessing machine output as acts of understanding in a straightforward yet theoretically sufficiently differentiated manner. Describing machine behavior as an act of understanding is a rationalizing rather than naturalizing interpretation made by an observer. When we rationalize a behavior, we do not empirically detect rationality. As interpreters we rather presuppose rationality a priori and ‘then’ interpret the behavior as intentional. This rationalizing interpretation eventually succeeds or fails. This a priori presumption is also known in hermeneutics as the principle of charity.12 It is, in particular, crucial to all interpretative situations that we will call ‘open games’ in the following. Open games are interpretive situations where we do not simply measure correct output but where we take an interpretive effort (here done by an LLM) as as a result that requires interpretation on the part of the human observer. It is open for both sides: for the machine, which is supposed to understand the text, and for the human observer, who is supposed to recognize the machine’s behavior as an achievement of understanding. In a way that comes close to Turing’s original idea of an imitation game (Turing [1950] 2021, §§ 1–2), probing experiments on understanding abilities of LLMs will in the first instance have to take into account the ascription-based character of assessing a machine’s rational abilities. Although this ascription-based character and the principle of charity behind ascribing rational behavior is overlooked in some current publications on comprehension abilities of LLMs (Yu et al. 2024), it is crucial for appropriately configuring the enormous differences in people’s willingness to accept LLMs as communication partners. These differences are not only due to differences in the degree of benevolence or charity with which people interact with LLMs, but also have their roots in the under-determination of abstract interpretive statements. By ‘under-determined’ we mean that by saying that p is the case, we always ignore aspects of the object or situation that is correctly being described by saying that p. Each finite description is – in most empirical cases – less determined than the object or situation itself. This can easily be proven by providing further descriptions of the same object or situation that have not been included in the previous descriptions.13 This problem, which we refer to as the problem of observer dependence in the attribution of understanding, must be specifically reflected upon and discussed in the following chapters for the individual research experiments.
Just like probing experiments from the fields of NLP (Chang et al. 2024), psychology (Trott and Jones 2023; Trott et al. 2023), often with a special focus on psychometrics (Chollet 2019), or from a more general interest in the human-like abilities of LLMs (Mirzadeh et al. 2024), CLS research will have to address the development of appropriate metrics for measuring the correctness of LLMs’ behavior. Since we are dealing with stochastic machines that react on the basis of randomization and probability, future research will have to consider not only individual responses, but also variations of types or patterns and distributions of responses. This will require a more rigorous formalization of correct versus incorrect output to be evaluated. One of the aims of this paper is to lay the groundwork for future work on robust metrics that can be scaled for benchmarking tasks. However, the particular challenge will be to bridge this problem when future work involves a more quantitative and metric evaluation of the comprehension abilities of LLMs. A quantitative approach to evaluating the capabilities of LLMs should not, as is currently often the case, ignore the effects of the principle of charity, observer dependency, and the attribution-based nature of interpreting machine behavior, but rather ensure that these are controlled for.
For the following experiments in this paper, the requirements of a sufficiently complex concept of understanding outlined above make it clear that we must not model the LLM agent with a simple task-driven approach, where each task is a self-contained unit. The behavior we observe is inevitably full of interpretations on our part. In particular, mistakes made by the LLMs and by us have helped us to change our understanding of the LLMs. A major challenge for future research lies in reconciling the variability that inevitably exists in open games with the fixed arrangements of benchmarking and quantitative evaluations.
3. General Knowledge
At the level of general knowledge, we test to what extent an LLM’s behavior can be interpreted according to the intentional stance in terms of forming correct beliefs and thus making correct statements about what is the case. On the one hand, these statements about what is the case involve statements about the text. On the other hand, we also focus on tasks that involve widely taught and culturally accessible knowledge of generally relevant context. We assume that the LLMs have already seen much of such contextually relevant information during training. Our experiments aim to evaluate not only the semantic correctness of their outputs, however, but also their ability to approach tasks in ways that reflect a meaningful understanding of the poems in terms of generalization, pattern recognition, and meaning attribution (i.e. knowledge in terms of ›knowing how to interpret‹). From a theoretical perspective, it is mandatory to reflect the conditions, our expectations, and our willingness – in terms of the principle of charity (section 2) – to accept the output as expressing correct text understanding.
Starting with the analysis of the metrical structure of the poems, we ask LLMs in Notebook 1 (NB 1) to return correct statements regarding the scansion and to report their results in a summarizing way. While Sonnet performed this task almost flawlessly, GPT4o and Gemini struggled. For the poem Hälfte des Lebens, Sonnet consistently produced accurate scansion, whereas GPT4o and Gemini produced errors. Regarding the second step, a notable observation across all LLMs was their frequent inability to summarize the scansion patterns they identified. They often report more stressed syllables than were detected, but never fewer. This discrepancy is probably related to the inability to count and deal with symbols that do not coincide with token boundaries (Edman et al. 2024; Xu et al. 2024). These findings highlight that while some LLMs can recognize scansion, all models struggle with tasks requiring metrical abstraction.14 On the level of general knowledge, it is easy to exactly define our expectation regarding correct output and thus to evaluate the LLM’s accuracy. Observer dependency is only a marginal problem at this level.
The analyses of rhyme words (NB 2) and schemes are directly linked to the former analyses (NB 1). For Hälfte des Lebens, the absence of rhymes was generally detected, but GPT4o and Gemini occasionally produced false positives by counting non-terminal words as rhyming elements. This indicates that the representation of verse structure is not well modeled in these LLMs. In contrast, Sonnet exhibited an almost perfect description of German pronunciation and rhyme structure. Regarding the unknown poem Unsere Toten, which follows an AABB rhyme scheme with an internal rhyme in the final verse, GPT4o and Gemini correctly identified the rhyme words, but only Sonnet accurately detected the rhyme scheme. The other two models showed inconsistent results.
While the prompt designs for meter and rhythm consist of simple zero-shot detection tasks, the prompts for the detection of assonance (NB 3) include different definitions of assonance ranging from simple descriptions to technical explanations involving phonemes (Zymner 2007). Based on the pre-registered standard of correct assonance detection, we found that all LLMs demonstrate low accuracy for the German poems, regardless of the definition provided and the poem. By contrast, their precision and recall on the English translations of the poems is markedly higher independently of the definition provided and also for a prompting that does not offer any definition (see also the table summarizing the results in NB 3).
Even if (according to our considerations in section 2) it is advisable not to mix the levels of the functional and intentional stances, it is nevertheless useful to consider which system characteristics at the functional level prevent better performance at the level of behavioral explanation in the intentional stance. In this respect, the results indicate that training on phonetic features in German either did not play a major role during the development or that such training was not sufficiently effective. To address this, a two-step chain-of-thought (CoT) prompting method is applied, asking the LLMs to first transcribe the poem into International Phonetic Alphabet (IPA) and then identify assonance based on the transcription. Though all LLMs are good at performing the first step of transcribing the poems into IPA, they are not able to efficiently base the second on the first step. The finding highlights a difference in how humans and LLMs approach tasks that rely on foundational language skills. Humans are able to detect and process phonetic patterns like meter, rhyme, and assonance without need for specialized training or explicit systems such as IPA. It can be assumed that this ability stems from a mixture of linguistic capacities and learned experience, which allows humans to recognize phonetic similarity. In contrast, LLMs seem to lack this phenomenological foundation. If concrete improvements at the functional level of the system lead to the behavior of LLMs being better interpretable at the level of intentional stance as a rational understanding achievement, this also opens up further possibilities for combining design stance and intentional stance in a sufficiently sophisticated way.
Analyses of the lexis show the ability to identify the semantic field of selected nouns and verbs across all models. The identification of the parts of speech is partially flawed, however (NB 4). Our experiments aim at reconstructing how the LLMs understand the imagery, figurative speech, and meaning of the two poems. For this, the LLMs were tasked with identifying all instances of figurative speech in the poems and, for each instance, providing reasons for why it was identified as figurative speech (NB 5). This approach allows to distinguish between linguistic devices that render an entire text as an overarching image and those that serve as localized elements of illustration within the text (Burdorf 2015). Our particular interest lies in these localized figures of speech, such as metaphors, metonymies, synecdoches, and symbols.15
All LLMs work remarkably well, as they identify many instances, even if none of the LLMs cover all cases. Indicators that an expression is supposed to be understood figuratively are, according to the models, that a literal understanding does not make sense, for example “Human qualities are attributed to inanimate objects (walls).” (Gemini on Hälfte des Lebens). Interpretations often refer to an established understanding of the symbol, which is then explicitly marked, e.g., “Roses are often symbols of beauty, passion, or transience” (GPT4o on Hälfte des Lebens).
In order to investigate the relation between figurative speech and literal understanding, a completion task using the “simple suffix prompting” (Liu et al. 2022) method with “that is to say” was conducted (NB 5). The task for the LLMs was to interpret the figurative phrase “Die Mauern stehn sprachlos und kalt” of the poem Hälfte des Lebens. In the prompt design, the suffix “that is to say” was inserted between the figurative phrase and its possible literal explanation, signaling the need for interpretation. Only Gemini correctly engaged with the syntactic structure of the prompt and completed the sentence with the literary description “indifferent, uncaring.“
At this stage, the complications coming from the observer dependence of the principle of charity (section 2) come into play. Our impression that all LLMs perform remarkably well when it comes to determining figurative speech could easily be countered by sceptics who do not find the characterization of metaphors and other devices sufficiently sophisticated. Characterizing figurative speech or abstractively describing figurative content of a poem has some range of plausible answers. Generous observers with a large portion of hermeneutic charity will accept all output as ’correct understanding’ that is logically compatible with their description of that poem. Less generous observers will expect very specific answers. It is the general character of under-determined semantic content of abstract and interpretive statements that lead to some variance in accepting correct answers. In order to control for this observer dependence of evaluating the LLMs’ performance, the LLMs were asked in the next step to choose the best completion for the figurative phrase from four options, evaluate their choice with regard to the context of the poem, and provide a confidence score for their decision. All three models selected the same completion: “the emptiness echoes within the confines of their silence,” assigning it an identical confidence score of 0.9. The chosen completion aligns with traditional interpretative approaches. However, it became clear that the theme of “speechlessness,” as discussed in scholarly research on Hälfte des Lebens, was not selected (Strauss 1965). Exploring how the three LLMs engage with figurative speech yield three interesting results: Firstly, we saw that their outputs are primarily shaped by conventional interpretations. All three LLMs draw on culturally entrenched associations rather than generating novel interpretations. Secondly, the more open the range of possible answers, the higher the risk of large variance between positive and negative evaluation. Thirdly, it must be taken into account that the high level of agreement in the completion task is also due to the fact that the possible selection of answers is relatively small. Since LLMs are stochastic machines, it must be assumed that simple counting tasks (such as those for aspects 1 to 3) are very difficult because there are many possible answers and therefore a high probability of error, while seemingly complex tasks of identifying metaphors are better handled simply because the statistical risk of being wrong is lower.
The experiments on syntactic structure (NB 6) show that all models have generalized broad syntactic signals of German well. However, they did not manage to elaborate the difference between the two stanzas of Hälfte des Lebens in terms of enjambments (verse endings splitting a phrase).
The experiments on title (NB 8) and text meaning (NB 9) highlight the LLMs’ capabilities in interpreting central themes and motifs as well as generating plausible interpretation hypotheses. At the level of general knowledge, the models demonstrated a strong ability to generate conventional interpretations (NB 9). Zero-shot prompting reveals a strong focus on oppositional motifs or theme. In the case of Hälfte des Lebens, all three models focused on the central oppositional pairs of “summer and winter” to summarize the poem’s thematic elements. Their interpretations rarely ventured beyond these straightforward dichotomies to address more figurative or nuanced meanings, however. For the less familiar poem Unsere Toten, the models displayed greater diversity in their hypotheses, referencing historical contexts such as the World War I and II. In addition to this, all LLMs provided a range of different interpretations. We then asked the models to handle some of the most salient or most surprising aspects of the poems’ titles (NB 8). Processing the title of a work of art is a complex interpretive operation that requires understanding the work itself and relating its meaning to the literary meaning of the title and then to think about the effects of connecting both. For Hölderlin’s Hälfte des Lebens, the models recognized the title’s relevance to the poem’s dual structure, connecting the “half” to the juxtaposition of summer and winter imagery. However, interpretations diverged in their reasoning. GPT4o and Sonnet argue that the poem thematizes both halves of life, while Gemini claimed the poem exclusively addresses the first half, associated with summer. Despite failing to engage with the second stanza’s conditional structure (“wo nehm ich, wenn”), Gemini’s interpretation framed the title as emphasizing youth and vitality. Meanwhile, GPT4o and Sonnet took a more abstract approach, interpreting “Hälfte” as representing a midpoint or turning point in life, reflecting a moment of awareness about life’s contrasting phases. For Unsere Toten, all models correctly identified the reference to “German soldiers” and the invocation of national identity as central elements of the title. Sonnet was the only model to explicitly and correctly associate the poem with World War I.
The tasks set here can mostly be described as a work-immanent approach in which semantic relations between a title-phrase and the bundled sentence meanings of the respective poems are to be described. As with other – at first glance complex – tasks of linking semantic units (meaning, metaphor), all language models are very good in this respect and at a level that fulfills the requirements of general knowledge. Our interest is, however, not only in the performance the LLM’s showed in terms of their understanding abilities, but also to reflect on the observer relativity on the level of evaluation. It has to be taken into consideration, however, that our willingness to consider the output as acceptable answers comes from both the under-determined nature of such generalizing statements and from the limited range of obviously false answers.
4. Expert Knowledge
On the level of expert knowledge we designed prompts that forced the models to show more sophisticated behavior according to what one could call philological ways of reasoning.16 When adopting the intentional stance, interpreting such complex reasoning involves ascribing to the model concrete objectives (interpretive intentions in the literal sense) as well as beliefs regarding textual and contextual facts. Particularly challenging at this level is the assessment of the conditions under which an LLM’s answers represent adequate understanding. Irrespective of the inescapable observer dependency of assessing some behavior as understanding ability (see section 2), we predefined conditions of correct answers for each experiment, respectively. According to theoretical discussion of the different dimensions of the philological concept of understanding (Künne 2003; Strube 2003), these conditions include the ability to integrate historical context knowledge, the ability not only to provide plausible answers but also to judge the empirical appropriateness of different explanatory hypotheses, to connect different layers and aspects of the work, or finding an appropriate level of abstraction.
When asked to identify instances in both poems where the meter is changed to indicate a semantic aspect, all models identified the change in meter between the first and second stanzas in Hölderlin’s poem and associated it with the change in meaning and emotion (NB 1). While this may be due to the fact that this change is mentioned in many interpretations of the text, all models also identified the change in meter in the last two lines of Pfeifer’s poem. Additionally, we asked for the verse meter of Unsere Toten. The correct answer is ‘Knittelvers‘, which has the rhyme scheme AABB and four stressed syllables. Since it allows free filling of unstressed syllables and thus a variable number of syllables, simple bottom-up detection from smaller units is not feasible. Identifying the Knittelvers can thus be considered a basic form of what we call expert tasks that require the interpreter to spontaneously take into account a non-trivial logical relationship between the categories to be considered. Two of the models, Sonnet and GPT4o, answered correctly, but only when asked for a “German verse meter”, not when asked for a verse meter in general. The texts were provided in German and the name of the author is a typical German name, nevertheless the models only applied the context of German history of metrics to the task, when explicitly prompted. We consider this finding remarkable because it shows that the models did not spontaneously draw on the appropriate cultural horizon as the relevant context.
Since Hälfte des Lebens does not have a rhyme, we focused on Unsere Toten for the analysis of rhyme (NB 2). The models were asked to describe two strategies for relating rhyme to the meaning of the poem, and then to determine whether any of the relations produced interesting insights into the poems. All models made plausible suggestions on a general level, and all applied their proposed approaches to the poem. Before we queried the models, we pre-registered a set of acceptable answers. One being a semantic relation between the rhymed words (semantic rhyme words), another a relation of the vowel structure of the poem or parts of it and the vowel structure of the rhyme words, both as indicators of a specific tone or mood (rhyme and mood). GPT4o, for example, proposed the rhyme and mood approach and described a relation which, however, was not even plausible. Even in an open interpretive constellation into which we brought a big portion of hermeneutic charity, we did not succeed at explaining the LLM’s behavior as acceptable text understanding. We are therefore still a long way from benchmarking tasks of this type. Even if LLMs will provide plausible answers in the future, it will not be easy to define meaningful acceptance ranges for adequate answers.
Since the models failed at the first level of general knowledge when prompted to detect assonance (NB 3), we outsource the summary of the respective results into NB 3. To investigate the ability of the models to analyze the lexis of the poems (NB 4) they were asked to focus on one semantic contrast that is triggered either by a morphologically complex word or within a phrase, and to elaborate on how this contrast contributes to the meaning of the poem. All three models identify “heilignüchterne” as the most striking example of semantic contrast in Hölderlin’s poem. Only Sonnet makes use of specialized vocabulary describing “heilignüchtern” as “oxymoronic combination”. No model refers to the classical topos of “sobria ebriatas” (Schmidt 1982).
Regarding the aspect of phrases (NB 5), we explored the ability of LLMs to contextualize ambiguous phrases and translate non-figurative to figurative language. Our focus for this task was on the phrase “im Winde klirren die Fahnen.” As noted by Strauss (1965), the notion of a fabric flag is very likely to firstly come to mind but must be abandoned in favor of the concept of a weather vane to align with the intended meaning. In an initial zero-shot prompt, the LLMs were asked to interpret the meaning of “Fahne” with all results beeing incorrect when taking Strauss’s semantic determination as ground truth. Based on the observation that the LLMs failed in this case when measured against a – perhaps dogmatically – predefined set of correct answers, we moved on to a more ‘open game’ with the models. Confronted with a given finite set of unlikely meanings of the phrase, GPT4o and Sonnet refused most of them, but selected the military context, while Gemini suggested a new metaphorical interpretation. For the unknown poem Unsere Toten, the study examined the models’ ability to generate and assign figurative phrases based on their interpretations. Specifically, the task addressed the transition from non-figurative to figurative language, using the phrase “die Füße mühn sich im zitternden Mondenschein”. The scenario assumed that part of the text was unreadable, leaving either only an interpretation or a gap-filled text for reference. In both cases, the LLMs demonstrated the ability to generate figurative phrases that thematically align with the poem.17 We would like to emphasize that this type of experiment is crucial for investigating the semantic interpretation capabilities of the models. These types of experiments cannot be easily converted into simple benchmarking tasks, but require human interpretation of the correspondence between literal and figurative meaning, as well as between the given information and the generated results.
Engaging with a literary text on a research level often involves addressing literary theoretical positions. As Köppe and Winko (2013) note, it is impossible to read a text without theory. In examining the interpretative outputs of LLMs with regard to statements about textual meaning (NB 9), we therefore asked to what extent, and in what ways, the models reflect specific literary theoretical approaches. Can we identify latent representations of literary and cultural theories in the interpretations generated by LLMs? Our study focuses on the dimensions of representativeness in these outputs. The starting point was Görner (2016, p. 107) and his thesis for Hälfte des Lebens: “Postcolonial literary studies do not take us far in understanding [Hölderlin’s] work. By contrast, (post-)structuralists and deconstructionists appear – albeit unintentionally – to have prepared the way for interpreting Hölderlin.” Notably, all the LLMs produced postcolonial interpretations for both poems, incorporating key terms central to the theory. However, when ranking the literary theoretical positions, GPT4o and Sonnet indicated that the poem Hälfte des Lebens “lacks overt colonial references” or “lacks specific markers of colonization.” In contrast, for the unknown poem Unsere Toten, Gemini suggested that the “poem can be read as an allegory for the lasting impact of colonialism.” The recourse to interpretation-theoretical framework assumptions thus in no case went beyond very superficial remarks.
Regarding figurative Language (NB 7), we tested for both poems the ability of the LLMs to change its understanding of figurative language when additional information was given about a specific term which was used figuratively. In the case of Hälfte des Lebens we added the information that its author was a great admirer of classic antiquity (which is common knowledge) and that in classic literature swans are often a metaphor of the poet, the latter being specialized knowledge applied to this poem first by Schmidt (1982). All models provided a before and after interpretation and used the information to change or deepen their reading. For example, the interpretation changes from “swans can be initially read as representing a harmonious connection between nature’s elements” to “swans become a symbol of the poet in his ideal state: connected to nature, inspired, and capable of creating beautiful and meaningful art”. Their understanding is remarkable in that they explain how this additional information changes not only the meaning of swan in itself, but how the situation of the swans and their actions gain additional meaning. This extends ‘upwards’ to the level of a textual meaning, when one model summarizes the whole poem as a “meditation on the poet’s role and the crisis of modern poetry versus ancient ideals”. In the case of Unsere Toten we added the information, that the poem was first printed in 1922 (we added the whole bibliographic information, but the models concentrated on the date). Though all models described the specific situation after World War I in Germany, only one understood that the returning people are the dead soldiers.
For the task of processing the poems’ titles, we expected the models to operate with context, here with intertextual resonance in the titles (NB 8). For Hälfte des Lebens, we deliberatly provided an anachronistic and thus irrelevant but similar title: Mitte des Lebens, a novel from 1978 by Luise Rinser. For Unsere Toten, we offered a potentially relevant context by mentioning that there was a journal, Jahrbuch der Schiffbautechnischen Gesellschaft (1914), which had a section Unsere Toten. Our expection was that the models would warn of potentially anachronistic interpretation, and are able to abstract from the journal section Unsere Toten some potential genre-like rules of commemoration that are applicable to the poem. The risk of anachronism (with referring to Rinser’s novel) has been raised by no model. None of the models was able to refer to the content of the intertexts that were mentioned as a context. For Unsere Toten the models were able to infer the genre-like function of commemoration. Gemini connected this function with the poem’s phrase “Nur nicht vergessen! Uns nicht vergessen!” and thus highlighted the commemorative function of this part of the poem, without addressing the obvious differences. We conclude that when provided with potentially relevant context information, all models reason on the level of merely semantic surface relationships by extracting semantic information from the information that is available for the text and the context/pretext. It is particularly striking that the models process all the information provided as actually relevant and create associations between text and context. No relevance check of the context offered, which is specific to philologically sophisticated interpretation, did take place, however.
With regard to evaluating the output, we can summarize that the abstraction that is performed by the models when they try to apply some aspect of a given context to the text being interpreted works well on a purely semantic level but not on the level of relating works, events, objects, and persons as historical positivities. This result aligns well with the finding of other studies that do not see any complex world model included in the language model, which is, as its name says, a language model without anything beyond the semiotic relations of language itself.18
When looking at the complication that arises from the open interpretation game that we enter when interpreting the models’ behavior, we can now make visible more clearly some of the challenges of evaluating their behavior in terms of their understanding abilities. Firstly, the reader may have realized that we mostly tried to maximize hermeneutic charity, i.e. to interpret the behavior as rational as possible. In all cases where the models’ output could not be rationalized in a satisfying way, it is pretty clear that the models are far from showing understanding-like abilities. More debatable are, however, the cases where we believe the answers that were given were a remarkable demonstration of such abilities. If we recall the relatively convincing performance on figurative language (NB 7), we can imagine more sceptical observers who bring less hermeneutic charity into play and who may claim that the presumed ability could be better explained as a statistically likely output from the design stance. In order to handle counter arguments of this kind in a pragmatic way, we will introduce a third and more complex level of complexity. No matter how benevolently — in the sense of the principle of charity — one interprets the behavior of LLMs in individual cases as an act of understanding, one systematic challenge remains. This challenge can be recorded in terms of a relationship that says that the more abstract an interpretative statement is, the more serious the under-determination. As a result, the more abstract an interpretative statement is, the more its acceptance depends on how much hermeneutic charity the observer shows when “interacting” with the machine.19
5. Abstraction and Transfer
Many of the more complex tasks from the previous section can still be considered tasks that reflect general cultural knowledge that is simply being reproduced by the language models from their pre-training. In this section, we discuss a series of more complex probing experiments that shall test the models’ ability to infer counterfactual interpretive rules from given data or examples and apply them to the two poems. As a preliminary step, and in cases where the main task could not be solved successfully, we asked the models to apply counterfactual rules that we had explicitly articulated. By either giving the models counterfactual, i.e. made-up, rules or having them infer such rules inductively, we can assume that we are asking for analytical processes that the models could not have seen during training.
For the analysis of meter (NB 1), our plan was to give the models two tasks: First, we wanted to test their ability to apply arbitrary rules about meter to given data and analyze the results. Secondly, we wanted them to abstract these rules from given example data. The rules we defined are simple. They mix information about the emphasis of words with the ability to detect vowels and change the emphasis, when a specific combination is met (accented syllables containing the vowel ‘i’ were changed to non-accented syllables). The task was then to apply this rule to the poems. No model was able to solve this task. Based on this result we decided to skip the second experiment, because counterfactual information on the level of meter seems to be a challenge even in a simple setting. For the analysis of assonance (NB 3), we used a similar task. We defined a made-up phenomenon called ‘eusonance’: when the vowel sounds of the stressed syllables in two consecutive words in the same line are an i-sound and a German e-sound (e.g. “Ich” and “esse” in “Ich esse Kuchen“). With the two-step CoT-prompting (see section 3) we asked the models firstly to translate the poems to IPA and then to give an analysis of all eusonances. The problems were, on a structural level, of the same types as with the basic detection of assonance. Most severely, the models assumed words that were not even in the poem. Then, false phonems were claimed as generating eusonance (i.e. a-, o- and u-sounds). Also, very simple aspects of the definition were disregarded (for instance, combinations of two e-sounds were also included in the answer on a regular basis). The second task was to abstract and infer a rule of interpretation based on an interpretation provided for a different poem, Waldgespräch by Eichendorff (1815). The models are surprisingly good at inferring the intended rule from the interpretation. The difference between a well-known poem likely to be part of the training material and a poem previously unknown to the model is insignificant when it comes to sound qualities in the domain of German language. Seemingly difficult tasks of inferring and abstracting rules and of performing meaning- and association-based operations between the lexical units of the poem is relatively easy for all models. It is mostly the level of detecting the basic properties of sound quality that is still largely missing in all LLMs.
To test their abstraction and transfer ability with lexis, the models received the very challenging task to interpret attributive adjectives (in contrast to predicatively used ones) in terms of their antonym or semantic opposite (attributed to a made-up poet). In the case of Hölderlin, two of the models refused the task because they determined that the rules were inadequate for Hölderlin. Gemini only understood one part of the rule, namely the mapping to antonyms, but it did not realize that whether the rule should be applied depended on the exact part of speech (NB 4). GPT4o couldn’t identify the rule in the case of Unsere Toten.
Our experiments probing the ability of the LLMs to process poems at the phrases level (NB 5) started from the assumption that LLMs may lack a multimodal horizon of experience and perception, which can be crucial for interpreting certain phrases, particularly in poetry. To address this limitation, we introduced a new rule of meaning that posits: Only the onomatopoetic level of a phrase carries significance. And we gave two examples for this new rule. In both cases, the LLMs recognized and applied the new rule of meaning to their interpretations. Notably, the onomatopoetic translations produced by all three LLMs displayed striking similarities across the two poems. This consistency might suggest that the LLMs are capable of engaging with the onomatopetic layer of meaning in literary texts, even though they lack direct sensory or experiential input.
In our experiments on the ability to process the meaning of texts, we introduced a non-referential rule of meaning to guide the overall interpretation of the poems (NB 9). According to this rule, the meaning of an expression lies solely in its function within a communicative act, independent of any direct connection to an extralinguistic reality. This rule was applied to two case studies on HÄUSER, a poem written by Helga Novak [1982] (2015) and Einsamkeit by Nikolaus Lenau [1834] (1995). In explaining the new rule of meaning, all three LLMs emphasized that the “communicative act” is central to the assignment of meaning. For both poems, the LLMs’ proposed interpretations focused primarily on the lyric first person, but the notion that extralinguistic reality does not play a meaningful role remained somewhat vague in their results. This suggests that while LLMs can conceptually engage with the idea of communicative meaning, they struggle to fully articulate its implications when detached from concrete referents.
Our experiment on the use of figurative language only used the poem by Pfeifer (NB 7). It started with the counterfactual claim that there was a group of authors in the Weimar Republic interlacing hidden references to the new medium of the film into their texts. The models were then tasked with identifying these references in Pfeifer’s poem and then to adapt the interpretation of the text to this hidden reference. We expected the models to detect the unusual phrase “the trembling moon’s light” which is hardly consistent with a realistic moon light. (The movement of the bodies in the same poem which may remind modern readers of Zombies would be an anachronism, as the first Zombie movie was made in 1932.) All models successfully identified the phrase and connected the poem to the medium film changing the interpretation of the text accordingly. In this case we did not ask the models to identify the rule (there are references to the medium film) themselves, but we did not specify the relation, so the model had to apply this very general pattern to the elements of the text themselves.
To test the abilities of the models to reflect on the relation between text and title (NB 8), we asked the LLMs to infer a counterfactual interpretative rule from a poem written by Joseoph von Eichendorff (1815) that was later published with the title Waldgespräch. Given an intuitively straight-forward title based on a first reading (“Lorelei”), the difference between the actual later title and the made-up title was used to find some aspect of the poem that is highlighted by this difference. In the final run, counterfactual (but here claimed to be real) titles “Des Dichters Leben” for Hälfte des Lebens and “Gefallene Geister” for Unsere Toten were provided. In a first run (task 3a in NB 8), the real titles were provided, in a second run (3b), made-up titles were used. Although it is relatively hard to sort out false rules inferred from one example, it is very interesting to see that the models tend to slightly over- or undergeneralize.
Again, when summarizing the observations one has to consider two different levels, that of the quality of the performance itself, and the hermeneutic complications coming from the observer dependency in open games. Regarding the general performance, we admit that the overall picture is that of very mixed and disparate impressions. In some cases, counterfactual rules were inferred by some LLMs remarkably well, namely for assonance (3) and titles (8) but not for meter (1) and lexis (4). For some aspects only, the rules that were given or spontaneously inferred by the models were applied consistently, namely for phrases (5) and figurative language (7) but not for assonance (3) and meaning (9). The most important take-away is that we do not find confirmation of the initial assumptions that the performance consistently decreases with a higher level of complexity for all nine aspects or tasks. Quite the contrary, the LLMs consistently struggled with some of the aspects on all three levels of complexity but could deal with some aspects across all levels of complexity. Regarding the hermeneutic complexity in open games, a new challenge came into play on the level of abstraction and transfer. Doing abstraction is always an open task that allows for choosing an appropriate level of abstraction relative to its purpose within an argument. Thus, assessing as to whether an LLM chose an appropriate level of abstraction depends of the observer’s willingness to ascribe a rational argument to the LLM. The more complex and abstraction-based a task is, the more it remains a matter of human judgment to decide in which cases the models have reached a hermeneutically appropriate level of abstraction.
6. Conclusion
Our paper contributes to a hermeneutically founded strategy of evaluating the understanding abilities of LLMs in tasks of analyzing poetry. In general, we saw in our qualitative study that all models performed surprisingly well. More precisely, depending on the portion of hermeneutic charity that we as observers of the LLM’s behavior bring into play, the performance looked well in the eyes of benevolent observers. The models were good at tasks that we believe are difficult for humans, such as processing non-literal meaning and combining different levels of semantics by finding non-literal association, equivalence, and opposition. In this respect, the performance of the LLMs often covaries more strongly with the nine differentiated aspects from meter to meaning than with the three levels of complexity (general knowledge to abstraction and transfer). In some aspects that we believe are comparatively easy tasks for humans such as counting syllables or recognizing meter, all models struggle. One problem is counting. Another problem seems to be the lack of a stable representation of these qualities. So, the models can solve some tasks that rely on one pass through the text, such as detecting a relation between formal and semantic patterns, but have problems with tasks that demand repeated access to the formal features. Probably, this cannot be explained by a lack of training data alone. We assume that humans have a capability of processing formal aspects based on phonetic information and an understanding of phonetic similarity which is not available to the same extent to language models yet. These results are preliminary and need quantification in future work. This is a clear limitation of the qualitative approach. Irrespective of this limitation, it is possible draw inferences on the structure of the LLMs’ strengths and weaknesses.
Firstly, we clearly found interesting types of errors. LLMs can fail to form correct beliefs about what is the case in the text, they can fail to take into account the correct cultural context, they can fail to infer the rule from a given example, they can fail to apply a rule in the correct way, and they can choose a level of abstraction that is not accepted as the appropriate level by the human observer. Processing semantic comparison, contrast, association, and also finding aspects in works that are connected through multi-step semantic abstraction are more at the core of the LLMs capabilities. Even on the level of abstraction and transfer, the models were efficient at inferring and applying interpretive rules. Providing historical and empirical arguments by giving good reason for the correctness of specific historically explanatory hypotheses is not among the abilities that LLMs are at the moment particularly good at. This can have to do with the training material, but also with the way historical world knowledge is or rather is not (yet) modeled in LLMs.
Beyond the covariation between the nine aspects, we are also able to identify significant performance problems in terms of complexity of understanding, as reflected by the three levels in section 3 to section 5, especially when focusing on expert knowledge and abstraction and transfer. First, all three LLMs draw on culturally established associations rather than producing novel or surprising connections and interpretations. They produce the expected rather than the original: LLMs follow the path of highest probability and expectability with little surprise, whereas sophisticated interpretations in literary studies at the level of expert communication are expected to look for new and unusual ideas.20 Second, it is well known that prompting is very important for communicating with LLMs. We observed three interesting behaviors, which we propose to summarize as the problem of culturally sensitive context integration:
If there is information in the prompt, the models consider all information to be relevant, regardless of whether it is actually relevant. Most of the time, they do not rely on a stable representation of the world that would allow them to reject some given information as obviously nonsensical or useless. Rather, they show an attitude that attributes a high level of competence to the user in selecting the information given in the prompt.
Even if they are able to employ complex arguments when the prompt already engages them on this level, it is not their standard level of interaction. The language and the argumentative structure of the prompt seem to be far more important for this kind of framing than any explicitly described roles.
If an answer can be found in a cultural context other than the main English one, the models work better if that context is explicitly stated in the prompt.
We found that understanding includes calling on the most appropriate rules. For instance, when the correct verse meter is to be inferred, the space of potential and culturally relevant verse meters has to be considered. We saw that LLMs often use the domains and cultural spaces coming from the language they are most thoroughly trained on. According to prompt engineering, we usually expect the users to intelligently control these cultural biases by adding information on the relevant context. If we ask for LLMs’ interpretive capabilities, we can observe that this is one of the most interesting shortcomings to infer the ‘correct’ contexts. This shortcoming affects both the level of expert knowledge (when describing the task as a matter of culturally and historically sensitive context selection) and the level of abstraction and transfer (when describing the abductive inferential reasoning needed for such tasks).
Our primary goal, however, was not only to present preliminary tests as the first steps towards benchmarking studies, but also to critically reflect on the complications arising from the interpretative constellation in which we find ourselves as observers with our principle of charity, interpreting LLMs as rationally interpreting agents. It is important to keep in mind that whenever we as humans are engaged in an open game with the LLM asking it to give complex interpretive answers, we are in a situation of a human-computer-interaction taking the intentional stance. Taking the intentional stance requires us to maximize hermeneutic charity, i.e. to presume the LLM to be acting rationally. This general principle of charity is, in empirical interpretive constellations, counterbalanced by a relatively big or small portion of willingness to assume rationality and to accept linguistic output as appropriate answers. It is possible to summarize some general rules that have strong implications for future benchmarking tasks on LLMs’ understanding capabilities: The more complex the task, the more open is the interpretive situation between the human interpreter of the LLM’s text-interpretive activities. The more open this situation, the more it depends on standards of defining the situatively appropriate levels of abstraction and other soft evaluation criteria. It is often the task of interpretation and text analysis itself to justify that the rules applied and the chosen level of abstraction are appropriate. For the observer evaluating the interpretation, the task is to assess the interpretation according to whether the chosen rules and abstractions were appropriate and suitable for the situation.
If these rules generally apply, there is a challenge for benchmarking tasks. Benchmarking needs datasets and fixed arrangements where correct versus false output can be strictly defined. Our paper aims to help identify those areas of tasks that can be prepared relatively quickly for benchmarking (much of which falls under the category of general knowledge) and those areas for which a challenging compromise must first be reached between the interpretive dependency of the rational performance of an LLM and the objectifiable quality of the output. In our opinion, this applies to both expert knowledge and abstraction and transfer. Also many of the tasks that seem to fall under simple ’literary feature detection’, however, which we modeled as acts of forming correct beliefs about a text, involve steps of reasoning. We therefore believe that even for tasks that we subsumed as general knowledge tasks, hermeneutic complexity has to be taken into consideration. For future research, it will be essential to control for rather than ignore the observer relativity of accepting an answer as a demonstration of the ability to understand a poem.
7. Data Availability
Data can be found here: https://github.com/cophi-wue/llms_read_hoelderlin. It has been archived and is persistently available at: https://doi.org/10.5281/zenodo.17754274.
8. Software Availability
Software can be found here: https://github.com/cophi-wue/llms_read_hoelderlin. It has been archived and is persistently available at: https://doi.org/10.5281/zenodo.17754274.
10. Author Contributions
Fotis Jannidis: Conceptualization, Writing – original draft, Methodology, Software, Investigation
Rabea Kleymann: Conceptualization, Writing – original draft, Methodology, Investigation, Software
Julian Schröter: Conceptualization, Writing – original draft, Methodology, Investigation, Formal Analysis, Writing – review & editing, Software
Heike Zinsmeister: Writing – review & editing, Methodology
Notes
- It is an open question for us to what extent benchmarking LLMs’ capabilities will be essential to computational humanities research in the future. [^]
- We chose German poetry because this is our field of philological expertise. However, we also used or made English translations to be able to compare linguistic domains (Hölderlin 1965). For further information see the “Readme.md” file in the code repository (section 8). [^]
- We would like to thank Merten Kroencke, who digitized the anthology and made the poem available to us. [^]
- The nine tasks are distributed over nine Jupyter Notebooks (NB 1 to NB 9) in the code repository; see section 7 and section 8 at the end of the paper. [^]
- Three levels structure each of the nine task related Jupyter Notebooks. These notebooks contain more detailed information on all our experiments, and we believe them to be an important part of this study. [^]
- The so-called continental European tradition of philosophical hermeneutics, with its phenomenological foundations and concepts of the ‘pre-structure’ of understanding (Gadamer 1965), shows a strong internalist tendency; see the critical analysis in Scholz (2005). [^]
- Note that the internal/external distinction we draw upon here is different from the distinction between internalism and externalism in semantics, where externalism refers to meaning constituted through referential grounding. For a contribution reinforcing this semantic aspect regarding the understanding abilities of LLMs see Borg (2025) and Havlík (2024). [^]
- Note that externalist approaches (Stekeler-Weithofer 2002) that draw heavily on Wittgenstein’s notion of rule-following may make quasi-internalist demands when they claim that understanding in the full sense includes aspects of normativity, personal and social obligation, and other implications. We believe that such aspects will become more important in future discussions when relating AI understanding capabilities to a full sense of understanding in human and social contexts. [^]
- The most prominent argument against a purely externalist notion of human understanding is Searle’s Chinese Room argument (Searle 1980). For a summary of the debate see Cole (2024). [^]
- Think also of Searle’s argument that language use is a sufficient condition for assigning understanding only if the agent is a human being (Searle 1980). [^]
- Bieri (1987) provides a reconstruction that takes into consideration Dennett’s later withdrawal of a radical instrumentalist view. [^]
- The principle of charity has not only been developed and defended by Dennett (2017, 1987). For the long history of this principle in hermeneutics and for its many systematically different variants see Scholz (2016, 160); Spoerhase (2007, 229–251). [^]
- For further implications of this relationship between language and world regarding the question of how the world is represented within large language models, see Havlík (2024, 9) [^]
- Rerunning these experiments with later models didn’t change anything fundamentally, but in November 2025 Sonnet 4.5 and ChatGPT 5 could not identify the scansion pattern while Gemini 2.5 could. [^]
- The capability of LLMs to understand metaphors in non-literary language has been tested in Wachowiak and Gromann (2023) and more systematically in Tong et al. (2024) using older and smaller LLMs with very mixed results. For metaphors in literary texts see Boisson et al. (2025). [^]
- Similar to the notion of ’Styles of Reasoning’, see Hacking (1994). We can only hint at the new possibilities for distinguishing the different forms of philological reasoning that have been identified through extensive praxeological analysis in Winko et al. (2024). [^]
- Interestingly, the models did not account for metrical considerations in particular. Regarding the gap-filled text, GPT4o and Gemini both added references to cardinal directions, while Sonnet only added “Süden” (south) as an additional direction. Notably, the word “Schlürfen” (to slurp) was frequently completed with terms such as “wandern” (to wander) and “gehen” (to walk). As for all experiments, the more nuanced results can be found in the Jupyter Notebooks. [^]
- With regard to an analysis of the theory of language that is realized by LLMs and the thesis that LLMs largely actualize post-structuralist notion of language, see Underwood (2023). [^]
- From the perspective of interpretation theory, this relationship is in accordance with interpretive pluralism, which assumes that different and even competing interpretive hypotheses for the same work can be acceptable. [^]
- The scope of this statement is limited, of course, as we have not systematically experimented with different temperature values. [^]
References
Bamman, David, Kent K. Chang, Li Lucy, and Naitian Zhou (2024). “On Classification with Large Language Models in Cultural Analytics”. In: Proceedings of the Computational Humanities Research Conference 2024. Ed. by Wouter Haverals, Marijn Koolen, and Laure Thompson, 494–527. https://ceur-ws.org/Vol-3834/paper119.pdf (visited on 11/21/2025).
Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell (2021). “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In: FAccT ’21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, 610–623. http://doi.org/10.1145/3442188.3445922.
Bieri, Peter (1987). “Intentionale Systeme: Überlegungen zu Daniel Dennetts Theorie des Geistes”. In: Struktur und Erfahrung in der psychologischen Forschung. Ed. by Jochen Brandtstädter. De Gruyter, 208–252.
Boisson, Joanne, Zara Siddique, Hsuvas Borkakoty, Dimosthenis Antypas, Luis Espinosa Anke, and Jose Camacho-Collados (2025). “Automatic Extraction of Metaphoric Analogies from Literary Texts: Task Formulation, Dataset Construction, and Evaluation”. In: Proceedings of the 31st International Conference on Computational Linguistics. Ed. by Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert. Association for Computational Linguistics, 6692–6704. https://aclanthology.org/2025.coling-main.448/ (visited on 11/21/2025).
Borg, Emma (2025). “LLMs, Turing Tests and Chinese Rooms: The Prospects for Meaning in Large Language Models”. In: Inquiry. http://doi.org/10.1080/0020174X.2024.2446241.
Burdorf, Dieter (2015). Einführung in die Gedichtanalyse. J.B. Metzler. http://doi.org/10.1007/978-3-476-05422-7.
Chang, Yupeng, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie (2024). “A Survey on Evaluation of Large Language Models”. In: ACM Trans. Intell. Syst. Technol. 15 (3). http://doi.org/10.1145/3641289.
Chollet, François (2019). “On the Measure of Intelligence”. In: arXiv preprint. http://doi.org/10.48550/arXiv.1911.01547.
Cole, David (2024). “The Chinese Room Argument”. In: The Stanford Encyclopedia of Philosophy. Ed. by Edward N. Zalta and Uri Nodelman. Winter 2024. Metaphysics Research Lab, Stanford University. https://plato.stanford.edu/archives/win2024/entries/chinese-room/ (visited on 11/21/2025).
Dennett, Daniel C. (1971). “Intentional Systems”. In: The Journal of Philosophy 68 (4), 87–106. http://doi.org/10.2307/2025382.
Dennett, Daniel C. (1987). The Intentional Stance. MIT Press.
Dennett, Daniel C. (2017). Brainstorms : Philosophical Essays on Mind and Psychology. MIT Press. http://doi.org/10.7551/mitpress/11146.001.0001.
Dilthey, Wilhelm 1833-1911 (1974). Der Aufbau der geschichtlichen Welt in den Geisteswissenschaften. Suhrkamp.
Edman, Lukas, Helmut Schmid, and Alexander Fraser (2024). “CUTE: Measuring LLMs’ Understanding of Their Tokens”. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Ed. by Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen. Association for Computational Linguistics, 3017–3026. http://doi.org/10.18653/v1/2024.emnlp-main.177.
Eichendorff, Joseph von (1815). “Waldgespräch”. In: Ahnung und Gegenwart. Ein Roman. Schrag, 285–286.
Gadamer, Hans-Georg (1965). Wahrheit und Methode : Grundzüge einer philosophischen Hermeneutik. 2nd ed., expanded with an addendum. J. C. B. Mohr (Paul Siebeck).
Gengnagel, Tessa, Fotis Jannidis, Rabea Kleymann, Julian Schröter, and Heike Zinsmeister (2024). “Bedeutung in Zeiten großer Sprachmodelle”. In: Book of Abstracts - DHd2024. Ed. by Joëlle Weis, Estelle Bunout, Thomas Haider, and Patrick Helling, 81–85. http://doi.org/10.5281/zenodo.10686565.
Görner, Rüdiger (2016). Hölderlin und die Folgen. J.B. Metzler Verlag.
Hacking, Ian (1994). “Styles of Scientific Thinking or Reasoning: A New Analytical Tool for Historians and Philosophers of the Sciences”. In: Trends in the Historiography of Science. Ed. by Kostas Gavroglu, Jean Christianidis, and Efthymios Nicolaidis. Springer, 31–48. http://doi.org/10.1007/978-94-017-3596-4_3.
Havlík, Vladimír (2024). “Meaning and Understanding in Large Language Models”. In: Synthese 205 (1), 1–21. http://doi.org/10.1007/s11229-024-04878-4.
Hicke, Rebecca M. M., Yuri Bizzoni, Pascale Feldkamp, and Ross Deans Kristensen-McLachlan (2025). “Says Who? Effective Zero-Shot Annotation of Focalization”. In: Anthology of Computers and the Humanities 3. Ed. by Taylor Arnold, Margherita Fantoli, and Ruben Ros, 739–755. http://doi.org/10.63744/xxqzxENxsh3b.
Hölderlin, Friedrich (1804). “Hälfte des Lebens”. In: Taschenbuch für das Jahr 1805. Der Liebe und Freundschaft gewidmet. Friedrich Wilmans, 85.
Hölderlin, Friedrich (1965). “Halves of Life”. In: An Anthology of German Poetry from Hölderlin to Rilke in English Translation. Ed. by Angel Flores. Trans. by Kate Flores. Peter Smith, 26–27.
Kirschenbaum, Matthew (2023). “Again Theory: A Forum on Language, Meaning, and Intent in the Time of Stochastic Parrots”. In: Critical Inquiry - In the Moment. https://critinq.wordpress.com/2023/06/26/again-theory-a-forum-on-language-meaning-and-intent-in-the-time-of-stochastic-parrots/ (visited on 11/24/2025).
Köppe, Tilmann and Simone Winko (2013). Neuere Literaturtheorien: Eine Einführung. 2nd ed. Metzler.
Künne, Wolfgang (2003). “Verstehen und Sinn: ein sprachanalytische Betrachtung”. In: Hermeneutik : Basistexte zur Einführung in die wissenschaftstheoretischen Grundlagen von Verstehen und Interpretation. Ed. by Axel Bühler. Synchron, 61–78.
Lenau, Nikolaus [1834] (1995). “Einsamkeit”. In: Werke und Briefe. Historisch-kritische Gesamtausgabe. Band 2. Neuere Gedichte und lyrische Nachlese. Ed. by Antal Mádl. Wien: Deuticke und Klett, 76.
Liu, Emmy, Chenxuan Cui, Kenneth Zheng, and Graham Neubig (2022). “Testing the Ability of Language Models to Interpret Figurative Language”. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Ed. by Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz. Association for Computational Linguistics, 4437–4452. http://doi.org/10.18653/v1/2022.naacl-main.330.
Makkreel, Rudolf A. (2002). “Pushing the Limits of Understanding in Kant and Dilthey”. In: Grenzen des Verstehens: philosophische und humanwissenschaftliche Perspektiven. Ed. by Gudrun Kühne-Bertram and Gunter Scholtz. Vandenhoeck & Ruprecht, 35–47.
Mirzadeh, Iman, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar (2024). “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models”. In: arXiv preprint. http://doi.org/10.48550/arXiv.2410.05229.
Novak, Helga M. [1982] (2015). “HÄUSER”. In: Poesiealbum 320. Auswahl von Rita Jorek. Märkischer Verlag, 23.
Pfeifer, Hans (1922). “Unsere Toten”. In: Die deutsche Balladen-Chronik. Ein Balladenbuch von deutscher Geschichte und deutscher Art. Ed. by Wilhelm Uhlmann-Bixterheide. Ruhfus, 290.
Ryle, Gilbert (1945). “Knowing How and Knowing That: The Presidential Address”. In: Proceedings of the Aristotelian Society 46, 1–16. https://www.jstor.org/stable/4544405 (visited on 11/21/2025).
Schmidt, Jochen (1982). “Sobria ebrietas. Hölderlins ’Hälfte des Lebens”’. In: Hölderlin-Jahrbuch. Ed. by Bernhard Böschenstein and Gerhard Kurz. Vol. 23. Mohr Siebeck, 182–190.
Scholz, Oliver R. (2005). “Die Vorstruktur des Verstehens. Ein Beitrag zur Klärung des Verhältnisses zwischen traditioneller Hermeneutik und ’philosophischer’ Hermeneutik”. In: Geschichte der Hermeneutik und die Methodik der textinterpretierenden Disziplinen. Ed. by Jörg Schönert. De Gruyter, 443–461.
Scholz, Oliver R. (2016). Verstehen und Rationalität : Untersuchungen zu den Grundlagen von Hermeneutik und Sprachphilosophie. 3rd, revised and expanded edition. Vittorio Klostermann.
Searle, John R (1980). “Minds, Brains, and Programs”. In: Behavioral and Brain Sciences 3 (3), 417–457. http://doi.org/10.1017/S0140525X00005756.
Spoerhase, Carlos (2007). Autorschaft und Interpretation. Methodische Grundlagen einer philologischen Hermeneutik. De Gruyter. http://doi.org/10.1515/9783110921649.
Stekeler-Weithofer, Pirmin (2002). “Sind Sprechen und Verstehen ein Regelfolgen? Probleme konventionalistischer und intentionalistischer Theorien der Sprache”. In: Gibt es eine Sprache hinter dem Sprechen? Ed. by Sybille Krämer and Ekkehard König. Suhrkamp, 190–225.
Strauss, Ludwig (1965). “”Hälfte des Lebens””. In: Interpretationen, Band 1: Deutsche Lyrik von Weckherlin bis Benn. Ed. by Jost Schillemeit. Vol. 1. S. Fischer Verlag, 113–134.
Strube, Werner (2003). “Analyse des Verstehensbegriffs”. In: Hermeneutik: Basistexte zur Einführung in die wissenschaftstheoretischen Grundlagen von Verstehen und Interpretation. Ed. by Axel Bühler. Kolleg Synchron. Synchron, 79–98.
Tong, Xiaoyu, Rochelle Choenni, Martha Lewis, and Ekaterina Shutova (2024). “Metaphor Understanding Challenge Dataset for LLMs”. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Association for Computational Linguistics. http://doi.org/10.18653/v1/2024.acl-long.193.
Trott, Sean (2022). How Could We Know if Large Language Models Understand Language? Substack newsletter. The Counterfactual. https://seantrott.substack.com/p/how-could-we-know-if-large-language (visited on 11/21/2025).
Trott, Sean and Cameron Jones (2023). Do Large Language Models Have a “Theory of Mind”? Substack newsletter. The Counterfactual. https://seantrott.substack.com/p/do-large-language-models-have-a-theory (visited on 11/21/2025).
Trott, Sean, Cameron Jones, Tyler Chang, James Michaelov, and Benjamin Bergen (2023). “Do Large Language Models Know What Humans Know?” In: Cognitive Science 47 (7). http://doi.org/10.1111/cogs.13309.
Turing, Alan M. [1950] (2021). Computing Machinery and Intelligence / Können Maschinen denken? Great Papers Philosophie. Translated from the English (1950) and edited by Achim Stephan and Sven Walter, with contributions from members of the Turing Study Project. Reclam.
Uhlmann-Bixterheide, Wilhelm, ed. (1922). Die deutsche Balladen-Chronik. Ein Balladenbuch von deutscher Geschichte und deutscher Art. Ruhfus.
Underwood, Ted (2023). “The Empirical Triumph of Theory”. In: Critical Inquiry - In the Moment. https://critinq.wordpress.com/2023/06/29/the-empirical-triumph-of-theory/ (visited on 11/21/2025).
Wachowiak, Lennart and Dagmar Gromann (2023). “Does GPT-3 Grasp Metaphors? Identifying Metaphor Mappings with Generative Language Models”. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Ed. by Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki. Association for Computational Linguistics, 1018–1032. http://doi.org/10.18653/v1/2023.acl-long.58.
Walsh, Melanie, Anna Preus, and Elizabeth Gronski (2024). “Does ChatGPT Have a Poetic Style?” In: Proceedings of the Computational Humanities Research Conference 2024. Ed. by Wouter Haverals, Marijn Koolen, and Laure Thompson, 1201–1219. https://ceur-ws.org/Vol-3834/paper122.pdf (visited on 11/21/2025).
Winko, Simone, Stefan Descher, Urania Milevski, Merten Kröncke, Fabian Finkendey, Loreen Dalski, and Julia Wagner (2024). Praktiken des Plausibilisierens. Untersuchungen zum Argumentieren in literaturwissenschaftlichen Interpretationstexten. Universitätsverlag Göttingen. http://doi.org/10.17875/gup2024-2639.
Wittgenstein, Ludwig [1953] (2003). Philosophische Untersuchungen. 1st ed. Suhrkamp.
Xu, Ruoxi, Yingfei Sun, Mengjie Ren, Shiguang Guo, Ruotong Pan, Hongyu Lin, Le Sun, and Xianpei Han (2024). “AI for Social Science and Social Science of AI: A Survey”. In: Information Processing & Management 61 (3). http://doi.org/10.1016/j.ipm.2024.103665.
Yu, Linhao, Qun Liu, and Deyi Xiong (2024). “LFED: A Literary Fiction Evaluation Dataset for Large Language Models”. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Ed. by Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue. ELRA and ICCL, 10466–10475. https://aclanthology.org/2024.lrec-main.915/ (visited on 11/21/2025).
Zymner, Rüdiger (2007). “Assonanz”. In: Reallexikon der deutschen Literaturwissenschaft. Ed. by Klaus Weimar, Harald Fricke, and Jan-Dirk Müller. Vol. 1. De Gruyter, 156–157.