<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd">
<!--<?xml-stylesheet type="text/xsl" href="article.xsl"?>-->
<article article-type="research-article" dtd-version="1.2" xml:lang="en" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id journal-id-type="issn">2940-1348</journal-id>
<journal-title-group>
<journal-title>Journal of Computational Literary Studies</journal-title>
</journal-title-group>
<issn pub-type="epub">2940-1348</issn>
<publisher>
<publisher-name>Universit&#228;ts- und Landesbibliothek Darmstadt</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.48694/jcls.4312</article-id>
<article-categories>
<subj-group>
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Interpretation, Argument, Evaluation</article-title>
<subtitle>A Workflow for Assessing LLM-Generated Interpretations of Poetry</subtitle>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-9177-7645</contrib-id>
<name>
<surname>Pichler</surname>
<given-names>Axel</given-names>
</name>
<email>axel.pichler@univie.ac.at</email>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0009-0003-7242-462X</contrib-id>
<name>
<surname>Endres</surname>
<given-names>Martin</given-names>
</name>
<xref ref-type="aff" rid="aff-2">2</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0003-3193-6170</contrib-id>
<name>
<surname>Reiter</surname>
<given-names>Nils</given-names>
</name>
<xref ref-type="aff" rid="aff-3">3</xref>
</contrib>
</contrib-group>
<aff id="aff-1"><label>1</label>Department of German Studies, University of Vienna <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://ror.org/03prydq77">ROR</ext-link>, Vienna, Austria</aff>
<aff id="aff-2"><label>2</label>Institute for German and Dutch Philology, Freie Universit&#228;t Berlin <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://ror.org/046ak2485">ROR</ext-link>, Berlin, Germany</aff>
<aff id="aff-3"><label>3</label>Department of Digital Humanities, University of Cologne <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://ror.org/00rcxh774">ROR</ext-link>, Cologne, Germany</aff>
<pub-date publication-format="electronic" date-type="pub" iso-8601-date="2026-01-14">
<day>14</day>
<month>01</month>
<year>2026</year>
</pub-date>
<pub-date pub-type="collection">
<year>2026</year>
</pub-date>
<volume>5</volume>
<issue>1</issue>
<fpage>1</fpage>
<lpage>29</lpage>
<history>
<date date-type="received" iso-8601-date="2025-06-11">
<day>11</day>
<month>06</month>
<year>2025</year>
</date>
<date date-type="accepted" iso-8601-date="2025-12-21">
<day>21</day>
<month>12</month>
<year>2025</year>
</date>
<date date-type="published" iso-8601-date="2026-01-14">
<day>14</day>
<month>01</month>
<year>2026</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright: &#x00A9; 2026 The Author(s)</copyright-statement>
<copyright-year>2026</copyright-year>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>The text of this work is released under the Creative Commons license CC BY 4.0 International. You can find the contract text of the license at <uri xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</uri>. The illustrations are excluded from this license, here the copyright lies with the respective rights holder.</license-p>
</license>
</permissions>
<self-uri xlink:href="https://jcls.io/articles/10.48694/jcls.4312/"/>
<abstract>
<p>This paper examines how interpretations of poems generated by LLMs can be evaluated in a way that meets standards from literary studies. To this end, we develop and evaluate a workflow that draws on reference data from literary studies and their argumentative structures when generating interpretations. This enables the generation of interpretations that themselves exhibit such structures and can be evaluated with respect to both their argumentative coherence and literary scholarship standards. Our experiments demonstrate that this workflow can be applied successfully, and that the model under investigation generate reasonable descriptions of the poems, but fail at more abstract interpretative tasks.</p>
</abstract>
<kwd-group>
<kwd>interpretation</kwd>
<kwd>Large Language Models</kwd>
<kwd>generation</kwd>
<kwd>evaluation</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="S1">
<title>1. Introduction</title>
<p>Despite the increasing diversification of literary studies in recent decades, the interpretation of literary texts remains one of its central practices. This is evident not only in the prominence of chapters on interpretation in key introductory works, but also in empirical studies highlighting the prevalence of interpretive articles in scholarly journals.<xref ref-type="fn" rid="n1">1</xref> This focus on interpretation contrasts with current trends in computational literary studies, where machine learning methods such as large language models (LLMs) have been employed primarily for text analytic questions, which typically involve classification problems such as genre attribution or sentiment analysis.<xref ref-type="fn" rid="n2">2</xref> Classification, the task of assigning previously defined categories to instances, can also be understood as a subform of description that is grounded in a theory and/or taxonomy of the relevant domain &#8211; for example, a theory of literary genres. Interpretation, by contrast, draws on specific theories of meaning &#8211; such as intentionalist or anti-intentionalist approaches &#8211; and, often in conjunction with text descriptions, attributes meanings to texts or text elements. An illustrative example may clarify this distinction: Identifying Hugo von Hofmannsthal&#8217;s poem <italic>Mein Garten</italic> as a sonnet &#8211; on the basis of its adherence to the characteristic features of this formally well defined genre &#8211; relies on a genre theory but does not presuppose a particular theory of meaning. By contrast, claiming that the poem explores the opposition between art and nature presupposes a theory of meaning that enables the attribution of such semantic properties to the poem.</p>
<p>In this work, we take initial steps towards exploring how interpretations of literary texts generated by LLMs can be evaluated. Here, we refer specifically to instruction fine-tuned LLMs that operate according to the &#8216;prompt-and-generate&#8217; paradigm<xref ref-type="fn" rid="n3">3</xref>, enabling them to generate coherent outputs in response to open-ended textual prompts.</p>
<p>Any attempt to explore the potential of LLMs for generating literary interpretations must contend with a foundational characteristic of literary studies: There are different conceptualizations of what it means to interpret a text. These differing conceptualizations are associated with distinct standards by which actual interpretations are assessed or evaluated. Such standards are often contested within literary theory and are frequently described as theory-dependent. Nevertheless, to evaluate generated interpretations of literary texts in a consistent and transparent manner, an explicitly formulated standard is needed &#8211; ideally one that is accepted independently of specific theoretical presuppositions. In other words, a well-defined set of evaluation metrics is necessary to enable the assessment of LLM-generated interpretations in the first place.</p>
<p>To explore such a set of metrics and address the issue of the theory-dependence of existing practices of interpretation and evaluation in literary studies, we propose a method of generating interpretations that reduces them to their argumentative core. By &#8216;argumentative core&#8217;, we refer to the fundamental argumentative structure that underpins a literary interpretation, independent of its stylistic or rhetorical presentation. While the modes of textualization in literary interpretations vary depending on the approach, it seems largely undisputed that they involve argumentation.</p>
<p>Building on this notion of an argumentative core, this paper seeks to identify an evaluative framework &#8211; drawn from theoretical debates &#8211; and to select and refine criteria derived from this framework to test whether they are suitable for assessing the argumentative core of LLM-generated interpretations. Our primary objective, therefore, lies in the selection and refinement of evaluation criteria and in demonstrating that they can be applied in an intersubjectively consistent manner. By contrast, the actual evaluation of LLM-generated interpretations falls outside the scope of this study. We contend that a meaningful evaluation becomes feasible only when such criteria are explicitly defined and their application ensures a high level of consistency among evaluators.</p>
<p>In addition, we would like to highlight several further limitations of this study in order to prevent potential misunderstandings. First, we do not address the question of how LLMs generate meaning &#8211; nor how this process differs from human meaning-making &#8211; and what implications this has for their alleged &#8216;understanding of language&#8217;.<xref ref-type="fn" rid="n4">4</xref> For heuristic purposes, we adopt what we refer to as a <italic>pretense stance</italic>: We treat LLM outputs as if they were produced by intentional agents, while fully acknowledging that these models do not possess genuine mental states. This interpretive strategy allows for a pragmatically useful engagement with LLM-generated texts, particularly in communicative and evaluative contexts. Conceptually, this stance is grounded in an externalist view of meaning, which holds that meaning does not arise from internal mental representations but from social, pragmatic, and interpretive practices. Within this framework, linguistic outputs are treated as meaningful insofar as they can be situated within communicative contexts and interpreted through interaction. This perspective is compatible with a range of externalist positions in current debates on LLMs and meaning, including accounts following Dennett&#8217;s intentional stance &#8211; which legitimizes mentalistic attributions based on their explanatory utility rather than ontological commitments &#8211; as well as accounts of derived intentionality such as those proposed in Borg (<xref ref-type="bibr" rid="B5">2025</xref>) or Koch (<xref ref-type="bibr" rid="B22">2025</xref>).<xref ref-type="fn" rid="n5">5</xref></p>
<p>A second limitation pertains to our theoretical orientation within literary studies. Just as there is no single literary theory, there is no monolithic discipline of literary studies, but rather a plurality of approaches. Our perspective is rooted in a specific tradition &#8211; namely, the German-language debates on literary theory informed by analytical philosophy: analytical literary theory.<xref ref-type="fn" rid="n6">6</xref> This approach is not characterized by adherence to a specific method but is instead defined by its commitment to scientific standards, conceptual clarity, precise question formulation, and rigorous argumentation (<xref ref-type="bibr" rid="B23">K&#246;ppe 2008</xref>).</p>
<p>In summary, this paper introduces a workflow for generating and evaluating literary interpretations using large language models. We begin by outlining the theoretical background, offering a brief overview of key debates on interpretation within literary theory. Next, we present different evaluation models, from which we select one &#8211; the framework by Strube (<xref ref-type="bibr" rid="B35">1992</xref>) &#8211; and justify this selection. We then address the question of how literary interpretations &#8211; specifically of poetry &#8211; can be generated by instruction-following LLMs in a way that aligns with Strube&#8217;s criteria. To achieve this, we adopt an approach based on the argumentative reconstruction of interpretations, ensuring that the generated texts can be systematically evaluated. Therefore, we operationalize Strube&#8217;s criteria in detail. Subsequently, we also describe the construction of reference/training data based on the argumentative reconstruction of existing poem interpretations. This is followed by an outline of the experimental setup, the presentation of results, and finally, the conclusion, which includes a discussion of limitations and suggestions for future research.</p>
</sec>
<sec id="S2">
<title>2. Interpretation in Literary Theory and CLS</title>
<p>Interpretation is a central concept in literary studies.<xref ref-type="fn" rid="n7">7</xref> The term is used to describe both the act of interpreting and the written results of this act, which can refer to either a single statement about a literary text or a complete essay dedicated to an exegesis. However, the meaning of the term remains a subject of ongoing debate.<xref ref-type="fn" rid="n8">8</xref> B&#252;hler (<xref ref-type="bibr" rid="B9">1999</xref>), for example, describes 17 different uses of the word &#8216;to interpret&#8217; with regard to the exegesis of texts in German. This ambiguity arises from the fact that the meaning of the term varies depending on its context of use and is influenced by related concepts &#8211; such as meaning, text, and work of art &#8211; as well as the theoretical frameworks in which these are embedded. Given this complexity, we refrain from proposing a single, fixed definition of interpretation. Instead, we adopt a scheme developed by G&#246;ran Hermer&#233;n, which provides a structured way to capture its diverse uses. G&#246;ran Hermer&#233;n describes &#8216;interpretation&#8217; as the following relation between five variables: &#8220;X interprets Y as Z for U in order to V&#8221; (<xref ref-type="bibr" rid="B19">Herm&#233;ren 1983, 142</xref>). This scheme makes it possible to differentiate between types of interpretation based on the definition of the variables: Depending on which object is interpreted in which way and with which purpose, a different type of interpretation results. According to Hermer&#233;n, the different types of interpretation correspond to different criteria to determine their &#8216;correctness&#8217;.<xref ref-type="fn" rid="n9">9</xref></p>
<p>However, the correctness or truth of interpretation statements is only one of several criteria that can be used to assess interpretations. Literary theory has worked out numerous such criteria, which to some extent were and are always determined by their theoretical and theoretical-historical standpoint. In 1992, Werner Strube proposed a set of criteria for the assessment of interpretations based on the language use and dominant interpretational practices in literary studies.<xref ref-type="fn" rid="n10">10</xref> Strube draws on the distinction between &#8216;Auslegung&#8217; (exegesis) and &#8216;Deutung&#8217; (interpretation) in German: He understands &#8216;Auslegung&#8217; as the use of a specific scheme to interpret parts of a given text. &#8216;Interpretation&#8217;, on the other hand, refers to the combination of several such schemes into a final interpretation that refers to the entire text. Based on this distinction, Strube identifies four dimensions of the given practice of interpretation in literary studies and outlines relevant assessment criteria: 1) the way in which literary texts are described in literary studies, 2) the exegesis of a text, from which 3) the interpretation of a text differs, and 4) the mode of argumentation. For each dimension, he specifies conditions for their successful or unsuccessful realization. For the description in disciplinary terminology of literary studies, these are accuracy, relevance and appropriateness; for the exegesis, plausibility and historical coherence; for the interpretation, specificity, integrity and comprehensiveness; and for the argumentative structure, coherence, unforcedness and freedom from contradiction. It is controversial in literary theory whether such criteria can be independent of the guiding theory and the overarching interpretative goals.<xref ref-type="fn" rid="n11">11</xref> However, it should be noted that these debates primarily concern the application of the criteria to a complete interpretation and the practices associated with it, rather than to argumentative cores. Underlying these debates is the fundamental question whether there are assessment criteria that are valid <italic>in general</italic> and, if so, what these criteria might be.</p>
<p>The determination of generally valid criteria for assessing interpretations is closely tied to the problem of evaluating interpretations, or their arbitrariness. Lutz Danneberg has reconstructed this problem in the form of the following argument (<xref ref-type="bibr" rid="B12">Danneberg 1992, 15</xref>):</p>
<list list-type="bullet">
<list-item><p>If there are no acceptable (or justified) criteria for evaluating interpretations with respect to their validity claims, then interpretations cannot be be assessed in terms of their validity claims.</p></list-item>
<list-item><p>If interpretations cannot be assessed in terms of their validity claims, then they are considered to be of equal evaluative rank.</p></list-item>
<list-item><p>If interpretations are considered to be of equal evaluative rank, then their choice is arbitrary.</p></list-item>
</list>
<p>The question of the evaluation of interpretation is accordingly one of the central questions of literary theory.</p>
<p>Given the complexity of these issues and the central role of interpretation in literary studies, it may come as no surprise that, to the best of our knowledge, no attempts have yet been made to develop (quantitative) evaluation metrics for interpretations (generated by LLMs). However, there are first attempts to explore the range of possible criteria for such evaluations and the benchmarking of LLMs&#8217; text-interpretive abilities. One example is Jannidis et al. (<xref ref-type="bibr" rid="B20">2025</xref>): Their study investigates how well contemporary LLMs can &#8220;understand&#8221; poetry by probing nine core aspects of literary analysis &#8211; from meter and rhyme to figurative language and meaning &#8211; across increasing levels of interpretive complexity. The authors show that while LLMs perform well on semantic and interpretive tasks, they struggle with formally grounded operations such as scansion, phonetic pattern recognition, and culturally sensitive context integration. This study positions itself as an exploratory first engagement with the problem of interpretation, and thus as an attempt to delineate the literary- and communication-theoretical foundations for future benchmarking of LLM-generated interpretive statements. By contrast, in the present paper we develop and evaluate an approach that integrates the generation and the evaluation of interpretive statements so closely that a genuinely human &#8211; and ultimately quantifiable &#8211; assessment of them becomes already possible.</p>
<p>In addition to this, there are experiments from literature didactics that take a different approach by refraining from developing and explicating precisely defined evaluation metrics. For example, Susteck and Perder (<xref ref-type="bibr" rid="B36">2023</xref>) use four canonical German poems to investigate the extent to which ChatGPT 3.5 can cope with writing tasks in high school poetry analysis. The authors found that ChatGPT performs very convincingly, particularly in the generation of interpretation hypotheses that &#8220;link texts with stereotyped, but often appropriate interpretation patterns due to their comparatively high degree of vagueness&#8221;(<xref ref-type="bibr" rid="B36">Susteck and Perder 2023, 12</xref>). <xref ref-type="bibr" rid="B36">Susteck and Perder</xref>&#8217;s approach differs from the one presented by us in that 1) the evaluation of the generated texts is based on high school objectives &#8211; such as summary, classification within an epochal context, topic definition, and form analysis &#8211; rather than on explicitly operationalized evaluation criteria derived from a specific literary theory framework, 2) the OpenAI online chat interface is used and not the API, 3) no batch prompting is used, but a dialogical-chatbot-interaction, and, 4) interactive prompting was carried out.</p>
</sec>
<sec id="S3">
<title>3. Generating and Evaluating Interpretations</title>
<p>Our general workflow is depicted in <xref ref-type="fig" rid="F1">Figure 1</xref>. Starting with a poem and existing interpretations, we first derive three argument reconstructions. These reconstructions are then used as examples in the prompt provided to the LLM, which also contains a new poem. The LLMs receive the reconstructed argumentations only in the form of individual statements, without any information about (a) which of Strube&#8217;s levels they correspond to or (b) which argumentative function they fulfill.<xref ref-type="fn" rid="n12">12</xref> The model then generates output in the same style (i.e., as the argumentative core of an interpretation). The generated outputs are subsequently evaluated through manual inspection.</p>
<fig id="F1">
<caption>
<p><bold>Figure 1:</bold> Schematic depiction of the workflow in this paper.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jcls-4312_pichler-g1.png"/>
</fig>
<p>In the following, we will explain this workflow in detail, discuss the rationale behind our choices, examine how it aligns with a specific form of evaluation, and justify why this particular form was selected.</p>
<sec id="S3.1">
<title>3.1 Evaluation of Automatically Generated Interpretations</title>
<p>LLMs are capable of generating texts that resemble interpretations of poems. This requires nothing more than a prompt that, in addition to the request to interpret a poem, contains the poem itself.</p>
<p>In principle, there are different ways on how to evaluate generated texts. A first possibility is the use of <bold>evaluation metrics from Natural Language Processing:</bold> In Natural Language Processing, the evaluation of generative language models is considered difficult.<xref ref-type="fn" rid="n13">13</xref> Currently, metrics from the field of machine translation are often used. Here, the generated texts are compared with reference texts. These reference texts are the texts that the model should ideally generate. The two most common metrics for evaluating text generated by LLMs are BLEU (<xref ref-type="bibr" rid="B30">Papineni et al. 2001</xref>) and ROUGE (<xref ref-type="bibr" rid="B25">Lin 2004</xref>). BLEU calculates the n-gram overlap between the generated text and the reference text. While BLEU focuses on precision, ROUGE is oriented towards recall and distinguishes between different variants: ROUGE-N, which examines the n-gram overlaps, ROUGE-L, which examines the longest common subsequence instead of the n-gram overlap, and ROUGE-S, which focuses on so-called skip-bigrams.</p>
<p>These metrics are not suitable for our purposes, as they were developed for unstructured text rather than for argument-like structured interpretations, which we aim to evaluate. While it would technically be possible to serialize the reconstruction into a stream of tokens, the linguistic variability of such texts is likely to be quite high. As a result, it is entirely possible for perfectly congruent interpretation arguments to be expressed in different words, making these metrics inadequate for our needs.</p>
<p>Another possibility is to build on recent praxeological research. Praxeology understands interpretation as one of many practices in everyday literary studies and considers interpretive texts as manifestations of these practices (<xref ref-type="bibr" rid="B28">Martus and Spoerhase 2022</xref>). From this perspective, it would make sense to evaluate LLM-generated interpretations by involving literary studies scholars using the <bold>scientific questionnaire method</bold>.<xref ref-type="fn" rid="n14">14</xref> Such an approach would have the added benefit of not only evaluating but also providing valuable insights into the guiding background assumptions of the discipline. However, the design of these questionnaires would ultimately rely on existing evaluation criteria. For this reason, we have chosen a third option for the present paper: the use of existing <bold>criteria from literary theory/literary studies</bold>.</p>
<p>As explained in <xref ref-type="sec" rid="S2">section 2</xref>, there are catalogs of such criteria, but considering their significance, it is surprising how few of them actually exist. In the following, we will work with those of Werner Strube (<xref ref-type="bibr" rid="B35">1992</xref>). The reasons for this choice are as follows: Firstly, Strube claims to adopt a descriptive approach. His criteria were created on the basis of actual interpretation practice. Secondly, the argumentative structure of the interpretations plays a central role in his catalog of criteria, which makes them particularly suitable for application in the context given here. Thirdly, another advantage of applying Strube&#8217;s criteria to the argumentative cores of interpretations is that the interpretative goals characteristic of interpretations, as described by Herm&#233;ren, can be disregarded. Strube&#8217;s criteria, in the version adapted by us in the following, do not require their specification &#8211; but they do allow for the inclusion of these goals within a modular extension of the cores, if needed. Fourthly, research has already indicated that his criteria are specific enough in relation to the actual practice of interpretation and that the guiding theoretical criteria are largely acceptable within literary studies as well as suitable for operationalization (<xref ref-type="bibr" rid="B24">K&#246;ppe and Winko 2011</xref>). Section <xref ref-type="sec" rid="S3.3">3.3</xref> will be devoted to the latter.</p>
</sec>
<sec id="S3.2">
<title>3.2 Generation of Interpretations Suited for Evaluation</title>
<p>Literary interpretations consist of distinct components such as thesis statements, textual evidence, analytical reasoning, and contextualization. Evaluating such interpretations holistically risks conflating these components, making it difficult to determine which aspects of the interpretation meet the required standards and which do not. For this reason, it is necessary to isolate and evaluate individual components of the generated text in relation to specific criteria. With simple prompt-based generation, it is neither clear (a) which components of the output can or should be evaluated with regard to which criteria or, if such criteria exist, how the generated text should be broken down into its components so that these criteria can be applied to the corresponding components, nor (b) which literary-theoretical assumptions the LLM realizes during generation. To address these challenges, we adopt a procedure that already suggests a certain output structure via the prompt: the generation of argumentative cores of interpretations. This approach simplifies the isolation and evaluation of the individual components. From the perspective of actual interpretative practice in literary studies, this method may seem unconventional, as such practices typically do not adhere to rigid organizational or structural schemes for the interpretations they produce. Nevertheless, we believe that this limitation is outweighed by the possibilities to guide the generation in such a way that the output is structured to align with the expected levels of output components, thereby facilitating a systematic evaluation according to the selected criteria from literary studies.</p>
<p>From a machine learning perspective, it makes sense to enrich the prompts with such reference data in order to achieve the above-mentioned goals. In this case, these data should consist of existing interpretations of literary texts. To assess their influence on the generation process and to evaluate texts generated with their help, it is useful to extract their central components in a structured form. To achieve this, we draw on a common practice in dealing with and analyzing scholarly texts: the reconstruction of their central arguments.<xref ref-type="fn" rid="n15">15</xref></p>
<p>Such reconstructions of arguments necessarily go beyond the literal wording of the texts examined. Descher and Petraschka (<xref ref-type="bibr" rid="B16">2018</xref>) identify the following dimensions of argument-reconstruction to which this applies in particular: reformulations, the clarification of text elements that require interpretation, the addition of argumentation steps, the sequence of arguments and the choice of argumentation scheme. Accordingly, reconstructions of arguments are themselves highly interpretative.</p>
<p>The guiding principle in our reconstructions is a specific version of the principle of charity: the aim of reconstructing the strongest possible argument from the texts. To achieve better alignment with Strube&#8217;s evaluation criteria, we base our reconstruction of the arguments on Strube&#8217;s distinction between the three levels of interpretation: description, exegesis, and interpretation. Therefore we reconstruct only the first three levels of the potentially multi-level argumentation (<xref ref-type="fig" rid="F2">Figure 2</xref>), whose lowest level consists of arguments, whose premises consist of textual descriptions (= P<sub>1</sub> - P<sub>4</sub>), whose conclusions (= C<sub>1</sub> - C<sub>2</sub>) themselves figure as premises (= P<sub>5</sub> - P<sub>6</sub>) of the central argument of the interpretations (= C<sub>3</sub>).</p>
<fig id="F2">
<caption>
<p><bold>Figure 2:</bold> Reconstruction of an argument.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jcls-4312_pichler-g2.png"/>
</fig>
<p>Reconstructions of the central arguments of an interpretation are not uncontroversial in literary studies. They are considered reductionist, as they ignore numerous aspects that can also be significant for the persuasive power of interpretations, such as their rhetorical and stylistic design (<xref ref-type="bibr" rid="B1">Albrecht and Danneberg 2021</xref>; <xref ref-type="bibr" rid="B16">Descher and Petraschka 2018</xref>). Reconstructions that focus on the core structure and theses therefore differ in several respects from those that are more closely aligned with the subject-specific culture and practice of interpretation in literary studies, such as those recently developed for the interpretation of canonical narrative texts in the German-speaking world as part of the ArguLit project at the University of G&#246;ttingen (<xref ref-type="bibr" rid="B39">Winko et al. 2024</xref>): While the latter strives for a &#8220;&#8216;dense description&#8217; of the argumentative contexts as well as the characteristics of the interpretative texts&#8221;, taking into account &#8220;above all the diversity and linguistic complexity of the means of representation used&#8221; and accordingly reconstructs them on &#8220;a hermeneutic basis&#8221;(<xref ref-type="bibr" rid="B39">Winko et al. 2024, 43</xref>), the method used here is far more economical, focusing on the basic argumentative structure of the interpretative texts analyzed. Accordingly, we refer to the results of our argument reconstructions in the following as &#8216;argument-like structured interpretation&#8217;.</p>
<p>It is important to note that such reconstructions serve only as proxies for the actual textual practices of literary studies, as they lack the detail and comprehensiveness of reconstructions that account for all dimensions of literary arguments. Nevertheless, we employ them for several pragmatic purposes. First, they allow literary interpretations to be distilled to their argumentative core, presenting them in a format that is both generic and comparable. This reduction makes them significantly easier to evaluate and contrast than the often rhetorically and stylistically complex source texts. Second, these reconstructions can be flexibly expanded with additional layers of information, such as the types of arguments employed, key textual passages, or meta-information, including the interpretation&#8217;s objective or its underlying literary-theoretical framework. This makes them a modular and adaptable format for analyzing and comparing interpretive practices.</p>
</sec>
<sec id="S3.3">
<title>3.3 Operationalization of Werner Strube&#8217;s Criteria for Evaluating an Interpretation</title>
<p>The criteria presented in Strube (<xref ref-type="bibr" rid="B35">1992</xref>) address different levels of the interpretive process: (1) the relationship between statements about the interpreted text and interpretive claims; (2) the relationship between these interpretive claims and higher-level interpretive statements; and (3) their argumentative connection. These criteria are expressed as single or multi-place predicates, which can be attributed either to individual statements within interpretations or to the relationships between them. In the following, we outline the stages of the interpretation process and specify the corresponding evaluation criteria. Each criterion is reformulated as a conditional rule, defining the conditions under which a given predicate may be attributed to (parts of) an interpretation. Where necessary, Strube&#8217;s original formulations are adapted to fit our approach, which focuses primarily on the reconstruction of argumentative structures.</p>
<p>Descriptions are</p>
<list list-type="bullet">
<list-item><p><italic>empirically correct or accurate</italic> if there is a correspondence between the description and the poem, i.e., if what the description claims is actually present in the text, and/or</p></list-item>
<list-item><p><italic>appropriate</italic> if the description serves as a premise for one of the arguments of the exegesis.</p></list-item>
</list>
<p>Exegeses of a text are</p>
<list list-type="bullet">
<list-item><p><italic>plausible</italic>, if it &#8220;is sufficiently justified in the description assigned to it&#8221;, that is, if the exegesis supported by the premises preceding it.<xref ref-type="fn" rid="n16">16</xref></p></list-item>
</list>
<p>Interpretations of a text are</p>
<list list-type="bullet">
<list-item><p><italic>integrative</italic>, if the conclusions of all subarguments flow into the final argumentation as premises, and/or</p></list-item>
<list-item><p><italic>specific enough</italic>, if the conclusion of all sub-arguments is not vague and general, but instead makes statements about the subject that are as precise and detailed as possible.</p></list-item>
</list>
<p>The argumentation is</p>
<list list-type="bullet">
<list-item><p><italic>free of contradictions</italic> if it does not contain any statements that are in logical opposition to each other, and/or</p></list-item>
<list-item><p><italic>coherent or ordered</italic> if &#8220;the exegesis is grounded in the description, the interpretation grounded in the exegesis&#8221; (<xref ref-type="bibr" rid="B35">Strube 1992, 198</xref>).</p></list-item>
</list>
<p>The three levels of evaluation correspond to the three levels of reconstruction with which we work (<xref ref-type="fig" rid="F2">Figure 2</xref>). It follows from the frame of reference of the individual criteria that, evaluating an interpretation aimed at providing an overall interpretation of a text or poem, different criteria apply at different levels.<xref ref-type="fn" rid="n17">17</xref> Criteria related to description apply exclusively to the premises of the reconstructed arguments; criteria relating to exegesis address the relationship between premises and material rules of inference at the second level; and criteria regarding interpretation pertain to the first or top level of the reconstructed argument. Strube himself, however, does not explicitly include material rules of inference in his list of criteria, likely because these are rarely made explicit in actual practice of interpretation. By material &#8216;rules of inference&#8217;, we refer to domain-specific principles of reasoning that connect premises and conclusions based on substantive knowledge.<xref ref-type="fn" rid="n18">18</xref> Given their central importance for reconstructing the argument structure of interpretations, we have supplemented Strube&#8217;s list by incorporating such explicit rules of inference. We reconstruct them as conditionals &#8211; that is, if-then sentences &#8211; whose components consist of the generalized premises or theses. These are considered <italic>acceptable</italic> if they are &#8220;collective convictions that have been accepted by the majority and/or by experts in the course of previous argumentation&#8221;(<xref ref-type="bibr" rid="B39">Winko et al. 2024, 41</xref>). It also seems reasonable to locate Strube&#8217;s criterion of historical appropriateness here, which for him pertains to the so-called interpretive schemata. According to this criterion, statements are <italic>historically coherent</italic>, if they &#8220;correspond to what the author knew and could therefore have meant&#8221;(<xref ref-type="bibr" rid="B35">Strube 1992, 192</xref>).</p>
<p>Due to the fact that we work exclusively with reconstructed arguments, the following restrictions must be applied in the selection and use of Strube&#8217;s criteria. We have omitted four of them: First, it is not possible to determine the <italic>relevance</italic> of the text-describing premises based solely on the reconstructions, as the literary-theoretical method is not explicitly mentioned within them.<xref ref-type="fn" rid="n19">19</xref> This also renders the category of <italic>comprehensiveness</italic> obsolete, as this is dependent on the category of <italic>relevance</italic>. Secondly, it can be assumed that the premises supporting those arguments whose material rules of inference have been interpretatively inferred by us will be <italic>appropriate</italic>, since the construction of their rules of inference is based on precisely these premises. The same applies, thirdly, to the category of <italic>integrity</italic>: If the final material rule of inference is an inferred one, it is already reconstructed based on the aforementioned criterion. Fourth, we omit the category of <italic>unforcedness</italic>, as it conflicts with our understanding of the principle of charity: Our goal is to reconstruct the strongest possible arguments, which is why we exclude premises that cannot be integrated into the reconstruction. From a different perspective, this might appear as forced.</p>
</sec>
</sec>
<sec id="S4">
<title>4. Reconstructing Core Arguments</title>
<p>The experiments presented in this paper utilize reference data for the generation and evaluation of interpretations of poems. In the given case, reference data are interpretations of literary texts that are representative (not in a statistical sense) of the interpretation practices within the discipline. An interpretation can be considered representative if a) it is used in teaching or b) it is frequently cited. The former applies to texts from the Reclam publishing house. The Reclam Verlag is a German publishing house renowned for its pocket-sized editions of classic literature and accompanying interpretations. It plays a significant role in German education by making essential literary works accessible and affordable for students and readers.</p>
<p>At the end of the 1990s, Reclam published collections of interpretative texts on works by 12 canonical German-language poets as a series of collected interpretations of poems entitled <italic>Gedichte und Interpretationen</italic>. From this series, we have selected three interpretations: Jochen Schmidt&#8217;s interpretation of Friedrich H&#246;lderlin&#8217;s <italic>H&#228;lfte des Lebens</italic>(Schmidt <xref ref-type="bibr" rid="B33">1984</xref>), Hans-Georg Kemper&#8217;s interpretation of Georg Trakl&#8217;s <italic>Im Winter</italic>(Kemper <xref ref-type="bibr" rid="B21">1999</xref>) and Marco Meli&#8217;s interpretation of Gottfried Benn&#8217;s <italic>Der S&#228;nger</italic>(Meli <xref ref-type="bibr" rid="B29">1997</xref>). The guiding selection criteria were that (a) the poems analyzed should be described as comprehensively as possible in the context of the interpretation and (b) these descriptions should serve as premises in the actual interpretation.</p>
<p>To strengthen the influence of existing interpretations on the generation process and to establish a framework for the systematic evaluation of the generated texts, we employ argumentative reconstructions of these interpretations (see <xref ref-type="sec" rid="S3">subsection 3.2</xref>). In reconstructing the arguments, we follow the recommended procedure in Brun and Hirsch Hadorn (<xref ref-type="bibr" rid="B8">2021</xref>) and supplement it with insights from Winko et al. (<xref ref-type="bibr" rid="B39">2024</xref>): The reconstruction process begins with a close reading of the text to be interpreted, followed by the development of a structured overview of the interpretation texts. Based on this overview, we identify the central thesis along with its supporting premises, reformulate unclear, incomplete, or inconsistent statements, and add missing premises or conclusions where necessary. In doing so, we balance two opposing principles: On the one hand, we aim to stay as close as possible to the original formulations; on the other, we seek to strengthen the reconstructed arguments to ensure they provide sound and coherent reasoning. This tension is particularly relevant when adding missing rules of inference.</p>
<p>Such domain-specific rules of inference are according to Winko et al. (<xref ref-type="bibr" rid="B39">2024, 263</xref>) &#8220;assumptions of interpreters [&#8230;] that underlie the plausibilization of their interpretative hypotheses, but usually represent the general framework assumptions or rules of the game for the plausibilization of interpretative hypotheses as implicit presuppositions that are potentially shared by many representatives of the subject&#8221;. As implicit presuppositions, these rules are typically not explicitly formulated in the interpretation texts themselves and must therefore be supplemented in the reconstruction process.<xref ref-type="fn" rid="n20">20</xref></p>
<p>When adding inference rules, we proceeded as follows: After isolating the main thesis of an interpretation and identifying its supporting premises, we first examined which inference rule could be supplemented with minimal intervention if no explicit rule was provided. In this process, we considered not only deductive arguments but also inductive reasoning and inference to the best explanation. For instance, if an interpretation argues &#8211; based on close readings &#8211; that the two stanzas of a poem stand in a relationship of allegory and reflection, we would reconstruct the argument as one from circumstantial evidence rather than explicitly formulating a deductive inference rule. Only when no inductive reconstruction was possible based on the given premises and conclusion we introduced a conditional inference rule, adhering to the principles of argumentation reconstruction outlined in Brun and Hirsch Hadorn (<xref ref-type="bibr" rid="B8">2021</xref>).</p>
<p>In the following we will use Hans-Georg Kemper&#8217;s interpretation of Georg Trakl&#8217;s poem <italic>Im Winter</italic> as an example to illustrate the procedure of argumentative reconstruction of an interpretation of a poem. Kemper&#8217;s interpretation is divided into five parts, three of which focus on a particular dimension of the poem: After a brief introduction (<xref ref-type="bibr" rid="B21">Kemper 1999, 43</xref>), in which Kemper articulates his three central hypotheses, the first part (pp. 44-48) is devoted to the description and exegesis of Trakl&#8217;s expressionistic sequential style (Reihenstil). The second part (pp. 48-55) examines the sound-symbolic, motivic and structural repetitions of the poem, while the third part (pp. 55-58) explores the intra- and intertextual references to the rest of Trakl&#8217;s lyrical oeuvre. These dimensions are brought together concisely in an overall interpretation (pp. 58).</p>
<p>Kemper opens his article with the following hypotheses: Trakl&#8217;s <italic>Im Winter</italic> belongs (1) &#8220;to the early examples of the expressionistic sequential style&#8221; and breaks with the characteristics of the classical and romantic tradition of German poetry. It simultaneously realizes (2) &#8220;the poetic design of a referenceable winter image of high sensual plasticity&#8221;, which, however (3) &#8220;through its sensual charge and connotative approximation of the motifs&#8221; leads these motifs &#8220;to lose their everyday linguistic meaning and an autonomization of the poetic texture setting in&#8221;.</p>
<p>Each of the following three sections is dedicated to the development and support of one of these three hypotheses. The first part explores in detail the realization of the expressionistic sequential style in Trakl&#8217;s poem and examines its consequences in relation to classical-romantic German poetry. The second part compares the poem with Bruegel&#8217;s <italic>The Hunters in the Snow</italic> to show that Trakl&#8217;s poem, like Bruegel&#8217;s painting, is characterized by &#8220;haunting plasticity and suggestiveness&#8221;. According to Kemper, the sound symbolism &#8211; especially assonances and alliterations &#8211; as well as motivic and structural repetitions contribute to this effect. In the third part, Kemper shows that &#8220;the multiplicity of the image parts and the approximation of the motifs [&#8230;] promoted by the form causes a tendency towards the autonomization of the vocabulary that runs counter to its referentiality&#8221;. This multiplicity is a result of the Trakl-specific connotations established throughout his lyrical oeuvre. In his conclusion, Kemper unites these three lines of argument to assert that Trakl&#8217;s poem combines the &#8220;destruction of traditional poetic meaning and the construction of an autonomous world of signs typical of Trakl&#8221; in such a way that their opposition is &#8220;&#8216;suspended&#8217; in sense of a refusal of meaning&#8221;.</p>
    <p>If one attempts an argumentative reconstruction of Kemper&#8217;s interpretation, one is confronted with a complex argument. Kemper&#8217;s main thesis &#8211; and thus the conclusion of this argument &#8211; is the assertion that Trakl&#8217;s poem ultimately eludes a clear specification of meaning through the interplay of different principles of representation and form. The justification of this thesis can be reconstructed as a four-part argument (shown in Figure 3&#8211;Figure 6 in <xref ref-type="bibr" rid="A1">Appendix 1</xref>), with each part consisting of subarguments that have their own intermediate conclusions. The conclusions of the first three main arguments then form the premises of Kemper&#8217;s central argument. However, the reconstruction necessarily adopts substantive theoretical assumptions &#8211; concerning, for instance, the integrability of meaning levels and the criteria of definable meaning &#8211; that are not independently argued for in the original text but are nonetheless carried over into the reconstructed argument without being made explicit.</p>
</sec>
<sec id="S5">
<title>5. Experiments</title>
<sec id="S5.1">
<title>5.1 Experimental Setup</title>
<p>We conduct experiments with one LLM: Anthropic&#8217;s Claude-Sonnet-4.5<xref ref-type="fn" rid="n21">21</xref>. The model was selected by manually comparing the output of three different LLMs. The key reason for selecting Claude was its ability to account for the structure of the argumentation reconstructions of the interpretations without rigidly adhering to the semantics of individual segments of the prompted examples, unlike other models. We worked with the default temperature of 1, as this yielded the best results in manual, qualitative inspection. All inputs and outputs were generated in German, as we worked with German poems. Additionally, the prompt template already incorporated the modular structure that we consider useful for the continuation of our experiments. For instance, the titles of the poems were entered separately, which allows for future experiments to generate interpretations with or without the inclusion of poem titles. The input to the model was a prompt consisting of a simple task description, an example of an argument-like interpretation and the corresponding poem as well as the poem to be interpreted:</p>
<code>1 You are a literary scholar.</code>
<code>2 Interpret the following poem in an argumentative form.</code>
<code>3 - - -</code>
<code>4 ### Orientation:</code>
<code>5 Below you will find an example that serves only as a structural template, not as a content template.</code>
<code>6 Use the argumentative structure as a guide, but develop new arguments that refer exclusively to the new poem.</code>
<code>7</code>
<code>8 ### Example (structure template only)</code>
<code>9 Title: {title}</code>
<code>10 Example: {poem}</code>
<code>11</code>
<code>12 Interpretation (example): {interpretation</code>
<code>13 - - -</code>
<code>14 ### New Poem</code>
<code>15 Title: {title_x}</code>
<code>16 Poem: {poem_x}</code>
<code>17 - - -</code>
<code>18 ### Interpretation:</code>
<p>During the creation of the prompt templates, we deliberately avoided extensive iterative prompt engineering, as without an algorithmically implemented evaluation procedure and a reliable gold standard data set, prompt optimization becomes prohibitively time-consuming and thus futile (<xref ref-type="bibr" rid="B32">Pichler et al. 2025</xref>). As examples (consisting of a poem and its interpretation in the form of an argument-like interpretation), we utilized the three argument-type reconstructions described in <xref ref-type="sec" rid="S4">section 4</xref>, hereafter referred to as reference data. These examples were also employed to iteratively refine the evaluation scheme (cf. <xref ref-type="sec" rid="S3">subsection 3.3</xref>). As test data, we selected six poems &#8211; three canonical works and three more contemporary pieces. These are Johann Wolfgang von Goethe&#8217;s <italic>&#220;ber allen Gipfeln</italic>, Hugo von Hofmannsthal&#8217;s <italic>Manche freilich &#8230;</italic>, Ingeborg Bachmann&#8217;s <italic>Die gestundete Zeit</italic>, Frederike Mayr&#246;cker&#8217;s <italic>was brauchst du</italic>, Durs Gr&#252;nbein&#8217;s <italic>Die leeren Zeichen 19</italic> and Elfriede Gerstl&#8217;s <italic>balance - balance</italic>. Each of the six poems was interpreted using the template above as a one-shot prompt, with a different reference reconstruction used as an example in each run (in the following marked with 1: Schmidt (<xref ref-type="bibr" rid="B33">1984</xref>), 2: Kemper (<xref ref-type="bibr" rid="B21">1999</xref>), 3: Meli (<xref ref-type="bibr" rid="B29">1997</xref>)). Considering the three argument-like structured interpretations that served as examples, this approach resulted in three interpretations per poem, yielding a total of 18 generated interpretations. These interpretations were first assigned to Strube&#8217;s levels and then evaluated on the base of the criteria developed in <xref ref-type="sec" rid="S3">subsection 3.3</xref>, using a four-point Likert scale on each interpretation statement by the first and second author of this paper as annotators.<xref ref-type="fn" rid="n22">22</xref> To verify the consistency in the application of the criteria, we calculate inter-annotator agreement (Cohen&#8217;s Kappa; <xref ref-type="bibr" rid="B11">Cohen 1960</xref>) as well as the percentage agreement.</p>
<p>Subsequently, we analyse the generated argument-like structured interpretations by examining the agreement with regard to the different levels of argumentation according to Strube &#8211; i.e. description, exegesis, interpretation and rule of inference &#8211; and the correlation between agreement and the average Likert scores per annotator per level of argumentation.</p>
</sec>
<sec id="S5.2">
<title>5.2 Results</title>
<p><bold>Consistency of Evaluation Criteria</bold> (<xref ref-type="table" rid="T1">Table 1</xref>): With regard to the individual argument-like interpretations, we observe an average inter-annotator agreement of 0.74. Average standard deviation values of the Likert scores are 0.61 and 0.64 respectively. Taken together, these scores indicate that the annotators reached a solid and reliable agreement. The moderate standard deviation suggests that while there was some variability in the ratings, the agreement remained within an acceptable range, which supports the robustness of the annotation results &#8211; but also shows that the full Likert-range was rarely used. Still, this indicates that the operationalization of Strube&#8217;s evaluation criteria is reasonably reliable and consistent, given the complexity of the annotated reasoning.</p>
<table-wrap id="T1">
<caption>
<p><bold>Table 1:</bold> IAA, Average, and Standard Deviation of Likert scales for both annotators across argument-like structured interpretations.</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top" rowspan="3">Name</td>
<td align="center" valign="top" rowspan="3">IAA</td>
<td align="center" valign="top" colspan="2">Annotator 1</td>
<td align="center" valign="top" colspan="2">Annotator 2</td>
</tr>
<tr>
<td colspan="2"><hr/></td>
<td colspan="2"><hr/></td>
</tr>
<tr>
<td align="center" valign="top">Mean</td>
<td align="center" valign="top">Std. Dev.</td>
<td align="center" valign="top">Mean</td>
<td align="center" valign="top">Std. Dev.</td>
</tr>
<tr>
<td colspan="6"><hr/></td>
</tr>
<tr>
<td align="left" valign="top">Gr&#252;nbein 3</td>
<td align="right" valign="top">0.9397</td>
<td align="right" valign="top">3.0800</td>
<td align="right" valign="top">0.4000</td>
<td align="right" valign="top">3.0400</td>
<td align="right" valign="top">0.3512</td>
</tr>
<tr>
<td align="left" valign="top">Goethe 1</td>
<td align="right" valign="top">0.8622</td>
<td align="right" valign="top">3.5250</td>
<td align="right" valign="top">0.6400</td>
<td align="right" valign="top">3.4359</td>
<td align="right" valign="top">0.6405</td>
</tr>
<tr>
<td align="left" valign="top">Gr&#252;nbein 1</td>
<td align="right" valign="top">0.8385</td>
<td align="right" valign="top">3.0385</td>
<td align="right" valign="top">0.4455</td>
<td align="right" valign="top">3.1538</td>
<td align="right" valign="top">0.5435</td>
</tr>
<tr>
<td align="left" valign="top">Goethe 3</td>
<td align="right" valign="top">0.8300</td>
<td align="right" valign="top">2.5909</td>
<td align="right" valign="top">0.5032</td>
<td align="right" valign="top">2.5455</td>
<td align="right" valign="top">0.5958</td>
</tr>
<tr>
<td align="left" valign="top">Goethe 2</td>
<td align="right" valign="top">0.8030</td>
<td align="right" valign="top">2.9574</td>
<td align="right" valign="top">1.0623</td>
<td align="right" valign="top">2.9783</td>
<td align="right" valign="top">1.0644</td>
</tr>
<tr>
<td align="left" valign="top">Hofmannsthal 1</td>
<td align="right" valign="top">0.7871</td>
<td align="right" valign="top">3.6970</td>
<td align="right" valign="top">0.4667</td>
<td align="right" valign="top">3.6970</td>
<td align="right" valign="top">0.4667</td>
</tr>
<tr>
<td align="left" valign="top">Hofmannsthal 2</td>
<td align="right" valign="top">0.7824</td>
<td align="right" valign="top">3.4130</td>
<td align="right" valign="top">0.5803</td>
<td align="right" valign="top">3.4348</td>
<td align="right" valign="top">0.5012</td>
</tr>
<tr>
<td align="left" valign="top">Mayr&#246;cker 2</td>
<td align="right" valign="top">0.7753</td>
<td align="right" valign="top">3.2388</td>
<td align="right" valign="top">0.7404</td>
<td align="right" valign="top">3.2985</td>
<td align="right" valign="top">0.7591</td>
</tr>
<tr>
<td align="left" valign="top">Hofmannsthal 3</td>
<td align="right" valign="top">0.7620</td>
<td align="right" valign="top">3.8286</td>
<td align="right" valign="top">0.4528</td>
<td align="right" valign="top">3.7143</td>
<td align="right" valign="top">0.5186</td>
</tr>
<tr>
<td align="left" valign="top">Gr&#252;nbein 2</td>
<td align="right" valign="top">0.7219</td>
<td align="right" valign="top">2.8387</td>
<td align="right" valign="top">0.6323</td>
<td align="right" valign="top">2.8226</td>
<td align="right" valign="top">0.6408</td>
</tr>
<tr>
<td align="left" valign="top">Bachmann 1</td>
<td align="right" valign="top">0.7157</td>
<td align="right" valign="top">2.8621</td>
<td align="right" valign="top">0.6394</td>
<td align="right" valign="top">3.0000</td>
<td align="right" valign="top">0.7071</td>
</tr>
<tr>
<td align="left" valign="top">Bachmann 2</td>
<td align="right" valign="top">0.7123</td>
<td align="right" valign="top">3.2653</td>
<td align="right" valign="top">0.6701</td>
<td align="right" valign="top">3.3265</td>
<td align="right" valign="top">0.6579</td>
</tr>
<tr>
<td align="left" valign="top">Mayr&#246;cker 3</td>
<td align="right" valign="top">0.6883</td>
<td align="right" valign="top">3.2653</td>
<td align="right" valign="top">0.5692</td>
<td align="right" valign="top">3.2653</td>
<td align="right" valign="top">0.6382</td>
</tr>
<tr>
<td align="left" valign="top">Gerstl 1</td>
<td align="right" valign="top">0.6779</td>
<td align="right" valign="top">2.7097</td>
<td align="right" valign="top">0.7829</td>
<td align="right" valign="top">2.7419</td>
<td align="right" valign="top">0.8152</td>
</tr>
<tr>
<td align="left" valign="top">Gerstl 2</td>
<td align="right" valign="top">0.6647</td>
<td align="right" valign="top">2.5942</td>
<td align="right" valign="top">0.8964</td>
<td align="right" valign="top">2.5507</td>
<td align="right" valign="top">0.9477</td>
</tr>
<tr>
<td align="left" valign="top">Bachmann 3</td>
<td align="right" valign="top">0.6385</td>
<td align="right" valign="top">2.7931</td>
<td align="right" valign="top">0.4123</td>
<td align="right" valign="top">3.1951</td>
<td align="right" valign="top">0.6411</td>
</tr>
<tr>
<td align="left" valign="top">Gerstl 3</td>
<td align="right" valign="top">0.5842</td>
<td align="right" valign="top">2.5556</td>
<td align="right" valign="top">0.6157</td>
<td align="right" valign="top">2.7222</td>
<td align="right" valign="top">0.4609</td>
</tr>
<tr>
<td align="left" valign="top">Mayr&#246;cker 1</td>
<td align="right" valign="top">0.5823</td>
<td align="right" valign="top">2.8750</td>
<td align="right" valign="top">0.5367</td>
<td align="right" valign="top">2.8750</td>
<td align="right" valign="top">0.6124</td>
</tr>
<tr>
<td align="left" valign="top">Average</td>
<td align="right" valign="top">0.7426</td>
<td align="right" valign="top">3.0627</td>
<td align="right" valign="top">0.6137</td>
<td align="right" valign="top">3.0999</td>
<td align="right" valign="top">0.6423</td>
</tr>
<tr>
<td colspan="6"><hr/></td>
</tr>
</tbody>
</table>
</table-wrap>
<p><bold>Cohen&#8217;s Kappa and Average Likert Values per Interpretation and their Ratio</bold> (<xref ref-type="table" rid="T1">Table 1</xref>): The analysis reveals that there is substantial item-level variability. However, this variability does not map straightforwardly onto individual authors. For instance, texts by Gr&#252;nbein span a wide range, from comparatively moderate agreement (0.72 for Gr&#252;nbein 2) to the highest observed IAA overall (0.94 for Gr&#252;nbein 3). Mayr&#246;cker&#8217;s texts likewise show marked internal variation, ranging from 0.58 (Mayr&#246;cker 1) to 0.78 (Mayr&#246;cker 2). Bachmann also exhibits noticeable spread (0.64&#8211;0.72), while Gerstl&#8217;s texts cluster at the lower end of the distribution but still vary substantially (0.58&#8211;0.68). By contrast, Hofmannsthal&#8217;s items show relatively stable and consistently high agreement (0.76&#8211;0.79), and Goethe&#8217;s texts likewise fall within a comparatively narrow and elevated range (0.80&#8211;0.86).</p>
<p>The Likert means for both annotators cluster between the middle and the higher end of the scale (approximately 2.5&#8211;3.8), and the corresponding standard deviations are relatively homogeneous, mostly between about 0.35 and 1.06. The two annotators show similar dispersion across items, and there is no obvious systematic association between higher IAA and either higher or lower variance in the ratings. Likewise, there is no clear monotonic relationship between IAA and the level of the mean judgments themselves. Overall, the data suggest nuanced, item-specific differences in how interpretable or stable particular argument-like interpretations are, but they do not reveal strong, easily generalizable patterns at the level of individual authors.</p>
<p><bold>Percent Agreement per Argumentation Level</bold> (<xref ref-type="table" rid="T2">Table 2</xref>): To determine the agreement between the annotators with regard to the individual argumentative levels of the generated interpretations, we calculated the percentage agreement and the Pearson correlation efficient of this to the average Likert scores. The results reveal a differentiated pattern across the four argumentative levels. First, average agreement exceeds 75% in three categories: The highest mean occurs in the rule-of-inference layer (90.89), followed by the description layer (84.79%), and interpretation (76.90%). Agreement is noticeably lower for exegesis (68.73%). At the same time, the item-level values exhibit substantial internal variability within all categories. Description agreement, for instance, ranges from as low as 0% (Hofmannsthal 1) to multiple instances of 100%. Exegesis spans an interval, from 25% (Goethe 3) to 94.12% (Gr&#252;nbein 3). Interpretation likewise displays considerable dispersion, extending from 0% (Mayr&#246;cker 1) to multiple cases of 100% (Goethe 1, Gr&#252;nbein 3, Hofmannsthal 1). Second, missing values (&#8211;) appear across two categories, most prominently in the rule-of-inference layer. In five cases, the generated interpretations contain no inferential structures that could be evaluated by both annotators, leading to the absence of a corresponding agreement value. Missing entries in the description category occur only twice and arise exclusively when the model did not produce a descriptive layer at all. Third, the comparatively lower mean agreement in exegesis and interpretation aligns with well-known tendencies in literary scholarship: Such statements are inherently more contestable than descriptive claims. In numerous instances, the model introduced exegetical or interpretative assertions without providing sufficiently clear or consistent descriptive grounding, which resulted in divergent annotator judgments. This mechanism contributes to the broader spread of agreement values in both categories, in contrast to the more structurally constrained description and rule-of-inference layers.</p>
<table-wrap id="T2">
<caption>
<p><bold>Table 2:</bold> Percent agreement for each category across poems, sorted alphabetically by poem name, including the arithmetic mean. Empty cells indicate that the LLM did not generate text pertaining to the respective category.</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top" rowspan="3">Name</td>
<td align="center" valign="top" colspan="4">Percent Agreement</td>
</tr>
<tr>
<td colspan="4"><hr/></td>
</tr>
<tr>
<td align="center" valign="top">Description</td>
<td align="center" valign="top">Exegesis</td>
<td align="center" valign="top">Interpretation</td>
<td align="center" valign="top">Rule of Inference</td>
</tr>
<tr>
<td colspan="5"><hr/></td>
</tr>
<tr>
<td align="left" valign="top">Bachmann 1</td>
<td align="right" valign="top">87.50</td>
<td align="right" valign="top">46.67</td>
<td align="right" valign="top">100.00</td>
<td align="right" valign="top">100.00</td>
</tr>
<tr>
<td align="left" valign="top">Bachmann 2</td>
<td align="right" valign="top">87.50</td>
<td align="right" valign="top">68.42</td>
<td align="right" valign="top">78.57</td>
<td align="right" valign="top">50.00</td>
</tr>
<tr>
<td align="left" valign="top">Bachmann 3</td>
<td align="right" valign="top">&#8211;</td>
<td align="right" valign="top">66.67</td>
<td align="right" valign="top">83.33</td>
<td align="right" valign="top">&#8211;</td>
</tr>
<tr>
<td align="left" valign="top">Goethe 1</td>
<td align="right" valign="top">88.24</td>
<td align="right" valign="top">100.00</td>
<td align="right" valign="top">100.00</td>
<td align="right" valign="top">100.00</td>
</tr>
<tr>
<td align="left" valign="top">Goethe 2</td>
<td align="right" valign="top">85.71</td>
<td align="right" valign="top">70.00</td>
<td align="right" valign="top">75.00</td>
<td align="right" valign="top">83.33</td>
</tr>
<tr>
<td align="left" valign="top">Goethe 3</td>
<td align="right" valign="top">100.00</td>
<td align="right" valign="top">25.00</td>
<td align="right" valign="top">75.00</td>
<td align="right" valign="top">&#8211;</td>
</tr>
<tr>
<td align="left" valign="top">Gr&#252;nbein 1</td>
<td align="right" valign="top">100.00</td>
<td align="right" valign="top">66.67</td>
<td align="right" valign="top">75.00</td>
<td align="right" valign="top">100.00</td>
</tr>
<tr>
<td align="left" valign="top">Gr&#252;nbein 2</td>
<td align="right" valign="top">83.33</td>
<td align="right" valign="top">76.92</td>
<td align="right" valign="top">91.67</td>
<td align="right" valign="top">87.50</td>
</tr>
<tr>
<td align="left" valign="top">Gr&#252;nbein 3</td>
<td align="right" valign="top">100.00</td>
<td align="right" valign="top">94.12</td>
<td align="right" valign="top">100.00</td>
<td align="right" valign="top">&#8211;</td>
</tr>
<tr>
<td align="left" valign="top">Hofmannsthal 1</td>
<td align="right" valign="top">0.00</td>
<td align="right" valign="top">84.00</td>
<td align="right" valign="top">100.00</td>
<td align="right" valign="top">100.00</td>
</tr>
<tr>
<td align="left" valign="top">Hofmannsthal 2</td>
<td align="right" valign="top">100.00</td>
<td align="right" valign="top">66.67</td>
<td align="right" valign="top">75.00</td>
<td align="right" valign="top">87.50</td>
</tr>
<tr>
<td align="left" valign="top">Hofmannsthal 3</td>
<td align="right" valign="top">100.00</td>
<td align="right" valign="top">80.00</td>
<td align="right" valign="top">75.00</td>
<td align="right" valign="top">100.00</td>
</tr>
<tr>
<td align="left" valign="top">Mayr&#246;cker 1</td>
<td align="right" valign="top">&#8211;</td>
<td align="right" valign="top">68.75</td>
<td align="right" valign="top">0.00</td>
<td align="right" valign="top">100.00</td>
</tr>
<tr>
<td align="left" valign="top">Mayr&#246;cker 2</td>
<td align="right" valign="top">87.50</td>
<td align="right" valign="top">47.06</td>
<td align="right" valign="top">72.22</td>
<td align="right" valign="top">85.71</td>
</tr>
<tr>
<td align="left" valign="top">Mayr&#246;cker 3</td>
<td align="right" valign="top">93.75</td>
<td align="right" valign="top">76.00</td>
<td align="right" valign="top">66.67</td>
<td align="right" valign="top">&#8211;</td>
</tr>
<tr>
<td align="left" valign="top">Gerstl 1</td>
<td align="right" valign="top">83.33</td>
<td align="right" valign="top">66.67</td>
<td align="right" valign="top">50.00</td>
<td align="right" valign="top">100.00</td>
</tr>
<tr>
<td align="left" valign="top">Gerstl 2</td>
<td align="right" valign="top">75.00</td>
<td align="right" valign="top">58.62</td>
<td align="right" valign="top">66.67</td>
<td align="right" valign="top">87.50</td>
</tr>
<tr>
<td align="left" valign="top">Gerstl 3</td>
<td align="right" valign="top">&#8211;</td>
<td align="right" valign="top">75.00</td>
<td align="right" valign="top">100.00</td>
<td align="right" valign="top">&#8211;</td>
</tr>
<tr>
<td colspan="5"><hr/></td>
</tr>
<tr>
<td align="left" valign="top">Average</td>
<td align="right" valign="top">84.79</td>
<td align="right" valign="top">68.73</td>
<td align="right" valign="top">76.90</td>
<td align="right" valign="top">90.89</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><bold>Category-wise Evaluation and Correlations</bold> (<xref ref-type="table" rid="T3">Table 3</xref>): We observe a differentiated pattern in average Likert scores across the annotation layers, with systematically higher evaluations for descriptive components than for the higher argumentative levels: The mean value is highest in the description category (3.48), followed by exegesis (3.15), with lower means in interpretation (3.07) and the rule-of-inference layer (2.66). This pattern indicates that annotators tended to evaluate the descriptive parts of the interpretations more favorably than the more abstract argumentative components, although the difference between the interpretation and rule-of-inference layers remains small. The low Likert scores for the rule-of-inference layers are related to a problem that already became apparent during the reconstruction process and is also described by Winko et al. (<xref ref-type="bibr" rid="B39">2024</xref>): the reconstruction constitutes a highly generalizing supplement in which restrictive conditions are easily overlooked, and whose isolation simultaneously creates the impression of a deductive argumentation. As a result, many of the rules generated by the LLM on this basis are not convincing. In addition, the correlations between the individual Likert scores and the percentage agreement show that a high percentage agreement does not necessarily coincide with high Likert scores. In the description category, correlations with agreement are weakly positive for Annotator 1 (0.30) and moderately negative for Annotator 2 (-0.19), suggesting no stable relationship between agreement and evaluative judgment at this level. The exegesis category exhibits moderate positive correlations for both annotators (Annotator 1: 0.45, Annotator 2: 0.40), pointing to a more consistent alignment between agreement and evaluation than in the description layer. By contrast, the rule-of-inference category displays negative correlations for both annotators (Annotator 1: -0.05, Annotator 2: -0.21), indicating that higher agreement at this level is not associated with higher Likert ratings and may even coincide with more critical evaluations. The interpretation category again shows weak correlations (Annotator 1: 0.26, Annotator 2: 0.05), reinforcing the conclusion that the relationship between agreement and evaluative judgment remains relatively unstable at this global interpretive level.</p>
<table-wrap id="T3">
<caption>
<p><bold>Table 3:</bold> Category-wise evaluation scores. Likert scores are averaged over both annotators, correlation measured between individual Likert scores and percent agreement.</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top" rowspan="3">Category</td>
<td align="center" valign="top" rowspan="3">Likert<break/>Mean</td>
<td align="center" valign="top" colspan="2">Correlation of agreement with</td>
</tr>
<tr>
<td colspan="2"><hr/></td>
</tr>
<tr>
<td align="center" valign="top">A1</td>
<td align="center" valign="top">A2</td>
</tr>
<tr>
<td colspan="4"><hr/></td>
</tr>
<tr>
<td align="left" valign="top">Description</td>
<td align="right" valign="top">3.4768</td>
<td align="right" valign="top">0.3012</td>
<td align="right" valign="top">-0.1888</td>
</tr>
<tr>
<td align="left" valign="top">Exegesis</td>
<td align="right" valign="top">3.1524</td>
<td align="right" valign="top">0.4480</td>
<td align="right" valign="top">0.4017</td>
</tr>
<tr>
<td align="left" valign="top">Interpretation</td>
<td align="right" valign="top">3.0662</td>
<td align="right" valign="top">0.2589</td>
<td align="right" valign="top">0.0535</td>
</tr>
<tr>
<td align="left" valign="top">Rule of Inference</td>
<td align="right" valign="top">2.6600</td>
<td align="right" valign="top">-0.0458</td>
<td align="right" valign="top">-0.2061</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In aggregate, it can therefore be said: Across categories, percent agreement (PA) is generally high, with the highest values in the rules-of-inference and descriptive layers and the lowest in exegesis, with interpretation showing comparatively intermediate levels of agreement. Likert evaluations show only moderate variation in their mean values across categories, indicating that both annotators applied their judgments in a broadly comparable manner. The correlations between PA and Likert scores differ markedly by argumentative level: They are negative in the descriptive category, clearly positive in exegesis, close to zero in interpretation, and negative again in the rule-of-inference category. Taken together, these results show that the relationship between agreement and evaluative judgments is not uniform across levels but depends on the type of argumentative operation involved.</p>
</sec>
</sec>
<sec id="S6">
<title>6. Conclusions</title>
<p>In summary, the workflow developed here for evaluating generated argument-like interpretations of poems appears robust overall, yet its reliability varies across argumentative levels. While the overall moderate to high percent agreement, with notable variation across categories, indicates consistent annotation behavior, the divergent correlations between PA and Likert evaluations show that agreement and evaluative judgments do not align uniformly across descriptive, exegetical, interpretative, and inferential operations. Rather than indicating uniform effects, the correlations suggest category-specific relationships between inter-annotator agreement and perceived quality that admit multiple interpretations. Particularly noteworthy is the positive correlation between PA and Likert scores in the exegetical category, which cautiously suggests that higher annotator agreement may coincide with higher perceived quality of exegetical operations, without implying a strong or universal alignment between agreement and evaluation. In contrast, in the rules-of-inference category, the negative correlation in combination with comparatively high PA values can be read as tentative evidence that instances of agreement, where they occur, tend to coincide with lower Likert scores, potentially suggesting consensual identification of weak or unconvincing rules of inference. Conversely, the near-zero correlation in the interpretative category, alongside a moderate level of agreement, and the mixed correlations observed in the descriptive layer indicate areas where automated interpretations introduce forms of instability that human annotators detect to varying degrees. Together, these findings demonstrate both the sensitivity of the argument-like interpretive framework and the differentiated reliability of automated interpretive outputs depending on the type of argumentative operation involved.</p>
<p>We also see great potential for critical self-reflection of practices of literary studies in the fact that the evaluation of generated interpretation forces researchers to make his/her guiding background assumptions of these practices explicit. A central role in this would probably be played by the analysis of inference rules, as these are the manifestation of framework assumptions that secretly guide interpretation but are rarely made explicit. By explicitly formulating these inference rules in generating interpretations as presented here, much can be learned about the practices of the discipline by evaluating them, without having to carry out the laborious work of partially or fully reconstructing existing interpretations in advance.</p>
</sec>
<sec id="S7">
<title>7. Future Work</title>
<p>Future work can build upon this study on various levels: For instance, a detailed comparison of existing catalogs of evaluation criteria in terms of their consistency and operationalizability could contribute to refining the approach presented here. Additionally, the number of argumentative levels considered during reconstruction could be gradually expanded to examine the extent to which LLMs can adequately transfer the argumentative structure to new texts at increasing levels of complexity. Furthermore, efforts could be made to improve the skeletal reconstruction of arguments using advanced techniques such as prompt-tuning. By refining prompt design, it may be possible to generate more accurate and nuanced representations of complex literary interpretations. Additionally, larger test datasets are required to ensure the robustness and generalizability of the findings. These datasets should encompass a broader range of literary-historical epochs and interpretive frameworks, enabling a more comprehensive evaluation of the methodology across diverse contexts. Building on such datasets, future work should also aim to evaluate not only the LLM-generated interpretations, but also the reference interpretations themselves, using the same set of criteria. Comparing these evaluations may yield valuable insights into differences in interpretive practice and argumentative structure between human and machine-generated interpretations. Moreover, if a sufficient amount of manually annotated evaluation data becomes available, a classifier could be trained to automate the evaluation process, thereby enhancing scalability and consistency in future assessments. Finally, it is essential to explore alternative reconstruction approaches that might offer different or complementary perspectives on literary argumentation, contributing to a more versatile and multifaceted framework for the analysis of interpretive texts.</p>
</sec>
<sec id="S8">
<title>8. Limitations</title>
<p>This study is subject to several limitations. First, the chosen form of reconstruction does not align with common textual practices in literary studies or established literary conventions of representation. As a result, the approach may appear overly simplistic and neglect the specific context of interpretation, such as the intended audience, purpose, or situational relevance. Second, the validity of the study is limited by the fact that the reference data used for comparison was not itself evaluated, leaving potential biases or inaccuracies in the reference data unaddressed. Third, the reproducibility of results poses a challenge due to the non-deterministic nature of large language models and the use commercial models via API. Even with identical inputs and prompts, outputs may vary, making consistent replication difficult. Fourth, the experiment was conducted on a relatively small dataset of 18 examples, limiting the statistical robustness and generalizability of the findings. Fifth, the results cannot be generalized across all LLMs, as the analysis was restricted to a single model, which may not fully represent the capabilities or limitations of other models in the same category. Finally, the study evaluated reconstructed arguments and interpretations rather than directly assessing LLM-generated interpretations. This indirect approach might not capture the full potential or limitations of LLMs in generating literary analyses directly, leaving room for further exploration in future research.</p>
</sec>
<sec id="S9">
<title>9. Data Availability</title>
<p>Data and code can be found here: <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://github.com/AxPic/poem-int-eval">https://github.com/AxPic/poem-int-eval</ext-link>. They have been archived and are persistently available at: <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://doi.org/10.5281/zenodo.18166524">https://doi.org/10.5281/zenodo.18166524</ext-link></p>
</sec>
</body>
<back>
<sec id="S10">
<title>10. Acknowledgements</title>
<p>We would like to thank the reviewers as well as Janina Jacke and the participants of the General Meeting of the DFG Priority Programme &#8220;Computational Literary Studies&#8221; in W&#252;rzburg for their constructive feedback on earlier versions of this contribution.</p>
</sec>
<sec id="S11">
<title>11. Author Contributions</title>
<p><bold>Axel Pichler:</bold> Conceptualization, Data curation, Investigation, Methodology, Validation, Formal analysis, Writing &#8211; original draft, Writing &#8211; review &amp; editing</p>
<p><bold>Martin Endres:</bold> Validation</p>
<p><bold>Nils Reiter:</bold> Formal analysis, Writing &#8211; review &amp; editing, dealing with emojis in references</p>
</sec>
<fn-group>
<fn id="n1"><p>For example, in Martus (<xref ref-type="bibr" rid="B27">2021, 48&#8211;54</xref>), which draws on a corpus linguistic study of the renowned journal <italic>Deutsche Vierteljahrsschrift f&#252;r Literaturwissenschaft und Geistesgeschichte</italic> published in German-speaking countries, 630 of the 2,430 articles published there between 1923 and 2018 were identified as interpretations from the field of Modern German Literature.</p></fn>
<fn id="n2"><p>For classification tasks see the overview in Bamman et al. (<xref ref-type="bibr" rid="B2">2024</xref>). For the few exceptions of studies that deal with the interpretive competence of LLMs, see <xref ref-type="sec" rid="S2">section 2</xref>.</p></fn>
<fn id="n3"><p>For a description of this paradigm and its differences to alternative applications of LLMs, see Liu et al. (<xref ref-type="bibr" rid="B26">2023</xref>).</p></fn>
<fn id="n4"><p>A central reference point in this ongoing debate is Bender et al. (<xref ref-type="bibr" rid="B4">2021</xref>), who argue that LLMs merely mimic interpersonal language use, ultimately only predicting the next word, and therefore &#8211; particularly from an intentionalist perspective &#8211; cannot be considered genuine producers of meaning.</p></fn>
<fn id="n5"><p>An extensive and reflective justification of the added value of externalist positions in the debate on whether LLMs &#8216;understand&#8217; is provided by Jannidis et al. (<xref ref-type="bibr" rid="B20">2025</xref>).</p></fn>
<fn id="n6"><p>All direct quotations from German-language research are reproduced in English translation produced by DeepL.</p></fn>
<fn id="n7"><p>An excellent overview of the current debate on interpretation in literary studies provides Descher et al. (<xref ref-type="bibr" rid="B15">2015</xref>), see also Davies and Matheson (<xref ref-type="bibr" rid="B13">2008</xref>).</p></fn>
<fn id="n8"><p>A widespread understanding, which goes back to Gilbert Ryle, is that interpretation is the attribution of meaning. However, as Axel B&#252;hler (<xref ref-type="bibr" rid="B9">1999</xref>), among others, has shown, this definition only applies to the word or sentence level, but not to texts as a whole. With regard to the interpretative determination of texts that go beyond the sentence level, recent research speaks accordingly of &#8220;thematic interpretation&#8221;, which is realized in statements of the following structure: &#8220;Text X is about y&#8221; (<xref ref-type="bibr" rid="B39">Winko et al. 2024, 166</xref>).</p></fn>
<fn id="n9"><p>The question of whether there is one or more valid interpretations of individual literary phenomena or texts that contradict each other is a standing topic of debate in literary theory. The answers to this question range from interpretation-theoretical monism, which assumes that there is only one potentially correct interpretation within the framework of a particular type of interpretation, to interpretation-theoretical relativism. For an introduction to this debate: Davies and Matheson (<xref ref-type="bibr" rid="B13">2008</xref>); for a critique and rejection of interpretation-theoretical relativism in the sense of an acceptance of contradictory or incoherent interpretative statements: Descher (<xref ref-type="bibr" rid="B14">2017</xref>).</p></fn>
<fn id="n10"><p>Alternative systematizations of evaluation criteria are offered by Beardsley (<xref ref-type="bibr" rid="B3">1981</xref>), Zabka (<xref ref-type="bibr" rid="B40">2008</xref>), and Descher et al. (<xref ref-type="bibr" rid="B15">2015, 47</xref>) as well as Petraschka and Descher (<xref ref-type="bibr" rid="B31">2019, 54&#8211;70</xref>). Winko et al. (<xref ref-type="bibr" rid="B39">2024, 495&#8211;516</xref>) systematizes the use of quality criteria in interpretative texts.</p></fn>
<fn id="n11"><p>Paradigmatic for this is the following statement by Steffen Martus at the end of a contribution on the practice of interpretation in literary studies with regard to the multiple relationality of this practice to other factors: &#8220;Because [&#8230;] the potential qualities of an interpretation are realized in varying degrees of quality and intensity, no schematically applicable evaluation rules can be given. &#8216;Neither truth nor method guarantee [&#8230;] that an interpretation is really good and acceptable to literary interpreters&#8217;, and the question of &#8216;how to distinguish between &#8220;good&#8221; and &#8220;not so good&#8221; practice&#8217; must be supplemented by the question: good for whom, for what and for which situation?&#8221; (<xref ref-type="bibr" rid="B27">Martus 2021, 74</xref>). Martus refers here to some of the variables which can also be found in Hermer&#233;n&#8217;s definition above to characterize the type of interpretation and thus also its conditions for success. The quotations in the quote are taken from Hempfer (<xref ref-type="bibr" rid="B18">2018</xref>).</p></fn>
<fn id="n12"><p>In future experiments, this information will be provided in a modular fashion in order to determine which combination yields the most reliable results.</p></fn>
<fn id="n13"><p>For an overview see: Celikyilmaz et al. (<xref ref-type="bibr" rid="B10">2021</xref>).</p></fn>
<fn id="n14"><p>The scientific questionnaire method involves systematically collecting and analyzing self-reported verbal and numerical data from respondents about their experiences and behavior. This is done using a self-administered scientific questionnaire, which can be distributed in person, by mail, online, or via mobile devices. Key elements are the respondents, the questionnaire, and the context in which it is completed (<xref ref-type="bibr" rid="B17">D&#246;ring 2023, 393ff.</xref>).</p></fn>
<fn id="n15"><p>According to Bowell et al. (<xref ref-type="bibr" rid="B6">2020, 144</xref>) the &#8220;goal of argument-reconstruction is to produce a clear and completely explicit statement of the argument that the arguer had in mind. The desired clarity and explicitness are achieved by putting all of the argument, and nothing but the argument, into standard form: this displays the argument&#8217;s premises, intermediate conclusions and conclusion, and indicates the inferences between them.&#8221;.</p></fn>
<fn id="n16"><p>This is plausibility in the sense of justifiability, see Winko (<xref ref-type="bibr" rid="B38">2015</xref>).</p></fn>
<fn id="n17"><p>As K&#246;ppe and Winko (<xref ref-type="bibr" rid="B24">2011</xref>) critically note, Strube&#8217;s system of categories is geared towards this case. Interpretations that pursue other goals that do not concern the entire literary text &#8211; e.g. the clarification of a poetic image or the intertextual context of a single verse &#8211; do not possess the third of Strube&#8217;s levels of interpretation and would therefore be evaluated less favorably.</p></fn>
<fn id="n18"><p>In using the term &#8216;material rules of inference&#8217;, we draw on the tradition established by Wilfrid Sellars (<xref ref-type="bibr" rid="B34">1953</xref>) and developed in contemporary inferentialism by Robert Brandom (<xref ref-type="bibr" rid="B7">2001</xref>), as well as the closely related notion of field-dependent warrants in StephenToulmin (<xref ref-type="bibr" rid="B37">2003</xref>)&#8217;s model of argumentation.</p></fn>
<fn id="n19"><p>To asses the relevance of a description one would have to determine on the basis of the reconstruction, which literary-theoretical method could be involved in the interpretation, which leaves a great deal of room for interpretation and in many cases is not possible due to the lack of an explicit connection to literary theory or the use of vocabulary specific for a certain literary theory in many interpretations. This is at least the case in the poetry interpretations we have reconstructed. The situation is similar in the corpus of interpretative texts examined as part of the ArgLit project. Here, too, the literary-theoretical standpoint could only rarely be determined (<xref ref-type="bibr" rid="B39">Winko et al. 2024, 155</xref>).</p></fn>
<fn id="n20"><p>In attempting to explicate such rules of inference, Winko et al. (<xref ref-type="bibr" rid="B39">2024, 269&#8211;272</xref>) have encountered numerous problems concerning, among other things, the degree of generality or the scope of these rules of inference.</p></fn>
<fn id="n21"><p><ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.anthropic.com/news/claude-4-5-sonnet">https://www.anthropic.com/news/claude-4-5-sonnet</ext-link></p></fn>
<fn id="n22"><p>Two examples of generated argument-like interpretations can be found in the appendix; see Hofmannsthal 3 and Gerstl 3.</p></fn>
</fn-group>
<ref-list>
<ref id="B1"><mixed-citation publication-type="book"><string-name><surname>Albrecht</surname>, <given-names>Andrea</given-names></string-name> and <string-name><given-names>Lutz</given-names> <surname>Danneberg</surname></string-name> (<year>2021</year>). <chapter-title>&#8220;Verstehen, Auslegen, Darstellen und Vermitteln: Literaturwissenschaftliche Interpretationstexte in praxeologischer Perspektive&#8221;</chapter-title>. In: <source>Doing Interpretation</source>. Ed. by <string-name><given-names>Johannes</given-names> <surname>Corrodi Katzenstein</surname></string-name>, <string-name><given-names>Andreas</given-names> <surname>Mauz</surname></string-name>, and <string-name><given-names>Christiane</given-names> <surname>Tietz</surname></string-name>. <publisher-name>Brill &#8212; Sch&#246;ningh</publisher-name>, <fpage>23</fpage>&#8211;<lpage>50</lpage>. <pub-id pub-id-type="doi">10.30965/9783657701551_003</pub-id>.</mixed-citation></ref>
<ref id="B2"><mixed-citation publication-type="webpage"><string-name><surname>Bamman</surname>, <given-names>David</given-names></string-name>, <string-name><given-names>Kent K.</given-names> <surname>Chang</surname></string-name>, <string-name><given-names>Li</given-names> <surname>Lucy</surname></string-name>, and <string-name><given-names>Naitian</given-names> <surname>Zhou</surname></string-name> (<year>2024</year>). <article-title>&#8220;On Classification with Large Language Models in Cultural Analytics&#8221;</article-title>. In: <source>Proceedings of the Computational Humanities Research Conference 2024</source>. Ed. by <string-name><given-names>Wouter</given-names> <surname>Haverals</surname></string-name>, <string-name><given-names>Marijn</given-names> <surname>Koolen</surname></string-name>, and <string-name><given-names>Laure</given-names> <surname>Thompson</surname></string-name>. <uri>https://ceur-ws.org/Vol-3834/paper119.pdf</uri> (visited on 12/08/2025).</mixed-citation></ref>
<ref id="B3"><mixed-citation publication-type="book"><string-name><surname>Beardsley</surname>, <given-names>Monroe C.</given-names></string-name> (<year>1981</year>). <source>Aesthetics. Problems in the Philosophy of Criticism</source>. <publisher-name>Hackett</publisher-name>.</mixed-citation></ref>
<ref id="B4"><mixed-citation publication-type="book"><string-name><surname>Bender</surname>, <given-names>Emily M.</given-names></string-name>, <string-name><given-names>Timnit</given-names> <surname>Gebru</surname></string-name>, <string-name><given-names>Angelina</given-names> <surname>McMillan-Major</surname></string-name>, and <string-name><given-names>Shmargaret</given-names> <surname>Shmitchell</surname></string-name> (<year>2021</year>). <chapter-title>&#8220;On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? &#129436;&#8221;</chapter-title>. In: <source>Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency</source>. <publisher-name>ACM</publisher-name>, <fpage>610</fpage>&#8211;<lpage>623</lpage>. <pub-id pub-id-type="doi">10.1145/3442188.3445922</pub-id>.</mixed-citation></ref>
<ref id="B5"><mixed-citation publication-type="journal"><string-name><surname>Borg</surname>, <given-names>Emma</given-names></string-name> (<year>2025</year>). <article-title>&#8220;LLMs, Turing Tests and Chinese Rooms: the Prospects for Meaning in Large Language Models&#8221;</article-title>. In: <source>Inquiry</source>, <fpage>1</fpage>&#8211;<lpage>31</lpage>. <pub-id pub-id-type="doi">10.1080/0020174x.2024.2446241</pub-id>.</mixed-citation></ref>
<ref id="B6"><mixed-citation publication-type="book"><string-name><surname>Bowell</surname>, <given-names>Tracy</given-names></string-name>, <string-name><given-names>Robert</given-names> <surname>Cowan</surname></string-name>, and <string-name><given-names>Gary</given-names> <surname>Kemp</surname></string-name> (<year>2020</year>). <source>Critical Thinking: A Concise Guide</source>. <edition>5th</edition> edition. <publisher-name>Routledge</publisher-name>. <pub-id pub-id-type="doi">10.4324/9781351243735</pub-id>.</mixed-citation></ref>
<ref id="B7"><mixed-citation publication-type="book"><string-name><surname>Brandom</surname>, <given-names>Robert</given-names></string-name> (<year>2001</year>). <source>Making It Explicit: Reasoning, Representing, and Discursive Commitment</source>. <edition>4th</edition> edition. <publisher-name>Harvard Univ. Press</publisher-name>.</mixed-citation></ref>
<ref id="B8"><mixed-citation publication-type="book"><string-name><surname>Brun</surname>, <given-names>Georg</given-names></string-name> and <string-name><given-names>Gertrude Hirsch</given-names> <surname>Hadorn</surname></string-name> (<year>2021</year>). <source>Textanalyse in den Wissenschaften: Inhalte und Argumente analysieren und verstehen</source>. <edition>4th</edition> edition. <publisher-name>vdf Hochschulverlag AG an der ETH Z&#252;rich</publisher-name>. <pub-id pub-id-type="doi">10.3218/4034-0</pub-id>.</mixed-citation></ref>
<ref id="B9"><mixed-citation publication-type="journal"><string-name><surname>B&#252;hler</surname>, <given-names>Axel</given-names></string-name> (<year>1999</year>). <article-title>&#8220;Die Vielfalt des Interpretierens&#8221;</article-title>. In: <source>Analyse &amp; Kritik</source> <volume>21</volume> (<issue>1</issue>), <fpage>117</fpage>&#8211;<lpage>137</lpage>. <pub-id pub-id-type="doi">10.1515/auk-1999-0107</pub-id>.</mixed-citation></ref>
<ref id="B10"><mixed-citation publication-type="journal"><string-name><surname>Celikyilmaz</surname>, <given-names>Asli</given-names></string-name>, <string-name><given-names>Elizabeth</given-names> <surname>Clark</surname></string-name>, and <string-name><given-names>Jianfeng</given-names> <surname>Gao</surname></string-name> (<year>2021</year>). <article-title>&#8220;Evaluation of Text Generation: A Survey&#8221;</article-title>. In: <source>arXiv preprint</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2006.14799</pub-id>.</mixed-citation></ref>
<ref id="B11"><mixed-citation publication-type="journal"><string-name><surname>Cohen</surname>, <given-names>Jacob</given-names></string-name> (<year>1960</year>). <article-title>&#8220;A Coefficient of Agreement for Nominal Scales&#8221;</article-title>. In: <source>Educational and Psychological Measurement</source> <volume>20</volume> (<issue>1</issue>), <fpage>37</fpage>&#8211;<lpage>46</lpage>. <pub-id pub-id-type="doi">10.1177/001316446002000104</pub-id>.</mixed-citation></ref>
<ref id="B12"><mixed-citation publication-type="book"><string-name><surname>Danneberg</surname>, <given-names>Lutz</given-names></string-name> (<year>1992</year>). <chapter-title>&#8220;Einleitung. Interpretation und Argumentation: Fragestellungen der Interpretationstheorie&#8221;</chapter-title>. In: <source>Vom Umgang mit Literatur und Literaturgeschichte</source>. Ed. by <string-name><given-names>Lutz</given-names> <surname>Danneberg</surname></string-name> and <string-name><given-names>Friedrich</given-names> <surname>Vollhardt</surname></string-name>. <publisher-name>Metzler</publisher-name>, <fpage>13</fpage>&#8211;<lpage>23</lpage>.</mixed-citation></ref>
<ref id="B13"><mixed-citation publication-type="book"><string-name><surname>Davies</surname>, <given-names>David</given-names></string-name> and <string-name><given-names>Carl</given-names> <surname>Matheson</surname></string-name>, eds. (<year>2008</year>). <source>Contemporary Readings in the Philosophy of Literature: An Analytic Approach</source>. <publisher-name>Broadview Press</publisher-name>.</mixed-citation></ref>
<ref id="B14"><mixed-citation publication-type="book"><string-name><surname>Descher</surname>, <given-names>Stefan</given-names></string-name> (<year>2017</year>). <source>Relativismus in der Literaturwissenschaft: Studien zu relativistischen Theorien der Interpretation literarischer Texte</source>. <publisher-name>Erich Schmidt Verlag GmbH &amp; Co. KG</publisher-name>. <pub-id pub-id-type="doi">10.37307/b.978-3-503-17461-4</pub-id>.</mixed-citation></ref>
<ref id="B15"><mixed-citation publication-type="book"><string-name><surname>Descher</surname>, <given-names>Stefan</given-names></string-name>, <string-name><given-names>Jan</given-names> <surname>Borkowski</surname></string-name>, <string-name><given-names>Felicitas</given-names> <surname>Ferder</surname></string-name>, and <string-name><given-names>Philipp David</given-names> <surname>Heine</surname></string-name> (<year>2015</year>). <chapter-title>&#8220;Probleme der Interpretation von Literatur &#8211; Ein &#220;berblick&#8221;</chapter-title>. In: <source>Literatur interpretieren: Interdisziplin&#228;re Beitr&#228;ge zur Theorie und Praxis</source>. <publisher-name>Brill &#8212; mentis</publisher-name>, <fpage>11</fpage>&#8211;<lpage>70</lpage>. <pub-id pub-id-type="doi">10.30965/9783957438973_003</pub-id>.</mixed-citation></ref>
<ref id="B16"><mixed-citation publication-type="journal"><string-name><surname>Descher</surname>, <given-names>Stefan</given-names></string-name> and <string-name><given-names>Thomas</given-names> <surname>Petraschka</surname></string-name> (<year>2018</year>). <article-title>&#8220;Die Explizierung des Impliziten&#8221;</article-title>. In: <source>Scientia Poetica</source> <volume>22</volume> (<issue>1</issue>), <fpage>180</fpage>&#8211;<lpage>208</lpage>. <pub-id pub-id-type="doi">10.1515/scipo-2018-007</pub-id>.</mixed-citation></ref>
<ref id="B17"><mixed-citation publication-type="book"><string-name><surname>D&#246;ring</surname>, <given-names>Nicola</given-names></string-name> (<year>2023</year>). <source>Forschungsmethoden und Evaluation in den Sozial- und Humanwissenschaften</source>. <edition>6th</edition> edition. <publisher-name>Springer</publisher-name>. <pub-id pub-id-type="doi">10.1007/978-3-662-64762-2</pub-id>.</mixed-citation></ref>
<ref id="B18"><mixed-citation publication-type="book"><string-name><surname>Hempfer</surname>, <given-names>Klaus W.</given-names></string-name> (<year>2018</year>). <source>Literaturwissenschaft &#8211; Grundlagen einer systematischen Theorie</source>. <publisher-name>Abhandlungen zur Literaturwissenschaft. J.B. Metzler</publisher-name>. <pub-id pub-id-type="doi">10.1007/978-3-476-04700-7</pub-id>.</mixed-citation></ref>
<ref id="B19"><mixed-citation publication-type="journal"><string-name><surname>Herm&#233;ren</surname>, <given-names>G&#246;ran</given-names></string-name> (<year>1983</year>). <article-title>&#8220;Interpretation. Types and Criteria&#8221;</article-title>. In: <source>Grazer Philosophische Studien</source> <volume>19</volume>, <fpage>131</fpage>&#8211;<lpage>161</lpage>. <pub-id pub-id-type="doi">10.5840/gps19831923</pub-id>.</mixed-citation></ref>
<ref id="B20"><mixed-citation publication-type="journal"><string-name><surname>Jannidis</surname>, <given-names>Fotis</given-names></string-name>, <string-name><given-names>Rabea</given-names> <surname>Kleymann</surname></string-name>, <string-name><given-names>Julian</given-names> <surname>Schr&#246;ter</surname></string-name>, and <string-name><given-names>Heike</given-names> <surname>Zinsmeister</surname></string-name> (<year>2025</year>). <article-title>&#8220;Do Large Language Models Understand Literature? Case Studies and Probing Experiments on German Poetry&#8221;</article-title>. In: <source>Journal of Computational Literary Studies</source> <volume>4</volume> (<issue>1</issue>). <pub-id pub-id-type="doi">10.48694/jcls.4225</pub-id>.</mixed-citation></ref>
<ref id="B21"><mixed-citation publication-type="book"><string-name><surname>Kemper</surname>, <given-names>Hans-Georg</given-names></string-name> (<year>1999</year>). <chapter-title>&#8220;Form-(De)-Konstruktion: Poetische Malerei im Reihungsstil&#8221;</chapter-title>. In: <source>Gedichte von Georg Trakl</source>. Ed. by <string-name><given-names>Hans-Georg</given-names> <surname>Kemper</surname></string-name>. <publisher-name>Reclam</publisher-name>, <fpage>43</fpage>&#8211;<lpage>59</lpage>.</mixed-citation></ref>
<ref id="B22"><mixed-citation publication-type="journal"><string-name><surname>Koch</surname>, <given-names>Steffen</given-names></string-name> (<year>2025</year>). <article-title>&#8220;Babbling Stochastic Parrots? A Kripkean Argument for Reference in Large Language Models&#8221;</article-title>. In: <source>Philosophy of AI</source> <volume>1</volume>, <fpage>19</fpage>&#8211;<lpage>33</lpage>. <pub-id pub-id-type="doi">10.18716/OJS/PHAI/2025.2325</pub-id>.</mixed-citation></ref>
<ref id="B23"><mixed-citation publication-type="book"><string-name><surname>K&#246;ppe</surname>, <given-names>Tilmann</given-names></string-name> (<year>2008</year>). <chapter-title>&#8220;Konturen einer analytischen Literaturtheorie&#8221;</chapter-title>. In: <source>Derrida und danach? Literaturtheoretische Diskurse der Gegenwart</source>. Ed. by <string-name><given-names>Gregor</given-names> <surname>Thuswaldner</surname></string-name>. <publisher-name>VS Verlag f&#252;r Sozialwissenschaften</publisher-name>, <fpage>67</fpage>&#8211;<lpage>83</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-531-91822-8_5</pub-id>.</mixed-citation></ref>
<ref id="B24"><mixed-citation publication-type="book"><string-name><surname>K&#246;ppe</surname>, <given-names>Tilmann</given-names></string-name> and <string-name><given-names>Simone</given-names> <surname>Winko</surname></string-name> (<year>2011</year>). <chapter-title>&#8220;Zum Vergleich literaturwissenschaftlicher Interpretationen&#8221;</chapter-title>. In: <source>Hermeneutik des Vergleichs</source>. Ed. by <string-name><given-names>Andreas</given-names> <surname>Mauz</surname></string-name> and <string-name><given-names>Hartmut von</given-names> <surname>Sass</surname></string-name>. <publisher-name>K&#246;nigshausen &amp; Neuman</publisher-name>, <fpage>305</fpage>&#8211;<lpage>320</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-0348-9261-2_4</pub-id>.</mixed-citation></ref>
<ref id="B25"><mixed-citation publication-type="webpage"><string-name><surname>Lin</surname>, <given-names>Chin-Yew</given-names></string-name> (<year>2004</year>). <chapter-title>&#8220;ROUGE: A Package for Automatic Evaluation of Summaries&#8221;</chapter-title>. In: <source>Text Summarization Branches Out</source>. <publisher-name>Association for Computational Linguistics</publisher-name>, <fpage>74</fpage>&#8211;<lpage>81</lpage>. <uri>https://aclanthology.org/W04-1013/</uri> (visited on 12/08/2025).</mixed-citation></ref>
<ref id="B26"><mixed-citation publication-type="journal"><string-name><surname>Liu</surname>, <given-names>Pengfei</given-names></string-name>, <string-name><given-names>Weizhe</given-names> <surname>Yuan</surname></string-name>, <string-name><given-names>Jinlan</given-names> <surname>Fu</surname></string-name>, <string-name><given-names>Zhengbao</given-names> <surname>Jiang</surname></string-name>, <string-name><given-names>Hiroaki</given-names> <surname>Hayashi</surname></string-name>, and <string-name><given-names>Graham</given-names> <surname>Neubig</surname></string-name> (<year>2023</year>). <article-title>&#8220;Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing&#8221;</article-title>. In: <source>ACM Computing Surveys</source> <volume>55</volume> (<issue>9</issue>), <fpage>1</fpage>&#8211;<lpage>35</lpage>. <pub-id pub-id-type="doi">10.1145/3560815</pub-id>.</mixed-citation></ref>
<ref id="B27"><mixed-citation publication-type="book"><string-name><surname>Martus</surname>, <given-names>Steffen</given-names></string-name> (<year>2021</year>). <chapter-title>&#8220;Interpretieren &#8211; Lesen &#8211; Schreiben: Zur hermeneutischen Praxis aus literaturwissenschaftlicher Perspektive&#8221;</chapter-title>. In: <source>Hermeneutik unter Verdacht</source>. Ed. by <string-name><given-names>Andreas</given-names> <surname>Kablitz</surname></string-name>, <string-name><given-names>Christoph</given-names> <surname>Markschies</surname></string-name>, and <string-name><given-names>Peter</given-names> <surname>Strohschneider</surname></string-name>. <publisher-name>De Gruyter</publisher-name>, <fpage>45</fpage>&#8211;<lpage>82</lpage>. <pub-id pub-id-type="doi">10.1515/9783110698084-003</pub-id>.</mixed-citation></ref>
<ref id="B28"><mixed-citation publication-type="book"><string-name><surname>Martus</surname>, <given-names>Steffen</given-names></string-name> and <string-name><given-names>Carlos</given-names> <surname>Spoerhase</surname></string-name> (<year>2022</year>). <source>Geistesarbeit. Eine Praxeologie der Geisteswissenschaften</source>. <publisher-name>Suhrkamp</publisher-name>. <pub-id pub-id-type="doi">10.1515/scipo2024-024</pub-id>.</mixed-citation></ref>
<ref id="B29"><mixed-citation publication-type="book"><string-name><surname>Meli</surname>, <given-names>Marco</given-names></string-name> (<year>1997</year>). <chapter-title>&#8220;Der S&#228;nger&#8221;</chapter-title>. In: <source>Gedichte von Gottfried Benn</source>. Ed. by <string-name><given-names>Harald</given-names> <surname>Steinhagen</surname></string-name>. <publisher-name>Reclam</publisher-name>, <fpage>87</fpage>&#8211;<lpage>99</lpage>.</mixed-citation></ref>
<ref id="B30"><mixed-citation publication-type="book"><string-name><surname>Papineni</surname>, <given-names>Kishore</given-names></string-name>, <string-name><given-names>Salim</given-names> <surname>Roukos</surname></string-name>, <string-name><given-names>Todd</given-names> <surname>Ward</surname></string-name>, and <string-name><given-names>Wei-Jing</given-names> <surname>Zhu</surname></string-name> (<year>2001</year>). <chapter-title>&#8220;BLEU: A Method for Automatic Evaluation of Machine Translation&#8221;</chapter-title>. In: <source>Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL &#8217;02</source>. <publisher-name>Association for Computational Linguistics</publisher-name>, <fpage>311</fpage>&#8211;<lpage>318</lpage>. <pub-id pub-id-type="doi">10.3115/1073083.1073135</pub-id>.</mixed-citation></ref>
<ref id="B31"><mixed-citation publication-type="book"><string-name><surname>Petraschka</surname>, <given-names>Thomas</given-names></string-name> and <string-name><given-names>Stefan</given-names> <surname>Descher</surname></string-name> (<year>2019</year>). <source>Argumentieren in der Literaturwissenschaft. Eine Einf&#252;hrung</source>. <publisher-name>Reclam Verlag</publisher-name>.</mixed-citation></ref>
<ref id="B32"><mixed-citation publication-type="book"><string-name><surname>Pichler</surname>, <given-names>Axel</given-names></string-name>, <string-name><given-names>Janis</given-names> <surname>Pagel</surname></string-name>, and <string-name><given-names>Nils</given-names> <surname>Reiter</surname></string-name> (<year>2025</year>). <chapter-title>&#8220;Evaluating LLM-Prompting for Sequence Labeling Tasks in Computational Literary Studies&#8221;</chapter-title>. In: <source>Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)</source>. Ed. by <string-name><given-names>Anna</given-names> <surname>Kazantseva</surname></string-name>, <string-name><given-names>Stan</given-names> <surname>Szpakowicz</surname></string-name>, <string-name><given-names>Stefania</given-names> <surname>Degaetano-Ortlieb</surname></string-name>, <string-name><given-names>Yuri</given-names> <surname>Bizzoni</surname></string-name>, and <string-name><given-names>Janis</given-names> <surname>Pagel</surname></string-name>. <publisher-name>Association for Computational Linguistics</publisher-name>, <fpage>32</fpage>&#8211;<lpage>46</lpage>. <pub-id pub-id-type="doi">10.18653/v1/2025.latechclfl-1.5</pub-id>.</mixed-citation></ref>
<ref id="B33"><mixed-citation publication-type="book"><string-name><surname>Schmidt</surname>, <given-names>Jochen</given-names></string-name> (<year>1984</year>). <chapter-title>&#8220;&#8221;Sobria ebrietas&#8221;. H&#246;lderlins &#8221;H&#228;lfte des Lebens&#8221;&#8221;</chapter-title>. In: <source>Gedichte und Interpretationen. Band 3: Klassik und Romantik</source>. Ed. by <string-name><given-names>Wulf</given-names> <surname>Segebrecht</surname></string-name>. <publisher-name>Reclam</publisher-name>, <fpage>256</fpage>&#8211;<lpage>267</lpage>.</mixed-citation></ref>
<ref id="B34"><mixed-citation publication-type="webpage"><string-name><surname>Sellars</surname>, <given-names>Wilfrid</given-names></string-name> (<year>1953</year>). <article-title>&#8220;Inference and Meaning&#8221;</article-title>. In: <source>Mind</source> <volume>62</volume>.<issue>247</issue>, <fpage>313</fpage>&#8211;<lpage>338</lpage>. <sc>issn</sc>: 00264423, 14602113. <uri>http://www.jstor.org/stable/2251271</uri> (visited on 12/19/2025).</mixed-citation></ref>
<ref id="B35"><mixed-citation publication-type="book"><string-name><surname>Strube</surname>, <given-names>Werner</given-names></string-name> (<year>1992</year>). <chapter-title>&#8220;&#220;ber Kriterien der Beurteilung von Textinterpretationen&#8221;</chapter-title>. In: <source>Vom Umgang mit Literatur und Literaturgeschichte</source>. Ed. by <string-name><given-names>Lutz</given-names> <surname>Danneberg</surname></string-name>, <string-name><given-names>Friedrich</given-names> <surname>Vollhart</surname></string-name>, <string-name><given-names>Hartmut</given-names> <surname>B&#246;hme</surname></string-name>, and <string-name><given-names>J&#246;rg</given-names> <surname>Sch&#246;nert</surname></string-name>. <publisher-name>Metzler</publisher-name>, <fpage>185</fpage>&#8211;<lpage>210</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-476-03386-4_8</pub-id>.</mixed-citation></ref>
<ref id="B36"><mixed-citation publication-type="journal"><string-name><surname>Susteck</surname>, <given-names>Sebastian</given-names></string-name> and <string-name><given-names>Christoph</given-names> <surname>Perder</surname></string-name> (<year>2023</year>). <article-title>&#8220;Schreiben durch K&#252;nstliche Intelligenz. ChatGPT und automatisierte Lyrikanalysen&#8221;</article-title>. In: <source>MiDU - Medien im Deutschunterricht</source>, <fpage>1</fpage>&#8211;<lpage>20</lpage>. <pub-id pub-id-type="doi">10.18716/OJS/MIDU/2023.0.2</pub-id>.</mixed-citation></ref>
<ref id="B37"><mixed-citation publication-type="book"><string-name><surname>Toulmin</surname>, <given-names>Stephen E.</given-names></string-name> (<year>2003</year>). <source>The Uses of Argument</source>. <edition>2nd</edition> edition. <publisher-name>Cambridge University Press</publisher-name>.</mixed-citation></ref>
<ref id="B38"><mixed-citation publication-type="book"><string-name><surname>Winko</surname>, <given-names>Simone</given-names></string-name> (<year>2015</year>). <chapter-title>&#8220;Zur Plausibilit&#228;t als Beurteilungskriterium literaturwissenschaftlicher Interpretationen&#8221;</chapter-title>. In: <source>Theorien, Methoden und Praktiken des Interpretierens</source>. Ed. by <string-name><given-names>Andrea</given-names> <surname>Albrecht</surname></string-name>, <string-name><given-names>Lutz</given-names> <surname>Danneberg</surname></string-name>, <string-name><given-names>Olav</given-names> <surname>Kr&#228;mer</surname></string-name>, and <string-name><given-names>Carlos</given-names> <surname>Spoerhase</surname></string-name>. <publisher-name>De Gruyter</publisher-name>, <fpage>483</fpage>&#8211;<lpage>512</lpage>. <pub-id pub-id-type="doi">10.1515/9783110353983.483</pub-id>.</mixed-citation></ref>
<ref id="B39"><mixed-citation publication-type="book"><string-name><surname>Winko</surname>, <given-names>Simone</given-names></string-name>, <string-name><given-names>Stefan</given-names> <surname>Descher</surname></string-name>, <string-name><given-names>Urania</given-names> <surname>Milevski</surname></string-name>, <string-name><given-names>Merten</given-names> <surname>Kr&#246;ncke</surname></string-name>, <string-name><given-names>Fabian</given-names> <surname>Finkendey</surname></string-name>, <string-name><given-names>Loreen</given-names> <surname>Dalski</surname></string-name>, and <string-name><given-names>Julia</given-names> <surname>Wagner</surname></string-name> (<year>2024</year>). <source>Praktiken des Plausibilisierens: Untersuchungen zum Argumentieren in literaturwissenschaftlichen Interpretationstexten</source>. <publisher-name>G&#246;ttingen University Press</publisher-name>. <pub-id pub-id-type="doi">10.17875/gup2024-2639</pub-id>.</mixed-citation></ref>
<ref id="B40"><mixed-citation publication-type="journal"><string-name><surname>Zabka</surname>, <given-names>Thomas</given-names></string-name> (<year>2008</year>). <article-title>&#8220;Interpretationsverh&#228;ltnisse entfalten. Vorschl&#228;ge zur Analyse und Kritik literaturwissenschaftlicher Bedeutungszuweisungen&#8221;</article-title>. In: <source>Journal of Literary Theory</source> <volume>2</volume> (<issue>1</issue>), <fpage>51</fpage>&#8211;<lpage>69</lpage>. doi: <pub-id pub-id-type="doi">10.1515/JLT.2008.005</pub-id>.</mixed-citation></ref>
</ref-list>
<sec id="A1">
<title>Appendix 1: Reconstruction of the Core Argumentation of <italic>Kemper</italic></title>
<list list-type="bullet">
<list-item><p><bold>Premise 1</bold> (= P<sub>1</sub>): The poem consists of three four-line stanzas with different individual images from the natural and human world.</p></list-item>
<list-item><p><bold>Premise 2</bold> (= P<sub>2</sub>): With the exception of three enjambments, the end of the sentence and the end of the verse coincide in the poem, which reinforces the pauses between the images.</p></list-item>
<list-item><p><bold>Premise 3</bold> (= P<sub>3</sub>): The three enjambments only connect main clauses and do not break up sentences.</p></list-item>
<list-item><p><bold>Premise 4</bold> (= P<sub>4</sub>): The simple, uniform sentence structure supports the clear separation of the images.</p></list-item>
<list-item><p><bold>Intermediate Conclusion 1</bold> (= C<sub>1</sub>): The poem realizes a new poetic image in each verse.</p></list-item>
<list-item><p><bold>Material rule of inference</bold> (= R<sub>1</sub>): If images in a poem are arranged as a strict sequence of independent units without syntactic dependence, this constitutes a sequence of independent individual images, which is characteristic of the expressionist sequential style.</p></list-item>
<list-item><p><bold>Intermediate Conclusion 2</bold> (= C<sub>2</sub>): The poem realizes the expressionistic sequential style.</p></list-item>
<list-item><p><bold>Premise 5</bold> (= P<sub>5</sub>): The expressionist sequence style suspends the level of symbolic meaning.</p></list-item>
<list-item><p><bold>Conclusion</bold> (= Final C): The poem has no symbolic meaning.</p></list-item>
</list>
<fig id="F3">
<caption>
<p><bold>Figure 3:</bold> Reconstruction of Argument 1.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jcls-4312_pichler-g3.png"/>
</fig>
<list list-type="bullet">
<list-item><p><bold>Premise 1</bold> (= P<sub>1</sub>): The first and second verses of the poem show a repetitive but varied optical sequence of movement from the ground to the sky and from the sky to the ground and back.</p></list-item>
<list-item><p><bold>Premise 2</bold> (= P<sub>2</sub>): The first verse of the poem introduces two antonymic assonance groups (&#8216;a&#8217; and &#8216;ei&#8217;), which lead to a blending of optical and haptic perceptions.</p></list-item>
<list-item><p><bold>Premise 3</bold> (= P<sub>3</sub>) The contrasting and at the same time analogous perceptual values thus created are continued and intensified denotatively and connotatively in the following verses of the poem.</p></list-item>
<list-item><p><bold>Intermediate Conclusion 1</bold> (= C<sub>1</sub>): The poem is characterized by sound-symbolic, motivic and structural repetitions and correspondences.</p></list-item>
<list-item><p><bold>Material rule of inference</bold> (= R<sub>1</sub>): If a poem exhibits a high degree of sound-symbolic, motivic and structural repetitions and correspondences, then it potentially creates an impression of iconicity analogous to painting.</p></list-item>
<list-item><p><bold>Intermediate Conclusion 2</bold> (= C<sub>2</sub>): The poem creates an impression of iconicity analogous to painting.</p></list-item>
<list-item><p><bold>Material rule of inference</bold> (= R<sub>2</sub>): If a text achieves a painting-like iconicity and its images can be related to real winter scenes, then it offers a poetic winter image of high sensual plasticity that can be referentialized.</p></list-item>
<list-item><p><bold>Premise 4:</bold> = P<sub>4</sub>) The images in the poem can be related to real winter scenes.</p></list-item>
<list-item><p><bold>Conclusion</bold> (= Final C): The poem offers a poetic winter image of high sensual plasticity that can be referentialized.</p></list-item>
</list>
<fig id="F4">
<caption>
<p><bold>Figure 4:</bold> Reconstruction of Argument 2.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jcls-4312_pichler-g4.png"/>
</fig>
<list list-type="bullet">
<list-item><p><bold>Premise 1</bold> (= P<sub>1</sub>): Trakl uses a limited vocabulary of images and motifs, which he combines and varies in different poems.</p></list-item>
<list-item><p><bold>Premise 2</bold> (= P<sub>2</sub>): The recurring use of certain images and motifs in different poems creates a Trakl-specific intertextuality.</p></list-item>
<list-item><p><bold>Premise 3</bold> (= P<sub>3</sub>): The poem contains images and motifs that also recur in other poems by Trakl.</p></list-item>
<list-item><p><bold>Intermediate Conclusion</bold> (= C<sub>1</sub>):The poem participates in the Trakl-specific intertextuality.</p></list-item>
<list-item><p><bold>Premise 4</bold> (= P<sub>4</sub>): In the poem, numerous inter- and intratextual relations between the image parts and an approximation of the motifs are present.</p></list-item>
<list-item><p><bold>Material rule of inference</bold> (= R<sub>1</sub>): If, in a poem, numerous inter- and intratextual relations between the image parts and an approximation of the motifs are present, this tends to lead to an autonomization of its vocabulary.</p></list-item>
<list-item><p><bold>Conclusion</bold> (= Final C): The poem has a tendency toward an autonomization of its vocabulary.</p></list-item>
</list>
<fig id="F5">
<caption>
<p><bold>Figure 5:</bold> Reconstruction of Argument 3.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jcls-4312_pichler-g5.png"/>
</fig>
<list list-type="bullet">
<list-item><p><bold>Conclusion of argument 1</bold> (= C<sub>1</sub>): The poem has no symbolic meaning.</p></list-item>
<list-item><p><bold>Conclusion of argument 2</bold> (= C<sub>2</sub>): The poem offers a poetic winter image of high sensual plasticity that can be referentialized.</p></list-item>
<list-item><p><bold>Conclusion of argument 3</bold> (= C<sub>3</sub>): The poem has a tendency toward an autonomization of its vocabulary.</p></list-item>
<list-item><p><bold>Premise 4</bold> (= P<sub>4</sub>): These three aspects constitute different, non-integrable semantic levels of the poem.</p></list-item>
<list-item><p><bold>Intermediate Conclusion</bold> (= C<sub>4</sub>): The poem offers different meanings on different representational levels.</p></list-item>
<list-item><p><bold>Material rule of inference</bold> (= R<sub>1</sub>): If a poem offers different meanings on different representational levels, then it eludes a clear definition of its meaning.</p></list-item>
<list-item><p><bold>Conclusion:</bold> (= Final C): The poem <italic>Im Winter</italic> eludes a clear definition of its meaning.</p></list-item>
</list>
<fig id="F6">
<caption>
<p><bold>Figure 6:</bold> Reconstruction of Argument 4.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="jcls-4312_pichler-g6.png"/>
</fig>
</sec>
<sec id="A2">
<title>Appendix 2: Generated Argument-Like Interpretation of Hofmannsthal&#8217;s <italic>Manche freilich &#8230;</italic>, prompted with Meli&#8217;s Reconstruction, and Gerstl&#8217;s <italic>balance&#8211;balance</italic>, also prompted with Meli&#8217;s Reconstruction</title>
<p>The following sections show two LLM-generated argument cores that have been manually evaluated in this paper. We show the interpretation core with the highest agreement and the highest average Likert scores (Hofmannsthal 3), as well as the interpretation core with a high agreement and correspondingly lower Likert scores (Gerstl 3).</p>
<sec id="A2.1">
<title>Hofmannsthal 3</title>
<code>1 Die ersten beiden Strophen des Gedichts &#187;Manche freilich &#8230;&#171; stellen zwei gegens&#228;tzliche soziale Existenzformen einander gegen&#252;ber.</code>
<code>2 Die eine Gruppe muss &#8222;drunten sterben&#8221; (V. 1) und liegt &#8222;mit schweren Gliedern / Bei den Wurzeln des verworrenen Lebens&#8221; (V. 5-6), w&#228;hrend die andere &#8222;droben&#8221; (V. 3) wohnt und bei &#8222;Sibyllen&#8221; und &#8222;K&#246;niginnen&#8221; (V. 8) sitzt.</code>
<code>3 Die r&#228;umliche Opposition von &#8222;drunten&#8221; und &#8222;droben&#8221; korrespondiert mit der Gegen&#252;berstellung von &#8222;schwer&#8221; und &#8222;leicht&#8221;.</code>
<code>4 Die Wiederholung des Wortes &#8222;manche&#8221; bzw. &#8222;andre&#8221; etabliert eine dualistische Weltordnung.</code>
<code>5 Diese dualistische Weltordnung wird in der dritten Strophe problematisiert.</code>
<code>6</code>
<code>7 Ein &#8222;Schatten&#8221; (V. 11) verbindet die getrennten Sph&#228;ren miteinander, und &#8222;die leichten sind an die schweren / Wie an Luft und Erde gebunden&#8221; (V. 13-14).</code>
<code>8 Die Naturmetapher &#8222;Luft und Erde&#8221; suggeriert, dass diese Verbindung existenziell und unaufl&#246;sbar ist.</code>
<code>9 In der vierten Strophe wechselt das Gedicht von der allgemeinen Beobachtung zur subjektiven Perspektive eines lyrischen Ich.</code>
<code>10 Das lyrische Ich kann &#8222;ganz vergessener V&#246;lker M&#252;digkeiten&#8221; nicht &#8222;abtun&#8221; (V. 15-16) und vermag das &#8222;stumme Niederfallen ferner Sterne&#8221; nicht von seiner &#8222;erschrockenen Seele&#8221; fernzuhalten (V. 17-18).</code>
<code>11 Das lyrische Ich artikuliert eine existenzielle Verbundenheit mit dem Leiden anderer und mit kosmischen Vorg&#228;ngen.</code>
<code>12 Die Unm&#246;glichkeit der Abgrenzung deutet auf eine ethische Verantwortung hin, die das privilegierte Individuum nicht ablegen kann.</code>
<code>13</code>
<code>14 Die Schlu&#223;strophe universalisiert diese Erkenntnis durch die Feststellung, dass &#8222;viele Geschicke&#8221; nebeneinander existieren (V. 19).</code>
<code>15 Das &#8222;Dasein&#8221; verwebt alle Schicksale &#8222;durcheinander&#8221; (V. 20), wodurch individuelle Existenz als Teil eines &#252;berindividuellen Zusammenhangs erscheint.</code>
<code>16 Das lyrische Ich behauptet, sein &#8222;Teil&#8221; sei &#8222;mehr als dieses Lebens / Schlanke Flamme oder schmale Leier&#8221; (V. 21-22).</code>
<code>17 Die Metaphern &#8222;Flamme&#8221; und &#8222;Leier&#8221; verweisen auf &#228;sthetische Autonomie und individuelles Leben.</code>
<code>18 Die Negation dieser Metaphern bedeutet eine Absage an die Vorstellung einer in sich geschlossenen &#228;sthetischen oder individuellen Existenz.</code>
<code>19 Das Gedicht &#187;Manche freilich &#8230;&#171; verhandelt das Verh&#228;ltnis von privilegierter und unterprivilegierter Existenz sowie die Verantwortung des Individuums gegen&#252;ber dem kollektiven Leiden.</code>
<code>20 Wenn ein Gedicht soziale Gegens&#228;tze thematisiert und zugleich die Rolle des Dichters reflektiert, kann es als sozialethischer und poetologischer Text gelesen werden.</code>
<code>21 Das Gedicht &#187;Manche freilich &#8230;&#171; ist ein sozialethischer und poetologischer Text.</code>
<code>22</code>
<code>23 Das</code>
<code> lyrische Ich geh&#246;rt offenbar zur privilegierten Gruppe der &#8222;Leichten&#8221;.</code>
<code>24 Das lyrische Ich kann sich jedoch nicht von der Verbindung zu den &#8222;Schweren&#8221; l&#246;sen.</code>
<code>25 Diese unl&#246;sbare Verbindung wird nicht als Begrenzung, sondern als Erweiterung der eigenen Existenz dargestellt.</code>
<code>26</code>
<code>27 Das Gedicht pr&#228;sentiert eine Poetik der Empathie und der kollektiven Verantwortung, die &#252;ber &#228;sthetischen Autonomieanspruch hinausgeht.</code>
<code>28 Das Gedicht &#187;Manche freilich &#8230;&#171; steht damit in der Tradition einer ethisch engagierten Dichtung, die soziale Gegens&#228;tze nicht ausblendet, sondern in die poetische Selbstreflexion integriert.</code>
</sec>
<sec id="A2.2">
<title>Gerstl 3</title>
<code>1 Die erste Strophe des Gedichts &#187;balance - balance&#171; verneint explizit, dass r&#228;umliche Distanz eine &#187;l&#246;sung&#171; f&#252;r unbenannte Probleme darstellt (V. 1-4).</code>
<code>2 Die zweite Strophe wiederholt und erweitert das Motiv der r&#228;umlichen Entfernung, indem sie das &#187;sitzen auf stufen weit weg&#171; als Zustand der Schwebe charakterisiert (V. 5-9).</code>
<code>3</code>
<code>4 Das Gedicht &#187;balance - balance&#171; verhandelt das Verh&#228;ltnis von physischer Ortsverlagerung und innerer Problembew&#228;ltigung.</code>
<code>5 Die physische Distanz (&#187;rom&#171; vs. &#187;wien&#171;) wird als unwirksame Fluchtbewegung entlarvt, da die Probleme &#187;in meinem kopf&#171; verbleiben.</code>
<code>6 Das zentrale Motiv der &#187;schwebe&#171; (V. 8) korrespondiert mit dem Titel &#187;balance - balance&#171; und bezeichnet einen prek&#228;ren Gleichgewichtszustand.</code>
<code>7 Dieser Schwebezustand wird durch zwei gegens&#228;tzliche Bewegungen konstituiert: die triviale Ablenkung durch &#187;l&#228;stige papier-l&#252;sterne ameisen&#171; (V. 7) und die bedrohlichen &#187;abst&#252;rze&#171; (V. 9).</code>
<code>8</code>
<code>9 Das Gedicht arbeitet mit einer Gegen&#252;berstellung von scheinbaren und tats&#228;chlichen Abgr&#252;nden: Die Flugzeuge schweben &#187;&#252;ber den so genannten wirklichen abgr&#252;nden&#171; (V. 11), w&#228;hrend das lyrische Ich &#252;ber &#187;gar nicht unwahrscheinlichen abst&#252;rzen&#171; schwebt (V. 9).</code>
<code>10 Die Qualifizierung der Abgr&#252;nde als &#187;so genannte wirkliche&#171; (V. 11) stellt die Hierarchie zwischen &#228;u&#223;erer und innerer Bedrohung infrage.</code>
<code>11 Die Flucht erfolgt zweifach: r&#228;umlich von Wien nach Rom und mental durch &#187;wegschauen&#171; (V. 12) auf &#187;kleine k&#228;frige buchstaben&#171; (V. 13).</code>
<code>12 Das Schreiben selbst wird als Ablenkungsstrategie kenntlich gemacht, die Teil des prek&#228;ren Gleichgewichts ist.</code>
<code>13 Das Gedicht &#187;balance - balance&#171; pr&#228;sentiert einen Zustand permanenter Instabilit&#228;t, in dem weder r&#228;umliche Flucht noch k&#252;nstlerische Ablenkung eine genuine L&#246;sung bieten.</code>
<code>14 Die konsequente Kleinschreibung und die parataktische Syntax verst&#228;rken formal den Eindruck der Desorientierung und fehlenden Hierarchie.</code>
<code>15 Die Wiederholung des Titelworts &#187;balance&#171; deutet auf die Notwendigkeit st&#228;ndiger Ausbalancierung hin, nicht auf einen erreichten Gleichgewichtszustand.</code>
<code>16 Das Gedicht &#187;balance - balance&#171; verweigert jede Form von Transzendenz oder Aufl&#246;sung und beharrt stattdessen auf der Unausweichlichkeit des prek&#228;ren Schwebezustands als existenzielle Grundsituation der Moderne.</code>
</sec>
</sec>
</back>
</article>