1. Introduction
“The boy took the old army blanket off the bed and spread it over the back of the chair and over the old man’s shoulders.” – The Old Man and the Sea, Ernest Hemingway
Ernest Hemingway’s prose is famously sparse, but conjures vivid mental images: Simple actions and objects – no florid descriptions, no overt emotional cues – yet the scene is immediately present, affective, and immersive. One might say that the strength of literary texts lies in their Imageability.
The concept of Imageability originates in psycholinguistics, where it describes the ease with which words evoke sensory experiences (Paivio et al. 1968). However, when speaking of the Imageability of literary texts, we are going beyond individual words, and rather touching upon implicit and evocative strategies, including imagery, narrated perception, and the overall immersiveness or experientiality of the text – strategies that have long been held to increase the appeal of texts and enhance the reading experience (Ellen J. Esrock 1994; Sharma Paudyal 2023).
These strategies are related. We can define imagery as language use that appeals to our senses – creating mental images (Lacey and Lawson 2013; Sharma Paudyal 2023). For example, Burroway (1987) notes that a certain use of nouns that evoke sensory images and of verbs that represent visual actions makes writing “come alive.” This effect aligns with the broader concept of literary experientiality (Fludernik 1996). Experientiality describes how narratives prompt embodied engagement through a “quasi-mimetic evocation of real-life experience,” drawing on knowledge readers have acquired from their physical presence in the world. Psychological and reader-response studies further support this link, associating imagery with the intensity of emotional responses in reading and feelings of embodiment (Blackwell 2020; Goetz et al. 1993; Martínez 2024). By measuring Imageability, we can gain insights into how texts evoke embodied responses and better understand the immersive qualities that shape reader engagement.
When measuring the Imageability of literary texts, studies typically rely on dictionaries developed in psycholinguistic research that assign Imageability or concreteness scores to words (Feldkamp et al. 2024; J. T. Kao and Jurafsky 2015). However, the use of such word-based dictionaries presents issues. We identify three main limitations in current computational approaches to Imageability: conceptual vagueness, poor lexical coverage, and lack of compositionality at the sentence level. First, Imageability itself is not a straightforward property but tends to be inherently vague, with human judgments often diverging.1 It spans literal and figurative language, object descriptions, and metaphorical expressions that rely on shared cognitive schemas – not merely knowledge of the visual world. The precise nature of what Imageability encodes remains debated, and it is often conflated with related constructs such as visualness and concreteness. While some studies suggest that Imageability primarily reflects visual features of language (Ellis 1991), it strongly correlates with concreteness and Visuality (Brysbaert et al. 2014), making it difficult to isolate as a distinct linguistic property. To test the relationship to these related concepts, we would need to compare dictionaries for Imageability, Visuality, and concreteness.
A second main issue is that dictionary-based scores struggle with sentence-level Imageability. The accuracy of aggregated Imageability, Visuality, or concreteness scores at the sentence level is often poor (Verma et al. 2023), which may stem from the limited coverage of dictionaries and the questionable assumption that averaging word-level scores yields a meaningful sentence-level representation. A third issue is that dictionary-based approaches fail to account for compositionality – phrases such as “She painted a dark picture” and “She painted a picture in the dark” have different Imageability despite containing similar words.2
Regarding the latter issue, recent advances in natural language processing (NLP) have attempted to quantify Imageability using multimodal models that integrate textual and visual information (Verma et al. 2023; S. Wu and Smith 2023). However, approaches such as gauging Imageability through text-to-image generation output show uneven relation to human judgment, especially for literary texts (S. Wu and Smith 2023).
In sum, existing methods for computing Imageability scores face three main challenges: (1) the vague conceptualization of Imageability, (2) the limited lexical coverage of dictionary-based approaches, and (3) the difficulty of generalizing from word- or phrase-level scores to sentence-level Imageability. These issues are relevant for literary texts: Imageability is a core stylistic and aesthetic device; and since literature constructs immersive sensory experiences through language alone, it is an ideal domain for testing computational models of Imageability. Unlike instructional or descriptive texts, literary language frequently employs figurative expressions, ellipsis, and symbolic imagery – requiring more nuanced tools to capture sensory evocativeness at the sentence or paragraph level.
To address these limitations, we propose a data-driven, scalable approach that moves beyond static dictionary-based methods. Given the impact of Imageability and concreteness on immersivity – and, by extension, reader appreciation – we explore automatic assessment techniques based on text representations. Prior work has demonstrated the visual knowledge of text-only models (Sharma et al. 2024), while recent advances in multimodal models (Radford et al. 2021) offer new opportunities for capturing the visual dimension of language.
Examining the shape of both text-based and multimodal embeddings, we test their ability to approximate Imageability, concreteness, and/or Visuality scores.
Specifically, we evaluate their efficacy in characterizing literary texts through three experiments:
Word-level analysis: We assess the relationship between human Imageability scores and metrics of multimodal word embeddings for dictionary entries of the Imageability dictionary.
Sentence-level analysis: We compare dictionary-derived scores with metrics of multimodal sentence embeddings for literary texts.
Literary case study: We examine the discriminatory power of these embedding-based metrics in distinguishing between text types where Imageability is expected to differ: Imagist poems vs. love poems.3
By systematically evaluating these approaches, we aim to develop a more robust framework for measuring Imageability in literary texts – one that accounts for compositionality, sentence structure, and the multimodal nature of literary imagery.
While our study does not include human annotations, it represents an initial computational exploration aimed at (1) testing the relationship between Imageability and related constructs such as concreteness and Visuality, and (2) evaluating the potential of embedding-based metrics to model Imageability beyond static, word-level ratings.
2. Related Works
2.1 Imageability in Literary Texts
The evocation of mental imagery in literary texts has been a debated topic in literary and psychological scholarship (Kuzmičová 2014). Despite its prominence in early 20th century literary movements like Imagism, where the emphasis was placed on clear, visual language and the rejection of abstraction (Pound 1913), the role of imagery and the Imageability of literature was often overlooked in Structuralist and New Criticist frameworks, prioritizing linguistic networks and meaning-making (Ellen J. Esrock 1994). However, in recent years, the concept of Imageability has regained attention, with both literary scholars and psychologists increasingly examining its role in reader response (Kuzmičová 2014; Magyari et al. 2020; Martínez 2024; Sharma Paudyal 2023).
Imageability, defined as the ability of a text to evoke sensory experiences and mental imagery, is closely linked to the heightened emotional responses that images can provoke (Goetz et al. 1993) and the embodied nature of reading experiences (Martínez 2024). Literary passages that employ concrete, sensory language – those that engage the senses without explicit emotional cues – have been shown to elicit emotional responses from readers. For instance, Hemingway’s minimalist style (see our example above), which uses stark imagery without overt emotional direction, is perceived as emotionally charged by human readers, despite being classified as neutral by automatic sentiment analysis systems (Feldkamp et al. 2024). Furthermore, the evocation of interoceptive or physiological states can activate a reader’s embodied experience (Martínez 2024), while concrete language enhances emotional engagement and heightens suspense (Auracher and Bosch 2016).
While imagery, concreteness, and Imageability have long been concepts employed in literary analysis (Ellen J. Esrock 1994; Sharma Paudyal 2023), computational literary history studies have further shown how quantifying Imageability can be employed to characterize certain literary texts. For instance, studies of poetry have shown that Imagism, with its focus on direct, visual language, is associated with higher levels of Imageability. Additionally, a historical shift toward more concrete and imageable language in poetry has been observed, suggesting a gradual evolution of literary style over time (J. T. Kao and Jurafsky 2015).
However, quantifying Imageability – along with related concepts like concreteness and Visuality – is a challenge for computational literary analysis. These concepts are often defined and operationalized differently across domains. For instance, in more communicative texts, such as journalism, Imageability is often tied directly to sensory or visual representations, where literary language frequently employs imagery in more abstract or symbolic ways, and may use it more strategically and with greater nuance. The concept of “implicit” expression – “show, don’t tell” – is particularly significant in literature, where imagery is often used to evoke affect without explicitly naming it (Feldkamp et al. 2024), and literary scholarship frequently uses terms like ‘evocative’ or ‘understatement’ to describe authorial styles (Daoshan and Shuo 2014; Strychacz 2002), further emphasizing the subtlety of literary Imageability as a strategic tool. Furthermore, literary studies have made efforts to distinguish between conceptually different types of imagery (Kuzmičová 2014).
This implicit nature of literary expression poses additional challenges for computational methods that rely on standardized lexical resources. Its subtleties may not align with the operational definitions of Imageability found in existing dictionaries or lexicons.
2.2 Dictionaries for Imageability
The terms Imageability, concreteness, and Visuality are often used interchangeably, though they capture distinct but overlapping dimensions of language. While concreteness typically refers to the degree to which a word denotes a tangible entity, Imageability extends beyond the purely visual to include mental representations, including interoceptive states (Dellantonio et al. 2014).
In literary studies, these constructs have been applied in different ways. For example, J. Kao and Jurafsky (2012) use concreteness – or its reverse, abstractness – to assess literary imagery, while J. T. Kao and Jurafsky (2015) measure “concrete imagery” through a combination of object-word frequency, abstract-word frequency, and dictionary-based concreteness and Imageability scores.
Even when focusing specifically on Imageability, various resources have been developed, beginning in the 1960s with early psycholinguistic studies (Paivio et al. 1968). One of the first large-scale lexicons, the MRC database, was compiled in the 1980s (Coltheart 1981) and remains widely used despite its limited scope. More recent and expansive dictionaries have been developed, such as the 40,000-lemma concreteness lexicon by Brysbaert et al. (2014), which significantly surpasses the 4,800 lemmas found in the MRC lexicon. However, coverage across different dimensions remains uneven, with Imageability, Visuality, and concreteness lexicons varying in size and consistency.
2.3 Development of Models for Imageability
Recently, the visual aspect of Imageability has gained significant attention in Natural Language Processing (NLP), particularly in the context of text-to-image models like DALL-E. These models rely on dual processing of text and images, yet struggle when dealing with long-form text containing spans of non-visual content. As a result, visualness has been proposed as a useful metric for characterizing the prompt prior to generation, with the goal of improving the accuracy of text-to-image generation models (Chen et al. 2025; Verma et al. 2023).
For instance, Verma et al. (2023) introduces a binary classification task distinguishing imageable from non-imageable sentences to enhance prompt characterization before image generation. Similarly, Chen et al. (2025) explores the role of visualness in guiding the generation process, aiming to refine model outputs. Apart from augmented image generation, identifying imageable text might also have further downstream applications, such as on-the-fly visuals (Liu et al. 2023) and image-assisted video navigation (Zhao et al. 2019). However, while such binary classification is practical for generation and other tasks, it is not as useful for describing nuanced data, where we ideally want to maintain a level of granularity: gauging more or less imageable text.
Table 1: How Imageability relates to creating mental constructs.
| Construct | Scope | Relation to Imageability |
| Concreteness | Tangibility of a word’s referent | Partial overlap. Concrete words tend to be imageable, but abstract items such as “whirlwind” can evoke vivid scenes; conversely, “road” is concrete yet often yields weak imagery in isolation. |
| Visualness | Strength of visual associations | Proper subset. Imageability spans all modalities (auditory, tactile, olfactory, …), not vision alone. |
| Imagery (literary studies) | Textual clusters of sensory details | Complementary. Imagery is a textual feature; Imageability is the reader-side potential. |
To address the shortcomings of existing Imageability dictionaries, S. Wu and Smith (2023) propose methods that incorporate sentence compositionality, aiming to better capture the nuances of how Imageability evolves across different sentence structures. While their work shows promise in addressing fixedness in representations, a critical challenge remains: Many texts can evoke strong mental imagery in readers without these images being strongly encoded in a culturally shared or visual sense. For instance, creative and poetic language can provoke vivid imagery that is not directly tied to shared or commonsense visual representations. When the consistency of generated images is used as a proxy for Imageability (S. Wu and Smith 2023), this may actually measure the stability of a text’s representation rather than its inherent Imageability. Conversely, a non-visual passage may still elicit a text-to-image model to generate superficially coherent images. With high uncertainty, text-to-image models often generate visually similar images (i.e., images of actual text) that may appear coherent but do not necessarily align with any human reader’s mental image of the text.
Finally, methods for gauging Imageability may show variability across genres. S. Wu and Smith (2023) finds an insignificant correlation between their Imageability measure and human assessments of poem lines, yet a significant correlation for news sentences. This suggests that the effectiveness of Imageability metrics may depend on the genre and its inherent stylistic and thematic characteristics.
3. What We Mean by Imageability
We follow the psycholinguistic tradition in defining Imageability as the ease with which a linguistic expression evokes sensory representations in the mind of a typical reader, but we extend the expression from individual words to any contiguous span of text.
Given a reader r and a text span t, the Imageability I(t,r) is the subjective vividness of the multi-sensory mental imagery spontaneously elicited by t. In group studies, we use the expected value
Two implications follow. First, Imageability is a psychological potential, approximated by behavioral data or cognitively motivated proxies. Second, it tends to be compositional: The vividness of “he smoked a crooked, emerald-green cigarette” might not be a linear sum of the scores for its component words.
Ideally, we would move from the word to the sentence level without sacrificing scalability. Because Imageability is, by definition, a reader experience, ultimate confirmation demands sentence- or passage-level ratings. The present work should therefore be read as a bridging effort: dictionary-validated, embedding-based metrics ready for human calibration.4
Table 2: Granularity choices when measuring Imageability.
| Level | Typical operationalization | Limitations |
| Word | Psycholinguistic norms from 25–40 k-entry lexica (MRC, Lancaster, BLP) | Coverage gaps for literary vocabulary; ignores syntax and context. |
| Sentence/line | Human ratings (rare) or context-aware embeddings (this work) | Ratings costly; embeddings need interpretability. |
| Whole text | Aggregations (mean, max, entropy) over sentences | Sensitive to length and genre; reliant on robust lower-level scores. |
4. Resources
4.1 Dictionaries
For Experiment I, we utilize lexicon-based resources to analyze the Imageability of words, primarily relying on the MRC Psycholinguistic Imageability Lexicon (Coltheart 1981). This lexicon comprises 4,828 lemmas that have been rated for their capacity to evoke mental imagery. Additionally, we resort to two other well-established resources: the Concreteness Lexicon (Brysbaert et al. 2014), which assigns ratings to words based on their perceived tangibility and sensory grounding, and the Lancaster Sensorimotor Norms (Lynott et al. 2020), which provide detailed modality-specific perceptual ratings (i.e., visual, auditory, tactile associations). As the lexica of the three resources largely overlap, we can systematically compare how they conceptualize and quantify Imageability, concreteness, and sensory experience. Given that previous research has noted a strong correlation between Imageability and concreteness, but also some key distinctions between them (Paivio et al. 1968), our analysis seeks to clarify the extent to which dictionary-based Imageability measures capture cognitive and perceptual properties distinct from general word concreteness and modality-specific sensory attributes.
The MRC Psycholinguistic Imageability Lexicon (Coltheart 1981): One of the earliest large-scale resources for word Imageability, this lexicon contains 4,828 lemmas, each rated based on the extent to which they evoke mental imagery. The ratings were collected from human participants, making it an empirically grounded resource for word-level Imageability. We compare this resource (here, imag) with later expansions and refinements: (1) Cortese and Fugett’s Imageability Ratings (here, imag C) (Cortese and Fugett 2004): an updated version that increases the coverage of Imageability scores and refines earlier ratings. (2) Reilly and Kean’s Formal Distinctiveness Model (here, imag R) (Reilly and Kean 2007): a lexicon that integrates and updates multiple prior resources, including the MRC, while filtering out words with mid-range Imageability ratings to focus on words that are strongly imageable or non-imageable.
The Visuality Lexicon of the Lancaster Sensorimotor Norms (Lynott et al. 2020): The Lancaster Sensorimotor Norms provide modality-specific sensory ratings (i.e., visual, auditory, tactile, and motor associations) for 39,707 English words. In our experiments, we use the Visuality scores specifically. Unlike general Imageability, Visuality captures the extent to which a word evokes a visual percept. This distinction is important because some words may be highly imageable but not strongly visual (i.e., “fragrance” or “melody”). Comparing Visuality to Imageability allows us to examine how modality-specific sensory experience aligns with broader notions of literary imagery.
The Concreteness Lexicon (Brysbaert et al. 2014): This dataset provides concreteness ratings for 40,000 words, where concreteness is defined as the degree to which a word refers to a tangible, physical entity. While concreteness and Imageability are often correlated, they are not identical concepts: Some abstract words (i.e., “freedom”) might be highly imageable due to their symbolic richness, while some concrete words (i.e., “rock”) may elicit limited mental imagery despite being physically tangible. By including concreteness as a comparative measure, we assess how word-level concreteness and Imageability interact, particularly in literary contexts.
4.2 Literary Texts
For Experiments II and III, we use full sentences from literary texts. Moving from single words to entire sentences enables an assessment of how Imageability, concreteness, and Visuality manifest in context.
For Experiment II, the dataset includes two modernist novels alongside a large-scale corpus of fiction:
The Old Man and the Sea by Ernest Hemingway (1952): This novel is characterized by concise, concrete descriptions and a direct, unembellished prose, making it an ideal candidate to evaluate Imageability in an economical (yet vivid) narrative style.
Mrs. Dalloway by Virginia Woolf (1925): In contrast, Woolf’s novel employs stream-of-consciousness narration, featuring long, fluid sentences that foreground subjective perception with immersive sensory detail. Its contrast with Hemingway’s style allows us to test whether opposite stylistic techniques correlate with distinct levels of Imageability.
Sentences from the Chicago Corpus (1880–2000): A diverse dataset of 9,000 sentences randomly sampled from 9,000 different novels, to ensure a broad coverage of stylistic and historical variation in fiction. The corpus from which the sentences are sampled includes works ranging from canonical literature to lesser-known fiction, providing a representative snapshot of Anglophone prose writing across the 19th and 20th centuries.5 For further details, see Bizzoni et al. (2024) and Y. Wu et al. (2024).
All sentences were tokenized using SpaCy’s NLP library.6 The inclusion of both single-author novels and a large, multi-author corpus allows us to assess how Imageability varies both within and between different literary styles. In this context, Hemingway and Woolf serve as controlled case studies for contrasting narrative techniques: Hemingway’s prose is marked by concise, concrete descriptions of low abstraction, whereas Woolf’s stream-of-consciousness style favors immersive, introspective, but often highly imageable narration. These stylistic differences provide a useful basis for testing whether Imageability metrics capture differences in literary technique. Meanwhile, the Chicago Corpus, composed of diverse works spanning more than a century, offers a broadly representative dataset that enables generalization beyond the idiosyncrasies of individual authors.
Table 3: Data.
| Sentences | Year | Type | |
| Hemingway | 1,928 | 1952 | prose |
| Woolf | 3,578 | 1925 | prose |
| Chicago Corpus | 9,000 | 1880–2000 | prose |
| Imagist | 1,195 | 1915 | poetry |
| Modern Love | 1,126 | 1896–1939 | poetry |
For Experiment III, we conduct a literary case study aimed at distinguishing Imagist poetry from more topic-based, modern love-themed poetry using embedding-based metrics and dictionary-derived scores. The choice of Imagist poetry for testing the distinguishing power of dictionary- and embedding-based features was also made because of previous works supporting that Imagist poetry stands out in these dimensions (Gleason 2007, 2009; J. T. Kao and Jurafsky 2015). To this end, we utilize two distinct poetic datasets:
Some Imagist Poets (1915): An anthology compiled by Ezra Pound, which includes 37 poems authored by six poets. Imagist poetry is characterized by its emphasis on precise and concrete imagery, minimalism, and a rejection of abstraction, making it an ideal test case for computational measures of Imageability.
Modern love-themed poems: A selection of 74 poems by 22 authors, collected from The Poetry Foundation’s curated category ‘Love.’7
By juxtaposing these two corpora, we aim to assess whether computationally derived Imageability and concreteness scores can effectively differentiate poetic traditions that prioritize sensory evocation (Imagism) from those that may contain a broader range of abstract or figurative language (general love poetry). This comparison provides insight into the applicability of our methods for distinguishing stylistic and thematic variations in literary texts.
Note that in gauging the relationship between the dictionary-based features and the embeddings, we sum the Imageability, concreteness, and Visuality scores assigned via the dictionaries across sentences without normalizing for sentence length. This approach allows us to capture the total intensity of these features in the sentence, rather than averaging the intensity of individual words. We do this because we are interested in the overall presence or weight of these features in a given sentence. Literary texts may rely on the cumulative effect of imagery across sentences. This approach allows us to reflect the broader contextual presence that we also expect the embeddings to capture. In Experiment III, where we compare Imagist and Love poetry lines, we do normalize for line length to replicate the methodology used in J. T. Kao and Jurafsky (2015), which assigns scores by dividing the summed Imageability score of words by the number of words (extant in the dictionary).
It is important to note that whether features are assigned based on sums or length, the normalized sum has significance when differentiating between groups or authors – which is what we do in Experiment III. For example, in the data summary (Table 4), we see that Love poetry, on average, has higher Imageability than Imagist poetry. However, when normalizing the scores for line length, this trend is reversed, with Imagist poetry averaging 376.2 and Love poetry 363.7.
Table 4: Sentence-level average (and Standard Deviation) Imageability, concreteness, and Visuality of datasets.
| Imageability | Concreteness | Visuality | ||
| Hemingway | 4359.2 ± 3241.6 | 38.1 ± 28.4 | 33.7 ± 26.0 | |
| normalized | 352.6 ± 38.0 | 2.8 ± 0.3 | 2.4 ± 0.3 | |
| Woolf | 5156.8 ± 6167.0 | 45.5 ± 55.0 | 42.3 ± 51.5 | |
| normalized | 350.6 ± 47.7 | 2.7 ± 0.4 | 2.5 ± 0.4 | |
| Chicago Corpus | 5173.8 ± 5292.2 | 47.3 ± 46.3 | 43.1 ± 43.1 | |
| normalized | 345.7 ± 37.1 | 2.7 ± 0.3 | 2.5 ± 0.3 | |
| Imagist | 1870.6 ± 1027.3 | 17.1 ± 8.7 | 14.9 ± 7.9 | |
| normalized | 376.2 ± 64.5 | 3.0 ± 0.5 | 2.6 ± 0.6 | |
| Love | 2112.53 ± 846.7 | 18.3 ± 6.6 | 16.2 ± 6.1 | |
| normalized | 363.7 ± 53.1 | 2.8 ± 0.5 | 2.5 ± 0.5 |
The decision to normalize feature scores when using dictionaries in literary experiments is a question of operationalization that we want to underline. For example: Is a long sentence with many low- and high-imageable words more or less imageable than a shorter sentence with high-imageable words, if their summed scores are the same? In other words, do factors like density or brevity affect the perceived Imageability of sentences? This question can only be answered by comparing human Imageability judgments against both methods of score assignment. We leave this to future work and focus here only on the relationship between different systems (embeddings vs. dictionaries), not their relation to human judgment.
4.3 Embeddings
When it comes to embeddings, we employ the CLIP model (Radford et al. 2021), a multimodal vision language model trained on large-scale image-text pairs. CLIP is designed to align textual and visual representations, making it particularly suitable for capturing visual and concrete dimensions of language, which are directly relevant to our study of Imageability.
Given its training objective, the semantic space of CLIP’s embeddings is expected to encode visual salience and concreteness more effectively than purely text-based models. This suggests that CLIP-based embeddings may provide a more explicit representation of Imageability compared to traditional word embeddings derived from text-only corpora.
However, the extent to which multimodal representations differ from text-based embeddings in encoding sensory and Imagistic properties remains an open question. To address this, we compare CLIP-based embeddings against those generated by a text-only model, specifically BERT. BERT embeddings provide a useful contrast, as they are derived solely from linguistic contexts without access to visual grounding.8 This comparison allows us to evaluate whether Imageability-related features emerge naturally in textual embeddings or whether multimodal supervision enhances their representation.
It’s important to underline that the psycholinguistic lexica used in Experiments I–III enter our pipeline only for validation. They provide a widely accepted yardstick against which to gauge whether a candidate metric is even plausible. Crucially, the mapping from embedding shape to Imageability is not fitted on dictionary scores – indeed no fitting is required, because norm and entropy are closed-form functions of the raw vectors. In downstream applications (i.e., analyzing an unedited novel draft or surfacing vivid quotations for digital exhibits) the dictionaries can be dropped entirely.
5. Methods
We analyze the shape of both word and sentence embedding representations to determine how Imageability manifests in textual and multimodal embeddings. Specifically, we examine whether embeddings of highly imageable and concrete words exhibit distinct distributional properties compared to those of abstract or less imageable words.
Our initial hypothesis (H1) is that words with higher Imageability and concreteness might have more localized values in the embedding space, meaning that they might cluster more tightly within specific regions of the vector space. In contrast, embeddings of more abstract words, that might lend themselves to a larger array of contexts, may be more dispersed, leading to higher entropy in their distribution and lower norm (i.e., “strength”). To illustrate this, consider the word “dog”: Especially in a multimodal model like CLIP, its embedding is likely to concentrate most of its information on specific dimensions, while a more abstract term like “beautiful” is less directly tied to a specific visual referent and may exhibit a broader, more diffuse representation across the semantic space. The difference in embedding structure between these two cases can be quantified by analyzing the distribution of activation values across all dimensions.9
The opposite hypothesis is also a possibility (H2). Under this view, concrete words such as “dog” or “tree” may activate a broader range of dimensions due to their rich sensory associations across multiple modalities. In contrast, abstract words like “justice” or “hope” may activate fewer, more specific dimensions, resulting in a sharp activation profile with, for example, higher norm but lower entropy – akin to a “spike” in certain representational axes. This could occur if embeddings for abstract concepts rely on a small number of high-level semantic features (i.e., valence, affect, discourse function), while embeddings for concrete words require a more distributed, multimodal representation that increases their variance and entropy.
To formally test this, we compute various vector shape metrics on the sentence embeddings. These include:
-
Norm (‖x‖2). The Euclidean norm measures the overall magnitude of the embedding vector. It provides a sense of how large the values in the vector are. This does not necessarily tell us about distribution across dimensions.
It is defined as:
A sparse embedding (where most values are near zero and only a few are large) might have a lower norm, while more uniform activation leads to higher norm.
-
Entropy ( ). Entropy is defined as:
where e is the embedding vector, and are probabilities obtained by applying a normalization to the elements of e. We primarily construct these probabilities by taking the absolute value of each dimension and normalizing to sum to 1, thus ensuring nonnegative values interpretable as a pseudo-probability distribution over dimensions. This approach was chosen over the softmax transformation to avoid amplifying the largest embedding values exponentially, which can distort the distribution and reduce sensitivity to variations across smaller dimensions.10 n is the number of components in the vector and ϵ is a small constant to avoid log(0). The entropy reflects how evenly the embedding’s values are distributed across dimensions.
-
Variance (σ2): It measures the spread of values in the embedding vector, indicating how much they deviate from the mean:
where is the value of the embedding at dimension , n is the total number of dimensions, and is the mean of these values. Higher variance suggests a more dispersed representation, potentially reflecting greater contextual flexibility, while lower variance may indicate a more compact, feature-specific encoding. This is crucial in evaluating whether highly imageable embeddings are tightly clustered or broadly distributed in a semantic space.
-
Sparsity Ratio. Sparsity ratio can be defined as:
where is the embedding vector, n is the number of components in the vector, is the Manhattan norm (the sum of the absolute values of the elements of ), and is the Euclidean norm (the square root of the sum of the squared elements of ). The sparsity ratio gives us an idea of how densely populated the embedding is, with lower values indicating higher sparsity.
Note: For norm and entropy, which are less intuitive measures, we show the distribution of values over embedding dimensions for the embeddings with the highest/lowest entropy and norm of the Chicago Corpus sentence samples (see Appendix A, Figure 4 and Figure 5).
To evaluate the validity of these embedding-based metrics as indicators of Imageability, we correlate them with dictionary-based Imageability, concreteness, and Visuality scores. Specifically, we compare values derived from the lexical resources of the MRC Psycholinguistic Imageability Lexicon (Coltheart 1981), the Concreteness Lexicon (Brysbaert et al. 2014), and the Lancaster Sensorimotor Norms (Lynott et al. 2020) against our computed embedding shape properties (see section 3). By computing Spearman correlations, we assess the degree to which embedding metrics reflect known psycholinguistic properties of the lexicon.
Since dictionary-based scores are primarily word-level ratings, we then extend our analysis to the sentence level by aggregating word-level values across sentences. Specifically, for each sentence, we compute the sum of Imageability, concreteness, and Visuality of the sentence based on the words that are present in the dictionaries (see the end of subsection 4.2 again for more details). This allows us to compare sentence embeddings with dictionary-based metrics computed over entire sentences.
It is important to note that the transition from word-level representations to sentence-level embeddings is non-trivial. The Imageability of a sentence is not simply the sum of its individual words’ scores; rather, it depends on syntactic structure, compositionality, and context-dependent meaning shifts, which cannot be captured by sums of dictionary scores. For example, a sentence like “The sky darkened before the storm” contains words with varying individual Imageability scores, but their combined effect creates a vivid, scene-setting description. Conversely, a sentence with highly imageable words may still lack clear imagery if its structure is abstract or ambiguous.
At the same time, sentence embeddings are not simple averages of word representations: They incorporate contextual interactions, modifying the contribution of each word depending on its grammatical role and semantic dependencies. This means that embedding-based metrics may diverge significantly from dictionary-based scores when applied at the sentence level.
6. Results
6.1 Experiment I: Correlations at the Word Level
Figure 1 presents the correlation matrix for dictionary-based and embedding-derived metrics, computed over the MRC dictionary lemmas (i.e., the subset of words appearing across all dictionaries used in this study). The figure illustrates the relationships both within dictionary-based scores and within embedding-based metrics, as well as their intercorrelations.
Figure 1: Comparison of embedding metrics and dictionary scores using the multimodal model and BERT. Numbers refer to the Spearman coefficient. Note that imag C and imag R refer to the two expansion dictionaries of Imageability, see subsection 4.1, while imag refers to the general MRC Imageability dictionary.
Internal Correlations in Dictionary-based Scores. We observe a strong mutual correlation among dictionary-based metrics. Notably, Reilly and Kean's Imageability(imag R) exhibits a correlation with MRC Imageability (imag) comparable in magnitude to its correlation with Concreteness. This suggests that, in practice, Imageability and concreteness are not sharply distinguished in these resources – at least at the word level. While both measures are conceptually distinct, their empirical overlap aligns with prior research indicating a strong connection between how vividly a word evokes imagery and how concrete its referent is.
Interestingly, Visuality shows a weaker correlation with both Imageability and Concreteness, suggesting that the dictionary-based concept of Imageability is more strongly associated with tactile or sensorimotor properties than with purely visual modalities. This reinforces the idea that Imageability, as defined in psycholinguistic lexica, captures a broader range of sensory experiences beyond just the visual salience.
Internal Correlations in Embedding-based Metrics. Turning to the embedding-derived metrics, we find an even stronger internal correlation structure. For instance, norm and entropy exhibit an inverse relationship, indicating that embeddings with higher activation magnitudes (higher norm) tend to have more localized values, while those with lower norm tend to have more evenly spread, high-entropy distributions. Because our entropy calculation involves absolute-value normalization, direct comparison of entropy values across embedding types (i.e., CLIP vs. BERT) should be interpreted cautiously, though relative trends within each model remain informative.1112
This is consistent with our hypothesis that abstract words may have sharp activation spikes in fewer dimensions, whereas concrete words may activate a broader set of features across the embedding space.
Correlations between Dictionary and Embedding Metrics. Across all embedding-based metrics, we find a significant correlation with dictionary-derived scores, particularly with Imageability and Concreteness (ρ = .55 – .61). This suggests that embedding norms, entropy, and related properties encode information that aligns with human-annotated word Imageability and concreteness ratings. As we transition to the sentence level in subsequent experiments, we investigate whether these correlations persist when compositional effects come into play.
Moreover, when comparing our results with a text-only model, we find that the previously observed correlations do not fully hold. While there are slight correlations between Imageability (imag R) and embedding-derived metrics such as norm, entropy, and variance, these are notably weaker than those found using the multimodal CLIP model (Figure 1).
Additionally, the internal correlations between embedding-based metrics are less pronounced in BERT. Specifically, the relation between variance and entropy drops, as well as the relation between norm and entropy (suggesting that the strong correlations between these in the CLIP model do relate to the model’s reliance on probability conversion).
Given the strong internal correlations of the embedding metrics in the CLIP model embeddings, we retain only norm and entropy for the next two experiments, while maintaining Imageability, Visuality, and concreteness as dictionary-based features.13
6.2 Experiment II: Sentence-level Analysis in Literary Texts
For correlations of metrics across sentences, we find very similar correlations as in Experiment I, presented in Table 5. That is, across all of our three literary datasets, we find that Imageability has a negative relationship to embedding norm and a positive relationship to embedding entropy. The direction and strength of these correlations is the similar for Concreteness and Visuality. We find the strongest correlations within the Hemingway sentences (min. ρ ± 0.61) and the weakest for the Chicago Corpus sentences (min. ρ ± 0.31)(Table 5). This suggests that embedding norm and entropy maintain their correlation direction with the dictionary-based features also when aggregating scores at the sentence level, but that the strength of the correlation might differ according to the type of literature. Among Chicago Corpus sentences, where correlations between embedding- and dictionary-based metrics are the lowest, we might expect the diversity – across both genres, styles, and decades – to have an effect.
Table 5: Spearman correlations between dictionary scores of sentences and the norms and entropies of sentence embeddings across our literary data.
| Data | Imageability | Visuality | Concreteness | |
| Norm | Hemingway | -0.61 | -0.63 | -0.62 |
| Woolf | -0.44 | -0.46 | -0.46 | |
| Chicago Corpus | -0.31 | -0.37 | -0.36 | |
| Entropy | Hemingway | 0.60 | 0.63 | 0.62 |
| Woolf | 0.42 | 0.44 | 0.44 | |
| Chicago Corpus | 0.31 | 0.37 | 0.37 |
Still, within each group of literary data (Hemingway, Woolf, Chicago Corpus), we find more or less imageable sentences (according to the dictionary) have a relation to the level of norm and entropy of their embedding. To examine how norm and embedding distribute within the groups, and to illustrate this effect, we selected two example sentences as reference points for high and low Imageability:
Highly imageable: “The thin white surgical gloves he wore as he pumped the gas looked like pale skin.”
Non-imageable: “Wishful thinking as the saying goes.”
These two sentences were among the top and bottom ten in terms of dictionary-based Imageability scores among all sentences in our data.
The contrast between these examples highlights the degree to which descriptive, sensory-rich language correlates with embedding structure, where highly imageable sentences appear to cluster in regions with lower entropy and higher norm.
As shown in Figure 2, our analysis indicates that all three groups of literary data tend to be skewed toward lower entropy, meaning that we predominantly observe a tail distribution at the entropy levels of highly imageable sentences.
6.3 Experiment III: Comparing Poems
In our final experiment, we compare Imagist poems to an assorted set of Modern Love poems not constrained by any specific literary movement. Previous research by J. T. Kao and Jurafsky (2015) found that, compared to 19th-century poetry, Imagist poetry exhibits higher levels of object mentions, abstraction, Imageability, and concreteness – particularly when measured using the MRC Imageability Dictionary (Coltheart 1981) and the Brysbaert Concreteness Lexicon (Brysbaert et al. 2014) – both of which are also used in this study. To ensure consistency with previous methodologies, we compute Imageability, Visuality, and concreteness using the same approach as J. T. Kao and Jurafsky (2015).14
Figure 3 presents the juxtaposition of the two poetic traditions in terms of their embedding-derived and dictionary-based features. The lines indicate two poem lines which were found among the top ten most imageable and bottom ten least imageable lines of poetry in the full set (Imagist & Love poems). These were:
Highly imageable: “Homespun, dyed butternuts dark gold color.”
Non-imageable: “Of insidious intent”
Here, we are showing where these two sentences are positioned in terms of each measure.
This comparison allows us to assess that embedding-based metrics do distinguish Imagist poetry, and that Imagist poetry does exhibit higher Imageability and concreteness at the sentence level, aligning with prior findings on its heightened emphasis on sensory detail and concrete imagery. These findings are supported by conducting a t-test between the groups (Table 6).
Table 6: T-test and Mann-Whitney U results (for comparison) for dictionary-based features and embedding-based metrics between Imagist & Love poems groups. The largest statistic for each variable is in bold, and the second-largest is underlined. All tests were significant with p < 0.01.
| T-test | Mann-Whitney U | |
| embedding norm | -6.69 | 567,900.00 |
| embedding entropy | 7.19 | 783,363.00 |
| embedding sparsity | 6.44 | 773,000.00 |
| Imageability | 5.07 | 734,321.50 |
| Visuality | 4.98 | 749,496.00 |
| Concreteness | 7.93 | 796,327.00 |
7. Discussion
Our findings suggest that embedding structure meaningfully reflects psycholinguistic properties of words, particularly in relation to Imageability, concreteness, and Visuality. Across our experiments, we observe that embedding norm and entropy serve as reliable indicators of a text’s sensory specificity, but in a manner contrary to our initial hypothesis (H1). Instead, our results provide stronger evidence for H2, indicating that concrete and imageable words and sentences exhibit higher entropy and lower norm, while abstract words show lower entropy and higher norm. This suggests that highly concrete words are encoded in a more diffuse and broadly distributed manner, engaging multiple representational dimensions, whereas abstract words activate fewer dimensions more intensely, leading to sharper activation peaks (high norm) but a more compressed distribution (low entropy). We see this pattern when visualizing the embeddings with the highest and lowest norms, as well as the highest and lowest entropy, where low-entropy/high-norm embeddings appear to exhibit longer tails and more dimensions with zero values; while high-entropy/low-norm embeddings show values more evenly centered around zero (see Appendix A, Figure 4 and Figure 5).
This pattern challenges the assumption that concrete words, due to their contextual constraints, would be more compactly represented in embedding space. Instead, it appears that concreteness leads to a more dispersed activation profile. This might reflect semantic affordances – that is, concrete words can be associated with a rich variety of semantic features, leading to a broader spread across dimensions. In contrast, abstract words tend to be semantically constrained to fewer, high-level conceptual dimensions, which results in embeddings with spiky, high-norm activations concentrated in a limited set of representational axes. This raises new questions about how different embedding models distribute meaning across dimensions – particularly whether multimodal training systematically encourages broader activation patterns for sensory-rich words compared to text-only embeddings (i.e., BERT).
At the sentence level, we observe a similar effect: Literary texts exhibit a strong skew toward lower norm and higher entropy, with particularly imageable sentences spreading their representational load across more dimensions, while abstract sentences cluster in sharper, more concentrated regions of the embedding space.
Our analysis of Imagist vs. Modern Love poetry provides further confirmation that the shape of semantic embeddings encapsulates Imageability-related psycholinguistic features. Consistent with prior research, we find that Imagist poetry exhibits higher overall Imageability and concreteness, with embedding structures reflecting a more diffuse, multimodal distribution (low-norm/high-entropy). This finding reinforces the idea that Imageability is not merely a product of genre conventions but is actively shaped by individual sentence composition. In contrast, Modern Love poetry – while still employing rich figurative language – tends to contain more conceptual abstraction and affective expression, which aligns with a more sharply clustered, locally spiky, embedding representation – reflected in the generally high-norm/low-entropy shape of their embeddings.
Taken together, these findings suggest that norm/entropy may act as a dictionary-free proxy for readers’ experience of vividness. The Imagist–Love case study demonstrates genre-level separability even under lexical control, indicating that the signal is not reducible to word-level concreteness counts. At the same time, benchmarking against legacy dictionaries offers an interpretable bridge to prior literature.
8. Conclusion
Our study suggests that the computational representation of sensory experience in embeddings follows distinct structural patterns for concrete vs. abstract language. Specifically, we find that highly concrete and imageable words exhibit greater entropy and lower norm, reflecting a more distributed, multimodal representation, whereas abstract words show lower entropy and higher norm, indicative of sharper, more localized activation patterns. These findings challenge our original assumptions about the compactness of concrete word representations and the way linguistic meaning is distributed across high-dimensional embedding spaces. Moreover, our results reinforce the role of multimodal models like CLIP in capturing sensorimotor properties, while text-only models like BERT appear to encode Imageability less systematically.
Finally, if this approach is valid, it can constitute a method for dictionary-free inference on text Imageability. Once the mapping from embedding shape (norm, entropy) to an Imageability score is learned, no external lexicon is required at inference time. Any sentence – whether it contains out-of-vocabulary words, creative neologisms, or code-switched phrases – can be scored in a single forward pass through a pre-trained model.15
A key next step is to directly evaluate embedding-based metrics – alongside dictionary-derived features – against human judgments of sentence Imageability. While our study establishes that embedding norms and entropies exhibit trends similar to dictionary-based Imageability and concreteness, it remains unclear how well these computational features actually predict human-perceived sensory vividness.
This issue is particularly pressing because dictionaries, though widely used in psycholinguistics and NLP, are inherently limited – especially when extended to sentence-level interpretation, where contextual and compositional effects play an important role to their human interpretation. We have demonstrated the relationship between these metrics, but it remains an open question whether embeddings might actually outperform lexicon-based methods in capturing human Imageability judgments – or whether they introduce biases or artifacts not present in traditional resources.
Further work should also explore a broader range of multimodal architectures, including models with more fine-grained visual-text alignment (i.e., DALL·E’s prior networks, BLIP, or fine-tuned vision-language transformers).
If multimodal embeddings systematically encode sensory experience, they could offer a scalable alternative to the hand-annotated psycholinguistic resources that are costly and relatively limited in scope. This is particularly relevant for literary studies, where large-scale human annotation of Imageability, concreteness, or perceptual vividness remains impractical outside of standard use of modern English and a few other languages.
Further validation is needed before applying these embedding-based metrics to broader literary studies – though the same could arguably be said of applying Imageability dictionaries at the sentence level. If proven reliable, these methods could enable large-scale investigations into the evolution of prose styles, genre-specific Imageability trends, and historical shifts in literary sensory encoding. Additionally, it would be valuable to compare literary texts to non-literary domains, such as journalistic writing, political rhetoric, or scientific discourse, to better understand how Imageability and perceptual concreteness vary across communicative registers.
9. Data Availability
Data and code can be found here: https://github.com/centre-for-humanities-computing/imageability_jcls. They have been archived and are persistently available at: https://doi.org/10.5281/zenodo.17821875.
10. Author Contributions
Yuri Bizzoni: Conceptualization, Methodology, Formal analysis, Validation, Resources, Writing – original draft
Pascale Feldkamp: Conceptualization, Methodology, Formal analysis, Resources, Visualization, Writing – original draft
Kristoffer L. Nielbo: Methodology, Formal analysis, Validation, Funding acquisition, Writing – original draft
Notes
- In Verma et al. (2023), inter-annotator agreement for sentence-level Imageability ratings was 0.45 (Krippendorff’s α). Note that this was for non-literary texts, where we might expect literary texts to effect an even lower agreement, which seems to be the case in annotation tasks for other concepts (Feldkamp et al. 2024). [^]
- Also, note that most dictionaries assign Imageability scores at the lemma level, abstracting from the word-form level. Working from lemmas means that the variations in word forms – such as ‘painted’ vs. ‘paint’ – have the same Imageability score, even though differences in tense and part-of-speech category may evoke a different intensity and, in theory, a different set of sensory associations for human readers. [^]
- In this third experiment, we use Imagist poems as a testbed to probe whether multimodal embeddings capture stylistic and sensory variation. This should not be taken to imply a direct equivalence between Imageability and Imagism, nor a reduction of poetic imagery to literal visual representation. Nonetheless, the historical emphasis of Imagist poetry on economy, concreteness, and sensory immediacy makes it a useful comparative corpus for our purposes. [^]
- The final experiment of this paper already uses implicit human judgments by distinguishing two different literary genres. [^]
- It should be noted that the dataset is predominantly English-language fiction, potentially limiting its generalizability to other linguistic traditions. [^]
- Specifically, we employed the SpaCy en_core_web_sm model (SpaCy 2024 by Montani et al. (2023)), which provides robust sentence segmentation for literary texts, ensuring consistent parsing across different styles. [^]
- To maintain clear genre distinctions, we excluded poets in Some Imagist Poets from the love-themed poetry dataset. For a complete list of included poems and authors, see Appendix B. The dataset is available at: https://huggingface.co/datasets/merve/poetry. [^]
- We selected BERT due to its widespread use in word embedding research and literary analysis. In particular, the bert-base-cased model (Devlin et al. 2019) is frequently applied for semantic and stylistic investigations in computational literary studies (Grisot et al. 2022; Paragini and Kestemont 2022; Silva et al. 2023). [^]
- This is also the hypothesis of Hessel et al. (2018, 2194) in quantifying visual concreteness, who write: “Intuitively, a visually concrete concept is one associated with more locally similar sets of images; for example, images associated with ‘dog’ will likely contain dogs, whereas images associated with ‘beautiful’ may contain flowers, sunsets, weddings, or an abundance of other possibilities.” [^]
- To verify the robustness of this approach, we also computed probabilities using a softmax transformation, which produced near-identical entropy values across our experiments. This confirms that our simpler absolute-value normalization provides a consistent and interpretable proxy for measuring the spread or concentration of embedding activations. While this method introduces nonlinearity, embeddings with primarily positive values (i.e., CLIP) may yield systematically different entropy scores. Nonetheless, this approach balances interpretability and computational simplicity. [^]
- See section 5 for details on entropy computation and normalization procedures. Note: The relation between embeddings’ norms and entropy is not necessary, but a by-product of CLIP’s training objectives. The strong correlation between norm and entropy when using the multimodal CLIP model likely stems from the way the model processes text. Since CLIP relies on softmax at various stages to encode textual inputs, its embeddings inherently carry a probabilistic structure. When computing entropy, which also transforms embeddings into a probability distribution, this process can introduce an automatic dependency on the norm. Specifically, embeddings with higher norms tend to distribute probability mass differently, leading to a systematic correlation between norm and entropy. To avoid enforcing this relation, we ensured that we did not use softmax in the process of computing entropy, although the softmax approach was also tested. See section 5. [^]
- As with the relation between norm and entropy, we see the relation between variance and entropy strongly in the CLIP model, but not in the BERT model. Again, this may be related to the normalization process in generating CLIP embeddings, where a normalization will also fix the variance dependent on the entropy – both then relying on the scaling of the data. [^]
- Alternative Imageability dictionaries were excluded due to the smaller size of their lexica and due to their high correlation with the MRC Imageability dictionary, making them redundant for our analysis. [^]
- Unlike previous studies, however, we conduct our analysis at the poem-line level rather than at the poem level. That is, we calculate the average Imageability score across the words in each line that appear in the dictionary, rather than aggregating at the level of entire poems. This allows us to maintain higher granularity and a larger number of data points, providing a more detailed view of how Imagist poetic discourse – not just entire poems – manifests Imageability. [^]
- Scalability therefore stems from the billions of image–text pairs used to fit CLIP, not from the 5,000–40,000 entries of psycholinguistic dictionaries. In this sense, our approach is data-driven in deployment, and we use the legacy lexica only as a hold-out benchmark during evaluation. The distinction mirrors practice in automatic speech-recognition research, where acoustic models are trained on broadcast audio but validated against a much smaller, human-transcribed test set. [^]
References
Auracher, Jan and Hildegard Bosch (2016). “Showing with Words: The Influence of Language Concreteness on Suspense”. In: Scientific Study of Literature 6 (2), 208–242. http://doi.org/10.1075/ssol.6.2.03aur.
Bizzoni, Yuri, Pascale Feldkamp Moreira, Ida Marie S. Lassen, Mads Rosendahl Thomsen, and Kristoffer Nielbo (2024). “A Matter of Perspective: Building a Multi-Perspective Annotated Dataset for the Study of Literary Quality”. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Ed. by Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue. ELRA and ICCL, 789–800. https://aclanthology.org/2024.lrec-main.71 (visited on 11/11/2025).
Blackwell, Simon E. (2020). “Emotional Mental Imagery”. In: The Cambridge Handbook of the Imagination. Ed. by Anna Abraham. Cambridge Handbooks in Psychology. Cambridge University Press, 241–257. http://doi.org/10.1017/9781108580298.016.
Brysbaert, Marc, Amy Beth Warriner, and Victor Kuperman (2014). “Concreteness Ratings for 40 Thousand Generally Known English Word Lemmas”. In: Behavior Research Methods 46 (3), 904–911. http://doi.org/10.3758/s13428-013-0403-5.
Burroway, Janet (1987). Writing Fiction: A Guide to Narrative Craft. Little, Brown.
Chen, Yufeng, Guanghui Yue, Weide Liu, Chenlei Lv, Ruomei Wang, Fan Zhou, and Baoquan Zhao (2025). “Predicting Plain Text Imageability for Faithful Prompt-Conditional Image Generation”. In: PRICAI 2024: Trends in Artificial Intelligence. Ed. by Rafik Hadfi, Patricia Anthony, Alok Sharma, Takayuki Ito, and Quan Bai. Springer Nature, 89–95. http://doi.org/10.1007/978-981-96-0122-6_9.
Coltheart, Max (1981). “The MRC Psycholinguistic Database”. In: The Quarterly Journal of Experimental Psychology A: Human Experimental Psychology 33 (4), 497–505. http://doi.org/10.1080/14640748108400805.
Cortese, Michael J. and April Fugett (2004). “Imageability Ratings for 3,000 Monosyllabic Words”. In: Behavior Research Methods, Instruments, & Computers 36 (3), 384–387. http://doi.org/10.3758/BF03195585.
Daoshan, MA and Zhang Shuo (2014). “A Discourse Study of the Iceberg Principle in A Farewell to Arms”. In: Studies in Literature and Language 8 (1), 80–84.
Dellantonio, Sara, Claudio Mulatti, Luigi Pastore, and Remo Job (2014). “Measuring Inconsistencies Can Lead You Forward: Imageability and the X-ception Theory”. In: Frontiers in Psychology 5, 1–9. http://doi.org/10.3389/fpsyg.2014.00708.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Ed. by Jill Burstein, Christy Doran, and Thamar Solorio. Association for Computational Linguistics, 4171–4186. http://doi.org/10.18653/v1/N19-1423.
Ellen J. Esrock (1994). The Reader’s Eye. Johns Hopkins University Press. http://archive.org/details/readerseyevisual00esro (visited on 11/11/2025).
Ellis, Nick (1991). “Chapter 21 Word Meaning and the Links between the Verbal System and Modalities of Perception and Imagery or In Verbal Memory the Eyes See Vividly, but Ears Only Faintly Hear, Fingers Barely Feel and the Nose Doesn’t Know”. In: Advances in Psychology. Ed. by Robert H. Logie and Michel Denis. Mental Images in Human Cognition 80. North-Holland, 313–329. http://doi.org/10.1016/S0166-4115(08)60521-X.
Feldkamp, Pascale, Ea Lindhardt Overgaard, Kristoffer Laigaard Nielbo, and Yuri Bizzoni (2024). “Sentiment Below the Surface: Omissive and Evocative Strategies in Literature and Beyond”. In: CHR 2024: Computational Humanities Research Conference. Ed. by Wouter Haverals, Marijn Koolen, and Laure Thompson. https://ceur-ws.org/Vol-3834/paper98.pdf (visited on 11/11/2025).
Fludernik, Monika (1996). “Towards a ‘Natural’ Narratology”. In: JLSE 25 (2), 97–141. http://doi.org/10.1515/jlse.1996.25.2.97.
Gleason, Daniel W. (2007). “Seeing Imagism: A Poetics of Literary Visualization”. Ph.D. Thesis. Evanston, Ill: Northwestern University.
Gleason, Daniel W. (2009). “The Visual Experience of Image Metaphor: Cognitive Insights into Imagist Figures”. In: Poetics Today 30 (3), 423–470. http://doi.org/10.1215/03335372-2009-002.
Goetz, Ernest T., Mark Sadoski, Michael L. Stowe, Thomas G. Fetsco, and Susan G. Kemp (1993). “Imagery and Emotional Response in Reading Literary Text: Quantitative and Qualitative Analyses”. In: Poetics 22 (1-2), 35–49. http://doi.org/10.1016/0304-422X(93)90019-D.
Grisot, Giulia, Federico Pennino, and J. Berenike Herrmann (2022). Predicting Sentiments and Space in Swiss Literature Using BERT and Prodigy. CHR 2022: Computational Humanities Research Conference. https://nbn-resolving.org/urn:nbn:de:0070- pub-29691146 (visited on 12/08/2025).
Hessel, Jack, David Mimno, and Lillian Lee (2018). “Quantifying the Visual Concreteness of Words and Topics in Multimodal Datasets”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Ed. by Marilyn Walker, Heng Ji, and Amanda Stent. Association for Computational Linguistics, 2194–2205. http://doi.org/10.18653/v1/N18-1199.
Kao, Justine and Dan Jurafsky (2012). “A Computational Analysis of Style, Affect, and Imagery in Contemporary Poetry”. In: Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature. Ed. by David Elson, Anna Kazantseva, Rada Mihalcea, and Stan Szpakowicz. Association for Computational Linguistics, 8–17. https://aclanthology.org/W12-2502/ (visited on 11/11/2025).
Kao, Justine and Dan Jurafsky (2015). “A Computational Analysis of Poetic Style: Imagism and Its Influence on Modern Professional and Amateur Poetry”. In: Linguistic Issues in Language Technology 12 (3). https://aclanthology.org/2015.lilt-12.3/ (visited on 11/11/2025).
Kuzmičová, Anežka (2014). “Literary Narrative and Mental Imagery: A View from Embodied Cognition”. In: Style 48 (3), 275–293. https://www.jstor.org/stable/10.5325/style.48.3.275 (visited on 11/11/2025).
Lacey, Simon and Rebecca Lawson (2013). “Introduction”. In: Multisensory Imagery. Ed. by Simon Lacey and Rebecca Lawson. Springer, 1–8. http://doi.org/10.1007/978-1-4614-5879-1_1.
Liu, Xingyu ”Bruce”, Vladimir Kirilyuk, Xiuxiu Yuan, Alex Olwal, Peggy Chi, Xiang ”Anthony” Chen, and Ruofei Du (2023). “Visual Captions: Augmenting Verbal Communication with On-the-fly Visuals”. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. Ed. by Albrecht Schmidt, Kaisa Väänänen, Tesh Goyal, Per Ola Kristensson, Anicia Peters, Stefanie Mueller, Julie R. Williamson, and Max L. Wilson. Association for Computing Machinery, 1–20. http://doi.org/10.1145/3544548.3581566.
Lynott, Dermot, Louise Connell, Marc Brysbaert, James Brand, and James Carney (2020). “The Lancaster Sensorimotor Norms: Multidimensional Measures of Perceptual and Action Strength for 40,000 English Words”. In: Behavior Research Methods 52 (3), 1271–1291. http://doi.org/10.3758/s13428-019-01316-z.
Magyari, Lilla, Anne Mangen, Anežka Kuzmičová, Arthur M. Jacobs, and Jana Lüdtke (2020). “Eye Movements and Mental Imagery During Reading of Literary Texts with Different Narrative Styles”. In: Journal of Eye Movement Research 13 (3), 1–35. http://doi.org/10.16910/jemr.13.3.3.
Martínez, María-Angeles (2024). “Imagining Emotions in Storyworlds: Physiological Narrated Perception and Emotional Mental Imagery”. In: Frontiers in Human Neuroscience 18. http://doi.org/10.3389/fnhum.2024.1336286.
Montani, Ines, Matthew Honnibal, Adriane Boyd, Sofie Van Landeghem, and Henning Peters (2023). explosion/spaCy: v3.7.2: Fixes for APIs and requirements. Zenodo. http://doi.org/10.5281/zenodo.10009823.
Paivio, Allan, John C. Yuille, and Stephen A. Madigan (1968). “Concreteness, Imagery, and Meaningfulness Values for 925 Nouns”. In: Journal of Experimental Psychology 76 (1), 1–25. http://doi.org/10.1037/h0025327.
Paragini, Margherita and Mike Kestemont (2022). “The Roots of Doubt: Fine-tuning a BERT Model to Explore a Stylistic Phenomenon”. In: CHR 2022: Computational Humanities Research Conference. Ed. by Folgert Karsdorp, Alie Lassche, and Nielbo Kristoffer, 72–91. https://anet.be/record/opacirua/c:irua:192413 (visited on 11/11/2025).
Pound, Ezra (1913). “A Few Don’ts by an Imagiste”. In: Poetry 1 (6), 200–206. https://www.jstor.org/stable/20569730 (visited on 11/11/2025).
Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever (2021). “Learning Transferable Visual Models from Natural Language Supervision”. In: Proceedings of the 38th International Conference on Machine Learning. Ed. by Marina Meila and Tong Zhang. PMLR, 8748–8763. https://proceedings.mlr.press/v139/radford21a.html (visited on 11/11/2025).
Reilly, Jamie and Jacob Kean (2007). “Formal Distinctiveness of High- and Low-imageability Nouns: Analyses and Theoretical Implications”. In: Cognitive Science 31 (1), 157–168. http://doi.org/10.1080/03640210709336988.
Sharma, Pratyusha, Tamar Rott Shaham, Manel Baradad, Stephanie Fu, Adrian Rodriguez-Munoz, Shivam Duggal, Phillip Isola, and Antonio Torralba (2024). A Vision Check-up for Language Models. arXiv preprint. http://doi.org/10.48550/arXiv.2401.01862.
Sharma Paudyal, Homa Nath (2023). “The Use of Imagery and Its Significance in Literary Studies”. In: The Outlook: Journal of English Studies 14, 114–127. http://doi.org/10.3126/ojes.v14i1.56664.
Silva, Kanishka, Burcu Can, Frédéric Blain, Raheem Sarwar, Laura Ugolini, and Ruslan Mitkov (2023). “Authorship Attribution of Late 19th Century Novels Using GAN-BERT”. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). Ed. by Vishakh Padmakumar, Gisela Vallejo, and Yao Fu. Association for Computational Linguistics, 310–320. http://doi.org/10.18653/v1/2023.acl-srw.44.
SpaCy (2024). en_core_web_sm-3.8.0. Models for English. explosion. https://github.com/explosion/spacy-models/releases/tag/en_core_web_sm-3.8.0 (visited on 11/11/2025).
Strychacz, Thomas (2002). “‘The Sort of Thing You Should Not Admit’: Ernest Hemingway’s Aesthetic of Emotional Restraint”. In: Boys Don’t Cry? Rethinking Narratives of Masculinity and Emotion in the U.S. Ed. by Milette Shamir and Jennifer Travis. Columbia University Press, 141–166. http://doi.org/10.7312/sham12034-009.
Verma, Gaurav, Ryan Rossi, Christopher Tensmeyer, Jiuxiang Gu, and Ani Nenkova (2023). “Learning the Visualness of Text Using Large Vision-Language Models”. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Ed. by Houda Bouamor, Juan Pino, and Kalika Bali. Association for Computational Linguistics, 2394–2408. http://doi.org/10.18653/v1/2023.emnlp-main.147.
Wu, Si and David Smith (2023). “Composition and Deformance: Measuring Imageability with a Text-to-Image Model”. In: Proceedings of the 5th Workshop on Narrative Understanding. Ed. by Nader Akoury, Elizabeth Clark, Mohit Iyyer, Snigdha Chaturvedi, Faeze Brahman, and Khyathi Chandu. Association for Computational Linguistics, 106–117. http://doi.org/10.18653/v1/2023.wnu-1.16.
Wu, Yaru, Yuri Bizzoni, Pascale Moreira, and Kristoffer Nielbo (2024). “Perplexing Canon: A Study on GPT-based Perplexity of Canonical and Non-canonical Literary Works”. In: Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024). Ed. by Yuri Bizzoni, Stefania Degaetano-Ortlieb, Anna Kazantseva, and Stan Szpakowicz. Association for Computational Linguistics, 172–184. https://aclanthology.org/2024.latechclfl-1.16 (visited on 11/11/2025).
Zhao, Baoquan, Songhua Xu, Shujin Lin, Ruomei Wang, and Xiaonan Luo (2019). “A New Visual Interface for Searching and Navigating Slide-Based Lecture Videos”. In: IEEE International Conference on Multimedia and Expo (ICME). IEEE Computer Society, 928–933. http://doi.org/10.1109/ICME.2019.00164.
Appendix
A.1 Norm and Entropy of Embeddings
We visualize the full embedding for extreme cases of norm and entropy values to give an idea of what these measures imply.
Figure 5: The two sentence embeddings with the highest and lowest entropy of the 9,000 Chicago Corpus sentences. Note how the embedding with least entropy appears to contain stronger values (e.g., is bluer) (a) and more extreme values (in either direction) (b), mirroring the shape of the embedding with most norm above.
A.2 Experiment III Data
Table 7: Poems and authors of the Modern Love poems.
| Author | Poem |
| Michael Anania | Motet |
| Louise Bogan | To a Dead Lover |
| Leave-Taking | |
| Juans Song | |
| Epitaph for a Romantic Woman | |
| Knowledge | |
| Song for the Last Act | |
| A Tale | |
| Asil Bunting | from Odes: 30. The Orotava Road |
| Hart Crane | Voyages |
| from The Bridge: Southern Cross | |
| E. E. Cummings | as freedom is a breakfastfood |
| i carry your heart with me(i carry it in) | |
| love is more thicker than forget | |
| Paul Laurence Dunbar | The Old Front Gate |
| A Negro Love Song | |
| Invitation to Love | |
| Night of Love | |
| Thou Art My Lute | |
| Song (Wintah, summah, snow er shine) | |
| T. S. Eliot | Portrait of a Lady |
| The Love Song of J. Alfred Prufrock | |
| Kenneth Fearing | Aphrodite Metropolis (2) |
| X Minus X | |
| Ivor Gurney | Photographs |
| Stephen Spender | Song |
| James Joyce | Tutto Sciolto |
| D. H. Lawrence | Last Words to Miriam |
| Gloire de Dijon | |
| Cruelty and Love | |
| Tortoise Gallantry | |
| The Bride | |
| Song (Love has crept…) | |
| Tortoise Shout | |
| Edgar Lee Masters | Lydia Puckett |
| Lucinda Matlock | |
| Mrs. Meyers | |
| Sarah Brown | |
| Marjorie Pickthall | Adam and Eve |
| Carl Sandburg | Bilbea |
| At a Window | |
| How Much? | |
| Kenneth Slessor | New Magic |
| Gertrude Stein | The house was just twinkling in the moon light |
| Idem the Same: A Valentine to Sherwood Anderson | |
| Wallace Stevens | Hymn from a Watermelon Pavilion |
| Peter Quince at the Clavier | |
| Sara Teasdale | Union Square |
| Spring in War-Time | |
| The Old Maid | |
| Since There Is No Escape | |
| The Look | |
| Over the Roofs | |
| Faults | |
| Eight O’Clock | |
| Old Love and New | |
| Debt | |
| Louis Untermeyer | Infidelity |
| Feuerzauber | |
| Elinor Wylie | Wild Peaches |
| Valentine | |
| William Butler Yeats | When You Are Old |
| Politics | |
| The Circus Animals Desertion | |
| He wishes his Beloved were Dead | |
| Never give all the Heart | |
| To an Isle in the Water | |
| Reconciliation | |
| The Cap and Bells | |
| Down By the Salley Gardens | |
| The Song of Wandering Aengus | |
| Adam’s Curse | |
| No Second Troy | |
| A Drinking Song |
Table 8: Poems and authors of the Imagist poems.
| Author | Poem |
| Richard Aldington | Childhood |
| The Poplar | |
| Round-Pond | |
| Daisy | |
| Epigrams | |
| The Faun sees Snow for the First Time | |
| Lemures | |
| H. D. | The Pool |
| The Garden | |
| Sea Lily | |
| Sea Iris | |
| Sea Rose | |
| Oread | |
| Orion Dead | |
| John Gould Fletcher | The Blue Symphony |
| London Excursion | |
| F. S. Flint | Trees |
| Lunch | |
| Malady | |
| Accident | |
| Fragment | |
| Houses | |
| Eau-Forte | |
| D. H. Lawrence | Ballad of Another Ophelia |
| Illicit | |
| Fireflies in the Corn | |
| A Woman and Her Dead Husband | |
| The Mowers | |
| Scent of Irises | |
| Green | |
| Amy Lowell | Venus Transiens |
| The Travelling Bear | |
| The Letter | |
| Grotesque | |
| Bullion | |
| Solitaire | |
| The Bombardment |




