Skip to main content
Article

Reconstructing Shuffled Text. Bad Results for NLP, but Good News for Using In-Copyright Texts


Abstract

Existing copyright laws in the European Union, the United States, and many other jurisdictions worldwide impose limitations on Text and Data Mining (TDM) that affect the storage, publication, and reuse of datasets built from in-copyright texts. This issue directly affects researchers in CLS, a field in which work on contemporary materials is desirable and in which Open Science principles are quite strongly established. As a solution, derived text formats (DTFs) have been proposed. One important question with respect to copyright law is whether or not the source text can be reconstructed from a given DTF. In this paper we present the first of a series of experiments we plan to conduct on this issue. For this experiment, we have fine-tuned a large language model to reconstruct source texts from DTFs. The results of the reconstruction vary depending on various conditions, but on the whole are not very successful. This suggests that reconstructing text in DTFs is not as simple as is sometimes assumed and we believe that this result may encourage scholars to convert their in-copyright texts to DTFs and publish them as research data.

Keywords: derived text formats, copyright, reconstructibility, evaluation

How to Cite:

Du, K., Ackerschewski, S., Navruz, U., Sınır, N., Valline, J. & Schöch, C., (2025) “Reconstructing Shuffled Text. Bad Results for NLP, but Good News for Using In-Copyright Texts”, Journal of Computational Literary Studies 4(1). doi: https://doi.org/10.48694/jcls.4163

172 Views

20 Downloads

Published on
2025-12-04

Peer Reviewed

1. Introduction

In many Digital Humanities (DH) projects, texts are being digitized, collected and/or enriched in order to be used as research data. However, existing copyright laws in the European Union, the United States, and many other jurisdictions worldwide impose several limitations on Text and Data Mining (TDM) that affect the storage, publication, and reuse of datasets built from in-copyright texts. This undoubtedly has a negative impact on the reproducibility of published research results and on the spirit and practice of open science. As a potential solution to this problem, scholars have proposed and are currently utilizing derived text formats (DTFs), also known as extracted features, for non-consumptive research (e.g., Bhattacharyya et al. 2015; Jett et al. 2020; Y. Lin et al. 2012; Organisciak and Downie 2021; Schöch et al. 2020). The ‘Hathi Trust Extracted Features’ (Jett et al. 2020), for example, might be the most widely-used set of DTFs in the Digital Humanities. However, beyond the specific design choices of this particular DTF, many other kinds of DTF exist or could be envisioned.

The key idea behind DTFs is to selectively remove specific copyright-relevant information from in-copyright texts by applying various transformations to them, so that these texts are no longer readable by humans and do not contain copyright-relevant features. If done in a suitable manner, the publication of such texts as research data is unlikely to affect the rights of copyright holders. At the same time, they remain suitable for (at least some of the) TDM tasks in the Digital Humanities, such as authorship attribution, topic modeling, or sentiment analysis (e.g., Du 2023; Kocula 2021).

There are several types of transformations that can be used to transform a text into a DTF: removal, exchange, and replacement. For example, the sequence information in text can be removed, that is, as the example in Table 1 shows, the order of the words can be shuffled. To convert a longer text (e.g., a novel) to this format, it is first split into chunks (for example 1000-words chunks or 500-words chunks). Then, the sequence information, i.e., the order of the words in each chunk, is removed by randomizing their order. Note that the sequence of the chunks within each whole text is maintained. This allows the text to become less readable while roughly preserving the main structure of the original text. Another possibility is to reduce the information about individual tokens by replacing a certain percentage of word forms with their corresponding Part-of-Speech (PoS) tags, without affecting word sequence information. Furthermore, since the goal of transforming text into DTFs is to keep the textual information for different TDM tasks, word embeddings (both static and contextualized) are also a promising candidate for information-rich DTFs (Schöch et al. 2020).

Table 1: An example of a text and its two variants in DTFs.

source text Sherlock Holmes took his bottle from the corner of the mantel-piece and his hypodermic syringe from its neat morocco case. With his long, white, nervous fingers he adjusted the delicate needle, and rolled back his left shirt-cuff.
word order shuffled his bottle from of mantel-piece With his syringe from its the neat case. His white, took the fingers he and hypodermic Sherlock the Morocco delicate needle, and nervous corner rolled back his left shirt-cuff. Holmes long, adjusted
50% of words replaced by POS tags Sherlock PROPN VERB his NOUN from DET corner of the NOUN-piece CCONJ PRON hypodermic NOUN ADP its ADJ morocco case. ADP his long, ADJ, ADJ NOUN he VERB the delicate NOUN, CCONJ rolled ADV his left shirt-NOUN.

Technically speaking, DTFs are actually text that contains noise. It is not a difficult task to convert textual data into such a format. In contrast, DTFs are currently facing more controversy at the legal level. For example, there is the view that converting texts to DTFs and publishing them does not constitute a copyright infringement in itself, only reconstructing the source texts from DTF texts does; however, there is another view that even if the texts are converted to DTFs, these texts could still be protected by copyright law and therefore cannot be made public. Against this background, the legal status of DTFs is discussed in detail in Iacino et al. (2025). Among other points, the article discusses the attitude of German courts towards the relationship between text length and copyright protection. As long as the DTF text does not contain text fragments longer than 11 words that are not sufficiently different from the original work, then such DTF text are unlikely to infringe copyright law.

One important aspect of DTFs regarding copyright law is the question of whether or not the original texts can be reconstructed from a given DTF (i.e., the ‘reconstructibility’ criterion). If we want to prepare the in-copyright texts in a DTF and make them available to others, we have to be careful that the source texts cannot be easily reconstructed. On this point, Raue and Schöch (2020) stress that the original texts should not be reconstructable with reasonable effort, for example on the basis of position information of the text sequences or other sequence information.1 Of course, the definition of “with reasonable effort” here is very vague. Therefore, it is essential to demonstrate how easy or difficult it is to reconstruct text in DTFs through practical experiments. In the following, we first outline the motivation of our research in detail (section 2), then we describe our data and methodology (section 3) and provide a discussion of the relation between our research and the memorization issue in LLMs (section 4). After that, we will present and discuss the results of the reconstruction experiments (section 5), before we conclude (section 6).

2. Motivation

In CLS, ensuring the reproducibility of research requires that underlying text collections be publicly accessible. However, when such collections consist of copyrighted texts, direct publication is legally restricted. One possible solution is to transform the texts into DTF, provided that three conditions are met: (1) the transformed texts remain suitable for addressing the research questions, (2) they are unreadable to human readers, and (3) the original texts cannot easily be reconstructed from them. Our work focuses on evaluating DTFs from the perspective of Natural Language Processing (NLP); we aim to share our knowledge in order to provide legal experts with supporting arguments when the legal status of DTFs is discussed. In particular, this paper addresses the third point just mentioned.

In recent years, technologies related to large language models (LLMs) have developed rapidly. BERT, for instance, is trained using two tasks: One where the model learns to predict a masked word from context, and one where it learns whether two sentences directly follow each other or not (Devlin et al. 2019). In contrast, BART is trained on texts with sentences in random order, learning to reconstruct the original sequence during training (Lewis et al. 2020). The textual data used to train LLMs is very similar to the text in DTFs, and the task of training LLMs is analogous to reconstructing text in DTFs. Therefore, it can be assumed that LLMs may be capable of reconstructing the original text from DTFs. And indeed, Kugler et al. (2024) demonstrated that the publication of the encoder together with the contextualized embeddings makes it possible to generate data for training a decoder that exhibits a reconstruction accuracy that is very likely sufficient to violate copyright. But their test is not the same as the usage scenario we aim to investigate. Their study was to infer the data used to train LLMs, while we focus on reconstructing text from DTFs similar to that in Table 1.

In fact, most of the NLP experts we have encountered agree that it should be possible to reconstruct text in DTFs using LLMs. Even if it is not possible now, it will be realized in the future as NLP technology advances. However, to the best of our knowledge, there are no relevant studies addressing this issue yet. Our preliminary test with ChatGPT (free version) showed that reconstructing text in DTFs is not impossible, even though ChatGPT is not specifically trained for this particular task (see Figure 1).2 So we can assume that it may be possible to reconstruct the source text from different DTFs with the help of specially trained LLMs, even though from the point of view of making in-copyright textual data publicly available, we would prefer that the text reconstruction experiments will result in an unsuccessful outcome. Therefore, we are planning to conduct a series of experiments in order to review the degree of difficulty in reconstructing text from different DTFs. In the research reported here, and as a first step, we have tested the reconstruction of the text where the word order was shuffled.

Figure 1: Reconstructing shuffled text using ChatGPT. The source text is in Table 1.

3. Data and Methodology

The primary aim of our research is to investigate whether LLMs can reconstruct shuffled literary texts. For our experiments, we utilized textual data from two English datasets: IMDb reviews (non-literary texts) and a subset of the Gutenberg corpus (literary texts).3 We have selected only English text because LLMs perform better when processing English text compared to other languages. Additionally, non-literary texts are generally considered less complex than literary texts. Therefore, it is reasonable to hypothesize that reconstructing IMDb reviews would be easier and more successful than reconstructing texts from the Gutenberg corpus. By comparing the results of these two datasets, we can test this hypothesis and gain a deeper understanding of the model’s ability to reconstruct shuffled text.

3.1 IMDb-reviews

In the experiments of reconstructing IMDb-reviews, each review was used as one data point. To transform the IMDb-reviews into the DTF format, we only shuffled the word order of each sentence. The order of the sentences in each review was not altered. Three sets of data containing 25,000, 50,000 and 75,000 reviews were prepared as training data, while the validation and testing data contained 5000 unseen reviews in each case. By varying the amount of the training data, we can test the hypothesis that more training data leads to better reconstruction results.

3.2 Gutenberg Texts

In order to ensure that the literary genre does not become a confounding factor in the test results, we randomly selected Gutenberg novels from four different genres: detective fiction, historical fiction, love stories, and science fiction. Two datasets were created for evaluating the impact of the amount of training data. One consisting of 3 novels from each genre (12 novels in total) and the other consisting of 15 novels from each genre (60 novels in total). All the novels were split into chunks, and the words within each chunk were randomly shuffled, while the order of the chunks was not altered. These chunks were then used as data points for model training, validation, and testing in the ratio of 80%, 10%, 10%. The chunk lengths were set to 50, 100, and 500 words. By varying the chunk length, we can also test the hypothesis that reconstructing shorter chunks/texts will be more successful. Since we used either 15 or 60 novels as the dataset, when these novels are divided into chunks and the chunk length is set differently, the total number of divided chunks is different. An overview of the number of chunks can be found in Table 2.

Table 2: Number of chunks used as training, validation and testing data.

chunk length: 50 chunk length: 100 chunk length: 500
15 novels 60 novels 15 novels 60 novels 15 novels 60 novels
training data 14,258 89,915 7,130 44,971 1,428 9,013
validation data 4,753 11,240 2,377 5,622 477 1,127
testing data 4,753 11,239 2,377 5,621 476 1,127

3.3 Method

We treated the reconstruction of texts in DTF as an automatic translation task and used the translation pipeline from Huggingface.4 Automatic translation converts a sequence of text from one language to another. In the context of our research, these two languages are DTF text and the original text. This task can be formulated as a sequence-to-sequence problem and therefore requires using a sequence-to-sequence large language model. We used the pre-trained T5-base model and fine-tuned the model using DTF texts as the input and the unaltered source texts as the target text of the translation. The ‘Text-to-Text Transfer Transformer’ (T5) model is a framework that treats the tasks of translation, question answering and categorization as the same process: The model takes text as input and generates target text as output. In this way, the same model, loss function, hyperparameters, etc. can be used for different tasks (Raffel et al. 2020). After fine-tuning, the model was evaluated on the unseen testing data. For both fine-tuning and evaluation, six measures were used to compare the predicted text with the target text. They have been proposed to compare the similarity of strings and are often used to evaluate the results of automatic translation.

  • WER: The word error rate (WER) is derived from the Levenshtein distance, working at the word level. It indicates the average number of errors (substitutions, deletions and insertions) per word. The smaller the value is, the higher the similarity (Morris et al. 2004; Woodard and Nelson 1982).

  • ROUGE scores: ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation and it is a set of metrics focusing on comparing texts from different perspectives. ROUGE-1 is a unigram (1-gram) based scoring and compares the the maximum number of unigrams occurring both in the predicted text and the reference text. ROUGE-2 is a bigram (2-gram) based scoring and compares the the maximum number of bigrams occurring both in the predicted text and the reference text. ROUGE-L focuses on the longest common subsequence between predicted and the reference text and the ROUGE-Lsum first splits the text into sentences, then performs ROUGE-L calculations for each sentence individually. For all the ROUGE scores, higher scores indicating higher similarity between the predicted and the reference text (C.-Y. Lin 2004).

  • SacreBLEU: SacreBLEU is a tool for calculating shareable, comparable, and reproducible BiLingual Evaluation Understudy (BLEU) scores (Papineni et al. 2002) and reports scores between 0 and 100, with 0 meaning zero resemblance and 100 meaning identical sentences (Post 2018).

4. Memorization in LLMs

In the test using ChatGPT to reconstruct the text in Figure 1, we see that ChatGPT recognizes the source of the input text. It shows that the model has a memory for the text it has seen in the pre-training. If the training data can be reproduced verbatim, this phenomenon is called Memorization in LLMs (e.g., Biderman et al. 2023; Lee et al. 2022). This issue has been examined by inverting the BERT pipeline (Kugler et al. 2024), through name cloze inference (Chang et al. 2023), or by asking the LLMs to complete a passage extracted from a book and measuring the overlap of the first ten tokens it produces with the real text in the book (Zhang et al. 2024). The latter two studies mentioned above prompted generative LLMs to examine which books exists in the training data. In the case of the present study, the datasets used are both publicly available and it is almost entirely certain that the data was used for the pre-training of most of the LLMs, meaning the LLMs have seen these texts already. From a technical perspective, our experiments are therefore also an examination of memorization in LLMs. However, our study is different from memorization-focused studies in the following aspects:

  1. The memorization-focused studies looked at inferring the training data of LLMs or proving that certain data was used for training of LLMs. Thus, the LLMs are the object of their study. In comparison, our study use LLMs to reconstruct source texts from DTFs. Therefore, the object of our study are DTFs and the LLMs are used as research tools.

  2. Although our experiments have used data that LLMs are highly likely to have seen during training, which makes our experiments fit the scope of research on memorization, the real-world application of our approach is to reconstruct in-copyright texts that are not available online and less likely to have been included in the pre-training of LLMs. That’s a different task from examining memorization in LLMs.

  3. The motivation for our study is law and practice: Our ultimate goal is to enable scholars to make in copyright texts publicly available as research data and DTFs are our solution to this problem. Therefore, regardless of whether or not LLMs have seen the original text, as long as the DTF text is used as input data to LLMs and the original text can be reliably reconstructed as a result, then this particular DTF cannot be considered as a solution for making in-copyright texts public.

  4. Since the goal of our research is not to examine memorization in LLMs, questions such as the correlation between the memorization of books in LLMs and the appearance frequency or popularity of the same books on the web (Zhang et al. 2024) are not central to us.

5. Results

5.1 Reconstructing IMDb-reviews

The results for reconstructing the unseen 5,000 reviews in the testing data is presented in Figure 2, which is a comparison of the three trained models’ performance across six evaluation metrics. The ‘scrambled_baseline’ in the figure represents the string similarity between the text in DTF and the source text. This baseline allows us to examine the extent to which the reconstruction has brought the DTF text back to the original. The ‘25000_model’, ‘50000_model’ and ‘75000_model’ labels represent the scores achieved with models trained with 25,000, 50,000 and 75,000 reviews, respectively. Since WER is different from all other scores in that higher values represent poorer results, to make it easier to understand the results, the results for 1-WER are shown here. Also, the sacreBLEU scores are scaled down by a factor of 100 for visualization convenience. All six measures show very similar results: Models trained with more data have better results in reconstructing text. The model trained using 75,000 reviews gets the best scores in all tests, except for the ROUGE-1 score of the ‘scrambled_baseline’. This is because the ROUGE-1 is a unigram based scoring. Since we’re only disrupting the order of the words in the text, it’s no surprise that the baseline has a perfect score, 1.0. The other ROUGE-1 scores indicate that in the process of reconstructing the text, the model is not simply putting all the input scrambled words in the correct original order, but ‘rewriting’ the text given the information provided by the input text. This is very likely due to the fact that we are treating the reconstruction as an automatic translation task and the model is not given direct instructions to use all the words in the input text during training. Overall, judging by the scores, even in the best cases, the similarity between the reconstruction results and the original text remains limited.

Figure 2: Average similarity scores achieved by different models in reconstructing IMDb-reviews. Higher values represent better results.

Obviously, these numerical assessments are not sufficient to let us see the full picture of the test results. We therefore selected three examples including the source text, their reconstructed texts, and their scores in order to provide readers with a more intuitive idea of the results of the reconstruction.5 All three examples were selected from the text reconstructed using the best model, ‘75000_model’. The similarity scores between the source texts and their reconstruction are presented in Figure 3, while the three reviews and their reconstruction are listed in Table 3. In terms of similarity scores, review No. 4,691 is the best reconstruction result, No. 4,320 is the second best, and No. 4,758 is the worst. The sacreBLEU score for No. 4,758 is 0.00085 after dividing by 100. It’s so low that it’s barely visible in the visualization.

Table 3: Three IMDb-reviews and their reconstruction.

Review type text
No. 4,691 source This movie starts with a lot of promise. The opening scene, featuring Sean Connery, is very entertaining. However, Connery disappears for most of the rest of the movie along with any talent that anyone else may have exhibited. The movie jumps from place to place with no coherent story. There is no sense of time. The editing is laughable. After the first 5 minutes there is nothing worth watching in this film.
No. 4,691 reconstructed This movie starts with a lot of promise. The opening scene featuring Sean Connery, is very entertaining. However, the rest of Connery may have exhibited any talent along with the rest of the movie. The movie jumps from place to place with no coherent story. There is no sense of time. The editing is laughable. After watching this film there is nothing worth watching.
No. 4,320 source I remember seeing this on T.V. in the early ’80’s, and even though I was still kind of young, I thought it was awful. Rock Hudson should really have been more selective of the scripts he accepted. some of his films are really good, and others like ‘embryo’ and this piece of drek should have been left to the next generation of actors.now for the scene that I thought was the funniest of the whole movie. it happens at the very end as the camera is pulling away and the screen starts to fade to black. If I remember the scene correctly, a group of people are still in either a wrecked hotel or a cave and some guys wife has just been declared dead. as the camera pulls back you can clearly see the ‘dead woman’ stand up and walk off set.
No. 4,320 reconstructed I remember seeing this on T.V., and even though it was in the early 80’s, I still thought it was awful. Rock Hudson should have been more selective for the scripts. I thought that this was the ‘embryon’ of the next generation of actors, and the whole piece of drek was really good and funniest. I think some of the films should have been the funniest and now the scene starts to fade away as the camera starts pulling away at the end. If the scene has been declared dead, or a group of guys are still in a cave or a wrecked hotel.
No. 4,758 source I couldn’t make it through the whole thing. It just wasn’t worth my time. Maybe one-fourth of the dialogue would have been worth listening to (or reading – since I don’t understand French) if the pseudo-profundity and pseudo-wittiness of the other three-fourths of the film were deleted. … [Here, around 230 words from the source text have been omitted.] … At least these films are interesting and enjoyable, which is much more than I can say about IN PRAISE OF LOVE (Éloge de l’amour). I give this film 2 out of 10 stars. Not quite offensive enough to rate 1 for ‘awful’ (such as ‘The Devils’ with Oliver Reed and Vanessa Redgrave). If you still want to watch it, go ahead. But don’t say I didn’t warn you!!!
No. 4,758 reconstructed I couldn’t make it through the whole thing. It just wasn’t worth my time. Maybe – since the fourths of the French dialogue were deleted (if the pseudo-wittiness of the other three) or – if the pseudo-pseudo-pseudo … [Here, “-pseudo” repeated 52 times.] … -pseudo-pseudo-pseudo. But don’t say I didn’t warn you!!!

Figure 3: String similarity of three reconstructed IMDb-reviews.

As we can see in Table 3, the reconstruction of the review No. 4,691 is indeed more successful than the other cases. Two of the four sentences are identical to the source text, but the other two are not. The last sentence, in particular, expresses the opposite meaning and sentiment of the source text. In comparison, the reconstructed text of the review No. 4,320 is different in length from the source text (and its DTF version), and most of the reconstructed text is inconsistent with the source text. Only one or two sentences or phrases are identical to the source text. It is worth noting that the difference between No. 4,691 and No. 4,320 for the ROUGE-1 score is not particularly large, but the difference in the quality of the reconstructed text is very obvious. This suggests that a bigger overlap of unigrams between reconstructed text and the original texts is relatively easy to achieve. In contrast, there is much less overlap between the longer sequences (bigrams etc.) of reconstructed text and the source text. The reconstruction of review No. 4,758 can be described as a complete failure. Although the first and last sentences are the same as in the source text, the longer part of the text in the middle has been replaced by the persistent repetition of the string “pseudo-”. In the results of reconstruction of the 5,000 unseen reviews, this multiple repetition of the same string can be observed quite often. We assume that this might be caused by Greedy sampling, which is fairly common in tasks that use LLMs to generate text (e.g., Fu et al. 2021; Holtzman et al. 2020; Welleck et al. 2019).

To determine how many reconstructed texts can achieve a level of similarity comparable to the review No. 4,691, we can refer to Figure 4. This figure presents the distribution of similarity scores for all 5,000 unseen reviews reconstructed using the ‘75000_model.’ The data indicates that a very clear minority of these reconstructed texts (significantly less than 25%) attain a ROUGE-2 or ROUGE-L score greater than 0.7. In comparison, although the ROUGE-1 scores are much higher overall, the majority (around 75%) of the ROUGE-1 scores are also lower than 0.8. Thus, we can conclude that the reconstruction of the IMDb-reviews is not successful.

Figure 4: Similarity score distribution of reconstructed IMDb reviews using the ‘75000_model’. The outliers are not visualized.

5.2 Reconstructing Gutenberg Texts

The reconstruction results of Gutenberg text chunks are presented in Figure 5. The top plot shows the result using 12 novels and the bottom plot is the similarity scores achieved using 60 novels as data. In the top plot, all evaluation measures have relatively low scores across the three chunk lengths. As in the previous test, ROUGE-1 has slightly higher values compared to other measures, but overall, the scores are low. In comparison, the scores in the bottom plot improve significantly. Compared to other evaluation metrics, ROUGE-1 has the highest scores, especially for smaller chunk lengths (50 and 100). The results suggest that both corpus size and chunk length have an impact on the reconstruction, with larger corpora and smaller chunk lengths generally yielding better reconstruction results. Altogether, this test confirms the observation in reconstructing the IMDb-reviews in the previous test. After reviewing all the reconstructed texts, we found that the persistent repetition of the same string caused by Greedy sampling are more frequent when reconstructing text chunks of 500 words. This problem is clearly related to the length of the text that must be generated.6

Figure 5: Average similarity scores achieved in reconstructing Gutenberg text chunks.

To provide an idea of the quality of the reconstructed text, we selected three examples from the reconstruction results of 60 novels that were segmented into 50-words chunks, as the reconstruction was most successful for shorter chunks using more data. Figure 6 shows the similarity scores for the three reconstructed chunks, while their source and reconstructed texts are provided in Table 4. Chunk No. 3,536 achieved the highest scores, including a perfect ROUGE-1 score. Its reconstructed text differs very little from the original text in general, especially the first two sentences, which are nearly identical to the original text. However, the later sentences differ significantly in meaning due to the confused placement of the personal pronouns. In contrast, the scores for Chunk No. 1,368 were much lower, and it is quite difficult to infer the source text from the reconstructed text. Chunk No. 5,481 had the least successful reconstruction, with minimal scores. Its reconstructed text consisted only of a series of dots and three words.

Table 4: Three Gutenberg text chunks and their reconstruction.

Chunk type text
No.3,536 source David ! ” she cried,—“my dear David — ! ” Then she broke off . “ What is it ? ” she asked , in a different tone . He showed her the headlines of the newspaper he was carrying . “ Tragedy ! ” he answered hoarsely . “
No.3,536 reconstructed ! ” she cried,—“my dear David ! ” He answered hoarsely . “ What is it ? ” she asked . Then he broke off in a different tone . “ David ! ” he showed her the headlines of the newspaper . “ Tragedy ! ” she was carrying
No.1,368 source ”else , so that I had not so much as a glimpse of her face . But I knew that it was Mary . ““““““““ Come , ”””””””” said my lord , pleasantly . ““““““““ We will go to her . It may be , she will not have the”
No.1,368 reconstructed ”. ““““““““ Come , my lord , ”””””””” she said pleasantly . ““““““““ We may not have so much as a glimpse of her face . But it was so , as I knew , that Mary will not go . It will be so , that I will not”
No.5,481 source be humble . The thought had mingled with the sea ’s rhythmic lullaby as it hushed her restless soul to sleep last night . He had offered her a new God who was Love,–his God . One who gave him happiness and content . Why should she resist ? Was
No.5,481 reconstructed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . One who had

Figure 6: String similarity of three reconstructed Gutenberg text 50-words chunks.

Figure 7 provides an overview of the distribution of similarity scores for all unseen 50-words chunks. Although more than 75% of the ROUGE-1 scores are over 0.8, over 75% of the other similarity scores (for WER even almost all of the scores) are lower than 0.4. This means that the majority of the reconstructed texts have a high degree of unigram overlap with the source text and they are of moderately poorer quality than No. 1,368. In comparison, very successful reconstructions like No. 3,536, or very unsuccessful reconstructions like No. 5,481, are in a very small minority.

Figure 7: Similarity score distribution of reconstructed Gutenberg 50-words chunks. The outliers are not visualized.

Considered together, the test results of the two datasets show that reconstructing DTF texts is quite challenging, especially for longer texts. Even for texts as short as 50 words, the reconstructed texts are, especially at the semantic and content level, still far from the original. By comparing the score distributions of the most successful reconstruction between the two datasets (Figure 8), we can conclude that it is indeed more difficult to reconstruct literary texts. This is because the experiments on Gutenberg dataset scored lower on e.g., ROUGE-2 or ROUGE-L, which can better reflect the results of the reconstruction.

Figure 8: Comparison of the score distributions of the most successful reconstructions on the two datasets. The outliers are not visualized.

6. Conclusion

In this paper we presented our experiments on reconstructing text from a DTF using a fine-tuned LLM. In order to gain a preliminary understanding of the ability of LLMs to reconstruct text, we fine-tuned the T5-base model using text with scrambled word order and used it for the reconstruction of unseen text.

The results of the reconstruction are mixed, but on the whole not very successful. What is clear is that this task of text reconstruction can very likely be improved by optimizing the technical aspects, e.g., by using different model training strategies, more powerful models, more training data, setting the length of the reconstructed text to be the same as the length of the source text, choosing a different sampling mechanism from greedy sampling, and so on. On the other hand, if the text to be reconstructed is more complex – such as in-copyright, less well-known literary works that are not available on the Internet, which aligns more closely with real-world applications of DTFs –, then the task becomes more difficult. Additionally, if the shuffling of word order goes beyond the level of sentences or 50 words (for example extending to the level of paragraphs, chunks larger than 1,000 words or even entire texts), or combining different DTF methods for transforming texts (for example replacing 10% of random words with their corresponding PoS tags in addition to shuffling the word order), this will undoubtedly make the reconstruction significantly more challenging. Conversely, if the same book is converted into different DTFs and all these DTF texts are publicly available, it might be easier to reconstruct the text by combining these DTFs. All of these aspects remain to be studied and we will keep working on this topic with more experiments in order to determine exactly how complex it is to reconstruct text in different DTFs, and what factors this depends on.

As a possible reference for defining “with reasonable effort” (mentioned in the ‘Introduction’), we would also like to briefly report on the resources used to accomplish this work. This work has been conducted as a collaboration between four NLP Masters students and a DH postdoctoral researcher, and in close consultation with an established DH researcher. Our experiments show that reconstructing text in just one DTF is not a simple task for someone without sufficient expertise in NLP, as we needed to implement and test a custom-built reconstruction pipeline for this task. This task also requires considerable resources, in the sense that to train the model, we used a workstation equipped with an Nvidia GeForce RTX 4090 GPU, which costs several thousand euros and consumes considerable amounts of power during the training and inference process. In addition, the process requires time, as depending on the size of the dataset, training the model and inference on unseen data can take several hours to several days. In contrast, anyone can obtain digitized text with much better quality by taking photos of a printed book and running OCR on the page images (even the iPhone, for example, has OCR software integrated), which is much cheaper, faster and easier.

Finally, while the presented work is both legal and practical in nature, its motivation derives from the needs of the CLS community. It can be understand as infrastructure development for the field, which is as important as conducting CLS research itself. We believe that the bad results of our experiments is good news for using in-copyright text as research data. We hope, at the very least, that the results presented here can encourage DH scholars to convert their in-copyright texts to DTFs and publish them as research data, which is very valuable for transparent and sustainable research and access to large reference corpora.

7. Data Availability

Data and code have been archived and are persistently available at: https://doi.org/10.5281/zenodo.17198425.

8. Acknowledgements

This publication was created in the context of the work of the association German National Research Data Infrastructure (NFDI) e.V. NFDI is financed by the Federal Republic of Germany and the 16 federal states, and the consortium Text+ is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project number 460033370. The authors would like to thank for the funding and support. Furthermore, thanks also include all institutions and actors who are committed to the association and its goals.

9. Author Contributions

Keli Du: Conceptualization, Formal analysis, Investigation, Methodology, Visualization, Writing – original draft, Writing – review & editing.

Sarah Ackerschewski: Methodology, Resources, Software.

Uygar Navruz: Data curation, Methodology, Software.

Nazan Sınır: Data curation, Methodology, Software.

Julian Valline: Methodology, Resources, Software.

Christof Schöch: Funding acquisition, Resources, Supervision, Writing – review & editing.

Notes

  1. Quote in the original German: “Neben der Nichterkennbarkeit wird man als zweite Anforderung von einem urheberrechtsfreien, abgeleiteten Textformat verlangen müssen, dass die ursprünglichen Texte nicht aufgrund von Positionsangaben der Textsequenzen oder sonstiger Sequenzinformationen mit verhältnismäßigem Aufwand rekonstruierbar sind.” (Raue and Schöch 2020). [^]
  2. Surely, as we can see from ChatGPT’s answer, this relatively successful text reconstruction is most likely due to the fact that the model has already seen the original text during training. Therefore, the output text may possibly be ‘memorized’ rather than ‘reconstructed’. The issue of memorization will be discussed in more detail in section 4. [^]
  3. All the textual data are available online. Please see section 7. [^]
  4. See: https://huggingface.co/docs/transformers/tasks/translation. [^]
  5. All the results of the reconstruction are available online. Please see section 7. [^]
  6. All the reconstructed 50-words, 100-words and 500-words chunks are available online. Please see section 7. [^]

References

Bhattacharyya, Sayan, Peter Organisciak, and J. Stephen Downie (2015). “A Fragmentizing Interface to a Large Corpus of Digitized Text: (Post)humanism and Non-consumptive Reading via Features”. In: Interdisciplinary Science Reviews 40 (1), 61–77.  http://doi.org/10.1179/0308018814Z.000000000105.

Biderman, Stella, USVSN Sai Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, and Edward Raff (2023). “Emergent and Predictable Memorization in Large Language Models”. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. Ed. by A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine. Curran Associates Inc. https://dl.acm.org/doi/10.5555/3666122.3667341 (visited on 11/19/2025).

Chang, Kent, Mackenzie Cramer, Sandeep Soni, and David Bamman (2023). “Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4”. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Ed. by Houda Bouamor, Juan Pino, and Kalika Bali. Association for Computational Linguistics, 7312–7327.  http://doi.org/10.18653/v1/2023.emnlp-main.453.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Ed. by Jill Burstein, Christy Doran, and Thamar Solorio. Association for Computational Linguistics, 4171–4186.  http://doi.org/10.18653/v1/N19-1423.

Du, Keli (2023). “Understanding the Impact of Three Derived Text Formats on Authorship Classification with Delta”. In: DHd 2023 Open Humanities Open Culture. 9. Tagung des Verbands “Digital Humanities im deutschsprachigen Raum” (DHd 2023). Ed. by Peer Trilcke, Anna Busch, and Patrick Helling. Zenodo.  http://doi.org/10.5281/zenodo.7715299.

Fu, Zihao, Wai Lam, Anthony Man-Cho So, and Bei Shi (2021). “A Theoretical Analysis of the Repetition Problem in Text Generation”. In: Proceedings of the AAAI Conference on Artificial Intelligence 35 (14), 12848–12856.  http://doi.org/10.1609/aaai.v35i14.17520.

Holtzman, Ari, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi (2020). “The Curious Case of Neural Text Degeneration”. In: arXiv preprint.  http://doi.org/10.48550/arXiv.1904.09751.

Iacino, Gianna, Paweł Kamocki, Keli Du, Christof Schöch, Andreas Witt, Philippe Genêt, and José Calvo Tello (2025). “Legal Status of Derived Text Formats – 2nd Deliverable of Text+ AG Legal and Ethical Issues –”. In: RuZ – Recht und Zugang 5 (3), 149–172.  http://doi.org/10.5771/2699-1284-2024-3-149.

Jett, Jacob, Boris Capitanu, Deren Kudeki, Timothy Cole, Yuerong Hu, Peter Organisciak, Ted Underwood, Eleanor Dickson Koehl, Ryan Dubnicek, and J. Stephen Downie (2020). The HathiTrust Research Center Extracted Features Dataset (2.0).  http://doi.org/10.13012/R2TE-C227.

Kocula, Martin (2021). Volltext vs. abgeleitetes Textformat: Systematische Evaluation der Performanz von Topic Modeling bei unterschiedlichen Textformaten mit Python.  http://doi.org/10.5281/zenodo.5552487.

Kugler, Kai, Simon Münker, Johannes Höhmann, and Achim Rettinger (2024). “InvBERT: Reconstructing Text from Contextualized Word Embeddings by Inverting the BERT pipeline”. In: Journal of Computational Literary Studies 2 (1).  http://doi.org/10.48694/jcls.3572.

Lee, Katherine, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini (2022). “Deduplicating Training Data Makes Language Models Better”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Ed. by Smaranda Muresan, Preslav Nakov, and Aline Villavicencio. Association for Computational Linguistics, 8424–8445.  http://doi.org/10.18653/v1/2022.acl-long.577.

Lewis, Mike, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer (2020). “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Ed. by Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault. Association for Computational Linguistics, 7871–7880.  http://doi.org/10.18653/v1/2020.acl-main.703.

Lin, Chin-Yew (2004). “ROUGE: A Package for Automatic Evaluation of Summaries”. In: Text Summarization Branches Out. Association for Computational Linguistics, 74–81. https://aclanthology.org/W04-1013/ (visited on 05/30/2025).

Lin, Yuri, Jean-Baptiste Michel, Erez Aiden Lieberman, Jon Orwant, Will Brockman, and Slav Petrov (2012). “Syntactic Annotations for the Google Books NGram Corpus”. In: Proceedings of the ACL 2012 System Demonstrations. Ed. by Min Zhang. Association for Computational Linguistics, 169–174. https://aclanthology.org/P12-3029 (visited on 05/30/2025).

Morris, Andrew Cameron, Viktoria Maier, and Phil Green (2004). “From WER and RIL to MER and WIL: Improved Evaluation Measures for Connected Speech Recognition”. In: Interspeech 2004, 2765–2768.  http://doi.org/10.21437/Interspeech.2004-668.

Organisciak, Peter and J. Stephen Downie (2021). “Research Access to In-copyright Texts in the Humanities”. In: Information and Knowledge Organisation in Digital Humanities. Ed. by Koraljka Golub and Ying-Hsang Liu. Routledge, 157–177.

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu (2002). “BLEU: a Method for Automatic Evaluation of Machine Translation”. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Ed. by Pierre Isabelle, Eugene Charniak, and Dekang Lin. Association for Computational Linguistics, 311–318.  http://doi.org/10.3115/1073083.1073135.

Post, Matt (2018). “A Call for Clarity in Reporting BLEU Scores”. In: Proceedings of the Third Conference on Machine Translation: Research Papers. Ed. by Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, and Karin Verspoor. Association for Computational Linguistics, 186–191.  http://doi.org/10.18653/v1/W18-6319.

Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu (2020). “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. In: Journal of Machine Learning Research 21 (140), 1–67. http://jmlr.org/papers/v21/20-074.html (visited on 05/30/2025).

Raue, Benjamin and Christof Schöch (2020). “Zugang zu großen Textkorpora des 20. und 21. Jahrhunderts mit Hilfe abgeleiteter Textformate – Versöhnung von Urheberrecht und textbasierter Forschung”. In: RuZ – Recht und Zugang 1 (2), 118–127.  http://doi.org/10.5771/2699-1284-2020-2-118.

Schöch, Christof, Frédéric Döhl, Achim Rettinger, Evelyn Gius, Peer Trilcke, Peter Leinen, Fotis Jannidis, Maria Hinzmann, and Jörg Röpke (2020). “Abgeleitete Textformate: Text und Data Mining mit urheberrechtlich geschützten Textbeständen”. In: Zeitschrift für digitale Geisteswissenschaften 5.  http://doi.org/10.17175/2020_006.

Welleck, Sean, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston (2019). “Neural Text Generation with Unlikelihood Training”. In: arXiv preprint.  http://doi.org/10.48550/arXiv.1908.04319.

Woodard, JP and Jeremy T. Nelson (1982). “An Information Theoretic Measure of Speech Recognition Performance”. In: Workshop on Standardisation for Speech I/O Technology, Naval Air Development Center, Warminster, PA.

Zhang, Xinhao, Olga Seminck, and Pascal Amsili (2024). “Remember to Forget: A Study on Verbatim Memorization of Literature in Large Language Models*”. In: Proceedings of the Computational Humanities Research Conference 2024. Ed. by Wouter Haverals, Marijn Koolen, and Laure Thompson. https://ceur-ws.org/Vol-3834/paper96.pdf (visited on 05/30/2025).